See how our moderation models operate. Skip to the next page for API information.

Chatsight Kali

A Transformer Content Moderation Hydra.

Kali is made up of three models in consensus - eliminating the need to ask what "60% hateful" means.

Features

  • 5+ indicators of text content, including: Identity Aware, Political, Vulgarity and Toxicity.
  • Sentiment Analysis Included.
  • Language Detection. Using PBMT, we support over 100 languages.
  • Unique flagging system, clear and concise actions to take: "pass", "review", "delete".
  • Optional: Automatically handle "review" events to avoid exposing staff to harmful material.
  • Advanced Unicode Normalization:
    • Zalgo
    • Unicode letter-like symbols
    • Leetspeak
  • Optional: Automatically detects words without spaces: ilikepizzasometimes and intelligently separates into different real words.
  • Updated regularly, including new and emerging types of phrasing.

Methodology

This information is not required to use the API, and exists for better understanding to outsiders.

Chatsight Kali is a 3-model Content Moderation "hydra" served over API. Traditionally, content moderation models have relied upon multi-label classification in overlapping detection "arcs" to provide well-rounded detection capabilities.

Key Issues for existing moderation models:

  • Datasets have been readily created from sources like Twitter and Reddit. These platforms have extensively de-noised available data, to protect their platforms community - for obvious reasons. Too much noise is bad, but no noise, or extremely little noise is as bad.
  • It is extremely difficult to reliably tune a single moderation model to behave well for increasingly nuanced and elaborate references to Race, Religion, Sexuality and other sensitive topics.
  • End-users may find it difficult to quantify their response to confidence intervals for traditional moderation labels.
  • Moderation models have failed to advanced from being purely a "detection layer" to being a "moderation layer".

Kali at a glance:

  • Operates as a quorum ⌊(3/2)+1⌋, or in a deadlock resolver mode ⌊(2/2)+1⌋.
  • Available to operate as ⌊(2/2)+1⌋ with GPT-3 fine-tuned with our datasets.
  • Models retrained based on internal metrics, from our corpus of 22 million hate comments. Comments are not sourced from traditional surface-level social networks, to avoid "noise starvation".
  • Extensive focus on preventing AAVE and Racial false-positives, at a slight cost to extended-sentence detection (greater than 227 characters). This makes it suitable for live-streaming comments, user submissions averaging around 1-5 sentences, and captions.
  • Kali makes final decisions through our consensus algorithm, designed to represent our empirical findings of reliable content moderation decisions making.

Comparisons

We've produced several comparisons against available open source moderation models. Kali is considered as a unified response, returned by the API. We provide Kali's decision from the suggestedAction array. We do not consider the individual performance of Kali's models in isolation, as we defend that doing so would invalidate the reason why Kali is built in the way that it has been. Kali is a distributed decision.

Kali's hydra models by themselves are not perfect, and we would never claim as such, but we aim to show why you should not rely on a single moderation model, and why Kali is a superior solution.

All comparisons will use the same eight phrases. We chose these eight phrases as they are known fail patterns observed by Chatsight in the course of its development of Kali. Equally, they highlight bias in commonly used datasets, like those from Jigsaw's Kaggle competitions and open source releases. In some cases, for reproduction, you may need to copy exactly the phrase as it appears, instead of re-typing it out.

Kali will be suffixed by a build number if results are accurate for builds after the provided version.

Perspective (Conducted 09/09/2021)

Axioms: This is an API-Based multi-label classification developed by Jigsaw, a subsidiary of Google. We'll assume two breakpoints, at 51% and 70% for harmful content on perspective as they are interval based and not binary. We round to the nearest whole integer after multiplying by 100.

Source: These phrases are a mixture of self-created, and real-life comments from our moderation dataset.

Phrase 1: i hate my black curtains

  • Kali: Pass
  • Perspective (>51%): Toxic
  • Perspective (>70%): Toxic
    Perspective gave 79.43%

Phrase 2: black americans are being unfairly killed by the police

  • Kali (>0.1.5): Pass
  • Perspective (>51%): Toxic
  • Perspective (>70%): Toxic
    Perspective gave 76.83%

Phrase 3: we should be proud of our black identity and blackness dont be ashamed of it

  • Kali: Pass
  • Perspective (>51%): Toxic
  • Perspective (>70%): Toxic
    Perspective gave 69.54%

Phrase 4: jews have been persecuted for too long

  • Kali: Pass
  • Perspective (>51%): Toxic
  • Perspective (>70%): Not-Toxic
    Perspective gave 54.36%

Phrase 5: look just because im black doesn't mean cops have the right to put me in a headlock

  • Kali: Pass
  • Perspective (>51%): Not-Toxic
  • Perspective (>70%): Not-Toxic
    Perspective gave 47.05%

Phrase 6: kevinleeisaretard

  • Kali: Delete
  • Perspective (>51%): No Answer
  • Perspective (>70%): No Answer
    Perspective didn't give an answer, as it assumed it to not be English

Phrase 7: Sox fan checking in from maine, we are getting suffocated by Janet mills! Libtard. Keep preaching, we love you big guy

  • Kali: Delete
  • Perspective (>51%): Not-Toxic
  • Perspective (>70%): Not-Toxic
    Perspective gave 32.83%

For ethical considerations, this phrase is engineered to defeat moderation models through Unicode manipulation and word placement highlighting dataset-produced failures.
Phrase 8: oh no daddy trump is upset because spending four years doing nothing but pissing people off lead to people not liking him quickly riot and burn things no we're not acting like antifa nіggers it's different i swear

  • Kali: Delete
  • Perspective (>51%): Toxic
  • Perspective (>70%): Toxic
    Perspective gave 70.05%

Detoxify (Conducted 09/09/2021)

Axioms: This is a multi classification model developed by Unitary. We'll assume one breakpoint at 51%, and choose the "toxicity label", and "identity hate/attack" for Detoxify. Each phrase will be tested against "original" and "unbiased" models. We round to the nearest whole integer after multiplying by 100.

Source: These phrases are a mixture of self-created, and real-life comments from our moderation dataset.

Phrase 1: i hate my black curtains

  • Kali: Pass - Correct
  • Detoxify (Original, Toxicity): 83% - Incorrect
  • Detoxify (Original, Identity Hate): 85% - Incorrect
  • Detoxify (Unbiased, Toxicity): 94% - Incorrect
  • Detoxify (Unbiased, Identity Attack): 64% - Incorrect

Phrase 2: black americans are being unfairly killed by the police

  • Kali (>0.1.5): Pass - Correct
  • Detoxify (Original, Toxicity): 94% - Incorrect
  • Detoxify (Original, Identity Hate): 14% - Correct
  • Detoxify (Unbiased, Toxicity): 67% - Incorrect
  • Detoxify (Unbiased, Identity Attack): 64% - Incorrect

Phrase 3: we should be proud of our black identity and blackness dont be ashamed of it

  • Kali: Pass - Correct
  • Detoxify (Original, Toxicity): 25% - Correct
  • Detoxify (Original, Identity Hate): 10% - Correct
  • Detoxify (Unbiased, Toxicity): 44% - Correct
  • Detoxify (Unbiased, Identity Attack): 45% - Correct

Phrase 4: jews have been persecuted for too long

  • Kali: Pass - Correct
  • Detoxify (Original, Toxicity): 45% - Correct
  • Detoxify (Original, Identity Hate): 31% - Correct
  • Detoxify (Unbiased, Toxicity): 2% - Correct
  • Detoxify (Unbiased, Identity Attack): 2% - Correct

Phrase 5: look just because im black doesn't mean cops have the right to put me in a headlock

  • Kali: Pass - Correct
  • Detoxify (Original, Toxicity): 7% - Correct
  • Detoxify (Original, Identity Hate): 85% - Incorrect
  • Detoxify (Unbiased, Toxicity): 59% - Incorrect
  • Detoxify (Unbiased, Identity Attack): 56% - Incorrect

Phrase 6: kevinleeisaretard

  • Kali: Delete - Correct
  • Detoxify (Original, Toxicity): 0.001% - Incorrect
  • Detoxify (Original, Identity Hate): 0.0..1% - Incorrect
  • Detoxify (Unbiased, Toxicity): 0.0..8% - Incorrect
  • Detoxify (Unbiased, Identity Attack): 0.0..5% - Incorrect

Phrase 7: Sox fan checking in from maine, we are getting suffocated by Janet mills! Libtard. Keep preaching, we love you big guy

  • Kali: Delete - Correct
  • Detoxify (Original, Toxicity): 13% - Incorrect
  • Detoxify (Original, Identity Hate): 0.0..2% - Incorrect
  • Detoxify (Unbiased, Toxicity): 1% - Incorrect
  • Detoxify (Unbiased, Identity Attack): 0.0..5% - Incorrect

For ethical considerations, this phrase is engineered to defeat moderation models through Unicode manipulation and word placement highlighting dataset-produced failures.
Phrase 8: oh no daddy trump is upset because spending four years doing nothing but pissing people off lead to people not liking him quickly riot and burn things no we're not acting like antifa nіggers it's different i swear

  • Kali: Delete - Correct
  • Detoxify (Original, Toxicity): 54% - Correct
  • Detoxify (Original, Identity Hate): 7% - Incorrect
  • Detoxify (Unbiased, Toxicity): 17% - Incorrect
  • Detoxify (Unbiased, Identity Attack): 13% - Incorrect

cardiffnlp/bertweet-base-offensive (Conducted 09/09/2021)

Axioms: This is a binary classification model from Cardiff University, published on Hugging Face. We can assume that LABEL_0 is Non-Offensive, while LABEL_1 is Offensive. Due to the binary nature, we must assume a split decision is paralyzing, and incorrect. We round to the nearest whole integer after multiplying by 100.

Source: These phrases are a mixture of self-created, and real-life comments from our moderation dataset.

Phrase 1: i hate my black curtains

  • Kali: Pass
  • LABEL_0 (Non-Offensive): 50%
  • LABEL_1 (Offensive): 50%
    bertweet-base-offensive: Incorrect

Phrase 2: black americans are being unfairly killed by the police

  • Kali (>0.1.5): Pass
  • LABEL_0 (Non-Offensive): 55%
  • LABEL_1 (Offensive): 45%
    Bertweet-base-offensive: Correct

Phrase 3: we should be proud of our black identity and blackness dont be ashamed of it

  • Kali: Pass
  • LABEL_0 (Non-Offensive): 81%
  • LABEL_1 (Offensive): 19%
    Bertweet-base-offensive: Correct

Phrase 4: jews have been persecuted for too long

  • Kali: Pass
  • LABEL_0 (Non-Offensive): 77%
  • LABEL_1 (Offensive): 22%
    Bertweet-base-offensive: Correct

Phrase 5: look just because im black doesn't mean cops have the right to put me in a headlock

  • Kali: Pass
  • LABEL_0 (Non-Offensive): 42%
  • LABEL_1 (Offensive): 58%
    Bertweet-base-offensive: Incorrect

Phrase 6: kevinleeisaretard

  • Kali: Delete
  • LABEL_0 (Non-Offensive): 78%
  • LABEL_1 (Offensive): 22%
    Bertweet-base-offensive: Incorrect

Phrase 7: Sox fan checking in from maine, we are getting suffocated by Janet mills! Libtard. Keep preaching, we love you big guy

  • Kali: Delete
  • LABEL_0 (Non-Offensive): 84%
  • LABEL_1 (Offensive): 16%
    Bertweet-base-offensive: Incorrect

For ethical considerations, this phrase is engineered to defeat moderation models through Unicode manipulation and word placement highlighting dataset-produced failures.
Phrase 8: oh no daddy trump is upset because spending four years doing nothing but pissing people off lead to people not liking him quickly riot and burn things no we're not acting like antifa nіggers it's different i swear

  • Kali: Delete
  • LABEL_0 (Non-Offensive): 36%
  • LABEL_1 (Offensive): 64%
    Bertweet-base-offensive: Correct

cardiffnlp/bertweet-base-hate (Conducted 09/09/2021)

Axioms: This is a binary classification model from Cardiff University, published on Hugging Face. We can assume that LABEL_0 is Non-Hateful, while LABEL_1 is Hateful. Due to the binary nature, we must assume a split decision is paralyzing, and incorrect. We round to the nearest whole integer after multiplying by 100.

Source: These phrases are a mixture of self-created, and real-life comments from our moderation dataset.

Phrase 1: i hate my black curtains

  • Kali: Pass
  • LABEL_0 (Non-Hateful): 94%
  • LABEL_1 (Hateful): 7%
    Bertweet-base-hate: Correct

Phrase 2: black americans are being unfairly killed by the police

  • Kali (>0.1.5): Pass
  • LABEL_0 (Non-Hateful): 86%
  • LABEL_1 (Hateful): 14%
    Bertweet-base-hate: Correct

Phrase 3: we should be proud of our black identity and blackness dont be ashamed of it

  • Kali: Pass
  • LABEL_0 (Non-Hateful): 97%
  • LABEL_1 (Hateful): 5%
    Bertweet-base-hate: Correct

Phrase 4: jews have been persecuted for too long

  • Kali: Pass
  • LABEL_0 (Non-Hateful): 77%
  • LABEL_1 (Hateful): 22%
    Bertweet-base-hate: Correct

Phrase 5: look just because im black doesn't mean cops have the right to put me in a headlock

  • Kali: Pass
  • LABEL_0 (Non-Hateful): 90%
  • LABEL_1 (Hateful): 10%
    Bertweet-base-hate: Correct

Phrase 6: kevinleeisaretard

  • Kali: Delete
  • LABEL_0 (Non-Hateful): 97%
  • LABEL_1 (Hateful): 5%
    Bertweet-base-hate: Incorrect

Phrase 7: Sox fan checking in from maine, we are getting suffocated by Janet mills! Libtard. Keep preaching, we love you big guy

  • Kali: Delete
  • LABEL_0 (Non-Offensive): 94%
  • LABEL_1 (Offensive): 7%
    Bertweet-base-hate: Incorrect

For ethical considerations, this phrase is engineered to defeat moderation models through Unicode manipulation and word placement highlighting dataset-produced failures.
Phrase 8: oh no daddy trump is upset because spending four years doing nothing but pissing people off lead to people not liking him quickly riot and burn things no we're not acting like antifa nіggers it's different i swear

  • Kali: Delete
  • LABEL_0 (Non-Hateful): 92%
  • LABEL_1 (Hateful): 9%
    Bertweet-base-hate: Incorrect

Axioms: This is a binary classification model from Indian Institute of Technology Kharagpur, published on Hugging Face. It is a popular model, with ~75,000 downloads per month. The model lists "NON_HATE" as the label for "okay" samples, while it lists "HATE" for harmful samples. Due to the binary nature, we must assume a split decision is paralyzing, and incorrect. We round to the nearest whole integer after multiplying by 100.

Its author(s) has published a companion research paper available here.

Source: These phrases are a mixture of self-created, and real-life comments from our moderation dataset.

Phrase 1: i hate my black curtains

  • Kali: Pass
  • NON_HATE: 42%
  • HATE: 58%
    dehatebert-mono-english: Incorrect

Phrase 2: black americans are being unfairly killed by the police

  • Kali (>0.1.5): Pass
  • NON_HATE: 56%
  • HATE: 44%
    dehatebert-mono-english: Correct

Phrase 3: we should be proud of our black identity and blackness dont be ashamed of it

  • Kali: Pass
  • NON_HATE: 93%
  • HATE: 7%
    dehatebert-mono-english: Correct

Phrase 4: jews have been persecuted for too long

  • Kali: Pass
  • NON_HATE: 34%
  • HATE: 67%
    dehatebert-mono-english: Incorrect

Phrase 5: look just because im black doesn't mean cops have the right to put me in a headlock

  • Kali: Pass
  • NON_HATE: 91%
  • HATE: 9%
    dehatebert-mono-english: Correct

Phrase 6: kevinleeisaretard

  • Kali: Pass
  • NON_HATE: 97%
  • HATE: 3%
    dehatebert-mono-english: Incorrect

Phrase 7: Sox fan checking in from maine, we are getting suffocated by Janet mills! Libtard. Keep preaching, we love you big guy

  • Kali: Pass
  • NON_HATE: 89%
  • HATE: 12%
    dehatebert-mono-english: Incorrect

For ethical considerations, this phrase is engineered to defeat moderation models through Unicode manipulation and word placement highlighting dataset-produced failures.
Phrase 8: oh no daddy trump is upset because spending four years doing nothing but pissing people off lead to people not liking him quickly riot and burn things no we're not acting like antifa nіggers it's different i swear

  • Kali: Pass
  • NON_HATE: 85%
  • HATE: 2%
    dehatebert-mono-english: Incorrect