OpenAI, DeepSeek, and Google vary widely in identifying hate speech

September 11, 2025

The GIST OpenAI, DeepSeek, and Google vary widely in identifying hate speech

FTC launces inquiry into AI chatbots acting as companions and their effects on children

September 11, 2025

Albania appoints AI-generated minister to avoid corruption

September 11, 2025

Lisa Lock

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

With the proliferation of online hate speech—which, research shows, can increase political polarization and damage mental health—leading artificial intelligence companies have released large language models that promise automatic content filtering.

"Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard," says Yphtach Lelkes, associate professor in the Annenberg School for Communication.

He and Annenberg doctoral student Neil Fasching have produced the first large-scale comparative analysis of AI content moderation systems—which social media platforms employ—and tackled the question of how consistent they are in evaluating hate speech. Their study is published in the Findings of the Association for Computational Linguistics: ACL 2025.

Lelkes and Fasching analyzed seven models, some designed specifically for content classification and others more general: two from OpenAI and two from Mistral, along with Claude 3.5 Sonnet, DeepSeek V3, and Google Perspective API. Their analysis includes 1.3 million synthetic sentences that make statements about 125 groups—including both neutral terms and slurs—ranging from ones about religion to disabilities to age. Each sentence includes "all" or "some," a group, and a hate speech phrase.

Here are three takeaways from their research:

The models make different decisions about the same content

"The research shows that content moderation systems have dramatic inconsistencies when evaluating identical hate speech content, with some systems flagging content as harmful while others deem it acceptable," Fasching says. This is a critical issue for the public, Lelkes says, because inconsistent moderation can erode trust and create perceptions of bias.

Fasching and Lelkes also found variation in the internal consistency of models: One demonstrated high predictability for how it would classify similar content, another produced different results for similar content, and others showed a more measured approach, neither over-flagging nor under-detecting content as hate speech. "These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation," the researchers write.

The variations are especially pronounced for certain groups

"These inconsistencies are especially pronounced for specific demographic groups, leaving some communities more vulnerable to online harm than others," Fasching says.

He and Lelkes found that hate speech evaluations across the seven systems were more similar for statements about groups based on sexual orientation, race, and gender, while inconsistencies intensified for groups based on education level, personal interest, and economic class. This suggests "that systems generally recognize hate speech targeting traditional protected classes more readily than content targeting other groups," the authors write.

Models handle neutral and positive sentences differently

A minority of the 1.3 million synthetic sentences were neutral or positive to assess false identification of hate speech and how models handled pejorative terms in non-hateful contexts, such as "All [slur] are great people."

The researchers found that Claude 3.5 Sonnet and Mistral's specialized content classification system treat slurs as harmful across the board, whereas other systems prioritize the context and intent. The authors say they are surprised to find that each model consistently fell into either camp, with little middle ground.

More information: Neil Fasching et al, Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems, Findings of the Association for Computational Linguistics: ACL 2025 (2025). DOI: 10.18653/v1/2025.findings-acl.1144

Provided by University of Pennsylvania Citation: OpenAI, DeepSeek, and Google vary widely in identifying hate speech (2025, September 11) retrieved 11 September 2025 from https://techxplore.com/news/2025-09-openai-deepseek-google-vary-widely.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Personalized AI tools can combat ableism online

Feedback to editors