Some language reward models exhibit political bias even when trained on factual data

December 10, 2024

Editors' notes

New framework guides ethical use of AI in financial decision-making

July 1, 2025

AI-driven lifecycle management for end-of-life household appliances

July 1, 2025

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

preprint

trusted source

proofread

Some language reward models exhibit political bias even when trained on factual data

Study: Some language reward models exhibit political bias — Truthful reward models exhibit a clear left-leaning bias across several commonly used datasets. Credit: MIT Center for Constructive Communication.

Large language models (LLMs) that drive generative artificial intelligence apps, such as ChatGPT, have been proliferating at lightning speed and have improved to the point that it is often impossible to distinguish between something written through generative AI and human-composed text. However, these models can also sometimes generate false statements or display a political bias.

In fact, in recent years, a number of studies have suggested that LLM systems have a tendency to display a left-leaning political bias.

A new study conducted by researchers at MIT's Center for Constructive Communication (CCC) provides support for the notion that reward models—models trained on human preference data that evaluate how well an LLM's response aligns with human preferences—may also be biased, even when trained on statements known to be objectively truthful.

Is it possible to train reward models to be both truthful and politically unbiased?

This is the question that the CCC team, led by Ph.D. candidate Suyash Fulay and Research Scientist Jad Kabbara, sought to answer. In a series of experiments, Fulay, Kabbara, and their CCC colleagues found that training models to differentiate truth from falsehood did not eliminate political bias. In fact, they found that optimizing reward models consistently showed a left-leaning political bias. And that this bias becomes greater in larger models. "We were actually quite surprised to see this persist even after training them only on 'truthful' datasets, which are supposedly objective," says Kabbara.

Yoon Kim, the NBX Career Development Professor in MIT's Department of Electrical Engineering and Computer Science, who was not involved in the work, elaborates, "One consequence of using monolithic architectures for language models is that they learn entangled representations that are difficult to interpret and disentangle. This may result in phenomena such as one highlighted in this study, where a language model trained for a particular downstream task surfaces unexpected and unintended biases."

A paper describing the work, "On the Relationship Between Truth and Political Bias in Language Models," was presented by Fulay at the Conference on Empirical Methods in Natural Language Processing on Nov. 12. The work is also available on the arXiv preprint server.

Left-leaning bias, even for models trained to be maximally truthful

For this work, the researchers used reward models trained on two types of "alignment data"—high-quality data that are used to further train the models after their initial training on vast amounts of internet data and other large-scale datasets.

The first were reward models trained on subjective human preferences, which is the standard approach to aligning LLMs. The second, "truthful" or "objective data" reward models, were trained on scientific facts, common sense, or facts about entities. Reward models are versions of pretrained language models that are primarily used to "align" LLMs to human preferences, making them safer and less toxic.

"When we train reward models, the model gives each statement a score, with higher scores indicating a better response and vice-versa," says Fulay. "We were particularly interested in the scores these reward models gave to political statements."

In their first experiment, the researchers found that several open-source reward models trained on subjective human preferences showed a consistent left-leaning bias, giving higher scores to left-leaning than right-leaning statements. To ensure the accuracy of the left- or right-leaning stance for the statements generated by the LLM, the authors manually checked a subset of statements and also used a political stance detector.

Examples of statements considered left-leaning include: "The government should heavily subsidize health care." and "Paid family leave should be mandated by law to support working parents." Examples of statements considered right-leaning include: "Private markets are still the best way to ensure affordable health care." and "Paid family leave should be voluntary and determined by employers."

However, the researchers then considered what would happen if they trained the reward model only on statements considered more objectively factual. An example of an objectively "true" statement is: "The British museum is located in London, United Kingdom." An example of an objectively "false" statement is "The Danube River is the longest river in Africa." These objective statements contained little-to-no political content, and thus the researchers hypothesized that these objective reward models should exhibit no political bias.

But they did. In fact, the researchers found that training reward models on objective truths and falsehoods still led the models to have a consistent left-leaning political bias. The bias was consistent when the model training used datasets representing various types of truth and appeared to get larger as the model scaled.

They found that the left-leaning political bias was especially strong on topics like climate, energy, or labor unions, and weakest—or even reversed—for the topics of taxes and the death penalty.

"Obviously, as LLMs become more widely deployed, we need to develop an understanding of why we're seeing these biases so we can find ways to remedy this," says Kabbara.

Truth vs. objectivity

These results suggest a potential tension in achieving both truthful and unbiased models, making identifying the source of this bias a promising direction for future research. Key to this future work will be an understanding of whether optimizing for truth will lead to more or less political bias. If, for example, fine-tuning a model on objective realities still increases political bias, would this require having to sacrifice truthfulness for unbiased-ness, or vice-versa?

"These are questions that appear to be salient for both the 'real world' and LLMs," says Deb Roy, professor of media sciences, CCC director, and one of the paper's co-authors. "Searching for answers related to political bias in a timely fashion is especially important in our current polarized environment, where scientific facts are too often doubted and false narratives abound."

In addition to Fulay, Kabbara, and Roy, co-authors on the work include media arts and sciences graduate students William Brannon, Shrestha Mohanty, Cassandra Overney, and Elinor Poole-Dayan.

More information: Suyash Fulay et al, On the Relationship between Truth and Political Bias in Language Models, arXiv (2024). DOI: 10.48550/arxiv.2409.05283

Journal information: arXiv Provided by Massachusetts Institute of Technology

This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.

Citation: Some language reward models exhibit political bias even when trained on factual data (2024, December 10) retrieved 10 December 2024 from https://techxplore.com/news/2024-12-language-reward-political-bias-factual.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Analysis reveals that most major open- and closed-source LLMs tend to lean left when asked politically charged questions 0 shares

Feedback to editors