August 4, 2025
The GIST AI models simulate human subjects to aid social science research, but limits remain
Lisa Lock
scientific editor
Andrew Zinin
lead editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
preprint
trusted source
proofread

LLMs that emulate human speech are being used to cost-effectively test assumptions and run pilot studies, producing promising early results. But researchers note that human data remains essential.
By improving our understanding of human behavior, social science research helps businesses design successful marketing programs, ensures governmental policies are responsive to people's needs, and supports the development of appropriate strategies for fighting disease and maintaining public safety.
This research spans the fields of economics, psychology, sociology, and political science and uses a variety of approaches, from fieldwork to online polling, randomized controlled trials, focus groups, observation, and more.
But all social science research is complicated by its subject: people.
"We're not dealing with petri dishes or plants that sit still and allow us to experiment over long periods of time," says Jacy Anthis, visiting scholar at the Stanford Institute for Human-Centered AI (HAI) and a Ph.D. candidate at the University of Chicago. "And because we study human subjects, this research can be time-consuming, expensive, and hard to replicate."
With advances in AI, though, social scientists can now simulate human data. Large language models (LLMs) that emulate human speech can roleplay expert social scientists or diverse human subjects to inexpensively test assumptions, run pilot studies, estimate optimal sample sizes, and leverage the statistical power that a combination of human and LLM subjects provide.
Yet there remain some ways in which LLMs aren't a great stand-in for human subjects, Anthis notes in a new paper posted to the arXiv preprint server: They often give less varied, biased, or sycophantic answers; and they don't generalize well to new settings.
Still, Anthis and others are optimistic about using LLMs for social science research since some rough-and-ready methods have already produced promising results.
If other researchers heed his rallying cry, Anthis says, one more year of work could lead to substantial improvements. "As technology and society rapidly evolve, we need social science tools like simulations that can keep pace."
Evaluating AI as human proxy
While AI has made major leaps on popular benchmarks, its ability to mimic humans is a more recent development. To determine how well it predicts human behavior, Luke Hewitt, a senior research fellow at Stanford PACS, and colleagues Robb Willer, Ashwini Ashokkumar, and Isaias Ghezae tested LLMs against previous randomized controlled trials (RCTs): Could the LLMs successfully replicate the results of trials done with human subjects?
Typical RCTs involve a "treatment"—some piece of information or action that scholars expect to impact a person's attitudes or behavior. So, for example, a researcher might ask participants to read a piece of text, watch a short video, or participate in a game about a topic (climate change or vaccines, for example), then ask them their opinion about that topic and compare their answers to those of a control group that did not undergo the treatment. Did their opinions shift compared to the controls? Are they more likely to change, start, or stop relevant behaviors?
For their project, Hewitt and his colleagues used the language model GPT-4 to simulate how a representative sample of Americans would respond to 476 different randomized treatments that had been previously studied. They found that in online survey experiments, LLM predictions of simulated responses were as accurate as human experts' predictions and correlated strongly (0.85) with measured treatment effects.
This accuracy is impressive, Hewitt says. The team was especially encouraged to find the same level of accuracy even when replicating studies that were published after GPT-4 was trained. "Many would have expected to see the LLM succeed at simulating experiments that were part of its training data and fail on new ones it hadn't seen before," Hewitt says. "Instead, we found the LLM could make fairly accurate predictions even for entirely novel experiments."
Unfortunately, he says, newer models are more difficult to vet. That's not just because their training data includes more-recently conducted studies, but also because LLMs are starting to do their own web searches, giving them access to information they weren't trained on. To evaluate these models, scholars may need to create an archive of unpublished studies never before on the internet.
AI is narrow-minded
While LLMs show potential accuracy in replicating studies, they face other major challenges that scholars would need to find ways to address.
One is distributional alignment: LLMs have a remarkable inability to match the variation of responses from humans. For example, in response to a "pick a number" game, LLMs will often choose a narrower (and oddly predictable) range of answers than people will. "They can misportray and flatten a lot of groups," says Nicole Meister, a graduate student in electrical engineering at Stanford.
In a recent paper, Meister and her colleagues evaluated different ways to prompt for and measure the distribution of an LLM's responses to various questions. For example, an LLM might be prompted to answer a question about the morality of drinking alcohol by selecting one of four multiple choice options: A, B, C, or D.
An LLM typically outputs just one answer, but one approach to measuring the distribution of possible answers is to look one layer deeper in the model to see the model's assessed likelihood of each of the four answers before it makes a final choice. But it turns out that this so-called "log probability" distribution is not very similar to human distributions, Meister says. Other approaches yielded more human-like variation: asking the LLM to simulate 30 people's answers, or asking the LLM to verbalize the likely distribution.
The team saw even better results when they provided the LLM with distributional information about how a group typically responds to a related prompt, an approach Meister calls "few-shot" steering. For example, an LLM responding to a question about how Democrats and Republicans feel about the morality of drinking alcohol would better align to real human responses if the model was primed with Democrats' and Republicans' distribution of opinions regarding religion or drunk driving.
The few-shot approach works best for opinion-based questions and less well for preferences, Meister notes. "If someone thinks that self-driving cars are bad, they will likely think that technology is bad, and the model will make that leap," she says. "But if I like war books, it doesn't mean that I don't like mystery books, so it's harder for an LLM to make that prediction."
That's a growing concern as some companies start to use LLMs to predict things like product preferences. "LLMs might not be the correct tool for this purpose," she says.
Other challenges: Validation, bias, sycophancy, and more
As with most AI technologies, the use of LLMs in the social sciences could be harmful if people use LLM simulations to replace human experiments, or if they use them in ways that are not well validated, Hewitt says. When using a model, people need to have some sense of whether they should trust it: Is their use case close enough to other uses the model has been validated on? "We're making progress, but in most instances I don't think we have that level of confidence quite yet," Hewitt says.
It will also be important, Hewitt says, to better quantify the uncertainty of model predictions. "Without uncertainty quantification," he says, "people might trust a model's predictions insufficiently in some cases and too much in others."
According to Anthis, other key challenges to using LLMs for social science research include:
- Bias: Models systematically present particular social groups inaccurately, often relying on racial, ethnic, and gender stereotypes.
- Sycophancy: Models designed as "assistants" tend to offer answers that may seem helpful to people, regardless of whether they are accurate.
- Alienness: Models' answers may resemble what a human might say, but on a deeper level are utterly alien. For example, an LLM might say 3.11 is greater than 3.9, or it might solve a simple mathematical problem using a bizarrely complex method.
- Generalization: LLMs don't accurately generalize beyond the data at hand, so social scientists may struggle using them to study new populations or large group behavior.
These challenges are tractable, Anthis says. Researchers can already apply certain tricks to alleviate bias and sycophancy; for example, interview-based simulation, asking the LLM to roleplay an expert, or fine-tuning a model to optimize for social simulation. Addressing the alienness and generalization issues is more challenging and may require a general theory of how LLMs work, which is currently lacking, he says.
Current best practice? A hybrid approach
Despite the challenges, today's LLMs can still play a role in social science research. David Broska, a sociology graduate student at Stanford, has developed a general methodology for using LLMs responsibly that combines human subjects and LLM predictions in a mixed subjects design.
"We now have two data types," he says. "One is human responses, which are very informative but expensive, and the other, LLM predictions, is not so informative but cheap."
The idea is to first run a small pilot study with humans and also run the same experiment with an LLM to see how interchangeable the results are. The approach, called prediction-powered inference, combines the two data resources effectively while preventing the LLM from introducing bias.
"We want to keep what the human subjects tell us and increase our confidence in the overall treatment effect while also statistically preventing the LLM from diminishing the credibility of our results," he says.
An initial hybrid pilot study can also provide a power analysis—a concrete estimate of the proportion of human and LLM subjects that will be most likely to generate a statistically meaningful result, Broska says. This sets researchers up for success in a hybrid study that could potentially be less expensive.
More broadly, Hewitt sees cases where LLM simulations are already useful. "If I was designing a study right now to test an intervention for shifting people's attitudes about climate in relation to a news event or new policy, or to increase public trust in vaccines, I would definitely first simulate that experiment in an LLM and use the results to augment my intuition."
Trust in the model is less important if the LLM is only helping with selecting experimental conditions or the wording of a survey question, Hewitt says. Human subjects are still paramount.
"At the end of the day, if you're studying human behavior, your experiment needs to ground out in human data."
More information: Jacy Reese Anthis et al, LLM Social Simulations Are a Promising Research Method, arXiv (2025). DOI: 10.48550/arxiv.2504.02234
Journal information: arXiv Provided by Stanford University Citation: AI models simulate human subjects to aid social science research, but limits remain (2025, August 4) retrieved 4 August 2025 from https://techxplore.com/news/2025-08-ai-simulate-human-subjects-aid.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
New research reveals AI has a confidence problem 15 shares
Feedback to editors