Why AI leaderboards are inaccurate and how to fix them

July 29, 2025

The GIST Why AI leaderboards are inaccurate and how to fix them

Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

Why AI leaderboards are inaccurate and how to fix them — Online leaderboards evaluate AI models by asking people to rate the generated content in head-to-head comparisons, in what the researchers call an "LLM Smackdown." A faulty ranking system could give a model the championship belt for the wrong reasons. Credit: Generated by Google Gemini 2.5 Flash and edited by Derek Smith

Faulty ranking mechanisms used in AI leaderboards can be overcome through approaches evaluated at the University of Michigan.

In their study, U-M researchers assessed the performance of four ranking methods used in popular online AI leaderboards, such as Chatbot Arena, as well as other sporting and gaming leaderboards. They found that the type and implementation of a ranking method can yield different results, even with the same crowdsourced dataset of model performance. From their results, the researchers developed guidelines for leaderboards to represent the AI models' true performance.

"Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren't accurate or well studied?" said Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.

"Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them."

Gen AI models are difficult to evaluate because judgments on AI-generated content can be subjective. Some leaderboards evaluate how accurately AI models perform specific tasks, such as answering multiple choice questions, but those leaderboards don't assess how well an AI creates diverse content without a single right answer.

To evaluate more open-ended output, other leaderboards, such as the popular Chatbot Arena, ask people to rate the generated content in head-to-head comparisons, in what the researchers call an "LLM Smackdown." The human contributors blindly submit a prompt to two random AI models then record their preferred answer in the leaderboard's database, which is then fed into the ranking system.

But the rankings can depend on the implementation of the systems. Chatbot Arena once used a ranking system called Elo, which is also commonly used to rank chess players and athletes. It has settings that allow users to set how drastically a win or a loss changes the leaderboard's rankings, and how that impact changes based on the player or model's age. In theory, these features allow a ranking system to be more flexible, but the proper settings for evaluating AI aren't always obvious.

"In chess and sport matches, there's a logical order of games that proceed as the players' skills change over their careers. But AI models don't change between releases, and they can instantly and simultaneously play many games," said Roland Daynauth, U-M doctoral student in computer science and engineering and the study's first author.

To help prevent accidental misuse, the researchers evaluated each rating system by feeding them a portion of two crowdsourced datasets of AI model performance—one from Chatbot Arena and another previously collected by the researchers. They then checked to see how accurately their rankings matched the win rate in a withheld portion of the datasets.

They also checked to see how sensitive each system's rankings were to user-defined settings, and whether the rankings followed the logic of all the pairwise comparisons: If A beats B, and B beats C, then A must be ranked higher than C.

They found that Glicko, a ranking system used in e-sports, tends to produce the most consistent results, especially when the number of comparisons are uneven. Other ranking systems—such as the Bradley-Terry system that Chatbot Arena implemented in December 2023—could also be accurate, but only when each model had an even number of comparisons. Such a system could allow a newer model to appear stronger than is warranted.

"Just because a model comes onto the scene and beats a grandmaster doesn't necessarily mean it's the best model. You need many, many games to know what the truth is," said Jason Mars, U-M associate professor of computer science and engineering and a co-corresponding author of the study.

In contrast, the rankings made by the Elo system, as well as the Markov Chains used by Google to rank pages in a web search, were highly dependent on how users configured the system. The Bradley-Terry system lacks user-defined settings, so it could be the best option for large datasets with an even number of comparisons for each AI.

"There's no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward," Tang said.

More information: Roland Daynauth et al. Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat: aclanthology.org/2025.acl-long.1265/

Provided by University of Michigan Citation: Why AI leaderboards are inaccurate and how to fix them (2025, July 29) retrieved 29 July 2025 from https://techxplore.com/news/2025-07-ai-leaderboards-inaccurate.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

How relative performance feedback can motivate doctors 3 shares

Feedback to editors

Why AI leaderboards are inaccurate and how to fix them

Gaby Clark

Andrew Zinin

By cryptoadmin

You Missed

New ensemble AI model enhances cyber intrusion detection with high accuracy

Highguard has raided its last fortress, will shutdown on March 12

Trump pressures banks to make deal with crypto firms over market structure bill

From Anthropic to Iran: Who sets the limits on AI’s use in war and surveillance?

Categories

Why AI leaderboards are inaccurate and how to fix them

Gaby Clark

Andrew Zinin

By cryptoadmin

Related Post

New ensemble AI model enhances cyber intrusion detection with high accuracy

From Anthropic to Iran: Who sets the limits on AI’s use in war and surveillance?

Deepfake songs are exploding, but a new tool shuts them down

You Missed

New ensemble AI model enhances cyber intrusion detection with high accuracy

Highguard has raided its last fortress, will shutdown on March 12

Trump pressures banks to make deal with crypto firms over market structure bill

From Anthropic to Iran: Who sets the limits on AI’s use in war and surveillance?