CRYPTOREPORTCLUB
  • Crypto news
  • AI
  • Technologies
Friday, August 1, 2025
No Result
View All Result
CRYPTOREPORTCLUB
  • Crypto news
  • AI
  • Technologies
No Result
View All Result
CRYPTOREPORTCLUB

Why AI leaderboards are inaccurate and how to fix them

July 29, 2025
157
0

July 29, 2025

The GIST Why AI leaderboards are inaccurate and how to fix them

Related Post

Study produces transformer-based AI approach to predicting customer behavior

Study produces transformer-based AI approach to predicting customer behavior

August 1, 2025
AI can help you die by suicide if you ask the right way, researchers say

AI can help you die by suicide if you ask the right way, researchers say

July 31, 2025
Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

Why AI leaderboards are inaccurate and how to fix them
Online leaderboards evaluate AI models by asking people to rate the generated content in head-to-head comparisons, in what the researchers call an "LLM Smackdown." A faulty ranking system could give a model the championship belt for the wrong reasons. Credit: Generated by Google Gemini 2.5 Flash and edited by Derek Smith

Faulty ranking mechanisms used in AI leaderboards can be overcome through approaches evaluated at the University of Michigan.

In their study, U-M researchers assessed the performance of four ranking methods used in popular online AI leaderboards, such as Chatbot Arena, as well as other sporting and gaming leaderboards. They found that the type and implementation of a ranking method can yield different results, even with the same crowdsourced dataset of model performance. From their results, the researchers developed guidelines for leaderboards to represent the AI models' true performance.

"Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren't accurate or well studied?" said Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.

"Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them."

Gen AI models are difficult to evaluate because judgments on AI-generated content can be subjective. Some leaderboards evaluate how accurately AI models perform specific tasks, such as answering multiple choice questions, but those leaderboards don't assess how well an AI creates diverse content without a single right answer.

To evaluate more open-ended output, other leaderboards, such as the popular Chatbot Arena, ask people to rate the generated content in head-to-head comparisons, in what the researchers call an "LLM Smackdown." The human contributors blindly submit a prompt to two random AI models then record their preferred answer in the leaderboard's database, which is then fed into the ranking system.

But the rankings can depend on the implementation of the systems. Chatbot Arena once used a ranking system called Elo, which is also commonly used to rank chess players and athletes. It has settings that allow users to set how drastically a win or a loss changes the leaderboard's rankings, and how that impact changes based on the player or model's age. In theory, these features allow a ranking system to be more flexible, but the proper settings for evaluating AI aren't always obvious.

Why AI leaderboards are inaccurate and how to fix them
Different ranking algorithms can produce different rankings with the same human evaluation data, making it difficult to determine which algorithm is appropriate for various use cases. Credit: Roland Daynauth et al.

"In chess and sport matches, there's a logical order of games that proceed as the players' skills change over their careers. But AI models don't change between releases, and they can instantly and simultaneously play many games," said Roland Daynauth, U-M doctoral student in computer science and engineering and the study's first author.

To help prevent accidental misuse, the researchers evaluated each rating system by feeding them a portion of two crowdsourced datasets of AI model performance—one from Chatbot Arena and another previously collected by the researchers. They then checked to see how accurately their rankings matched the win rate in a withheld portion of the datasets.

They also checked to see how sensitive each system's rankings were to user-defined settings, and whether the rankings followed the logic of all the pairwise comparisons: If A beats B, and B beats C, then A must be ranked higher than C.

They found that Glicko, a ranking system used in e-sports, tends to produce the most consistent results, especially when the number of comparisons are uneven. Other ranking systems—such as the Bradley-Terry system that Chatbot Arena implemented in December 2023—could also be accurate, but only when each model had an even number of comparisons. Such a system could allow a newer model to appear stronger than is warranted.

"Just because a model comes onto the scene and beats a grandmaster doesn't necessarily mean it's the best model. You need many, many games to know what the truth is," said Jason Mars, U-M associate professor of computer science and engineering and a co-corresponding author of the study.

In contrast, the rankings made by the Elo system, as well as the Markov Chains used by Google to rank pages in a web search, were highly dependent on how users configured the system. The Bradley-Terry system lacks user-defined settings, so it could be the best option for large datasets with an even number of comparisons for each AI.

"There's no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward," Tang said.

More information: Roland Daynauth et al. Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat: aclanthology.org/2025.acl-long.1265/

Provided by University of Michigan Citation: Why AI leaderboards are inaccurate and how to fix them (2025, July 29) retrieved 29 July 2025 from https://techxplore.com/news/2025-07-ai-leaderboards-inaccurate.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

How relative performance feedback can motivate doctors 3 shares

Feedback to editors

Share212Tweet133ShareShare27ShareSend

Related Posts

Study produces transformer-based AI approach to predicting customer behavior
AI

Study produces transformer-based AI approach to predicting customer behavior

August 1, 2025
0

July 31, 2025 The GIST Study produces transformer-based AI approach to predicting customer behavior Lisa Lock scientific editor Andrew Zinin lead editor Editors' notes This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility: fact-checked peer-reviewed...

Read moreDetails
AI can help you die by suicide if you ask the right way, researchers say

AI can help you die by suicide if you ask the right way, researchers say

July 31, 2025
A thermodynamic approach to machine learning: How optimal transport theory can improve generative models

A thermodynamic approach to machine learning: How optimal transport theory can improve generative models

July 31, 2025
Can you tell if that song AI-generated? Here are some things to check

Can you tell if that song AI-generated? Here are some things to check

July 31, 2025
Computationally efficient anomaly detection achieved through novel dual-lighting model

Computationally efficient anomaly detection achieved through novel dual-lighting model

July 31, 2025
Researchers optimize AI systems for science

Researchers optimize AI systems for science

July 31, 2025
AI tool transforms drone images into instant disaster area maps for responders

AI tool transforms drone images into instant disaster area maps for responders

July 31, 2025

Recent News

Study produces transformer-based AI approach to predicting customer behavior

Study produces transformer-based AI approach to predicting customer behavior

August 1, 2025
Shiba Inu (SHIB) Bears Destroyed, Bitcoin (BTC) Price Squeeze Next, This Is XRP’s Chance

Shiba Inu (SHIB) Bears Destroyed, Bitcoin (BTC) Price Squeeze Next, This Is XRP’s Chance

August 1, 2025
Reddit should be a ‘go-to search engine,’ Steve Huffman says

Reddit should be a ‘go-to search engine,’ Steve Huffman says

August 1, 2025
AI can help you die by suicide if you ask the right way, researchers say

AI can help you die by suicide if you ask the right way, researchers say

July 31, 2025

TOP News

  • The AirPods 4 are still on sale at a near record low price

    The AirPods 4 are still on sale at a near record low price

    535 shares
    Share 214 Tweet 134
  • Ripple Partners With Ctrl Alt to Expand Custody Footprint Into Middle East

    535 shares
    Share 214 Tweet 134
  • Cyberpunk 2077: Ultimate Edition comes to the Mac on July 17

    535 shares
    Share 214 Tweet 134
  • HBO confirms The Last of Us season 3 will arrive in 2027

    535 shares
    Share 214 Tweet 134
  • Reddit is back online after a brief outage

    535 shares
    Share 214 Tweet 134
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Use
Advertising: digestmediaholding@gmail.com

Disclaimer: Information found on cryptoreportclub.com is those of writers quoted. It does not represent the opinions of cryptoreportclub.com on whether to sell, buy or hold any investments. You are advised to conduct your own research before making any investment decisions. Use provided information at your own risk.
cryptoreportclub.com covers fintech, blockchain and Bitcoin bringing you the latest crypto news and analyses on the future of money.

© 2023-2025 Cryptoreportclub. All Rights Reserved

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Crypto news
  • AI
  • Technologies

Disclaimer: Information found on cryptoreportclub.com is those of writers quoted. It does not represent the opinions of cryptoreportclub.com on whether to sell, buy or hold any investments. You are advised to conduct your own research before making any investment decisions. Use provided information at your own risk.
cryptoreportclub.com covers fintech, blockchain and Bitcoin bringing you the latest crypto news and analyses on the future of money.

© 2023-2025 Cryptoreportclub. All Rights Reserved