CRYPTOREPORTCLUB
  • Crypto news
  • AI
  • Technologies
Wednesday, July 30, 2025
No Result
View All Result
CRYPTOREPORTCLUB
  • Crypto news
  • AI
  • Technologies
No Result
View All Result
CRYPTOREPORTCLUB

Why AI leaderboards are inaccurate and how to fix them

July 29, 2025
157
0

July 29, 2025

The GIST Why AI leaderboards are inaccurate and how to fix them

Related Post

‘Marathon at F1 speed’: China bids to lap US in AI leadership

‘Marathon at F1 speed’: China bids to lap US in AI leadership

July 30, 2025
Fraud detection strategies outlined may explain how to survive explosion of deepfakes

Fraud detection strategies outlined may explain how to survive explosion of deepfakes

July 30, 2025
Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

Why AI leaderboards are inaccurate and how to fix them
Online leaderboards evaluate AI models by asking people to rate the generated content in head-to-head comparisons, in what the researchers call an "LLM Smackdown." A faulty ranking system could give a model the championship belt for the wrong reasons. Credit: Generated by Google Gemini 2.5 Flash and edited by Derek Smith

Faulty ranking mechanisms used in AI leaderboards can be overcome through approaches evaluated at the University of Michigan.

In their study, U-M researchers assessed the performance of four ranking methods used in popular online AI leaderboards, such as Chatbot Arena, as well as other sporting and gaming leaderboards. They found that the type and implementation of a ranking method can yield different results, even with the same crowdsourced dataset of model performance. From their results, the researchers developed guidelines for leaderboards to represent the AI models' true performance.

"Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren't accurate or well studied?" said Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.

"Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them."

Gen AI models are difficult to evaluate because judgments on AI-generated content can be subjective. Some leaderboards evaluate how accurately AI models perform specific tasks, such as answering multiple choice questions, but those leaderboards don't assess how well an AI creates diverse content without a single right answer.

To evaluate more open-ended output, other leaderboards, such as the popular Chatbot Arena, ask people to rate the generated content in head-to-head comparisons, in what the researchers call an "LLM Smackdown." The human contributors blindly submit a prompt to two random AI models then record their preferred answer in the leaderboard's database, which is then fed into the ranking system.

But the rankings can depend on the implementation of the systems. Chatbot Arena once used a ranking system called Elo, which is also commonly used to rank chess players and athletes. It has settings that allow users to set how drastically a win or a loss changes the leaderboard's rankings, and how that impact changes based on the player or model's age. In theory, these features allow a ranking system to be more flexible, but the proper settings for evaluating AI aren't always obvious.

Why AI leaderboards are inaccurate and how to fix them
Different ranking algorithms can produce different rankings with the same human evaluation data, making it difficult to determine which algorithm is appropriate for various use cases. Credit: Roland Daynauth et al.

"In chess and sport matches, there's a logical order of games that proceed as the players' skills change over their careers. But AI models don't change between releases, and they can instantly and simultaneously play many games," said Roland Daynauth, U-M doctoral student in computer science and engineering and the study's first author.

To help prevent accidental misuse, the researchers evaluated each rating system by feeding them a portion of two crowdsourced datasets of AI model performance—one from Chatbot Arena and another previously collected by the researchers. They then checked to see how accurately their rankings matched the win rate in a withheld portion of the datasets.

They also checked to see how sensitive each system's rankings were to user-defined settings, and whether the rankings followed the logic of all the pairwise comparisons: If A beats B, and B beats C, then A must be ranked higher than C.

They found that Glicko, a ranking system used in e-sports, tends to produce the most consistent results, especially when the number of comparisons are uneven. Other ranking systems—such as the Bradley-Terry system that Chatbot Arena implemented in December 2023—could also be accurate, but only when each model had an even number of comparisons. Such a system could allow a newer model to appear stronger than is warranted.

"Just because a model comes onto the scene and beats a grandmaster doesn't necessarily mean it's the best model. You need many, many games to know what the truth is," said Jason Mars, U-M associate professor of computer science and engineering and a co-corresponding author of the study.

In contrast, the rankings made by the Elo system, as well as the Markov Chains used by Google to rank pages in a web search, were highly dependent on how users configured the system. The Bradley-Terry system lacks user-defined settings, so it could be the best option for large datasets with an even number of comparisons for each AI.

"There's no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward," Tang said.

More information: Roland Daynauth et al. Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat: aclanthology.org/2025.acl-long.1265/

Provided by University of Michigan Citation: Why AI leaderboards are inaccurate and how to fix them (2025, July 29) retrieved 29 July 2025 from https://techxplore.com/news/2025-07-ai-leaderboards-inaccurate.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

How relative performance feedback can motivate doctors 3 shares

Feedback to editors

Share212Tweet133ShareShare27ShareSend

Related Posts

‘Marathon at F1 speed’: China bids to lap US in AI leadership
AI

‘Marathon at F1 speed’: China bids to lap US in AI leadership

July 30, 2025
0

July 30, 2025 The GIST 'Marathon at F1 speed': China bids to lap US in AI leadership Andrew Zinin lead editor Editors' notes This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility: fact-checked reputable news...

Read moreDetails
Fraud detection strategies outlined may explain how to survive explosion of deepfakes

Fraud detection strategies outlined may explain how to survive explosion of deepfakes

July 30, 2025
AI agent autonomously solves complex cybersecurity challenges using text-based tools

AI agent autonomously solves complex cybersecurity challenges using text-based tools

July 29, 2025
How US adults are using AI, according to AP-NORC polling

How US adults are using AI, according to AP-NORC polling

July 29, 2025
Trading AI. How Artificial Intelligence Is Revolutionizing Financial Markets

Trading AI. How Artificial Intelligence Is Revolutionizing Financial Markets

July 29, 2025
‘AI veganism’: Some people’s issues with AI parallel vegans’ concerns about diet

‘AI veganism’: Some people’s issues with AI parallel vegans’ concerns about diet

July 29, 2025
‘Are you joking, mate?’ AI doesn’t get sarcasm in non-American varieties of English

‘Are you joking, mate?’ AI doesn’t get sarcasm in non-American varieties of English

July 29, 2025

Recent News

WLFI Invests $10M in Falcon Finance to Boost On-Chain Dollar Liquidity

July 30, 2025
Google is bringing image and PDF uploads to AI Mode

Google is bringing image and PDF uploads to AI Mode

July 30, 2025
‘Marathon at F1 speed’: China bids to lap US in AI leadership

‘Marathon at F1 speed’: China bids to lap US in AI leadership

July 30, 2025

Warning from Crypto Analysis Platform Matrixport! Fear and Greed Index Nears Peak! What Does It Mean? Here Are the Details

July 30, 2025

TOP News

  • AI-driven personalized pricing may not help consumers

    AI-driven personalized pricing may not help consumers

    543 shares
    Share 217 Tweet 136
  • Our favorite power bank for iPhones is 20 percent off right now

    543 shares
    Share 217 Tweet 136
  • God help us, Donald Trump plans to sell a phone

    544 shares
    Share 218 Tweet 136
  • Investment Giant 21Shares Announces New Five Altcoins Including Avalanche (AVAX)!

    543 shares
    Share 217 Tweet 136
  • WhatsApp has ads now, but only in the Updates tab

    543 shares
    Share 217 Tweet 136
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Use
Advertising: digestmediaholding@gmail.com

Disclaimer: Information found on cryptoreportclub.com is those of writers quoted. It does not represent the opinions of cryptoreportclub.com on whether to sell, buy or hold any investments. You are advised to conduct your own research before making any investment decisions. Use provided information at your own risk.
cryptoreportclub.com covers fintech, blockchain and Bitcoin bringing you the latest crypto news and analyses on the future of money.

© 2023-2025 Cryptoreportclub. All Rights Reserved

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Crypto news
  • AI
  • Technologies

Disclaimer: Information found on cryptoreportclub.com is those of writers quoted. It does not represent the opinions of cryptoreportclub.com on whether to sell, buy or hold any investments. You are advised to conduct your own research before making any investment decisions. Use provided information at your own risk.
cryptoreportclub.com covers fintech, blockchain and Bitcoin bringing you the latest crypto news and analyses on the future of money.

© 2023-2025 Cryptoreportclub. All Rights Reserved