A brand new metric to quantify capabilities of AI techniques when it comes to human capabilities

March 20, 2025 report

The GIST Editors' notes

This text has been reviewed in response to Science X's editorial course of and insurance policies. Editors have highlighted the next attributes whereas making certain the content material's credibility:

fact-checked

preprint

trusted supply

proofread

A brand new metric to quantify capabilities of AI techniques when it comes to human capabilities

A new metric to quantify capabilities of AI systems in terms of human capabilities
Our methodology for measuring AI agent time horizon. Credit score: arXiv (2025). DOI: 10.48550/arxiv.2503.14499

A group of AI researchers at startup METR is proposing a brand new metric to quantify the capabilities of AI techniques when it comes to human capabilities. They’ve printed a paper on the arXiv preprint server describing the brand new metric, which they name "task-completion time horizon" (TCTH).

LLMs resembling GPT-2 are getting higher at producing dependable outcomes with every new iteration. On this new examine, the group in California famous that such fashions are nonetheless being described in methods that aren’t as much as the duty of absolutely describing a system's capabilities. Due to that, they’ve give you a metric to quantify the capabilities in ways in which can be utilized throughout a number of fields, resembling writing pc applications or producing the steps wanted to hold out a activity.

With TCTH, duties might be quantified by testing them towards people. As one instance, the researchers discovered that early variations of LLMs failed to finish any of a sure group of duties given to human consultants, who might get them completed in a single minute. In sharp distinction, the most recent model of Claude 3.7 Sonnet can full 50% of sure duties that took people on common 59 minutes to attain.

A new metric to quantify capabilities of AI systems in terms of human capabilities
The size of duties (measured by how lengthy they take human professionals) that generalist autonomous frontier mannequin brokers can full with 50% reliability has been doubling roughly each 7 months for the final 6 years. Credit score: arXiv (2025). DOI: 10.48550/arxiv.2503.14499

By establishing a listing of duties after which seeing how lengthy it takes a human to attain them, the brand new metric could possibly be used to develop a benchmark to measure how properly AI fashions are stacking up. And such benchmarks, they counsel, must be based mostly on a 50% success fee, as a result of it has to this point been proven to be probably the most strong when utilized in knowledge distribution evaluation.

As a part of their work with the brand new metric, the analysis group discovered that AI fashions are enhancing dramatically on finishing lengthy duties, resembling programming, finishing up cybersecurity assignments, basic reasoning assignments and machine studying. Such progress means that they may quickly be used to hold out main assignments like chemical discovery and even complete engineering tasks.

Extra info: Thomas Kwa et al, Measuring AI Means to Full Lengthy Duties, arXiv (2025). DOI: 10.48550/arxiv.2503.14499

Journal info: arXiv

© 2025 Science X Community

Quotation: A brand new metric to quantify capabilities of AI techniques when it comes to human capabilities (2025, March 20) retrieved 20 March 2025 from https://techxplore.com/information/2025-03-metric-quantify-capabilities-ai-terms.html This doc is topic to copyright. Aside from any honest dealing for the aim of personal examine or analysis, no half could also be reproduced with out the written permission. The content material is offered for info functions solely.

Discover additional

The constraints of language: AI fashions nonetheless lag behind people in easy textual content comprehension assessments 12 shares

Feedback to editors