CRYPTOREPORTCLUB
  • Crypto news
  • AI
  • Technologies
Friday, July 25, 2025
No Result
View All Result
CRYPTOREPORTCLUB
  • Crypto news
  • AI
  • Technologies
No Result
View All Result
CRYPTOREPORTCLUB

Improving AI models: Automated tool detects silent errors in deep learning training

July 24, 2025
154
0

July 24, 2025

The GIST Improving AI models: Automated tool detects silent errors in deep learning training

Related Post

Trump’s AI plan calls for massive data centers. Here’s how it may affect energy in the US

Trump’s AI plan calls for massive data centers. Here’s how it may affect energy in the US

July 25, 2025
Tradition meets AI in Nishijinori weaving style from Japan’s ancient capital

Tradition meets AI in Nishijinori weaving style from Japan’s ancient capital

July 25, 2025
Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

preprint

trusted source

proofread

Improving AI models: Automated tool detects silent errors in deep learning training
Silent error in BLOOM-176B training. Credit: arXiv (2025). DOI: 10.48550/arxiv.2506.14813

TrainCheck uses training invariants to find the root cause of hard-to-detect errors before they cause downstream problems, saving time and resources.

A new open-sourced framework developed at the University of Michigan proactively detects silent errors as they happen during deep learning training. These difficult-to-detect issues do not cause obvious training failures, but quietly degrade model performance while wasting valuable resources and time.

In evaluations, the TrainCheck framework identified 18 out of 20 real-world silent training errors in just one iteration—while current methods only caught two—and uncovered six previously unknown bugs in popular training libraries. Researchers introduced TrainCheck in a study recently presented at the USENIX Symposium on Operating Systems Design and Implementation (OSDI) in Boston.

"By developing TrainCheck, we aim to empower developers with better tools to address silent errors, ultimately enabling more robust AI systems," said Ryan Huang, a U-M associate professor of computer science and engineering and senior author of the study.

During deep learning, artificial neural networks learn to perform tasks using large amounts of data, tweaking parameters over several cycles to reach the desired performance. Large-scale AI models, like large language models (LLMs) and computer vision models, are expensive to train, making silent errors particularly costly because they allow training to continue, leading to a suboptimal model.

Current methods monitor deep learning training with high-level signals, like loss (how wrong the model's predictions are compared to the correct answer), accuracy (the percentage of correct responses) and gradient norms (measures of how much the model's parameters change during each training step).

However, these bird's-eye-view metrics are noisy, naturally fluctuating during training, which makes it hard to differentiate between normal variation and an actual problem. For example, HuggingFace's training of its BLOOM-176B LLM missed a silent error because it didn't cause obvious changes in loss or accuracy. The bug caused copies of the model running on different GPUs to drift apart, making the final trained models unusable and thus wasting months of expensive computation.

TrainCheck's new approach relies on training invariants, which are rules that hold constant throughout training. The framework continuously monitors training invariants, immediately alerts developers about deviations and provides detailed debugging information to help find out what went wrong. This is a big step up from previous high-level methods that could not find the root cause even if problems were detected.

"By automatically inferring and monitoring training invariants, TrainCheck enables rapid identification and resolution of errors, which is a significant advancement over traditional methods. It sets a new standard for error detection in machine learning frameworks," said Yuxuan Jiang, a doctoral student of computer science and engineering at U-M and lead author of the study.

The research team put TrainCheck to the test on 20 silent errors while comparing performance to four existing detection methods. Six of the silent errors were drawn from previous research and the other 14 came from issues discussed in developer forums (GitHub, StackOverflow and social media) to ensure they were testing the framework against issues developers actually faced.

TrainCheck successfully detected 18 out of 20 silent errors while high-level signal detectors only detected two. Diagnostics revealed that of the 18 errors TrainCheck detected, the violation reports found the exact root cause for 10 cases and localized close to the root for the other eight. In contrast, the high-level detectors could only provide diagnostic hints for one error.

"We were impressed by how well TrainCheck performed in handling real-world issues using its principled invariant-based approach," said Huang.

When assessing false errors, TrainCheck did alert developers to false errors but at a low rate. Although false alarms occurred, they followed recognizable patterns that made them relatively easy to dismiss.

The strong results demonstrate that TrainCheck can be integrated into various machine learning frameworks, providing developers with a proactive tool to guard against errors. By offering early detection of silent errors, it minimizes wasted resources and enhances model accuracy and robustness.

Future adaptations might enhance TrainCheck to provide additional debugging help to developers, and extend the continuous validation approach to other computational domains, such as distributed systems, increasing resilience and performance where silent errors are common.

More information: Yuxuan Jiang et al, Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, arXiv (2025). DOI: 10.48550/arxiv.2506.14813

GitHub: github.com/OrderLab/TrainCheck

Journal information: arXiv Provided by University of Michigan Citation: Improving AI models: Automated tool detects silent errors in deep learning training (2025, July 24) retrieved 24 July 2025 from https://techxplore.com/news/2025-07-ai-automated-tool-silent-errors.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Fine-tuned LLMs boost error detection in radiology reports shares

Feedback to editors

Share212Tweet133ShareShare27ShareSend

Related Posts

Trump’s AI plan calls for massive data centers. Here’s how it may affect energy in the US
AI

Trump’s AI plan calls for massive data centers. Here’s how it may affect energy in the US

July 25, 2025
0

July 25, 2025 The GIST Trump's AI plan calls for massive data centers. Here's how it may affect energy in the US Andrew Zinin lead editor Editors' notes This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the...

Read moreDetails
Tradition meets AI in Nishijinori weaving style from Japan’s ancient capital

Tradition meets AI in Nishijinori weaving style from Japan’s ancient capital

July 25, 2025
AI tackles notoriously complex equations, enabling faster advances in drug and material design

AI tackles notoriously complex equations, enabling faster advances in drug and material design

July 25, 2025
AI will soon be able to audit all published research—what will that mean for public trust in science?

AI will soon be able to audit all published research—what will that mean for public trust in science?

July 25, 2025
A human-inspired pathfinding approach to improve robot navigation

A human-inspired pathfinding approach to improve robot navigation

July 25, 2025
Scientists develop tool to detect fake videos

Scientists develop tool to detect fake videos

July 25, 2025
Innovative robotic slip-prevention method could bring human-like dexterity to industrial automation

Innovative robotic slip-prevention method could bring human-like dexterity to industrial automation

July 25, 2025

Recent News

Amazon is developing a Wolfenstein TV show

Amazon is developing a Wolfenstein TV show

July 25, 2025

Tea App That Claimed to Protect Women Exposes 72,000 IDs in Epic Security Fail

July 25, 2025
LeBron James is reportedly trying to stop the spread of viral AI ‘pregnancy’ videos

LeBron James is reportedly trying to stop the spread of viral AI ‘pregnancy’ videos

July 25, 2025
Breaking Bad creator’s new show streams on Apple TV+ November 7

Breaking Bad creator’s new show streams on Apple TV+ November 7

July 25, 2025

TOP News

  • Bitcoin Sees Long-Term Holders Sell As Short-Term Buyers Step In – Sign Of Rally Exhaustion?

    Bitcoin Sees Long-Term Holders Sell As Short-Term Buyers Step In – Sign Of Rally Exhaustion?

    534 shares
    Share 214 Tweet 134
  • The AirPods 4 are still on sale at a near record low price

    533 shares
    Share 213 Tweet 133
  • Ripple Partners With Ctrl Alt to Expand Custody Footprint Into Middle East

    533 shares
    Share 213 Tweet 133
  • Cyberpunk 2077: Ultimate Edition comes to the Mac on July 17

    533 shares
    Share 213 Tweet 133
  • HBO confirms The Last of Us season 3 will arrive in 2027

    533 shares
    Share 213 Tweet 133
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Use
Advertising: digestmediaholding@gmail.com

Disclaimer: Information found on cryptoreportclub.com is those of writers quoted. It does not represent the opinions of cryptoreportclub.com on whether to sell, buy or hold any investments. You are advised to conduct your own research before making any investment decisions. Use provided information at your own risk.
cryptoreportclub.com covers fintech, blockchain and Bitcoin bringing you the latest crypto news and analyses on the future of money.

© 2023-2025 Cryptoreportclub. All Rights Reserved

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Crypto news
  • AI
  • Technologies

Disclaimer: Information found on cryptoreportclub.com is those of writers quoted. It does not represent the opinions of cryptoreportclub.com on whether to sell, buy or hold any investments. You are advised to conduct your own research before making any investment decisions. Use provided information at your own risk.
cryptoreportclub.com covers fintech, blockchain and Bitcoin bringing you the latest crypto news and analyses on the future of money.

© 2023-2025 Cryptoreportclub. All Rights Reserved