Innovative detection method makes AI smarter by cleaning up bad data before it learns

June 12, 2025

The GIST Innovative detection method makes AI smarter by cleaning up bad data before it learns

AI can evolve to feel guilt—but only in certain social environments

July 30, 2025

As AI booms, data centers threaten energy grid and water supplies, expert says

July 30, 2025

Lisa Lock

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

In the world of machine learning and artificial intelligence, clean data is everything. Even a small number of mislabeled examples known as label noise can derail the performance of a model, especially those like support vector machines (SVMs) that rely on a few key data points to make decisions.

SVMs are a widely used type of machine learning algorithm, applied in everything from image and speech recognition to medical diagnostics and text classification. These models operate by finding a boundary that best separates different categories of data. They rely on a small but crucial subset of the training data, known as support vectors, to determine this boundary. If these few examples are incorrectly labeled, the resulting decision boundaries can be flawed, leading to poor performance on real-world data.

Now, a team of researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) within the College of Engineering and Computer Science at Florida Atlantic University and collaborators have developed an innovative method to automatically detect and remove faulty labels before a model is ever trained—making AI smarter, faster and more reliable.

Before the AI even starts learning, the researchers clean the data using a math technique that looks for odd or unusual examples that don't quite fit. These "outliers" are removed or flagged, making sure the AI gets high-quality information right from the start. The paper is published in IEEE Transactions on Neural Networks and Learning Systems.

"SVMs are among the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering," said Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the FAU Department of Electrical Engineering and Computer Science, director of CA-AI and an FAU Sensing Institute (I-SENSE) faculty fellow.

"What makes them especially effective—but also uniquely vulnerable—is that they rely on just a small number of key data points, called support vectors, to draw the line between different classes. If even one of those points is mislabeled—for example, if a malignant tumor is incorrectly marked as benign—it can distort the model's entire understanding of the problem.

The consequences of that could be serious, whether it's a missed cancer diagnosis or a security system that fails to flag a threat. Our work is about protecting models—any machine learning and AI model including SVMs—from these hidden dangers by identifying and removing those mislabeled cases before they can do harm."

The data-driven method that "cleans" the training dataset uses a mathematical approach called L1-norm principal component analysis. Unlike conventional methods, which often require manual parameter tuning or assumptions about the type of noise present, this technique identifies and removes suspicious data points within each class purely based on how well they fit with the rest of the group.

"Data points that appear to deviate significantly from the rest—often due to label errors—are flagged and removed," said Pados. "Unlike many existing techniques, this process requires no manual tuning or user intervention and can be applied to any AI model, making it both scalable and practical."

The process is robust, efficient and entirely touch-free—even handling the notoriously tricky task of rank selection (which determines how many dimensions to keep during analysis) without user input.

Researchers extensively tested their technique on real and synthetic datasets with various levels of label contamination. Across the board, it produced consistent and notable improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems.

"What makes our approach particularly compelling is its flexibility," said Pados. "It can be used as a plug-and-play preprocessing step for any AI system, regardless of the task or dataset. And it's not just theoretical—extensive testing on both noisy and clean datasets, including well-known benchmarks like the Wisconsin Breast Cancer dataset, showed consistent improvements in classification accuracy.

"Even in cases where the original training data appeared flawless, our new method still enhanced performance, suggesting that subtle, hidden label noise may be more common than previously thought."

Looking ahead, the research opens the door to even broader applications. The team is interested in exploring how this mathematical framework might be extended to tackle deeper issues in data science such as reducing data bias and improving the completeness of datasets.

"As machine learning becomes deeply integrated into high-stakes domains like health care, finance and the justice system, the integrity of the data driving these models has never been more important," said Stella Batalama, Ph.D., dean of the FAU College of Engineering and Computer Science.

"We're asking algorithms to make decisions that impact real lives—diagnosing diseases, evaluating loan applications, even informing legal judgments. If the training data is flawed, the consequences can be devastating. That's why innovations like this are so critical.

"By improving data quality at the source—before the model is even trained—we're not just making AI more accurate; we're making it more responsible. This work represents a meaningful step toward building AI systems we can trust to perform fairly, reliably and ethically in the real world."

More information: Shruti Shukla et al, Training Dataset Curation by L 1-Norm Principal-Component Analysis for Support Vector Machines, IEEE Transactions on Neural Networks and Learning Systems (2025). DOI: 10.1109/TNNLS.2025.3568694

Provided by Florida Atlantic University Citation: Innovative detection method makes AI smarter by cleaning up bad data before it learns (2025, June 12) retrieved 12 June 2025 from https://techxplore.com/news/2025-06-method-ai-smarter-bad.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

New technique reduces bias in AI models while preserving or improving accuracy 1 shares

Feedback to editors