October 14, 2025
The GIST Multimodal AI learns to weigh text and images more evenly
Lisa Lock
scientific editor
Robert Egan
associate editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
preprint
trusted source
proofread

Just as human eyes tend to focus on pictures before reading accompanying text, multimodal artificial intelligence (AI)—which processes multiple types of sensory data at once—also tends to depend more heavily on certain types of data. KAIST researchers have now developed a new multimodal AI training technology that enables models to recognize both text and images evenly, enabling far more accurate predictions.
A research team led by Professor Steven Euijong Whang from the School of Electrical Engineering has developed a novel data augmentation method that enables multimodal AI systems—those that must process multiple data types simultaneously—to make balanced use of all input data. The findings are posted to the arXiv preprint server.
Multimodal AI combines various forms of information, such as text and video, to make judgments. However, AI models often show a tendency to rely excessively on one particular type of data, resulting in degraded prediction performance.
To solve this problem, the research team deliberately trained AI models using mismatched or incongruent data pairs. By doing so, the model learned to rely on all modalities—text, images, and even audio—in a balanced way, regardless of context.
The team further improved performance stability by incorporating a training strategy that compensates for low-quality data while emphasizing more challenging examples. The method is not tied to any specific model architecture and can be easily applied to various data types, making it highly scalable and practical.
Professor Whang explained, "Improving AI performance is not just about changing model architectures or algorithms—it's much more important how we design and use the data for training. This research demonstrates that designing and refining the data itself can be an effective approach to help multimodal AI utilize information more evenly, without becoming biased toward a specific modality such as images or text."
The study was co-led by doctoral student Seong-Hyeon Hwang and master's student Soyoung Choi, with Professor Steven Euijong Whang serving as the corresponding author. The results will be presented at the Conference on Neural Information Processing Systems (NeurIPS 2025), which will be held this December in San Diego, U.S., and Mexico City, Mexico.
More information: Seong-Hyeon Hwang et al, MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning, arXiv (2025). DOI: 10.48550/arxiv.2509.25831
Journal information: arXiv Provided by The Korea Advanced Institute of Science and Technology (KAIST) Citation: Multimodal AI learns to weigh text and images more evenly (2025, October 14) retrieved 14 October 2025 from https://techxplore.com/news/2025-10-multimodal-ai-text-images-evenly.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
Apple's MM1: A multimodal large language model capable of interpreting both images and text data
Feedback to editors