Multimodal AI learns to weigh text and images more evenly

October 14, 2025

The GIST Multimodal AI learns to weigh text and images more evenly

OpenAI to ease ChatGPT restrictions, allowing adult content for verified adults

October 14, 2025

It’s called automated officiating. The NBA is utilizing it to get even more calls right

October 14, 2025

Lisa Lock

scientific editor

Robert Egan

associate editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

preprint

trusted source

proofread

Multimodal AI that understands text and images the way humans do — MIDAS trains a multimodal model on both aligned and misaligned samples with conflicting semantics simultaneously. Credit: arXiv (2025). DOI: 10.48550/arxiv.2509.25831

Just as human eyes tend to focus on pictures before reading accompanying text, multimodal artificial intelligence (AI)—which processes multiple types of sensory data at once—also tends to depend more heavily on certain types of data. KAIST researchers have now developed a new multimodal AI training technology that enables models to recognize both text and images evenly, enabling far more accurate predictions.

A research team led by Professor Steven Euijong Whang from the School of Electrical Engineering has developed a novel data augmentation method that enables multimodal AI systems—those that must process multiple data types simultaneously—to make balanced use of all input data. The findings are posted to the arXiv preprint server.

Multimodal AI combines various forms of information, such as text and video, to make judgments. However, AI models often show a tendency to rely excessively on one particular type of data, resulting in degraded prediction performance.

To solve this problem, the research team deliberately trained AI models using mismatched or incongruent data pairs. By doing so, the model learned to rely on all modalities—text, images, and even audio—in a balanced way, regardless of context.

The team further improved performance stability by incorporating a training strategy that compensates for low-quality data while emphasizing more challenging examples. The method is not tied to any specific model architecture and can be easily applied to various data types, making it highly scalable and practical.

Professor Whang explained, "Improving AI performance is not just about changing model architectures or algorithms—it's much more important how we design and use the data for training. This research demonstrates that designing and refining the data itself can be an effective approach to help multimodal AI utilize information more evenly, without becoming biased toward a specific modality such as images or text."

The study was co-led by doctoral student Seong-Hyeon Hwang and master's student Soyoung Choi, with Professor Steven Euijong Whang serving as the corresponding author. The results will be presented at the Conference on Neural Information Processing Systems (NeurIPS 2025), which will be held this December in San Diego, U.S., and Mexico City, Mexico.

More information: Seong-Hyeon Hwang et al, MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning, arXiv (2025). DOI: 10.48550/arxiv.2509.25831

Journal information: arXiv Provided by The Korea Advanced Institute of Science and Technology (KAIST) Citation: Multimodal AI learns to weigh text and images more evenly (2025, October 14) retrieved 14 October 2025 from https://techxplore.com/news/2025-10-multimodal-ai-text-images-evenly.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Apple's MM1: A multimodal large language model capable of interpreting both images and text data

Feedback to editors

Multimodal AI learns to weigh text and images more evenly

OpenAI to ease ChatGPT restrictions, allowing adult content for verified adults

It’s called automated officiating. The NBA is utilizing it to get even more calls right

Related Posts

OpenAI to ease ChatGPT restrictions, allowing adult content for verified adults

It’s called automated officiating. The NBA is utilizing it to get even more calls right

Lancelot federated learning system combines encryption and robust aggregation to resist poisoning attacks

Hollywood-AI battle heats up, as OpenAI and studios clash over copyrights and consent

Millions of children face sexual violence as AI deepfakes drive surge in new cases—latest global data

Why industry-standard labels for AI in music could change how we listen

California enacts first US law requiring AI chatbot safety measures

Recent News

OpenAI to ease ChatGPT restrictions, allowing adult content for verified adults

Chainlink Price Eyes $100 as S&P Global Partnership Expands Institutional Adoption

Meta removes Facebook Group for tracking ICE agents after DOJ pressure

It’s called automated officiating. The NBA is utilizing it to get even more calls right

TOP News

Tron Looks to go Public in the U.S., Form Strategy Like TRX Holding Firm: FT

God help us, Donald Trump plans to sell a phone

Investment Giant 21Shares Announces New Five Altcoins Including Avalanche (AVAX)!

WhatsApp has ads now, but only in the Updates tab

AI generates data to help embodied agents ground language to 3D world

Welcome Back!

Retrieve your password