AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data

July 21, 2025

The GIST AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data

Scalable transformer accelerator enables on-device execution of large language models

July 22, 2025

Probing AI ‘thoughts’ reveals models use tree-like math to track shifting information

July 22, 2025

Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

AI vision, reinvented: The power of synthetic data — CoSyn works by leveraging the language skills of open-source AI models to create training data for other AI models to learn how to read complex, text-rich images. Credit: Yue Yang

In the race to develop AI that understands complex images like financial forecasts, medical diagrams and nutrition labels—essential for AI to operate independently in everyday settings—closed-source systems like ChatGPT and Claude currently set the pace. But no one outside their makers knows how those models were trained or what data they used, leaving open-source alternatives scrambling to catch up.

Now, researchers at Penn Engineering and the Allen Institute for AI (Ai2) have developed a new approach to train open-source models: using AI to create scientific figures, charts and tables that teach other AI systems how to interpret complex visual information.

Their tool, CoSyn (short for Code-Guided Synthesis), taps open-source AI models' coding skills to render text-rich images and generate relevant questions and answers, giving other AI systems the data they need to learn how to "see" and understand scientific figures.

As the researchers detail in a paper for ACL 2025, one of the world's leading AI conferences, CoSyn-trained models match or outperform their proprietary peers.

"This is like taking a student who's great at writing and asking them to teach someone how to draw, just by describing what the drawing should look like," says Yue Yang (GrEng'25), co-first author and Research Scientist at Ai2's PRIOR: Perceptual Reasoning and Interaction Research group. "We're essentially transferring the strengths of open-source AI from text to vision."

Synthetic images, real results

The resulting dataset, called CoSyn-400K, includes more than 400,000 synthetic images and 2.7 million sets of corresponding instructions, in categories as varied as scientific charts, chemical structures and user-interface screenshots. CoSyn-trained models outperformed top proprietary systems like GPT-4V and Gemini 1.5 Flash on a suite of seven benchmark tests.

In one particularly striking case, the researchers synthetically generated just 7,000 nutrition labels to train a model for a new benchmark they created, NutritionQA. That small, targeted dataset enabled their model to beat others trained on millions of real images.

"Training AI with CoSyn is incredibly data efficient," says Mark Yatskar, Assistant Professor in CIS and Yang's doctoral co-advisor. "We're showing that synthetic data can help models generalize to real-world scenarios that could be unique to a person's needs, like reading a nutrition label for someone with low vision."

Yue Yang demonstrates CoSyn's capabilities, using a model trained on synthetic data created with CoSyn to read nutrition labels and solve math problems. Credit: Sylvia Zhang

Scaling and diversifying the dataset

Creating hundreds of thousands of useful, varied training examples posed its own challenges.

To reach the scale required, co-first-author Ajay Patel, a doctoral student in Computer and Information Science (CIS), developed a software library called DataDreamer that automated the entire process of generating data. This allowed the team to prompt language models in parallel, enabling large-scale production of synthetic images and instructions.

In order to avoid repetition, the team leveraged "personas," short character profiles like "a sci-fi novelist" or "a chemistry teacher," which guided the AI's responses and shaped the content and tone of each example. Embedding these personas into prompts led CoSyn to produce richer, more varied training data across a wide range of domains.

"AI models tend to repeat themselves unless you nudge them into different perspectives," explains Patel. "Personas give us a scalable way to do that, and the results speak for themselves."

Leveling the playing field for open-source AI

By building CoSyn entirely with open-source tools, the researchers hope to democratize access to powerful vision-language training methods without the ethical and legal challenges surrounding web scraping and copyrighted content.

"This is a step towards AI helping us make new scientific discoveries," adds Chris Callison-Burch, Professor in CIS, who co-advised Yang and currently advises Patel. "It opens the door to AI systems that can reason about scientific documents, which could help a wide range of people, from college students to researchers."

From understanding to action

The team has released the full CoSyn code and dataset to the public, inviting the global research community to build upon their work.

Yang is already looking ahead to synthetic data that can help AI not only understand images, but also interact with them, serving as intelligent digital agents that can click buttons, fill out forms and assist users in daily tasks.

"In the long run, we want AI that can act in the world, not just describe it," Yang says. "This is one way to teach it how."

More information: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation, yueyang1996.github.io/papers/cosyn.pdf

Provided by University of Pennsylvania Citation: AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data (2025, July 21) retrieved 21 July 2025 from https://techxplore.com/news/2025-07-ai-vision-reinvented-language-gain.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

AI generates data to help embodied agents ground language to 3D world 1 shares

Feedback to editors

AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data

Scalable transformer accelerator enables on-device execution of large language models

Probing AI ‘thoughts’ reveals models use tree-like math to track shifting information

Related Posts

Scalable transformer accelerator enables on-device execution of large language models

Probing AI ‘thoughts’ reveals models use tree-like math to track shifting information

AI comes to California’s electric grid

AI models learn to split up tasks, slashing wait times for complex prompts

Platform can make machine learning more transparent and accessible

Researchers use multidimensional data mining for obstacle avoidance system in autonomous vehicles

AI is now part of our world. University graduates should know how to use it responsibly

Recent News

Scalable transformer accelerator enables on-device execution of large language models

Probing AI ‘thoughts’ reveals models use tree-like math to track shifting information

AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data

Google shows off the Pixel 10 less than a month before its launch

TOP News

Bitcoin Sees Long-Term Holders Sell As Short-Term Buyers Step In – Sign Of Rally Exhaustion?

AI-driven personalized pricing may not help consumers

Our favorite power bank for iPhones is 20 percent off right now

God help us, Donald Trump plans to sell a phone

Investment Giant 21Shares Announces New Five Altcoins Including Avalanche (AVAX)!

Welcome Back!

Retrieve your password