July 21, 2025
The GIST AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data
Gaby Clark
scientific editor
Andrew Zinin
lead editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
trusted source
proofread

In the race to develop AI that understands complex images like financial forecasts, medical diagrams and nutrition labels—essential for AI to operate independently in everyday settings—closed-source systems like ChatGPT and Claude currently set the pace. But no one outside their makers knows how those models were trained or what data they used, leaving open-source alternatives scrambling to catch up.
Now, researchers at Penn Engineering and the Allen Institute for AI (Ai2) have developed a new approach to train open-source models: using AI to create scientific figures, charts and tables that teach other AI systems how to interpret complex visual information.
Their tool, CoSyn (short for Code-Guided Synthesis), taps open-source AI models' coding skills to render text-rich images and generate relevant questions and answers, giving other AI systems the data they need to learn how to "see" and understand scientific figures.
As the researchers detail in a paper for ACL 2025, one of the world's leading AI conferences, CoSyn-trained models match or outperform their proprietary peers.
"This is like taking a student who's great at writing and asking them to teach someone how to draw, just by describing what the drawing should look like," says Yue Yang (GrEng'25), co-first author and Research Scientist at Ai2's PRIOR: Perceptual Reasoning and Interaction Research group. "We're essentially transferring the strengths of open-source AI from text to vision."
Synthetic images, real results
The resulting dataset, called CoSyn-400K, includes more than 400,000 synthetic images and 2.7 million sets of corresponding instructions, in categories as varied as scientific charts, chemical structures and user-interface screenshots. CoSyn-trained models outperformed top proprietary systems like GPT-4V and Gemini 1.5 Flash on a suite of seven benchmark tests.
In one particularly striking case, the researchers synthetically generated just 7,000 nutrition labels to train a model for a new benchmark they created, NutritionQA. That small, targeted dataset enabled their model to beat others trained on millions of real images.
"Training AI with CoSyn is incredibly data efficient," says Mark Yatskar, Assistant Professor in CIS and Yang's doctoral co-advisor. "We're showing that synthetic data can help models generalize to real-world scenarios that could be unique to a person's needs, like reading a nutrition label for someone with low vision."
Scaling and diversifying the dataset
Creating hundreds of thousands of useful, varied training examples posed its own challenges.
To reach the scale required, co-first-author Ajay Patel, a doctoral student in Computer and Information Science (CIS), developed a software library called DataDreamer that automated the entire process of generating data. This allowed the team to prompt language models in parallel, enabling large-scale production of synthetic images and instructions.
In order to avoid repetition, the team leveraged "personas," short character profiles like "a sci-fi novelist" or "a chemistry teacher," which guided the AI's responses and shaped the content and tone of each example. Embedding these personas into prompts led CoSyn to produce richer, more varied training data across a wide range of domains.
"AI models tend to repeat themselves unless you nudge them into different perspectives," explains Patel. "Personas give us a scalable way to do that, and the results speak for themselves."
Leveling the playing field for open-source AI
By building CoSyn entirely with open-source tools, the researchers hope to democratize access to powerful vision-language training methods without the ethical and legal challenges surrounding web scraping and copyrighted content.
"This is a step towards AI helping us make new scientific discoveries," adds Chris Callison-Burch, Professor in CIS, who co-advised Yang and currently advises Patel. "It opens the door to AI systems that can reason about scientific documents, which could help a wide range of people, from college students to researchers."
From understanding to action
The team has released the full CoSyn code and dataset to the public, inviting the global research community to build upon their work.
Yang is already looking ahead to synthetic data that can help AI not only understand images, but also interact with them, serving as intelligent digital agents that can click buttons, fill out forms and assist users in daily tasks.
"In the long run, we want AI that can act in the world, not just describe it," Yang says. "This is one way to teach it how."
More information: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation, yueyang1996.github.io/papers/cosyn.pdf
Provided by University of Pennsylvania Citation: AI vision, reinvented: Vision-language models gain clearer sight through synthetic training data (2025, July 21) retrieved 21 July 2025 from https://techxplore.com/news/2025-07-ai-vision-reinvented-language-gain.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
AI generates data to help embodied agents ground language to 3D world 1 shares
Feedback to editors