July 21, 2025
The GIST AI models learn to split up tasks, slashing wait times for complex prompts
Lisa Lock
scientific editor
Andrew Zinin
lead editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
preprint
trusted source
proofread

As large language models (LLMs) like ChatGPT continue to advance, user expectations of them keep growing, including with respect to how quickly they can respond to our increasingly intricate prompts requesting answers to ever-challenging problems and tasks.
Conventional LLMs rely on the concept of "autoregressive decoding," where each item ("token") in a sequence is predicted based on previously generated outputs. This approach inevitably leads to delays for more complicated prompts, though researchers have tried to mitigate this with projects that leverage the parallelism of multicore computer chips more effectively. For example, speculative decoding uses a fast draft model to propose tokens that are then verified in parallel by a slower, high-quality model.
A newer class of methods instead exploits "semantic independence," identifying syntactic patterns like bullet points and expanding each in parallel. But they rely on hand-crafted syntactic heuristics, which are brittle and often fail when responses deviate from expected formats.
These shortcomings inspired researchers at MIT's Computer Science and Artificial Intelligence Lab (CSAIL) and Google to use a learning-based approach to parallel decoding. Instead of relying on fixed rules, their method trains LLMs to recognize semantic independence—that is, to identify and decode semantically independent chunks of text in parallel.
The result: pasta.
Specifically, the CSAIL team's Parallel Structure Annotation (PASTA) enables LLMs to generate text in parallel, dramatically accelerating their response times. Unlike previous attempts that relied on rigid, hand-coded rules to identify independent text segments, PASTA teaches LLMs to inherently understand and express these parallelization opportunities within their own responses.
This approach—called learned asynchronous decoding—marks a shift toward teaching models to orchestrate their own parallel decoding strategy. The findings are published on the arXiv preprint server.
"Traditional LLMs are like a single cook making lasagna, one step at a time," explained Tian Jin, lead author of a new paper on the project that was presented at the International Conference on Machine Learning (ICML 2025) in Vancouver. "PASTA teaches the cook to recognize when different parts of the lasagna can be prepared simultaneously, like mixing a subset of ingredients while the oven preheats, leading to a much faster process overall."
This innovation tackles a fundamental bottleneck in LLM inference, where the sequential nature of decoding often results in underutilized hardware and lengthy wait times for users. Current LLMs can take seconds or even minutes to fulfill user requests, a latency issue that PASTA aims to resolve.
At the heart of PASTA are two main components: PASTA-LANG, an annotation language that allows LLMs to tag semantically independent parts of their responses, and an interpreter that acts on these tags to orchestrate parallel decoding during inference. As Jin explains, you can think of PASTA-LANG as a set of instructions the LLM writes for itself, marking sections of its output that can be worked on simultaneously. The interpreter then reads these instructions and manages the parallel generation of those sections.
The team trained LLMs to generate these PASTA-LANG annotations through a two-stage fine-tuning process. This training not only optimizes for decoding speed but also approximately maintains or even improves the quality of the generated responses. This dual optimization is a significant leap forward, as it enables continuous improvements on both speed and quality as more training compute becomes available.
In experiments conducted with PASTA on the AlpacaEval benchmark used, the team's self-parallelizing model showed geometric mean speedups reaching nearly 2x while experiencing only minor changes in response quality (from a gain of 2% to a drop of 7%). This means users can expect responses nearly twice as fast without a noticeable decrease in accuracy or coherence.
"It was surprising to see this behavior of having an LLM orchestrate its own inference-time behavior," Jin says. "It was illuminating—and in a way, magical—to see how throwing more compute at these algorithms yields increasingly sophisticated self-orchestration behavior."
The research highlights a critical challenge in the field: balancing speed and quality. Prior methods such as Skeleton-of-Thought (SoT) and APAR attempted parallel decoding by looking for manually specified syntactic structures like bullet points or paragraphs. However, these methods were often rigid and imprecise, failing to identify parallelization opportunities when responses deviated even slightly from expected patterns. PASTA's learning-based approach, in contrast, offers a more robust and scalable solution.
"It's about empowering the LLM to be smarter about how it generates content," says Jin, a Ph.D. student at CSAIL. "Instead of us trying to guess where it can work in parallel, we're teaching the LLM to identify those opportunities itself, on the fly."
Looking ahead, the team is optimistic about the broader implications of PASTA. The ability to significantly reduce LLM decoding latency could lead to reduced computational resource requirements, making these powerful AI models more accessible and affordable to a wider range of users and applications.
"We've essentially designed a protocol for an LLM to optimize itself," says Jin. "By improving the efficacy of LLM inference, PASTA could significantly reduce computational resource requests and improve accessibility of LLMs."
Jin spearheaded the project alongside his two faculty advisers, MIT professors Michael Carbin and Jonathan Ragan-Kelley. Other paper co-authors include CSAIL's Ellie Y. Cheng and Zack Ankner, and Google researchers Suvinay Subramanian, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh.
More information: Tian Jin et al, Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding, arXiv (2025). DOI: 10.48550/arxiv.2502.11517
Journal information: arXiv Provided by Massachusetts Institute of Technology Citation: AI models learn to split up tasks, slashing wait times for complex prompts (2025, July 21) retrieved 21 July 2025 from https://techxplore.com/news/2025-07-ai-tasks-slashing-complex-prompts.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
Faster, smarter, more open: Study shows new algorithms accelerate AI models 23 shares
Feedback to editors