Beating the AI bottleneck: Communications innovation could markedly improve AI training process

July 11, 2025

The GIST Beating the AI bottleneck: Communications innovation could markedly improve AI training process

Urgent need for ‘global approach’ on AI regulation: UN tech chief

July 27, 2025

China urges global consensus on balancing AI development, security

July 26, 2025

Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

proofread

Beating the AI bottleneck — ZEN System Overview. Credit: Zhuang Wang et al.

Artificial intelligence (AI) is infamous for its resource-heavy training, but a new study may have found a solution in a novel communications system, called ZEN, that markedly improves the way large language models (LLMs) train.

The research team at Rice University was helmed by doctoral graduate Zhuang Wang and computer science professor T.S. Eugene Ng with contributions from two other computer science faculty members: assistant professor Yuke Wang and professor Anshumali Shrivastava. Stevens University's Zhaozhuo Xu and Jingyi Xi of Zhejiang University also contributed to the project.

Distributed training, sparsity and communication

Wang said there are two phases where LLMs can bottleneck during the distributed training process: computation and communication.

The first occurs when the model needs to crunch through a large amount of data. It can bog down the system, consuming time and computing power. Splitting the data among hundreds, sometimes thousands, of graphics processing units (GPUs) helps manage that problem. They process multiple data samples separately, then feed them back into the model.

The second bottleneck happens when all those GPUs need to sync up so they can "talk" to the model and convey what they've learned. They need to efficiently communicate with one another to complete each training run smoothly and can slow down if the model gradients they have to sync are very large, which they often are.

"The previous solution was to send all the data out. But in practice, we observe that the data has a lot of zero values in the 'talk,'" Wang said. "We need a data structure to represent the communication information correctly."

Removing those zero or near-zero values and leaving only the relevant ones to be synchronized during communication is called "sparsification." The values that are left are aptly named "sparse tensors." It's common practice in LLM training and can save the system the effort of communicating billions of extra gradients. But it still leaves the communication bottleneck, which is where the team focused its research.

"There's actually not a lot of fundamental understanding of how to support these sparse tensors inside of distributed training," Ng said. "People propose the idea, but they don't understand what the optimal way of handling them is. One of the contributions of our work is to analyze these sparse tensors to understand how they behave."

Mapping the system, finding the structure

There were essentially three parts to this research: Part one was figuring out the characteristics of sparse tensors in popular models. The nonzero gradients left after sparsification aren't uniformly distributed; their location and tensor density depend on factors like the training model and dataset used.

That scattering of nonzero gradients leads to an imbalance during the communication phase that slows down synchronization and, by extension, slows down the training process. This new understanding sheds light on how to design better communication schemes to use with sparse tensors.

Once they knew how to approach their design, part two was figuring out the optimal communication schemes to use. Wang and Ng analyzed several options to determine what those were.

Because there was no optimal solution before this research, the third and final step was building a real-world system based on their research and applying that system to practical LLM training to see if it worked. ZEN was that system, and it displayed a stark difference in training speed when used on real-world LLMs.

"What we basically show is that we can accelerate the time to completion of the training because the communication is more efficient. … The time it takes to perform one step in the training is much faster," Ng said.

Since sparse tensors are used often and the field of LLM training is so broad, this discovery can be applied to just about any model with, as Ng phrased it, "the characteristics of sparsity." Be it text or image generation, ZEN can speed up model training if sparse tensors are present.

Wang isn't new to this area of research. He and Ng previously collaborated on a project to minimize the failure recovery overhead of LLMs after a hardware or software failure during training, which they named GEMINI—unveiled at the ACM Symposium on Operating Systems Principles in 2023.

Wang recently presented his paper on this newer research, entitled "ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization," at the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI) held in Boston.

More information: ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization, www.usenix.org/conference/osdi … entation/wang-zhuang

Provided by Rice University Citation: Beating the AI bottleneck: Communications innovation could markedly improve AI training process (2025, July 11) retrieved 11 July 2025 from https://techxplore.com/news/2025-07-ai-bottleneck-communications.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Improve machine learning performance by dropping the zeros 0 shares

Feedback to editors

Beating the AI bottleneck: Communications innovation could markedly improve AI training process

Urgent need for ‘global approach’ on AI regulation: UN tech chief

China urges global consensus on balancing AI development, security

Related Posts

Urgent need for ‘global approach’ on AI regulation: UN tech chief

China urges global consensus on balancing AI development, security

Trump’s AI plan calls for massive data centers. Here’s how it may affect energy in the US

Tradition meets AI in Nishijinori weaving style from Japan’s ancient capital

AI tackles notoriously complex equations, enabling faster advances in drug and material design

AI will soon be able to audit all published research—what will that mean for public trust in science?

A human-inspired pathfinding approach to improve robot navigation

Recent News

How Much of Ethereum’s Supply Is Lost Forever? Here’s the Amount That Must Be Excluded When Calculating Supply

Users Are Unstaking Their ETH in Unusual Amounts on Ethereum – What Does This Mean and Why Is It Happening? Cathie Wood Weighs In

Urgent need for ‘global approach’ on AI regulation: UN tech chief

Bitcoin Cash Surges Past $580 as Analysts Predict Breakout Toward $620–$680 Range

TOP News

Bitcoin Sees Long-Term Holders Sell As Short-Term Buyers Step In – Sign Of Rally Exhaustion?

The AirPods 4 are still on sale at a near record low price

Ripple Partners With Ctrl Alt to Expand Custody Footprint Into Middle East

Cyberpunk 2077: Ultimate Edition comes to the Mac on July 17

HBO confirms The Last of Us season 3 will arrive in 2027

Welcome Back!

Retrieve your password