Algorithm primarily based on LLMs doubles lossless knowledge compression charges

Could 14, 2025 characteristic

The GIST Editors' notes

This text has been reviewed in accordance with Science X's editorial course of and insurance policies. Editors have highlighted the next attributes whereas making certain the content material's credibility:

fact-checked

peer-reviewed publication

trusted supply

proofread

Algorithm primarily based on LLMs doubles lossless knowledge compression charges

A powerful lossless data compression algorithm based on LLMs
Picture evaluating the lossless compression charges of LMCompress with the normal state-of-the-art strategies and the large-model-based technique that was proposed independently by a DeepMind-Meta&INRIA workforce. The comparability is completed on 4 kinds of knowledge: picture, video, audio, and textual content. It reveals that LMCompress persistently outperforms the others on all knowledge varieties. Word that the DeepMind end result on video will not be out there. Credit score: Li et al.

Individuals retailer massive portions of knowledge of their digital gadgets and switch a few of this knowledge to others, whether or not for skilled or private causes. Information compression strategies are thus of the utmost significance, as they will enhance the effectivity of gadgets and communications, making customers much less reliant on cloud knowledge providers and exterior storage gadgets.

Researchers on the Central China Institute of Synthetic Intelligence, Peng Cheng Laboratory, Dalian College of Expertise, the Chinese language Academy of Sciences and College of Waterloo lately launched LMCompress, a brand new knowledge compression strategy primarily based on massive language fashions (LLMs), such because the mannequin underpinning the AI conversational platform ChatGPT.

Their proposed technique, outlined in a paper revealed in Nature Machine Intelligence, was discovered to be considerably extra highly effective than classical knowledge compression algorithms.

"In January 2023, after I taught a Kolmogorov complexity course on the College of Waterloo, I mirrored on the concept compression is knowing," Ming Li, senior creator of the paper, instructed Tech Xplore. "In different phrases, when you perceive one thing, you may categorical it succinctly; and when you can categorical one thing in very brief expression or in a couple of phrases, then it’s essential to perceive it.

"On this paper: we proved that compression implies the most effective studying/understanding. The other was proved in one in all our different papers, which was a precursor to this work, whereas one other paper by Google DeepMind independently obtained our preliminary outcomes."

A powerful lossless data compression algorithm based on LLMs
Picture illustrating the important thing perception of the workforce's paper. The perception that understanding is equal to compression bridges a cognitive idea (comprehension) and a technological idea (compression). It sheds mild on growing understanding-based applied sciences, say, semantic communication. Credit score: Li et al.

As a part of their latest research, Li and his colleagues got down to show that the higher fashions grasp knowledge, the higher they will summarize it and compress it. This concept dates again to 1948, particularly to Claude Shannon's famend mathematical principle of communication.

"Shannon primarily proposed that when you perceive the info to be communicated, then you may compress it, or in different phrases, shorten communication time," defined Li. "For 80 years, this analysis thought problem remained open, till AI and huge language fashions got here alongside. Our paper primarily proposes that if a big language mannequin can perceive knowledge properly, it should have the ability to guess what we plan to jot down, which permits us to compress the info considerably higher than the most effective classical lossless knowledge compressors (e.g., bzip for textual content, JPEG-2000 for photos)."

The essential thought behind the researchers' knowledge compression algorithm is that if an LLM is aware of what a consumer shall be writing, it doesn’t must transmit any knowledge, however can merely generate what the consumer desires them to transmit on the opposite finish (i.e., on a receiver's system). When Li and their colleagues examined their proposed strategy, they discovered that it not less than doubled compression charges for various kinds of knowledge, together with texts, photos, movies and audio recordsdata.

"That is wonderful within the sense that after 80 years of analysis, when you simply enhance a lossless compression algorithm by even 1%, that is already outstanding, and we had been capable of double compression charges," mentioned Li. "LMCompress is a compression algorithm utilizing massive fashions (massive language mannequin for texts, massive picture mannequin for photos, and so forth.). It compresses texts greater than two instances higher than classical algorithms, photos and audios two instances higher, and video barely lower than two instances higher. Due to this fact, while you transmit knowledge, you may go roughly two instances sooner."

This latest paper by Li and his colleagues may inform future efforts geared toward growing more and more superior knowledge compression methods, inspiring different researchers to leverage LLMs. Furthermore, the workforce's LMCompress algorithm may quickly be improved additional and deployed in real-world settings.

"We demonstrated that understanding equals compression, and we expect that is of essential significance," added Li. "We additionally paved the way in which for a brand new period of compressing knowledge utilizing LLMs. We predict sooner or later, when these massive fashions are on our cell telephones and in every single place, our technique of compressing knowledge will change the classical ones (e.g., .zip recordsdata). In our subsequent research, we additionally plan to make use of our methodology to match massive fashions and detect plagiarism."

Extra data: Ziguang Li et al, Lossless knowledge compression by massive fashions, Nature Machine Intelligence (2025). DOI: 10.1038/s42256-025-01033-7

Journal data: Nature Machine Intelligence

© 2025 Science X Community

Quotation: Algorithm primarily based on LLMs doubles lossless knowledge compression charges (2025, Could 14) retrieved 15 Could 2025 from https://techxplore.com/information/2025-05-algorithm-based-llms-lossless-compression.html This doc is topic to copyright. Aside from any honest dealing for the aim of personal research or analysis, no half could also be reproduced with out the written permission. The content material is supplied for data functions solely.

Discover additional

AI mannequin beats PNG and FLAC at compression 38 shares

Feedback to editors