this post was submitted on 15 May 2025

88 points (83.3% liked)

Technology

72498 readers

3672 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

Algorithm based on LLMs doubles lossless data compression rates (techxplore.com)

submitted 1 month ago by NoSpotOfGround@lemmy.world to c/technology@lemmy.world

34 comments fedilink hide all child comments

all 36 comments

sorted by: hot top controversial new old

[–] dotdi@lemmy.world 70 points 1 month ago (1 children)

Can’t wait to find hallucinated data in your uncompressed files.

[–] MuAraeOracle@real.lemmy.fan 37 points 1 month ago (3 children)

Ultimate compression It just replaces the video with a prompt, like Be Kind Rewind.

[–] Bezier@suppo.fi 24 points 1 month ago (1 children)

Compress file
Edit the prompt to "die hard 4k h265"
Decompress
Free movie

[–] MuAraeOracle@real.lemmy.fan 3 points 1 month ago

Hey it's illegal under the new anti-AI-sweeding laws!

[–] audaxdreik@pawb.social 5 points 1 month ago

Except it's Bob from Reboot instead of Jack Black.

[–] fluxion@lemmy.world 4 points 1 month ago

Vid was basically you playing with a cat so it got replaced with a token to fetch some stock footage of someone playing with a cat when you decompress.

[–] AbouBenAdhem@lemmy.world 55 points 1 month ago* (last edited 1 month ago)

The basic idea behind the researchers' data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end

Great... but if that’s the case, maybe the user should reconsider the usefulness of transmitting that data in the first place.

[–] andallthat@lemmy.world 27 points 1 month ago* (last edited 1 month ago) (1 children)

I tried reading the paper. There is a free preprint version on arxiv. This page (from the article linked by OP) also links the code they used and the data they tried compressing, in the end.

While most of the theory is above my head, the basic intuition is that compression improves if you have some level of "understanding" or higher-level context of the data you are compressing. And LLMs are generally better at doing that than numeric algorithms.

As an example if you recognize a sequence of letters as the first chapter of the book Moby-Dick you'll probably transmit that information more efficiently than a compression algorithm. "The first chapter of Moby-Dick"; there .. I just did it.

[–] underrate170@kbin.earth 2 points 1 month ago

Very helpful analogy!

[–] skip0110@lemm.ee 22 points 1 month ago (1 children)

This is not new knowledge and predates the current LLM fad.

See the Hutter prize which has had “machine learning” based compressors leading the ranking for some time: http://prize.hutter1.net/

It’s important to note when applied to compressors, the model does produce a code (aka encoding) that exactly reproduces the input. But on a different input the same model is unlikely to produce an impressive compression.

[–] Dragonstaff@leminal.space 2 points 1 month ago (1 children)

Can you define "compressors" here? (Google was unhelpful.)

[–] skip0110@lemm.ee 2 points 1 month ago

I could have said it better.

I mean compressor as half of a compression/decompression algorithm. The better way I should have worded it is: when you apply machine learning to a compression problem, you can do it lossless…your uncompressed output will be identical to the input, every time.

“NNCP” is a good search term to learn more, specifically about how this works.

[–] Alphane_Moon@lemmy.world 22 points 1 month ago* (last edited 1 month ago) (2 children)

I found the article to be rather confusing.

One thing to point out is that the video codec used in this research (but for which results weren't published for some reason), H264, is not at all state of the art.

H265 is far newer and they are already working on H266. There are also other much higher quality codecs such as AV1. For what it's worth, they do reference H265, but I don't have access to the source research paper, so it's difficult to say what they are comparing against.

The performance relative to FLAC is interesting though.

[–] InvertedParallax@lemm.ee 5 points 1 month ago* (last edited 1 month ago)

Vvc is h266, the spec is ready it's just not in a lot of hardware, or even decent software yet, that often takes a few years. The reference implementation encodes at like 1fps or less, but reference software is usually slow as hell in favor of correctness and code comprehension.

Av1 isn't much better than hevc (h265), it's just open and patent free and Google is pushing it like crazy.

It has iirc 1 major feature over hevc, non-square subpictures, beyond that it has some extensions for animation and slideshows basically.

[–] paraphrand@lemmy.world 3 points 1 month ago (1 children)

I wonder what the practical reasons for starting with h.264 are.

[–] entropicdrift@lemmy.sdf.org 2 points 1 month ago

Low/no patent issues, much simpler complexity

[–] deur 20 points 1 month ago

This is just a more complex version of shared dictionary compression which I think one of the web compression algorithms does. Stupid LLM fuckers at it again with dumb garbage.

[–] besselj@lemmy.ca 13 points 1 month ago* (last edited 1 month ago) (1 children)

So if I have two machines running the same local LLM and I pass a prompt between them, I've achieved data compression by transmitting the prompt rather than the LLM's expected response to the prompt? That's what I'm understanding from the article.

Neat idea, but what if you want to transmit some information that an LLM can't tokenize and generate accurately?

[–] taladar@sh.itjust.works 7 points 1 month ago

And how do I get the prompt that will reliably generate the data from the data? Usually for compression we do not start from an already compressed version.

[–] xep@fedia.io 8 points 1 month ago (2 children)

If this really is lossless, it is incredible. I'm skeptical until I see it in action though.

[–] MudMan@fedia.io 15 points 1 month ago (1 children)

Lossless is the big claim that nobody is fixating on because "AI" discussions only ever run one set of talking points.

I get how semantic understanding would trade performance for file size when doing compression. I don't get how you can deterministically use it to always get the exact same complete output from a partial input. I'd love to go over the full paper. And even then the maths would probably go way, way over my head.

[–] barsoap@lemm.ee 3 points 1 month ago (2 children)

So... crystal ball, I don't have access to the paper either. Think arithmetic coders as neural nets are function approximators. You send an initial token and the NN will start to generate deterministically, once you detect a divergence from the lossless ideal you send another token to put it on track again. Make it a sliding window so things don't become too computationally expensive. You architect the model not to be smart but to need little guidance following "external reasoning" so to speak.

The actual disadvantage of this kind of thing will be the model size, yes you might be able to transmit a book in a kilobyte (100x or more compression) but both encoder and decoder will need access to gigabytes of neural weights, and that's just for text. It's also not going to be computationalliy cheap, though probably cheaper than PAQ.

[–] Snazz@lemmy.world 3 points 1 month ago

Arithmetic coding is one of my favorite algorithms. Any token predictor can be converted into an entropy encoder!

[–] MudMan@fedia.io 2 points 1 month ago (1 children)

Trading processing power for size is a thing. I guess it depends on application and implementation. Well, and on the actual size of the models required.

It's one of those things that makes for a good headline, but then for usability it has to be part of a whole conversation about whether you want to spend the bandwidth, the processing power on compression, the processing power on real time upscaling, the processing power on different compression tools, something else or a mix of the above.

I suppose at some point it's all "benchmarks or it didn't happen" for these things. And when it comes to ML benchmarks are increasingly iffy anyway.

[–] Harlehatschi@lemmy.ml 1 points 1 month ago

But spending a lot of processing power to gain smaller sizes matters mostly in cases you want to store things long term. You probably wouldn't want to keep the exact same LLM with the same weightings and stuff around in that case.

[–] besselj@lemmy.ca 6 points 1 month ago

Extraordinary claims require extraordinary evidence.

[–] tekato@lemmy.world 6 points 1 month ago

Interesting how they forgot to go over the architecture for LMDecompress.

[–] Harlehatschi@lemmy.ml 6 points 1 month ago* (last edited 1 month ago) (2 children)

Ok so the article is very vague about what's actually done. But as I understand it the "understood content" is transmitted and the original data reconstructed from that.

If that's the case I'm highly skeptical about the "losslessness" or that the output is exactly the input.

But there are more things to consider like de-/compression speed and compatibility. I would guess it's pretty hard to reconstruct data with a different LLM or even a newer version of the same one, so you have to make sure you decompress your data some years later with a compatible LLM.

And when it comes to speed I doubt it's nearly as fast as using zlib (which is neither the fastest nor the best compressing...).

And all that for a high risk of bricked data.

[–] barsoap@lemm.ee 4 points 1 month ago (1 children)

I would guess it’s pretty hard to reconstruct data with a different LLM

I think the idea is to have compressor and decompressor use the exact same neural network. Looks like arithmetic coding with a learned function.

But yes model size is probably going to be an issue.

[–] Harlehatschi@lemmy.ml 2 points 1 month ago

Ye but that would limit the use cases to very few. Most of the time you compress data to either transfer it to a different system or to store it for some time, in both cases you wouldn't want to be limited to the exact same LLM. Which leaves us with almost no use case.

I mean... cool research... kinda.... but pretty useless.

[–] modeler@lemmy.world 1 points 1 month ago

I'm guessing that exactly the same LLM model is used (somehow) on both sides - using different models or different weights would not work at all.

An LLM is (at core) an algorithm that takes a bunch of text as input and produces an output of a list of word/probabilities such that the sum of all probabilities adds to 1.0. You could place a wrapper on this that creates a list of words by probability. A specific word can be identified by the index in the list, i.e. first word, tenth word etc.

(Technically the system uses 'tokens' which represent either whole words or parts of words, but that's not important here).

A document can be compressed by feeding in each word in turn, creating the list in the LLM, and searching for the new word in the list. If the LLM is good, the output will be a stream of small integers. If the LLM is a perfect predictor, the next word will always be the top of the list, i.e. a 1. A bad prediction will be a relatively large number in the thousands or millions.

Streams of small numbers are very well (even optimally) compressed using extant technology.

[–] fluxion@lemmy.world 5 points 1 month ago (1 children)

Middle-LLM compression

[–] sherlock@feddit.nu 3 points 1 month ago

We’re heading for a 5.2 Weissman score

[–] myrrh@ttrpg.network 3 points 1 month ago

...large-language models do not comport with lossless data reconstruction in my experience; quite the opposite...