Technology

72498 readers

3776 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

Algorithm based on LLMs doubles lossless data compression rates (techxplore.com)

submitted 1 month ago by NoSpotOfGround@lemmy.world to c/technology@lemmy.world

34 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Harlehatschi@lemmy.ml 6 points 1 month ago* (last edited 1 month ago) (2 children)

Ok so the article is very vague about what's actually done. But as I understand it the "understood content" is transmitted and the original data reconstructed from that.

If that's the case I'm highly skeptical about the "losslessness" or that the output is exactly the input.

But there are more things to consider like de-/compression speed and compatibility. I would guess it's pretty hard to reconstruct data with a different LLM or even a newer version of the same one, so you have to make sure you decompress your data some years later with a compatible LLM.

And when it comes to speed I doubt it's nearly as fast as using zlib (which is neither the fastest nor the best compressing...).

And all that for a high risk of bricked data.

[–] barsoap@lemm.ee 4 points 1 month ago (1 children)

I would guess it’s pretty hard to reconstruct data with a different LLM

I think the idea is to have compressor and decompressor use the exact same neural network. Looks like arithmetic coding with a learned function.

But yes model size is probably going to be an issue.

[–] Harlehatschi@lemmy.ml 2 points 1 month ago

Ye but that would limit the use cases to very few. Most of the time you compress data to either transfer it to a different system or to store it for some time, in both cases you wouldn't want to be limited to the exact same LLM. Which leaves us with almost no use case.

I mean... cool research... kinda.... but pretty useless.

[–] modeler@lemmy.world 1 points 1 month ago

I'm guessing that exactly the same LLM model is used (somehow) on both sides - using different models or different weights would not work at all.

An LLM is (at core) an algorithm that takes a bunch of text as input and produces an output of a list of word/probabilities such that the sum of all probabilities adds to 1.0. You could place a wrapper on this that creates a list of words by probability. A specific word can be identified by the index in the list, i.e. first word, tenth word etc.

(Technically the system uses 'tokens' which represent either whole words or parts of words, but that's not important here).

A document can be compressed by feeding in each word in turn, creating the list in the LLM, and searching for the new word in the list. If the LLM is good, the output will be a stream of small integers. If the LLM is a perfect predictor, the next word will always be the top of the list, i.e. a 1. A bad prediction will be a relatively large number in the thousands or millions.

Streams of small numbers are very well (even optimally) compressed using extant technology.