this post was submitted on 13 May 2025
5 points (85.7% liked)
AI
4857 readers
1 users here now
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.
founded 4 years ago
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
If you trust the measurements in perplexity which people occasionally publish on the llama.cpp project page or in some papers, the ability to run a larger model completely dwarfes the slight loss introduced by the quantization. That holds true to a level like Q4 or so. If you look at the perplexity numbers, they're not far away from 8bit or full precision. And then it starts to deteriorate faster with Q3 and Q2, if I remember correctly. I always try to fit the largest model I can at something between Q4 and Q8. I can't notice any major difference to full precision, though theoretically every step of quantization is going to lower quality by a tiny amount.
I noticed that for Q4 to above too, with my sweet spot at Q6 if i manage to. I am really confused about Q2-Q3 for models that are 2x+ of Q4 models. E.g. sometimes it seems Gemma3 12b Q4 (or Q6) is better than Gemma3 27b Q3_XS and sometimes it seems the opposite.
I think it's kind of hard to quantify the "better" other than measure perplexity. with Q3_XS or Q2 it seems to be a step down. But you'd have to look closely at the numbers to compare 12b-q4 to 27b-q3xs. I'm currently on mobile so I can't do that, but there are quite some tables buried somewhere in the llama.cpp discussions... However... I'm not sure if we have enough research on how perplexity measurements translate to "intelligence". This might not be the same. Idk. But you'd probably need to test a few hundred times or do something like the LLM Arena to get a meaningful result on how the models compare across size and quantization. (And I heard Q2 isn't worth it.)