AI

5205 readers

16 users here now

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

founded 4 years ago

Local LLMs: 16gb vram model advice and quantization question (lemmy.ml)

submitted 1 month ago by tcsenpai@lemmy.ml to c/artificial_intel@lemmy.ml

3 comments fedilink hide all child comments

So, I have a 16GB vram GPU (4070 ti Super) and 32GB DDR4 RAM. The RAM is slow af and thus I tend to run models fully on GPU.

I can easily run up to 21b-ish models with Q4, sometimes high Q3.

I am testing various models out there but I was wondering if you guys have any reccommendation.

I am also really interested in understanding if quantization really decrease the model quality so much. Like, It would be better to have a Q6 12b model (like Gemma 3 12b), a Q2_K_L 32b model (such as QwQ 32b) or a Q3_XS model (such as Gemma 3 27b)?

you are viewing a single comment's thread
view the rest of the comments

[–] tcsenpai@lemmy.ml 3 points 1 month ago (1 children)

I noticed that for Q4 to above too, with my sweet spot at Q6 if i manage to. I am really confused about Q2-Q3 for models that are 2x+ of Q4 models. E.g. sometimes it seems Gemma3 12b Q4 (or Q6) is better than Gemma3 27b Q3_XS and sometimes it seems the opposite.

[–] hendrik@palaver.p3x.de 2 points 1 month ago* (last edited 1 month ago)

I think it's kind of hard to quantify the "better" other than measure perplexity. with Q3_XS or Q2 it seems to be a step down. But you'd have to look closely at the numbers to compare 12b-q4 to 27b-q3xs. I'm currently on mobile so I can't do that, but there are quite some tables buried somewhere in the llama.cpp discussions... However... I'm not sure if we have enough research on how perplexity measurements translate to "intelligence". This might not be the same. Idk. But you'd probably need to test a few hundred times or do something like the LLM Arena to get a meaningful result on how the models compare across size and quantization. (And I heard Q2 isn't worth it.)