🔮The Codex

Quantization

Compressing AI models to use less memory and run faster with minimal quality loss.

📖 Apprentice Explanation

Quantization makes AI models smaller and faster by reducing their precision. It's like compressing a photo — you lose a tiny bit of quality but it takes up much less space.

🧙 Archmage Notes

Quantization reduces model weights from FP32/FP16 to INT8/INT4. Methods include GPTQ, AWQ, GGUF, and bitsandbytes. 4-bit quantization typically reduces model size by 4x with <5% quality degradation on benchmarks.

Related Enchantments

🔮

Inference

🔮

LoRA (Low-Rank Adaptation)