⚡ Quantization reduces the numerical precision of model weights from 32-bit to 8-bit or 4-bit — cutting model size by 4-8x and speeding up inference proportionally with minimal accuracy loss. It is why you can run a 7-billion parameter LLM on a MacBook. Without quantization, most AI models would be impossible to deploy outside expensive cloud servers.
Category: MLOps · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read
Quantization — How Reducing Number Precision Makes AI Models Smaller, Faster & Deployable Anywhere
What is Quantization?
A neural network stores millions to billions of numbers — the weights that encode what it has learned. Standard floating-point representation stores each weight as a 32-bit number (FP32) — able to represent values with extraordinary precision across a huge range. This precision is important during training, when small gradient updates must be captured accurately.
But at inference time — when the model is making predictions, not learning — that precision is often overkill. The network’s behaviour is surprisingly robust to small rounding errors in individual weights. If you round every weight from a 32-bit float to an 8-bit integer, the model becomes 4x smaller and runs 4x faster — and in most cases loses less than 1% of accuracy.
That is quantization: deliberately reducing the precision of model weights (and optionally activations) to trade a tiny accuracy loss for a large reduction in memory and compute.
Precision levels
FP32 (32-bit float) — 4 bytes per parameter. Standard training precision. A 7B parameter model: 28GB.
FP16 / BF16 (16-bit float) — 2 bytes per parameter. Half precision. Standard for modern GPU training and inference. A 7B model: 14GB. Minimal accuracy loss.
INT8 (8-bit integer) — 1 byte per parameter. 4x smaller than FP32. Small accuracy loss. Standard for production inference at scale. A 7B model: 7GB.
INT4 (4-bit integer) — 0.5 bytes per parameter. 8x smaller than FP32. Noticeable accuracy loss but enables very large models on constrained hardware. A 7B model: 3.5GB — runnable on a MacBook Pro.
How Quantization works?
Post-training quantization (PTQ): train the model in full precision, then apply quantization to the weights after training. Fast, no access to training data needed, some accuracy loss.
Quantization-aware training (QAT): simulate quantization during training — the model learns to tolerate rounding errors. Better accuracy than PTQ at the same bit width, but requires training data and significantly more compute.
GPTQ: a popular post-training quantization algorithm for LLMs that uses a small calibration dataset to minimise quantization error. Standard for 4-bit quantized LLMs.
GGUF format (llama.cpp): a quantized model format used by the llama.cpp project — enabling CPU inference of quantized LLMs on consumer hardware without a GPU.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- Running Llama 3 8B locally — in FP32 it needs 32GB of RAM. In Q4_K_M quantization (4-bit with mixed precision), it needs ~5GB — runnable on a 16GB MacBook. Tools like Ollama and LM Studio download and run GGUF-quantized models in one click.
- TensorFlow Lite — Google’s framework for mobile and edge ML deployment uses INT8 quantization by default — reducing model size for deployment on phones and IoT devices without cloud inference.
- ChatGPT’s infrastructure — OpenAI uses FP16 and INT8 quantization for production inference to reduce GPU memory requirements and increase throughput — more users served per GPU at lower cost.
Common pitfalls
- Accuracy degradation at low bit widths — INT4 and below can produce noticeable quality degradation, especially for complex tasks requiring precise numerical computation. Always evaluate the quantized model on your specific task before deploying.
- Hardware compatibility — not all hardware supports all precision levels. INT8 inference is well-supported on NVIDIA GPUs (Tensor Cores), CPUs (AVX-512), and Apple Silicon. INT4 support varies more.
- Activation quantization adds complexity — quantizing weights is straightforward. Quantizing activations (the intermediate values during inference) requires calibration and careful handling of outliers, which dramatically affect range and rounding error.
- Quantization is not free — even with minimal accuracy loss on benchmarks, quantized models may fail on specific edge cases. Always test on data representative of your deployment environment.
Frequently asked questions
QUESTION 1 What is quantization in simple terms?
ANSWER 1 Rounding model weights to lower-precision numbers — from 32-bit to 8-bit or 4-bit — cutting model size by 4-8x and speeding up inference proportionally, with minimal accuracy loss.
QUESTION 2 What is the difference between FP32, FP16, INT8, and INT4?
ANSWER 2 FP32: 4 bytes, full precision. FP16: 2 bytes, training standard. INT8: 1 byte, production inference standard. INT4: 0.5 bytes, enables very large models on consumer hardware.
QUESTION 3 Why does quantization matter for LLMs locally?
ANSWER 3 A 7B model in FP32 needs 28GB RAM. In INT4 it needs 3.5GB — runnable on a MacBook. Quantization makes local LLM inference possible.
QUESTION 4 What is quantization-aware training?
ANSWER 4 Simulating quantization during training so the model learns to tolerate rounding errors — better accuracy than post-training quantization at the cost of more compute.
📬 Get one concept + one use case every Tuesday. Join the newsletter →