Why does quantization matter for running LLMs locally?

A 7-billion parameter LLM in FP32 requires 28GB of RAM — more than most consumer computers have. In INT4, the same model requires 3.5GB — runnable on a MacBook or gaming PC. Quantization is what makes running Llama 3, Mistral, and Phi models locally possible. llama.cpp and Ollama use GGUF quantized models to enable local LLM inference without GPUs.

Quantization – UseCaseinAI

Q: What is quantization in simple terms?

Quantization is rounding model weights to lower precision numbers to save space and speed up computation. Standard models store weights as 32-bit floats — 4 bytes per number. 8-bit quantization stores them as integers — 1 byte per number. Same model, 4x smaller. The rounding introduces small errors, but neural networks are surprisingly robust to these — a well-quantized model loses little accuracy while running 4x faster and fitting in 4x less memory.

Q: What is the difference between FP32, FP16, INT8, and INT4?

FP32 (32-bit float): standard training precision, 4 bytes per parameter, highest accuracy. FP16 (16-bit float): half precision, 2 bytes, minimal accuracy loss, standard for modern GPU training. INT8 (8-bit integer): 1 byte, 4x smaller than FP32, small accuracy loss, widely used for inference. INT4 (4-bit integer): 0.5 bytes, 8x smaller than FP32, noticeable accuracy loss but enables very large models on constrained hardware. GPTQ and GGUF are popular INT4 quantization formats.

Q: What is quantization-aware training?

Post-training quantization applies quantization after training is complete — fast but loses some accuracy. Quantization-aware training (QAT) simulates quantization during training — the model learns to be robust to lower precision while training. QAT typically produces better accuracy at the same bit width than post-training quantization but requires access to training data and significantly more compute.

⚡ Quantization reduces the numerical precision of model weights from 32-bit to 8-bit or 4-bit — cutting model size by 4-8x and speeding up inference proportionally with minimal accuracy loss. It is why you can run a 7-billion parameter LLM on a MacBook. Without quantization, most AI models would be impossible to deploy outside expensive cloud servers.

Category: MLOps · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read

Quantization — How Reducing Number Precision Makes AI Models Smaller, Faster & Deployable Anywhere

What is Quantization?

A neural network stores millions to billions of numbers — the weights that encode what it has learned. Standard floating-point representation stores each weight as a 32-bit number (FP32) — able to represent values with extraordinary precision across a huge range. This precision is important during training, when small gradient updates must be captured accurately.

But at inference time — when the model is making predictions, not learning — that precision is often overkill. The network’s behaviour is surprisingly robust to small rounding errors in individual weights. If you round every weight from a 32-bit float to an 8-bit integer, the model becomes 4x smaller and runs 4x faster — and in most cases loses less than 1% of accuracy.

That is quantization: deliberately reducing the precision of model weights (and optionally activations) to trade a tiny accuracy loss for a large reduction in memory and compute.

Precision levels

FP32 (32-bit float) — 4 bytes per parameter. Standard training precision. A 7B parameter model: 28GB.

FP16 / BF16 (16-bit float) — 2 bytes per parameter. Half precision. Standard for modern GPU training and inference. A 7B model: 14GB. Minimal accuracy loss.

INT8 (8-bit integer) — 1 byte per parameter. 4x smaller than FP32. Small accuracy loss. Standard for production inference at scale. A 7B model: 7GB.

INT4 (4-bit integer) — 0.5 bytes per parameter. 8x smaller than FP32. Noticeable accuracy loss but enables very large models on constrained hardware. A 7B model: 3.5GB — runnable on a MacBook Pro.

How Quantization works?

Post-training quantization (PTQ): train the model in full precision, then apply quantization to the weights after training. Fast, no access to training data needed, some accuracy loss.

Quantization-aware training (QAT): simulate quantization during training — the model learns to tolerate rounding errors. Better accuracy than PTQ at the same bit width, but requires training data and significantly more compute.

GPTQ: a popular post-training quantization algorithm for LLMs that uses a small calibration dataset to minimise quantization error. Standard for 4-bit quantized LLMs.

GGUF format (llama.cpp): a quantized model format used by the llama.cpp project — enabling CPU inference of quantized LLMs on consumer hardware without a GPU.

Real-world examples

Not theory — what real teams actually shipped using this technique.

Running Llama 3 8B locally — in FP32 it needs 32GB of RAM. In Q4_K_M quantization (4-bit with mixed precision), it needs ~5GB — runnable on a 16GB MacBook. Tools like Ollama and LM Studio download and run GGUF-quantized models in one click.
TensorFlow Lite — Google’s framework for mobile and edge ML deployment uses INT8 quantization by default — reducing model size for deployment on phones and IoT devices without cloud inference.
ChatGPT’s infrastructure — OpenAI uses FP16 and INT8 quantization for production inference to reduce GPU memory requirements and increase throughput — more users served per GPU at lower cost.

Common pitfalls

Accuracy degradation at low bit widths — INT4 and below can produce noticeable quality degradation, especially for complex tasks requiring precise numerical computation. Always evaluate the quantized model on your specific task before deploying.
Hardware compatibility — not all hardware supports all precision levels. INT8 inference is well-supported on NVIDIA GPUs (Tensor Cores), CPUs (AVX-512), and Apple Silicon. INT4 support varies more.
Activation quantization adds complexity — quantizing weights is straightforward. Quantizing activations (the intermediate values during inference) requires calibration and careful handling of outliers, which dramatically affect range and rounding error.
Quantization is not free — even with minimal accuracy loss on benchmarks, quantized models may fail on specific edge cases. Always test on data representative of your deployment environment.

Frequently asked questions

QUESTION 1 What is quantization in simple terms?

ANSWER 1 Rounding model weights to lower-precision numbers — from 32-bit to 8-bit or 4-bit — cutting model size by 4-8x and speeding up inference proportionally, with minimal accuracy loss.

QUESTION 2 What is the difference between FP32, FP16, INT8, and INT4?

ANSWER 2 FP32: 4 bytes, full precision. FP16: 2 bytes, training standard. INT8: 1 byte, production inference standard. INT4: 0.5 bytes, enables very large models on consumer hardware.

QUESTION 3 Why does quantization matter for LLMs locally?

ANSWER 3 A 7B model in FP32 needs 28GB RAM. In INT4 it needs 3.5GB — runnable on a MacBook. Quantization makes local LLM inference possible.

QUESTION 4 What is quantization-aware training?

ANSWER 4 Simulating quantization during training so the model learns to tolerate rounding errors — better accuracy than post-training quantization at the cost of more compute.

📬 Get one concept + one use case every Tuesday. Join the newsletter →