QLoRA (Quantised LoRA) combines LoRA with 4-bit quantisation of the frozen base model. The base model weights are stored in 4-bit precision (rather than 16 or 32-bit), reducing memory by 4-8x. LoRA adapters are trained in 16-bit on top. Together, QLoRA allows fine-tuning a 65-billion parameter model on a single 48GB GPU — what previously required a cluster of expensive A100s.

LoRA (Low-Rank Adaptation)

Q: What is LoRA in simple terms?

LoRA is a shortcut for fine-tuning large models. Instead of updating all 7 billion parameters of a model (which needs 28GB of GPU memory and weeks of training), LoRA freezes the original weights and adds tiny adapter layers with only a few million parameters. Only the adapters are trained. The result is almost as good as full fine-tuning at 1% of the cost.

Q: How does LoRA work mathematically?

Each weight matrix W in the model gets two small companion matrices A and B, where A has shape (d × r) and B has shape (r × k), and r (the rank) is much smaller than d and k — typically 4 to 64. During fine-tuning, only A and B are updated. The effective weight update is A×B — a low-rank approximation of what a full weight update would be. After training, A×B can be merged into W for zero-latency inference.

Q: When should you use LoRA?

When you want to specialise a large pretrained model for a specific task or style but cannot afford full fine-tuning in terms of GPU memory, compute cost, or time. LoRA is the default fine-tuning approach for LLMs on consumer or mid-range hardware. It is also useful when you want to maintain multiple specialisations of the same base model — each as a small separate adapter file.

⚡ LoRA (Low-Rank Adaptation) fine-tunes large models by adding tiny trainable adapter layers instead of updating all parameters. Rather than retraining 7 billion weights, LoRA trains a few million adapter parameters — reducing GPU memory by 10-100x with minimal accuracy loss. It democratised LLM customisation, making it possible to fine-tune on a single consumer GPU.

Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read

LoRA — What It Is and How It Makes Fine-Tuning Billion-Parameter Models Affordable

What is LoRA ?

Fine-tuning a 7-billion parameter language model the traditional way requires updating all 7 billion weights. That needs approximately 28GB of GPU memory just to store the gradients — far beyond what a single consumer GPU can handle. For a 70-billion parameter model, you are looking at a cluster of A100s and costs in the tens of thousands of dollars.

LoRA asks a clever question: do we really need to update all those weights? Most of what makes a large model useful — its knowledge of language, its reasoning patterns, its broad capabilities — is already there. Fine-tuning is changing a small fraction of the behaviour. Maybe we only need to update a small fraction of the weights.

The mathematical insight behind LoRA is that the changes needed during fine-tuning tend to be low-rank — meaning they can be approximated by the product of two small matrices rather than one large one. Instead of updating a 4096×4096 weight matrix directly, LoRA adds two small matrices (4096×8 and 8×4096, with rank r=8) and trains only those. The original weights are frozen. Only the tiny adapters learn. The result is a fine-tuned model at 1% of the memory and compute cost.

How [Term] works

Take a pretrained model — Llama 3, Mistral, Falcon, or any transformer.
Freeze all original model weights — they are not updated during fine-tuning.
For each target weight matrix W, add two small trainable matrices: A (d × r) and B (r × k), where rank r << d and k.
The adapted weight is effectively W + A×B during the forward pass.
Only A and B are updated by gradient descent — typically 0.1-1% of total parameters.
After training, merge the adapters back into W for zero-overhead inference, or keep them separate for easy swapping between specialisations.

Real-world examples

Not theory — what real teams actually shipped using this technique.

The open-source community used LoRA to fine-tune Llama and Mistral into hundreds of specialised variants — instruction-following models, coding assistants, creative writing models — on single RTX 3090 consumer GPUs, democratising LLM customisation entirely.
Stable Diffusion fine-tuning using LoRA became ubiquitous — artists train LoRA adapters on their own art style (50-200 images, hours of training) to create a personalised image generator that produces their style on demand.
Enterprise teams use LoRA to fine-tune models on proprietary domain data (legal, medical, financial) — producing specialised models without sending data to external APIs, keeping training entirely on-premises.

Common pitfalls

Rank selection — too small a rank (r=1, 2) may not capture enough of the needed adaptation. Too large and you approach full fine-tuning costs without proportional benefit. r=8 to r=64 is typically the useful range.
Which layers to adapt — LoRA is typically applied to the query and value matrices in attention layers. Including more layers (FFN layers, all projection matrices) captures more adaptation capacity but increases cost.
Merging vs keeping adapters — merged adapters (baked into weights) have zero inference overhead. Unmerged adapters allow hot-swapping between specialisations at the cost of slight inference overhead.
Not a magic fix for data quality — LoRA dramatically reduces compute cost but does not reduce the need for high-quality fine-tuning data. 500 excellent examples still outperform 5,000 poor ones.

Frequently asked questions

QUESTION 1 What is LoRA in simple terms?

ANSWER 1 A shortcut for fine-tuning large models — add tiny adapter layers, train only those (1% of parameters), freeze everything else. Almost as good as full fine-tuning at 1% of the cost.

QUESTION 2 How does LoRA work mathematically?

ANSWER 2 Each weight matrix gets two small companion matrices A and B. Only A and B are trained. Their product A×B approximates the full weight update using far fewer parameters.

QUESTION 3 What is QLoRA?

ANSWER 3 LoRA combined with 4-bit quantisation of the frozen base model — enabling fine-tuning of 65B parameter models on a single 48GB GPU.

QUESTION 4 When should you use LoRA?

ANSWER 4 When you need to specialise a large model but cannot afford full fine-tuning in GPU memory or compute. The default fine-tuning approach for LLMs on consumer hardware.

📬 Get one concept + one use case every Tuesday. Join the newsletter →