RAG (Retrieval-Augmented Generation) lets an LLM answer questions using your own documents rather than just its training memory. Before generating a response, a retrieval system finds the most relevant passages from your knowledge base and injects them into the prompt. The LLM answers from what it retrieved — not what it vaguely remembers. More accurate, current, and citable. The most widely deployed enterprise AI architecture.

Category: RAG & Retrieval · Difficulty: Intermediate · Last updated: 15 May 2026 · 6 min read


RAG — What It Is, How It Works & Why It Is the Most Deployed Enterprise AI Architecture

What is RAG?

A large language model trained on data from 2023 does not know about your company’s updated product specifications from last month. It does not know the specific terms in your client contracts. It does not know your internal HR policies. If you ask it about any of these, it will either say it does not know or — more dangerously — make something up that sounds plausible.

RAG solves this by giving the model access to your documents at query time. When a user asks a question, the system searches a database of your documents for the most relevant passages. Those passages are handed to the LLM alongside the question. The model reads the retrieved information and generates an answer based on what it actually found in your documents — not what it learned from the internet years ago.

The result: answers grounded in your specific, current, private knowledge. Answers that can be cited to specific documents. Answers that are far less likely to be hallucinated.

How RAG works?

Offline (build the knowledge base):

  1. Load your documents — PDFs, Word files, web pages, database records.
  2. Chunk them — split into passages of 200-800 tokens. Chunking strategy matters enormously.
  3. Embed each chunk — convert to a dense vector using an embedding model (text-embedding-3-small, bge-large, etc.).
  4. Store in a vector database — Pinecone, Weaviate, pgvector, Qdrant, Chroma.

Online (answer a query):

  1. User asks a question.
  2. Embed the question with the same embedding model.
  3. Retrieve top-K chunks whose vectors are most similar to the question vector (cosine similarity or dot product).
  4. Optionally rerank — use a cross-encoder to reorder the top-K by true semantic relevance.
  5. Construct the prompt — include the retrieved chunks and the question, with instructions to answer based on the provided context.
  6. LLM generates the answer — grounded in retrieved content.
  7. Optionally cite sources — include which documents each part of the answer came from.

Real-world examples

Not theory — what real teams actually shipped using this technique.

  • Klarna’s customer support AI — RAG-powered system answers customer questions about policies, orders, and disputes grounded in Klarna’s actual policy documents and customer account data — reducing hallucination about specific policy terms.
  • Legal contract review — a law firm’s RAG system retrieves relevant clauses from thousands of precedent contracts when reviewing a new contract, suggesting standard language and flagging deviations from typical terms.
  • NHS clinical decision support — a RAG system that retrieves relevant NICE guidelines, drug interaction data, and patient-specific records when a clinician asks a clinical question — grounding the AI’s response in authoritative medical sources.

Common pitfalls

  • Chunking strategy is critical — chunks that are too small lose context; too large dilute relevance. Splitting mid-sentence or across related paragraphs breaks semantic coherence. Semantic chunking (split at natural content boundaries) outperforms fixed-size splitting.
  • Retrieval quality is the bottleneck — if the right chunk is not retrieved, the LLM cannot use it regardless of its capability. Evaluate retrieval precision and recall independently from generation quality.
  • The model may ignore retrieved context — LLMs sometimes answer from training memory even when relevant context is provided. Instruction tuning and prompt design enforce grounding.
  • Stale vector index — when source documents update, the vector index must be regenerated. Stale embeddings return stale results. Implement incremental indexing for frequently updated knowledge bases.

Frequently asked questions

QUESTION 1 What is RAG in simple terms?

ANSWER 1 Giving the AI an open-book exam — retrieving relevant documents before generating the answer, so the response is grounded in your actual content rather than training memory.

QUESTION 2 How is RAG different from fine-tuning?

ANSWER 2 RAG retrieves knowledge at inference time — no model changes, update by updating the database. Fine-tuning bakes knowledge into weights — expensive, needs retraining when knowledge changes.

QUESTION 3 What are the components of a RAG system?

ANSWER 3 Document loader, text splitter, embedding model, vector database, retriever, optional reranker, LLM, and prompt template.

QUESTION 4 What are common RAG failure modes?

ANSWER 4 Retrieval failure (right document not found), model ignoring context, chunk boundary problems splitting answers across chunks, and context window overflow from too many retrieved chunks.


📬 Get one concept + one use case every Tuesday. Join the newsletter →