Why This Matters
RAG is the most practical pattern for making LLMs useful with your own data. Instead of fine-tuning (expensive, brittle), you retrieve relevant context at query time and inject it into the prompt.
The Intuition
Think of an open-book exam. The LLM is the student, and RAG is the process of finding the right pages in the textbook before answering each question. Without RAG, the model relies only on what it memorized during training. With RAG, it gets to look things up.
How It Works
User Query
↓
[1. Embed Query] → query vector
↓
[2. Search Vector DB] → top-k relevant documents
↓
[3. Build Prompt] = system instructions + retrieved docs + user query
↓
[4. LLM Generates] → grounded answer with citationsKey Design Decisions
Chunk Size
| Size | Pros | Cons |
|---|---|---|
| Small (128 tokens) | Precise retrieval | Loses context |
| Medium (512 tokens) | Good balance | Standard choice |
| Large (1024+ tokens) | Rich context | Dilutes relevance |
Retrieval Strategy
- Naive RAG: Embed query → top-k → stuff into prompt
- Advanced RAG: Re-ranking, query expansion, hybrid search
- Modular RAG: Routing, filtering, iterative retrieval
Common Failure Modes
- Retrieved but not used: Model ignores context (fix: better prompting)
- Wrong context retrieved: Embedding mismatch (fix: hybrid retrieval, re-ranking)
- Context too large: Exceeds token budget (fix: summarization, chunking)
- Hallucination despite context: Model confabulates (fix: citation enforcement)
See Also
- Hybrid Retrieval — Improving RAG with keyword + semantic search
- Prompt Engineering — Crafting prompts that use retrieved context
- BCI In Practice — How Amprealize implements RAG for behaviors