Retrieval-Augmented Generation (RAG) is an LLM application pattern that retrieves external knowledge at query time and injects it into the model's context to produce grounded answers. Practitioners use RAG to reduce hallucinations, incorporate fresh/private data, and make outputs auditable via sources/citations. A useful mental model is that RAG is two coupled systems: an information retrieval system (indexes, rankers, filters) and a response synthesis system (prompting, citations, formatting). By 2026, production RAG has evolved well beyond naive chunk-and-retrieve — the dominant pattern is agentic RAG, where the LLM itself decides when, what, and how to retrieve. Most "RAG problems" are retrieval problems first—if the right evidence doesn't make it into context, generation quality can't recover it.
14 tables, 119 concepts. Select a concept node to jump to its table row.
Table 1: RAG Building Blocks (Conceptual)
Every RAG pipeline is assembled from the same handful of stages, run in order from ingesting raw data all the way to evaluating the final answer. Knowing this skeleton makes the rest of the cheat sheet easier to navigate — each later table simply zooms into one of these stages and lists the concrete tools you can plug in there.
| Stage | Example | Description |
|---|---|---|
answer = LLM(question, context=top_k_docs) | Generates with retrieved evidence rather than relying only on parametric memory. | |
docs = loader.load_data() | Reads source data and converts it into document objects for downstream processing. | |
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) | Splits documents into smaller units so retrieval can target the right passages. | |
emb = client.embeddings.create(model="text-embedding-3-small", input="...") | Maps text to vectors for similarity search in dense retrieval. | |
index.add(xb) | Builds a search structure over vectors (exact or ANN) to enable fast retrieval. | |
docs = retriever.invoke("...") | Selects candidate chunks/documents relevant to the query. | |
reranked = co.rerank(query=q, documents=texts) | Reorders retrieved candidates to improve top-k quality. | |
resp = query_engine.query("...") | Produces the final answer from retrieved context (often via an LLM). | |
resp = citation_engine.query("...") | Attaches sources to claims (usually at chunk-level granularity). | |
score = faithfulness | Measures retrieval + generation quality with task-appropriate metrics. |
Table 2: Chunking and Splitting
How you slice documents into chunks quietly decides how good your retrieval can ever be — too big and you bury the answer in noise, too small and you lose the surrounding context. These splitters range from the dependable recursive-character default to semantic and structure-aware options, plus the parent/child and contextual tricks that fix the boundary problems naive splitting creates.
| Splitter | Example | Description |
|---|---|---|
RecursiveCharacterTextSplitter(separators=["\n\n","\n"," ",""], chunk_size=1000, chunk_overlap=200) | • Default general-purpose splitter that tries separators in order to form sized chunks • benchmark-validated best default. | |
SemanticChunker(embeddings, breakpoint_threshold_type="percentile") | • Splits at semantic boundary breaks in embedding space • keeps topically coherent passages together. | |
TokenTextSplitter(chunk_size=512, chunk_overlap=64) | Splits by token count (useful when you need predictable model-context usage). | |
SentenceSplitter(chunk_size=512, chunk_overlap=50) | Splits text into sentence-based chunks with size/overlap controls. | |
MarkdownHeaderTextSplitter(headers_to_split_on=[("#","h1"),("##","h2")]) | Splits Markdown while preserving header structure as metadata. | |
HTMLHeaderTextSplitter(headers_to_split_on=[("h1","h1"),("h2","h2")]) | Splits HTML by header tags to keep section semantics. | |
CharacterTextSplitter(separator="\n\n", chunk_size=1000, chunk_overlap=200) | Simple splitter using a fixed separator and target chunk size. | |
contextualized_chunk = f"CONTEXT: {llm(doc, chunk)}\n\n{chunk}" | • Prepends an LLM-generated description of where each chunk fits in its document • Anthropic found this reduces retrieval failures by 49%. | |
ParentDocumentRetriever(vectorstore=vs, docstore=store, child_splitter=small_splitter) | Indexes small child chunks for precise recall but returns the larger parent chunk to the LLM for richer context. | |
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) | Repeats tail tokens/characters across chunks to reduce boundary misses. | |
Document(page_content=text, metadata={"source": url, "page": 12}) | Carries provenance and filters (e.g., doc id, page, section) through retrieval. |
Table 3: Embedding Models
The embedding model is what turns text into the vectors your search runs on, so its quality sets the ceiling on retrieval accuracy. The choice usually comes down to a few trade-offs — proprietary APIs like OpenAI and Cohere versus self-hostable open models like BGE-M3, dimension count and cost, and how well each handles non-English or domain-specific corpora.
| Model | Example | Description |
|---|---|---|
client.embeddings.create(model="text-embedding-3-small", input="...") | • OpenAI's cost-effective embedding model • 1536-dim, supports dimension reduction to 512 with minimal recall loss. | |
client.embeddings.create(model="text-embedding-3-large", input="...") | • OpenAI's high-accuracy model • 3072-dim, supports reduction to 256+ • strong English performance across MTEB. | |
co.embed(texts=[...], model="embed-v4.0", input_type="search_document") | • Multimodal and multilingual • 1536-dim, scores ~65.2 on MTEB • best for non-English corpora and mixed-modality inputs. | |
vo.embed(texts, model="voyage-3-large", input_type="document") | • Voyage AI's flagship model • outperforms text-embedding-3-large on MTEB • domain variants available (code, finance, law). | |
model = BGEM3FlagModel("BAAI/bge-m3"); model.encode(texts) | • Open-source • supports dense, sparse, and multi-vector retrieval in one model • self-hostable on a single GPU. | |
model = SentenceTransformer("intfloat/e5-mistral-7b-instruct") | • Open-source 7B model • competitive with proprietary models on MTEB • instruction-tuned for strong passage retrieval. |
Table 4: Dense Indexing and ANN
Once you have millions of vectors, scanning all of them per query is too slow — approximate nearest-neighbor (ANN) indexes trade a sliver of recall for orders-of-magnitude faster search. Here you'll find the graph- and partition-based structures (HNSW, IVFFlat) that power production vector DBs, the exact FAISS baselines, the quantization tricks that shrink memory, and the distance metrics that define "similar."
| Index | Example | Description |
|---|---|---|
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops); | • Graph-based ANN structure • the de-facto standard in production vector DBs due to low query latency at high recall. | |
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); | • Inverted-file ANN: partitions vectors into lists and probes a subset at query time• lower build cost than HNSW. | |
index = faiss.IndexFlatL2(d) | • Exact L2 distance brute-force baseline in FAISS • useful for small corpora or ground-truth evaluation. | |
index = faiss.IndexFlatIP(d) | Exact inner-product search baseline in FAISS. | |
index = faiss.IndexPQ(d, m, nbits) | Compresses vectors into short codes (6–10× compression) to reduce memory and speed up search. | |
ScalarQuantization(type=ScalarType.INT8) | • Quantizes each dimension to 8-bit integers • lighter than PQ with minimal recall loss • widely used in Qdrant. | |
ORDER BY embedding <=> $1 LIMIT 10 | Common similarity for normalized embeddings (implemented as distance operator in pgvector). | |
scores = x @ q | • Inner product similarity • equivalent to cosine when vectors are L2-normalized. | |
faiss.normalize_L2(x) | Makes cosine similarity retrieval equivalent to dot-product search. | |
rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0"); rag.search(q, k=5) | • Stores per-token embeddings • scores with MaxSim (max cosine per query token) • near cross-encoder accuracy at near bi-encoder speed. |
Table 5: Vector Databases
A vector database is where your embeddings actually live and get queried in production, bundling the ANN index with storage, metadata filtering, and operations. The real decision is which one fits your situation — pgvector if you already run Postgres, managed Pinecone for zero-ops scale, Qdrant or Weaviate for open-source flexibility, Chroma for quick prototyping.
| Database | Example | Description |
|---|---|---|
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops); | • PostgreSQL extension • keeps vectors and app data in the same table/transaction • best default for <5M vectors if you already run Postgres. | |
index.upsert(vectors=[(id, emb, meta)]); index.query(vector=q, top_k=10) | • Fully-managed SaaS • zero-ops serverless, auto-scales to billions of vectors • supports sparse-dense hybrid search. | |
client.search("docs", query_vector=q, query_filter=Filter(...), limit=10) | • Open-source Rust-native DB • rich payload filtering, scalar/product quantization, self-hosted or Qdrant Cloud. | |
collection.query.near_text("...", limit=10) | • Open-source • built-in vectorization modules (auto-embeds raw text), built-in hybrid BM25+vector search • GraphQL API. | |
client.search(collection_name="docs", data=[q_emb], limit=10) | • Open-source • GPU-accelerated, billion-scale distributed clusters • Zilliz Cloud is the managed version. | |
collection.query(query_texts=["..."], n_results=10) | • Embedded or client-server • zero-setup developer experience • best for prototyping and local development. | |
table.search(q_emb).limit(10).to_arrow() | • Zero-copy columnar storage (Lance format) • embedded/in-process, disk-based indexing for larger-than-RAM datasets. |
Table 6: Sparse and Hybrid Retrieval
Dense vectors are great at meaning but weak at exact terms — error codes, product names, and rare keywords often slip through. Sparse methods like BM25 catch exactly those, and hybrid retrieval fuses both signals so you don't have to choose; the rest of the rows cover the filtering, fusion, and diversity controls that shape the final candidate set.
| Retriever | Example | Description |
|---|---|---|
collection.query.hybrid(query="...", alpha=0.5, limit=10) | • Combines sparse and dense vector signals into one ranked list • covers both semantic similarity and exact-match needs. | |
\text{score}(q,d)=\sum_{t\in q} \text{IDF}(t)\cdot\frac{f(t,d)\cdot(k_1+1)}{f(t,d)+k_1\cdot(1-b+b\cdot\frac{\lvert d \rvert}{\text{avgdl}})} | • Term-based ranking model used for keyword (sparse) retrieval • strong on exact terms, error codes, and product names. | |
\arg\max_{d\in R\setminus S}\,\lambda\,\text{sim}(d,q)-(1-\lambda)\max_{d'\in S}\text{sim}(d,d') | • Diversifies selected chunks by trading off relevance vs redundancy • avoids returning near-duplicate passages. | |
where={"path":["source"],"operator":"Equal","valueString":"handbook"} | Restricts candidates to documents matching structured metadata predicates before vector search. | |
\text{RRF}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)} | • Rank aggregation to merge results from multiple retrievers without needing calibrated scores • k=60 is common. | |
docs = retriever.invoke(q, config={"configurable": {"k": 10}}) | Controls how many candidates are returned from retrieval. | |
as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}) | Filters results by minimum similarity score before passing context to the LLM. | |
"similarity": {"default": {"type": "BM25", "k1": 1.2, "b": 0.75}} | Configures per-field BM25 scoring parameters in Elasticsearch. |
Table 7: Query Transformation
Users rarely phrase questions the way your documents are written, and a single raw query often misses relevant chunks. These techniques reshape the query before it ever hits the index — rewriting it, expanding it into several variants, generating a hypothetical answer to search with, or breaking a complex question into sub-questions — to close that gap between how people ask and how content is stored.
| Technique | Example | Description |
|---|---|---|
"Rewrite the question for search: ..." | Converts user input into a search-optimized query string to improve vector recall. | |
mqr = MultiQueryRetriever.from_llm(retriever, llm) | Uses an LLM to generate multiple query variants and unions retrieved docs for better recall. | |
hyp = llm("Write a passage answering: ...")docs = retriever.invoke(hyp) | • Retrieves using embeddings of a hypothetical document generated from the query • helps with vague or sparse questions. | |
sq = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info) | Uses an LLM to translate natural language into a structured query + metadata filters. | |
subqs = ["...", "..."] | Splits a complex question into sub-questions that are answered individually and combined. | |
broader = llm(f"What is the more general question behind: '{q}'")docs = retriever.invoke(broader) | • First retrieves for a broader abstraction of the question before the specific query • improves recall on specific questions. | |
route = router.invoke({"question": q}) | • Selects a retriever/index/tool based on query intent or domain • avoids wasting retrieval on the wrong corpus. | |
q' = q + synonyms(q) | Adds related terms/phrases to improve recall in sparse retrieval. | |
docs = retriever.invoke(rephrased_q) | Normalizes user phrasing to reduce retrieval mismatch from casual or ambiguous language. |
Table 8: Reranking and Fusion
First-pass retrieval is tuned for speed and recall, so it usually pulls back more than you need with the best chunk not always on top. A reranker does a second, more expensive pass that scores each query-document pair jointly and reshuffles for precision — Anthropic measured reranking cutting retrieval failures by 67% when combined with contextual retrieval. The fusion and compression rows round out the toolkit for merging multiple ranked lists and trimming context noise.
| Ranker | Example | Description |
|---|---|---|
scores = cross_encoder.predict([(q, d) for d in docs]) | • Scores each query-document pair jointly with full attention for higher-precision top-k • compute cost is small vs LLM call. | |
co.rerank(model="rerank-english-v3.0", query=q, documents=texts, top_n=10) | • API reranker producing a relevance-ordered list with scores • Anthropic found adding reranking cuts retrieval failures by 67% combined with contextual retrieval. | |
reranker = FlagReranker("BAAI/bge-reranker-v2-m3"); scores = reranker.compute_score([(q,d) for d in docs]) | • Open-source cross-encoder reranker • multilingual, competitive with commercial rerankers • self-hostable. | |
\text{RRF}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)} | • Fuses multiple ranked lists without needing calibrated similarity scores • k=60 is a common default. | |
ens = EnsembleRetriever(retrievers=[r1, r2], weights=[0.5, 0.5]) | Combines multiple retrievers and applies rank fusion to merge results. | |
cc = ContextualCompressionRetriever(base_retriever=r, base_compressor=compressor) | Retrieves then compresses documents to only the query-relevant parts, reducing context noise. | |
post = SimilarityPostprocessor(similarity_cutoff=0.8) | Drops nodes below a minimum similarity threshold before synthesis. | |
unique = list({d.page_content: d for d in docs}.values()) | Removes duplicate chunks before feeding context to the LLM. | |
top_n=10 | Limits reranker output to the highest-scoring items only. |
Table 9: Query Engines and Answer Grounding
This is the generation half of RAG — the components that take retrieved context and turn it into a written answer. Grounding is the key idea: instructing the model to answer only from the provided evidence is the core defense against hallucination, and the citation engines tie each claim back to its source so readers can verify it.
| Engine | Example | Description |
|---|---|---|
qe = RetrieverQueryEngine.from_args(retriever=retriever)resp = qe.query("...") | LlamaIndex query engine that retrieves nodes then synthesizes a response. | |
qe = CitationQueryEngine.from_args(index=index)resp = qe.query("...") | Generates answers with inline source citations anchored to retrieved chunks. | |
"Answer only using the provided context." | • Forces the LLM to base claims on retrieved evidence rather than latent knowledge • core anti-hallucination mechanism. | |
prompt = ChatPromptTemplate.from_messages([("system", "..."), ("human", "{question}")]) | Parameterizes prompts so retrieval context and user input can be inserted reliably. | |
response_mode=ResponseMode.COMPACT | Controls how retrieved text is composed into prompts and how answers are formed. | |
max_output_tokens=512 | Limits generation length (and indirectly budgets room for retrieved context). | |
stream=True | • Streams partial tokens while a completion is being generated • reduces perceived latency. | |
citation_chunk_size=512 | Sets the chunk size used to form citation units for per-source attribution. |
Table 10: Advanced RAG Architectures
Beyond the basic retrieve-then-generate loop sit the patterns that define modern production RAG. Most share one move — letting the system decide how to retrieve rather than hard-coding a single pass: agentic RAG treats retrieval as a tool the LLM calls, corrective and self-RAG grade their own evidence, multi-hop chains several retrievals, and GraphRAG retrieves over a knowledge graph instead of flat chunks.
| Pattern | Example | Description |
|---|---|---|
tools=[search_knowledge_base]; agent.run(question) | • LLM decides when, what, and how to retrieve as a tool call • handles multi-step and conditional retrieval needs. | |
graphrag.query(query_type="local", query="...") | Microsoft's approach: extracts a knowledge graph from the corpus, builds community summaries, and retrieves via local or global search. | |
grade = evaluator.score(doc, q); if grade == "incorrect": web_search(q) | Lightweight evaluator grades each retrieved document (correct/ambiguous/incorrect) and triggers web search on failures. | |
# model uses reflection tokens: [Retrieve], [ISREL], [ISSUP] | Trains a single LM to adaptively retrieve on-demand and self-critique retrieved passages and its own generations. | |
for hop in range(MAX_HOPS): docs=retrieve(q); q=refine(q,docs) | • Chains multiple retrieval steps where each hop's results inform the next query • needed for questions spanning multiple documents. | |
chunk = f"{llm_context(doc, chunk)}\n\n{chunk}"; embed(chunk) | • Anthropic technique: prepends chunk-specific context before embedding and BM25 indexing • reduces retrieval failures by 49–67% combined with reranking. | |
route = classifier(query); pipeline = routes[route] | A query complexity classifier routes each query to the appropriate pipeline — no retrieval, single-hop, or multi-hop — saving cost on simple queries. |
Table 11: Storage, Persistence, and Caching
The plumbing that keeps a RAG system durable and affordable lives here — where vectors, full documents, and index structures get persisted so you don't rebuild them on every restart. The caching and incremental-indexing rows matter most at scale: re-embedding only changed documents and caching repeat embeddings are where real cost savings come from.
| Store | Example | Description |
|---|---|---|
vectors = embed(texts); upsert(vectors, metadata) | Persists embeddings + payloads for similarity search at retrieval time. | |
docstore.add_documents(docs) | • Stores full documents (separate from chunk/node indexes) • used by ParentDocumentRetriever. | |
index_store.persist(persist_dir="./storage") | Persists index metadata/structures for reload without rebuilding. | |
ctx = StorageContext.from_defaults(persist_dir="./storage") | Bundles storage backends used by an index/query pipeline. | |
CREATE TABLE items(id bigserial, content text, embedding vector(1536)); | Makes embeddings/queryable data durable in a database. | |
if doc.hash != stored.hash: re_embed(doc) | Re-indexes only changed documents to keep the vector store current without full rebuilds. | |
cache_key = sha256(text) | • Avoids recomputing embeddings for identical inputs • critical for cost control at scale. | |
index.query(namespace="prod", vector=q, top_k=10) | Separates tenant or environment data within one vector index for multi-tenant isolation. | |
SET key value EX 3600 | Expires cached retrieval/generation artifacts after a time-to-live window. |
Table 12: RAG Evaluation Metrics and Frameworks
You can't improve what you can't measure, and RAG quality splits cleanly into two questions — did retrieval surface the right context, and did generation stay faithful to it. These metrics (faithfulness, context precision/recall, answer relevancy) pin down each side, and the frameworks below let you run them in CI against a golden dataset so a pipeline change can be graded before it ships.
| Metric | Example | Description |
|---|---|---|
faithfulness ∈ [0,1] | • Measures consistency with retrieved context • the primary anti-hallucination metric. | |
answer_relevancy ∈ [0,1] | Measures how well the answer addresses the question. | |
context_precision ∈ [0,1] | Measures whether retrieved contexts are useful for answering the question. | |
context_recall ∈ [0,1] | Measures how much of the needed information is present in retrieved contexts. | |
factual_correctness ∈ [0,1] | Measures whether the answer is factually correct against a reference. | |
noise_sensitivity ∈ [0,1] | Measures robustness to irrelevant context — does the answer degrade when noisy chunks are included? | |
context_entities_recall ∈ [0,1] | Measures entity recall over retrieved context vs reference. | |
aspect_critic | LLM-judge style metric for assessing a specific aspect of the output (e.g., conciseness, harmlessness). | |
| • pytest-compatible LLM testing framework with 14+ metrics • designed for CI/CD quality gates. | |
px.launch_app(); tracer = register(project_name="rag") | • Open-source AI observability platform • OpenTelemetry-based tracing with built-in RAG evaluators • self-hostable. | |
client = langsmith.Client(); client.run_on_dataset(dataset_name="rag-eval", llm_or_chain=chain) | • LangChain-native tracing and evaluation platform • deep visibility into chain execution steps and LLM calls. | |
[{"question": q, "ground_truth": a, "source_docs": []}] | • Curated Q&A set with approved answers and source documents • used to benchmark pipeline changes before deployment. |
Table 13: Observability and Security
Once RAG is live you need to see inside it and defend it. The observability rows cover the OpenTelemetry GenAI spans and tracing tools that expose retrieval counts, token usage, and latency for debugging. The security rows map the OWASP LLM risks that hit RAG specifically — prompt injection riding in through retrieved context, data poisoning of the corpus, and sensitive-info leakage.
| Signal | Example | Description |
|---|---|---|
gen_ai.operation.name = "chat" | Standardizes tracing semantics for GenAI operations (inference, retrieval, tools) via OpenTelemetry. | |
gen_ai.retrieval.count = 10 | Captures retrieval metadata (chunk count, latency, scores) to debug relevance vs latency tradeoffs. | |
gen_ai.client.token.usage | Records token usage to monitor cost and performance over time. | |
langfuse = Langfuse(); trace = langfuse.trace(name="rag-query") | Open-source LLM observability platform with trace-based debugging, evals, and a prompt management UI. | |
"Ignore previous instructions and ..." | Attacker-controlled input attempts to override system/developer intent via the retrieval context. | |
"Print your hidden prompt" | Leakage of secrets, system prompts, or private data via model outputs. | |
"Upload malicious docs into the KB" | Corrupts the retrieval corpus so the model is grounded in incorrect or malicious context. | |
render_html(llm_output) | Treating model output as trusted can lead to downstream injection or execution. | |
"Summarize 10MB of text" | Attacks that drive excessive compute/cost via large inputs or adversarial usage. |
Table 14: Multimodal RAG
A huge share of real knowledge lives in PDFs full of tables, charts, and scanned layouts that text extraction mangles. Multimodal RAG sidesteps that by treating document pages as images — ColPali-style retrievers embed page images directly and a vision-language model reads them, increasingly replacing the old OCR-then-embed pipeline for complex documents.
| Technique | Example | Description |
|---|---|---|
model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2"); model.index(pdf_folder) | • VLM-based retriever that produces multi-vector embeddings directly from document page images via late interaction • no OCR needed. | |
docs_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2"); results = docs_model.search(query, k=3) | Python library wrapping ColPali with a familiar API for indexing PDFs and searching by visual content. | |
vl_model.generate(images=retrieved_pages, text=query) | Uses a Vision Language Model (e.g., Qwen2-VL, GPT-4V) to answer based on retrieved document page images. | |
text = ocr_engine.extract(page_image); embed(text) | • Traditional text-extraction pipeline before embedding • superseded by ColPali for documents with complex layouts, tables, and figures. |