RAG Patterns That Actually Work
Every tutorial on RAG follows the same script: chunk your docs, embed them, throw them in a vector database, retrieve top-k, stuff them into a prompt. Congratulations, you've built the "hello world" of RAG. It works on demos and falls apart on real data.
I've built retrieval systems across three different products now. Here's what actually moves the needle once you're past the basics.
naive vector search fails silently
The insidious thing about bad retrieval is that it looks like it's working. The model still generates fluent answers. It just generates them from the wrong context. Users don't notice until they act on hallucinated information.
The core problem: embedding similarity doesn't equal semantic relevance. "How do I cancel my subscription?" and "What's your cancellation policy?" have high cosine similarity. But one is a user action and the other is a policy document. Return the wrong one and the LLM gives a confident, wrong answer.
hybrid retrieval is non-negotiable
Pure vector search misses exact matches. Pure keyword search misses semantic connections. You need both.
from rank_bm25 import BM25Okapi
def hybrid_retrieve(query: str, documents: list, embeddings, k: int = 10):
# Vector search — semantic similarity
query_embedding = embed(query)
vector_results = vector_db.search(query_embedding, top_k=k)
# BM25 — exact keyword matching
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
bm25_scores = bm25.get_scores(query.split())
keyword_results = sorted(
enumerate(bm25_scores), key=lambda x: x[1], reverse=True
)[:k]
# Reciprocal rank fusion
combined = reciprocal_rank_fusion(vector_results, keyword_results)
return combined[:k]Reciprocal rank fusion (RRF) is the simplest way to combine them. For each result, score = 1/(rank + 60). Sum across both retrieval methods. Sort by combined score. The constant 60 is from the original paper — it works well in practice and I haven't found a reason to tune it.
chunking strategy matters more than embedding model
I've tested OpenAI's text-embedding-3-large, Cohere embed-v3, and local sentence-transformers. The quality differences are real but small — maybe 5-8% on retrieval benchmarks.
Chunking strategy differences? 20-40%.
What works:
Semantic chunking over fixed-size. Split on paragraph boundaries, headers, or topic shifts — not every 512 tokens. A chunk should be a self-contained idea.
Overlap is essential. 10-20% overlap between chunks prevents losing context at boundaries. A sentence that spans two chunks should appear in both.
Preserve hierarchy. If a chunk comes from "Section 3.2: Authentication," that metadata needs to travel with it. Prepend section headers to every chunk:
def chunk_with_context(section_title: str, content: str, chunk_size: int = 400):
chunks = split_by_paragraphs(content, max_tokens=chunk_size)
return [f"## {section_title}\n\n{chunk}" for chunk in chunks]This single trick — prepending the section header — improved retrieval accuracy by ~15% in my tests on a 2,000-page documentation corpus.
reranking is the highest-ROI addition
Retrieval gets you candidates. Reranking picks the best ones. A cross-encoder reranker looks at the query and each retrieved chunk together, which captures relevance that embedding similarity misses.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, k: int = 5):
# Cast a wide net
candidates = hybrid_retrieve(query, k=20)
# Rerank with cross-encoder
pairs = [(query, doc.content) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k]]Retrieve 20, rerank to 5. The cross-encoder is slower than embedding comparison but it only runs on 20 documents, not your entire corpus. Latency cost is ~50ms. Quality improvement is substantial.
Cohere's Rerank API is the easiest hosted option if you don't want to run inference yourself. $1 per 1,000 searches. Worth it.
query transformation catches the rest
Sometimes the user's query is just bad. "That thing with the webhook" isn't going to retrieve well no matter how good your pipeline is.
Three techniques that help:
- Query expansion — use the LLM to generate 2-3 alternative phrasings, retrieve for all of them, deduplicate
- Hypothetical document embedding (HyDE) — ask the LLM to write the ideal answer, embed that instead of the query
- Step-back prompting — ask a more general question first, use that context to refine retrieval
I use query expansion for most production systems. HyDE is clever but adds a full LLM call to the critical path, which usually isn't worth the latency hit.
the stack I'd recommend today
For a new RAG system in early 2026:
- Chunking: Semantic, with hierarchy preserved and 15% overlap
- Embedding: OpenAI text-embedding-3-small (good enough, cheap, fast)
- Vector DB: Pinecone or pgvector if you're already on Postgres
- Keyword search: BM25 via Elasticsearch or just in-memory for small corpora
- Fusion: Reciprocal rank fusion
- Reranking: Cohere Rerank or cross-encoder/ms-marco-MiniLM
- Query transformation: LLM-based query expansion
Skip the fancy stuff until this pipeline is solid. Most RAG failures come from bad chunking and missing reranking, not from using the wrong vector database.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.