AI & LLMFebruary 5, 20264 min read

RAG Patterns That Actually Work

aillmragretrieval

Every tutorial on RAG follows the same script: chunk your docs, embed them, throw them in a vector database, retrieve top-k, stuff them into a prompt. Congratulations, you've built the "hello world" of RAG. It works on demos and falls apart on real data.

I've built retrieval systems across three different products now. Here's what actually moves the needle once you're past the basics.

naive vector search fails silently

The insidious thing about bad retrieval is that it looks like it's working. The model still generates fluent answers. It just generates them from the wrong context. Users don't notice until they act on hallucinated information.

The core problem: embedding similarity doesn't equal semantic relevance. "How do I cancel my subscription?" and "What's your cancellation policy?" have high cosine similarity. But one is a user action and the other is a policy document. Return the wrong one and the LLM gives a confident, wrong answer.

hybrid retrieval is non-negotiable

Pure vector search misses exact matches. Pure keyword search misses semantic connections. You need both.

from rank_bm25 import BM25Okapi
 
def hybrid_retrieve(query: str, documents: list, embeddings, k: int = 10):
    # Vector search — semantic similarity
    query_embedding = embed(query)
    vector_results = vector_db.search(query_embedding, top_k=k)
 
    # BM25 — exact keyword matching
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    bm25_scores = bm25.get_scores(query.split())
    keyword_results = sorted(
        enumerate(bm25_scores), key=lambda x: x[1], reverse=True
    )[:k]
 
    # Reciprocal rank fusion
    combined = reciprocal_rank_fusion(vector_results, keyword_results)
    return combined[:k]

Reciprocal rank fusion (RRF) is the simplest way to combine them. For each result, score = 1/(rank + 60). Sum across both retrieval methods. Sort by combined score. The constant 60 is from the original paper — it works well in practice and I haven't found a reason to tune it.

chunking strategy matters more than embedding model

I've tested OpenAI's text-embedding-3-large, Cohere embed-v3, and local sentence-transformers. The quality differences are real but small — maybe 5-8% on retrieval benchmarks.

Chunking strategy differences? 20-40%.

What works:

Semantic chunking over fixed-size. Split on paragraph boundaries, headers, or topic shifts — not every 512 tokens. A chunk should be a self-contained idea.

Overlap is essential. 10-20% overlap between chunks prevents losing context at boundaries. A sentence that spans two chunks should appear in both.

Preserve hierarchy. If a chunk comes from "Section 3.2: Authentication," that metadata needs to travel with it. Prepend section headers to every chunk:

def chunk_with_context(section_title: str, content: str, chunk_size: int = 400):
    chunks = split_by_paragraphs(content, max_tokens=chunk_size)
    return [f"## {section_title}\n\n{chunk}" for chunk in chunks]

This single trick — prepending the section header — improved retrieval accuracy by ~15% in my tests on a 2,000-page documentation corpus.

reranking is the highest-ROI addition

Retrieval gets you candidates. Reranking picks the best ones. A cross-encoder reranker looks at the query and each retrieved chunk together, which captures relevance that embedding similarity misses.

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_and_rerank(query: str, k: int = 5):
    # Cast a wide net
    candidates = hybrid_retrieve(query, k=20)
 
    # Rerank with cross-encoder
    pairs = [(query, doc.content) for doc in candidates]
    scores = reranker.predict(pairs)
 
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:k]]

Retrieve 20, rerank to 5. The cross-encoder is slower than embedding comparison but it only runs on 20 documents, not your entire corpus. Latency cost is ~50ms. Quality improvement is substantial.

Cohere's Rerank API is the easiest hosted option if you don't want to run inference yourself. $1 per 1,000 searches. Worth it.

query transformation catches the rest

Sometimes the user's query is just bad. "That thing with the webhook" isn't going to retrieve well no matter how good your pipeline is.

Three techniques that help:

Query expansion — use the LLM to generate 2-3 alternative phrasings, retrieve for all of them, deduplicate
Hypothetical document embedding (HyDE) — ask the LLM to write the ideal answer, embed that instead of the query
Step-back prompting — ask a more general question first, use that context to refine retrieval

I use query expansion for most production systems. HyDE is clever but adds a full LLM call to the critical path, which usually isn't worth the latency hit.

the stack I'd recommend today

For a new RAG system in early 2026:

Chunking: Semantic, with hierarchy preserved and 15% overlap
Embedding: OpenAI text-embedding-3-small (good enough, cheap, fast)
Vector DB: Pinecone or pgvector if you're already on Postgres
Keyword search: BM25 via Elasticsearch or just in-memory for small corpora
Fusion: Reciprocal rank fusion
Reranking: Cohere Rerank or cross-encoder/ms-marco-MiniLM
Query transformation: LLM-based query expansion

Skip the fancy stuff until this pipeline is solid. Most RAG failures come from bad chunking and missing reranking, not from using the wrong vector database.

RAG Patterns That Actually Work

naive vector search fails silently

hybrid retrieval is non-negotiable

chunking strategy matters more than embedding model

reranking is the highest-ROI addition

query transformation catches the rest

the stack I'd recommend today

More in AI & LLM

The Context Window Trap

LLM Evaluation Is Hard

Voice AI in Production