System Design: Search Engine
Search is one of those features where the difference between "works" and "works well" determines whether people use your product. Google's entire empire was built on making search 10% better than the competition. Elasticsearch powers search for GitHub, Wikipedia, and Netflix. Let's understand what's under the hood.
We're designing a general-purpose search engine for a platform with 1B documents, 10k search queries per second, and sub-100ms response times.
requirements
Functional:
- Full-text search across millions of documents
- Ranked results by relevance
- Autocomplete/typeahead suggestions
- Filters (date range, category, author)
- Fuzzy matching (handle typos)
- Highlighting matched terms in results
Non-functional:
- Query latency under 100ms (p95)
- Indexing delay under 30 seconds (near real-time)
- Handle 10k queries per second
- Horizontal scalability as document count grows
- High availability with no single point of failure
the inverted index
This is the core data structure that makes search fast. Instead of scanning every document for a keyword, you build an index that maps each word to the list of documents containing it.
Forward Index (slow for search):
doc_1 → ["the", "quick", "brown", "fox"]
doc_2 → ["the", "lazy", "brown", "dog"]
doc_3 → ["quick", "fox", "jumps"]
Inverted Index (fast for search):
"the" → [doc_1, doc_2]
"quick" → [doc_1, doc_3]
"brown" → [doc_1, doc_2]
"fox" → [doc_1, doc_3]
"lazy" → [doc_2]
"dog" → [doc_2]
"jumps" → [doc_3]
A search for "quick fox" intersects the posting lists for "quick" and "fox": [doc_1, doc_3] ∩ [doc_1, doc_3] = [doc_1, doc_3]. Two documents found, zero full scans.
Each posting list entry also stores position information (for phrase queries) and term frequency (for ranking). The inverted index typically uses 10-20% of the original document size in storage.
high-level architecture
┌──────────┐ ┌────────────┐ ┌──────────────────┐
│ Client │────▶│ Query │────▶│ Search Cluster │
└──────────┘ │ Service │ │ │
│ - parse │ │ ┌─── Shard 1 ───┐│
│ - rewrite │ │ │ Inverted Index ││
│ - route │ │ └────────────────┘│
└──────┬─────┘ │ ┌─── Shard 2 ───┐│
│ │ │ Inverted Index ││
┌──────▼─────┐ │ └────────────────┘│
│ Aggregator │ │ ┌─── Shard 3 ───┐│
│ (merge + │◀───│ │ Inverted Index ││
│ rank) │ │ └────────────────┘│
└────────────┘ └──────────────────┘
│
┌──────▼──────┐
│ Indexing │
│ Pipeline │
│ (tokenize, │
│ analyze, │
│ build idx) │
└──────────────┘
Two separate paths: the query path (user searches, system returns results) and the indexing path (new documents are analyzed and added to the index).
query processing pipeline
A search query goes through multiple transformation stages before hitting the index.
User types: "runing sheos near me"
│
▼
┌──────────────┐
│ Tokenizer │ → ["runing", "sheos", "near", "me"]
└──────┬───────┘
▼
┌──────────────┐
│ Spell Check │ → ["running", "shoes", "near", "me"]
└──────┬───────┘
▼
┌──────────────┐
│ Stop Word │ → ["running", "shoes"]
│ Removal │ (remove "near", "me" for keyword search)
└──────┬───────┘
▼
┌──────────────┐
│ Stemming │ → ["run", "shoe"]
└──────┬───────┘
▼
┌──────────────┐
│ Synonym │ → ["run", "shoe", "sneaker", "trainer"]
│ Expansion │
└──────┬───────┘
▼
Query the inverted index for all terms
Tokenization splits text into searchable units. For English, split on whitespace and punctuation. For CJK languages (Chinese, Japanese, Korean), you need dictionary-based tokenization since there are no spaces between words.
Stemming reduces words to their root form: "running" → "run", "shoes" → "shoe". This dramatically improves recall. Porter Stemmer is the classic algorithm. Lemmatization is more accurate but slower.
ranking: BM25
Once you have candidate documents, rank them. BM25 (Best Matching 25) is the industry standard for text relevance scoring. It's what Elasticsearch uses by default.
BM25 Score Formula (simplified):
score(q, d) = Σ IDF(term) × TF(term, d) × (k1 + 1)
─────────────────────────────────────────
TF(term, d) + k1 × (1 - b + b × |d|/avgdl)
Where:
IDF(term) = log((N - n + 0.5) / (n + 0.5))
TF(term,d) = frequency of term in document d
N = total documents
n = documents containing the term
|d| = document length
avgdl = average document length
k1 ≈ 1.2 (term frequency saturation)
b ≈ 0.75 (document length normalization)
The intuition: a term is more important if it's rare across all documents (high IDF) and frequent in this particular document (high TF). BM25 adds length normalization (long documents don't get unfair advantage) and term frequency saturation (the 10th occurrence of a word doesn't help much more than the 5th).
For modern search, BM25 is often combined with ML re-ranking. Phase 1: BM25 retrieves the top 1,000 candidates. Phase 2: a neural ranker (BERT-based) re-scores the top 100 for semantic relevance.
sharding and replication
A billion documents don't fit in one machine's index. Shard the index across multiple nodes.
Sharding Strategy:
Document arrives → hash(doc_id) mod N → Shard assignment
Search query → broadcast to ALL shards → each returns top K
→ Aggregator merges N × K results → return top K
Shard 1 (docs 0-333M):
Primary → Replica 1 → Replica 2
Shard 2 (docs 333M-666M):
Primary → Replica 1 → Replica 2
Shard 3 (docs 666M-1B):
Primary → Replica 1 → Replica 2
Each shard maintains its own inverted index. A query hits all shards in parallel, each returns its top K results, and the aggregator merges them. This is a scatter-gather pattern.
Replication for availability: each shard has 2-3 replicas. Queries are load-balanced across replicas. If a node dies, the remaining replicas handle traffic while a replacement rebuilds.
Shard sizing: Elasticsearch recommends 10-50GB per shard. At 1B documents averaging 1KB each, that's 1TB total, split across 20-100 shards. In practice, 30 shards with 2 replicas each = 90 nodes. That's a typical production cluster for this scale.
autocomplete
Typeahead search has stricter latency requirements than full search — results must update as the user types, so you need sub-50ms responses.
Autocomplete Architecture:
User types: "syst"
│
▼
┌─────────────┐ ┌──────────────┐
│ Trie Index │ ──── │ Redis/ │
│ (prefix → │ │ In-memory │
│ completions)│ │ store │
└─────────────┘ └──────────────┘
Trie structure:
s
/ \
y e
| \
s a
| \
t r
| \
e c
| \
m h
|
[system, system design, systems programming]
Implementation: A trie (prefix tree) maps partial queries to the top 10 completions. Store the trie in Redis or in-memory on each API server. Update the trie hourly based on query popularity.
Personalization: Weight completions by the user's past searches. "sys" for a DevOps engineer should suggest "systemd" before "system of a down."
near real-time indexing
When a new document is created, it should be searchable within 30 seconds. Elasticsearch achieves this with a "refresh" cycle.
Indexing Pipeline:
New document ──▶ Kafka ──▶ Index Worker
│
┌──────────▼──────────┐
│ 1. Tokenize │
│ 2. Analyze (stem, │
│ lowercase, etc.) │
│ 3. Build index │
│ segment (in-mem) │
│ 4. Flush to disk │
│ every 1 second │
│ 5. Merge segments │
│ periodically │
└─────────────────────┘
New documents go into an in-memory buffer. Every second (configurable), the buffer is flushed to a new immutable index segment on disk. Searching queries both the existing segments and the in-memory buffer. Periodically, small segments are merged into larger ones (like LSM-tree compaction) to keep query performance stable.
key tradeoffs
Precision vs. recall: Stemming and synonym expansion increase recall (find more relevant results) but decrease precision (more irrelevant results sneak in). Tune aggressively for the use case. E-commerce search needs high precision (show exactly what they're looking for). Research search needs high recall (don't miss anything relevant).
Indexing speed vs. query speed: More analysis during indexing (synonyms, NER, embedding generation) makes queries faster and more accurate but slows down the indexing pipeline. For near real-time requirements, do heavy analysis asynchronously and update the index in a second pass.
Storage vs. freshness: Keeping all historical versions of documents enables time-travel queries but multiplies storage costs. Most systems keep only the latest version in the search index and archive old versions to cold storage.
Relevance vs. latency: A BERT re-ranker produces dramatically better results than BM25 alone. But it adds 50-100ms of latency. For autocomplete (where speed is everything), stick with statistical methods. For page-1 search results (where relevance is everything), add the neural re-ranker.
what Elasticsearch gets right
Elasticsearch's architecture is essentially what we've designed here: Lucene-based inverted indexes, sharded across a cluster, with near real-time refresh cycles. What makes it production-ready is the operational layer — automatic shard rebalancing, rolling upgrades, index lifecycle management, and a query DSL that handles everything from simple keyword search to complex aggregations.
If you're building search for a product, use Elasticsearch (or OpenSearch). If you're in an interview, understand the primitives underneath.
More in System Design
System Design: Distributed Job Scheduler
Designing a cron-at-scale system — priority queues, exactly-once execution, retry with dead letter queues, and the monitoring that keeps it honest.
System Design: File Storage Service
Designing S3-like object storage — chunking, deduplication, CDN integration, and the metadata layer that ties it all together.
System Design: Distributed Cache
Consistent hashing, eviction policies, cache stampede prevention, and the Redis vs Memcached decision you'll actually face in production.