System DesignMarch 6, 20268 min read

System Design: Search Engine

system-designarchitecturesearchelasticsearch

Search is one of those features where the difference between "works" and "works well" determines whether people use your product. Google's entire empire was built on making search 10% better than the competition. Elasticsearch powers search for GitHub, Wikipedia, and Netflix. Let's understand what's under the hood.

We're designing a general-purpose search engine for a platform with 1B documents, 10k search queries per second, and sub-100ms response times.

requirements

Functional:

Full-text search across millions of documents
Ranked results by relevance
Autocomplete/typeahead suggestions
Filters (date range, category, author)
Fuzzy matching (handle typos)
Highlighting matched terms in results

Non-functional:

Query latency under 100ms (p95)
Indexing delay under 30 seconds (near real-time)
Handle 10k queries per second
Horizontal scalability as document count grows
High availability with no single point of failure

the inverted index

This is the core data structure that makes search fast. Instead of scanning every document for a keyword, you build an index that maps each word to the list of documents containing it.

Forward Index (slow for search):
doc_1 → ["the", "quick", "brown", "fox"]
doc_2 → ["the", "lazy", "brown", "dog"]
doc_3 → ["quick", "fox", "jumps"]

Inverted Index (fast for search):
"the"   → [doc_1, doc_2]
"quick" → [doc_1, doc_3]
"brown" → [doc_1, doc_2]
"fox"   → [doc_1, doc_3]
"lazy"  → [doc_2]
"dog"   → [doc_2]
"jumps" → [doc_3]

A search for "quick fox" intersects the posting lists for "quick" and "fox": [doc_1, doc_3] ∩ [doc_1, doc_3] = [doc_1, doc_3]. Two documents found, zero full scans.

Each posting list entry also stores position information (for phrase queries) and term frequency (for ranking). The inverted index typically uses 10-20% of the original document size in storage.

high-level architecture

┌──────────┐     ┌────────────┐     ┌──────────────────┐
│  Client   │────▶│  Query      │────▶│  Search Cluster   │
└──────────┘     │  Service    │     │                   │
                 │  - parse    │     │  ┌─── Shard 1 ───┐│
                 │  - rewrite  │     │  │ Inverted Index ││
                 │  - route    │     │  └────────────────┘│
                 └──────┬─────┘     │  ┌─── Shard 2 ───┐│
                        │           │  │ Inverted Index ││
                 ┌──────▼─────┐     │  └────────────────┘│
                 │  Aggregator │     │  ┌─── Shard 3 ───┐│
                 │  (merge +   │◀───│  │ Inverted Index ││
                 │   rank)     │     │  └────────────────┘│
                 └────────────┘     └──────────────────┘
                        │
                 ┌──────▼──────┐
                 │  Indexing     │
                 │  Pipeline     │
                 │  (tokenize,   │
                 │   analyze,    │
                 │   build idx)  │
                 └──────────────┘

Two separate paths: the query path (user searches, system returns results) and the indexing path (new documents are analyzed and added to the index).

query processing pipeline

A search query goes through multiple transformation stages before hitting the index.

User types: "runing sheos near me"
       │
       ▼
┌──────────────┐
│  Tokenizer    │  → ["runing", "sheos", "near", "me"]
└──────┬───────┘
       ▼
┌──────────────┐
│  Spell Check  │  → ["running", "shoes", "near", "me"]
└──────┬───────┘
       ▼
┌──────────────┐
│  Stop Word    │  → ["running", "shoes"]
│  Removal      │     (remove "near", "me" for keyword search)
└──────┬───────┘
       ▼
┌──────────────┐
│  Stemming     │  → ["run", "shoe"]
└──────┬───────┘
       ▼
┌──────────────┐
│  Synonym      │  → ["run", "shoe", "sneaker", "trainer"]
│  Expansion    │
└──────┬───────┘
       ▼
  Query the inverted index for all terms

Tokenization splits text into searchable units. For English, split on whitespace and punctuation. For CJK languages (Chinese, Japanese, Korean), you need dictionary-based tokenization since there are no spaces between words.

Stemming reduces words to their root form: "running" → "run", "shoes" → "shoe". This dramatically improves recall. Porter Stemmer is the classic algorithm. Lemmatization is more accurate but slower.

ranking: BM25

Once you have candidate documents, rank them. BM25 (Best Matching 25) is the industry standard for text relevance scoring. It's what Elasticsearch uses by default.

BM25 Score Formula (simplified):

score(q, d) = Σ IDF(term) × TF(term, d) × (k1 + 1)
              ─────────────────────────────────────────
              TF(term, d) + k1 × (1 - b + b × |d|/avgdl)

Where:
  IDF(term)  = log((N - n + 0.5) / (n + 0.5))
  TF(term,d) = frequency of term in document d
  N          = total documents
  n          = documents containing the term
  |d|        = document length
  avgdl      = average document length
  k1 ≈ 1.2   (term frequency saturation)
  b  ≈ 0.75  (document length normalization)

The intuition: a term is more important if it's rare across all documents (high IDF) and frequent in this particular document (high TF). BM25 adds length normalization (long documents don't get unfair advantage) and term frequency saturation (the 10th occurrence of a word doesn't help much more than the 5th).

For modern search, BM25 is often combined with ML re-ranking. Phase 1: BM25 retrieves the top 1,000 candidates. Phase 2: a neural ranker (BERT-based) re-scores the top 100 for semantic relevance.

sharding and replication

A billion documents don't fit in one machine's index. Shard the index across multiple nodes.

Sharding Strategy:

Document arrives → hash(doc_id) mod N → Shard assignment

Search query → broadcast to ALL shards → each returns top K
            → Aggregator merges N × K results → return top K

Shard 1 (docs 0-333M):
  Primary → Replica 1 → Replica 2

Shard 2 (docs 333M-666M):
  Primary → Replica 1 → Replica 2

Shard 3 (docs 666M-1B):
  Primary → Replica 1 → Replica 2

Each shard maintains its own inverted index. A query hits all shards in parallel, each returns its top K results, and the aggregator merges them. This is a scatter-gather pattern.

Replication for availability: each shard has 2-3 replicas. Queries are load-balanced across replicas. If a node dies, the remaining replicas handle traffic while a replacement rebuilds.

Shard sizing: Elasticsearch recommends 10-50GB per shard. At 1B documents averaging 1KB each, that's 1TB total, split across 20-100 shards. In practice, 30 shards with 2 replicas each = 90 nodes. That's a typical production cluster for this scale.

autocomplete

Typeahead search has stricter latency requirements than full search — results must update as the user types, so you need sub-50ms responses.

Autocomplete Architecture:

User types: "syst"
     │
     ▼
┌─────────────┐      ┌──────────────┐
│  Trie Index   │ ──── │  Redis/       │
│  (prefix →    │      │  In-memory    │
│   completions)│      │  store        │
└─────────────┘      └──────────────┘

Trie structure:
       s
      / \
     y   e
     |    \
     s     a
     |      \
     t       r
     |        \
     e         c
     |          \
     m           h
     |
  [system, system design, systems programming]

Implementation: A trie (prefix tree) maps partial queries to the top 10 completions. Store the trie in Redis or in-memory on each API server. Update the trie hourly based on query popularity.

Personalization: Weight completions by the user's past searches. "sys" for a DevOps engineer should suggest "systemd" before "system of a down."

near real-time indexing

When a new document is created, it should be searchable within 30 seconds. Elasticsearch achieves this with a "refresh" cycle.

Indexing Pipeline:

New document ──▶ Kafka ──▶ Index Worker
                               │
                    ┌──────────▼──────────┐
                    │  1. Tokenize          │
                    │  2. Analyze (stem,    │
                    │     lowercase, etc.)  │
                    │  3. Build index       │
                    │     segment (in-mem)  │
                    │  4. Flush to disk     │
                    │     every 1 second    │
                    │  5. Merge segments    │
                    │     periodically      │
                    └─────────────────────┘

New documents go into an in-memory buffer. Every second (configurable), the buffer is flushed to a new immutable index segment on disk. Searching queries both the existing segments and the in-memory buffer. Periodically, small segments are merged into larger ones (like LSM-tree compaction) to keep query performance stable.

key tradeoffs

Precision vs. recall: Stemming and synonym expansion increase recall (find more relevant results) but decrease precision (more irrelevant results sneak in). Tune aggressively for the use case. E-commerce search needs high precision (show exactly what they're looking for). Research search needs high recall (don't miss anything relevant).

Indexing speed vs. query speed: More analysis during indexing (synonyms, NER, embedding generation) makes queries faster and more accurate but slows down the indexing pipeline. For near real-time requirements, do heavy analysis asynchronously and update the index in a second pass.

Storage vs. freshness: Keeping all historical versions of documents enables time-travel queries but multiplies storage costs. Most systems keep only the latest version in the search index and archive old versions to cold storage.

Relevance vs. latency: A BERT re-ranker produces dramatically better results than BM25 alone. But it adds 50-100ms of latency. For autocomplete (where speed is everything), stick with statistical methods. For page-1 search results (where relevance is everything), add the neural re-ranker.

what Elasticsearch gets right

Elasticsearch's architecture is essentially what we've designed here: Lucene-based inverted indexes, sharded across a cluster, with near real-time refresh cycles. What makes it production-ready is the operational layer — automatic shard rebalancing, rolling upgrades, index lifecycle management, and a query DSL that handles everything from simple keyword search to complex aggregations.

If you're building search for a product, use Elasticsearch (or OpenSearch). If you're in an interview, understand the primitives underneath.