Embedding Models Compared: Cost, Quality, Latency
Choosing an embedding model shouldn't take a week. But I've seen teams agonize over this decision, running elaborate benchmarks on curated datasets that don't match their production data. Here's what actually matters and what I'd pick today.
the models I've tested
I've run these across three production systems — a documentation search, a support ticket classifier, and a candidate assessment matching engine:
| Model | Dimensions | Cost (1M tokens) | Latency (1K docs) | MTEB Avg | |-------|-----------|-------------------|-------------------|----------| | OpenAI text-embedding-3-large | 3072 | $0.13 | ~8s | 64.6 | | OpenAI text-embedding-3-small | 1536 | $0.02 | ~5s | 62.3 | | OpenAI text-embedding-ada-002 | 1536 | $0.10 | ~6s | 61.0 | | Cohere embed-v3.0 | 1024 | $0.10 | ~7s | 64.5 | | Voyage AI voyage-2 | 1024 | $0.12 | ~9s | 63.8 | | nomic-embed-text (local) | 768 | $0 | ~12s* | 61.5 | | BGE-large-en-v1.5 (local) | 1024 | $0 | ~15s* | 63.0 |
*Local latency on M3 Max, batch of 1,000 documents. GPU would be faster.
the cost analysis most people skip
Embedding costs aren't just per-token pricing. Factor in:
Initial corpus embedding. A 100,000-document knowledge base at ~500 tokens per doc = 50M tokens. At $0.13/M tokens (text-embedding-3-large), that's $6.50 one-time. At $0.02/M (3-small), it's $1. Neither is significant.
Ongoing query embedding. 10,000 queries/day at ~50 tokens each = 500K tokens/day = 15M tokens/month. text-embedding-3-large: $1.95/month. text-embedding-3-small: $0.30/month. Still not significant.
Vector storage. This is where dimensions matter. 3072-dimension vectors use 3x the storage of 1024-dimension vectors. At 1M documents:
3072 dims × 4 bytes × 1M docs = 12.3 GB
1536 dims × 4 bytes × 1M docs = 6.1 GB
1024 dims × 4 bytes × 1M docs = 4.1 GB
768 dims × 4 bytes × 1M docs = 3.1 GB
Pinecone charges by storage. At $0.33/GB/month for the standard plan, the difference between 3072 and 1024 dimensions is ~$2.70/month per million vectors. Still small, but it adds up with scale.
The real cost driver is search latency, not embedding cost. Higher-dimension vectors = slower nearest-neighbor search = higher compute costs in production. This matters more than the embedding API price.
quality differences are smaller than you think
Here's what surprised me: on my actual production data, the quality gap between models was much smaller than MTEB benchmarks suggest.
I ran retrieval accuracy tests on 500 hand-labeled queries across my documentation search:
| Model | Top-1 Accuracy | Top-5 Accuracy | Top-10 Accuracy | |-------|---------------|----------------|-----------------| | text-embedding-3-large | 71.2% | 89.6% | 94.2% | | text-embedding-3-small | 68.8% | 87.4% | 93.0% | | Cohere embed-v3.0 | 70.4% | 88.8% | 93.8% | | BGE-large-en-v1.5 | 67.6% | 86.2% | 92.4% |
The spread from best to worst is 3.6% on top-1 and 1.8% on top-10. By the time you retrieve 10 candidates and rerank, the embedding model choice barely matters.
what actually moves the needle
If the embedding model contributes a ~3% quality difference, what contributes the other 97%?
Chunking strategy: ~20% impact. Semantic chunking with hierarchy preservation vs fixed-size chunks. This is the single biggest quality lever.
Reranking: ~15% impact. A cross-encoder reranker on top of any embedding model improves accuracy more than switching from the worst to the best embedding model.
Query preprocessing: ~10% impact. Expanding queries, handling typos, normalizing terminology.
Metadata filtering: ~10% impact. Filtering by document type, recency, or category before vector search narrows the search space and improves relevance.
Spending a week optimizing your embedding model while using fixed-size chunking and no reranker is optimizing the wrong thing.
my recommendations
Default pick: OpenAI text-embedding-3-small. $0.02/M tokens, 1536 dimensions, good quality. It's cheap enough that cost is irrelevant and good enough that quality isn't a concern. Use dimension reduction to 512 or 768 if storage matters.
response = client.embeddings.create(
model="text-embedding-3-small",
input="your text here",
dimensions=768, # reduce from 1536 to save storage
)When quality is critical: Cohere embed-v3.0. Slightly better than OpenAI on retrieval benchmarks, with search/document type optimization built in. The input_type parameter matters:
import cohere
co = cohere.Client(api_key="...")
# Different embeddings for documents vs queries
doc_embeddings = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document",
).embeddings
query_embedding = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query",
).embeddingsWhen data can't leave your network: nomic-embed-text or BGE-large. Run locally via sentence-transformers. ~3% quality hit vs cloud models, but zero data exposure and zero per-token cost.
When to use text-embedding-3-large: Almost never. The quality improvement over 3-small doesn't justify the 6.5x cost and 2x storage. If you need that last 2%, invest in better chunking and reranking instead.
The embedding model is a commodity. Invest your time in the retrieval pipeline around it.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.