Embeddings Explained: Choosing the Right Model and Vector Database for Production

Most teams choose their embedding model and vector database in an afternoon.

They pick the model everyone on Twitter recommends, grab whichever vector database has the slickest landing page, and move on. Six months later they are debugging poor retrieval, paying surprising bills, and discovering that switching either component means re-embedding the entire corpus.

These two choices sit at the bottom of every Retrieval-Augmented Generation system. In a previous article on running RAG in production, I covered the failure modes that show up once real users arrive. This one goes a level deeper, into the foundation those failures grow from: the embedding model that decides what "similar" means, and the vector database that has to find similar things fast, at scale, under a budget.

Get these two right and most of RAG becomes tractable. Get them wrong and no amount of prompt engineering or reranking will save you.

What an Embedding Actually Is

An embedding is a list of numbers that represents the meaning of a piece of text.

A model reads a sentence, a paragraph, or a document and outputs a vector, typically a few hundred to a few thousand floating-point numbers. Texts with similar meaning land close together in that high-dimensional space. Texts about unrelated topics land far apart. "How do I reset my password" and "I forgot my login credentials" point in nearly the same direction even though they share almost no words.

That is the entire trick. Retrieval becomes a geometry problem. To find documents relevant to a query, you embed the query, then find the stored vectors nearest to it. Distance stands in for relevance.

Two practical consequences follow immediately, and both are easy to underestimate:

The embedding model defines your notion of similarity. If the model does not understand your domain, two texts that a human expert would call related will land far apart, and retrieval fails before the database even runs.
The vector database makes that search fast. Comparing a query against millions of vectors one by one is too slow for production, so the database uses specialized indexes that trade a little accuracy for enormous speed.

Choosing well means understanding both halves. Let's take them in order.

Part 1: Choosing the Embedding Model

Why the MTEB Leaderboard Is a Starting Point, Not an Answer

The Massive Text Embedding Benchmark (MTEB) is the standard scoreboard for embedding models. It aggregates dozens of tasks across retrieval, classification, clustering, and reranking, and teams reach for the top of the list as if it were a ranking of "best to worst."

It is not. It is a ranking of "best to worst on these specific public datasets."

Your corpus is not in those datasets. A model that scores 68 on MTEB English and a model that scores 66 may rank in the opposite order on your medical records, your legal contracts, or your internal support tickets. The leaderboard tells you which models are competitive. It does not tell you which one wins on your data.

The only number that matters is the one you measure yourself. Take 50 to 100 representative queries, label the chunks that should be retrieved for each, and compute recall@K: how often the correct chunks appear in the top K results. A model that scores 0.85 recall@5 on your corpus beats a higher-ranked leaderboard model that scores 0.70, every single time. This is the same evaluation discipline that separates working RAG from demos, and it is cheap to run before you commit.

Dimensions: The Number That Quietly Drives Your Bill

Every embedding has a dimension count, the length of the vector. OpenAI's text-embedding-3-large outputs up to 3072 numbers. Smaller models output 384 or 768.

Dimensions are not a quality dial you should max out. They are a direct multiplier on three production costs:

Storage. Ten million chunks at 3072 dimensions in 32-bit floats is roughly 120 GB before indexing overhead. At 768 dimensions it is 30 GB. Same corpus, four times the footprint.
Memory. Graph-based indexes like HNSW hold vectors in RAM for speed. More dimensions means more RAM, and RAM is the expensive resource in vector search.
Latency. Every distance computation scales with dimension count. Larger vectors mean slower comparisons, multiplied across every query.

Higher dimensions usually buy more accuracy, but with sharply diminishing returns. For most applications, 768 to 1536 dimensions is the sweet spot.

The interesting development in 2026 is Matryoshka representation learning. Models trained this way front-load the most important semantic information into the earliest dimensions, so you can truncate a 3072-dimension vector down to 512 or 256 at query time and keep most of the quality, without re-embedding anything. OpenAI's v3 models, Cohere's Embed v4, and several open-source families support this. It lets you store compact vectors for fast first-pass search and keep the option of higher fidelity later. If you are choosing a model today, Matryoshka support is a feature worth weighting.

The Embedding Context Window Matters Too

Embedding models have input limits, just like generative LLMs. Feed a model more text than it can handle and the excess is silently truncated, so the tail of a long chunk never makes it into the vector.

Most embedding models handle 512 to 8192 tokens. Cohere's Embed v4 pushes this to 128,000 tokens, the largest among major commercial options. A larger window does not mean you should embed enormous chunks (chunking strategy still matters, as covered in the RAG article), but it removes a sharp edge where long passages get clipped without warning. If your documents contain long, self-contained sections, check the model's token limit before you trust your pipeline.

API or Self-Hosted, Closed or Open

There are two real axes here, and they are independent.

Hosted API models (OpenAI, Cohere, Voyage AI, Google) are the fastest path to production. No GPUs to provision, no model serving to operate, just an API call. You pay per token and you accept that your text leaves your infrastructure.

Self-hosted open models (the BGE family, Qwen3-Embedding, Jina, Nomic, E5) keep data in-house, cost nothing per call once the hardware is running, and can be fine-tuned on your domain. The price is operational: you run the inference, manage the GPUs, and own the uptime.

The deciding factors are usually data residency, volume, and domain specificity. If you are embedding billions of tokens, the per-call API cost dwarfs the cost of self-hosting. If you handle regulated data that cannot leave your environment, the choice is made for you. If your domain is highly specialized, fine-tuning an open model can lift recall further than any general-purpose API model will reach.

Domain Specificity and Fine-Tuning

General-purpose models are trained on broad internet text. They are excellent generalists and mediocre specialists.

In a specialized domain, the semantic relationships differ from general English. "Acceleration" means one thing in physics and another in a loan agreement. A general model may place a query about financial acceleration clauses closer to documents about cars than to the relevant contract language. The fix is either a model trained for your domain (code models for codebases, biomedical models for clinical text) or fine-tuning an open model on labeled pairs from your corpus. Fine-tuning a strong open embedding model on a few thousand domain pairs routinely lifts recall by double-digit percentages, which is often the difference between a system users trust and one they abandon.

A 2026 Snapshot of Embedding Models

Prices and scores move quickly, so treat this as a map of the landscape rather than a final verdict. Always confirm current numbers and, more importantly, measure on your own data.

Model	Type	Dimensions	Approx. price / 1M tokens	Notable for
OpenAI text-embedding-3-large	API	up to 3072 (Matryoshka)	~$0.13	Safe, strong default
OpenAI text-embedding-3-small	API	up to 1536 (Matryoshka)	~$0.02	Budget queries
Cohere Embed v4	API	256 to 1536	~$0.12	128k context, multimodal
Voyage voyage-3-large / 4-large	API	1024+	~$0.12 to $0.18	Retrieval-optimized quality
Google Gemini Embedding	API	3072	~$0.006 and up	Top MTEB, multimodal
Qwen3-Embedding-8B	Open / self-host	up to 4096	Compute only	Best multilingual, long docs
Jina embeddings v3 / v4	Open / API	128 to 2048 (Matryoshka)	~$0.02	Price-performance
BGE family	Open / self-host	768 to 1024	Compute only	Strong, fine-tunable baseline

A reasonable default for an English-language product: start with OpenAI text-embedding-3-large or a strong open model like BGE, measure recall on your corpus, and only then optimize for cost or domain fit.

Part 2: Choosing the Vector Database

The embedding model decides what to search for. The vector database decides how fast you find it, how it scales, and how much it costs to keep running.

What a Vector Database Actually Does

At minimum, a vector database stores your embeddings and answers nearest-neighbor queries. Comparing a query against every stored vector (exact search) is accurate but linear in corpus size, which collapses under load. So production systems use Approximate Nearest Neighbor (ANN) search: indexes that find the closest vectors with high probability while skipping most of the comparisons.

That approximation is the whole game. The index you choose, and how you tune it, determines the trade-off between recall, latency, and memory. This is the decision that actually matters, and it is mostly hidden behind default settings.

The Index Is the Real Decision: HNSW vs IVF

Two index families dominate production.

HNSW (Hierarchical Navigable Small World) builds a layered graph of vectors and walks it to find neighbors. It is the default for most production workloads in 2026: high recall, very low query latency (often sub-millisecond on millions of vectors), and it handles dynamic data where vectors are constantly added. The cost is memory. HNSW keeps the graph in RAM, and that RAM bill grows with both vector count and dimensions.

IVF (Inverted File Index) partitions vectors into clusters, then searches only the clusters nearest the query. It uses far less memory and builds faster, which makes it the right call when the corpus is very large and mostly static, or when memory is the binding constraint. The trade-off is tuning: you balance the number of clusters and how many you probe per query, and getting high recall takes more care than with HNSW.

A useful rule of thumb: reach for HNSW when latency and recall matter most and you can pay for RAM. Reach for IVF when the dataset is huge, relatively static, and memory is tight.

Quantization: How to Fit More Vectors in Less Memory

When vectors no longer fit comfortably in RAM, quantization compresses them by storing each number with less precision.

Scalar quantization (SQ8) stores each dimension in 1 byte instead of 4, a 4:1 compression that keeps the vector's geometry largely intact. It is the easy first step.
Product quantization (IVF-PQ) compresses much more aggressively, cutting memory 4 to 8 times while holding 95%+ recall. A billion-vector dataset that needs around 4 TB with raw HNSW can drop to roughly 500 GB with IVF-PQ.
Binary quantization stores each dimension as a single bit, giving the most extreme compression for a first-pass filter, then re-ranking survivors with full-precision vectors.

Newer approaches like RaBitQ pair IVF partitioning with randomized binary quantization to chase HNSW-like recall at IVF-like memory cost, and are moving from research into engines like Milvus and pgvector ecosystems. The practical point: quantization is how production teams keep large indexes affordable. If your provider supports it, it is usually worth turning on once your corpus grows past a few million vectors.

Metadata Filtering Is a First-Class Requirement

Real queries are rarely pure vector search. You want "the nearest chunks, but only from documents published this year, only in English, only from the product manual." That is metadata filtering, and databases differ enormously in how well they do it.

Naive filtering retrieves nearest neighbors first, then discards the ones that fail the filter, which can leave you with far fewer results than you asked for. Strong implementations filter and search together. Filtering performance is one of the clearest differentiators between vector databases, and it is easy to miss in a demo where every query is unfiltered. If your application segments content by date, type, tenant, or permission (and most production systems do), test filtering explicitly before you commit.

Hybrid Search: Why Pure Vectors Are Not Enough

Dense vector search is excellent at meaning and poor at exact matches. Search for a specific error code, a product SKU, or a rare proper noun, and embeddings often underweight the literal token while surfacing semantically "related" but wrong results.

The 2026 default for serious retrieval is hybrid search: run keyword search (BM25, a sparse term-matching algorithm) and dense vector search in parallel, then fuse the results, commonly with Reciprocal Rank Fusion. BM25 catches the literal matches embeddings miss; dense vectors recover meaning when users paraphrase. Together they consistently beat either one alone. Native, well-integrated hybrid search is a major reason teams choose one database over another, so weight it accordingly if your queries mix natural language with specific identifiers.

A 2026 Snapshot of Vector Databases

Database	Model	Strongest for	Watch out for
Pinecone	Fully managed	Easiest to operate, fast time to production	Cost at scale, less control
Qdrant	Open source (Rust)	Best filtering among open source, performance	You operate it (or pay for cloud)
Weaviate	Open source	Best built-in hybrid search, modules	More moving parts to learn
Milvus	Open source	Billion-vector scale, mature sharding	Operational complexity
pgvector	Postgres extension	Already on Postgres, vectors beside app data	Best under ~10M vectors
Chroma / FAISS	Open source / library	Prototyping and local development	Not aimed at large production scale

Match the Database to Your Scale

Scale is the cleanest way to narrow the field, because it eliminates options rather than relying on preference.

Corpus size	Realistic options
Under 10M vectors	Almost anything works. pgvector is ideal if you already run Postgres.
10M to 1B vectors	Pinecone (managed), or Qdrant / Weaviate / Milvus (self-hosted).
Over 1B vectors	Milvus or Vespa distributed deployments, with quantization.

If you are under 10 million vectors and already run Postgres, the honest answer is that you probably do not need a dedicated vector database yet. pgvector keeps your vectors next to your application data, gives you transactional consistency, and removes an entire system from your stack. Adopt a specialized database when you outgrow it, not before.

The Lock-In Nobody Warns You About

Here is the trap that connects both halves of this article.

Your embedding model and your vector database are coupled. Vectors from one model are meaningless to another, because each model has its own geometry. Switching embedding models means re-embedding your entire corpus, which costs money and time proportional to its size. Switching vector databases means re-indexing everything and rewriting your retrieval layer.

This is why the afternoon decision is dangerous. The cost of these choices is not paid up front. It is paid later, when you have ten million vectors and realize the model does not understand your domain, or the database cannot filter the way your product now needs.

Two ways to keep your options open:

Abstract your retrieval layer. Do not scatter database-specific calls across your codebase. Put retrieval behind an interface so swapping the backend touches one module, not fifty.
Keep your evaluation set permanent. The labeled query set you built to measure recall is the asset that lets you compare a new model or database against the current one in an afternoon, with numbers instead of vibes.

A Practical Decision Framework

Before you commit either component, work through this in order:

Embedding model:

[ ] Shortlisted 2 or 3 candidates from MTEB plus domain reputation
[ ] Measured recall@K on your own labeled query set, not the leaderboard
[ ] Chosen dimensions deliberately (768 to 1536 unless you have a reason)
[ ] Confirmed the input token limit fits your chunks
[ ] Decided API vs self-host based on volume, data residency, and domain fit
[ ] Considered fine-tuning if the domain is specialized

Vector database:

[ ] Sized the corpus and used scale to eliminate options
[ ] Chosen an index (HNSW for latency and recall, IVF for huge static data)
[ ] Planned quantization if the index will not fit in RAM affordably
[ ] Tested metadata filtering with realistic filters, not unfiltered demos
[ ] Verified hybrid search support if queries mix meaning and exact terms
[ ] Checked whether pgvector is enough before adopting a dedicated system

Both:

[ ] Retrieval is behind an abstraction layer
[ ] The evaluation set is saved and version-controlled for future comparisons

Frequently Asked Questions

Which embedding model should I use by default in 2026? For a general English product, start with OpenAI text-embedding-3-large or a strong open model such as BGE, then measure recall on your own data. The default is a starting point, not a final answer.

How many dimensions do I need? For most applications, 768 to 1536 dimensions balance quality, storage, and latency well. Maxing out dimensions rarely pays off, and Matryoshka models let you truncate later if needed.

Do I need a dedicated vector database? Not always. Under roughly 10 million vectors, pgvector inside your existing Postgres is often the simplest and most cost-effective choice. Adopt a specialized database when scale, filtering, or hybrid search demand it.

What is the difference between HNSW and IVF? HNSW is a graph index with very low latency and high recall but heavy memory use, ideal for dynamic data. IVF is a cluster-based index using less memory and faster builds, better for very large, mostly static corpora where memory is tight.

Why is hybrid search recommended? Dense vectors capture meaning but miss exact matches like error codes or SKUs. Combining BM25 keyword search with vector search and fusing the results consistently outperforms either approach alone.

The Right Mental Model

Embeddings and the vector database are not setup steps you complete once. They are the foundation the rest of your RAG system is built on, and like any foundation, they are expensive to replace after you have built on top.

The teams that get this right treat both as measured decisions. They benchmark embedding models on their own data, choose dimensions and indexes on purpose, test filtering and hybrid search the way production will actually use them, and keep their retrieval layer loosely coupled so the foundation can be replaced without demolishing the building.

The teams that struggle pick the popular model, accept the default index, and discover the cost of those defaults only when they are ten million vectors deep.

If retrieval is failing in your RAG system, the model and the database are the first places to look. Most of the time, the answer was decided here, long before the prompt ever ran.

This article completes the foundation of my RAG series. If you found it useful, the companion pieces on RAG in production, why tokens matter, and LLM context window limitations go deeper into the layers above it. I write regularly about applied AI engineering and the real lessons from production deployments. Find me on LinkedIn or reach out directly at [email protected].