Vector Databases and Embeddings: A Practical Guide
Master vector databases and embeddings for AI applications. Compare pgvector, Pinecone, and Weaviate with practical implementation examples and benchmarks.
Tags
Vector Databases and Embeddings: A Practical Guide
This is Part 4 of the AI Automation Engineer Roadmap series. This post builds on concepts from Part 3: Building RAG Pipelines.
TL;DR
Vector databases store and search high-dimensional embeddings at scale, forming the backbone of RAG systems, semantic search, and AI-powered recommendations. This post covers how embeddings represent meaning, the math behind similarity search, and practical comparisons of pgvector, Pinecone, Qdrant, and Chroma -- with production setup guides for each.
Why This Matters
In Part 3, we built a RAG pipeline using pgvector. But vector storage is a deep topic on its own. Choosing the wrong database, index type, or embedding model can mean the difference between sub-100ms queries and multi-second timeouts, between accurate retrieval and missing the most relevant document. As your dataset grows from thousands to millions of vectors, these decisions become critical. This post gives you the knowledge to make them confidently.
Core Concepts
How Embeddings Represent Meaning
An embedding is a list of numbers (a vector) that captures the semantic meaning of text. The key property: similar meanings produce similar vectors.
When you embed "How do I reset my password?" and "I forgot my login credentials," the resulting vectors will be close together in the embedding space, even though the sentences share almost no words. This is what makes semantic search fundamentally different from keyword search.
import OpenAI from "openai";
const openai = new OpenAI();
async function demonstrateEmbeddings() {
const texts = [
"How do I reset my password?",
"I forgot my login credentials",
"What is the weather today?",
];
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
const embeddings = response.data.map((d) => d.embedding);
// Similar meaning = high cosine similarity
console.log(
"Password vs Credentials:",
cosineSimilarity(embeddings[0], embeddings[1]).toFixed(4)
);
// ~0.85 -- very similar
console.log(
"Password vs Weather:",
cosineSimilarity(embeddings[0], embeddings[2]).toFixed(4)
);
// ~0.25 -- very different
}Cosine Similarity: The Math (Simplified)
Cosine similarity measures the angle between two vectors, ignoring magnitude. It ranges from -1 (opposite) to 1 (identical).
function cosineSimilarity(a: number[], b: number[]): number {
if (a.length !== b.length) throw new Error("Vectors must be same length");
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Dot product is faster when vectors are already normalized (unit length).
// OpenAI's text-embedding-3-* models return normalized vectors,
// so dot product = cosine similarity for those models.
function dotProduct(a: number[], b: number[]): number {
let result = 0;
for (let i = 0; i < a.length; i++) {
result += a[i] * b[i];
}
return result;
}When to use which metric:
- ›Cosine similarity: Default choice. Works regardless of vector normalization.
- ›Dot product: Faster. Use when vectors are normalized (OpenAI embeddings are).
- ›Euclidean distance (L2): Less common for text. Better for spatial/geometric data.
Dimensionality and Its Impact
Embedding dimensionality is the length of the vector. Higher dimensions capture more nuance but cost more to store and search:
- ›text-embedding-3-small: 1536 dimensions (good default)
- ›text-embedding-3-large: 3072 dimensions (higher accuracy)
- ›Cohere Embed v3: 1024 dimensions
- ›BGE-small: 384 dimensions (open-source, self-hosted)
OpenAI's v3 models support Matryoshka embeddings -- you can truncate vectors to fewer dimensions with graceful quality degradation:
// Generate a lower-dimensional embedding to save storage
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: "Your text here",
dimensions: 512, // Truncated from 1536 -- still good quality, 66% less storage
});Hands-On Implementation
pgvector: When Postgres Is Enough
If you already run PostgreSQL, pgvector is the pragmatic choice. No new infrastructure, no new billing, no new ops burden.
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
async function setupPgvector() {
await pool.query("CREATE EXTENSION IF NOT EXISTS vector");
await pool.query(`
CREATE TABLE IF NOT EXISTS embeddings (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
metadata JSONB DEFAULT '{}',
collection VARCHAR(255) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
)
`);
// IVFFlat index: faster build, good for < 1M vectors
// Lists = sqrt(number of rows) is a good starting point
await pool.query(`
CREATE INDEX IF NOT EXISTS embeddings_ivfflat_idx
ON embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
`);
// HNSW index: slower build, better recall, good for any scale
// m = max connections per node, ef_construction = build-time search width
await pool.query(`
CREATE INDEX IF NOT EXISTS embeddings_hnsw_idx
ON embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
`);
}
// Search with metadata filtering
async function searchPgvector(
queryEmbedding: number[],
collection: string,
topK: number = 5,
metadataFilter?: Record<string, unknown>
) {
let whereClause = "WHERE collection = $2";
const params: unknown[] = [`[${queryEmbedding.join(",")}]`, collection];
if (metadataFilter) {
whereClause += ` AND metadata @> $${params.length + 1}::jsonb`;
params.push(JSON.stringify(metadataFilter));
}
const result = await pool.query(
`SELECT content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM embeddings
${whereClause}
ORDER BY embedding <=> $1::vector
LIMIT $${params.length + 1}`,
[...params, topK]
);
return result.rows;
}IVFFlat vs HNSW:
- ›IVFFlat: Divides vectors into clusters (lists). Searches only nearby clusters. Faster to build, lower memory, but requires tuning
listsandprobes. - ›HNSW: Builds a hierarchical graph of vectors. Better recall out of the box, works well at any scale, but uses more memory and takes longer to build.
Use HNSW unless you have a specific reason not to.
Pinecone: Managed and Scalable
Pinecone is a fully managed vector database. You trade control for zero-ops:
import { Pinecone } from "@pinecone-database/pinecone";
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
async function setupPinecone() {
// Create an index (one-time setup, usually done via dashboard)
await pinecone.createIndex({
name: "knowledge-base",
dimension: 1536,
metric: "cosine",
spec: {
serverless: {
cloud: "aws",
region: "us-east-1",
},
},
});
}
async function upsertToPinecone(
vectors: Array<{
id: string;
values: number[];
metadata: Record<string, string | number | boolean>;
}>
) {
const index = pinecone.index("knowledge-base");
// Pinecone supports batch upserts up to 100 vectors at a time
const batchSize = 100;
for (let i = 0; i < vectors.length; i += batchSize) {
const batch = vectors.slice(i, i + batchSize);
await index.upsert(batch);
}
}
async function searchPinecone(
queryEmbedding: number[],
topK: number = 5,
filter?: Record<string, unknown>
) {
const index = pinecone.index("knowledge-base");
const results = await index.query({
vector: queryEmbedding,
topK,
filter,
includeMetadata: true,
});
return results.matches?.map((match) => ({
id: match.id,
score: match.score,
metadata: match.metadata,
}));
}Qdrant: Self-Hosted Power
Qdrant runs as a Docker container and gives you Pinecone-level features with full control:
import { QdrantClient } from "@qdrant/js-client-rest";
const qdrant = new QdrantClient({ url: "http://localhost:6333" });
async function setupQdrant() {
await qdrant.createCollection("knowledge-base", {
vectors: {
size: 1536,
distance: "Cosine",
},
optimizers_config: {
default_segment_number: 2,
},
// Enable on-disk storage for large datasets
on_disk_payload: true,
});
// Create payload index for metadata filtering
await qdrant.createPayloadIndex("knowledge-base", {
field_name: "category",
field_schema: "keyword",
});
}
async function searchQdrant(
queryEmbedding: number[],
topK: number = 5,
category?: string
) {
const results = await qdrant.search("knowledge-base", {
vector: queryEmbedding,
limit: topK,
filter: category
? {
must: [{ key: "category", match: { value: category } }],
}
: undefined,
with_payload: true,
});
return results.map((result) => ({
id: result.id,
score: result.score,
payload: result.payload,
}));
}Chroma: Lightweight and Local
Chroma is perfect for prototyping and small-scale applications:
import { ChromaClient, OpenAIEmbeddingFunction } from "chromadb";
const chroma = new ChromaClient();
const embedder = new OpenAIEmbeddingFunction({
openai_api_key: process.env.OPENAI_API_KEY!,
openai_model: "text-embedding-3-small",
});
async function setupChroma() {
const collection = await chroma.getOrCreateCollection({
name: "knowledge-base",
embeddingFunction: embedder,
metadata: { "hnsw:space": "cosine" },
});
return collection;
}
async function addToChroma(
documents: string[],
ids: string[],
metadata: Array<Record<string, string>>
) {
const collection = await setupChroma();
// Chroma handles embedding generation automatically
await collection.add({
documents,
ids,
metadatas: metadata,
});
}
async function searchChroma(query: string, topK: number = 5) {
const collection = await setupChroma();
// Chroma embeds the query for you
const results = await collection.query({
queryTexts: [query],
nResults: topK,
});
return results;
}Batch Operations for Production Ingestion
When ingesting thousands of documents, batch operations are critical:
async function batchIngest(
documents: Array<{ content: string; metadata: Record<string, unknown> }>,
batchSize: number = 50
) {
const openai = new OpenAI();
let totalProcessed = 0;
for (let i = 0; i < documents.length; i += batchSize) {
const batch = documents.slice(i, i + batchSize);
// Generate embeddings for the batch
const embeddingResponse = await openai.embeddings.create({
model: "text-embedding-3-small",
input: batch.map((d) => d.content),
});
// Build insert data
const chunks = batch.map((doc, idx) => ({
content: doc.content,
embedding: embeddingResponse.data[idx].embedding,
metadata: doc.metadata,
}));
// Insert into your vector store (using pgvector example)
await insertChunks(chunks);
totalProcessed += batch.length;
console.log(`Processed ${totalProcessed}/${documents.length} documents`);
}
}Best Practices
- ›Start with pgvector -- If you run PostgreSQL, add the extension before evaluating dedicated vector databases. It handles millions of vectors with proper indexing.
- ›Use HNSW indexes by default -- Better recall than IVFFlat with less tuning. The memory overhead is worth it.
- ›Normalize your vectors -- If your embedding model does not return normalized vectors, normalize them before storage. This lets you use faster dot product distance.
- ›Index metadata fields -- If you filter by category, tenant, or date alongside vector search, create indexes on those metadata fields.
- ›Monitor recall, not just latency -- A fast query that returns irrelevant results is worse than a slightly slower one that returns the right documents.
Common Pitfalls
- ›Mixing embedding models in one collection: If you embed some documents with
text-embedding-3-smalland others with Cohere, the vectors are incompatible. Similarity scores will be meaningless. - ›Not rebuilding indexes after large ingestions: IVFFlat indexes in pgvector need to be rebuilt after significant data changes. HNSW is more forgiving but still benefits from periodic optimization.
- ›Over-indexing small datasets: If you have fewer than 10,000 vectors, exact search (no index) is fast enough. Adding an index adds complexity without meaningful speed improvement.
- ›Ignoring dimensionality trade-offs: 3072-dimension embeddings use 2x the storage and are slower to search than 1536-dimension ones. The accuracy gain is often marginal.
- ›Not setting up backups: Vector databases hold computed data that is expensive to regenerate. Back them up like any other database.
What's Next
You now understand how embeddings work, how to choose and configure a vector database, and how to build production-grade storage for your RAG pipeline. But retrieval is only half the story. In Part 5: Building AI Agents with Tool Calling, we will take everything you have learned and build autonomous agents that can reason, use tools, and complete multi-step tasks.
FAQ
Should I use pgvector or a dedicated vector database like Pinecone?
Use pgvector if you already run PostgreSQL and need fewer than 10 million vectors. Choose Pinecone or Qdrant for larger scale, managed infrastructure, or when you need advanced filtering and hybrid search out of the box. The key factor is operational overhead -- pgvector means zero new infrastructure, while dedicated databases offer better performance at extreme scale.
What embedding model should I use for my AI application?
OpenAI text-embedding-3-small offers the best cost-to-quality ratio for most use cases. For higher accuracy, use text-embedding-3-large. For open-source, consider Cohere Embed v3 or BGE models. Always test with your actual data -- domain-specific content may perform better with specialized models.
How do vector similarity searches work?
Vector searches convert queries into embeddings, then find the nearest vectors in the database using distance metrics like cosine similarity or dot product, returning the most semantically similar documents. Indexes like HNSW make this fast by building graph structures that allow approximate nearest neighbor search without comparing against every vector.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
AI Evaluation for Production Workflows
Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.
How to Build an AI Workflow in a Production SaaS App
A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.
Building AI Features Safely: Guardrails, Fallbacks, and Human Review
A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.