Building RAG Pipelines: Retrieval-Augmented Generation
Step-by-step guide to building RAG pipelines with vector databases and embeddings. Learn chunking strategies, retrieval methods, and generation patterns.
Tags
Building RAG Pipelines: Retrieval-Augmented Generation
This is Part 3 of the AI Automation Engineer Roadmap series. Make sure you have read Part 2: From Prompt Engineering to Context Engineering first.
TL;DR
RAG pipelines ground LLM responses in real data by retrieving relevant documents before generation, dramatically reducing hallucinations and enabling domain-specific AI. This post walks through the full pipeline: document loading, chunking strategies, embedding, vector storage with pgvector, retrieval with hybrid search, reranking, and evaluation with RAGAS metrics.
Why This Matters
LLMs are trained on public data with a knowledge cutoff. They do not know about your company's internal docs, your product's latest features, or your customer's account details. Fine-tuning is expensive and slow to update. RAG solves this by dynamically fetching relevant information at query time and injecting it into the context window. It is the most practical pattern for building AI features that need access to private, domain-specific, or frequently changing data. If you are building anything beyond a generic chatbot, you need RAG.
Core Concepts
RAG Architecture Overview
A RAG pipeline has two phases:
Ingestion (offline):
- ›Load documents (PDFs, markdown, HTML, database rows)
- ›Split documents into chunks
- ›Generate embeddings for each chunk
- ›Store embeddings in a vector database
Retrieval + Generation (online):
- ›User asks a question
- ›Generate an embedding for the query
- ›Search the vector database for similar chunks
- ›(Optional) Rerank results for precision
- ›Inject the top chunks into the LLM context
- ›Generate a response grounded in the retrieved data
Document Loading
Before you can chunk and embed, you need to extract text from your source documents:
import { readFile } from "fs/promises";
import pdf from "pdf-parse";
interface Document {
content: string;
metadata: {
source: string;
type: string;
title?: string;
pageNumber?: number;
};
}
// Load different document types
async function loadDocument(filePath: string): Promise<Document[]> {
const ext = filePath.split(".").pop()?.toLowerCase();
switch (ext) {
case "md":
case "mdx":
case "txt": {
const content = await readFile(filePath, "utf-8");
return [{ content, metadata: { source: filePath, type: ext } }];
}
case "pdf": {
const buffer = await readFile(filePath);
const data = await pdf(buffer);
return [
{
content: data.text,
metadata: { source: filePath, type: "pdf", title: data.info?.Title },
},
];
}
default:
throw new Error(`Unsupported file type: ${ext}`);
}
}Chunking Strategies
Chunking is where most RAG pipelines succeed or fail. The goal is to create chunks that are semantically coherent -- each chunk should contain a complete thought or piece of information.
interface Chunk {
content: string;
metadata: {
source: string;
chunkIndex: number;
tokenCount: number;
};
}
// Strategy 1: Fixed-size with overlap
function fixedSizeChunk(text: string, chunkSize: number = 512, overlap: number = 50): string[] {
const words = text.split(/\s+/);
const chunks: string[] = [];
for (let i = 0; i < words.length; i += chunkSize - overlap) {
const chunk = words.slice(i, i + chunkSize).join(" ");
if (chunk.trim().length > 0) {
chunks.push(chunk);
}
}
return chunks;
}
// Strategy 2: Recursive splitting by structure
function recursiveChunk(
text: string,
maxChunkSize: number = 1000,
separators: string[] = ["\n## ", "\n### ", "\n\n", "\n", ". "]
): string[] {
// If text is small enough, return it as-is
if (text.length <= maxChunkSize) {
return [text.trim()].filter(Boolean);
}
// Try each separator in order of priority
for (const separator of separators) {
const parts = text.split(separator);
if (parts.length > 1) {
const chunks: string[] = [];
let currentChunk = "";
for (const part of parts) {
const candidate = currentChunk
? currentChunk + separator + part
: part;
if (candidate.length <= maxChunkSize) {
currentChunk = candidate;
} else {
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = part;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
// Recursively split any chunks that are still too large
return chunks.flatMap((chunk) =>
chunk.length > maxChunkSize
? recursiveChunk(chunk, maxChunkSize, separators.slice(1))
: [chunk]
);
}
}
// Fallback: hard split
return fixedSizeChunk(text, maxChunkSize);
}
// Strategy 3: Semantic chunking (split at topic boundaries)
// This uses embeddings to detect topic shifts -- more expensive but higher quality
async function semanticChunk(
text: string,
similarityThreshold: number = 0.75
): Promise<string[]> {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
// Embed each sentence
const embeddings = await generateEmbeddings(sentences);
const chunks: string[] = [];
let currentChunk = sentences[0];
for (let i = 1; i < sentences.length; i++) {
const similarity = cosineSimilarity(embeddings[i - 1], embeddings[i]);
if (similarity >= similarityThreshold) {
// Similar topic -- keep appending
currentChunk += " " + sentences[i];
} else {
// Topic shift -- start new chunk
chunks.push(currentChunk.trim());
currentChunk = sentences[i];
}
}
if (currentChunk) chunks.push(currentChunk.trim());
return chunks;
}Which strategy to use:
- ›Fixed-size: Simple and predictable. Good starting point for uniform text.
- ›Recursive: Best for structured documents (markdown, code). Respects document hierarchy.
- ›Semantic: Highest quality but most expensive. Use for heterogeneous document collections.
Generating Embeddings
import OpenAI from "openai";
const openai = new OpenAI();
async function generateEmbeddings(texts: string[]): Promise<number[][]> {
// Process in batches of 100 (API limit)
const batchSize = 100;
const allEmbeddings: number[][] = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const response = await openai.embeddings.create({
model: "text-embedding-3-small", // 1536 dimensions, $0.02/1M tokens
input: batch,
});
const embeddings = response.data.map((d) => d.embedding);
allEmbeddings.push(...embeddings);
}
return allEmbeddings;
}Hands-On Implementation
Vector Storage with pgvector
pgvector is a PostgreSQL extension that adds vector similarity search. If you already use Postgres, this is the most practical choice -- no new infrastructure required.
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
// Setup: create table with vector column
async function setupVectorStore() {
await pool.query("CREATE EXTENSION IF NOT EXISTS vector");
await pool.query(`
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW()
)
`);
// Create an HNSW index for fast approximate nearest neighbor search
await pool.query(`
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
`);
}
// Insert chunks with their embeddings
async function insertChunks(
chunks: Array<{ content: string; embedding: number[]; metadata: Record<string, unknown> }>
) {
const client = await pool.connect();
try {
await client.query("BEGIN");
for (const chunk of chunks) {
await client.query(
`INSERT INTO documents (content, embedding, metadata)
VALUES ($1, $2::vector, $3)`,
[chunk.content, `[${chunk.embedding.join(",")}]`, JSON.stringify(chunk.metadata)]
);
}
await client.query("COMMIT");
} catch (error) {
await client.query("ROLLBACK");
throw error;
} finally {
client.release();
}
}
// Similarity search
async function similaritySearch(
queryEmbedding: number[],
topK: number = 5,
metadataFilter?: Record<string, unknown>
): Promise<Array<{ content: string; similarity: number; metadata: Record<string, unknown> }>> {
let query = `
SELECT content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
`;
const params: unknown[] = [`[${queryEmbedding.join(",")}]`];
if (metadataFilter) {
query += ` WHERE metadata @> $2::jsonb`;
params.push(JSON.stringify(metadataFilter));
}
query += ` ORDER BY embedding <=> $1::vector LIMIT $${params.length + 1}`;
params.push(topK);
const result = await pool.query(query, params);
return result.rows;
}Hybrid Search: Semantic + Keyword
Pure semantic search sometimes misses exact keyword matches. Hybrid search combines both:
async function hybridSearch(
query: string,
queryEmbedding: number[],
topK: number = 5,
semanticWeight: number = 0.7
): Promise<Array<{ content: string; score: number }>> {
const result = await pool.query(
`
WITH semantic AS (
SELECT id, content, metadata,
1 - (embedding <=> $1::vector) AS score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $3
),
keyword AS (
SELECT id, content, metadata,
ts_rank(to_tsvector('english', content), plainto_tsquery('english', $2)) AS score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
LIMIT $3
)
SELECT
COALESCE(s.content, k.content) AS content,
COALESCE(s.score, 0) * $4 + COALESCE(k.score, 0) * (1 - $4) AS score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
ORDER BY score DESC
LIMIT $3
`,
[`[${queryEmbedding.join(",")}]`, query, topK, semanticWeight]
);
return result.rows;
}Reranking with Cohere
Initial retrieval casts a wide net. Reranking narrows it to the most relevant results:
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
async function rerankResults(
query: string,
documents: Array<{ content: string; score: number }>,
topN: number = 3
): Promise<Array<{ content: string; relevanceScore: number }>> {
const response = await cohere.rerank({
model: "rerank-english-v3.0",
query,
documents: documents.map((d) => d.content),
topN,
});
return response.results.map((result) => ({
content: documents[result.index].content,
relevanceScore: result.relevanceScore,
}));
}The Full RAG Pipeline
Putting it all together:
import OpenAI from "openai";
const openai = new OpenAI();
async function ragQuery(userQuestion: string): Promise<string> {
// 1. Embed the query
const embeddingResponse = await openai.embeddings.create({
model: "text-embedding-3-small",
input: userQuestion,
});
const queryEmbedding = embeddingResponse.data[0].embedding;
// 2. Retrieve with hybrid search
const candidates = await hybridSearch(userQuestion, queryEmbedding, 10);
// 3. Rerank for precision
const topDocs = await rerankResults(userQuestion, candidates, 3);
// 4. Build context and generate
const context = topDocs.map((d) => d.content).join("\n\n---\n\n");
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Answer questions based ONLY on the provided context.
If the context does not contain enough information, say so clearly.
Do not make up information.
## Context
${context}`,
},
{ role: "user", content: userQuestion },
],
temperature: 0,
max_tokens: 1024,
});
return response.choices[0].message.content || "No response generated.";
}Evaluation with RAGAS Metrics
You cannot improve what you do not measure. RAGAS provides standard metrics for RAG evaluation:
// Simplified RAGAS-style evaluation
interface RAGEvaluation {
faithfulness: number; // Does the answer only use retrieved context?
relevancy: number; // Is the retrieved context relevant to the question?
correctness: number; // Is the answer factually correct?
}
async function evaluateRAG(
question: string,
retrievedDocs: string[],
generatedAnswer: string,
groundTruth: string
): Promise<RAGEvaluation> {
const openai = new OpenAI();
// Evaluate faithfulness: is the answer grounded in the context?
const faithfulnessCheck = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Given a context and an answer, rate from 0 to 1 how well the answer is supported by the context. 1 = fully supported, 0 = completely fabricated. Respond with just the number.`,
},
{
role: "user",
content: `Context: ${retrievedDocs.join("\n")}\n\nAnswer: ${generatedAnswer}`,
},
],
temperature: 0,
});
// Evaluate relevancy: are the retrieved docs relevant?
const relevancyCheck = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Given a question and retrieved documents, rate from 0 to 1 how relevant the documents are. Respond with just the number.`,
},
{
role: "user",
content: `Question: ${question}\n\nDocuments: ${retrievedDocs.join("\n")}`,
},
],
temperature: 0,
});
// Evaluate correctness against ground truth
const correctnessCheck = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Compare the generated answer to the ground truth. Rate from 0 to 1 how correct the answer is. Respond with just the number.`,
},
{
role: "user",
content: `Ground Truth: ${groundTruth}\n\nGenerated Answer: ${generatedAnswer}`,
},
],
temperature: 0,
});
return {
faithfulness: parseFloat(faithfulnessCheck.choices[0].message.content || "0"),
relevancy: parseFloat(relevancyCheck.choices[0].message.content || "0"),
correctness: parseFloat(correctnessCheck.choices[0].message.content || "0"),
};
}Best Practices
- ›Chunk size matters more than you think -- Start with 500-1000 characters with 10-20% overlap. Too small and you lose context. Too large and you dilute relevance.
- ›Always include metadata -- Store source file, page number, section heading, and timestamps with every chunk. You will need them for citations and filtering.
- ›Use hybrid search from day one -- Pure semantic search misses exact term matches. Hybrid search with 70/30 semantic/keyword weighting is a strong default.
- ›Rerank before generation -- Retrieval gets you candidates. Reranking selects the best ones. The quality difference is significant.
- ›Evaluate continuously -- Build an evaluation dataset of question/answer pairs and run it against every pipeline change.
Common Pitfalls
- ›Chunking at arbitrary boundaries: Splitting mid-sentence or mid-paragraph destroys context. Use recursive or semantic chunking.
- ›Not deduplicating chunks: Overlapping chunks can flood your results with near-identical content, wasting context window space.
- ›Skipping the "I don't know" instruction: Without explicit instructions to admit uncertainty, the model will hallucinate answers from partial context.
- ›Using the wrong embedding model for the domain: General-purpose embeddings struggle with highly specialized text (legal, medical). Evaluate domain-specific models.
- ›Ignoring retrieval quality: Most RAG failures are retrieval failures, not generation failures. Debug retrieval first.
What's Next
This post used pgvector for storage, but there is much more to vector databases -- indexing strategies, managed vs self-hosted options, and production scaling patterns. In Part 4: Vector Databases and Embeddings, we will do a deep dive into how embeddings work mathematically, compare pgvector vs Pinecone vs Qdrant vs Chroma, and cover production indexing strategies.
FAQ
What is retrieval-augmented generation (RAG)?
RAG is a pattern that combines information retrieval with LLM generation. It fetches relevant documents from a knowledge base and includes them in the prompt context so the model can generate accurate, grounded responses. Unlike fine-tuning, RAG can work with constantly changing data and does not require retraining the model.
How do you choose the right chunking strategy for RAG?
The best chunking strategy depends on your content type. Use semantic chunking for varied documents, fixed-size with overlap for uniform text, and recursive splitting for structured documents like code or markdown. Start with recursive splitting at 500-1000 characters and iterate based on retrieval quality metrics.
What are the most common RAG pipeline failures?
Common failures include poor chunk boundaries that split context, inadequate embedding models for the domain, missing metadata filtering, and not implementing reranking to improve retrieval precision. The single most common mistake is blaming the LLM for bad answers when the real problem is that retrieval returned irrelevant chunks.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
AI Evaluation for Production Workflows
Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.
How to Build an AI Workflow in a Production SaaS App
A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.
Building AI Features Safely: Guardrails, Fallbacks, and Human Review
A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.