Blog/AI Automation/Building RAG Pipelines: Retrieval-Augmented Generation
POST
June 08, 2025
LAST UPDATEDJune 08, 2025

Building RAG Pipelines: Retrieval-Augmented Generation

Step-by-step guide to building RAG pipelines with vector databases and embeddings. Learn chunking strategies, retrieval methods, and generation patterns.

Tags

AIRAGVector DatabaseEmbeddingsLLMs
Building RAG Pipelines: Retrieval-Augmented Generation
5 min read

Building RAG Pipelines: Retrieval-Augmented Generation

This is Part 3 of the AI Automation Engineer Roadmap series. Make sure you have read Part 2: From Prompt Engineering to Context Engineering first.

TL;DR

RAG pipelines ground LLM responses in real data by retrieving relevant documents before generation, dramatically reducing hallucinations and enabling domain-specific AI. This post walks through the full pipeline: document loading, chunking strategies, embedding, vector storage with pgvector, retrieval with hybrid search, reranking, and evaluation with RAGAS metrics.

Why This Matters

LLMs are trained on public data with a knowledge cutoff. They do not know about your company's internal docs, your product's latest features, or your customer's account details. Fine-tuning is expensive and slow to update. RAG solves this by dynamically fetching relevant information at query time and injecting it into the context window. It is the most practical pattern for building AI features that need access to private, domain-specific, or frequently changing data. If you are building anything beyond a generic chatbot, you need RAG.

Core Concepts

RAG Architecture Overview

A RAG pipeline has two phases:

Ingestion (offline):

  1. Load documents (PDFs, markdown, HTML, database rows)
  2. Split documents into chunks
  3. Generate embeddings for each chunk
  4. Store embeddings in a vector database

Retrieval + Generation (online):

  1. User asks a question
  2. Generate an embedding for the query
  3. Search the vector database for similar chunks
  4. (Optional) Rerank results for precision
  5. Inject the top chunks into the LLM context
  6. Generate a response grounded in the retrieved data

Document Loading

Before you can chunk and embed, you need to extract text from your source documents:

typescript
import { readFile } from "fs/promises";
import pdf from "pdf-parse";
 
interface Document {
  content: string;
  metadata: {
    source: string;
    type: string;
    title?: string;
    pageNumber?: number;
  };
}
 
// Load different document types
async function loadDocument(filePath: string): Promise<Document[]> {
  const ext = filePath.split(".").pop()?.toLowerCase();
 
  switch (ext) {
    case "md":
    case "mdx":
    case "txt": {
      const content = await readFile(filePath, "utf-8");
      return [{ content, metadata: { source: filePath, type: ext } }];
    }
    case "pdf": {
      const buffer = await readFile(filePath);
      const data = await pdf(buffer);
      return [
        {
          content: data.text,
          metadata: { source: filePath, type: "pdf", title: data.info?.Title },
        },
      ];
    }
    default:
      throw new Error(`Unsupported file type: ${ext}`);
  }
}

Chunking Strategies

Chunking is where most RAG pipelines succeed or fail. The goal is to create chunks that are semantically coherent -- each chunk should contain a complete thought or piece of information.

typescript
interface Chunk {
  content: string;
  metadata: {
    source: string;
    chunkIndex: number;
    tokenCount: number;
  };
}
 
// Strategy 1: Fixed-size with overlap
function fixedSizeChunk(text: string, chunkSize: number = 512, overlap: number = 50): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];
 
  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    const chunk = words.slice(i, i + chunkSize).join(" ");
    if (chunk.trim().length > 0) {
      chunks.push(chunk);
    }
  }
 
  return chunks;
}
 
// Strategy 2: Recursive splitting by structure
function recursiveChunk(
  text: string,
  maxChunkSize: number = 1000,
  separators: string[] = ["\n## ", "\n### ", "\n\n", "\n", ". "]
): string[] {
  // If text is small enough, return it as-is
  if (text.length <= maxChunkSize) {
    return [text.trim()].filter(Boolean);
  }
 
  // Try each separator in order of priority
  for (const separator of separators) {
    const parts = text.split(separator);
    if (parts.length > 1) {
      const chunks: string[] = [];
      let currentChunk = "";
 
      for (const part of parts) {
        const candidate = currentChunk
          ? currentChunk + separator + part
          : part;
 
        if (candidate.length <= maxChunkSize) {
          currentChunk = candidate;
        } else {
          if (currentChunk) chunks.push(currentChunk.trim());
          currentChunk = part;
        }
      }
      if (currentChunk) chunks.push(currentChunk.trim());
 
      // Recursively split any chunks that are still too large
      return chunks.flatMap((chunk) =>
        chunk.length > maxChunkSize
          ? recursiveChunk(chunk, maxChunkSize, separators.slice(1))
          : [chunk]
      );
    }
  }
 
  // Fallback: hard split
  return fixedSizeChunk(text, maxChunkSize);
}
 
// Strategy 3: Semantic chunking (split at topic boundaries)
// This uses embeddings to detect topic shifts -- more expensive but higher quality
async function semanticChunk(
  text: string,
  similarityThreshold: number = 0.75
): Promise<string[]> {
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
 
  // Embed each sentence
  const embeddings = await generateEmbeddings(sentences);
 
  const chunks: string[] = [];
  let currentChunk = sentences[0];
 
  for (let i = 1; i < sentences.length; i++) {
    const similarity = cosineSimilarity(embeddings[i - 1], embeddings[i]);
 
    if (similarity >= similarityThreshold) {
      // Similar topic -- keep appending
      currentChunk += " " + sentences[i];
    } else {
      // Topic shift -- start new chunk
      chunks.push(currentChunk.trim());
      currentChunk = sentences[i];
    }
  }
 
  if (currentChunk) chunks.push(currentChunk.trim());
  return chunks;
}

Which strategy to use:

  • Fixed-size: Simple and predictable. Good starting point for uniform text.
  • Recursive: Best for structured documents (markdown, code). Respects document hierarchy.
  • Semantic: Highest quality but most expensive. Use for heterogeneous document collections.

Generating Embeddings

typescript
import OpenAI from "openai";
 
const openai = new OpenAI();
 
async function generateEmbeddings(texts: string[]): Promise<number[][]> {
  // Process in batches of 100 (API limit)
  const batchSize = 100;
  const allEmbeddings: number[][] = [];
 
  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize);
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small", // 1536 dimensions, $0.02/1M tokens
      input: batch,
    });
 
    const embeddings = response.data.map((d) => d.embedding);
    allEmbeddings.push(...embeddings);
  }
 
  return allEmbeddings;
}

Hands-On Implementation

Vector Storage with pgvector

pgvector is a PostgreSQL extension that adds vector similarity search. If you already use Postgres, this is the most practical choice -- no new infrastructure required.

typescript
import { Pool } from "pg";
 
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
 
// Setup: create table with vector column
async function setupVectorStore() {
  await pool.query("CREATE EXTENSION IF NOT EXISTS vector");
  await pool.query(`
    CREATE TABLE IF NOT EXISTS documents (
      id SERIAL PRIMARY KEY,
      content TEXT NOT NULL,
      embedding vector(1536),
      metadata JSONB DEFAULT '{}',
      created_at TIMESTAMP DEFAULT NOW()
    )
  `);
 
  // Create an HNSW index for fast approximate nearest neighbor search
  await pool.query(`
    CREATE INDEX IF NOT EXISTS documents_embedding_idx
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
  `);
}
 
// Insert chunks with their embeddings
async function insertChunks(
  chunks: Array<{ content: string; embedding: number[]; metadata: Record<string, unknown> }>
) {
  const client = await pool.connect();
  try {
    await client.query("BEGIN");
 
    for (const chunk of chunks) {
      await client.query(
        `INSERT INTO documents (content, embedding, metadata)
         VALUES ($1, $2::vector, $3)`,
        [chunk.content, `[${chunk.embedding.join(",")}]`, JSON.stringify(chunk.metadata)]
      );
    }
 
    await client.query("COMMIT");
  } catch (error) {
    await client.query("ROLLBACK");
    throw error;
  } finally {
    client.release();
  }
}
 
// Similarity search
async function similaritySearch(
  queryEmbedding: number[],
  topK: number = 5,
  metadataFilter?: Record<string, unknown>
): Promise<Array<{ content: string; similarity: number; metadata: Record<string, unknown> }>> {
  let query = `
    SELECT content, metadata,
           1 - (embedding <=> $1::vector) AS similarity
    FROM documents
  `;
 
  const params: unknown[] = [`[${queryEmbedding.join(",")}]`];
 
  if (metadataFilter) {
    query += ` WHERE metadata @> $2::jsonb`;
    params.push(JSON.stringify(metadataFilter));
  }
 
  query += ` ORDER BY embedding <=> $1::vector LIMIT $${params.length + 1}`;
  params.push(topK);
 
  const result = await pool.query(query, params);
  return result.rows;
}

Hybrid Search: Semantic + Keyword

Pure semantic search sometimes misses exact keyword matches. Hybrid search combines both:

typescript
async function hybridSearch(
  query: string,
  queryEmbedding: number[],
  topK: number = 5,
  semanticWeight: number = 0.7
): Promise<Array<{ content: string; score: number }>> {
  const result = await pool.query(
    `
    WITH semantic AS (
      SELECT id, content, metadata,
             1 - (embedding <=> $1::vector) AS score
      FROM documents
      ORDER BY embedding <=> $1::vector
      LIMIT $3
    ),
    keyword AS (
      SELECT id, content, metadata,
             ts_rank(to_tsvector('english', content), plainto_tsquery('english', $2)) AS score
      FROM documents
      WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
      LIMIT $3
    )
    SELECT
      COALESCE(s.content, k.content) AS content,
      COALESCE(s.score, 0) * $4 + COALESCE(k.score, 0) * (1 - $4) AS score
    FROM semantic s
    FULL OUTER JOIN keyword k ON s.id = k.id
    ORDER BY score DESC
    LIMIT $3
    `,
    [`[${queryEmbedding.join(",")}]`, query, topK, semanticWeight]
  );
 
  return result.rows;
}

Reranking with Cohere

Initial retrieval casts a wide net. Reranking narrows it to the most relevant results:

typescript
import { CohereClient } from "cohere-ai";
 
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
 
async function rerankResults(
  query: string,
  documents: Array<{ content: string; score: number }>,
  topN: number = 3
): Promise<Array<{ content: string; relevanceScore: number }>> {
  const response = await cohere.rerank({
    model: "rerank-english-v3.0",
    query,
    documents: documents.map((d) => d.content),
    topN,
  });
 
  return response.results.map((result) => ({
    content: documents[result.index].content,
    relevanceScore: result.relevanceScore,
  }));
}

The Full RAG Pipeline

Putting it all together:

typescript
import OpenAI from "openai";
 
const openai = new OpenAI();
 
async function ragQuery(userQuestion: string): Promise<string> {
  // 1. Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: userQuestion,
  });
  const queryEmbedding = embeddingResponse.data[0].embedding;
 
  // 2. Retrieve with hybrid search
  const candidates = await hybridSearch(userQuestion, queryEmbedding, 10);
 
  // 3. Rerank for precision
  const topDocs = await rerankResults(userQuestion, candidates, 3);
 
  // 4. Build context and generate
  const context = topDocs.map((d) => d.content).join("\n\n---\n\n");
 
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Answer questions based ONLY on the provided context.
If the context does not contain enough information, say so clearly.
Do not make up information.
 
## Context
${context}`,
      },
      { role: "user", content: userQuestion },
    ],
    temperature: 0,
    max_tokens: 1024,
  });
 
  return response.choices[0].message.content || "No response generated.";
}

Evaluation with RAGAS Metrics

You cannot improve what you do not measure. RAGAS provides standard metrics for RAG evaluation:

typescript
// Simplified RAGAS-style evaluation
interface RAGEvaluation {
  faithfulness: number; // Does the answer only use retrieved context?
  relevancy: number; // Is the retrieved context relevant to the question?
  correctness: number; // Is the answer factually correct?
}
 
async function evaluateRAG(
  question: string,
  retrievedDocs: string[],
  generatedAnswer: string,
  groundTruth: string
): Promise<RAGEvaluation> {
  const openai = new OpenAI();
 
  // Evaluate faithfulness: is the answer grounded in the context?
  const faithfulnessCheck = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Given a context and an answer, rate from 0 to 1 how well the answer is supported by the context. 1 = fully supported, 0 = completely fabricated. Respond with just the number.`,
      },
      {
        role: "user",
        content: `Context: ${retrievedDocs.join("\n")}\n\nAnswer: ${generatedAnswer}`,
      },
    ],
    temperature: 0,
  });
 
  // Evaluate relevancy: are the retrieved docs relevant?
  const relevancyCheck = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Given a question and retrieved documents, rate from 0 to 1 how relevant the documents are. Respond with just the number.`,
      },
      {
        role: "user",
        content: `Question: ${question}\n\nDocuments: ${retrievedDocs.join("\n")}`,
      },
    ],
    temperature: 0,
  });
 
  // Evaluate correctness against ground truth
  const correctnessCheck = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Compare the generated answer to the ground truth. Rate from 0 to 1 how correct the answer is. Respond with just the number.`,
      },
      {
        role: "user",
        content: `Ground Truth: ${groundTruth}\n\nGenerated Answer: ${generatedAnswer}`,
      },
    ],
    temperature: 0,
  });
 
  return {
    faithfulness: parseFloat(faithfulnessCheck.choices[0].message.content || "0"),
    relevancy: parseFloat(relevancyCheck.choices[0].message.content || "0"),
    correctness: parseFloat(correctnessCheck.choices[0].message.content || "0"),
  };
}

Best Practices

  1. Chunk size matters more than you think -- Start with 500-1000 characters with 10-20% overlap. Too small and you lose context. Too large and you dilute relevance.
  2. Always include metadata -- Store source file, page number, section heading, and timestamps with every chunk. You will need them for citations and filtering.
  3. Use hybrid search from day one -- Pure semantic search misses exact term matches. Hybrid search with 70/30 semantic/keyword weighting is a strong default.
  4. Rerank before generation -- Retrieval gets you candidates. Reranking selects the best ones. The quality difference is significant.
  5. Evaluate continuously -- Build an evaluation dataset of question/answer pairs and run it against every pipeline change.

Common Pitfalls

  • Chunking at arbitrary boundaries: Splitting mid-sentence or mid-paragraph destroys context. Use recursive or semantic chunking.
  • Not deduplicating chunks: Overlapping chunks can flood your results with near-identical content, wasting context window space.
  • Skipping the "I don't know" instruction: Without explicit instructions to admit uncertainty, the model will hallucinate answers from partial context.
  • Using the wrong embedding model for the domain: General-purpose embeddings struggle with highly specialized text (legal, medical). Evaluate domain-specific models.
  • Ignoring retrieval quality: Most RAG failures are retrieval failures, not generation failures. Debug retrieval first.

What's Next

This post used pgvector for storage, but there is much more to vector databases -- indexing strategies, managed vs self-hosted options, and production scaling patterns. In Part 4: Vector Databases and Embeddings, we will do a deep dive into how embeddings work mathematically, compare pgvector vs Pinecone vs Qdrant vs Chroma, and cover production indexing strategies.

FAQ

What is retrieval-augmented generation (RAG)?

RAG is a pattern that combines information retrieval with LLM generation. It fetches relevant documents from a knowledge base and includes them in the prompt context so the model can generate accurate, grounded responses. Unlike fine-tuning, RAG can work with constantly changing data and does not require retraining the model.

How do you choose the right chunking strategy for RAG?

The best chunking strategy depends on your content type. Use semantic chunking for varied documents, fixed-size with overlap for uniform text, and recursive splitting for structured documents like code or markdown. Start with recursive splitting at 500-1000 characters and iterate based on retrieval quality metrics.

What are the most common RAG pipeline failures?

Common failures include poor chunk boundaries that split context, inadequate embedding models for the domain, missing metadata filtering, and not implementing reranking to improve retrieval precision. The single most common mistake is blaming the LLM for bad answers when the real problem is that retrieval returned irrelevant chunks.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

AI Evaluation for Production Workflows
Mar 21, 20266 min read
AI
Evaluation
LLMOps

AI Evaluation for Production Workflows

Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.

How to Build an AI Workflow in a Production SaaS App
Mar 21, 20267 min read
AI
SaaS
Workflows

How to Build an AI Workflow in a Production SaaS App

A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.

Building AI Features Safely: Guardrails, Fallbacks, and Human Review
Mar 21, 20266 min read
AI
LLM
Guardrails

Building AI Features Safely: Guardrails, Fallbacks, and Human Review

A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.