Why use pgvector instead of a dedicated vector database for a RAG chatbot?

pgvector keeps your vector data alongside relational data in PostgreSQL, simplifying architecture and deployment. For most chatbot use cases with under a few million documents, pgvector offers sufficient performance with less operational overhead.

How do you implement streaming responses in a Next.js RAG chatbot?

Use the Vercel AI SDK's streamText function with a retrieved context injected into the system prompt. The SDK handles Server-Sent Events, token-by-token streaming, and client-side rendering with the useChat hook.

What is the best way to handle document ingestion for a RAG chatbot?

Build an ingestion pipeline that parses documents, splits them into overlapping chunks with metadata, generates embeddings via OpenAI, and stores vectors in pgvector. Run ingestion as a background job with progress tracking.

Blog/AI Automation/Build a Production RAG Chatbot with Next.js and pgvector

POST

July 20, 2025

LAST UPDATEDJuly 20, 2025

Build a Production RAG Chatbot with Next.js and pgvector

Build a production-ready RAG chatbot using Next.js, pgvector, and OpenAI. Complete tutorial covering ingestion, retrieval, streaming responses, and deployment.

Build a Production RAG Chatbot with Next.js and pgvector

This is part of the AI Automation Engineer Roadmap series.

TL;DR

A production RAG chatbot combines document ingestion, chunking, embeddings, vector retrieval, prompt construction, and streaming answer generation. With Next.js for the application layer, PostgreSQL plus pgvector for storage, and OpenAI for embeddings and generation, you can ship a retrieval system that answers questions against your own documents instead of relying only on the model's pretraining.

Why This Matters

Plain chatbots are limited by training data and context alone. They can explain general concepts, but they do not know your internal docs, product policies, onboarding guides, or support content unless you inject that information at request time.

That is exactly what retrieval-augmented generation solves. A RAG system:

›ingests documents into a searchable vector index
›retrieves the most relevant chunks for a question
›injects those chunks into the prompt
›asks the model to answer using grounded context

The result is a chatbot that is far more useful for internal knowledge bases, product help centers, and support assistants.

Core Concepts

What RAG Actually Adds

RAG does not "train" the model on your data. It improves answer quality by retrieving relevant context before generation.

The request lifecycle looks like this:

›A user asks a question.
›The question is converted into an embedding vector.
›The vector is compared against stored document chunk embeddings.
›The top matches are retrieved.
›Those matches are passed into the model prompt.
›The model answers using the retrieved material.

That architecture matters because it keeps your data fresh. Update the documents, re-index them, and the chatbot can answer from the new content without retraining the model.

Why pgvector Is a Practical Choice

For many teams, pgvector is the right starting point because:

›it keeps structured data and vector data in one database
›it reduces operational complexity
›it works well for moderate corpus sizes
›PostgreSQL tooling is mature and familiar

If you are already using Postgres, pgvector often wins on simplicity. You can move to a dedicated vector database later if scale or latency requires it.

Chunking Is More Important Than Most People Think

Many weak RAG systems fail because of bad chunking.

If chunks are too small:

›retrieval loses context
›answers become fragmented

If chunks are too large:

›irrelevant context pollutes the prompt
›retrieval quality drops

A common starting point is:

›500 to 1,000 tokens per chunk
›10% to 20% overlap
›metadata for source, section, title, and document type

Hands-On Implementation

Step 1: Create the Embeddings Table

Start by enabling pgvector and creating a table for chunks:

sql

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE document_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id TEXT NOT NULL,
  title TEXT,
  content TEXT NOT NULL,
  source_url TEXT,
  section TEXT,
  metadata JSONB DEFAULT '{}'::jsonb,
  embedding VECTOR(1536) NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);
 
CREATE INDEX document_chunks_embedding_idx
ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The exact vector dimension depends on the embeddings model you use. Make sure the schema matches the model output.

Step 2: Build the Ingestion Pipeline

Your ingestion job should:

›load source documents
›normalize text
›split into chunks
›generate embeddings
›store chunks plus metadata

Example:

typescript

import OpenAI from "openai";
import { sql } from "@vercel/postgres";
 
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});
 
function chunkText(text: string, maxLength = 1200, overlap = 200) {
  const chunks: string[] = [];
  let start = 0;
 
  while (start < text.length) {
    const end = Math.min(start + maxLength, text.length);
    chunks.push(text.slice(start, end));
    start += maxLength - overlap;
  }
 
  return chunks;
}
 
export async function ingestDocument(documentId: string, title: string, body: string) {
  const chunks = chunkText(body);
 
  for (const chunk of chunks) {
    const embedding = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: chunk,
    });
 
    await sql`
      INSERT INTO document_chunks (document_id, title, content, embedding)
      VALUES (
        ${documentId},
        ${title},
        ${chunk},
        ${JSON.stringify(embedding.data[0].embedding)}::vector
      )
    `;
  }
}

In production, batch work and run ingestion asynchronously. Document parsing, embedding generation, and database writes should not block user requests.

Step 3: Retrieve Relevant Chunks

At query time, embed the question and run a similarity search:

typescript

export async function retrieveRelevantChunks(query: string) {
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
 
  const queryEmbedding = embeddingResponse.data[0].embedding;
 
  const result = await sql`
    SELECT
      document_id,
      title,
      content,
      source_url,
      1 - (embedding <=> ${JSON.stringify(queryEmbedding)}::vector) AS similarity
    FROM document_chunks
    ORDER BY embedding <=> ${JSON.stringify(queryEmbedding)}::vector
    LIMIT 5
  `;
 
  return result.rows;
}

The point is not to retrieve the most text. The point is to retrieve the most useful context.

Step 4: Generate a Grounded Answer

Once you have the relevant chunks, pass them into a system prompt:

typescript

import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
 
export async function answerQuestion(question: string) {
  const chunks = await retrieveRelevantChunks(question);
 
  const context = chunks
    .map(
      (chunk, index) =>
        `Source ${index + 1}: ${chunk.title}\n${chunk.content}`,
    )
    .join("\n\n");
 
  return streamText({
    model: openai("gpt-4o"),
    system: `You are a support assistant. Answer using the retrieved context.
 
If the answer is not supported by the context, say you do not know.
Be concise and cite the most relevant source title when possible.`,
    prompt: `Question: ${question}\n\nContext:\n${context}`,
  });
}

The most important line in that prompt is the instruction to admit uncertainty. That is how you reduce hallucinations.

Step 5: Expose It Through a Next.js Route

In a Next.js App Router project, your route handler can stream directly:

typescript

// app/api/chat/route.ts
import { NextRequest } from "next/server";
import { answerQuestion } from "@/lib/rag";
 
export async function POST(req: NextRequest) {
  const { message } = await req.json();
  const result = await answerQuestion(message);
  return result.toDataStreamResponse();
}

And on the client:

typescript

"use client";
 
import { useChat } from "@ai-sdk/react";
 
export function ChatUI() {
  const { messages, input, handleInputChange, handleSubmit } = useChat({
    api: "/api/chat",
  });
 
  return (
    <form onSubmit={handleSubmit}>
      <input value={input} onChange={handleInputChange} />
      <button type="submit">Ask</button>
      {messages.map((message) => (
        <div key={message.id}>
          <strong>{message.role}:</strong> {message.content}
        </div>
      ))}
    </form>
  );
}

Production Considerations

Add Metadata and Source Attribution

Do not store only raw text. Store:

›title
›document type
›section
›source URL
›version
›timestamps

That lets you:

›show citations in the UI
›filter retrieval by source type
›rebuild indexes safely
›debug bad answers

Handle Ingestion as a Background Job

If you ingest PDFs, docs, or large markdown sets, do it asynchronously. The ingestion pipeline should support:

›retry logic
›progress tracking
›idempotent re-ingestion
›delete-and-rebuild by document ID

Evaluate Retrieval, Not Just Generation

A lot of teams evaluate the final answer but ignore retrieval quality. That is a mistake.

Track:

›whether the right chunks were retrieved
›whether chunk ranking was sensible
›whether the answer cited the correct sources
›whether the system abstained when context was weak

If retrieval is wrong, the model cannot recover.

Add Query Rewriting Carefully

Query rewriting can help for short or ambiguous user questions, but it also creates risk if it changes the meaning of the question. Start simple before adding an extra rewrite stage.

Use Human Review for High-Stakes Domains

For internal docs, product help, and support, a grounded answer may be good enough. For legal, medical, finance, or contract workflows, keep a human-in-the-loop review step.

Common Pitfalls

Indexing Poorly Cleaned Documents

If the ingestion stage includes headers, footers, duplicate nav text, or OCR junk, retrieval quality drops immediately.

Over-Retrieving Context

More context is not always better. Too many chunks create noisy prompts and weaker answers.

No Fallback for Unknown Answers

If the model is always forced to answer, it will confidently invent information when retrieval is weak.

Ignoring Retrieval Metadata

If you cannot inspect which chunks were used, debugging becomes much harder.

Final Recommendations

If you want a production-ready RAG chatbot, optimize for reliability before adding complexity. A simple system with good chunking, strong metadata, clear grounding prompts, and visible citations will outperform a more complicated system with weak retrieval and no evaluation.

The most pragmatic stack for many teams is:

›Next.js for the application layer
›PostgreSQL with pgvector for vector storage
›OpenAI embeddings for retrieval
›streamed generation for UX
›background ingestion jobs for indexing

That gets you a maintainable system quickly, and it leaves room to add reranking, feedback loops, or hybrid search later.

Next Steps

Once the basic chatbot works, useful next upgrades are:

›hybrid search with keyword plus vector retrieval
›chunk reranking
›conversation memory
›source citation UI
›evaluation datasets for retrieval quality
›admin tooling for ingestion and re-indexing

And if you are building broader AI automation systems, the natural next step is learning how to orchestrate multiple tools and services together in a reusable way, which is where multi-tool AI agents with MCP become relevant.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

Start a Conversation

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Integrate Stripe Payments in a Next.js Application

Tailwind CSS v4 Migration: What Changed and How to Upgrade

Mar 21, 20266 min read

Evaluation

LLMOps

AI Evaluation for Production Workflows

Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.

Read Article

Mar 21, 20267 min read

SaaS

Workflows

How to Build an AI Workflow in a Production SaaS App

A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.

Read Article

Building AI Features Safely: Guardrails, Fallbacks, and Human Review

Mar 21, 20266 min read

LLM

Guardrails

Building AI Features Safely: Guardrails, Fallbacks, and Human Review

A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.

Read Article

Build a Production RAG Chatbot with Next.js and pgvector

Build a Production RAG Chatbot with Next.js and pgvector

TL;DR

Why This Matters

Core Concepts

What RAG Actually Adds

Why pgvector Is a Practical Choice

Chunking Is More Important Than Most People Think

Hands-On Implementation

Step 1: Create the Embeddings Table

Step 2: Build the Ingestion Pipeline

Step 3: Retrieve Relevant Chunks

Step 4: Generate a Grounded Answer

Step 5: Expose It Through a Next.js Route

Production Considerations

Add Metadata and Source Attribution

Handle Ingestion as a Background Job

Evaluate Retrieval, Not Just Generation

Add Query Rewriting Carefully

Use Human Review for High-Stakes Domains

Common Pitfalls

Indexing Poorly Cleaned Documents

Over-Retrieving Context

No Fallback for Unknown Answers

Ignoring Retrieval Metadata

Final Recommendations

Next Steps

Need help with a project?

Let's Build It

Sadam Hussain

Related Articles

AI Evaluation for Production Workflows

How to Build an AI Workflow in a Production SaaS App

Building AI Features Safely: Guardrails, Fallbacks, and Human Review