Blog/AI Automation/LLMOps: Evaluation, Tracing, and Monitoring
POST
November 01, 2025
LAST UPDATEDNovember 01, 2025

LLMOps: Evaluation, Tracing, and Monitoring

Master LLMOps practices for evaluating, tracing, and monitoring AI systems in production. Set up Langfuse observability and automated evaluation pipelines.

Tags

AILLMOpsEvaluationLangfuseMonitoring
LLMOps: Evaluation, Tracing, and Monitoring
7 min read

LLMOps: Evaluation, Tracing, and Monitoring

This is Part 8 of the AI Automation Engineer Roadmap series.

TL;DR

LLMOps brings DevOps rigor to AI systems through systematic evaluation, real-time tracing, and continuous monitoring to maintain quality in production. Without it, you are flying blind -- shipping AI features with no way to know if they are actually working, how much they cost, or when they regress.

Why This Matters

Through Parts 1-7 of this series, we have built increasingly sophisticated AI systems -- from basic LLM calls to RAG pipelines to multi-agent orchestration. But here is the uncomfortable truth: none of that matters if you cannot measure whether your system is producing good results.

Traditional software has deterministic tests. Given input X, you expect output Y. LLM-powered systems are inherently non-deterministic. The same prompt can produce different outputs across runs. A model update can silently degrade quality. A subtle change in your retrieval pipeline can tank relevance without any errors in the logs.

LLMOps solves this by giving you three capabilities: evaluation (is the output good?), tracing (what happened during the request?), and monitoring (how is the system performing over time?). Together, they turn your AI application from a black box into an observable, measurable system.

Core Concepts

The Evaluation Challenge

Evaluating LLM outputs is fundamentally different from testing traditional software. You are not checking for exact matches -- you are assessing quality along multiple dimensions:

  • Relevance: Does the output actually answer the question?
  • Faithfulness: Is the output grounded in the provided context, or is the model hallucinating?
  • Completeness: Does the output cover all aspects of the question?
  • Harmlessness: Does the output avoid toxic, biased, or unsafe content?
  • Format compliance: Does the output follow the requested structure?

No single metric captures all of these. Effective evaluation requires combining automated metrics, LLM-as-judge assessments, and targeted human review.

Golden Test Sets

A golden test set is a curated collection of input-output pairs that represent your critical use cases. Think of it as your regression test suite for AI. Each entry contains:

  • An input query or prompt
  • The expected context (for RAG systems)
  • A reference answer or acceptance criteria
  • Metadata about which aspects to evaluate

Building a good golden test set takes time, but it is the single most valuable investment in LLMOps. Start with 20-50 examples covering your most important scenarios, then grow it as you discover edge cases in production.

Tracing and Observability

A single user request to an AI system can trigger dozens of internal operations: embedding generation, vector search, context assembly, prompt construction, LLM calls, tool execution, response parsing. Tracing captures this entire chain so you can debug issues, identify bottlenecks, and understand costs.

A good trace captures:

  • Each LLM call with its prompt, response, model, and token counts
  • Retrieval operations with query, results, and relevance scores
  • Tool calls with inputs and outputs
  • Latency at every step
  • Total cost for the request

RAGAS Metrics for RAG Systems

If you built a RAG pipeline following Part 3, you need RAG-specific evaluation metrics. RAGAS (Retrieval Augmented Generation Assessment) provides four key metrics:

  • Context Relevancy: How relevant is the retrieved context to the question? Irrelevant chunks waste context window space and can confuse the model.
  • Faithfulness: Is the answer actually supported by the retrieved context? This catches hallucinations where the model generates plausible-sounding answers not grounded in your data.
  • Answer Relevancy: Does the answer address the question that was asked? A faithful answer can still be irrelevant if it latches onto the wrong part of the context.
  • Context Recall: Does the retrieved context contain the information needed to answer the question? Low recall means your retrieval pipeline is missing relevant documents.

Hands-On Implementation

Setting Up Langfuse for Tracing

Langfuse is an open-source LLM observability platform that you can self-host or use as a managed service. Here is how to integrate it with a TypeScript AI application:

typescript
// lib/langfuse.ts
import { Langfuse } from "langfuse";
 
export const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
  baseUrl: process.env.LANGFUSE_BASE_URL || "https://cloud.langfuse.com",
});
 
// Wrapper for traced LLM calls
export async function tracedLlmCall({
  traceId,
  name,
  model,
  messages,
  callLlm,
}: {
  traceId: string;
  name: string;
  model: string;
  messages: Array<{ role: string; content: string }>;
  callLlm: () => Promise<{
    content: string;
    usage: { promptTokens: number; completionTokens: number };
  }>;
}) {
  const trace = langfuse.trace({ id: traceId });
  const generation = trace.generation({
    name,
    model,
    input: messages,
  });
 
  const startTime = Date.now();
  try {
    const result = await callLlm();
 
    generation.end({
      output: result.content,
      usage: {
        promptTokens: result.usage.promptTokens,
        completionTokens: result.usage.completionTokens,
      },
      metadata: {
        latencyMs: Date.now() - startTime,
      },
    });
 
    return result;
  } catch (error) {
    generation.end({
      statusMessage: (error as Error).message,
      level: "ERROR",
    });
    throw error;
  }
}

Integrating Tracing with Vercel AI SDK

If you are using the Vercel AI SDK (common in Next.js applications), Langfuse integrates cleanly:

typescript
// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { langfuse } from "@/lib/langfuse";
 
export async function POST(req: Request) {
  const { messages, sessionId } = await req.json();
 
  const trace = langfuse.trace({
    name: "chat-completion",
    sessionId,
    metadata: {
      messageCount: messages.length,
      lastUserMessage: messages.at(-1)?.content?.slice(0, 100),
    },
  });
 
  const generation = trace.generation({
    name: "stream-response",
    model: "gpt-4o",
    input: messages,
  });
 
  const result = streamText({
    model: openai("gpt-4o"),
    messages,
    onFinish({ text, usage }) {
      generation.end({
        output: text,
        usage: {
          promptTokens: usage.promptTokens,
          completionTokens: usage.completionTokens,
        },
      });
 
      // Score based on response length as a basic quality signal
      trace.score({
        name: "response_length",
        value: text.length > 50 ? 1 : 0,
        comment:
          text.length > 50
            ? "Sufficient response"
            : "Suspiciously short response",
      });
 
      langfuse.flush();
    },
  });
 
  return result.toDataStreamResponse();
}

Building an Evaluation Pipeline

Here is a comprehensive evaluation pipeline that runs your golden test set against your AI system and scores the results:

typescript
// evaluation/run-eval.ts
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
import { langfuse } from "@/lib/langfuse";
 
interface TestCase {
  id: string;
  input: string;
  expectedContext?: string;
  referenceAnswer: string;
  criteria: string[];
}
 
interface EvalResult {
  testCaseId: string;
  scores: Record<string, number>;
  reasoning: Record<string, string>;
  latencyMs: number;
  tokenUsage: { prompt: number; completion: number };
}
 
// LLM-as-judge evaluator
async function llmJudge(
  question: string,
  answer: string,
  reference: string,
  criterion: string
): Promise<{ score: number; reasoning: string }> {
  const { text } = await generateText({
    model: openai("gpt-4o"),
    prompt: `You are an expert evaluator. Score the following
answer on a scale of 1-5 for the criterion: ${criterion}.
 
Question: ${question}
Reference Answer: ${reference}
Actual Answer: ${answer}
 
Respond in JSON format:
{"score": <1-5>, "reasoning": "<brief explanation>"}`,
  });
 
  return JSON.parse(text);
}
 
async function runEvaluation(
  testCases: TestCase[],
  generateAnswer: (input: string) => Promise<{
    answer: string;
    usage: { promptTokens: number; completionTokens: number };
  }>
): Promise<EvalResult[]> {
  const results: EvalResult[] = [];
 
  for (const testCase of testCases) {
    const startTime = Date.now();
    const { answer, usage } = await generateAnswer(testCase.input);
    const latencyMs = Date.now() - startTime;
 
    const scores: Record<string, number> = {};
    const reasoning: Record<string, string> = {};
 
    // Run LLM-as-judge for each criterion
    for (const criterion of testCase.criteria) {
      const evaluation = await llmJudge(
        testCase.input,
        answer,
        testCase.referenceAnswer,
        criterion
      );
      scores[criterion] = evaluation.score;
      reasoning[criterion] = evaluation.reasoning;
    }
 
    // Log scores to Langfuse
    const trace = langfuse.trace({
      name: "evaluation",
      metadata: { testCaseId: testCase.id },
    });
 
    for (const [criterion, score] of Object.entries(scores)) {
      trace.score({
        name: criterion,
        value: score / 5, // Normalize to 0-1
        comment: reasoning[criterion],
      });
    }
 
    results.push({
      testCaseId: testCase.id,
      scores,
      reasoning,
      latencyMs,
      tokenUsage: {
        prompt: usage.promptTokens,
        completion: usage.completionTokens,
      },
    });
  }
 
  await langfuse.flush();
  return results;
}
 
// Example golden test set
const goldenTestSet: TestCase[] = [
  {
    id: "pricing-basic",
    input: "What are your pricing plans?",
    referenceAnswer:
      "We offer three plans: Starter ($29/mo), Pro ($99/mo), and Enterprise (custom pricing).",
    criteria: ["relevance", "completeness", "faithfulness"],
  },
  {
    id: "refund-policy",
    input: "How do I get a refund?",
    referenceAnswer:
      "Contact support within 30 days of purchase for a full refund. No questions asked.",
    criteria: ["relevance", "completeness", "helpfulness"],
  },
];

Cost Tracking Per Request

Track costs at the request level to identify expensive patterns and optimize spending:

typescript
// lib/cost-tracker.ts
const MODEL_COSTS: Record<
  string,
  { input: number; output: number }
> = {
  "gpt-4o": { input: 2.5 / 1_000_000, output: 10 / 1_000_000 },
  "gpt-4o-mini": { input: 0.15 / 1_000_000, output: 0.6 / 1_000_000 },
  "claude-sonnet-4-20250514": { input: 3 / 1_000_000, output: 15 / 1_000_000 },
};
 
export function calculateCost(
  model: string,
  promptTokens: number,
  completionTokens: number
): number {
  const costs = MODEL_COSTS[model];
  if (!costs) return 0;
  return (
    promptTokens * costs.input +
    completionTokens * costs.output
  );
}
 
export function trackRequestCost(
  trace: any,
  model: string,
  usage: { promptTokens: number; completionTokens: number }
) {
  const cost = calculateCost(
    model,
    usage.promptTokens,
    usage.completionTokens
  );
 
  trace.score({
    name: "cost_usd",
    value: cost,
    comment: `${model}: ${usage.promptTokens} in, ${usage.completionTokens} out`,
  });
 
  return cost;
}

Regression Detection

Set up automated regression detection that alerts you when evaluation scores drop:

typescript
// evaluation/regression-check.ts
interface EvalRun {
  timestamp: Date;
  averageScores: Record<string, number>;
  testCaseCount: number;
}
 
function detectRegression(
  current: EvalRun,
  baseline: EvalRun,
  threshold: number = 0.1
): { regressed: boolean; details: string[] } {
  const details: string[] = [];
  let regressed = false;
 
  for (const [criterion, currentScore] of Object.entries(
    current.averageScores
  )) {
    const baselineScore = baseline.averageScores[criterion];
    if (baselineScore === undefined) continue;
 
    const delta = currentScore - baselineScore;
    if (delta < -threshold) {
      regressed = true;
      details.push(
        `${criterion}: ${baselineScore.toFixed(2)} -> ${currentScore.toFixed(2)} (${(delta * 100).toFixed(1)}% drop)`
      );
    }
  }
 
  return { regressed, details };
}
 
// Run in CI pipeline after prompt or retrieval changes
async function ciEvalCheck() {
  const currentRun = await runEvaluationSuite();
  const baseline = await getLastPassingBaseline();
 
  const { regressed, details } = detectRegression(
    currentRun,
    baseline
  );
 
  if (regressed) {
    console.error("REGRESSION DETECTED:");
    details.forEach((d) => console.error(`  - ${d}`));
    process.exit(1);
  }
 
  console.log("All evaluation scores within threshold.");
}

Best Practices

  • Evaluate before shipping, not after complaints. Run your golden test set as part of your CI/CD pipeline. Catch regressions before they reach users.
  • Use LLM-as-judge for subjective criteria. Automated metrics like BLEU and ROUGE are cheap but correlate poorly with human judgment for generative tasks. LLM-as-judge evaluations are more expensive but significantly more meaningful.
  • Track cost per feature, not just per request. Aggregate costs by feature (chat, search, summarization) to understand which capabilities are driving your bill.
  • Set up alerting on latency p95, not just averages. Average latency hides tail cases where users wait 30+ seconds. Monitor p95 and p99 to catch the worst experiences.
  • Version your prompts alongside your code. When evaluation scores drop, you need to know which prompt version caused the regression. Langfuse's prompt management or a simple version string in your traces makes this traceable.
  • Sample human evaluations weekly. Even the best automated evaluation misses things. Review a random sample of production responses each week to calibrate your automated metrics.

Common Pitfalls

  • Testing only the happy path. Your golden test set should include adversarial inputs, edge cases, ambiguous queries, and out-of-scope questions. These are where AI systems fail most visibly.
  • Treating evaluation as a one-time task. Models change, data drifts, user behavior evolves. Evaluation is a continuous process, not a launch checklist item.
  • Ignoring cost until the bill arrives. LLM costs can spike unexpectedly from prompt injection attacks, retry storms, or a viral feature. Set up cost alerts from day one.
  • Over-relying on a single evaluation metric. An answer can score high on relevance but hallucinate facts. Always evaluate across multiple dimensions.
  • Not correlating traces with user feedback. When a user reports a bad response, you should be able to pull up the exact trace showing what happened. Connect user feedback to trace IDs.

What's Next

Now that we can evaluate and monitor our AI systems, it is time to put them to work in real-world automation. In Part 9: AI Automation Workflows with n8n and LangChain, we will build end-to-end automation pipelines that combine visual workflow tools with custom AI code to automate complex business processes.

FAQ

What is LLMOps and why does it matter?

LLMOps applies operational best practices to LLM-powered applications, covering evaluation, tracing, monitoring, and cost management. It ensures AI systems remain reliable, performant, and cost-effective in production.

How do you evaluate LLM outputs systematically?

Use a combination of automated metrics (relevance, faithfulness, toxicity), LLM-as-judge evaluations for subjective quality, and human evaluation for edge cases. Build evaluation datasets that cover your key use cases.

What does Langfuse provide for LLM observability?

Langfuse provides request tracing, latency tracking, cost monitoring, prompt version management, and evaluation scoring. It integrates with LangChain, Vercel AI SDK, and direct API calls for comprehensive observability.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

AI Evaluation for Production Workflows
Mar 21, 20266 min read
AI
Evaluation
LLMOps

AI Evaluation for Production Workflows

Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.

How to Build an AI Workflow in a Production SaaS App
Mar 21, 20267 min read
AI
SaaS
Workflows

How to Build an AI Workflow in a Production SaaS App

A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.

Building AI Features Safely: Guardrails, Fallbacks, and Human Review
Mar 21, 20266 min read
AI
LLM
Guardrails

Building AI Features Safely: Guardrails, Fallbacks, and Human Review

A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.