LLMOps: Evaluation, Tracing, and Monitoring
Master LLMOps practices for evaluating, tracing, and monitoring AI systems in production. Set up Langfuse observability and automated evaluation pipelines.
Tags
LLMOps: Evaluation, Tracing, and Monitoring
This is Part 8 of the AI Automation Engineer Roadmap series.
TL;DR
LLMOps brings DevOps rigor to AI systems through systematic evaluation, real-time tracing, and continuous monitoring to maintain quality in production. Without it, you are flying blind -- shipping AI features with no way to know if they are actually working, how much they cost, or when they regress.
Why This Matters
Through Parts 1-7 of this series, we have built increasingly sophisticated AI systems -- from basic LLM calls to RAG pipelines to multi-agent orchestration. But here is the uncomfortable truth: none of that matters if you cannot measure whether your system is producing good results.
Traditional software has deterministic tests. Given input X, you expect output Y. LLM-powered systems are inherently non-deterministic. The same prompt can produce different outputs across runs. A model update can silently degrade quality. A subtle change in your retrieval pipeline can tank relevance without any errors in the logs.
LLMOps solves this by giving you three capabilities: evaluation (is the output good?), tracing (what happened during the request?), and monitoring (how is the system performing over time?). Together, they turn your AI application from a black box into an observable, measurable system.
Core Concepts
The Evaluation Challenge
Evaluating LLM outputs is fundamentally different from testing traditional software. You are not checking for exact matches -- you are assessing quality along multiple dimensions:
- ›Relevance: Does the output actually answer the question?
- ›Faithfulness: Is the output grounded in the provided context, or is the model hallucinating?
- ›Completeness: Does the output cover all aspects of the question?
- ›Harmlessness: Does the output avoid toxic, biased, or unsafe content?
- ›Format compliance: Does the output follow the requested structure?
No single metric captures all of these. Effective evaluation requires combining automated metrics, LLM-as-judge assessments, and targeted human review.
Golden Test Sets
A golden test set is a curated collection of input-output pairs that represent your critical use cases. Think of it as your regression test suite for AI. Each entry contains:
- ›An input query or prompt
- ›The expected context (for RAG systems)
- ›A reference answer or acceptance criteria
- ›Metadata about which aspects to evaluate
Building a good golden test set takes time, but it is the single most valuable investment in LLMOps. Start with 20-50 examples covering your most important scenarios, then grow it as you discover edge cases in production.
Tracing and Observability
A single user request to an AI system can trigger dozens of internal operations: embedding generation, vector search, context assembly, prompt construction, LLM calls, tool execution, response parsing. Tracing captures this entire chain so you can debug issues, identify bottlenecks, and understand costs.
A good trace captures:
- ›Each LLM call with its prompt, response, model, and token counts
- ›Retrieval operations with query, results, and relevance scores
- ›Tool calls with inputs and outputs
- ›Latency at every step
- ›Total cost for the request
RAGAS Metrics for RAG Systems
If you built a RAG pipeline following Part 3, you need RAG-specific evaluation metrics. RAGAS (Retrieval Augmented Generation Assessment) provides four key metrics:
- ›Context Relevancy: How relevant is the retrieved context to the question? Irrelevant chunks waste context window space and can confuse the model.
- ›Faithfulness: Is the answer actually supported by the retrieved context? This catches hallucinations where the model generates plausible-sounding answers not grounded in your data.
- ›Answer Relevancy: Does the answer address the question that was asked? A faithful answer can still be irrelevant if it latches onto the wrong part of the context.
- ›Context Recall: Does the retrieved context contain the information needed to answer the question? Low recall means your retrieval pipeline is missing relevant documents.
Hands-On Implementation
Setting Up Langfuse for Tracing
Langfuse is an open-source LLM observability platform that you can self-host or use as a managed service. Here is how to integrate it with a TypeScript AI application:
// lib/langfuse.ts
import { Langfuse } from "langfuse";
export const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
secretKey: process.env.LANGFUSE_SECRET_KEY!,
baseUrl: process.env.LANGFUSE_BASE_URL || "https://cloud.langfuse.com",
});
// Wrapper for traced LLM calls
export async function tracedLlmCall({
traceId,
name,
model,
messages,
callLlm,
}: {
traceId: string;
name: string;
model: string;
messages: Array<{ role: string; content: string }>;
callLlm: () => Promise<{
content: string;
usage: { promptTokens: number; completionTokens: number };
}>;
}) {
const trace = langfuse.trace({ id: traceId });
const generation = trace.generation({
name,
model,
input: messages,
});
const startTime = Date.now();
try {
const result = await callLlm();
generation.end({
output: result.content,
usage: {
promptTokens: result.usage.promptTokens,
completionTokens: result.usage.completionTokens,
},
metadata: {
latencyMs: Date.now() - startTime,
},
});
return result;
} catch (error) {
generation.end({
statusMessage: (error as Error).message,
level: "ERROR",
});
throw error;
}
}Integrating Tracing with Vercel AI SDK
If you are using the Vercel AI SDK (common in Next.js applications), Langfuse integrates cleanly:
// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { langfuse } from "@/lib/langfuse";
export async function POST(req: Request) {
const { messages, sessionId } = await req.json();
const trace = langfuse.trace({
name: "chat-completion",
sessionId,
metadata: {
messageCount: messages.length,
lastUserMessage: messages.at(-1)?.content?.slice(0, 100),
},
});
const generation = trace.generation({
name: "stream-response",
model: "gpt-4o",
input: messages,
});
const result = streamText({
model: openai("gpt-4o"),
messages,
onFinish({ text, usage }) {
generation.end({
output: text,
usage: {
promptTokens: usage.promptTokens,
completionTokens: usage.completionTokens,
},
});
// Score based on response length as a basic quality signal
trace.score({
name: "response_length",
value: text.length > 50 ? 1 : 0,
comment:
text.length > 50
? "Sufficient response"
: "Suspiciously short response",
});
langfuse.flush();
},
});
return result.toDataStreamResponse();
}Building an Evaluation Pipeline
Here is a comprehensive evaluation pipeline that runs your golden test set against your AI system and scores the results:
// evaluation/run-eval.ts
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
import { langfuse } from "@/lib/langfuse";
interface TestCase {
id: string;
input: string;
expectedContext?: string;
referenceAnswer: string;
criteria: string[];
}
interface EvalResult {
testCaseId: string;
scores: Record<string, number>;
reasoning: Record<string, string>;
latencyMs: number;
tokenUsage: { prompt: number; completion: number };
}
// LLM-as-judge evaluator
async function llmJudge(
question: string,
answer: string,
reference: string,
criterion: string
): Promise<{ score: number; reasoning: string }> {
const { text } = await generateText({
model: openai("gpt-4o"),
prompt: `You are an expert evaluator. Score the following
answer on a scale of 1-5 for the criterion: ${criterion}.
Question: ${question}
Reference Answer: ${reference}
Actual Answer: ${answer}
Respond in JSON format:
{"score": <1-5>, "reasoning": "<brief explanation>"}`,
});
return JSON.parse(text);
}
async function runEvaluation(
testCases: TestCase[],
generateAnswer: (input: string) => Promise<{
answer: string;
usage: { promptTokens: number; completionTokens: number };
}>
): Promise<EvalResult[]> {
const results: EvalResult[] = [];
for (const testCase of testCases) {
const startTime = Date.now();
const { answer, usage } = await generateAnswer(testCase.input);
const latencyMs = Date.now() - startTime;
const scores: Record<string, number> = {};
const reasoning: Record<string, string> = {};
// Run LLM-as-judge for each criterion
for (const criterion of testCase.criteria) {
const evaluation = await llmJudge(
testCase.input,
answer,
testCase.referenceAnswer,
criterion
);
scores[criterion] = evaluation.score;
reasoning[criterion] = evaluation.reasoning;
}
// Log scores to Langfuse
const trace = langfuse.trace({
name: "evaluation",
metadata: { testCaseId: testCase.id },
});
for (const [criterion, score] of Object.entries(scores)) {
trace.score({
name: criterion,
value: score / 5, // Normalize to 0-1
comment: reasoning[criterion],
});
}
results.push({
testCaseId: testCase.id,
scores,
reasoning,
latencyMs,
tokenUsage: {
prompt: usage.promptTokens,
completion: usage.completionTokens,
},
});
}
await langfuse.flush();
return results;
}
// Example golden test set
const goldenTestSet: TestCase[] = [
{
id: "pricing-basic",
input: "What are your pricing plans?",
referenceAnswer:
"We offer three plans: Starter ($29/mo), Pro ($99/mo), and Enterprise (custom pricing).",
criteria: ["relevance", "completeness", "faithfulness"],
},
{
id: "refund-policy",
input: "How do I get a refund?",
referenceAnswer:
"Contact support within 30 days of purchase for a full refund. No questions asked.",
criteria: ["relevance", "completeness", "helpfulness"],
},
];Cost Tracking Per Request
Track costs at the request level to identify expensive patterns and optimize spending:
// lib/cost-tracker.ts
const MODEL_COSTS: Record<
string,
{ input: number; output: number }
> = {
"gpt-4o": { input: 2.5 / 1_000_000, output: 10 / 1_000_000 },
"gpt-4o-mini": { input: 0.15 / 1_000_000, output: 0.6 / 1_000_000 },
"claude-sonnet-4-20250514": { input: 3 / 1_000_000, output: 15 / 1_000_000 },
};
export function calculateCost(
model: string,
promptTokens: number,
completionTokens: number
): number {
const costs = MODEL_COSTS[model];
if (!costs) return 0;
return (
promptTokens * costs.input +
completionTokens * costs.output
);
}
export function trackRequestCost(
trace: any,
model: string,
usage: { promptTokens: number; completionTokens: number }
) {
const cost = calculateCost(
model,
usage.promptTokens,
usage.completionTokens
);
trace.score({
name: "cost_usd",
value: cost,
comment: `${model}: ${usage.promptTokens} in, ${usage.completionTokens} out`,
});
return cost;
}Regression Detection
Set up automated regression detection that alerts you when evaluation scores drop:
// evaluation/regression-check.ts
interface EvalRun {
timestamp: Date;
averageScores: Record<string, number>;
testCaseCount: number;
}
function detectRegression(
current: EvalRun,
baseline: EvalRun,
threshold: number = 0.1
): { regressed: boolean; details: string[] } {
const details: string[] = [];
let regressed = false;
for (const [criterion, currentScore] of Object.entries(
current.averageScores
)) {
const baselineScore = baseline.averageScores[criterion];
if (baselineScore === undefined) continue;
const delta = currentScore - baselineScore;
if (delta < -threshold) {
regressed = true;
details.push(
`${criterion}: ${baselineScore.toFixed(2)} -> ${currentScore.toFixed(2)} (${(delta * 100).toFixed(1)}% drop)`
);
}
}
return { regressed, details };
}
// Run in CI pipeline after prompt or retrieval changes
async function ciEvalCheck() {
const currentRun = await runEvaluationSuite();
const baseline = await getLastPassingBaseline();
const { regressed, details } = detectRegression(
currentRun,
baseline
);
if (regressed) {
console.error("REGRESSION DETECTED:");
details.forEach((d) => console.error(` - ${d}`));
process.exit(1);
}
console.log("All evaluation scores within threshold.");
}Best Practices
- ›Evaluate before shipping, not after complaints. Run your golden test set as part of your CI/CD pipeline. Catch regressions before they reach users.
- ›Use LLM-as-judge for subjective criteria. Automated metrics like BLEU and ROUGE are cheap but correlate poorly with human judgment for generative tasks. LLM-as-judge evaluations are more expensive but significantly more meaningful.
- ›Track cost per feature, not just per request. Aggregate costs by feature (chat, search, summarization) to understand which capabilities are driving your bill.
- ›Set up alerting on latency p95, not just averages. Average latency hides tail cases where users wait 30+ seconds. Monitor p95 and p99 to catch the worst experiences.
- ›Version your prompts alongside your code. When evaluation scores drop, you need to know which prompt version caused the regression. Langfuse's prompt management or a simple version string in your traces makes this traceable.
- ›Sample human evaluations weekly. Even the best automated evaluation misses things. Review a random sample of production responses each week to calibrate your automated metrics.
Common Pitfalls
- ›Testing only the happy path. Your golden test set should include adversarial inputs, edge cases, ambiguous queries, and out-of-scope questions. These are where AI systems fail most visibly.
- ›Treating evaluation as a one-time task. Models change, data drifts, user behavior evolves. Evaluation is a continuous process, not a launch checklist item.
- ›Ignoring cost until the bill arrives. LLM costs can spike unexpectedly from prompt injection attacks, retry storms, or a viral feature. Set up cost alerts from day one.
- ›Over-relying on a single evaluation metric. An answer can score high on relevance but hallucinate facts. Always evaluate across multiple dimensions.
- ›Not correlating traces with user feedback. When a user reports a bad response, you should be able to pull up the exact trace showing what happened. Connect user feedback to trace IDs.
What's Next
Now that we can evaluate and monitor our AI systems, it is time to put them to work in real-world automation. In Part 9: AI Automation Workflows with n8n and LangChain, we will build end-to-end automation pipelines that combine visual workflow tools with custom AI code to automate complex business processes.
FAQ
What is LLMOps and why does it matter?
LLMOps applies operational best practices to LLM-powered applications, covering evaluation, tracing, monitoring, and cost management. It ensures AI systems remain reliable, performant, and cost-effective in production.
How do you evaluate LLM outputs systematically?
Use a combination of automated metrics (relevance, faithfulness, toxicity), LLM-as-judge evaluations for subjective quality, and human evaluation for edge cases. Build evaluation datasets that cover your key use cases.
What does Langfuse provide for LLM observability?
Langfuse provides request tracing, latency tracking, cost monitoring, prompt version management, and evaluation scoring. It integrates with LangChain, Vercel AI SDK, and direct API calls for comprehensive observability.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
AI Evaluation for Production Workflows
Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.
How to Build an AI Workflow in a Production SaaS App
A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.
Building AI Features Safely: Guardrails, Fallbacks, and Human Review
A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.