From Prompt Engineering to Context Engineering
Discover the shift from prompt engineering to context engineering. Learn how to structure context windows for reliable and consistent AI outputs.
Tags
From Prompt Engineering to Context Engineering
This is Part 2 of the AI Automation Engineer Roadmap series. If you have not read Part 1: Understanding LLMs, start there.
TL;DR
Context engineering goes beyond prompt tricks by systematically structuring the entire context window to produce reliable AI outputs at scale. This post covers zero-shot and few-shot prompting, Chain-of-Thought reasoning, the ReAct pattern, structured outputs with JSON mode and Zod schemas, and practical strategies for managing context windows in production.
Why This Matters
"Prompt engineering" sounds like you are crafting a clever sentence. In reality, building production AI systems requires engineering the entire context -- system prompts, user inputs, retrieved documents, conversation history, examples, and output constraints -- all within a finite token budget. The shift from "what prompt should I write?" to "how should I structure the context window?" is what separates hobby projects from production-grade AI features. The term "context engineering" captures this more accurately, and understanding it is the single highest-leverage skill for an AI automation engineer.
Core Concepts
Zero-Shot vs Few-Shot Prompting
Zero-shot means giving the model a task with no examples. It relies entirely on the model's training data to understand what you want:
// Zero-shot: The model figures out the format on its own
const zeroShotPrompt = "Classify this support ticket as 'billing', 'technical', or 'general': " +
"'My payment was charged twice last month'";Few-shot means including examples in the prompt. This dramatically improves consistency because the model pattern-matches against your examples:
// Few-shot: Providing examples teaches the model your exact format
const fewShotPrompt = `Classify each support ticket. Respond with only the category.
Ticket: "I can't log into my account after resetting my password"
Category: technical
Ticket: "Can I get a refund for my annual subscription?"
Category: billing
Ticket: "What are your business hours?"
Category: general
Ticket: "My payment was charged twice last month"
Category:`;Few-shot prompting is often the fastest way to improve output quality without changing models or adding complexity. Start with 3-5 examples that cover edge cases.
Chain-of-Thought (CoT) Prompting
Chain-of-Thought prompting asks the model to show its reasoning before giving a final answer. This significantly improves accuracy on tasks that require multi-step reasoning:
const cotSystemPrompt = `You are a pricing calculator for a SaaS product.
When calculating prices, think through each step:
1. Identify the base plan and its price
2. Apply any quantity discounts
3. Add or remove add-ons
4. Apply promotional discounts last
5. Show the final total
Always show your reasoning before the final answer.`;
const userMessage = "We need 25 seats on the Pro plan ($49/seat/mo) " +
"with the analytics add-on ($10/seat/mo). We have a 15% annual discount.";
// The model will break down the calculation step by step
// rather than jumping to a number that might be wrongThe key insight: when the model "thinks out loud," it is less likely to skip steps or make arithmetic errors. For classification tasks, you can ask it to reason first and then provide the classification on the final line, making it easy to parse.
The ReAct Pattern
ReAct (Reasoning + Acting) combines Chain-of-Thought with tool use. The model alternates between reasoning about what to do and taking actions:
Thought: The user wants flight prices from NYC to London. I need to search flights.
Action: searchFlights({ from: "NYC", to: "LDN", date: "2025-06-15" })
Observation: Found 12 flights, cheapest is $450 on British Airways.
Thought: I have the results. Let me format them for the user.
Answer: The cheapest flight from NYC to London on June 15th is $450 on British Airways...
We will implement ReAct loops fully in Part 5: Building AI Agents. For now, understand that this pattern is the foundation of every AI agent framework.
System Prompts: Best Practices
The system prompt is your most powerful lever. It sets the model's persona, constraints, and behavioral rules. Here is a battle-tested structure:
function buildSystemPrompt(context: {
role: string;
rules: string[];
outputFormat: string;
examples?: string;
}): string {
return `# Role
${context.role}
# Rules
${context.rules.map((r, i) => `${i + 1}. ${r}`).join("\n")}
# Output Format
${context.outputFormat}
${context.examples ? `# Examples\n${context.examples}` : ""}`;
}
// Usage
const systemPrompt = buildSystemPrompt({
role: "You are a senior code reviewer analyzing TypeScript pull requests.",
rules: [
"Focus on bugs, security issues, and performance problems",
"Ignore stylistic preferences unless they impact readability",
"Rate severity as 'critical', 'warning', or 'info'",
"If the code is fine, say so briefly -- do not invent issues",
],
outputFormat: `Respond in JSON format:
{
"issues": [{ "line": number, "severity": string, "description": string }],
"summary": "One sentence overall assessment"
}`,
});Structured Outputs with JSON Mode and Zod
Getting reliable JSON from LLMs is one of the most common requirements. Here is how to do it properly:
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
// Define your output schema with Zod
const TicketAnalysis = z.object({
category: z.enum(["billing", "technical", "general", "urgent"]),
sentiment: z.enum(["positive", "negative", "neutral"]),
summary: z.string().describe("One sentence summary of the ticket"),
suggestedAction: z.string().describe("Recommended next step"),
priority: z.number().min(1).max(5).describe("1 = lowest, 5 = highest"),
});
type TicketAnalysis = z.infer<typeof TicketAnalysis>;
async function analyzeTicket(ticketText: string): Promise<TicketAnalysis> {
const openai = new OpenAI();
const response = await openai.beta.chat.completions.parse({
model: "gpt-4o",
messages: [
{
role: "system",
content: "Analyze support tickets and extract structured information.",
},
{ role: "user", content: ticketText },
],
response_format: zodResponseFormat(TicketAnalysis, "ticket_analysis"),
});
const parsed = response.choices[0].message.parsed;
if (!parsed) throw new Error("Failed to parse response");
return parsed;
}With Anthropic, you achieve structured outputs via explicit instructions in the system prompt combined with Zod validation on the client side:
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
async function analyzeWithClaude(ticketText: string): Promise<TicketAnalysis> {
const anthropic = new Anthropic();
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 512,
system: `Analyze support tickets. Respond ONLY with valid JSON matching this schema:
{
"category": "billing" | "technical" | "general" | "urgent",
"sentiment": "positive" | "negative" | "neutral",
"summary": "string",
"suggestedAction": "string",
"priority": 1-5
}`,
messages: [{ role: "user", content: ticketText }],
});
const text = response.content.find((b) => b.type === "text")?.text;
if (!text) throw new Error("No text in response");
// Parse and validate with Zod
const parsed = JSON.parse(text);
return TicketAnalysis.parse(parsed);
}Hands-On Implementation
Context Window Management
In production, you will quickly run into context limits. Here is a practical approach to managing context windows:
import { encoding_for_model, type TiktokenModel } from "tiktoken";
function countTokens(text: string, model: TiktokenModel = "gpt-4o"): number {
const encoder = encoding_for_model(model);
const tokens = encoder.encode(text);
encoder.free();
return tokens.length;
}
interface ContextBudget {
system: number;
examples: number;
retrievedDocs: number;
conversationHistory: number;
userMessage: number;
reservedForOutput: number;
}
function buildManagedContext(config: {
maxContextTokens: number;
systemPrompt: string;
examples: string[];
retrievedDocs: string[];
conversationHistory: Array<{ role: string; content: string }>;
userMessage: string;
maxOutputTokens: number;
}): { messages: Array<{ role: string; content: string }>; budget: ContextBudget } {
const budget: ContextBudget = {
system: countTokens(config.systemPrompt),
examples: 0,
retrievedDocs: 0,
conversationHistory: 0,
userMessage: countTokens(config.userMessage),
reservedForOutput: config.maxOutputTokens,
};
let remainingTokens =
config.maxContextTokens - budget.system - budget.userMessage - budget.reservedForOutput;
// 1. Add examples (highest priority after system prompt)
const includedExamples: string[] = [];
for (const example of config.examples) {
const tokens = countTokens(example);
if (remainingTokens - tokens > 0) {
includedExamples.push(example);
budget.examples += tokens;
remainingTokens -= tokens;
}
}
// 2. Add retrieved documents
const includedDocs: string[] = [];
for (const doc of config.retrievedDocs) {
const tokens = countTokens(doc);
if (remainingTokens - tokens > 0) {
includedDocs.push(doc);
budget.retrievedDocs += tokens;
remainingTokens -= tokens;
}
}
// 3. Add conversation history (most recent first)
const includedHistory: Array<{ role: string; content: string }> = [];
for (const msg of [...config.conversationHistory].reverse()) {
const tokens = countTokens(msg.content);
if (remainingTokens - tokens > 0) {
includedHistory.unshift(msg);
budget.conversationHistory += tokens;
remainingTokens -= tokens;
} else {
break; // Stop adding history when budget is exhausted
}
}
// Assemble the final messages array
const systemContent = [
config.systemPrompt,
includedExamples.length > 0
? "\n## Examples\n" + includedExamples.join("\n---\n")
: "",
includedDocs.length > 0
? "\n## Relevant Documents\n" + includedDocs.join("\n---\n")
: "",
]
.filter(Boolean)
.join("\n");
return {
messages: [
{ role: "system", content: systemContent },
...includedHistory,
{ role: "user", content: config.userMessage },
],
budget,
};
}Prompt Templates and Versioning
Do not hardcode prompts as string literals scattered across your codebase. Treat them as versioned configuration:
// prompts/ticket-classifier.ts
export const TICKET_CLASSIFIER = {
version: "1.3.0",
model: "gpt-4o-mini" as const,
temperature: 0,
systemPrompt: `You are a support ticket classifier for an e-commerce platform.
# Categories
- billing: Payment issues, refunds, charges, invoices
- technical: Bugs, errors, login problems, performance issues
- shipping: Delivery status, tracking, address changes
- general: Everything else
# Rules
1. Choose exactly ONE category
2. If a ticket spans multiple categories, pick the PRIMARY concern
3. Respond with only the category name, nothing else`,
// Track changes
changelog: [
{ version: "1.3.0", change: "Added shipping category" },
{ version: "1.2.0", change: "Added rule about multi-category tickets" },
{ version: "1.1.0", change: "Switched from gpt-4o to gpt-4o-mini" },
{ version: "1.0.0", change: "Initial version" },
],
};Best Practices
- ›Start with few-shot before reaching for fine-tuning -- 3-5 good examples in the prompt solve most consistency issues.
- ›Use structured output schemas -- Zod + JSON mode eliminates fragile regex parsing of LLM output.
- ›Budget your context window explicitly -- Know exactly how many tokens each component uses. Surprises here cause silent failures.
- ›Version your prompts -- Treat prompts like code. Track changes, run evaluations against previous versions, and never edit production prompts without testing.
- ›Separate instructions from data -- Use clear delimiters (XML tags, markdown headers) between your instructions and the content the model should process.
- ›Fail gracefully on parse errors -- Even with JSON mode, always wrap parsing in try/catch with a retry or fallback.
Common Pitfalls
- ›Prompt spaghetti: Prompts scattered as string literals across your codebase become unmaintainable fast. Centralize them.
- ›Stuffing the context window: More context is not always better. Irrelevant context dilutes the signal and hurts accuracy.
- ›Ignoring token costs of examples: Five few-shot examples at 200 tokens each is 1000 tokens per request. At scale, this adds up.
- ›Not testing prompt changes: A "small tweak" to a system prompt can change output behavior across your entire application. Always run evaluations.
- ›Over-engineering prompt templates: Template systems with variable interpolation are useful. Full DSLs for prompt construction are usually overkill.
What's Next
You now know how to structure context windows and get reliable outputs from LLMs. But what happens when the model needs information it was not trained on -- your company's documentation, product data, or customer records? In Part 3: Building RAG Pipelines, we will cover how to retrieve relevant documents and inject them into context for accurate, grounded responses.
FAQ
What is the difference between prompt engineering and context engineering?
Prompt engineering focuses on crafting individual prompts, while context engineering involves systematically designing the entire context window including system prompts, examples, retrieved data, and conversation history for consistent results. Context engineering treats the full context as an engineered artifact with explicit token budgets, versioning, and testing -- not just a clever sentence you hope works.
Why is context engineering important for production AI systems?
Production systems need reliable, repeatable outputs. Context engineering provides a structured approach to controlling AI behavior that scales better than ad-hoc prompt tweaking. When you have hundreds of different prompts running thousands of requests per day, you need systematic approaches to testing, versioning, and managing context -- not just "try different phrasings until it works."
What are the key techniques in context engineering?
Key techniques include few-shot example selection, dynamic context assembly, retrieval-augmented context, structured output formatting, and systematic context window management. The most impactful is usually combining few-shot examples with structured output schemas (Zod + JSON mode), which gives you consistency and type safety in a single pattern.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
AI Evaluation for Production Workflows
Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.
How to Build an AI Workflow in a Production SaaS App
A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.
Building AI Features Safely: Guardrails, Fallbacks, and Human Review
A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.