Blog/AI Automation/From Prompt Engineering to Context Engineering
POST
May 10, 2025
LAST UPDATEDMay 10, 2025

From Prompt Engineering to Context Engineering

Discover the shift from prompt engineering to context engineering. Learn how to structure context windows for reliable and consistent AI outputs.

Tags

AIPrompt EngineeringContext EngineeringLLMs
From Prompt Engineering to Context Engineering
5 min read

From Prompt Engineering to Context Engineering

This is Part 2 of the AI Automation Engineer Roadmap series. If you have not read Part 1: Understanding LLMs, start there.

TL;DR

Context engineering goes beyond prompt tricks by systematically structuring the entire context window to produce reliable AI outputs at scale. This post covers zero-shot and few-shot prompting, Chain-of-Thought reasoning, the ReAct pattern, structured outputs with JSON mode and Zod schemas, and practical strategies for managing context windows in production.

Why This Matters

"Prompt engineering" sounds like you are crafting a clever sentence. In reality, building production AI systems requires engineering the entire context -- system prompts, user inputs, retrieved documents, conversation history, examples, and output constraints -- all within a finite token budget. The shift from "what prompt should I write?" to "how should I structure the context window?" is what separates hobby projects from production-grade AI features. The term "context engineering" captures this more accurately, and understanding it is the single highest-leverage skill for an AI automation engineer.

Core Concepts

Zero-Shot vs Few-Shot Prompting

Zero-shot means giving the model a task with no examples. It relies entirely on the model's training data to understand what you want:

typescript
// Zero-shot: The model figures out the format on its own
const zeroShotPrompt = "Classify this support ticket as 'billing', 'technical', or 'general': " +
  "'My payment was charged twice last month'";

Few-shot means including examples in the prompt. This dramatically improves consistency because the model pattern-matches against your examples:

typescript
// Few-shot: Providing examples teaches the model your exact format
const fewShotPrompt = `Classify each support ticket. Respond with only the category.
 
Ticket: "I can't log into my account after resetting my password"
Category: technical
 
Ticket: "Can I get a refund for my annual subscription?"
Category: billing
 
Ticket: "What are your business hours?"
Category: general
 
Ticket: "My payment was charged twice last month"
Category:`;

Few-shot prompting is often the fastest way to improve output quality without changing models or adding complexity. Start with 3-5 examples that cover edge cases.

Chain-of-Thought (CoT) Prompting

Chain-of-Thought prompting asks the model to show its reasoning before giving a final answer. This significantly improves accuracy on tasks that require multi-step reasoning:

typescript
const cotSystemPrompt = `You are a pricing calculator for a SaaS product.
 
When calculating prices, think through each step:
1. Identify the base plan and its price
2. Apply any quantity discounts
3. Add or remove add-ons
4. Apply promotional discounts last
5. Show the final total
 
Always show your reasoning before the final answer.`;
 
const userMessage = "We need 25 seats on the Pro plan ($49/seat/mo) " +
  "with the analytics add-on ($10/seat/mo). We have a 15% annual discount.";
 
// The model will break down the calculation step by step
// rather than jumping to a number that might be wrong

The key insight: when the model "thinks out loud," it is less likely to skip steps or make arithmetic errors. For classification tasks, you can ask it to reason first and then provide the classification on the final line, making it easy to parse.

The ReAct Pattern

ReAct (Reasoning + Acting) combines Chain-of-Thought with tool use. The model alternates between reasoning about what to do and taking actions:

Thought: The user wants flight prices from NYC to London. I need to search flights.
Action: searchFlights({ from: "NYC", to: "LDN", date: "2025-06-15" })
Observation: Found 12 flights, cheapest is $450 on British Airways.
Thought: I have the results. Let me format them for the user.
Answer: The cheapest flight from NYC to London on June 15th is $450 on British Airways...

We will implement ReAct loops fully in Part 5: Building AI Agents. For now, understand that this pattern is the foundation of every AI agent framework.

System Prompts: Best Practices

The system prompt is your most powerful lever. It sets the model's persona, constraints, and behavioral rules. Here is a battle-tested structure:

typescript
function buildSystemPrompt(context: {
  role: string;
  rules: string[];
  outputFormat: string;
  examples?: string;
}): string {
  return `# Role
${context.role}
 
# Rules
${context.rules.map((r, i) => `${i + 1}. ${r}`).join("\n")}
 
# Output Format
${context.outputFormat}
 
${context.examples ? `# Examples\n${context.examples}` : ""}`;
}
 
// Usage
const systemPrompt = buildSystemPrompt({
  role: "You are a senior code reviewer analyzing TypeScript pull requests.",
  rules: [
    "Focus on bugs, security issues, and performance problems",
    "Ignore stylistic preferences unless they impact readability",
    "Rate severity as 'critical', 'warning', or 'info'",
    "If the code is fine, say so briefly -- do not invent issues",
  ],
  outputFormat: `Respond in JSON format:
{
  "issues": [{ "line": number, "severity": string, "description": string }],
  "summary": "One sentence overall assessment"
}`,
});

Structured Outputs with JSON Mode and Zod

Getting reliable JSON from LLMs is one of the most common requirements. Here is how to do it properly:

typescript
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
 
// Define your output schema with Zod
const TicketAnalysis = z.object({
  category: z.enum(["billing", "technical", "general", "urgent"]),
  sentiment: z.enum(["positive", "negative", "neutral"]),
  summary: z.string().describe("One sentence summary of the ticket"),
  suggestedAction: z.string().describe("Recommended next step"),
  priority: z.number().min(1).max(5).describe("1 = lowest, 5 = highest"),
});
 
type TicketAnalysis = z.infer<typeof TicketAnalysis>;
 
async function analyzeTicket(ticketText: string): Promise<TicketAnalysis> {
  const openai = new OpenAI();
 
  const response = await openai.beta.chat.completions.parse({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: "Analyze support tickets and extract structured information.",
      },
      { role: "user", content: ticketText },
    ],
    response_format: zodResponseFormat(TicketAnalysis, "ticket_analysis"),
  });
 
  const parsed = response.choices[0].message.parsed;
  if (!parsed) throw new Error("Failed to parse response");
 
  return parsed;
}

With Anthropic, you achieve structured outputs via explicit instructions in the system prompt combined with Zod validation on the client side:

typescript
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
 
async function analyzeWithClaude(ticketText: string): Promise<TicketAnalysis> {
  const anthropic = new Anthropic();
 
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 512,
    system: `Analyze support tickets. Respond ONLY with valid JSON matching this schema:
{
  "category": "billing" | "technical" | "general" | "urgent",
  "sentiment": "positive" | "negative" | "neutral",
  "summary": "string",
  "suggestedAction": "string",
  "priority": 1-5
}`,
    messages: [{ role: "user", content: ticketText }],
  });
 
  const text = response.content.find((b) => b.type === "text")?.text;
  if (!text) throw new Error("No text in response");
 
  // Parse and validate with Zod
  const parsed = JSON.parse(text);
  return TicketAnalysis.parse(parsed);
}

Hands-On Implementation

Context Window Management

In production, you will quickly run into context limits. Here is a practical approach to managing context windows:

typescript
import { encoding_for_model, type TiktokenModel } from "tiktoken";
 
function countTokens(text: string, model: TiktokenModel = "gpt-4o"): number {
  const encoder = encoding_for_model(model);
  const tokens = encoder.encode(text);
  encoder.free();
  return tokens.length;
}
 
interface ContextBudget {
  system: number;
  examples: number;
  retrievedDocs: number;
  conversationHistory: number;
  userMessage: number;
  reservedForOutput: number;
}
 
function buildManagedContext(config: {
  maxContextTokens: number;
  systemPrompt: string;
  examples: string[];
  retrievedDocs: string[];
  conversationHistory: Array<{ role: string; content: string }>;
  userMessage: string;
  maxOutputTokens: number;
}): { messages: Array<{ role: string; content: string }>; budget: ContextBudget } {
  const budget: ContextBudget = {
    system: countTokens(config.systemPrompt),
    examples: 0,
    retrievedDocs: 0,
    conversationHistory: 0,
    userMessage: countTokens(config.userMessage),
    reservedForOutput: config.maxOutputTokens,
  };
 
  let remainingTokens =
    config.maxContextTokens - budget.system - budget.userMessage - budget.reservedForOutput;
 
  // 1. Add examples (highest priority after system prompt)
  const includedExamples: string[] = [];
  for (const example of config.examples) {
    const tokens = countTokens(example);
    if (remainingTokens - tokens > 0) {
      includedExamples.push(example);
      budget.examples += tokens;
      remainingTokens -= tokens;
    }
  }
 
  // 2. Add retrieved documents
  const includedDocs: string[] = [];
  for (const doc of config.retrievedDocs) {
    const tokens = countTokens(doc);
    if (remainingTokens - tokens > 0) {
      includedDocs.push(doc);
      budget.retrievedDocs += tokens;
      remainingTokens -= tokens;
    }
  }
 
  // 3. Add conversation history (most recent first)
  const includedHistory: Array<{ role: string; content: string }> = [];
  for (const msg of [...config.conversationHistory].reverse()) {
    const tokens = countTokens(msg.content);
    if (remainingTokens - tokens > 0) {
      includedHistory.unshift(msg);
      budget.conversationHistory += tokens;
      remainingTokens -= tokens;
    } else {
      break; // Stop adding history when budget is exhausted
    }
  }
 
  // Assemble the final messages array
  const systemContent = [
    config.systemPrompt,
    includedExamples.length > 0
      ? "\n## Examples\n" + includedExamples.join("\n---\n")
      : "",
    includedDocs.length > 0
      ? "\n## Relevant Documents\n" + includedDocs.join("\n---\n")
      : "",
  ]
    .filter(Boolean)
    .join("\n");
 
  return {
    messages: [
      { role: "system", content: systemContent },
      ...includedHistory,
      { role: "user", content: config.userMessage },
    ],
    budget,
  };
}

Prompt Templates and Versioning

Do not hardcode prompts as string literals scattered across your codebase. Treat them as versioned configuration:

typescript
// prompts/ticket-classifier.ts
export const TICKET_CLASSIFIER = {
  version: "1.3.0",
  model: "gpt-4o-mini" as const,
  temperature: 0,
  systemPrompt: `You are a support ticket classifier for an e-commerce platform.
 
# Categories
- billing: Payment issues, refunds, charges, invoices
- technical: Bugs, errors, login problems, performance issues
- shipping: Delivery status, tracking, address changes
- general: Everything else
 
# Rules
1. Choose exactly ONE category
2. If a ticket spans multiple categories, pick the PRIMARY concern
3. Respond with only the category name, nothing else`,
 
  // Track changes
  changelog: [
    { version: "1.3.0", change: "Added shipping category" },
    { version: "1.2.0", change: "Added rule about multi-category tickets" },
    { version: "1.1.0", change: "Switched from gpt-4o to gpt-4o-mini" },
    { version: "1.0.0", change: "Initial version" },
  ],
};

Best Practices

  1. ›Start with few-shot before reaching for fine-tuning -- 3-5 good examples in the prompt solve most consistency issues.
  2. ›Use structured output schemas -- Zod + JSON mode eliminates fragile regex parsing of LLM output.
  3. ›Budget your context window explicitly -- Know exactly how many tokens each component uses. Surprises here cause silent failures.
  4. ›Version your prompts -- Treat prompts like code. Track changes, run evaluations against previous versions, and never edit production prompts without testing.
  5. ›Separate instructions from data -- Use clear delimiters (XML tags, markdown headers) between your instructions and the content the model should process.
  6. ›Fail gracefully on parse errors -- Even with JSON mode, always wrap parsing in try/catch with a retry or fallback.

Common Pitfalls

  • ›Prompt spaghetti: Prompts scattered as string literals across your codebase become unmaintainable fast. Centralize them.
  • ›Stuffing the context window: More context is not always better. Irrelevant context dilutes the signal and hurts accuracy.
  • ›Ignoring token costs of examples: Five few-shot examples at 200 tokens each is 1000 tokens per request. At scale, this adds up.
  • ›Not testing prompt changes: A "small tweak" to a system prompt can change output behavior across your entire application. Always run evaluations.
  • ›Over-engineering prompt templates: Template systems with variable interpolation are useful. Full DSLs for prompt construction are usually overkill.

What's Next

You now know how to structure context windows and get reliable outputs from LLMs. But what happens when the model needs information it was not trained on -- your company's documentation, product data, or customer records? In Part 3: Building RAG Pipelines, we will cover how to retrieve relevant documents and inject them into context for accurate, grounded responses.

FAQ

What is the difference between prompt engineering and context engineering?

Prompt engineering focuses on crafting individual prompts, while context engineering involves systematically designing the entire context window including system prompts, examples, retrieved data, and conversation history for consistent results. Context engineering treats the full context as an engineered artifact with explicit token budgets, versioning, and testing -- not just a clever sentence you hope works.

Why is context engineering important for production AI systems?

Production systems need reliable, repeatable outputs. Context engineering provides a structured approach to controlling AI behavior that scales better than ad-hoc prompt tweaking. When you have hundreds of different prompts running thousands of requests per day, you need systematic approaches to testing, versioning, and managing context -- not just "try different phrasings until it works."

What are the key techniques in context engineering?

Key techniques include few-shot example selection, dynamic context assembly, retrieval-augmented context, structured output formatting, and systematic context window management. The most impactful is usually combining few-shot examples with structured output schemas (Zod + JSON mode), which gives you consistency and type safety in a single pattern.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

AI Evaluation for Production Workflows
Mar 21, 20266 min read
AI
Evaluation
LLMOps

AI Evaluation for Production Workflows

Learn how to evaluate AI workflows in production using task-based metrics, human review, regression checks, and business-aligned quality thresholds.

How to Build an AI Workflow in a Production SaaS App
Mar 21, 20267 min read
AI
SaaS
Workflows

How to Build an AI Workflow in a Production SaaS App

A practical guide to designing and shipping AI workflows inside a production SaaS app, with orchestration, fallback logic, evaluation, and user trust considerations.

Building AI Features Safely: Guardrails, Fallbacks, and Human Review
Mar 21, 20266 min read
AI
LLM
Guardrails

Building AI Features Safely: Guardrails, Fallbacks, and Human Review

A production guide to shipping AI features safely with guardrails, confidence thresholds, fallback paths, auditability, and human-in-the-loop review.