Blog/Behind the Code/Building an LLM Orchestration Layer for Interview Prep

POST

May 05, 2025

LAST UPDATEDMay 05, 2025

Building an LLM Orchestration Layer for Interview Prep

How we built an LLM orchestration layer that chains multiple AI models for interview preparation, with prompt management, response streaming, cost optimization, and fallback strategies.

Building an LLM Orchestration Layer for Interview Prep

TL;DR

An orchestration layer that routes prompts to different LLM providers based on task complexity reduced API costs significantly while maintaining response quality for interview preparation scenarios. We built a provider abstraction that unified OpenAI and Gemini behind a single interface, added intelligent routing, prompt versioning, streaming support, and automatic failover — turning a tightly coupled GPT integration into a resilient multi-model system.

The Challenge

The platform was an AI-powered interview preparation tool. Users could practice behavioral interviews, system design discussions, and coding challenges with an AI interviewer that adapted to their responses, provided real-time feedback, and generated detailed performance reports.

The initial implementation was a direct OpenAI integration. Every feature called the OpenAI API directly with hardcoded prompts scattered across the codebase. This created several problems that compounded as the platform grew.

First, cost. Every interaction — from generating a simple follow-up question to producing a detailed performance analysis — used GPT-4. A system design mock interview could consume 15-20 API calls, each using the most expensive model regardless of task complexity. Monthly API costs were growing linearly with users and threatening the unit economics.

Second, reliability. OpenAI had periodic rate limiting and occasional outages. When their API went down, our platform went down. There was no fallback, no graceful degradation. Users in the middle of a mock interview would hit an error wall.

Third, maintainability. Prompts were string literals in service files. Changing the tone of the interviewer meant finding and editing strings across multiple files. There was no way to A/B test prompt variations or roll back a prompt change that degraded quality. And because every feature was coupled to OpenAI's specific API shape, evaluating alternative providers meant rewriting every integration point.

The goal was an orchestration layer that decoupled the application from any specific LLM provider, routed requests intelligently based on task requirements, managed prompts as versioned artifacts, and handled failures gracefully.

The Architecture

Provider Abstraction

The foundation was a provider interface that normalized the differences between OpenAI and Gemini into a common contract. Each provider adapter translated between our internal representation and the provider's API.

// llm/providers/types.ts
export interface LLMMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}
 
export interface LLMRequest {
  messages: LLMMessage[];
  model: string;
  temperature?: number;
  maxTokens?: number;
  stream?: boolean;
}
 
export interface LLMResponse {
  content: string;
  model: string;
  provider: string;
  usage: {
    promptTokens: number;
    completionTokens: number;
    totalTokens: number;
  };
  latencyMs: number;
}
 
export interface LLMProvider {
  name: string;
  complete(request: LLMRequest): Promise<LLMResponse>;
  stream(request: LLMRequest): AsyncIterable<string>;
  isAvailable(): Promise<boolean>;
  estimateCost(request: LLMRequest): number;
}

// llm/providers/openai.ts
import OpenAI from 'openai';
import { LLMProvider, LLMRequest, LLMResponse } from './types';
 
export class OpenAIProvider implements LLMProvider {
  name = 'openai';
  private client: OpenAI;
 
  constructor(apiKey: string) {
    this.client = new OpenAI({ apiKey });
  }
 
  async complete(request: LLMRequest): Promise<LLMResponse> {
    const start = Date.now();
 
    const response = await this.client.chat.completions.create({
      model: request.model,
      messages: request.messages,
      temperature: request.temperature ?? 0.7,
      max_tokens: request.maxTokens,
    });
 
    return {
      content: response.choices[0]?.message?.content ?? '',
      model: request.model,
      provider: this.name,
      usage: {
        promptTokens: response.usage?.prompt_tokens ?? 0,
        completionTokens: response.usage?.completion_tokens ?? 0,
        totalTokens: response.usage?.total_tokens ?? 0,
      },
      latencyMs: Date.now() - start,
    };
  }
 
  async *stream(request: LLMRequest): AsyncIterable<string> {
    const response = await this.client.chat.completions.create({
      model: request.model,
      messages: request.messages,
      temperature: request.temperature ?? 0.7,
      max_tokens: request.maxTokens,
      stream: true,
    });
 
    for await (const chunk of response) {
      const content = chunk.choices[0]?.delta?.content;
      if (content) yield content;
    }
  }
 
  async isAvailable(): Promise<boolean> {
    try {
      await this.client.models.list();
      return true;
    } catch {
      return false;
    }
  }
 
  estimateCost(request: LLMRequest): number {
    const estimatedTokens = request.messages
      .reduce((sum, m) => sum + m.content.length / 4, 0);
    // GPT-4 pricing approximation per 1K tokens
    const rates: Record<string, number> = {
      'gpt-4': 0.03,
      'gpt-4-turbo': 0.01,
      'gpt-3.5-turbo': 0.0005,
    };
    return (estimatedTokens / 1000) * (rates[request.model] ?? 0.01);
  }
}

The Gemini adapter followed the same pattern, translating our LLMMessage format into Gemini's Content structure and normalizing the response back. The key was that the calling code never knew which provider was handling the request.

Intelligent Router

The router decided which provider and model handled each request. It considered the task type, the estimated complexity, cost constraints, and provider health.

// llm/router/TaskRouter.ts
import { LLMProvider, LLMRequest } from '../providers/types';
 
interface TaskConfig {
  taskType: string;
  preferredProvider: string;
  preferredModel: string;
  fallbackProvider: string;
  fallbackModel: string;
  maxLatencyMs: number;
  temperature: number;
}
 
const TASK_CONFIGS: Record<string, TaskConfig> = {
  'interview.followup': {
    taskType: 'interview.followup',
    preferredProvider: 'gemini',
    preferredModel: 'gemini-1.5-flash',
    fallbackProvider: 'openai',
    fallbackModel: 'gpt-3.5-turbo',
    maxLatencyMs: 3000,
    temperature: 0.8,
  },
  'interview.feedback': {
    taskType: 'interview.feedback',
    preferredProvider: 'openai',
    preferredModel: 'gpt-4-turbo',
    fallbackProvider: 'gemini',
    fallbackModel: 'gemini-1.5-pro',
    maxLatencyMs: 10000,
    temperature: 0.3,
  },
  'report.generation': {
    taskType: 'report.generation',
    preferredProvider: 'openai',
    preferredModel: 'gpt-4-turbo',
    fallbackProvider: 'gemini',
    fallbackModel: 'gemini-1.5-pro',
    maxLatencyMs: 30000,
    temperature: 0.2,
  },
  'interview.question': {
    taskType: 'interview.question',
    preferredProvider: 'gemini',
    preferredModel: 'gemini-1.5-flash',
    fallbackProvider: 'openai',
    fallbackModel: 'gpt-3.5-turbo',
    maxLatencyMs: 5000,
    temperature: 0.7,
  },
};
 
export class TaskRouter {
  private providers: Map<string, LLMProvider>;
  private healthStatus: Map<string, boolean> = new Map();
 
  constructor(providers: LLMProvider[]) {
    this.providers = new Map(providers.map((p) => [p.name, p]));
    this.startHealthChecks();
  }
 
  async route(taskType: string, messages: LLMRequest['messages']): Promise<{
    provider: LLMProvider;
    request: LLMRequest;
  }> {
    const config = TASK_CONFIGS[taskType];
    if (!config) throw new Error(`Unknown task type: ${taskType}`);
 
    const primaryHealthy = this.healthStatus.get(config.preferredProvider) !== false;
    const providerName = primaryHealthy
      ? config.preferredProvider
      : config.fallbackProvider;
    const model = primaryHealthy
      ? config.preferredModel
      : config.fallbackModel;
 
    const provider = this.providers.get(providerName);
    if (!provider) throw new Error(`Provider not found: ${providerName}`);
 
    return {
      provider,
      request: {
        messages,
        model,
        temperature: config.temperature,
        stream: config.maxLatencyMs <= 5000, // Stream for latency-sensitive tasks
      },
    };
  }
 
  private startHealthChecks() {
    setInterval(async () => {
      for (const [name, provider] of this.providers) {
        this.healthStatus.set(name, await provider.isAvailable());
      }
    }, 30000);
  }
}

The routing logic was intentional about which tasks went where. Follow-up questions during an interview needed to be fast and conversational — Gemini Flash handled those well at a fraction of GPT-4's cost. Detailed performance feedback and report generation required deeper reasoning, so those went to GPT-4 Turbo. This task-based routing was the single biggest lever for cost reduction.

Prompt Management

Prompts were extracted from application code into versioned templates stored in a structured format. Each prompt had a unique identifier, version history, and variable slots that the runtime filled in.

// llm/prompts/PromptManager.ts
interface PromptTemplate {
  id: string;
  version: number;
  template: string;
  variables: string[];
  metadata: {
    taskType: string;
    description: string;
    createdAt: string;
    updatedAt: string;
  };
}
 
const PROMPTS: Record<string, PromptTemplate> = {
  'behavioral-interviewer': {
    id: 'behavioral-interviewer',
    version: 4,
    template: `You are a senior engineering manager conducting a behavioral interview.
The candidate is preparing for {{company}} interviews at the {{level}} level.
 
Your approach:
- Ask one question at a time
- Use the STAR method to probe for specifics
- If the answer is vague, ask a targeted follow-up
- Be conversational but professional
- After the candidate finishes a story, briefly acknowledge it before moving on
 
Current topic: {{topic}}
Questions asked so far: {{questionCount}}
 
Previous conversation context:
{{context}}
 
Generate the next interviewer response.`,
    variables: ['company', 'level', 'topic', 'questionCount', 'context'],
    metadata: {
      taskType: 'interview.question',
      description: 'Behavioral interview question generation',
      createdAt: '2024-08-01',
      updatedAt: '2025-01-15',
    },
  },
};
 
export class PromptManager {
  render(promptId: string, variables: Record<string, string>): string {
    const prompt = PROMPTS[promptId];
    if (!prompt) throw new Error(`Prompt not found: ${promptId}`);
 
    let rendered = prompt.template;
    for (const [key, value] of Object.entries(variables)) {
      rendered = rendered.replace(new RegExp(`\\{\\{${key}\\}\\}`, 'g'), value);
    }
 
    // Validate all variables were replaced
    const unreplaced = rendered.match(/\{\{(\w+)\}\}/g);
    if (unreplaced) {
      throw new Error(
        `Unreplaced variables in prompt ${promptId}: ${unreplaced.join(', ')}`
      );
    }
 
    return rendered;
  }
 
  getVersion(promptId: string): number {
    return PROMPTS[promptId]?.version ?? -1;
  }
}

This separation meant the product team could iterate on prompt wording without touching application code. Version numbers let us track which prompt version produced which user interactions, which was critical for quality analysis. When a prompt change degraded response quality, we could identify affected sessions by version and roll back.

Streaming and Real-Time Delivery

For the interview conversation flow, streaming was essential. Users needed to see the AI interviewer's response appear token by token, just like a chat interface. Waiting 5-10 seconds for a complete response before displaying anything felt broken.

// llm/orchestrator/Orchestrator.ts
import { TaskRouter } from '../router/TaskRouter';
import { PromptManager } from '../prompts/PromptManager';
import { LLMResponse } from '../providers/types';
 
export class LLMOrchestrator {
  constructor(
    private router: TaskRouter,
    private promptManager: PromptManager
  ) {}
 
  async *streamInterviewResponse(
    promptId: string,
    variables: Record<string, string>,
    taskType: string
  ): AsyncIterable<string> {
    const systemPrompt = this.promptManager.render(promptId, variables);
    const messages = [{ role: 'system' as const, content: systemPrompt }];
 
    const { provider, request } = await this.router.route(taskType, messages);
 
    try {
      yield* provider.stream(request);
    } catch (error) {
      // Failover to fallback provider
      console.error(`Primary provider failed for ${taskType}:`, error);
      const fallback = await this.router.route(taskType + '.fallback', messages);
      yield* fallback.provider.stream(fallback.request);
    }
  }
 
  async generateReport(
    promptId: string,
    variables: Record<string, string>
  ): Promise<LLMResponse> {
    const systemPrompt = this.promptManager.render(promptId, variables);
    const messages = [{ role: 'system' as const, content: systemPrompt }];
 
    const { provider, request } = await this.router.route(
      'report.generation',
      messages
    );
 
    const maxRetries = 3;
    let lastError: Error | null = null;
 
    for (let attempt = 0; attempt < maxRetries; attempt++) {
      try {
        return await provider.complete(request);
      } catch (error) {
        lastError = error as Error;
        // Exponential backoff
        await new Promise((r) => setTimeout(r, 1000 * Math.pow(2, attempt)));
      }
    }
 
    throw lastError;
  }
}

On the frontend, we consumed the stream through a Server-Sent Events endpoint that forwarded tokens from the orchestrator to the browser:

// api/interview/stream.ts (Next.js API route)
import { NextRequest } from 'next/server';
import { orchestrator } from '@/lib/llm';
 
export async function POST(request: NextRequest) {
  const { promptId, variables, taskType } = await request.json();
 
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      try {
        for await (const token of orchestrator.streamInterviewResponse(
          promptId,
          variables,
          taskType
        )) {
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ token })}\n\n`));
        }
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    },
  });
 
  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      Connection: 'keep-alive',
    },
  });
}

Key Decisions & Trade-offs

Task-based routing over user-tier routing. An alternative approach was routing based on the user's subscription tier — free users get cheaper models, paid users get GPT-4. We rejected this because it created a visible quality gap that hurt conversion. Instead, we routed based on task complexity. Every user got GPT-4 for detailed feedback and reports (the high-value moments), while conversational turns used faster, cheaper models. Users perceived consistent quality because the high-impact interactions were always high-quality.

Provider abstraction over direct multi-model calls. We could have simply added if (provider === 'gemini') branches everywhere. The abstraction layer was more upfront work, but it paid off within weeks when we needed to evaluate Claude as a third provider. Adding a new provider meant implementing one adapter class rather than touching every callsite.

Streaming for conversations, batch for reports. We streamed interview interactions for perceived speed but used synchronous completion for report generation. Reports needed the full response for post-processing (parsing structured sections, extracting scores), which was easier with a complete response than accumulating a stream. The latency tradeoff was acceptable because users expected report generation to take a few seconds.

Template-based prompts over a database. Storing prompts in code (as typed constants) rather than a database meant prompt changes required a deployment. This was intentional. Prompts were the core product logic — a bad prompt could ruin user experience. Treating them as code meant they went through code review, had version control history, and could be rolled back with a git revert. The tradeoff was iteration speed, but the guardrails were worth it.

Health checks over circuit breakers. A full circuit breaker pattern (with half-open states and failure thresholds) would have been more robust. Our health check approach was simpler: poll providers every 30 seconds and mark them healthy or not. This worked because our scale didn't require sub-second failover. If a provider went down between health checks, the first request to it would fail, trigger an immediate failover, and the next health check would update the status. For our traffic volume, this was sufficient.

Results & Outcomes

The most measurable outcome was cost reduction. By routing conversational turns to Gemini Flash and reserving GPT-4 Turbo for feedback and reports, API costs dropped substantially relative to the all-GPT-4 baseline. The exact savings varied month to month with usage patterns, but the cost per interview session became sustainable for the business model.

Reliability improved noticeably. Before the orchestration layer, an OpenAI outage meant a full platform outage. After, the router automatically failed over to Gemini. During one notable OpenAI incident that lasted several hours, the platform continued serving users on Gemini without any user-facing impact. The team received zero support tickets during that window.

Response quality remained consistent despite using cheaper models for some tasks. We ran blind quality evaluations where team members rated interview interactions without knowing which model generated them. The conversational turns from Gemini Flash scored comparably to GPT-3.5 Turbo for the follow-up question task, validating the routing strategy.

Developer velocity improved because the abstraction made LLM interactions predictable. Adding a new AI-powered feature meant defining a task type, writing a prompt template, configuring the routing, and calling the orchestrator. The boilerplate was gone. New features that previously took a week to integrate with the LLM were done in a day or two.

Prompt iteration became systematic. Each prompt had a version, and each user session recorded which prompt versions it used. When the team updated the behavioral interviewer prompt from version 3 to version 4, they could compare user engagement metrics across versions. This turned prompt engineering from guesswork into a measurable practice.

What I'd Do Differently

I'd add semantic caching from the start. Many interview preparation requests are similar — "tell me about a time you resolved a conflict" generates similar system prompts across users. An embedding-based cache that matches semantically similar prompts could serve cached responses for common scenarios, reducing both cost and latency. We built this later, but retrofitting it was harder than designing for it.

I'd implement structured output parsing at the provider level. We relied on prompt instructions to get models to return JSON for structured tasks like scoring, but models occasionally returned malformed JSON. Adding a validation and retry layer within the provider abstraction — or using function calling / structured output features where available — would have eliminated an entire class of runtime errors.

I'd also build a prompt playground into the admin tooling. Prompt iteration happened through code changes, deployments, and monitoring production metrics. A playground where the product team could test prompt variations against sample inputs, compare outputs side-by-side, and then promote a version to production would have shortened the iteration cycle significantly.

FAQ

What is an LLM orchestration layer?

An LLM orchestration layer sits between your application and AI model providers, handling prompt routing, model selection, response caching, rate limiting, and fallback logic. It abstracts away provider-specific details and gives you a unified API for AI interactions. In our architecture, the application code called orchestrator.streamInterviewResponse() without knowing whether OpenAI or Gemini was handling the request. The orchestrator consulted the task router to pick the right provider and model, rendered the prompt template with the provided variables, handled streaming or batch completion, and managed retries and failover. This separation meant adding a new provider, changing routing rules, or updating prompts didn't require changes to application features. The orchestration layer was the single integration point between business logic and AI capabilities.

How do you optimize LLM costs in production?

Key strategies include routing simple tasks to cheaper models, caching frequent prompt-response pairs, using streaming to reduce timeout waste, implementing token budgets per user, and batching similar requests where latency allows. Our biggest cost lever was task-based routing. We profiled every LLM interaction in the platform, categorized them by complexity, and assigned each to the cheapest model that maintained acceptable quality. Conversational follow-ups went to Gemini Flash. Detailed feedback went to GPT-4 Turbo. Report generation used GPT-4 Turbo with tightly controlled max token limits. We also estimated costs before sending requests using token count approximations, which let us set per-session budgets and alert when interactions were consuming more tokens than expected. The combination of routing and monitoring made costs predictable and controllable.

How do you handle LLM provider outages?

The orchestration layer maintains a priority list of providers per task type. If the primary provider returns an error or exceeds latency thresholds, requests automatically failover to the next provider with compatible capabilities, ensuring uninterrupted service. Our implementation used periodic health checks (every 30 seconds) that proactively marked providers as available or unavailable. When a provider was marked unhealthy, the router immediately directed traffic to the fallback. For mid-request failures — where a request was sent but the provider returned an error — the orchestrator caught the exception, logged it, and retried with the fallback provider. For streaming requests, this meant the user might experience a brief pause during failover, but the stream resumed from the fallback provider. We also tracked provider performance metrics (latency, error rate, token throughput) over time, which helped us make informed decisions about default provider assignments for each task type.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

Start a Conversation

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Event-Driven Architecture in Node.js: Patterns and Pitfalls

Scaling Strategies: Horizontal vs Vertical Scaling

Building a Resume RAG Chatbot for a Portfolio Assistant

Apr 17, 20268 min read

RAG

Next.js

Building a Resume RAG Chatbot for a Portfolio Assistant

How I turned a static portfolio into an AI-powered assistant using Next.js, the Vercel AI SDK, Gemini embeddings, and pgvector-backed retrieval over resume content.

Read Article

Optimizing Core Web Vitals for e-Commerce

Mar 01, 202610 min read

SEO

Performance

Next.js

Optimizing Core Web Vitals for e-Commerce

Our journey to scoring 100 on Google PageSpeed Insights for a major Shopify-backed e-commerce platform.

Read Article

Building an AI-Powered Interview Feedback System

Feb 22, 20269 min read

LLM

Feedback

Building an AI-Powered Interview Feedback System

How we built an AI-powered system that analyzes mock interview recordings and generates structured feedback on communication, technical accuracy, and problem-solving approach using LLMs.

Read Article

Building an LLM Orchestration Layer for Interview Prep

Building an LLM Orchestration Layer for Interview Prep

TL;DR

The Challenge

The Architecture

Provider Abstraction

Intelligent Router

Prompt Management

Streaming and Real-Time Delivery

Key Decisions & Trade-offs

Results & Outcomes

What I'd Do Differently

FAQ

What is an LLM orchestration layer?

How do you optimize LLM costs in production?

How do you handle LLM provider outages?

Need help with a project?

Let's Build It

Sadam Hussain

Related Articles

Building a Resume RAG Chatbot for a Portfolio Assistant

Optimizing Core Web Vitals for e-Commerce

Building an AI-Powered Interview Feedback System