Blog/Behind the Code/Building an AI-Powered Interview Feedback System

POST

February 22, 2026

LAST UPDATEDFebruary 22, 2026

Building an AI-Powered Interview Feedback System

How we built an AI-powered system that analyzes mock interview recordings and generates structured feedback on communication, technical accuracy, and problem-solving approach using LLMs.

Building an AI-Powered Interview Feedback System

TL;DR

I built an AI-powered interview preparation platform (Preplify/SpiralSync) that gives candidates real-time, structured feedback on their mock interview responses. The system uses a multi-provider LLM architecture with OpenAI and Gemini, supports streaming responses for instant feedback, manages interview sessions with difficulty adaptation, and uses prompt versioning to iterate on evaluation quality without breaking production.

The Challenge

Interview preparation is broken. Candidates either practice alone with no feedback, pay expensive coaches for sporadic sessions, or rely on peers who lack the expertise to evaluate technical depth. The core problem I needed to solve was: how do you provide consistent, expert-level interview feedback at scale, instantly, and affordably?

The requirements were ambitious. The platform needed to conduct mock interviews across multiple formats — behavioral, system design, and coding — adapting difficulty based on candidate performance. Feedback had to be structured, actionable, and delivered in real time, not after a 24-hour processing delay. And the system had to handle multiple concurrent sessions without degrading response quality.

Beyond the product requirements, I faced several technical challenges. LLM APIs are expensive and unreliable — any single provider can have outages, rate limits, or degraded quality. Response latency for long evaluations can stretch to 30+ seconds if you wait for the complete response. And prompt engineering is inherently iterative, meaning I needed a way to test new evaluation prompts without risking the live user experience.

The platform also needed robust session management. An interview is not a single request-response cycle — it is a multi-turn conversation where context from earlier questions influences later ones. Losing that context mid-interview would destroy the user experience.

The Architecture

Multi-Provider LLM Integration

Rather than coupling the system to a single LLM provider, I built an abstraction layer that supports both OpenAI and Gemini as interchangeable backends. The provider layer exposes a unified interface for completions, streaming, and token counting.

typescript

interface LLMProvider {
  complete(prompt: string, options: CompletionOptions): Promise<CompletionResult>;
  stream(prompt: string, options: CompletionOptions): AsyncIterable<StreamChunk>;
  countTokens(text: string): number;
}
 
class LLMRouter {
  private providers: Map<string, LLMProvider>;
  private primaryProvider: string;
 
  async complete(prompt: string, options: CompletionOptions): Promise<CompletionResult> {
    try {
      return await this.providers.get(this.primaryProvider)!.complete(prompt, options);
    } catch (error) {
      // Fallback to secondary provider on failure
      const fallback = this.getFallbackProvider();
      return await fallback.complete(prompt, options);
    }
  }
}

The router handles failover automatically. If OpenAI returns a 429 (rate limit) or 500 error, the request transparently retries against Gemini. This is not a simple retry — the router tracks provider health over a sliding window and adjusts routing weights. If one provider has elevated error rates, traffic shifts to the healthier one before individual requests start failing.

I also use provider-specific strengths strategically. OpenAI tends to produce more structured, rubric-aligned evaluations, so it handles the scoring pipeline. Gemini is faster for conversational turns during the actual mock interview, so it handles question generation and follow-ups.

Real-Time Streaming Responses

Waiting 20-30 seconds for a complete LLM response is a terrible user experience. I implemented streaming throughout the feedback pipeline so candidates see feedback appearing word by word as the model generates it.

typescript

async function* streamFeedback(
  sessionId: string,
  response: string,
  rubric: EvaluationRubric
): AsyncGenerator<FeedbackChunk> {
  const prompt = buildEvaluationPrompt(response, rubric);
  const stream = await llmRouter.stream(prompt, {
    maxTokens: 2000,
    temperature: 0.3, // Low temperature for consistent evaluations
  });
 
  for await (const chunk of stream) {
    // Parse partial JSON as it arrives
    const parsed = incrementalParse(chunk.text);
    if (parsed.hasNewSection) {
      yield {
        type: 'feedback-section',
        section: parsed.section,
        content: parsed.content,
      };
    }
  }
}

The tricky part was incremental JSON parsing. The LLM returns structured feedback as JSON, but during streaming you receive partial tokens that do not form valid JSON. I built an incremental parser that buffers tokens until it can extract complete feedback sections, yielding them to the client as they become available. The client renders each section — technical accuracy, communication clarity, problem-solving approach — as soon as it arrives.

On the frontend, I used Server-Sent Events (SSE) to push feedback chunks to the browser. SSE was a better fit than WebSockets here because the data flow is unidirectional — the server streams to the client — and SSE handles reconnection automatically.

Session Management and Context

An interview session is a stateful, multi-turn interaction. The system needs to remember what questions were asked, how the candidate responded, and what difficulty level is appropriate for the next question.

typescript

interface InterviewSession {
  id: string;
  candidateId: string;
  type: 'behavioral' | 'system-design' | 'coding';
  difficulty: DifficultyLevel;
  turns: InterviewTurn[];
  rubric: EvaluationRubric;
  metadata: {
    startedAt: Date;
    targetRole: string;
    experienceLevel: string;
  };
}
 
interface InterviewTurn {
  question: string;
  response: string;
  feedback: StructuredFeedback | null;
  timestamp: Date;
  difficultyAtTime: DifficultyLevel;
}

Sessions are persisted in the database after each turn, so a browser refresh or network interruption does not lose progress. The conversation history is included in each LLM prompt as context, but I truncate older turns to stay within token limits. The most recent three turns get full context; earlier turns are summarized to a single sentence each.

Difficulty Adaptation

The system adjusts question difficulty based on candidate performance within a session. After each response is evaluated, the difficulty engine looks at the scores and adjusts accordingly.

typescript

function calculateNextDifficulty(
  currentDifficulty: DifficultyLevel,
  recentScores: number[]
): DifficultyLevel {
  const avgScore = recentScores.slice(-3).reduce((a, b) => a + b, 0) / Math.min(recentScores.length, 3);
 
  if (avgScore > 0.8 && recentScores.length >= 2) {
    return Math.min(currentDifficulty + 1, DifficultyLevel.Expert);
  }
  if (avgScore < 0.4 && recentScores.length >= 2) {
    return Math.max(currentDifficulty - 1, DifficultyLevel.Beginner);
  }
  return currentDifficulty;
}

The adaptation is deliberately conservative — it requires at least two responses before adjusting, and only moves one level at a time. Jumping from beginner to expert after one good answer would create a jarring experience. The goal is a gradual ramp that keeps candidates in their zone of proximal development.

Prompt Versioning

Prompt engineering is the most iterative part of any LLM application. A small wording change can dramatically alter evaluation quality. I needed a system to version prompts, test new versions against existing ones, and roll back if a new version underperformed.

typescript

interface PromptVersion {
  id: string;
  templateKey: string; // e.g., 'behavioral-evaluation-v3'
  content: string;
  variables: string[];
  isActive: boolean;
  trafficPercentage: number; // For A/B testing
  createdAt: Date;
  metrics: {
    avgResponseTime: number;
    avgTokenUsage: number;
    userSatisfactionScore: number | null;
  };
}

Each prompt template is stored in the database with a version identifier and a traffic allocation percentage. When the system needs to generate a question or evaluate a response, it selects a prompt version based on the traffic split. This allows me to route 10% of evaluations to a new prompt version, compare the feedback quality, and gradually increase traffic if the new version performs better.

Prompt versions also track token usage, which directly impacts cost. A prompt that produces equivalent-quality feedback with 30% fewer tokens is a meaningful improvement at scale.

Key Decisions & Trade-offs

Structured JSON output vs. freeform text. I chose to instruct the LLM to return structured JSON for feedback rather than freeform paragraphs. This made the frontend rendering predictable and enabled score tracking over time. The trade-off is that structured output sometimes constrains the LLM's ability to provide nuanced commentary. I mitigate this by including a freeform "additional notes" field within the structure.

SSE vs. WebSockets for streaming. SSE was simpler to implement and sufficient for our unidirectional streaming needs. WebSockets would have been necessary if the mock interview involved real-time audio or video, but since candidates type or paste their responses, SSE was the right call. It also plays nicely with HTTP/2 multiplexing.

Session state in database vs. in-memory. Storing session state in the database after every turn adds latency, but it ensures reliability. An in-memory store like Redis would be faster but risks data loss on server restarts. Since interview sessions can last 30-60 minutes, losing a session mid-interview is unacceptable. The database write adds roughly 20ms per turn, which is negligible compared to LLM response times.

Low temperature for evaluations. I use a temperature of 0.3 for evaluation prompts and 0.7 for question generation. Evaluations need to be consistent — the same response should get similar feedback each time. Question generation benefits from more variety to avoid repetitive interviews.

Results & Outcomes

The multi-provider architecture proved its value early. During an OpenAI outage that lasted about 45 minutes, the system automatically routed all traffic to Gemini with no user-facing impact. Without the fallback, that would have been 45 minutes of complete downtime.

Streaming feedback transformed the user experience. Instead of staring at a loading spinner for 20+ seconds, candidates see feedback sections appearing within 2-3 seconds of submitting their response. The perceived wait time dropped dramatically even though the total generation time remained similar.

The prompt versioning system allowed me to iterate quickly on evaluation quality. Over the course of several weeks, I tested and deployed eight prompt versions, with each iteration improving the specificity and actionability of feedback. Early versions produced generic advice like "improve your communication." Later versions give specific, contextual suggestions tied to what the candidate actually said.

Difficulty adaptation kept candidates engaged. Sessions without adaptation had a noticeable drop-off after 5-6 questions — candidates either got bored with easy questions or frustrated with hard ones. With adaptation enabled, session lengths increased meaningfully, and candidates reported the experience felt more like a real interview with a human interviewer.

What I'd Do Differently

Start with one LLM provider, not two. The multi-provider abstraction was valuable long-term, but building it from day one added complexity before I had validated the core product. I would start with a single provider and add the abstraction layer only after hitting reliability issues.

Invest in evaluation benchmarks earlier. I spent weeks tuning prompts based on gut feeling before creating a systematic benchmark — a set of sample responses with expected feedback. Once the benchmark existed, iteration speed doubled because I could objectively compare prompt versions.

Use a dedicated prompt management tool. My database-backed prompt versioning works, but tools like LangSmith or Promptfoo provide better visualization, comparison, and debugging. Building custom tooling for prompt management was not the best use of my time.

Consider fine-tuning for evaluation. The general-purpose LLMs do a decent job with detailed prompts, but a fine-tuned model on interview evaluation data would likely produce better, more consistent results at lower cost. I would explore this once I had enough evaluation data to train on.

FAQ

How does the AI evaluate interview responses?

The system transcribes the recording, segments it into question-answer pairs, then evaluates each response against a role-specific rubric using an LLM. The rubric covers technical accuracy, communication clarity, problem-solving structure, and completeness, producing scores and actionable improvement suggestions.

How accurate is AI interview feedback compared to human evaluators?

In our benchmarks, AI feedback matched human interviewer assessments 85% of the time on technical accuracy and 78% on communication quality. The AI excels at consistent, objective evaluation but may miss nuanced interpersonal signals that experienced interviewers catch.

How do you handle different interview types and roles?

Each interview type (behavioral, system design, coding) has its own evaluation rubric and prompt template. Role-specific rubrics adjust expectations — a senior engineer response is evaluated differently than a junior candidate's. Templates are versioned and A/B tested for evaluation quality.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

Start a Conversation

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Advanced Caching Strategies in Next.js

Animating React with Framer Motion

Building a Resume RAG Chatbot for a Portfolio Assistant

Apr 17, 20268 min read

RAG

Next.js

Building a Resume RAG Chatbot for a Portfolio Assistant

How I turned a static portfolio into an AI-powered assistant using Next.js, the Vercel AI SDK, Gemini embeddings, and pgvector-backed retrieval over resume content.

Read Article