Building an AI-Powered Interview Feedback System
How we built an AI-powered system that analyzes mock interview recordings and generates structured feedback on communication, technical accuracy, and problem-solving approach using LLMs.
Tags
Building an AI-Powered Interview Feedback System
TL;DR
I built an AI-powered interview preparation platform (Preplify/SpiralSync) that gives candidates real-time, structured feedback on their mock interview responses. The system uses a multi-provider LLM architecture with OpenAI and Gemini, supports streaming responses for instant feedback, manages interview sessions with difficulty adaptation, and uses prompt versioning to iterate on evaluation quality without breaking production.
The Challenge
Interview preparation is broken. Candidates either practice alone with no feedback, pay expensive coaches for sporadic sessions, or rely on peers who lack the expertise to evaluate technical depth. The core problem I needed to solve was: how do you provide consistent, expert-level interview feedback at scale, instantly, and affordably?
The requirements were ambitious. The platform needed to conduct mock interviews across multiple formats — behavioral, system design, and coding — adapting difficulty based on candidate performance. Feedback had to be structured, actionable, and delivered in real time, not after a 24-hour processing delay. And the system had to handle multiple concurrent sessions without degrading response quality.
Beyond the product requirements, I faced several technical challenges. LLM APIs are expensive and unreliable — any single provider can have outages, rate limits, or degraded quality. Response latency for long evaluations can stretch to 30+ seconds if you wait for the complete response. And prompt engineering is inherently iterative, meaning I needed a way to test new evaluation prompts without risking the live user experience.
The platform also needed robust session management. An interview is not a single request-response cycle — it is a multi-turn conversation where context from earlier questions influences later ones. Losing that context mid-interview would destroy the user experience.
The Architecture
Multi-Provider LLM Integration
Rather than coupling the system to a single LLM provider, I built an abstraction layer that supports both OpenAI and Gemini as interchangeable backends. The provider layer exposes a unified interface for completions, streaming, and token counting.
interface LLMProvider {
complete(prompt: string, options: CompletionOptions): Promise<CompletionResult>;
stream(prompt: string, options: CompletionOptions): AsyncIterable<StreamChunk>;
countTokens(text: string): number;
}
class LLMRouter {
private providers: Map<string, LLMProvider>;
private primaryProvider: string;
async complete(prompt: string, options: CompletionOptions): Promise<CompletionResult> {
try {
return await this.providers.get(this.primaryProvider)!.complete(prompt, options);
} catch (error) {
// Fallback to secondary provider on failure
const fallback = this.getFallbackProvider();
return await fallback.complete(prompt, options);
}
}
}The router handles failover automatically. If OpenAI returns a 429 (rate limit) or 500 error, the request transparently retries against Gemini. This is not a simple retry — the router tracks provider health over a sliding window and adjusts routing weights. If one provider has elevated error rates, traffic shifts to the healthier one before individual requests start failing.
I also use provider-specific strengths strategically. OpenAI tends to produce more structured, rubric-aligned evaluations, so it handles the scoring pipeline. Gemini is faster for conversational turns during the actual mock interview, so it handles question generation and follow-ups.
Real-Time Streaming Responses
Waiting 20-30 seconds for a complete LLM response is a terrible user experience. I implemented streaming throughout the feedback pipeline so candidates see feedback appearing word by word as the model generates it.
async function* streamFeedback(
sessionId: string,
response: string,
rubric: EvaluationRubric
): AsyncGenerator<FeedbackChunk> {
const prompt = buildEvaluationPrompt(response, rubric);
const stream = await llmRouter.stream(prompt, {
maxTokens: 2000,
temperature: 0.3, // Low temperature for consistent evaluations
});
for await (const chunk of stream) {
// Parse partial JSON as it arrives
const parsed = incrementalParse(chunk.text);
if (parsed.hasNewSection) {
yield {
type: 'feedback-section',
section: parsed.section,
content: parsed.content,
};
}
}
}The tricky part was incremental JSON parsing. The LLM returns structured feedback as JSON, but during streaming you receive partial tokens that do not form valid JSON. I built an incremental parser that buffers tokens until it can extract complete feedback sections, yielding them to the client as they become available. The client renders each section — technical accuracy, communication clarity, problem-solving approach — as soon as it arrives.
On the frontend, I used Server-Sent Events (SSE) to push feedback chunks to the browser. SSE was a better fit than WebSockets here because the data flow is unidirectional — the server streams to the client — and SSE handles reconnection automatically.
Session Management and Context
An interview session is a stateful, multi-turn interaction. The system needs to remember what questions were asked, how the candidate responded, and what difficulty level is appropriate for the next question.
interface InterviewSession {
id: string;
candidateId: string;
type: 'behavioral' | 'system-design' | 'coding';
difficulty: DifficultyLevel;
turns: InterviewTurn[];
rubric: EvaluationRubric;
metadata: {
startedAt: Date;
targetRole: string;
experienceLevel: string;
};
}
interface InterviewTurn {
question: string;
response: string;
feedback: StructuredFeedback | null;
timestamp: Date;
difficultyAtTime: DifficultyLevel;
}Sessions are persisted in the database after each turn, so a browser refresh or network interruption does not lose progress. The conversation history is included in each LLM prompt as context, but I truncate older turns to stay within token limits. The most recent three turns get full context; earlier turns are summarized to a single sentence each.
Difficulty Adaptation
The system adjusts question difficulty based on candidate performance within a session. After each response is evaluated, the difficulty engine looks at the scores and adjusts accordingly.
function calculateNextDifficulty(
currentDifficulty: DifficultyLevel,
recentScores: number[]
): DifficultyLevel {
const avgScore = recentScores.slice(-3).reduce((a, b) => a + b, 0) / Math.min(recentScores.length, 3);
if (avgScore > 0.8 && recentScores.length >= 2) {
return Math.min(currentDifficulty + 1, DifficultyLevel.Expert);
}
if (avgScore < 0.4 && recentScores.length >= 2) {
return Math.max(currentDifficulty - 1, DifficultyLevel.Beginner);
}
return currentDifficulty;
}The adaptation is deliberately conservative — it requires at least two responses before adjusting, and only moves one level at a time. Jumping from beginner to expert after one good answer would create a jarring experience. The goal is a gradual ramp that keeps candidates in their zone of proximal development.
Prompt Versioning
Prompt engineering is the most iterative part of any LLM application. A small wording change can dramatically alter evaluation quality. I needed a system to version prompts, test new versions against existing ones, and roll back if a new version underperformed.
interface PromptVersion {
id: string;
templateKey: string; // e.g., 'behavioral-evaluation-v3'
content: string;
variables: string[];
isActive: boolean;
trafficPercentage: number; // For A/B testing
createdAt: Date;
metrics: {
avgResponseTime: number;
avgTokenUsage: number;
userSatisfactionScore: number | null;
};
}Each prompt template is stored in the database with a version identifier and a traffic allocation percentage. When the system needs to generate a question or evaluate a response, it selects a prompt version based on the traffic split. This allows me to route 10% of evaluations to a new prompt version, compare the feedback quality, and gradually increase traffic if the new version performs better.
Prompt versions also track token usage, which directly impacts cost. A prompt that produces equivalent-quality feedback with 30% fewer tokens is a meaningful improvement at scale.
Key Decisions & Trade-offs
Structured JSON output vs. freeform text. I chose to instruct the LLM to return structured JSON for feedback rather than freeform paragraphs. This made the frontend rendering predictable and enabled score tracking over time. The trade-off is that structured output sometimes constrains the LLM's ability to provide nuanced commentary. I mitigate this by including a freeform "additional notes" field within the structure.
SSE vs. WebSockets for streaming. SSE was simpler to implement and sufficient for our unidirectional streaming needs. WebSockets would have been necessary if the mock interview involved real-time audio or video, but since candidates type or paste their responses, SSE was the right call. It also plays nicely with HTTP/2 multiplexing.
Session state in database vs. in-memory. Storing session state in the database after every turn adds latency, but it ensures reliability. An in-memory store like Redis would be faster but risks data loss on server restarts. Since interview sessions can last 30-60 minutes, losing a session mid-interview is unacceptable. The database write adds roughly 20ms per turn, which is negligible compared to LLM response times.
Low temperature for evaluations. I use a temperature of 0.3 for evaluation prompts and 0.7 for question generation. Evaluations need to be consistent — the same response should get similar feedback each time. Question generation benefits from more variety to avoid repetitive interviews.
Results & Outcomes
The multi-provider architecture proved its value early. During an OpenAI outage that lasted about 45 minutes, the system automatically routed all traffic to Gemini with no user-facing impact. Without the fallback, that would have been 45 minutes of complete downtime.
Streaming feedback transformed the user experience. Instead of staring at a loading spinner for 20+ seconds, candidates see feedback sections appearing within 2-3 seconds of submitting their response. The perceived wait time dropped dramatically even though the total generation time remained similar.
The prompt versioning system allowed me to iterate quickly on evaluation quality. Over the course of several weeks, I tested and deployed eight prompt versions, with each iteration improving the specificity and actionability of feedback. Early versions produced generic advice like "improve your communication." Later versions give specific, contextual suggestions tied to what the candidate actually said.
Difficulty adaptation kept candidates engaged. Sessions without adaptation had a noticeable drop-off after 5-6 questions — candidates either got bored with easy questions or frustrated with hard ones. With adaptation enabled, session lengths increased meaningfully, and candidates reported the experience felt more like a real interview with a human interviewer.
What I'd Do Differently
Start with one LLM provider, not two. The multi-provider abstraction was valuable long-term, but building it from day one added complexity before I had validated the core product. I would start with a single provider and add the abstraction layer only after hitting reliability issues.
Invest in evaluation benchmarks earlier. I spent weeks tuning prompts based on gut feeling before creating a systematic benchmark — a set of sample responses with expected feedback. Once the benchmark existed, iteration speed doubled because I could objectively compare prompt versions.
Use a dedicated prompt management tool. My database-backed prompt versioning works, but tools like LangSmith or Promptfoo provide better visualization, comparison, and debugging. Building custom tooling for prompt management was not the best use of my time.
Consider fine-tuning for evaluation. The general-purpose LLMs do a decent job with detailed prompts, but a fine-tuned model on interview evaluation data would likely produce better, more consistent results at lower cost. I would explore this once I had enough evaluation data to train on.
FAQ
How does the AI evaluate interview responses?
The system transcribes the recording, segments it into question-answer pairs, then evaluates each response against a role-specific rubric using an LLM. The rubric covers technical accuracy, communication clarity, problem-solving structure, and completeness, producing scores and actionable improvement suggestions.
How accurate is AI interview feedback compared to human evaluators?
In our benchmarks, AI feedback matched human interviewer assessments 85% of the time on technical accuracy and 78% on communication quality. The AI excels at consistent, objective evaluation but may miss nuanced interpersonal signals that experienced interviewers catch.
How do you handle different interview types and roles?
Each interview type (behavioral, system design, coding) has its own evaluation rubric and prompt template. Role-specific rubrics adjust expectations — a senior engineer response is evaluated differently than a junior candidate's. Templates are versioned and A/B tested for evaluation quality.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Optimizing Core Web Vitals for e-Commerce
Our journey to scoring 100 on Google PageSpeed Insights for a major Shopify-backed e-commerce platform.
Migrating from Pages to App Router
A detailed post-mortem on migrating a massive enterprise dashboard from Next.js Pages Router to the App Router.
Performance Optimization for an Image-Heavy Rental Platform
How we optimized Core Web Vitals for an image-heavy rental platform, reducing LCP from 4.2s to 1.8s through responsive images, lazy loading, blur placeholders, and CDN-based image transformation.