Building a WebRTC Video Interview Platform
Case study on building a WebRTC-based video interview platform with recording, real-time transcription, AI feedback, and adaptive bitrate for varying network conditions.
Tags
Building a WebRTC Video Interview Platform
TL;DR
I built a WebRTC-based video interview platform with Firebase for signaling, server-side recording via a headless media participant, and real-time transcription piped through Deepgram. The hardest parts were NAT traversal reliability and graceful degradation when peer-to-peer connections failed. This post walks through the architecture, the ICE/STUN/TURN stack, and the lessons I learned shipping real-time video at production quality.
The Challenge
The client needed a platform where recruiters could conduct structured video interviews with candidates — scheduled ahead of time, recorded for later review, and transcribed in real time so an AI module could provide post-interview feedback. They had been using Zoom links copy-pasted into calendar invites, which meant no recording ownership, no transcription pipeline, and no integration with their existing applicant tracking system.
The requirements boiled down to:
- ›Peer-to-peer video calls between two participants with screen sharing
- ›Server-side recording that the platform owned and stored
- ›Real-time captions during the call with full transcript saved afterward
- ›Scheduling integration with calendar invites and reminder notifications
- ›Reliability across corporate networks where firewalls and NATs are aggressive
- ›Adaptive quality so calls stay connected even on poor connections
Third-party video SDKs like Twilio or Daily.co were considered, but the client wanted full ownership of the media pipeline — no per-minute billing, no vendor lock-in on recordings, and the ability to plug in their own AI processing downstream.
The Architecture
Signaling with Firebase Realtime Database
WebRTC requires a signaling channel to exchange SDP offers/answers and ICE candidates between peers before a direct connection is established. I chose Firebase Realtime Database for this because it provides real-time listeners out of the box, scales without infrastructure management, and the client already had Firebase in their stack.
The signaling flow works like this:
// Caller creates an offer and writes it to Firebase
const peerConnection = new RTCPeerConnection(iceConfig);
localStream.getTracks().forEach((track) => {
peerConnection.addTrack(track, localStream);
});
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);
const callDoc = doc(db, "calls", callId);
await setDoc(callDoc, {
offer: {
type: offer.type,
sdp: offer.sdp,
},
createdAt: serverTimestamp(),
status: "waiting",
});
// Listen for ICE candidates and write them to a subcollection
peerConnection.onicecandidate = (event) => {
if (event.candidate) {
const candidateRef = collection(callDoc, "callerCandidates");
addDoc(candidateRef, event.candidate.toJSON());
}
};The callee listens on the same document, picks up the offer, creates an answer, and writes their own ICE candidates to a parallel subcollection. Firebase's onSnapshot listeners make this feel instantaneous.
One thing I learned the hard way: you need to handle the case where ICE candidates arrive before the remote description is set. I queued incoming candidates and applied them only after setRemoteDescription completed:
let pendingCandidates: RTCIceCandidateInit[] = [];
let remoteDescriptionSet = false;
onSnapshot(collection(callDoc, "callerCandidates"), (snapshot) => {
snapshot.docChanges().forEach((change) => {
if (change.type === "added") {
const candidate = change.doc.data() as RTCIceCandidateInit;
if (remoteDescriptionSet) {
peerConnection.addIceCandidate(new RTCIceCandidate(candidate));
} else {
pendingCandidates.push(candidate);
}
}
});
});
// After setting remote description
await peerConnection.setRemoteDescription(new RTCSessionDescription(answer));
remoteDescriptionSet = true;
pendingCandidates.forEach((c) =>
peerConnection.addIceCandidate(new RTCIceCandidate(c))
);
pendingCandidates = [];ICE, STUN, and TURN Configuration
The ICE (Interactive Connectivity Establishment) framework is where WebRTC connections either succeed or fail. STUN servers help peers discover their public IP addresses, while TURN servers relay media when direct connections are impossible — typically behind symmetric NATs or strict corporate firewalls.
I configured the RTCPeerConnection with multiple STUN servers and a self-hosted TURN server running coturn:
const iceConfig: RTCConfiguration = {
iceServers: [
{ urls: "stun:stun.l.google.com:19302" },
{ urls: "stun:stun1.l.google.com:19302" },
{
urls: [
"turn:turn.ourdomain.com:3478?transport=udp",
"turn:turn.ourdomain.com:3478?transport=tcp",
"turns:turn.ourdomain.com:5349?transport=tcp",
],
username: credentials.username,
credential: credentials.credential,
},
],
iceTransportPolicy: "all", // Try direct first, fall back to relay
};The turns (TURN over TLS on port 443) entry is critical for corporate environments. Many corporate firewalls block non-standard ports but allow 443 since it looks like HTTPS traffic. Without this, a significant chunk of users behind corporate proxies simply cannot connect.
For TURN credentials, I used time-limited credentials generated server-side using HMAC-based authentication, so credentials expire after the interview session:
// Server-side credential generation
function generateTurnCredentials(userId: string): TurnCredentials {
const ttl = 86400; // 24 hours
const timestamp = Math.floor(Date.now() / 1000) + ttl;
const username = `${timestamp}:${userId}`;
const hmac = crypto.createHmac("sha1", TURN_SECRET);
hmac.update(username);
const credential = hmac.digest("base64");
return { username, credential };
}Peer Connection Lifecycle Management
Managing the peer connection lifecycle is where the complexity really lives. Connections can fail, networks can switch (think laptop moving from ethernet to WiFi), and users can lose connectivity temporarily. I implemented a state machine that tracked connection status:
peerConnection.onconnectionstatechange = () => {
const state = peerConnection.connectionState;
switch (state) {
case "connected":
clearReconnectTimer();
updateCallStatus("active");
break;
case "disconnected":
// Start a grace period — disconnected often recovers on its own
startReconnectTimer(5000, () => attemptIceRestart());
break;
case "failed":
// ICE restart to try new candidates
attemptIceRestart();
break;
case "closed":
cleanupCall();
break;
}
};
async function attemptIceRestart() {
try {
const offer = await peerConnection.createOffer({ iceRestart: true });
await peerConnection.setLocalDescription(offer);
// Write the new offer to Firebase for the other peer
await updateDoc(callDoc, {
offer: { type: offer.type, sdp: offer.sdp },
iceRestart: true,
});
} catch (err) {
// If ICE restart fails, fall back to full reconnection
await fullReconnect();
}
}The key insight is that disconnected is not the same as failed. A disconnected state often recovers within seconds as ICE finds alternative candidate pairs. Immediately tearing down the connection on disconnect would create unnecessary disruption.
Fallback to Relay Mode
When peer-to-peer connections consistently fail, the platform forces relay mode by recreating the peer connection with iceTransportPolicy: "relay". This guarantees the connection goes through the TURN server:
async function fallbackToRelay() {
const relayConfig: RTCConfiguration = {
...iceConfig,
iceTransportPolicy: "relay", // Force TURN relay only
};
// Tear down existing connection
peerConnection.close();
// Create new connection in relay-only mode
peerConnection = new RTCPeerConnection(relayConfig);
// Re-add tracks, re-establish signaling...
}This trades latency for reliability. Relay mode adds a hop through the TURN server, but it works in virtually every network environment. In practice, I found that about 15-20% of connections in corporate settings needed relay mode.
Server-Side Recording
Client-side recording using MediaRecorder is unreliable — if the user closes the tab, the recording is lost. Instead, I used a headless media participant that joins each call invisibly, receives all streams, and records them using GStreamer on the server.
The recording service joins the call as a hidden peer, receives the combined audio/video streams, and pipes them into a GStreamer pipeline that produces an MP4 file streamed to cloud storage. When the call ends, the recording is finalized and associated with the interview record in the database.
Real-Time Transcription Pipeline
Audio from the WebRTC stream is extracted and piped to Deepgram via WebSocket for real-time speech-to-text. The flow looks like:
- ›The recording server extracts the audio track from each participant
- ›Raw PCM audio is streamed to a Deepgram WebSocket connection
- ›Interim results are broadcast to both participants as live captions
- ›Final results are accumulated into a complete transcript
- ›The transcript is saved and fed to the AI feedback module post-call
Session Scheduling
Scheduling was built on top of the existing calendar system. When a recruiter schedules an interview, the system creates a call document in Firebase with a future timestamp, sends calendar invites via SendGrid with a unique join link, and fires reminder notifications 15 minutes before the call. The join link validates the participant's identity and connects them to the pre-created signaling room.
Key Decisions & Trade-offs
Firebase vs. custom WebSocket signaling: Firebase was faster to implement and inherently scalable, but it means signaling data lives in a third-party service. For this use case, signaling data is ephemeral and non-sensitive (SDP and ICE candidates contain network metadata, not call content), so the trade-off was acceptable.
Self-hosted TURN vs. managed TURN: Self-hosting coturn gave us full control over relay infrastructure and eliminated per-minute costs that managed TURN providers charge. The downside is operational overhead — monitoring, scaling, and geo-distributing TURN servers. For a platform expecting moderate scale, self-hosted was the right call.
Server-side recording vs. client-side: Server-side recording is significantly more complex to implement but produces reliable recordings regardless of client behavior. Given that these recordings were the primary deliverable (recruiters review them later), reliability was non-negotiable.
SSE for signaling updates vs. polling: Firebase's real-time listeners essentially give us server-push for free. If I had built custom signaling, I would have used WebSockets, but Firebase's SDK abstracts this entirely.
Simulcast for adaptive bitrate: Rather than implementing custom bandwidth estimation, I leveraged WebRTC's built-in simulcast — encoding video at multiple quality levels and letting the receiver request the appropriate layer. This kept the implementation simple while providing genuine adaptive quality.
Results & Outcomes
The platform shipped and handled interview sessions reliably across a wide range of network conditions. The relay fallback mechanism proved essential — connections that would have failed on a basic WebRTC implementation succeeded by transparently switching to TURN relay mode. Recruiters could schedule, conduct, and review interviews entirely within the platform, eliminating the fragmented workflow of external video tools. The real-time transcription was accurate enough for the AI feedback module to generate actionable post-interview insights, and the server-side recordings provided a reliable archive that the team could reference weeks later.
What I'd Do Differently
Use a Selective Forwarding Unit (SFU) from the start. The peer-to-peer approach works well for two-participant calls, but when the client later asked about panel interviews with multiple interviewers, the architecture needed significant rework. An SFU like mediasoup or Janus from day one would have handled multi-party calls without the mesh complexity.
Invest in connection quality diagnostics earlier. I added getStats() monitoring late in the project, but having real-time metrics on packet loss, jitter, and round-trip time from the start would have made debugging connection issues much faster.
Implement a proper SRTP key management system. WebRTC encrypts media by default via DTLS-SRTP, but for compliance-heavy clients, having explicit key management and audit trails would add confidence. I would build this in from the beginning for enterprise deployments.
Use a dedicated signaling service instead of Firebase. Firebase worked, but a purpose-built signaling service with better connection state management and built-in room concepts would have reduced the custom logic I had to write around call lifecycle management.
FAQ
How do you record WebRTC video sessions server-side?
We used a media server that joins each call as a hidden participant, receives all media streams, and records them using GStreamer. The recording server establishes a peer connection to the signaling room just like a regular participant, but with no outgoing media streams. It receives audio and video tracks from all participants, pipes them into a GStreamer pipeline that muxes and encodes the streams into an MP4 container, and streams the output directly to cloud storage. This approach produces reliable recordings regardless of client-side issues — if a participant closes their browser, the server-side recording still captures everything up to that point. Unlike MediaRecorder-based client recording, which depends on the browser tab staying open and the user's device having sufficient resources, server-side recording is deterministic and centrally managed.
How does adaptive bitrate work in WebRTC?
WebRTC includes built-in bandwidth estimation via the GCC (Google Congestion Control) algorithm, which continuously measures packet loss and round-trip time to estimate available bandwidth. We configured simulcast encoding, which sends the video at multiple quality levels simultaneously — typically three layers at different resolutions and frame rates. The receiving peer (or SFU) dynamically switches between these layers based on current network conditions. When bandwidth drops, the receiver requests a lower-quality layer, keeping the call connected with reduced video quality rather than freezing or disconnecting entirely. The transition between layers happens seamlessly — participants see a quality reduction but experience no interruption. This approach keeps calls viable even on cellular connections or congested corporate networks where bandwidth fluctuates significantly.
How do you add real-time transcription to a video call?
Audio from the WebRTC stream is extracted on the recording server and piped to Deepgram's speech-to-text API via a persistent WebSocket connection. The audio is sent as raw PCM samples at 16kHz, and Deepgram returns interim results within 300-500ms. These interim results are broadcast back to both call participants through the signaling channel (Firebase) and rendered as live captions in the UI. As Deepgram refines its recognition, interim results are replaced by final results that represent the completed transcription for each utterance. Final results are accumulated into a complete transcript document, tagged with speaker labels and timestamps. After the call ends, the full transcript is saved to the database and passed to the AI feedback module, which analyzes communication patterns, keyword usage, and response quality to generate structured interview feedback for the recruiter.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Optimizing Core Web Vitals for e-Commerce
Our journey to scoring 100 on Google PageSpeed Insights for a major Shopify-backed e-commerce platform.
Building an AI-Powered Interview Feedback System
How we built an AI-powered system that analyzes mock interview recordings and generates structured feedback on communication, technical accuracy, and problem-solving approach using LLMs.
Migrating from Pages to App Router
A detailed post-mortem on migrating a massive enterprise dashboard from Next.js Pages Router to the App Router.