Blog/Behind the Code/Building a WebRTC Video Interview Platform
POST
June 25, 2025
LAST UPDATEDJune 25, 2025

Building a WebRTC Video Interview Platform

Case study on building a WebRTC-based video interview platform with recording, real-time transcription, AI feedback, and adaptive bitrate for varying network conditions.

Tags

WebRTCVideoFirebaseNext.js
Building a WebRTC Video Interview Platform
9 min read

Building a WebRTC Video Interview Platform

TL;DR

I built a WebRTC-based video interview platform with Firebase for signaling, server-side recording via a headless media participant, and real-time transcription piped through Deepgram. The hardest parts were NAT traversal reliability and graceful degradation when peer-to-peer connections failed. This post walks through the architecture, the ICE/STUN/TURN stack, and the lessons I learned shipping real-time video at production quality.

The Challenge

The client needed a platform where recruiters could conduct structured video interviews with candidates — scheduled ahead of time, recorded for later review, and transcribed in real time so an AI module could provide post-interview feedback. They had been using Zoom links copy-pasted into calendar invites, which meant no recording ownership, no transcription pipeline, and no integration with their existing applicant tracking system.

The requirements boiled down to:

  • Peer-to-peer video calls between two participants with screen sharing
  • Server-side recording that the platform owned and stored
  • Real-time captions during the call with full transcript saved afterward
  • Scheduling integration with calendar invites and reminder notifications
  • Reliability across corporate networks where firewalls and NATs are aggressive
  • Adaptive quality so calls stay connected even on poor connections

Third-party video SDKs like Twilio or Daily.co were considered, but the client wanted full ownership of the media pipeline — no per-minute billing, no vendor lock-in on recordings, and the ability to plug in their own AI processing downstream.

The Architecture

Signaling with Firebase Realtime Database

WebRTC requires a signaling channel to exchange SDP offers/answers and ICE candidates between peers before a direct connection is established. I chose Firebase Realtime Database for this because it provides real-time listeners out of the box, scales without infrastructure management, and the client already had Firebase in their stack.

The signaling flow works like this:

typescript
// Caller creates an offer and writes it to Firebase
const peerConnection = new RTCPeerConnection(iceConfig);
 
localStream.getTracks().forEach((track) => {
  peerConnection.addTrack(track, localStream);
});
 
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);
 
const callDoc = doc(db, "calls", callId);
await setDoc(callDoc, {
  offer: {
    type: offer.type,
    sdp: offer.sdp,
  },
  createdAt: serverTimestamp(),
  status: "waiting",
});
 
// Listen for ICE candidates and write them to a subcollection
peerConnection.onicecandidate = (event) => {
  if (event.candidate) {
    const candidateRef = collection(callDoc, "callerCandidates");
    addDoc(candidateRef, event.candidate.toJSON());
  }
};

The callee listens on the same document, picks up the offer, creates an answer, and writes their own ICE candidates to a parallel subcollection. Firebase's onSnapshot listeners make this feel instantaneous.

One thing I learned the hard way: you need to handle the case where ICE candidates arrive before the remote description is set. I queued incoming candidates and applied them only after setRemoteDescription completed:

typescript
let pendingCandidates: RTCIceCandidateInit[] = [];
let remoteDescriptionSet = false;
 
onSnapshot(collection(callDoc, "callerCandidates"), (snapshot) => {
  snapshot.docChanges().forEach((change) => {
    if (change.type === "added") {
      const candidate = change.doc.data() as RTCIceCandidateInit;
      if (remoteDescriptionSet) {
        peerConnection.addIceCandidate(new RTCIceCandidate(candidate));
      } else {
        pendingCandidates.push(candidate);
      }
    }
  });
});
 
// After setting remote description
await peerConnection.setRemoteDescription(new RTCSessionDescription(answer));
remoteDescriptionSet = true;
pendingCandidates.forEach((c) =>
  peerConnection.addIceCandidate(new RTCIceCandidate(c))
);
pendingCandidates = [];

ICE, STUN, and TURN Configuration

The ICE (Interactive Connectivity Establishment) framework is where WebRTC connections either succeed or fail. STUN servers help peers discover their public IP addresses, while TURN servers relay media when direct connections are impossible — typically behind symmetric NATs or strict corporate firewalls.

I configured the RTCPeerConnection with multiple STUN servers and a self-hosted TURN server running coturn:

typescript
const iceConfig: RTCConfiguration = {
  iceServers: [
    { urls: "stun:stun.l.google.com:19302" },
    { urls: "stun:stun1.l.google.com:19302" },
    {
      urls: [
        "turn:turn.ourdomain.com:3478?transport=udp",
        "turn:turn.ourdomain.com:3478?transport=tcp",
        "turns:turn.ourdomain.com:5349?transport=tcp",
      ],
      username: credentials.username,
      credential: credentials.credential,
    },
  ],
  iceTransportPolicy: "all", // Try direct first, fall back to relay
};

The turns (TURN over TLS on port 443) entry is critical for corporate environments. Many corporate firewalls block non-standard ports but allow 443 since it looks like HTTPS traffic. Without this, a significant chunk of users behind corporate proxies simply cannot connect.

For TURN credentials, I used time-limited credentials generated server-side using HMAC-based authentication, so credentials expire after the interview session:

typescript
// Server-side credential generation
function generateTurnCredentials(userId: string): TurnCredentials {
  const ttl = 86400; // 24 hours
  const timestamp = Math.floor(Date.now() / 1000) + ttl;
  const username = `${timestamp}:${userId}`;
  const hmac = crypto.createHmac("sha1", TURN_SECRET);
  hmac.update(username);
  const credential = hmac.digest("base64");
  return { username, credential };
}

Peer Connection Lifecycle Management

Managing the peer connection lifecycle is where the complexity really lives. Connections can fail, networks can switch (think laptop moving from ethernet to WiFi), and users can lose connectivity temporarily. I implemented a state machine that tracked connection status:

typescript
peerConnection.onconnectionstatechange = () => {
  const state = peerConnection.connectionState;
 
  switch (state) {
    case "connected":
      clearReconnectTimer();
      updateCallStatus("active");
      break;
    case "disconnected":
      // Start a grace period — disconnected often recovers on its own
      startReconnectTimer(5000, () => attemptIceRestart());
      break;
    case "failed":
      // ICE restart to try new candidates
      attemptIceRestart();
      break;
    case "closed":
      cleanupCall();
      break;
  }
};
 
async function attemptIceRestart() {
  try {
    const offer = await peerConnection.createOffer({ iceRestart: true });
    await peerConnection.setLocalDescription(offer);
    // Write the new offer to Firebase for the other peer
    await updateDoc(callDoc, {
      offer: { type: offer.type, sdp: offer.sdp },
      iceRestart: true,
    });
  } catch (err) {
    // If ICE restart fails, fall back to full reconnection
    await fullReconnect();
  }
}

The key insight is that disconnected is not the same as failed. A disconnected state often recovers within seconds as ICE finds alternative candidate pairs. Immediately tearing down the connection on disconnect would create unnecessary disruption.

Fallback to Relay Mode

When peer-to-peer connections consistently fail, the platform forces relay mode by recreating the peer connection with iceTransportPolicy: "relay". This guarantees the connection goes through the TURN server:

typescript
async function fallbackToRelay() {
  const relayConfig: RTCConfiguration = {
    ...iceConfig,
    iceTransportPolicy: "relay", // Force TURN relay only
  };
 
  // Tear down existing connection
  peerConnection.close();
 
  // Create new connection in relay-only mode
  peerConnection = new RTCPeerConnection(relayConfig);
  // Re-add tracks, re-establish signaling...
}

This trades latency for reliability. Relay mode adds a hop through the TURN server, but it works in virtually every network environment. In practice, I found that about 15-20% of connections in corporate settings needed relay mode.

Server-Side Recording

Client-side recording using MediaRecorder is unreliable — if the user closes the tab, the recording is lost. Instead, I used a headless media participant that joins each call invisibly, receives all streams, and records them using GStreamer on the server.

The recording service joins the call as a hidden peer, receives the combined audio/video streams, and pipes them into a GStreamer pipeline that produces an MP4 file streamed to cloud storage. When the call ends, the recording is finalized and associated with the interview record in the database.

Real-Time Transcription Pipeline

Audio from the WebRTC stream is extracted and piped to Deepgram via WebSocket for real-time speech-to-text. The flow looks like:

  1. The recording server extracts the audio track from each participant
  2. Raw PCM audio is streamed to a Deepgram WebSocket connection
  3. Interim results are broadcast to both participants as live captions
  4. Final results are accumulated into a complete transcript
  5. The transcript is saved and fed to the AI feedback module post-call

Session Scheduling

Scheduling was built on top of the existing calendar system. When a recruiter schedules an interview, the system creates a call document in Firebase with a future timestamp, sends calendar invites via SendGrid with a unique join link, and fires reminder notifications 15 minutes before the call. The join link validates the participant's identity and connects them to the pre-created signaling room.

Key Decisions & Trade-offs

Firebase vs. custom WebSocket signaling: Firebase was faster to implement and inherently scalable, but it means signaling data lives in a third-party service. For this use case, signaling data is ephemeral and non-sensitive (SDP and ICE candidates contain network metadata, not call content), so the trade-off was acceptable.

Self-hosted TURN vs. managed TURN: Self-hosting coturn gave us full control over relay infrastructure and eliminated per-minute costs that managed TURN providers charge. The downside is operational overhead — monitoring, scaling, and geo-distributing TURN servers. For a platform expecting moderate scale, self-hosted was the right call.

Server-side recording vs. client-side: Server-side recording is significantly more complex to implement but produces reliable recordings regardless of client behavior. Given that these recordings were the primary deliverable (recruiters review them later), reliability was non-negotiable.

SSE for signaling updates vs. polling: Firebase's real-time listeners essentially give us server-push for free. If I had built custom signaling, I would have used WebSockets, but Firebase's SDK abstracts this entirely.

Simulcast for adaptive bitrate: Rather than implementing custom bandwidth estimation, I leveraged WebRTC's built-in simulcast — encoding video at multiple quality levels and letting the receiver request the appropriate layer. This kept the implementation simple while providing genuine adaptive quality.

Results & Outcomes

The platform shipped and handled interview sessions reliably across a wide range of network conditions. The relay fallback mechanism proved essential — connections that would have failed on a basic WebRTC implementation succeeded by transparently switching to TURN relay mode. Recruiters could schedule, conduct, and review interviews entirely within the platform, eliminating the fragmented workflow of external video tools. The real-time transcription was accurate enough for the AI feedback module to generate actionable post-interview insights, and the server-side recordings provided a reliable archive that the team could reference weeks later.

What I'd Do Differently

Use a Selective Forwarding Unit (SFU) from the start. The peer-to-peer approach works well for two-participant calls, but when the client later asked about panel interviews with multiple interviewers, the architecture needed significant rework. An SFU like mediasoup or Janus from day one would have handled multi-party calls without the mesh complexity.

Invest in connection quality diagnostics earlier. I added getStats() monitoring late in the project, but having real-time metrics on packet loss, jitter, and round-trip time from the start would have made debugging connection issues much faster.

Implement a proper SRTP key management system. WebRTC encrypts media by default via DTLS-SRTP, but for compliance-heavy clients, having explicit key management and audit trails would add confidence. I would build this in from the beginning for enterprise deployments.

Use a dedicated signaling service instead of Firebase. Firebase worked, but a purpose-built signaling service with better connection state management and built-in room concepts would have reduced the custom logic I had to write around call lifecycle management.

FAQ

How do you record WebRTC video sessions server-side?

We used a media server that joins each call as a hidden participant, receives all media streams, and records them using GStreamer. The recording server establishes a peer connection to the signaling room just like a regular participant, but with no outgoing media streams. It receives audio and video tracks from all participants, pipes them into a GStreamer pipeline that muxes and encodes the streams into an MP4 container, and streams the output directly to cloud storage. This approach produces reliable recordings regardless of client-side issues — if a participant closes their browser, the server-side recording still captures everything up to that point. Unlike MediaRecorder-based client recording, which depends on the browser tab staying open and the user's device having sufficient resources, server-side recording is deterministic and centrally managed.

How does adaptive bitrate work in WebRTC?

WebRTC includes built-in bandwidth estimation via the GCC (Google Congestion Control) algorithm, which continuously measures packet loss and round-trip time to estimate available bandwidth. We configured simulcast encoding, which sends the video at multiple quality levels simultaneously — typically three layers at different resolutions and frame rates. The receiving peer (or SFU) dynamically switches between these layers based on current network conditions. When bandwidth drops, the receiver requests a lower-quality layer, keeping the call connected with reduced video quality rather than freezing or disconnecting entirely. The transition between layers happens seamlessly — participants see a quality reduction but experience no interruption. This approach keeps calls viable even on cellular connections or congested corporate networks where bandwidth fluctuates significantly.

How do you add real-time transcription to a video call?

Audio from the WebRTC stream is extracted on the recording server and piped to Deepgram's speech-to-text API via a persistent WebSocket connection. The audio is sent as raw PCM samples at 16kHz, and Deepgram returns interim results within 300-500ms. These interim results are broadcast back to both call participants through the signaling channel (Firebase) and rendered as live captions in the UI. As Deepgram refines its recognition, interim results are replaced by final results that represent the completed transcription for each utterance. Final results are accumulated into a complete transcript document, tagged with speaker labels and timestamps. After the call ends, the full transcript is saved to the database and passed to the AI feedback module, which analyzes communication patterns, keyword usage, and response quality to generate structured interview feedback for the recruiter.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Optimizing Core Web Vitals for e-Commerce
Mar 01, 202610 min read
SEO
Performance
Next.js

Optimizing Core Web Vitals for e-Commerce

Our journey to scoring 100 on Google PageSpeed Insights for a major Shopify-backed e-commerce platform.

Building an AI-Powered Interview Feedback System
Feb 22, 20269 min read
AI
LLM
Feedback

Building an AI-Powered Interview Feedback System

How we built an AI-powered system that analyzes mock interview recordings and generates structured feedback on communication, technical accuracy, and problem-solving approach using LLMs.

Migrating from Pages to App Router
Feb 15, 20268 min read
Next.js
Migration
Case Study

Migrating from Pages to App Router

A detailed post-mortem on migrating a massive enterprise dashboard from Next.js Pages Router to the App Router.