What a 100K-Concurrent WebRTC Signaling Architecture Looks Like
A control-plane deep dive into deterministic room placement, backpressure budgets, autoscaling signals, and failure containment at high concurrency.
What Changes in WebRTC at 100K Concurrency
At this scale, signaling becomes a control-plane design problem. You need deterministic room placement, admission control, backpressure, and resilient reconnect semantics.
Baseline topology
- Edge gateway: TLS, auth, and websocket termination.
- Signaling cluster: room membership, offer/answer routing, ICE lifecycle.
- Message bus: cross-node fanout for participant updates.
- Media plane (SFU): isolated from signaling failures.
Deterministic Room Placement
Avoid random node selection. Use rendezvous hashing so reconnects tend to land on the same shard when healthy, reducing cross-node chatter and cache misses.
interface SignalingNode {
id: string;
region: string;
activeConnections: number;
maxConnections: number;
}
function score(nodeId: string, roomId: string) {
return murmurHash(nodeId + ":" + roomId);
}
export function selectNode(roomId: string, nodes: SignalingNode[]) {
const healthy = nodes.filter((n) => n.activeConnections < n.maxConnections);
if (healthy.length === 0) throw new Error("No capacity available");
return healthy
.map((node) => ({
node,
hash: score(node.id, roomId),
loadRatio: node.activeConnections / node.maxConnections,
}))
.sort((a, b) => b.hash - a.hash || a.loadRatio - b.loadRatio)[0].node;
}Backpressure and Fanout Discipline
One noisy room can starve a node. Apply per-room and per-connection budgets. Drop non-critical updates first (e.g., speaking indicators) before critical protocol events.
class TokenBucket {
constructor(
private capacity: number,
private refillPerSecond: number,
private tokens = capacity,
) {}
tick(elapsedSeconds: number) {
this.tokens = Math.min(this.capacity, this.tokens + elapsedSeconds * this.refillPerSecond);
}
tryConsume(cost = 1) {
if (this.tokens < cost) return false;
this.tokens -= cost;
return true;
}
}
export function publishParticipantUpdate(roomBucket: TokenBucket, payload: unknown) {
if (!roomBucket.tryConsume(1)) {
metrics.increment("rtc.room_updates_dropped");
return;
}
bus.publish("room.participant.updated", payload);
}Autoscaling Signals That Matter
CPU alone is insufficient. You need domain metrics in autoscaling: active connections, fanout queue depth, join latency p95, and reconnect failure rate.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: signaling-hpa
spec:
minReplicas: 8
maxReplicas: 64
metrics:
- type: Pods
pods:
metric:
name: active_websocket_connections
target:
type: AverageValue
averageValue: "1700"
- type: Pods
pods:
metric:
name: fanout_queue_depth
target:
type: AverageValue
averageValue: "250"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65Resilience principle
If a signaling node dies, blast radius should be shard-local and clients must reconnect within an SLO window. Design for partial failure, not perfect uptime.