Real-Time at Scale: Building a Hospital Nurse Allocation System for 5,000 Patients/Day

When a client needed to coordinate nurse assignments across 15 hospitals in real time, polling wasn't going to cut it. Here's the architecture we landed on — WebSockets, Redis pub/sub, and a rule-based scheduling engine that runs without a hitch in a live clinical environment.

When the client first described the problem, it sounded straightforward: nurses needed to know their assignments without refreshing a page. Fifteen hospitals, ~5,000 patients a day, dozens of wards. In practice it was one of the more interesting real-time problems I've worked on.

Why Polling Was Never an Option

The naive approach — periodic HTTP polling every few seconds — falls apart quickly at this scale. Every connected client hitting your API every 3 seconds with even 200 concurrent ward users means hundreds of unnecessary database reads per minute. More critically, in a clinical environment, a 3-second lag in an assignment update can create genuine operational risk.

We needed true push. The question was which protocol and what the fanout architecture should look like.

The Architecture

The stack we landed on:

▸Node.js + ws for the WebSocket server (we evaluated Socket.IO but the extra abstraction wasn't worth it here — raw ws gives you full control over heartbeats and reconnect logic)
▸Redis pub/sub as the message bus between hospital-specific channels
▸PostgreSQL as the source of truth for all assignment state
▸React on the frontend, with a custom useSocket hook that handles connection lifecycle

Each ward gets its own Redis channel: ward:{hospitalId}:{wardId}. When a dispatcher makes an assignment change, the write goes to Postgres, then publishes an event to the relevant Redis channel. All WebSocket servers subscribed to that channel fan the message out to connected clients in that ward.

Dispatcher action
       ↓
POST /assignments → Postgres write
       ↓
Redis PUBLISH ward:H1:ICU { type: "ASSIGN", nurseId, patientId }
       ↓
WS Server (subscribed to ward:H1:ICU)
       ↓
Push to connected ward clients

Handling Disconnects and Reconnection

The trickiest part wasn't the happy path — it was what happens when a nurse's tablet drops off the network for 30 seconds and reconnects. You can't just re-subscribe; you need to reconcile state.

Our solution: every client sends a lastSeenEventId on reconnect. The server replays any missed events from a Redis list (we keep a 5-minute rolling window per channel). If the client has been offline longer than that, it gets a full-state snapshot from Postgres and replays from there.

This pattern — "delta sync with snapshot fallback" — is something I've since applied to several other real-time features. It's not original but it's reliable.

What 35% Better Utilization Actually Means

The metric we tracked post-launch: the ratio of nurse-hours actually assigned to wards vs. available nurse-hours. Before the system, dispatchers were working from whiteboards and Excel sheets, and assignments often lagged 15–20 minutes behind patient intake events.

After: assignments pushed in under 500ms from the dispatch event. The 35% improvement came from eliminating the lag, not from any algorithmic scheduling magic — though we did add a basic rules engine for flagging under-staffed wards.

Lessons

A few things that weren't obvious before we built this:

Redis channel granularity matters a lot. We initially had one channel per hospital. This meant every connected client in a 1,200-bed hospital received every event, regardless of ward. Granular channels (per ward) cut message volume by ~85%.

WebSocket heartbeat tuning is underrated. The default ping/pong intervals in most WS libraries are too conservative for a clinical environment where a dropped connection should be detected within 5–10 seconds, not 60. We settled on 8s ping, 4s pong timeout.

Don't put business logic in the WS layer. The WebSocket server's only job is fanout. It doesn't validate, doesn't compute — it subscribes to Redis and pushes. All logic lives in the HTTP API layer. This made the system much easier to test and reason about.

Why Polling Was Never an Option

The Architecture

Handling Disconnects and Reconnection

What 35% Better Utilization Actually Means

Lessons

$ ls ../