Shipping an LLM-powered feature to production is different from running a notebook. Here's how we built a real-time inference pipeline for an automated detection system — handling job queuing, streaming responses, and failure recovery at the backend level.
The project: automate a verification workflow that was previously done manually by operators reviewing images. The client needed inference results streamed back in real time — operators couldn't stare at a loading spinner for 8 seconds while the model ran.
The output needed to feel live. That meant streaming.
The Pipeline Architecture
Client (WebSocket)
↓
FastAPI WS endpoint → Redis Stream (XADD job)
↓
Worker Process (model inference)
↓
Redis Stream (XADD result chunk)
↓
FastAPI consumer → WebSocket push to client
We used Redis Streams (not pub/sub) here intentionally. Streams give you persistent, ordered, consumer-group-based processing. If a worker crashes mid-inference, the job isn't lost — another worker picks it up via XAUTOCLAIM.
FastAPI + Streaming Responses
FastAPI's WebSocket support is genuinely good for this pattern. Here's the core of the WS endpoint:
@app.websocket("/ws/inference/{job_id}")
async def inference_ws(websocket: WebSocket, job_id: str):
await websocket.accept()
# Subscribe to result stream for this job
last_id = "0"
while True:
results = await redis.xread(
{f"results:{job_id}": last_id}, block=200, count=10
)
if results:
for _, messages in results:
for msg_id, data in messages:
last_id = msg_id
await websocket.send_json(data)
if data.get("done"):
await websocket.close()
return
The worker runs the model inference in a separate process (we used Ray for worker management), writing result chunks to the stream as they're generated. The WS endpoint just polls and pushes.
Handling the Hard Parts
Backpressure. LLM inference is slow relative to WebSocket throughput. We added a simple token bucket on the worker side to prevent Redis stream buildup when a batch of jobs arrived simultaneously.
Job timeouts. Every job gets a 30-second TTL in a Redis sorted set. A background task sweeps expired jobs and sends a {done: true, error: "timeout"} event so the client isn't left hanging.
Model cold starts. Running on AWS ECS, the first inference after a scale-up could take 4–6 seconds for model loading. We pre-warmed the container by sending a dummy inference on startup. Not elegant, but it worked.
What 70% Accuracy Improvement Actually Means
The baseline was manual operator review — subjectively accurate but slow and inconsistent across operators. The LLM-assisted workflow surfaced the most likely classification first, with a confidence score, and operators confirmed or overrode. The 70% improvement is against a benchmark set the client defined based on their historical error rate.
The key insight: you don't need the model to be right 100% of the time if the workflow is designed around human confirmation. The model's job is to reduce cognitive load, not replace judgment.
FastAPI vs. Flask for This Use Case
We chose FastAPI specifically for async support throughout. Flask's sync model would have required a thread per WebSocket connection, which doesn't scale. FastAPI's async WebSocket handling means a single event loop manages hundreds of concurrent connections efficiently.
The Pydantic validation on request/response bodies was also genuinely useful — it caught a surprising number of integration bugs early.