MilikMilik

How Superhuman Handles 200K Requests per Second for Real-Time AI Inference

How Superhuman Handles 200K Requests per Second for Real-Time AI Inference

From Prototype Models to Internet-Scale AI Inference

Superhuman’s productivity suite now delivers AI-powered writing assistance to more than 40 million daily users, correcting grammar and refining tone across dozens of languages in real time. Behind every suggestion is a custom large language model that must respond at over 200,000 queries per second with roughly 50 input tokens and 50 output tokens per request. The platform’s production target is aggressive: p99 latency under one second and four nines of reliability. Meeting those goals at such volume highlights the gap between lab benchmarks and enterprise-grade AI inference scaling. A model that appears fast in isolation can crumble once exposed to diurnal traffic spikes, noisy neighbors, and cascading retries. Superhuman’s journey shows that moving from prototype to production is less about squeezing a few more tokens per second from a GPU and more about building a resilient serving architecture that treats AI inference as a critical, always-on system.

Why Superhuman Moved from DIY Serving to a Foundation Model API

Originally, Superhuman ran its grammar model on a self-managed vLLM stack backed by L40S GPUs and internal tooling for training and model management. This DIY approach worked but incurred mounting costs in engineering time: each new model version required months of hand-tuned performance optimization, while a small infrastructure team managed capacity planning, autoscaling policies, and reliability. To keep up with product ambitions, Superhuman partnered with Databricks and migrated to Databricks model serving, effectively treating inference as a foundation model API. This shift abstracted away much of the infrastructure complexity, letting Superhuman’s engineers refocus on model quality and user experience instead of low-level deployment mechanics. Crucially, the partnership was anchored in explicit service-level objectives: sub-second p99 latency with no quality regression on Superhuman’s internal evaluation harnesses. That contract framed the engineering work as a shared responsibility between AI product and platform provider.

Load Balancing and Autoscaling for 200K QPS Without Latency Spikes

Hitting latency targets on a single pod is only the starting point; the real challenge is sustaining those targets under 200K+ QPS, sharp diurnal ramps, and unpredictable bursts. Superhuman and Databricks discovered that Kubernetes’ default round-robin load balancing produced uneven request distribution at high throughput, creating hotspots and inflated tail latency. They instead built an Endpoint Discovery Service that watches Kubernetes Services and EndpointSlices and drives a custom load balancer based on the “power of two choices” algorithm. For each request, two candidate pods are sampled and traffic is sent to the one with fewer active requests, smoothing load across replicas. Autoscaling is tuned around average request concurrency per pod, with targets informed by benchmarked sustainable RPS. Scale-up is intentionally aggressive to absorb sudden surges, while scale-down is conservative to prevent flapping that can destabilize latency. Joint shadow testing across both teams helped refine these parameters and catch edge cases early.

Durable Execution: A Safety Net for Production AI Workflows

As AI systems move from single-shot prompts to complex, multi-step workflows, reliability demands extend beyond model serving to orchestration. Temporal’s durable execution platform offers a complementary pattern: it turns fragile, distributed workflows into crash-proof sequences by automatically persisting state and resuming execution after failures. Originally forked from Uber’s Cadence engine, Temporal lets developers express business logic in normal code rather than DSLs, while the underlying engine handles retries, timeouts, and recovery. This approach is particularly relevant for AI inference scaling, where chains of model calls, data fetches, and post-processing steps must complete reliably despite flaky networks or transient API failures. With more than 3,000 paying customers relying on its open-source and cloud offerings, Temporal demonstrates how a workflow “safety net” can sit beneath AI-powered applications, complementing foundation model APIs by ensuring that every inference-driven process either runs to completion or fails predictably, without bespoke error-handling scaffolding.

The Emerging Blueprint for High-Throughput, Low-Latency AI

Taken together, Superhuman’s partnership with Databricks and Temporal’s rise as a durable execution layer outline an emerging blueprint for AI inference in production. At the core is a separation of concerns: foundation model APIs handle the heavy lifting of serving models at high throughput with strict production latency optimization, while workflow engines like Temporal ensure that complex AI-driven processes are resilient by default. Engineering teams can then focus on model architecture, evaluation, and product experience, instead of rebuilding load balancers, autoscalers, and retry logic from scratch. Real-world deployments show that scaling to 200K QPS with sub-second p99 latency is less about any single breakthrough and more about rigorous systems thinking: smarter load distribution, data-informed autoscaling, explicit SLAs, and fault-tolerant orchestration. For startups and enterprises alike, these patterns mark the path from promising prototypes to dependable, large-scale AI experiences.

Comments
Say Something...
No comments yet. Be the first to share your thoughts!