From Turn-Based Chatbots to Always-On Interaction
Thinking Machines, founded by former OpenAI CTO Mira Murati in 2025, is challenging the turn-based design that has defined mainstream AI so far. Today’s frontier systems typically process one complete user input, generate a full reply, and only then accept new information—an interaction style the company describes as a “collaboration bottleneck.” In that model, humans must wait for the system to finish before correcting, guiding, or adding context, making complex work feel more like sending tickets to a machine than collaborating with a partner. Murati’s team argues that interactivity must scale with intelligence: if models grow more capable but remain locked into slow, sequential exchanges, much of users’ tacit knowledge never reaches the AI. Their answer is a new class of real-time multimodal AI, called interaction models, built to treat conversation, audio, video, and text as a continuous shared stream, not a series of isolated turns.
Inside the 0.4-Second Low-Latency AI Response
The flagship TML-Interaction-Small model is engineered around a “micro-turn” architecture that targets fluid, near-human latency. Instead of waiting for full sentences or finished video clips, the system slices interaction into 200-millisecond chunks. Every slice, it ingests fresh audio and video while simultaneously generating its own output, creating the feel of a live counterpart that listens while it speaks. This design underpins its headline benchmark: a 0.40-second response time on FD-bench turn-taking tests, faster than OpenAI’s GPT-realtime-2.0 minimal and Google’s Gemini-3.1-flash-live-preview. Crucially, the model handles audio, video, and text natively within one transformer stack, rather than stitching together separate encoders. Audio arrives as dMel signals via a lightweight embedding layer, while images are decomposed into 40×40 patches through an hMLP, all co-trained to support continuous, low-latency AI response without external orchestration.
Real-Time Multimodal AI: Seeing, Counting, and Timing on the Fly
What distinguishes Thinking Machines’ voice interaction models is not only speed but how they use real-time audio video text processing to act proactively. Because the system is always watching and listening, it can react to visual changes without waiting for explicit prompts—counting push-up repetitions from a camera feed, tracking posture and noticing when a person slouches, or providing live sports-style commentary. Time awareness is built in, letting the model answer questions like how long a task took, or guide timed breathing exercises, based solely on its continuous stream of perception. Demos highlight translation where the AI overlaps speech rather than waiting for full sentences, enabling back-and-forth conversations that resemble talking to a bilingual colleague. For more complex reasoning that cannot be resolved within micro-turns, an asynchronous background model runs in parallel, while the interaction model keeps the conversation flowing and blends deeper results in as they arrive.
Mira Murati’s Vision Amid Talent Turbulence
The technical debut of interaction models also serves as a statement of intent for a young company that has already weathered high-profile departures. Half of Thinking Machines’ original six-person founding team reportedly left within its first year, including then-CTO Barret Zoph, who returned to OpenAI following controversy, while Meta is said to have hired away seven founding members after failing to acquire the startup. In response, Murati recruited PyTorch creator Soumith Chintala as CTO and continued to push a vision of AI collaboration where interface design is not an afterthought. Her background leading ChatGPT at OpenAI and briefly serving as interim CEO during that company’s leadership crisis lends weight to the effort. For now, TML-Interaction-Small is in research preview, with limited access promised for research partners in the coming months and a broader release targeted later this year, as the team works to scale the architecture to larger models without sacrificing responsiveness.
