Thinking Machines’ 0.4-Second ‘Interaction Models...

From Turn-Based Chats to Live Collaboration

Thinking Machines, founded by former OpenAI CTO Mira Murati, is challenging the turn-based design of today’s AI assistants. Most frontier models still wait for users to finish typing or speaking, then generate a full response before listening again. The company argues this “collaboration bottleneck” keeps humans at arm’s length, more like sending emails than having a live discussion. Its answer is a new class of real-time AI models called interaction models, built to handle audio, video, and text as continuous streams rather than discrete turns. Instead of bolting together separate tools for speech, vision, and dialog control, Thinking Machines has architected a single system that treats multimodal input as native. The ambition is clear: move beyond batch-style workflows toward low latency AI that can stay in the loop with you, adapting as your intent and environment change moment by moment.

Inside the 0.4-Second Interaction Engine

At the core of Thinking Machines’ approach is a 200-millisecond “micro-turn” architecture. Every 0.2 seconds, the model ingests fresh slices of audio and video while simultaneously generating its own output, erasing the artificial boundaries between speaking and listening. Its flagship TML-Interaction-Small model posts a 0.40-second response time on FD-bench turn-taking tests, outpacing Gemini-3.1-flash-live and GPT-realtime-2.0 in this metric. That speed underpins use cases where even minor delay breaks the illusion of conversation: live translation, coaching, on-the-fly explanation of complex tasks, or responsive creative brainstorming. Under the hood, audio is encoded via lightweight dMel embeddings and images are split into compact 40×40 patches, all co-trained with a transformer rather than handled by separate encoders. The result is a low latency AI system optimized not for offline intelligence, but for staying present in the flow of interaction.

Seeing, Timing, and Talking at Once

These interaction models aim to go beyond faster replies to unlock qualitatively new behavior. Because the system processes video and audio continuously, it can exhibit what Thinking Machines calls visual proactivity: it reacts to what changes in view, not just to explicit verbal commands. In demos, the real-time AI models count exercise repetitions from a camera feed, notice posture shifts like slouching, and keep up a conversational thread while doing so. Time awareness is baked in as well, allowing the assistant to answer questions such as how long a coding task took or to pace timed breathing exercises without extra tooling. Crucially, the model can speak over you, interject, or yield conversational floor dynamically, enabling experiences like live sports commentary and simultaneous translation that standard turn-based conversational AI simply struggles to support.

A Two-Model Stack for Speed and Depth

To reconcile responsiveness with complex reasoning, Thinking Machines separates interaction from heavy thinking. The fast interaction model runs in the foreground, maintaining context, reading new audio video processing streams, and generating immediate responses in micro-turns. When a request demands deeper analysis, it hands work to a slower background model, then blends the results back into the ongoing conversation as they arrive. This architecture keeps the user’s experience fluid: you can clarify instructions, change direction, or point the camera at something new while the system continues working. Benchmarks reflect this design trade-off. On FD-bench for interactivity, TML-Interaction-Small significantly outperforms other real-time AI models. On broader intelligence tests like Audio MultiChallenge, it lands slightly below the highest-quality GPT-realtime setting but above many live competitors, underscoring its orientation toward practical, real-time applications rather than pure offline score chasing.

Murati’s Vision: Interactivity as a First-Class Feature

Mira Murati’s track record at OpenAI, where she oversaw the development of ChatGPT and briefly served as interim CEO, gives context to Thinking Machines’ strategy. After leaving to found the company, she positioned it squarely around a belief that interactivity should scale alongside intelligence, arguing that the way we work with AI cannot remain an afterthought. The young startup has already weathered aggressive talent poaching and leadership changes, yet its interaction models represent a clear technical thesis: the next wave of conversational AI will be judged less on isolated benchmark scores and more on how naturally it collaborates with humans in real time. With a research preview available and limited access for partners planned before a wider rollout, the question now is whether this low latency, multimodal approach can become a practical alternative to today’s turn-based assistants in everyday workflows.

Thinking Machines’ 0.4-Second ‘Interaction Models’ Aim to Make AI Feel Truly Live

From Turn-Based Chats to Live Collaboration

Inside the 0.4-Second Interaction Engine

Seeing, Timing, and Talking at Once

A Two-Model Stack for Speed and Depth

Murati’s Vision: Interactivity as a First-Class Feature