Thinking Machines’ 0.4-Second Interaction Models ...

From Turn-Based Chats to Continuous Real-Time AI Interaction

Thinking Machines’ new interaction models are designed to solve a fundamental problem in today’s AI assistants: they still behave like turn-based systems. Users speak or type, then wait while the model processes a full request before responding. This lag breaks conversational flow and limits how much nuance, intent, and context can be conveyed. The company’s Interaction Voice Models instead enable real-time AI interaction across audio, video, and text simultaneously, treating conversation as a continuous stream rather than discrete turns. The system listens while it talks, tracks what it sees in video, and manages dialogue in parallel, aiming to feel more like a collaborator than a command-line. By positioning these models as a research preview for developers and researchers, Thinking Machines is signaling a shift in focus from raw model intelligence toward interaction quality as a core dimension of multimodal AI models.

How a 0.4-Second Voice AI Response Time Changes the Experience

The flagship TML-Interaction-Small model responds in roughly 0.40 seconds, a voice AI response time that edges ahead of other real-time systems such as Gemini-3.1-flash-live and GPT-realtime-2.0. That gap may look small on paper, but in conversation it’s the difference between a pause and a natural back-and-forth. Under the hood, Thinking Machines processes interaction in 200-millisecond micro-turns, continuously updating what it hears and sees while generating replies. A time-aware interaction model manages the live exchange, while an asynchronous background model tackles deeper reasoning and tool calls, feeding results back into the conversation without visible lag. This architecture lets the AI translate speech in real time, track posture, or count exercise reps from video while still talking, effectively shrinking the bandwidth bottleneck between humans and multimodal AI models and making continuous collaboration feel far more fluid.

Inside the Interaction Architecture: Micro-Turns and Multimodal Streams

Instead of treating a prompt as a single block of text or audio, Thinking Machines’ system slices interaction into 200-millisecond chunks, or micro-turns. Each micro-turn carries partial audio, visual, and textual context into the model, which updates its understanding of the scene and the conversation state in near real time. The time-aware interaction model keeps track of who is speaking, what is visible in the video, and how the dialog is evolving, allowing the AI to interrupt, clarify, or react mid-utterance like a human. In parallel, a background model handles extended reasoning, tool usage, or longer-running tasks, then merges those outputs back into the live stream without forcing the user to stop and wait. This multi-stream design is what enables the conversational AI breakthrough: responses that adapt on the fly to changing inputs instead of being locked to a single finished prompt.

Mira Murati’s Strategic Bet on Interaction, Not Just Intelligence

Led by former OpenAI CTO Mira Murati, Thinking Machines is explicitly arguing that “interactivity should scale alongside intelligence.” Her experience launching mainstream systems like ChatGPT gives weight to the thesis that raw model power is no longer enough; how humans and AI collaborate has to be a first-class design concern. The startup has reportedly attracted significant investor attention and even acquisition overtures, underscoring how central real-time, multimodal AI models are becoming to the next wave of products. By building its stack from scratch around continuous interaction, the company is framing itself as a challenger to incumbents that still rely on turn-based paradigms. While the interaction models remain in research preview, early benchmarks and demos suggest a new baseline for what natural human-AI conversation should feel like, potentially reshaping expectations across productivity tools, creative software, and voice-first interfaces.

What Near-Instant Multimodal AI Means for the Future of Conversation

If Thinking Machines can deliver these capabilities at scale, the implications for everyday interaction with AI are substantial. A system that can watch, listen, and respond at human conversation speed unlocks use cases that clunky, latency-prone assistants could never support: live coaching during workouts, instant translation while maintaining eye contact, or collaborative work sessions where AI tracks documents, screens, and speech in tandem. Crucially, it also keeps humans in the loop during complex tasks, since users can interrupt, correct, or redirect the AI mid-flow. As more developers gain access to the research preview later this year, the broader ecosystem will test how robust these interaction models are in noisy, unpredictable real-world settings. Whether or not Thinking Machines maintains a lead, its 0.4-second benchmark is likely to set a new bar for any system claiming real-time AI interaction.

Thinking Machines’ 0.4-Second Interaction Models Push Conversational AI Toward Truly Real-Time Dialogue

From Turn-Based Chats to Continuous Real-Time AI Interaction

How a 0.4-Second Voice AI Response Time Changes the Experience

Inside the Interaction Architecture: Micro-Turns and Multimodal Streams

Mira Murati’s Strategic Bet on Interaction, Not Just Intelligence

What Near-Instant Multimodal AI Means for the Future of Conversation