Thinking Machines’ Real-Time Multimodal AI Aims t...

From Turn-Based Chats to Continuous Interaction

Thinking Machines, founded by former OpenAI CTO Mira Murati, is challenging the core pattern of how people use AI. Most AI interaction models today work like email: you type or speak, then wait while the system processes your request and responds in one big block. The company argues this turn-based approach creates a “collaboration bottleneck,” because real work rarely fits neatly into single prompts and complete answers. Their new real-time multimodal AI “interaction models” are built to handle audio, video, and text at once, without external scaffolding or separate encoders bolted on later. Instead of treating speech, vision, and language as separate modes, the system co-trains everything in one transformer stack. The goal is an AI that stays in the loop with you—listening, watching, and responding continuously—so interacting with it feels less like operating a tool and more like collaborating with a human teammate.

Inside the 0.4-Second Response Time

At the heart of Thinking Machines’ approach is a 200-millisecond “micro-turn” architecture. Rather than waiting for you to finish speaking or for a video clip to end, the model slices interaction into tiny time windows. Every 200 milliseconds, it ingests fresh audio and video while simultaneously generating its next output. That design underpins the flagship TML-Interaction-Small model, which the company says delivers responses in about 0.40 seconds—faster than the reported latencies of Google’s Gemini-3.1-flash-live and OpenAI’s GPT-realtime-2.0 minimal. This speed is not just a benchmark brag; it is what makes audio video AI processing feel natural instead of laggy. The system uses lightweight audio embeddings and image patches rather than heavy standalone encoders, and it runs as a Mixture-of-Experts model with a limited number of parameters active at any moment to stay within tight latency constraints.

Multimodal Awareness: Seeing, Hearing, and Timing in Real Time

The most distinctive feature of these AI interaction models is how they treat perception as continuous. Because the model is always listening and watching, it can react to what changes on screen or in a room, not just to what is said. In demos, the AI counts push-up repetitions from video, notices slouching posture, and provides real-time translation, all while keeping up a conversation. Thinking Machines calls this “visual proactivity”: ask it to track an action and it does so as events unfold, instead of acknowledging the request and going silent. The system is also explicitly time-aware, so it can handle tasks like timed breathing cues or answering questions about how long a coding session lasted. These capabilities set it apart from traditional real-time multimodal AI systems, which typically rely on external timers, separate vision services, or pre-scripted logic rather than a single model that perceives and responds in one continuous loop.

Two-Model Design: Instant Reactions, Deeper Reasoning

To balance speed with reasoning depth, Thinking Machines uses a two-model architecture. The front-facing interaction model handles the live conversation: it manages turn-taking implicitly, detects when you are thinking or yielding, and can interject mid-sentence. In the background, a separate asynchronous model tackles heavier tasks—complex reasoning, long-form planning, or intricate analyses. As the background model finishes subtasks, the interaction layer weaves the results back into the live dialogue without freezing or dropping context. This design mirrors how humans often think ahead while talking. It also contrasts with many existing AI workflows, where a single model processes a request end-to-end and only then returns a result. By decoupling responsiveness from heavy computation, the Thinking Machines startup aims to make AI interaction models that are both conversationally fluid and intellectually capable, narrowing the gap between human-to-human and human-to-AI collaboration.

Positioning Against Today’s AI and What Comes Next

Thinking Machines is still in research-preview mode, with limited access promised for partners before a wider release later. Yet the startup is already positioning itself against leading AI platforms by emphasizing interactivity over raw benchmark scores. On interactivity-focused tests, the company reports that its TML-Interaction-Small model surpasses current competitors in both responsiveness and certain proactive tasks, even if some high-quality configurations from other providers maintain an edge on specific intelligence benchmarks. Strategically, Murati’s team is betting that the next wave of AI adoption will be defined less by static chatbots and more by systems that can collaborate across audio, video, and text in real time. If their real-time multimodal AI approach scales, it could reshape expectations for how quickly and naturally AI should respond, moving the field away from prompt-response workflows toward continuous, human-like interaction patterns.

Thinking Machines’ Real-Time Multimodal AI Aims to Rethink How Humans Talk to Machines

From Turn-Based Chats to Continuous Interaction

Inside the 0.4-Second Response Time

Multimodal Awareness: Seeing, Hearing, and Timing in Real Time

Two-Model Design: Instant Reactions, Deeper Reasoning

Positioning Against Today’s AI and What Comes Next