Real-Time Multimodal AI Is Here: How Audio and Vi...

From Turn-Based Chats to Continuous Interaction

Thinking Machines is challenging the way people work with AI by attacking what it calls the “collaboration bottleneck.” Most conversational AI technology still behaves like email: you type or speak, the system waits, then sends back a complete reply before listening again. That turn-based model suits the AI, not the human, and makes it hard to steer or correct the system in real time. The company’s new interaction models flip this pattern by treating audio and video as continuous streams instead of discrete turns. Rather than waiting for you to finish, the model listens, sees, and responds in near real time, using what the company describes as 200-millisecond “micro-turns.” This approach aims to match how people actually work—talking, gesturing, and adjusting on the fly—so that real-time AI processing can keep up with human intent instead of lagging behind it.

How Interaction Models Work Across Audio, Video, and Text

At the core of Thinking Machines’ multimodal AI models is a continuous pipeline that processes sound, visuals, and language together. Audio arrives as lightweight embeddings, while images are broken into small patches before being fed into the same transformer that drives conversation. Every 200 milliseconds, the system ingests new audio or video frames and simultaneously generates its next response segment. This “listens while talking” design enables the model to handle audio video AI interaction more like a human partner than a scripted assistant. To balance speed with depth, the architecture is split in two: a fast interaction model keeps the conversation flowing, while a separate background model tackles more complex reasoning tasks and streams results back into the dialogue. The result is an AI that can stay present in the moment without sacrificing the deeper analysis users expect from advanced conversational AI technology.

New Everyday Uses: Accessibility, Creation, and Hands-Free Work

Real-time AI processing opens up use cases that static, turn-based systems could only approximate. Thinking Machines’ demos highlight features like counting exercise repetitions from live video, gently correcting posture when someone slouches, and translating speech on the fly while maintaining a natural dialogue. Because the model can see changes and track time directly, it can handle tasks such as timing breathing exercises or answering “how long did that take?” without special app logic. For accessibility, this kind of audio video AI interaction could support live captioning, signposting visual cues, or acting as a proactive co-pilot for people who rely on voice and vision rather than keyboards. For creators, it promises on-the-spot feedback while filming or recording. And for hands-free interaction—whether in workshops, kitchens, or classrooms—multimodal AI models that can watch, listen, and respond continuously may finally feel like collaborative partners instead of static tools.

Why Mira Murati’s Involvement Signals a Bigger Shift

Thinking Machines’ push into real-time multimodal AI comes with notable pedigree. Founder Mira Murati previously served as CTO at OpenAI, where she helped lead the development of ChatGPT and briefly stepped in as interim CEO during a high-profile leadership shake-up. Her new company has already attracted intense attention, from reported acquisition interest to high-profile hires such as PyTorch creator Soumith Chintala as CTO. Even amid early turbulence, including several founding members departing for larger firms, the launch of interaction models is a clear signal that enterprise-level players see conversational AI technology as moving beyond text boxes and chat logs. By prioritising interactivity as much as raw intelligence, Murati’s team is betting that the next competitive edge will come from systems that collaborate fluidly with humans in real time—watching, listening, and responding as naturally as a colleague across the table.

Real-Time Multimodal AI Is Here: How Audio and Video Processing Changes What’s Possible

From Turn-Based Chats to Continuous Interaction

How Interaction Models Work Across Audio, Video, and Text

New Everyday Uses: Accessibility, Creation, and Hands-Free Work

Why Mira Murati’s Involvement Signals a Bigger Shift