MilikMilik

Real-Time Multimodal AI Is Here: Startups Race To Make Machines Conversational in Audio and Video

Real-Time Multimodal AI Is Here: Startups Race To Make Machines Conversational in Audio and Video

From Turn-Based Chatbots to Real-Time Multimodal AI

For years, conversational AI systems have behaved like email: you talk, they wait, then respond in a single block. Thinking Machines, founded by former OpenAI CTO Mira Murati, is challenging that turn-based paradigm with what it calls “interaction models” – real-time multimodal AI built to handle audio, video and text as continuous streams. Instead of bolting together separate tools for speech, vision and dialogue management, these models treat everything as one ongoing interaction, shrinking response times and complexity. The company argues that traditional interfaces create a “collaboration bottleneck,” because humans must fully finish their input before the system reacts. In real work, however, people course-correct mid-sentence, gesture, share screens and expect others to keep up. By letting AI perceive and respond continuously, interaction models aim to align machine behavior with the way people actually communicate, opening the door to more natural and efficient human–AI collaboration.

Inside Thinking Machines’ 0.40-Second Interaction Models

Thinking Machines’ flagship model, TML-Interaction-Small, responds in about 0.40 seconds while handling audio, video and text simultaneously. Under the hood, the system slices interaction into 200-millisecond “micro-turns,” processing incoming speech or video frames while generating its own response at the same time. One subsystem manages conversation flow, while another works on more complex tasks in the background, mirroring how a person can keep talking while planning their next point. Early demos show the model counting exercise reps from video, translating speech in real time and even noticing posture changes like slouching, all while maintaining a live conversation. By removing artificial turn boundaries, the model can interject, speak over you when appropriate and react to what it “sees” without waiting for you to stop. This kind of low-latency audio video AI processing is a foundational shift toward AI that feels less like a search box and more like a teammate in the room.

Voice AI Technology Moves Past Dictation With Wispr

While labs like Thinking Machines push multimodal performance, voice AI technology startups are reimagining how people feed work into software. Wispr, maker of Wispr Flow, is in talks to raise about USD 260 million (approx. RM1.2 billion) in a Menlo Ventures–led round that could value the company near USD 2 billion (approx. RM9.2 billion). Rather than classic dictation that dumps a messy transcript into a document, Flow lets users speak naturally in any app and turns speech into clean, context-aware writing. It strips filler words, formats text and adapts to where you are working, whether that is email, Slack, documents or even a code editor. The bet is that if people speak faster than they type, the next everyday interface for work will be voice. Funding momentum suggests investors see voice-first conversational AI systems as a front door to productivity, not just an accessibility add-on.

Real-Time Multimodal AI Is Here: Startups Race To Make Machines Conversational in Audio and Video

Otter.ai and the Rise of Conversational Knowledge

Otter.ai, which helped create the AI meeting assistant category, is pushing beyond transcription toward what it calls a Conversational Knowledge Engine. After processing billions of meetings, CEO Sam Liang argues that most tools still stop at summary and basic chat, failing to connect the underlying knowledge. Otter’s new approach aggregates conversational data across an organisation into a longitudinal knowledge graph, mapping clients, projects, people and topics, and tracking who said what over time. It effectively becomes a system of record for spoken conversations, similar to how CRM, HR and ERP systems capture other business data. With configurable permissions and data retention controls inspired by Slack-style channels, Otter aims to make meetings queryable while respecting privacy. As real-time multimodal AI matures, this kind of persistent conversational layer could turn every call, stand-up and customer interaction into structured, searchable intelligence rather than quickly forgotten talk.

How Real-Time Multimodal AI Could Reshape Work and Services

Taken together, these advances in real-time multimodal AI and voice AI technology hint at a new interface layer for both enterprises and consumers. In the workplace, AI agents that listen, watch and respond continuously could join video calls, track action items, reference past decisions and even react to visual cues like a prototype on screen or a participant’s body language. Customer service systems might blend live voice, screen content and camera feeds to troubleshoot problems without forcing users through rigid menus. For accessibility, multimodal support promises richer assistance: real-time captioning, visual scene descriptions and conversational guidance in one coherent experience. The big shift is that audio video AI processing is no longer a specialty feature; it is becoming the default mode of interaction. Startups like Thinking Machines, Wispr and Otter.ai are showing that the next wave of AI will be judged less by model size and more by how seamlessly it fits into everyday human communication.

Comments
Say Something...
No comments yet. Be the first to share your thoughts!