Thinking Machines Introduces AI Models for Live Multimodal Collaboration

Thinking Machines Labs introduced a research preview of “interaction models” designed for continuous real-time collaboration across audio, video, and text. The system combines live multimodal interaction with asynchronous reasoning and tool use.

By Daniel Mercer Edited by Maria Konash Published: May 12, 2026 at 12:25 pm UTC

Thinking Machines Labs introduced a research preview of what it calls “interaction models,” a new class of AI systems designed to collaborate with users continuously across audio, video, and text rather than through traditional turn-based prompts.

The company said the models are trained from scratch to support real-time interaction, allowing users and AI systems to speak, interrupt, observe, respond, and work simultaneously. The architecture is built around “micro-turns” that process roughly 200 milliseconds of input and output at a time, enabling continuous two-way interaction instead of waiting for users to finish speaking or typing before responding.

According to Thinking Machines, the system combines a real-time interaction model with a separate asynchronous background model responsible for longer reasoning tasks, tool use, browsing, and workflow execution. The interaction layer remains active throughout the process while integrating results from the background model as they arrive.

The company argued that current AI systems create a “collaboration bottleneck” because most models operate through rigid turn-taking interfaces that limit human involvement during reasoning and execution. Thinking Machines said its approach aims to make AI collaboration function more like natural human conversation.

The research preview demonstrates several capabilities that are difficult to achieve in standard voice assistants or multimodal chat systems. These include simultaneous speech between user and model, proactive verbal and visual interjections, continuous visual monitoring, real-time translation, concurrent tool use during conversations, and direct awareness of elapsed time.

For example, the company showed scenarios where the model corrected spoken language mistakes while users continued speaking, counted physical exercises through live video streams, reacted to coding errors as they appeared onscreen, and performed live multilingual translation without pausing conversations.

Interaction Becomes A Core AI Capability

The announcement reflects a broader shift in AI development toward systems optimized for continuous collaboration rather than isolated prompt-response exchanges.

Most current real-time AI products rely on external orchestration layers such as voice activity detection systems and separate dialogue managers to simulate interactivity. Thinking Machines argues those approaches create limitations because the intelligence governing interruptions, timing, and conversational flow exists outside the model itself.

Instead, the company embedded interaction directly into model training and architecture. That allows responsiveness, interruption handling, simultaneous speaking, and multimodal awareness to improve alongside overall model capability as systems scale.

The architecture also differs from many multimodal systems by minimizing reliance on large standalone audio or video encoders. Audio, video, and text are processed together through shared transformer infrastructure using lightweight embedding layers and early fusion techniques.

Benchmarks Highlight Speed And Responsiveness

Thinking Machines said its TML-Interaction-Small model achieved stronger combined responsiveness and interaction quality than several existing commercial realtime AI systems across internal and public benchmarks.

The company highlighted improvements in latency, interruption handling, simultaneous conversation, proactive responses, and continuous multimodal awareness. Internal evaluations also tested capabilities that many current voice models cannot reliably perform, including reacting to visual changes without explicit prompts and speaking concurrently with users during live tasks.

The released model is currently a 276-billion-parameter mixture-of-experts system with 12 billion active parameters at runtime. Thinking Machines said larger interaction models are already pretrained but remain too computationally expensive for low-latency deployment today.

The company added that future work will focus on longer session memory management, infrastructure optimization, safety research for realtime multimodal interaction, and deeper coordination between interactive and background reasoning systems.

The announcement also follows a recently expanded partnership between NVIDIA and Thinking Machines Labs to deploy next-generation Vera Rubin AI systems for frontier model training.