Voice is rapidly becoming the next major interface for artificial intelligence, according to ElevenLabs co-founder and CEO Mati Staniszewski. Speaking at Web Summit in Doha, Staniszewski said advances in voice models are shifting how people interact with machines, moving beyond text and screens toward more natural, conversational control.
Voice technology, he said, has evolved past simply mimicking human speech patterns such as emotion and intonation. Instead, it is increasingly being paired with the reasoning capabilities of large language models, allowing AI systems to understand context, respond intelligently, and take action with fewer explicit instructions.
“In the years ahead, hopefully all our phones will go back in our pockets,” Staniszewski said, describing a future where voice allows users to stay immersed in the physical world while AI operates seamlessly in the background.
Industry Momentum Behind Voice AI
That vision underpins ElevenLabs’s recent $500 million funding round, which valued the company at $11 billion. The emphasis on voice is also gaining traction across the broader AI industry. OpenAI and Google have both made voice a central component of their next-generation AI models, while Apple has been quietly building voice-adjacent technologies through acquisitions such as Q.ai.
As AI expands into cars, wearables, and other hardware, control mechanisms are shifting. Interaction is becoming less about tapping screens and typing commands, and more about speaking naturally. Investors see this transition as a foundational change in how users will engage with technology.
Iconiq Capital general partner Seth Pierrepont echoed that view at Web Summit, noting that while screens will remain important for entertainment and gaming, traditional inputs like keyboards are beginning to feel outdated. As AI systems become more autonomous, he said, interactions will require less direct prompting and more contextual understanding.
Agentic Systems and Persistent Context
Staniszewski highlighted agentic AI as a key driver of this shift. Future voice systems, he said, will rely on persistent memory and accumulated context rather than isolated commands. That approach could make interactions feel more continuous and human-like, with AI systems anticipating needs instead of waiting for detailed instructions.
This evolution is also shaping how voice models are deployed. While high-quality audio processing has largely relied on cloud infrastructure, ElevenLabs is moving toward a hybrid approach that blends cloud-based and on-device processing. The goal is to support always-on use cases in hardware such as headphones and wearables, where voice becomes a constant interface rather than an optional feature.
Partnerships and Privacy Questions
ElevenLabs is already working with Meta to integrate its voice technology into products including Instagram and Horizon Worlds. Staniszewski said the company would also consider partnerships involving Meta’s Ray-Ban smart glasses as voice-driven interfaces extend into new form factors.
However, as voice systems become more persistent and embedded in everyday devices, concerns around privacy and data collection are growing. Always-on voice interfaces could store vast amounts of personal information, raising questions about surveillance and misuse. Companies such as Google have previously faced scrutiny over how voice data is handled, underscoring the risks as voice becomes more central to AI experiences.
Staniszewski acknowledged those concerns, suggesting that trust and responsible deployment will be critical as voice moves closer to users’ daily lives and becomes one of the primary ways people interact with AI.