Microsoft Launches MAI Models for Speech, Voice, and Image AI

Microsoft has introduced MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, expanding its AI model lineup with faster performance and competitive pricing for developers. The models are now available through Microsoft Foundry.

By Samantha Reed Edited by Maria Konash Published: Apr 2, 2026 at 6:21 pm UTC

Microsoft launches MAI models for voice, transcription, and images with faster speeds and lower costs. Image: Microsoft

Microsoft has unveiled a new suite of AI models under its MAI branding, including MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, aimed at strengthening its position in multimodal AI. The models are now available through Microsoft Foundry and the MAI Playground, targeting developers building applications across speech, voice, and visual content.

MAI-Transcribe-1 focuses on speech-to-text capabilities, delivering state-of-the-art accuracy across 25 widely used languages. According to benchmark results, the model achieves a lower average word error rate compared to several competing systems, indicating improved transcription quality. It is also designed for real-world conditions, handling noisy or complex audio environments.

Performance is a key differentiator. Microsoft states that MAI-Transcribe-1 processes batch transcription tasks up to 2.5 times faster than its existing Azure-based offerings. The model is priced starting at $0.36 per hour, positioning it competitively among cloud providers offering similar services.

MAI-Voice-1, the company’s latest voice generation model, emphasizes realism and expressive output. It supports natural speech synthesis with emotional nuance and can maintain speaker identity across longer audio segments. Developers can also create custom voices using short audio samples, expanding use cases in voice assistants, media production, and enterprise applications.

Focus on Speed, Cost, and Enterprise Adoption

MAI-Image-2 completes the model trio, targeting image generation with improved speed and quality. Microsoft reports that the model delivers at least twice the generation speed of earlier versions while maintaining visual fidelity. It is designed for professional use cases such as marketing, design, and content creation, with a focus on realistic lighting, accurate textures, and legible in-image text.

Pricing reflects Microsoft’s broader strategy to compete on cost efficiency. MAI-Image-2 is offered at $5 per million tokens for text input and $33 per million tokens for image output, while MAI-Voice-1 starts at $22 per million characters. The company is positioning the MAI family as offering strong price-to-performance across modalities.

Enterprise adoption is already underway. WPP, a global marketing and communications group, is among early partners using MAI-Image-2 for large-scale creative production. Microsoft plans to integrate these models across its own ecosystem, including Copilot products and enterprise tools.

The company said the MAI models were developed with built-in safety measures and tested through internal evaluation processes, reflecting ongoing efforts to align performance improvements with responsible AI deployment.

The company is also expanding Copilot with multi-model AI workflows, enabling systems like GPT and Claude to collaborate on responses to improve accuracy and reliability, further reinforcing its strategy to integrate diverse AI capabilities into a unified platform.