The first is MAI-Voice-1, a speech generation model that’s now live in Copilot Daily and Podcasts. Redmond says the system can spit out a full minute of expressive audio in under a second on a single GPU.
To show it off, Vole has bolted a “Copilot Audio Expressions” demo into Copilot Labs, letting punters paste in text, pick a voice, style, and mood, and then download the generated clip if they fancy.
The second is MAI-1-Preview, Microsoft’s first homegrown foundation model trained end-to-end. It’s being trialled on LMArena, a community benchmarking platform, and was built using nearly 15,000 of Nvidia’s H100 GPUs.
Vole claims the mixture-of-experts design helps the model follow instructions closer and give more useful responses in daily scenarios.
MAI-1-Preview is set to start creeping into text-based Copilot features in the coming weeks. According to Redmond, this is just the start of its in-house push, with more upgrades lined up once the training flywheel starts spinning harder.
For now, Microsoft seems keen to show it can build models itself rather than leaning entirely on OpenAI, and it’s betting that sheer scale and speed will help keep Copilot in the spotlight.