Skip to content
AI

ai-Pulse 2024: Kyutai CEO On Conversational AI

"AI has become way too important these days to be only developed and operated behind closed doors."

A year after its launch, the French non-profit AI research lab Kyutai showcased what it believes true conversational AI should sound like. At the ai-Pulse conference in Paris, CEO Patrick Perez demonstrated Moshi, a groundbreaking voice AI system released earlier this year that breaks free from the robotic back-and-forth of traditional voice assistants.

"It's about overall communication," Perez explained. Unlike the stilted exchanges we're used to with current voice assistants, Moshi can engage in natural, fluid conversations with proper timing, emotion, and even overlapping speech – just like humans do.

Founded with donations from tech luminaries Xavier Niel, Rodolphe Saadé, and Eric Schmidt, Kyutai is taking an unusual approach in today's AI landscape: by making everything open source. The lab has already released Moshi's model weights and audio codec, which have been downloaded half a million times, along with a detailed 70-page technical report explaining how it all works, Perez said.

The technology behind Moshi represents a fundamental shift in how voice AI operates. Traditional systems use a chain of separate modules: voice detection, speech recognition, language processing, and text-to-speech synthesis. This approach, while functional, loses important emotional nuances and creates unnatural pauses in conversation.

Moshi takes a different path. At its core is a 7-billion-parameter language model called Helium and a state-of-the-art audio codec named Mimi. These work together in what Perez calls a "multi-stream architecture" that can simultaneously process multiple streams of audio and text. The result is "full duplex" communication – Moshi can listen, think, and speak all at once, with a theoretical latency of just 160 milliseconds.

Training such a system was no small feat. The team used 7 million hours of English audio for initial training, then fine-tuned the model using an eclectic mix of data: 2,000 hours of 1990s phone calls, recordings with voice actors, and 200,000 hours of synthetic conversations generated using their own tools.

In live demonstrations, Moshi showed it could engage in natural conversations about images, demonstrate different speaking voices (Moshi-ka and Moshi-ko), and even continue speaking from audio prompts while maintaining consistent style and tone. The system can run on cloud infrastructure with just 200 milliseconds of latency, or even on a MacBook Pro for users prioritizing accessibility or privacy.

"We built it from scratch in six months with a very small team using 1,000 GPUs," Perez noted. "It's really the first voice AI of its type, and it's open."

Looking ahead, Kyutai plans to expand Moshi's capabilities. The team is working on releasing training and fine-tuning code to allow developers to adapt the model to their needs. They're also developing versions that can speak other languages, starting with French. To support this expansion, they're partnering with European media groups to access quality audio and text data in various languages.

The success of Moshi – with its demo already experienced in half a million sessions – suggests there's a strong appetite for more natural and engaging voice AI, he said.

Beyond voice interaction, Moshi demonstrates promising capabilities in other areas, including multi-speaker expressive text-to-speech synthesis and robust audio transcription with timestamps. The system can even engage in conversations about images, suggesting its potential as a broader multimodal AI platform.

Perez said that Kyutai's mission extends beyond just creating impressive technology. The lab's commitment to open science and accessible innovation is just as essential: "AI has become way too important these days to be only developed and operated behind closed doors."

Comments

Latest