It’s accurate and nearly as fast as expert human interpreters.
The dream of a universal AI interpreter just got a bit closer. This week, tech giant Meta released a new AI that can almost instantaneously translate speech in 101 languages as soon as the words tumble out of your mouth.
AI translators are nothing new. But they generally work best with text and struggle to transform spoken words from one language to another. The process is usually multistep. The AI first turns speech into text, translates the text, and then converts it back to speech. Though already useful in everyday life, these systems are inefficient and laggy. Errors can also sneak in at each step.
Meta’s new AI, dubbed SEAMLESSM4T, can directly convert speech into speech. Using a voice synthesizer, the system translates words spoken in 101 languages into 36 others—not just into English, which tends to dominate current AI interpreters. In a head-to-head evaluation, the algorithm is 23 percent more accurate than today’s top models—and nearly as fast as expert human interpreters. It can also translate text into text, text into speech, and vice versa.
Meta is releasing all the data and code used to develop the AI to the public for non-commercial use, so others can optimize and build on it. In a sense, the algorithm is “foundational,” in that “it can be fine-tuned on carefully curated datasets for specific purposes—such as improving translation quality for certain language pairs or for technical jargon,” wrote Tanel Alumäe at Tallinn University of Technology, who was not involved in the project. “This level of openness is a huge advantage for researchers who lack the massive computational resources needed to build these models from scratch.”
It’s “a hugely interesting and important effort,” Sabine Braun at the University of Surrey, who was also not part of the study, told Nature.
Self-Learning AI
Machine translation has made strides in the past few years thanks to large language models. These models, which power popular chatbots like ChatGPT and Claude, learn language by training on massive datasets scraped from the internet—blogs, forum comments, Wikipedia.
In translation, humans carefully vet and label these datasets, or “corpuses,” to ensure accuracy. Labels or categories provide a sort of “ground truth” as the AI learns and makes predictions.
But not all languages are equally represented. Training corpuses are easy to come by for high-resource languages, such as English and French. Meanwhile, low-resource languages, largely used in mid- or low-income countries, are harder to find—making it difficult to train a data-hungry AI translator with trusted datasets.
“Some human-labeled resources for translation are freely available, but often limited to a small set of languages or in very specific domains,” wrote the authors.
To get around the problem, the team used a technique called parallel data mining, which crawls the internet and other resources for audio snippets in one language with matching subtitles in another. These pairs, which match in meaning, add a wealth of training data in multiple languages—no human annotation needed. Overall, the team collected roughly 443,000 hours of audio with matching text, resulting in about 30,000 aligned speech-text pairs.
SEAMLESSM4T consists of three different blocks, some handling text and speech input and others output. The translation part of the AI was pre-trained on a massive dataset containing 4.5 million hours of spoken audio in multiple languages. This initial step helped the AI “learn patterns in the data, making it easier to fine-tune the model for specific tasks” later on, wrote Alumäe. In other words, the AI learned to recognize general structures in speech regardless of language, establishing a baseline that made it easier to translate low-resource languages later.
The AI was then trained on the speech pairs and evaluated against other translation models.
Spoken Word
A key advantage of the AI is its ability to directly translate speech, without having to convert it into text first. To test this ability, the team hooked up an audio synthesizer to the AI to broadcast its output. Starting with any of the 101 languages it knew, the AI translated speech into 36 different tongues—including low-resource languages—with only a few seconds of delay.
The algorithm outperformed existing state-of-the-art systems, achieving 23 percent greater accuracy using a standardized test. It also better handled background noise and voices from different speakers, although—like humans—it struggled with heavily accented speech.
Lost in Translation
Language isn’t just words strung into sentences. It reflects cultural contexts and nuances. For example, translating a gender-neutral language into a gendered one could introduce biases. Does “I am a teacher” in English translate to the masculine “Soy profesor” or to the feminine “Soy profesora” in Spanish? What about translations for doctor, scientist, nanny, or president?
Mistranslations may also add “toxicity,” when the AI spews out offensive or harmful language that doesn’t reflect the original meaning—especially for words that don’t have a direct counterpart in the other language. While easy to laugh off as a comedy of errors in some cases, these mistakes are deadly serious when it comes to medical, immigration, or legal scenarios.
“These sorts of machine-induced error could potentially induce real harm, such as erroneously prescribing a drug, or accusing the wrong person in a trial,” wrote Allison Koenecke at Cornell University, who wasn’t involved in the study. The problem is likely to disproportionally affect people speaking low-resource languages or unusual dialects, due to a relative lack of training data.
To their credit, the Meta team analyzed their model for toxicity and fine-tuned it during multiple stages to lower the chances of gender bias and harmful language.
“This is a step in the right direction, and offers a baseline against which future models can be tested,” wrote Koenecke.
Meta is increasingly supporting open-source technology. Previously, the tech giant released PyTorch, a software library for AI training, which was used by companies, including OpenAI and Tesla, and researchers around the globe. SEAMLESSM4T will also be made public for others to build on its abilities.
The AI is just the latest machine translator that can handle speech-to-speech translation. Previously, Google showcased AudioPaLM, an algorithm that can turn 113 languages into English—but only English. SEAMLESSM4T broadens the scope. Although it only scratches the surface of the roughly 7,000 languages spoken, the AI inches closer to a universal translator—like the Babel fish in The Hitchhiker’s Guide to the Galaxy, which translates languages from species across the universe when popped into the ear.
“The authors’ methods for harnessing real-world data will forge a promising path towards speech technology that rivals the stuff of science fiction,” wrote Alumäe.
The post Meta’s New AI Translates Speech in Real Time Across More Than 100 Languages appeared first on SingularityHub.