The tech giant Meta, formerly known as Facebook, has reached a groundbreaking milestone in the field of artificial intelligence (AI). The company has developed AI models capable of recognizing and generating speech in over 1,000 languages, marking a tenfold increase compared to the current capabilities. As Meta noted, this achievement holds significant potential for preserving endangered languages.
Today we're sharing new progress on our AI speech work. Our Massively Multilingual Speech (MMS) project has now scaled speech-to-text & text-to-speech to support over 1,100 languages — a 10x increase from previous work.
— Meta AI (@MetaAI) May 22, 2023
Details + access to new pretrained models ⬇️
Meta is making these advanced models available to the public through the popular code hosting service GitHub. By open-sourcing their technology, Meta aims to empower developers working across diverse languages, enabling them to create new speech applications. These applications could range from messaging services that seamlessly understand users speaking different languages to versatile virtual reality systems accessible in any language.
While there are approximately 7,000 languages spoken worldwide, existing speech recognition models provide comprehensive coverage for only about 100 languages. This limitation is primarily due to the need for a significant amount of labeled training data for such models, which are mostly available for a small number of languages, such as English, Spanish, and Chinese.
To overcome this challenge, Meta's researchers took an innovative approach. They retrained an existing AI model developed by the company in 2020, which had the ability to learn speech patterns from audio without relying on extensive labeled data like transcripts. The team trained this model using two new datasets. The first dataset comprised audio recordings of the New Testament Bible and corresponding text in 1,107 languages, collected from the internet. The second dataset consisted of unlabeled New Testament audio recordings in 3,809 languages. By meticulously processing the speech audio and text data, the researchers improved its quality and then employed an algorithm to align the audio recordings with the accompanying text. This process was repeated with a second algorithm trained on the newly aligned data. Through this method, the researchers succeeded in training the algorithm to learn new languages more easily and efficiently, even without the accompanying text.
Michael Auli, a research scientist at Meta involved in the project, expressed excitement about the findings, stating, "We can use what that model learned to then quickly build speech systems with very, very little data." Auli further highlighted the challenge they faced: "For English, we have lots and lots of good data sets, and we have that for a few more languages, but we just don’t have that for languages that are spoken by, say, 1,000 people."
According to the researchers, their AI models are capable of conversing in over 1,000 languages and recognizing more than 4,000 languages. Comparing their models with those of rival companies, including OpenAI Whisper, Meta claims their models exhibit half the error rate while covering 11 times more languages.
However, the team acknowledges certain limitations. The models are susceptible to mistranscribing specific words or phrases, potentially resulting in inaccurate or offensive labels. Additionally, Meta's speech recognition models have yielded slightly more biased words compared to other models, albeit only 0.7% more.
While the accomplishment of Meta's researchers is undeniably impressive, the utilization of religious texts for training AI models has raised concerns among experts. Chris Emezue, a researcher at Masakhane, an organization focused on natural language processing for African languages, spoke out on the matter: “The Bible has a lot of bias and misrepresentations.”. Emezue was not involved in the project.
Meta's achievement in expanding the reach of AI speech recognition models represents a significant leap forward in preserving and harnessing the linguistic diversity of our world. With their open-source release, Meta aims to foster collaboration and innovation among developers, ultimately paving the way for more inclusive and accessible technological solutions across numerous languages and cultures.
As a reminder, Meta has recently introduced an AI model that can isolate and mask objects within images.