prscrew.com

Meta MMS vs. OpenAI Whisper: An Evaluation Debate

Written on

Chapter 1: Introduction to Meta's MMS

Meta AI has launched the Massively Multilingual Speech (MMS) initiative, offering speech-to-text and text-to-speech services for over 1,100 languages. This advancement holds great potential for enhancing language technologies, particularly for languages that are often overlooked or under-resourced.

Despite the ambitious claims made by Meta regarding MMS, verifying these assertions can be quite challenging, given that the company seldom shares details about its evaluation methodologies. However, there are instances where the information presented is evidently inaccurate or misleading, necessitating a closer examination of the research to identify discrepancies.

Section 1.1: The Evaluation Challenge

Upon seeing a tweet from Yann LeCun about this project, I spent about an hour uncovering issues with their evaluation approach.

To illustrate the problem, one common misstep is the aggregation of scores from various studies into a single table to derive conclusions. Unfortunately, researchers frequently mismanage these comparisons, leading to erroneous interpretations. I have documented similar issues across many organizations, including Meta, Google, OpenAI, and Amazon. This often results in a chain of mistakes where errors propagate from one study to another.

Section 1.2: Meta's Missteps in MMS

In early July, Meta AI introduced an important paper titled "No Language Left Behind" (NLLB), aiming to address multilingual speech challenges.

However, the MMS project exhibits similar flaws:

The presented table indicates that scores from Whisper are not directly comparable.

Update v1: I initially misunderstood the paper. It appears that Meta does not assert that MMS outperforms MLS. I appreciate the clarification from Michael Auli.

To clarify the analysis, the text states:

“Whisper results are not strictly comparable due to differing normalization, but MMS seems to perform better on MLS.”

This can be misinterpreted as:

MMS is superior.

Our scores are better.

However, the scores lack comparability.

Thus, MMS is superior.

This reasoning is flawed.

Corrected version:

As noted in the analysis:

“Whisper results are not strictly comparable due to differing normalization, but Whisper seems to perform better on MLS.”

This interpretation suggests:

Whisper is superior.

The score is higher.

Yet, the score's comparability is questionable.

Thus, Whisper is deemed superior.

This line of reasoning is equally problematic.

Note: Phrasing like “not strictly” and “appears” raises questions. What does “not strictly” comparable mean? Can we compare the scores or not? The answer is no.

Chapter 2: The Broader Context of AI Evaluation

In previous evaluations, OpenAI exhibited similar issues, where scores that lack comparability were nonetheless presented as superior based on higher numerical values. I have voiced my concerns regarding these practices in earlier analyses.

While I only skimmed the MMS paper, I suspect there may be other inconsistencies similar to the ones highlighted. I remain doubtful that all other scores presented in the evaluation table, especially those not marked with an asterisk, are indeed comparable.

Word Error Rate (WER) scores are typically calculated on normalized texts. I question how Meta can guarantee that all their scores were derived using the same normalization techniques. Google does not share code or specific methodologies, while the VoxPopuli project has published an evaluation process that includes lowercasing text and maintaining only ASCII characters alongside some punctuation adjustments.

Did all organizations follow the same procedures? I have my doubts. If time allows, I intend to recalculate some of these scores myself and will provide an update to this article.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Navigating FOMO: Embracing Self-Discovery in the Digital Era

Explore how to overcome FOMO and embrace self-discovery in a social media-driven world.

Finding the Balance: Do You Need to Exercise Daily?

Exploring the importance of daily movement while avoiding overtraining for a healthier lifestyle.

Discovering the Essence of Happiness and the Path to It

Explore the meaning of happiness and how being present can enhance your life through mindfulness and meditation.