Essential metrics in machine translation evaluation

In March 2018, Microsoft announced a historical milestone: Microsoft’s neural machine translation can allegedly match human performance in translating news from Chinese to English. But how can we compare and evaluate the quality of different systems? For that, we use machine translation evaluation.

Methods of Machine Translation Evaluation

With the fast development of deep learning, machine translation (MT) research has evolved from a rule-based model to neural models in more recent years. Neural MT (NMT) is currently a hot topic. We have recently seen a spike in publishing, with big players like IBM, Microsoft, Facebook, Amazon, and Google all actively researching NMT.

Machine translation evaluation is difficult because natural languages are highly ambiguous. In order to evaluate MT, both automatic and manual approaches can be used. Manual evaluation gives a better result in order to measure the quality of MT and to analyze errors within the system output: adequacy and fluency scores, post-editing measures, human ranking of translations at sentence-level, task-based evaluations etc… The most challenging issues in conducting human evaluations of MT output are high costs and time consumption. Therefore, various automatic methods were proposed to measure the performance of machine translation output like BLEU, METEOR, F-Measure, Levenshtein, WER (Word Error Rate), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), NIST (National Institute of Standards and Technology), etc.

BLEU : the most popular

Currently, the most popular method in machine translation evaluation is called BLEU. It is an abbreviation for “Bilingual Evaluation Understudy”. Originally introduced back in 2002, this method compares the hypothetical translation to one or more reference translations. The machine translation evaluation awards a higher score when the candidate translation shares many strings with the reference translation. The BLEU system scores a translation on a scale of 0 to 1, but it is frequently displayed as a percentage value: The closer to 1, the more the translation correlates to a human translation. The main difficulty here lies in the fact that there is not one single correct translation, but many alternative good translation options.

METEOR : an emphasis on recall and precision

The second most popular method in machine translation evaluation is METEOR. It stands for Metric for Evaluation of Translation with Explicit Ordering. Originally developed and released in 2004, METEOR was designed with the explicit goal of producing sentence-level scores which correlate well with human judgments of translation quality. Several key design decisions were incorporated into Meteor in support of this goal. In contrast to IBM’s Bleu, which uses only precision-based features, Meteor uses and emphasizes recall in addition to precision, a property that has been confirmed by several metrics as being critical for high correlation with human judgments. Meteor also addresses the problem of reference translation variability by utilizing flexible word matching, allowing for morphological variants and synonyms to be taken into account as legitimate correspondences.

Besides different methods in the machine translation evaluation, some researchers see the lack of consensus in how to report scores from its dominant metric. Although people refer to “the” BLEU score, BLEU scores can vary wildly with changes to its parameterization and, especially, reference processing schemes, yet these details are absent from papers or hard to determine.

Nevertheless, human and automatic metrics are both essential in assessing MT quality and serve different purposes. Good humanmetrics greatly help in developing good automatic metrics.

Don’t forget to share your thoughts in the comments below!

Thank you for reading, we hope you found this article insightful.

Want to learn more or apply to the TCloc Master’s Programme?

Click HERE to visit the homepage.

Thanks from the Tcloc web team