# BLEU

BLEU (Bilingual Evaluation Understudy) is a method for evaluating the quality of text which has been translated from one natural language to another using machine translation. BLEU was one of the first software metrics to report high correlation with human judgements of quality. The metric is currently one of the most popular in the field. The central idea behind the metric is that, "the closer a machine translation is to a professional human translation, the better it is".ref|Papineni2002a

The metric calculates scores for individual segments, generally sentences, and then averages these scores over the whole corpus in order to reach a final score. It has been shown to correlate highly with human judgements of quality at the corpus level.ref|Papineni2002bref|Coughlin2003a The quality of translation is indicated as a number between 0 and 1 and is measured as statistical closeness to a given set of good quality human reference translations. Therefore, it does not directly take into account translation intelligibility or grammatical correctness.

The metric works by measuring the n-gram co-occurrence between a given translation and the set of reference translations and then taking the weighted geometric mean. BLEU is specifically designed to approximate human judgement on a corpus level and performs badly if used to evaluate the quality of isolated sentences.

Algorithm

BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text. This is illustrated in the following example from Papineni et al. (2002),

In this example, the candidate text is given a unigram precision of,

:$P = frac\left\{m\right\}\left\{w_\left\{t = frac\left\{7\right\}\left\{7\right\} = 1$

Of the seven words in the candidate translation, all of them appear in the reference translations. This presents a problem for a metric, as the candidate translation above is complete nonsense, retaining none of the content of either of the references. The modification that BLEU makes is fairly straightforward.

For each word in the candidate translation, the algorithm takes the maximum total count in the reference translations. Taking the example above, the word 'the' appears twice in reference 1, and once in reference 2. The largest value is taken, in this case '2' as the "maximum reference count".

For each of the words in the candidate translation, the count of the word is compared against the maximum reference count, and the lowest value is taken. In this case, the count of the word 'the' in the candidate translation is '7', while the maximum reference count for the word is '2'. This "modified count" is then divided by the total number of words in the candidate translation. In the above example, the modified unigram precision score would be,

:$P = frac\left\{2\right\}\left\{7\right\}$

The above method is used to calculate scores for each $n$. The value of $n$ which has the "highest correlation with monolingual human judgements"ref|Papineni2002c was found to be 4. The unigram scores are found to account for the adequacy of the translation, in other words, how much information is retained in the translation. The longer $n$-gram scores account for the fluency of the translation, or to what extent it reads like "good English".

The modification made to precision does not solve the problem of short translations. Short translations can produce very high precision scores, even using modified precision. An example of a candidate translation for the same references as above might be:

:the cat

In this example, the modified unigram precision would be,

:$P = frac\left\{1\right\}\left\{2\right\} + frac\left\{1\right\}\left\{2\right\} = frac\left\{2\right\}\left\{2\right\}$

as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two. The modified bigram precision would be $1 / 1$ as the bigram, "the cat" appears once in the candidate. It has been pointed out that precision is usually twinned with recall to overcome this problem ref|Papineni2002d, as the unigram recall of this example would be $2 / 6$ or $2 / 7$. The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as a translation which consisted of all the words in each of the references.ref|Papineni2002e

In order to produce a score for the whole corpus, the modified precision scores for the segments are combined using the geometric mean, multiplied by a brevity penalty, whose purpose is to prevent very short candidates from receiving too high a score. Let $r$ be the total length of the reference corpus, and $c$ the total length of the translation corpus. If $c leq r$, the brevity penalty applies and is defined to be $e^\left\{\left(1-r/c\right)\right\}$. (In the case of multiple reference sentences, $r$ is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences. However, in the version of the metric used by NIST, the short reference sentence is used.)

Performance

BLEU has frequently been reported as correlating well with human judgement,ref|Papineni2002fref|Coughlin2003bref|Doddington2002a and certainly remains a benchmark for any new evaluation metric to beat. There are however a number of criticisms that have been voiced. It has been noted that while in theory capable of evaluating any language, BLEU does not in the present form work on languages without word boundaries.ref|Denoul2005a

It has been argued that although BLEU certainly has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.ref|Callison2006a As BLEU scores are taken at the corpus level, it is difficult to give a textual example. Nevertheless, they highlight two instances where BLEU seriously underperformed. These were the 2005 NIST evaluationsref|Lee2005a where a number of different machine translation systems were tested, and their study of the SYSTRAN engine versus two engines using statistical machine translation (SMT) techniques.ref|Callison2006b

In the 2005 NIST evaluation, they report that the scores generated by BLEU failed to correspond to the scores produced in the human evaluations. The system which was ranked highest by the human judges was only ranked 6th by BLEU. In their study, they compared SMT systems with SYSTRAN, a knowledge based system. The scores from BLEU for SYSTRAN were substantially worse than the scores given to SYSTRAN by the human judges. They note that the SMT systems were trained using BLEU minimum error rate training,ref|Och2004a and point out that this could be one of the reasons behind thedifference. They conclude by recommending that BLEU be used in a more restricted manner, for comparing the results from two similar systems, and for tracking "broad, incremental changes to a single system".ref|Callison2006c

BLEU and real applications of MT: criticism

Another possible criticism of evaluation measures such as BLEU is that they are far from measuring the performance of machine translation in real situations, which may be grouped in two categories: "assimilation" (use of machine translation output "as is" as an aid to understanding) and "dissemination" (use of machine translation as a way to produce drafts that will be corrected or "postedited" before publishing). This is because BLEU tries to measure how close the result of machine is from a reference translation or a set of reference translations produced by human translators, which may or may not correlate with indicators of quality in those two groups of real situations.

Indeed, one of the underlying assumptions of BLEU is that quality equals human likeness. But this may be one reason for criticism. For instance, "human-unlikely" translations such as English text without definite articles but otherwise "correct" may be very close to being adequate for assimilation purposes, but far from being adequate for dissemination purposes (too many words to insert). Conversely, "human-unlikely" translations with obvious errors affecting understandability of the text (for instance, lexical selection errors caused by ambiguity) may be easily rendered adequate by a human posteditor.

ee also

* NIST (metric)
* METEOR

Notes

#if: {colwidth|}| style="-moz-column-width:{colwidth}; column-width:{colwidth};" | #if: 2| style="-moz-column-count:2; column-count:2 };" |>
# Papineni, K., et al. (2002)
# Papineni, K., et al. (2002)
# Coughlin, D. (2003)
# Papineni, K., et al. (2002)
# Papineni, K., et al. (2002)
# Papineni, K., et al. (2002)
# Papineni, K., et al. (2002)
# Coughlin, D. (2003)
# Doddington, G. (2002)
# Denoul, E. and Lepage, Y. (2005)
# Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
# Lee, A. and Przybocki, M. (2005)
# Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
# Lin, C. and Och, F. (2004)
# Callison-Burch, C., Osborne, M. and Koehn, P. (2006)

References

* Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in "ACL-2002: 40th Annual meeting of the Association for Computational Linguistics" pp. 311--318
* Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in "11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006" pp. 249--256
* Doddington, G. (2002) "Automatic evaluation of machine translation quality using n-gram cooccurrence statistics" in "Proceedings of the Human Language Technology Conference (HLT), San Diego, CA" pp. 128--132
* Coughlin, D. (2003) "Correlating Automated and Human Assessments of Machine Translation Quality" in "MT Summit IX, New Orleans, USA" pp. 23--27
* Denoul, E. and Lepage, Y. (2005) "BLEU in characters: towards automatic MT evaluation in languages without word delimiters" in "Companion Volume to the Proceedings of the Second International Joint Conference on Natural Language Processing" pp. 81--86
* Lee, A. and Przybocki, M. (2005) NIST 2005 machine translation evaluation official results
* Lin, C. and Och, F. (2004) "Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics" in "Proceedings of the 42nd Annual Meeting of the Association of Computational Linguistics".

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• bleu — bleu …   Dictionnaire des rimes

• bleu — bleu, bleue [ blø ] adj. et n. m. • bloi, blo, blefXIe; frq. °blao;cf. all. blau I ♦ 1 ♦ Qui est d une couleur, entre l indigo et le vert, dont la nature offre de nombreux exemples, comme un ciel dégagé au milieu du jour (⇒ azur), certaines… …   Encyclopédie Universelle

• bleu — bleu, bleue (bleu, bleue) adj. 1°   Qui est de la couleur du ciel sans nuage. Des rubans bleus. Une robe bleue. Des yeux bleus. •   Une personne à la mode ressemble à une fleur bleue [bluet], LA BRUY. 13. •   Trois fioles d eau bleue, autrement d …   Dictionnaire de la Langue Française d'Émile Littré

• bleu — BLEU, EUE. adj. Qui est de couleur d azur, de la couleur du Ciel. Satin bleu. Jupe bleue. Avoir les yeux bleus.Bleu, se dit quelquefois De la couleurque certains épanchemens de sang, certaines contusions font prendre à la peau. Quand le sang lui… …   Dictionnaire de l'Académie Française 1798

• bleu — BLEU, [ble]ue. adj. Qui est de couleur d azur, de la couleur du ciel. Satin bleu. jupe bleuë. cette femme a les yeux bleus. quand les convulsions le prirent, il devint tout bleu. On appelle, Cordon bleu, Un grand ruban de tabis bleu que portent… …   Dictionnaire de l'Académie française

• bleu — adj. invar., s.n. Albastru deschis; azuriu. [pr.: blö] – Din fr. bleu. Trimis de paula, 20.08.2002. Sursa: DEX 98 ﻿ BLEU adj. invar. azuriu, (înv.) havaiu. (De culoare bleu.) …   Dicționar Român

• BLEU — Saltar a navegación, búsqueda BLEU (Bilingual Evaluation Understudy) es un método de evaluación de la calidad de traducciones realizadas por sistemas de traducción automática. Una traducción tiene mayor calidad cuanto más similar es con respecto… …   Wikipedia Español

• Bleu — or BLEU may be * French for blue * A 1993 movie, . * Bilingual evaluation understudy * Belgium Luxembourg Economic Union * Bleu (musician), born William James McAuley III in Boston, Massachusetts and bandmember of pop group L.E.O.. * Bleu, a type …   Wikipedia

• bleu — 〈[ blø:] Adj.; undekl.〉 grünlich blau ● ein Kleid in Bleu * * * bleu [blø: ] <indekl. Adj.> [frz. bleu = blau, aus dem Germ., verw. mit ↑ blau]: blassblau. * * * bleu   [blø; französisch »blau«], blassblau. * * * Bleu, das; s, , ugs.: s:… …   Universal-Lexikon

• Bleu — (fr., spr. Blö), Blau, so Bleu de France (spr. Blö d Frangs), Kaliblau, s.u. Blaufärben Bleu mourant (spr. Blömurang, das im Deutschen verderbte Blümerant), s. Blaßblau; Bleu Thenard (Kobaltultramarin), eine aus Thonerde u. Kobaltoxydul… …   Pierer's Universal-Lexikon