BLEU — Bilingual Evaluation Understudy

栏目: IT技术 · 发布时间: 5年前

内容简介：You are watching a very popular movie of a language that you do not understand, so you read the captions in a language that you know.We look at theAdequacyis a measure to know if all the meaning was expressed from source language to the target language

BLEU — Bilingual Evaluation Understudy

A step by step approach to understanding BLEU, a metric to understand the effectiveness of Machine Translation(MT)

What will you learn in this post?

How to measure the effectiveness of translating one language to another?
What is BLEU and how do we calculate the BLEU score for the effectiveness of the MT translation?
Understand the formulae for BLEU, What is Modified Precision, Count Clip and Brevity Penalty(BP)
Step by step calculation of the BLEU using an example
Calculating BLEU score using python nltk library

You are watching a very popular movie of a language that you do not understand, so you read the captions in a language that you know.

How do we know that the translations are good enough to convey the right meaning?

We look at the adequacy, fluency, and fidelity of the translations to know it’s effectiveness.

Adequacyis a measure to know if all the meaning was expressed from source language to the target language

Fidelityis the extent to which a translation accurately renders the meaning of the source text

Fluencymeasures how grammatically well-formed the sentences are along with ease of interpretation.

Another challenge with translations for a sentence is in the usage of different word choices and changing the word order. Below are a few examples.

Different word choices but conveying the same meaning

I enjoyed the concert

I liked the show

I relished the musical

Different word order conveying the same message

I was late for office due to traffic jam

The traffic jam was responsible for my delay to office

Traffic jam delayed me to office

With all these complexities, how can we measure the effectiveness of a machine translation?

We will use the main idea as described by Kishore Papineni

We will measure th e closeness of translation by finding legitimate differences in word choice and word order between the reference human translation and translation generated by the machine .

A few terms in context with BLEU

Reference translation is Human translation

Candidate Translation is Machine translation

To measure the machine translation effectiveness, we will evaluate the closeness of the machine translation to human reference translation using a metric known as BLEU-Bilingual Evaluation Understudy .

Let’s take an example where we have the following reference translations.

I always do.
I invariably do.
I perpetually do.

We have two different candidates from machine translation

I always invariably perpetually do.
I always do

Candidate 2 “ I always do ” shares most words and phrases with these three reference translations. We come to this conclusion by comparing n-gram matches between each candidate translation to the reference translations.

What do we mean by n-gram?

An n-gram is a sequence of words occurring within a given window where n represents the window size.

Let’s take the sentence, “ Once you stop learning, you start dying ” to understand n-grams.

unigram, bigram, and trigram for the sentence, “ Once you stop learning, you start dying. ”

BLEU compares the n-gram of the candidate translation with n-gram of the reference translation to count the number of matches. These matches are independent of the positions where they occur.

The more the number of matches between candidate and reference translation, the better is the machine translation.

Let’s start with a familiar metric: Precision .

In terms of Machine Translation, we define Precision as ‘the count of the number of candidate translation words which occur in any reference translation’ divided by the ‘total number of words in the candidate translation.’

Let’s take an example and calculate the precision for the candidate translation

The precision for candidate 1 is 2/7 (28.5%)
The Precision for candidate 2 is 1(100%).

These are unreasonably high precision, and we know these are not good translations.

To solve the issue, we will use modified n-gram precision . It is computed in multiple steps for each n-gram.

Let’s take an example and understand how the modified precision score is calculated. We have three human reference translation and a machine-translated candidate

We first calculate Count clip for any n-gram using the following steps

Step1: Count the maximum number of times a candidate n-gram occurs in any single reference translation; this is referred to as Count.
Step 2: For each reference sentence, count the number of times a candidate n-gram occurs. As we have three reference translations, we calculate, Ref 1 count, Ref2 count, and Ref 3 count.
Step 3: Take the maximum number of n-grams occurrences in any reference count. Also known as Max Ref Count .
Step 4: Take the minimum of the Count and Max Ref Count. Also known as Count clip as it clips the total count of each candidate word by its maximum reference count

Step 5: Add all these clipped counts.

Below we have clip counts for unigram and bigrams

Clip Count for unigram

Clip count for bigram

Step 6: Finally, divide the clipped counts by the total (unclipped) number of candidate n-grams to get the modified precision score.

Pₙ is modified precision score

The modified precision score for the unigram is 17/18
The modified precision score for bi-gram is 10/17

Summarizing modified precision score

Modified precision Pₙ: Sum of the clipped n-gram counts for all the candidate sentences in the corpus divide by the number of candidate n-grams

How does this modified precision score help?

Modified n-gram precision score captures two aspects of translation: adequacy and fluency.

A translation using the same words as in the references tends to satisfy adequacy.
The longer n-gram matches between candidate and reference translation account for fluency

What happens if the translations are too short or too long?

We add brevity penalty to handle too short translations.

Brevity Penalty(BP)will be 1.0 when the candidate translation length is the same as any reference translation length. The closest reference sentence length is the “best match length.”

With the brevity penalty, we see that a high-scoring candidate translation will match the reference translations in length, in word choice, and word order.

BP is an exponential decay and is calculated as shown below

r- count of words in a reference translation

c- count of words in a candidate translation

Note: Neither the brevity penalty nor the modified n-gram precision length directly considers the source length; instead, they only consider the range of reference translation lengths of the target language

Finally, we calculate BLEU

BP- brevity penalty

N: No. of n-grams, we usually use unigram, bigram, 3-gram, 4-gram

wₙ: Weight for each modified precision, by default N is 4, wₙ is 1/4=0.25

Pₙ: Modified precision

The BLEU metric ranges from 0 to 1. When the machine translation is identical to one of the reference translation, it will attain a score of 1. For this reason, even a human translator will not necessarily score 1.

I hope you now have a good understanding of BLEU.

BLEU metric is used for

Machine Translation
Image captioning
Text summarization
Speech recognition

How can I calculate BLEU in python?

nltk library provides implementation to calculate the BLRU score

Importing the required library

import nltk.translate.bleu_score as bleu

Setting the two different candidate translation that we will compare with two reference translations

reference_translation=['The cat is on the mat.'.split(),
 'There is a cat on the mat.'.split()
 ]
candidate_translation_1='the the the mat on the the.'.split()
candidate_translation_2='The cat is on the mat.'.split()

Calculating the BLEU score for candidate translation 1

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_1))

Calculating the BLEU score for candidate translation two, where the candidate translation matches with one of the reference translation

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_2))

We can also create our own methods in python using nltk library for calculating BLEU available in github

References:

BLEU: a Method for Automatic Evaluation of Machine Translation Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu

https://www.statmt.org/book/slides/08-evaluation.pdf

http://www.nltk.org/_modules/nltk/translate/bleu_score.html

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

BLEU — Bilingual Evaluation Understudy

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Intel系列微处理器体系结构、编程与接口

布雷, / 机械工业出版社 / 2005-4 / 99.00元

本书是讲述Intel微处理器的国外经典教材，已经多次再版，经过长期教学使用，吐故纳新，不断完善，内容丰富，体系完整。第6版中包含了微处理器领域的最新技术发展，涵盖了Pentium 4的内容。本书结合实例讲解工作原理，并给出小结和习题，既适合教学使用，也适合自学。书中许多实例都可以作为开发类似应用的模板和原型，极具实用价值。附录还给出了备查资料，供设计和调试汇编语言时使用。本书可作为高等院校计算机、......一起来看看《Intel系列微处理器体系结构、编程与接口》这本书的介绍吧!

码农工具

BLEU — Bilingual Evaluation Understudy