BLEU — Bilingual Evaluation Understudy

栏目: IT技术 · 发布时间: 4年前

内容简介:You are watching a very popular movie of a language that you do not understand, so you read the captions in a language that you know.We look at theAdequacyis a measure to know if all the meaning was expressed from source language to the target language

BLEU — Bilingual Evaluation Understudy

A step by step approach to understanding BLEU, a metric to understand the effectiveness of Machine Translation(MT)

What will you learn in this post?

  • How to measure the effectiveness of translating one language to another?
  • What is BLEU and how do we calculate the BLEU score for the effectiveness of the MT translation?
  • Understand the formulae for BLEU, What is Modified Precision, Count Clip and Brevity Penalty(BP)
  • Step by step calculation of the BLEU using an example
  • Calculating BLEU score using python nltk library

You are watching a very popular movie of a language that you do not understand, so you read the captions in a language that you know.

How do we know that the translations are good enough to convey the right meaning?

We look at the adequacy, fluency, and fidelity of the translations to know it’s effectiveness.

Adequacyis a measure to know if all the meaning was expressed from source language to the target language

Fidelityis the extent to which a translation accurately renders the meaning of the source text

Fluencymeasures how grammatically well-formed the sentences are along with ease of interpretation.

Another challenge with translations for a sentence is in the usage of different word choices and changing the word order. Below are a few examples.

Different word choices but conveying the same meaning

I enjoyed the concert

I liked the show

I relished the musical

Different word order conveying the same message

I was late for office due to traffic jam

The traffic jam was responsible for my delay to office

Traffic jam delayed me to office

With all these complexities, how can we measure the effectiveness of a machine translation?

We will use the main idea as described by Kishore Papineni

We will measure th e closeness of translation by finding legitimate differences in word choice and word order between the reference human translation and translation generated by the machine .

A few terms in context with BLEU

Reference translation is Human translation

Candidate Translation is Machine translation

To measure the machine translation effectiveness, we will evaluate the closeness of the machine translation to human reference translation using a metric known as BLEU-Bilingual Evaluation Understudy .

Let’s take an example where we have the following reference translations.

  1. I always do.
  2. I invariably do.
  3. I perpetually do.

We have two different candidates from machine translation

  1. I always invariably perpetually do.
  2. I always do

Candidate 2 I always do shares most words and phrases with these three reference translations. We come to this conclusion by comparing n-gram matches between each candidate translation to the reference translations.

What do we mean by n-gram?

An n-gram is a sequence of words occurring within a given window where n represents the window size.

Let’s take the sentence, “ Once you stop learning, you start dying ” to understand n-grams.

unigram, bigram, and trigram for the sentence, “ Once you stop learning, you start dying.

BLEU compares the n-gram of the candidate translation with n-gram of the reference translation to count the number of matches. These matches are independent of the positions where they occur.

The more the number of matches between candidate and reference translation, the better is the machine translation.

Let’s start with a familiar metric: Precision .

In terms of Machine Translation, we define Precision as ‘the count of the number of candidate translation words which occur in any reference translation’ divided by the ‘total number of words in the candidate translation.’

Let’s take an example and calculate the precision for the candidate translation

  • The precision for candidate 1 is 2/7 (28.5%)
  • The Precision for candidate 2 is 1(100%).

These are unreasonably high precision, and we know these are not good translations.

To solve the issue, we will use modified n-gram precision . It is computed in multiple steps for each n-gram.

Let’s take an example and understand how the modified precision score is calculated. We have three human reference translation and a machine-translated candidate

We first calculate Count clip for any n-gram using the following steps

  • Step1: Count the maximum number of times a candidate n-gram occurs in any single reference translation; this is referred to as Count.
  • Step 2: For each reference sentence, count the number of times a candidate n-gram occurs. As we have three reference translations, we calculate, Ref 1 count, Ref2 count, and Ref 3 count.
  • Step 3: Take the maximum number of n-grams occurrences in any reference count. Also known as Max Ref Count .
  • Step 4: Take the minimum of the Count and Max Ref Count. Also known as Count clip as it clips the total count of each candidate word by its maximum reference count
  • Step 5: Add all these clipped counts.

Below we have clip counts for unigram and bigrams

Clip Count for unigram
Clip count for bigram
  • Step 6: Finally, divide the clipped counts by the total (unclipped) number of candidate n-grams to get the modified precision score.
Pₙ is modified precision score
  • The modified precision score for the unigram is 17/18
  • The modified precision score for bi-gram is 10/17

Summarizing modified precision score

Modified precision Pₙ: Sum of the clipped n-gram counts for all the candidate sentences in the corpus divide by the number of candidate n-grams

How does this modified precision score help?

Modified n-gram precision score captures two aspects of translation: adequacy and fluency.

  • A translation using the same words as in the references tends to satisfy adequacy.
  • The longer n-gram matches between candidate and reference translation account for fluency

What happens if the translations are too short or too long?

We add brevity penalty to handle too short translations.

Brevity Penalty(BP)will be 1.0 when the candidate translation length is the same as any reference translation length. The closest reference sentence length is the “best match length.”

With the brevity penalty, we see that a high-scoring candidate translation will match the reference translations in length, in word choice, and word order.

BP is an exponential decay and is calculated as shown below

r- count of words in a reference translation

c- count of words in a candidate translation

Note: Neither the brevity penalty nor the modified n-gram precision length directly considers the source length; instead, they only consider the range of reference translation lengths of the target language

Finally, we calculate BLEU

BP- brevity penalty

N: No. of n-grams, we usually use unigram, bigram, 3-gram, 4-gram

wₙ: Weight for each modified precision, by default N is 4, wₙ is 1/4=0.25

Pₙ: Modified precision

The BLEU metric ranges from 0 to 1. When the machine translation is identical to one of the reference translation, it will attain a score of 1. For this reason, even a human translator will not necessarily score 1.

I hope you now have a good understanding of BLEU.

BLEU metric is used for

  • Machine Translation
  • Image captioning
  • Text summarization
  • Speech recognition

How can I calculate BLEU in python?

nltk library provides implementation to calculate the BLRU score

Importing the required library

import nltk.translate.bleu_score as bleu

Setting the two different candidate translation that we will compare with two reference translations

reference_translation=['The cat is on the mat.'.split(),
 'There is a cat on the mat.'.split()
 ]
candidate_translation_1='the the the mat on the the.'.split()
candidate_translation_2='The cat is on the mat.'.split()

Calculating the BLEU score for candidate translation 1

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_1))

Calculating the BLEU score for candidate translation two, where the candidate translation matches with one of the reference translation

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_2))

We can also create our own methods in python using nltk library for calculating BLEU available in github

References:

BLEU: a Method for Automatic Evaluation of Machine Translation Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu

https://www.statmt.org/book/slides/08-evaluation.pdf

http://www.nltk.org/_modules/nltk/translate/bleu_score.html


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

再启动

再启动

[日] 大前研一 / 田龙姬、金枫 / 中华工商联合出版社有限责任公司 / 2010-1 / 29.00元

1、“全球管理大师”、“日本战略之父”大前研一,职场励志最新巨作。 2、2010年1月中华工商联合出版社有限责任公司与日知公司继《货币战争2》《中国大趋势》之后,再度联手,重磅推出。 3、震撼中国职场的宗师级巨作,势必引领2010年中国职场4、世界著名出版商小学馆授予独家中文简体出版权。 5、试问,哪个老板不希望自己的员工不断实现自身的“再启动”呢? 6、只有不断激励鞭策自......一起来看看 《再启动》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具