Talk:BLEU/GA1

GA Reassessment
The edit link for this section can be used to add comments to the reassessment.''

This article has been reviewed as part of WikiProject Good articles/Project quality task force in an effort to ensure all listed Good articles continue to meet the Good article criteria. In reviewing the article, I have found there are some issues that may need to be addressed, listed below. I will check back in seven days. If these issues are addressed, the article will remain listed as a Good article. Otherwise, it may be delisted (such a decision may be challenged through WP:GAR). If improved after it has been delisted, it may be nominated at WP:GAN. Feel free to drop a message on my talk page if you have any questions, and many thanks for all the hard work that has gone into this article thus far.


 * The BLEU and real applications of MT: criticism section is completely uncited, and so gives an impression of original research.


 * Removed.


 * "However, in the version of the metric used by NIST ...". NIST needs to be spelled out the first time it's encountered.


 * It doesn't really make much sense to spell it out. I will however change the link to point to the metric.


 * Do none of the publications listed have isbns/issns?


 * Probably not, most are conference papers.


 * "It has been argued that although BLEU certainly has significant advantages ...". By whom has it been argued? What are these advantages? Over what?


 * If you look at the footnote: Callison-Burch, C., Osborne, M. and Koehn, P. (2006)


 * "As BLEU scores are taken at the corpus level ...." Calculated at the corpus level?


 * It means they are calculated over a whole corpus as opposed to over sentences, this means that while the scores over a whole corpus might correlate with human judgement, scores over individual sentences might not. I could give an example of this, but it would probably be considered original research.


 * Several important terms such as precision and recall are linked to disambiguation pages.


 * Fixed.


 * I found the description given in the Algorithm section unclear.
 * The terms unigram, bigram and n-gram are not explained


 * I recommend you look over the article n-gram


 * $$P = \frac{m}{w_{t}}$$ What do the terms $${m}$$ and $${w_{t}}$$ represent?


 * m = number of words in the text, wt = number of words found in the reference.


 * "The above method is used to calculate scores for each n." What does n represent? Each word in the sentence? Each sentence in the corpus?


 * the 'n' is from 'n-gram' (see article on n-gram)


 * "It has been pointed out that precision is usually twinned with recall ...". By whom?


 * The footnote says: Papineni, K., et al. (2002)... "Traditionally, precision has been paired with recall to overcome such length-related problems." -- All of the papers are online, so you can look up this stuff yourself.


 * "However, in the version of the metric used by NIST, the short reference sentence is used." Citation required.


 * I believe this is from the Doddington article.


 * Is the term "corpus" being used to apply to the whole of the work being translated, selected chunks of it, or both?


 * Corpus is opposed to sentence.


 * "In the case of multiple reference sentences, r is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences." This unclear, especially given the short example at the start of the section. As the algorithm is being applied to a corpus, wouldn't there always be multiple reference sections, one for each sentence? Summing the lengths of multiple reference sentences for one candidate sentence clearly wouldn't make sense.


 * When you're calculating the score over several references, it takes 'r' to be the sum of the lengths of the subset of sentences from the set of reference translations which are closest in length to the test sentences.

--Malleus Fatuorum 22:04, 19 February 2009 (UTC)

Thanks for the review, you've bought up some good points. - Francis Tyers · 10:27, 20 February 2009 (UTC)


 * If all of the papers are online, then links ought to be provided in this article.


 * Done.


 * I wasn't really asking you to explain the answers to my questions to me; I was asking you to explain in the article, to your readers. So terms like unigram and so on need to be explained here, without the need to follow the link. The article is otherwise somewhat inaccessible to the non-specialist reader.


 * Why should they be explained here when they are explained in another article ?


 * ... to make the article more accessible to your readers, who may not immediately realize that unigram is here being used as a synonym for "word", or that bi-gram is synonymous with two consecutive words. Or is it any pair of words, consecutive or not? --Malleus Fatuorum 11:06, 23 February 2009 (UTC)


 * Good point. I've added a brief explanation in parentheses. - Francis Tyers · 23:16, 23 February 2009 (UTC)


 * "In the 2005 NIST evaluation, they report ...". Who are "they"? I know I could look inside the citation, but the sentence as written is incorrect, as there is no antecedent to whom "they" refers. The use of NIST in apparently two different contexts (the standards organisation and the algorithm) is also rather confusing.


 * This is from the 'Callison Burch', the footnote for which is at the end of the paragraph. I've changed this into the passive as you're right, it did sound a bit odd. Regarding the fact that 'NIST' is used for both the organisation and the metric is confusing, you would have thought they could come up with something a bit more imaginative!

--Malleus Fatuorum 11:49, 20 February 2009 (UTC)


 * -Francis Tyers · 09:00, 23 February 2009 (UTC)

I have no particular complaints about the edits, they are mostly for style than anything else, and they in some cases leave the text clearer although in other cases give no substantive improvement. I would ask that the following parts be re-inserted:

The metric works by measuring the n-gram (sequences of one or more words) co-occurrence between a given translation and the set of reference translations and then taking the weighted geometric mean.

and

The quality of translation is indicated as a number between 0 and 1 and is measured as statistical closeness to a given set of good quality human reference translations.

This quote is also useful:

The central idea behind the metric is that, "the closer a machine translation is to a professional human translation, the better it is".

Please feel free to paraphrase it, but I think that it is worthwhile including. I also prefer the wording "report a high correlation ..." to "demonstrate a ..."

- Francis Tyers · 14:19, 25 February 2009 (UTC)


 * I've reintroduced the quotation, but with a slightly different lead in. Do you think that's OK now? --Malleus Fatuorum 14:48, 25 February 2009 (UTC)


 * Looks good. - Francis Tyers · 21:25, 26 February 2009 (UTC)


 * I wasn't happy with "report a high correlation" because metrics don't "report", they "measure". I've changed "demonstrate a high correlation" to "achieve a high correlation", which we can hopefully agree on. --Malleus Fatuorum 14:55, 25 February 2009 (UTC)


 * The metric doesn't "report", the people writing the paper "report". But it is ok as it stands. - Francis Tyers · 21:25, 26 February 2009 (UTC)

Thanks for the work that's been done to address my concerns; I'm now satisfied that the article meets the GA criteria, so I'm closing the review. --Malleus Fatuorum 00:24, 11 March 2009 (UTC)