Statistical machine translation

Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural machine translation.

The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in the late 1980s and early 1990s by researchers at IBM's Thomas J. Watson Research Center

Basis
The idea behind statistical machine translation comes from information theory. A document is translated according to the probability distribution $$p(e|f)$$ that a string $$e$$ in the target language (for example, English) is the translation of a string $$f$$ in the source language (for example, French).

The problem of modeling the probability distribution $$p(e|f)$$ has been approached in a number of ways. One approach which lends itself well to computer implementation is to apply Bayes Theorem, that is $$p(e|f) \propto p(f|e) p(e)$$, where the translation model $$p(f|e)$$ is the probability that the source string is the translation of the target string, and the language model $$p(e)$$ is the probability of seeing that target language string. This decomposition is attractive as it splits the problem into two subproblems. Finding the best translation $$\tilde{e}$$ is done by picking up the one that gives the highest probability:
 * $$ \tilde{e} = arg \max_{e \in e^*} p(e|f) = arg \max_{e\in e^*} p(f|e) p(e) $$.

For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings $$e^*$$ in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in speech recognition.

As the translation systems were not able to store all native strings and their translations, a document was typically translated sentence by sentence, but even this was not enough. Language models were typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but there was additional complexity due to different sentence lengths and word orders in the languages.

The statistical translation models were initially word based (Models 1-5 from IBM Hidden Markov model from Stephan Vogel and Model 6 from Franz-Joseph Och ), but significant advances were made with the introduction of phrase based models. Later work incorporated syntax or quasi-syntactic structures.

Benefits
The most frequently cited benefits of statistical machine translation (SMT) over rule-based approach were:


 * More efficient use of human and data resources
 * There were many parallel corpora in machine-readable format and even more monolingual data.
 * Generally, SMT systems were not tailored to any specific pair of languages.
 * More fluent translations owing to use of a language model

Shortcomings

 * Corpus creation can be costly.
 * Specific errors are hard to predict and fix.
 * Results may have superficial fluency that masks translation problems.
 * Statistical machine translation usually works less well for language pairs with significantly different word order.
 * The benefits obtained for translation between Western European languages are not representative of results for other language pairs, owing to smaller training corpora and greater grammatical differences.

Phrase-based translation
In phrase-based translation, the aim was to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words were called blocks or phrases, however, typically they were not linguistic phrases, but phrasemes that were found using statistical methods from corpora. It has been shown that restricting the phrases to linguistic phrases (syntactically motivated groups of words, see syntactic categories) decreased the quality of translation.

The chosen phrases were further mapped one-to-one based on a phrase translation table, and could be reordered. This table could be learnt based on word-alignment, or directly from a parallel corpus. The second model was trained using the expectation maximization algorithm, similarly to the word-based IBM model.

Syntax-based translation
Syntax-based translation was based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences/utterances. Until the 1990s, with advent of strong stochastic parsers, the statistical counterpart of the old idea of syntax-based translation did not take off. Examples of this approach included DOP-based MT and later synchronous context-free grammars.

Hierarchical phrase-based translation
Hierarchical phrase-based translation combined the phrase-based and syntax-based approaches to translation. It used synchronous context-free grammar rules, but the grammars could be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents. This idea was first introduced in Chiang's Hiero system (2005).

Challenges with statistical machine translation
Problems that statistical machine translation did not solve included:

Sentence alignment
In parallel corpora single sentences in one language can be found translated into several sentences in the other and vice versa. Long sentences may be broken up, short sentences may be merged. There are even some languages that use writing systems without clear indication of a sentence end (for example, Thai). Sentence aligning can be performed through the Gale-Church alignment algorithm. Through this and other mathematical models efficient search and retrieval of the highest scoring sentence alignment is possible.

Word alignment
Sentence alignment is usually either provided by the corpus or obtained by aforementioned Gale-Church alignment algorithm. To learn e.g. the translation model, however, we need to know which words align in a source-target sentence pair. The IBM-Models or the HMM-approach were attempts at solving this challenge.

Function words that have no clear equivalent in the target language were another challenge for the statistical models. For example, when translating from English to German, the sentence "John does not live here," the word "does" doesn't have a clear alignment in the translated sentence "John wohnt hier nicht." Through logical reasoning, it may be aligned with the words "wohnt" (as in English it contains grammatical information for the word "live") or "nicht" (as it only appears in the sentence because it is negated) or it may be unaligned.

Statistical anomalies
An example of such an anomaly was that "I took the train to Berlin" was mis-translated as "I took the train to Paris" due to the statistical abundance of "train to Paris" in the training set.

Idioms
Depending on the corpora used, idioms could not translate "idiomatically". For example, using Canadian Hansard as the bilingual corpus, "hear" was almost invariably translated to "Bravo!" since in Parliament "Hear, Hear!" becomes "Bravo!".

This problem is connected with word alignment, as in very specific contexts the idiomatic expression aligned with words that resulted in an idiomatic expression of the same meaning in the target language. However, it is unlikely, as the alignment usually doesn't work in any other contexts. For that reason, idioms could only be subjected to phrasal alignment, as they could not be decomposed further without losing their meaning. This problem was specific for word-based translation.

Different word orders
Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located, or where the same words are used as a question or a statement.

In speech recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the machine translator can only manage small sequences of words, and word order has to be thought of by the program designer. Attempts at solutions have included re-ordering models, where a distribution of location changes for each item of translation is guessed from aligned bi-text. Different location changes can be ranked with the help of the language model and the best can be selected.

Out of vocabulary (OOV) words
SMT systems typically store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated. This might be because of the lack of training data, changes in the human domain where the system is used, or differences in morphology.