How Does Machine Translation Work?
The faculty of language is one of the most important features distinguishing human beings from other creatures, and the human languages which have evolved over thousands of years are among men’s greatest cultural achievements. They are organic structures which are so highly flexible that they often thought to be chaotic.
How can it be that a computer program can cope with these organic structures, the languages, understand them and even translate from one into another? Without entering into philosophical considerations, it is fair to say that a computer program understands as little of languages as it does of the orbits of satellites which it can accurately compute, or of chess, even if it is able to beat Kasparov.
Translation programs apply rules and knowledge with which their developers are trying to model language. Sometimes such rules are found by statistically analyzing huge amounts of text data, but in any case with the aim to imitate the behavior of a translator. Since languages are so highly complex, nobody has been able yet to model the functioning of languages completely and accurately. This becomes evident when translation programs make mistakes or even break down.
The main difficulty which translation programs have to struggle with is the ambiguity of many linguistic utterances, of single words and also whole sentences. A large portion of the rules in translation systems describe which meaning is required under which conditions. This can be illustrated by examples such as the following ones:
Der Kurs findet statt.
(The course takes place.)
Der Kurs fällt.
(The rate is falling.)
Briefträger beißen Hunde selten.
Dogs seldom bite postmen.
Postmen seldom bite dogs.
The first example show different readings of the German word Kurs which are disambiguated by the context; the second example shows an ambiguous sentence structure (typical for German) – it is not clear whether Briefträger is subject or object of the sentence.
In spite of all difficulties, machine translation, which has been worked on since the beginning of computers in the forties of the last century, has made enough progress that it now has become a big help when dealing with foreign language texts. How they work is sketched here.
Translation in Seven Steps
We describe here the translation of written texts or documents, not the interpretation of spoken utterances. The transfer of spoken to written language, and the synthesis of spoken language from written texts are topics in their own right which can be treated separately.
Segmenting Documents into Words, Sentences and Formatting Information
The basic elements of translation programs are words and rules for combining them to form sentences, paragraphs and complete texts. Every document to be translated first needs to be decomposed into words, numbers and punctuation marks. Since the layout of the translation in most cases should look just like the original, this information must also be recognized so it can be inserted into the translation at the proper places.
Since the rules of combining – the grammatical rules – address sentences, also sentence boundaries need to be determined. Unfortunately, this is less easy than it may appear at first sight. A period may mark the end of a sentence, an abbreviation, a German ordinal number, it may be a decimal point or part of an e-mail or internet address
Reduction of Word Forms to their Canonical Form and Dictionary Lookup
Every translation program needs a dictionary. Here all information is stored which is necessary for the analysis of sentences and their translation, e.g. part of speech, gender, or semantic classification.
In principle, each possible form of a word could be put into the dictionary, e.g. German schlafen, schlafe, schläfst, schläft, schlaft, schlief, etc. Often this is not done, but a so-called morphological decomposition is preferred where the different word forms are reduced to a canonical form – the keyword in conventional dictionaries. This form is then used to do the dictionary lookup, and the word form at hand is assigned its corresponding grammatical information. E.g. schläfst – 2nd person singular present.
Recognizing Sentential Structures
In the beginning many researchers believed that they could obtain reasonable translations by having a program translate word by word. It became clear very quickly that this was an illusion, because firstly, languages differ very much in word order, and secondly, many words can have more than one meaning of which only one is valid in a given sentence. The results were completely unintelligible sequences of alternate word translations which nobody could use.
So, a translation program must “know” grammar. Each word and each phrase must be assigned its role in the sentence, and it must be determined as precisely as possible which combinations are probable, possible, excluded. The precision of these rules is decisive for translation quality.
The meaning of words not only depends on the context within a sentence, but also relationships between sentences are important. The use of pronouns such as German er, sie, es can make the interpretation of a sentence more difficult. E.g., how shall the word einstellen be translated in the sentence
Das Unternehmen stellt sie ein.
Is it hire, adjust, stop or still something else? This depends on whether sie refers to a person, a machine or the production of something. If that is not known, neither a human being nor a program is able to produce a reasonable translation for this sentence.
Assigning Translations to Single Words
Each word and many word groups are associated with one or more translations in the dictionary. When after grammatical analysis the contexts of the words are known, the appropriate translations can be selected.
Generating the Structure of Target Sentences
Starting from the structure of the source sentence and the word translations selected, the structure of the target sentence is built up. It can be quite different from the original. Thus
John grows a beard.
John lässt sich einen Bart wachsen.
because the word wachsen in German is not transitive, and therefore an additional verb—lassen—is required as a kind of intermediary.
Generating Word Forms
During the generation of the correct word order for the target sentence, translation programs usually work with canonical forms or word stems. Only after the structure established, forms such as lass, ein and wachs of the previous example become lässt, einen and wachsen.
Adding Layout Information
The layout information which was taken out in the first step must now be added to the translations such that in the end there is a new text which almost looks like the original. One note may be in order here: some formatting information such as bold face must be available even during the translation process, since the corresponding translations should appear in bold as well.