Since roughly a decade statistical machine translation (SMT) predominates in academic research. However, most commercial machine translation (MT) suppliers continue to offer systems based on more traditional rule-based architectures (RBMT). Difficulties with replacing the translation engines in the product set-up may explain this discrepancy in part. However, the main reasons are that RBMT makes available a whole bunch of functions which SMT does not provide, including human-readable, fully worked out “conventional” dictionaries, and that for a number of language pairs RBMT-quality is still higher.
SMT needs huge bilingual text corpora to compute satisfactory translation models, and it is inherently weak when dealing with rare data and non-local phenomena. Its advantages are low cost and robustness. The main disadvantages of RBMT are high cost and shortcomings with respect to resolving structural and lexical ambiguities.
We propose a hybrid architecture for high quality machine translation which combines the strengths of both approaches and minimizes their weaknesses: At the core is a rule-based MT system which provides morphology, declarative grammars, semantic categories, and small dictionaries, but which avoids all expensive kinds of intellectual knowledge acquisition. Instead of manually working out large dictionaries and compiling information on disambiguation preference, we suggest a novel corpus-based bootstrapping method for automatically expanding dictionaries, and for training the analytical performance and the choice of transfer alternatives.
As bilingual corpora with good literal translations are a sparse resource, we focus in particular on exploiting comparable monolingual corpora. We locate unknown words and expressions, and then use a statistically tuned analysis component in combination with similarity assumptions to identify relations across languages. This approach should make it possible to overcome the data acquisition bottleneck of conventional SMT.
Project Overview
We design and implement a hybrid architecture for high quality machine translation (HyghTra) which combines the strengths of the statistical and the rule-based approach and minimizes their weaknesses.
HyghTra will consist of a rule-based MT core system which provides morphology, declarative grammars, semantic categories, and small (cheap) bilingual dictionaries, and which omits all kinds of (expensive) disambiguating preference knowledge. Instead of compiling such knowledge and working out large dictionaries manually, we make use of a bootstrapping method for automatically extending dictionaries and for training the analytical performance and the choice of transfer alternatives, using monolingual and bilingual corpora.
Since bilingual data with good literal translations are sparse, we focus in particular on searching monolingual corpora for new words and use the statistically tuned analysis components of the system and similarity assumptions to crosslinguistically relate them to each other. This should overcome the data acquisition bottleneck of conventional SMT to a significant degree.
More information on the official webpage.
More information on Hyghtra workshops.
Project Participants
Project Details
Research area | FP7-PEOPLE-2009-IAPP Marie Curie IAPP transfer of knowledge programme |
Project Acronym | HYGHTRA |
Project Reference | 251534 |
Start Date | 2010-12-01 |
Duration | 48 months |
Contract Type | Industry-Academia Partnerships and Pathways (IAPP) |
End Date | 2014-11-30 |
Project Status | Execution |