Printable Version of this PageHome PageRecent ChangesSearchSign In
Tag:
Machine Translation by using Dependency Structure

As these days are called "The Information Era", people began to look for more information, not just from their native languages but also from others. It is nearly impossible for humans to learn all languages or even ones that are widely used such as English, Chinese, etc.; furthermore, it takes an indefinite time to translate all documents in such languages. This is where Machine Translation can be practical. If we somehow manage to find right transactions between languages, we can write computer programs, machine translators, to do the job .

The question is how to find the right transactions. There have been several distinct approaches for this matter: one used phrase structure to figure out the similarities between languages; however, for the most cases, it is very difficult to find common grounds between languages in phrase structure level. The current state of art, programmed by Google, is done by using n-gram: it is simple yet, with the vast amount of data that Google has, it performs relatively better than the others. I used the term 'relatively' since for many languages, its results are far worse than human translations.

This is why there needed a different approach to solve this problem. Here, I suggest dependency structure. Since dependency structure is independent from individual grammar yet gives much more common grounds between languages than phrase structure, it is natural to apply this for Machine Translation. Furthermore, there are languages such as Korean, Japanese, etc. that parsing into dependency structure is much easier than parsing into phrase structure. Moreover, the conversion from raw sentences to dependency structure is must faster, almost O(n^2) times faster, than to phrase structure.

For the thesis, I used both Treebank and Propbank for training: Treebank provides more information within one language whereas Propbank provides more language independent relations. Three languages are tested: Arabic, Chinese, and Korean. Since there is no Korean Propbank, I wrote a dependency parser to convert Korean Treebank to dependency tree. The evaluation is based on both BLUE-score and NIST-score.

Last modified 11 December 2007 at 5:13 am by choijd