Colloquium Summary 1

Chinese Treebank and Its Uses/Abuses
Nianwen (Bert) Xue
Sr. Research Associate
Institute of Cognitive Science, Palmer Lab
September 28, 2007

For years Dr. Xue has been investigating what it means to be a word in Chinese. This task is not as simple as one would think as Chinese does not utilize spaces like in English. More recently his work has focused on developing Chinese version of the Penn Treebank and Propbank. Treebanks are used to annotate a corpus of data with syntactic data, and are often used for study of syntactic structure in linguistics or for training of parsers in computational linguistis.[1] Similarly PropBank annotates a corpus with verbal propositions and their arguments.[2]

For much of his presentation Dr. Xue discussed what makes Chinese difficult not only for his work with TreeBank and PropBank, but for work in natural language processing in general. Several linguistic phenomena, or lack of, make processing of Chinese difficult. As mentioned before there are no natural word boundaries in written Chinese. Its morphology is also not as rich as in English, so there are no affixes to indicate tense or plurality. Words can also be segmented in different ways, and segmentation is dependent purely on context, and can not be disambiguated using syntax or grammar. Adding more ambiguity, some words in Chinese can be interpreted as verbs, nouns, or adjectives depending on the sentence.

With a team of annotators, Dr. Xue has been able to populate the Chinese TreeBank with nearly 1 million words that are annotated, segmented, part-of-speech (POS) tagged, and syntactically bracketed. Plans for the future include expanding to 1.1 million words and adding additional layers such as a timebank for annotation of temporal relationships. While this sounds like purely an issue of brute force, care must be taken in design of a treebank to accommodate ever changing linguistic theories.

The size and breadth of this data has led some critics to question whether it is harder to parse the actual source data or the TreeBank itself. Dr. Xue argues that while it's structures are more complicated (recursive) than ones found in the English language Penn TreeBank, the Chinese Tree Bank actually has fewer rules than and a lower branching factor – so its barrier of entry is about equivalent

A key lesson I took away from this lecture is that one needs to have proper understanding of the linguistic data, theories, and methods before utilizing a tool like the TreeBank. Simply processing the data without regard to the underlying meaning can be inefficient or wasteful.

[1]Treebank, Wikipedia - http://en.wikipedia.org/wiki/Treebank
[2]PropBank, Wikipedia - http://en.wikipedia.org/wiki/PropBank