Abstract:
|
Part of Speech (POS) tagging is a fundamental task in Natural Language Processing. Because of the important effects of POS tagging for further applications, detecting error in POS tagging is mentioned as another significant task. There are many researches how to detect automatically error in POS tagging for English and the results are quite high. However, for Vietnamese, this work is in the early stages without many of researches and documentation reference. VietTreeBank is a corpus which was built with a purpose to make a based resource for another applications and studies about Vietnamese Language. Unfortunately, the POS tagging process makes a lot of errors. Hence, through this thesis, we want to contribute to build VietTreeBank more and more accurate by detecting error in classifier (or noun classifier) tagging.
Variation n-gram algorithm is discovered by Markus Dickinson. It was applied in Wall Street Journal (WSJ) corpus as part of the Penn Treebank 3 release. The basic idea of this algorithm is the longer similar context, the more able to be error. In this thesis, we use a combination of this algorithm with knowledge of classifier constructions to detect and repair error tags.
We have achieved following results, we recognized that there are existing a lot of errors in classifier tagging process, mainly the confusion between nouns (N) and classifiers (Nc). In addition, the word segmentation is not completely correct. |