Khmer Treebank Construction via Interactive Tree Visualization

Bonpagna Kann(1*), Thodsaporn Chay-intr(2), Hour Kaing(3), Thanaruk Theeramunkong(4)

(1) Institute of Technology of Cambodia
(2) Sirindhorn International Institute of Technology, Thammasat University
(3) Institute of Technology of Cambodia
(4) Sirindhorn International Institute of Technology, Thammasat University
(*) Corresponding Author


Despite the fact that there are a number of researches working on Khmer Language in the field of Natural Language Processing along with some resources regarding words segmentation and POS Tagging, we still lack of high-level resources regarding syntax, Treebanks and grammars, for example. This paper illustrates the semi-automatic framework of constructing Khmer Treebank and the extraction of the Khmer grammar rules from a set of sentences taken from the Khmer grammar books. Initially, these sentences will be manually annotated and processed to generate a number of grammar rules with their probabilities once the Treebank is obtained. In our experiments, the annotated trees and the extracted grammar rules are analyzed in both quantitative and qualitative way. Finally, the results will be evaluated in three evaluation processes including Self-Consistency, 5-Fold Cross-Validation, Leave-One-Out Cross-Validation along with the three validation methods such as Precision, Recall, F1-Measure. According to the result of the three validations, Self-Consistency has shown the best result with more than 92%, followed by the Leave-One-Out Cross-Validation and 5-Fold Cross Validation with the average of 88% and 75% respectively. On the other hand, the crossing bracket data shows that Leave-One-Out Cross Validation holds the highest average with 96% while the other two are 85% and 89%, respectively.


Treebank Construction; Grammar Construction; Visualization Tool; Syntactic Parsing

Full Text:



M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank,” Comput. Linguist., Vol. 19, No. 2, pp. 313–330, 1993.

H. Riza, M. Purwoadi, T. Uliniansyah, and B. Pengkajian, “Introduction of the Asian Language Treebank,” Proceedings of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA), 2016, pp. 26–28.

C. Nou and W. Kameyama, “A Rule-Based Proper Noun Recognition for Khmer Part-of-Speech Tagger,” The 69th National Convention of IPSJ, 2007, pp. 2–385-586.

O. Boonyarith, “Derivatives in Khmer Compound Words,” Mon-Khmer Stud. 38 A J. Southeast Asian Lang. Cult. Spec. Vol. Dedic. to Dr. David Thomas, 2009, pp. 173–183.

S. Prasomsuk and P. Mol, “Thai to Khmer Rule-Based Machine Translation using Reordering Word to Phrase,” International Journal of Computer Theory and Engineering, Vol. 9, No. 3, pp. 223-228, 2017.

V. Chea, Y.K. Thu, C. Ding, M. Utiyama, A. Finch, and E. Sumita, “Khmer Word Segmentation Using Conditional Random Fields,” Khmer Natural Language Processing, 2015, pp. 1-8.

IDRC, “Khmer Part-of-Speech Tagger,” PAN Localization Cambodia (PLC) of IDRC, Project Report, pp. 1-10, 2008.

C. Nou and W. Kameyama, “Khmer POS Tagger: A Transformation-Based Approach with Hybrid Unknown Word Handling,” ICSC 2007 Int. Conf. Semant. Comput., 2007, pp. 482–489.

T. Chay-intr and T. Theeramunkong, “Towards Thai Treebank Construction and Grammar Derivation,” Proc. Jt. Int. Symp. Artif. Intell. Nat. Lang. Process. (iSAI-NLP 2017), 2017, pp. 133-141.

N. Bi and N. Taing. “Khmer Word Segmentation Based on Bi-directional Maximal Matching for Plaintext and Microsoft Word Document,” 2014 Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. APSIPA 2014, 2014, pp. 1-9.

C. Nou and W. Kameyama, “Hybrid Approach for Khmer Unknown Word POS Guessing,” 2007 IEEE International Conference on Information Reuse and Integration, IEEE IRI-2007, 2007, pp. 215–220.

G. Eryiğit, “ITU Treebank Annotation Tool,” Proceedings of the Linguistic Annotation Workshop (LAW '07), 2007, pp. 117-120.

P. Stenetorp, S. Pyysalo, G. Topi, T. Ohta, S. Ananiadou, and J. Tsujii, “BRAT : A Web-based Tool for NLP-Assisted Text Annotation,” EACL ’12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107.

C. Brabrand, R. Giegerich, and A. Møller, “Analyzing Ambiguity of Context-free Grammars,” Sci. Comput. Program., Vol. 75, No. 3, pp. 176–191, 2010.

T.H. Chen, C.-H. Tseng, and C.-P. Chen, “Automatic Learning of Context-Free Grammar,” Proc. 18th Conf. Comput. Linguist. Speech Process., 2006, pp.53–62.

Chandni, R. Narula, and S.K. Sharma, “Identification and Separation of Simple, Compound and Complex Sentences in Punjabi Language,” International Journal of Computer Applications & Information Technology, Vol. 6, No. 2, pp. 123–128, 2014.

C. Chhorn, Khmer Grammar for Students in General, Phnom Penh, Cambodia, 2002, pp.10-12.


Article Metrics

Abstract views : 187 | views : 97


  • There are currently no refbacks.

Copyright (c) 2019 IJITEE (International Journal of Information Technology and Electrical Engineering)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

ISSN  : 2550-0554 (online)

Contact :

Department of Electrical engineering and Information Technology, Faculty of Engineering
Universitas Gadjah Mada

Jl. Grafika No 2 Kampus UGM Yogyakarta

+62 (274) 552305

Email :