construction of vietnamese corpora for named

BioCaster1 for detecting and monitoring disease outbreaks, based on text mining from online news articles. One of the key components of the BioCaster system ...
194KB taille 2 téléchargements 302 vues
CONSTRUCTION OF VIETNAMESE CORPORA FOR NAMED ENTITY RECOGNITION 1

Thao Pham T. X.1, Tri T. Q.1, Ai Kawazoe2, Dien Dinh3, Nigel Collier2 Faculty of Computer Sciences, University of Information Technology - VNU of HCMC Vietnam Email: [email protected], [email protected] 2 National Institute of Informatics, Tokyo, Japan Email: {kawazoe, collier}@nii.ac.jp 3 Faculty of Information Technology, University of Natural Sciences - VNU of HCMC Vietnam Email: [email protected] Abstract

In order to build an automatic named entity recognition (NER) system using a machine learning approach, a large tagged corpus is widely seen as one necessary knowledge resource. Nevertheless, manual construction is time consuming, labor intensive and expensive. Building NER corpora for European languages has been extensively studied while some less-studied languages such as Vietnamese have not yet received much attention. This paper describes construction of a Vietnamese corpus, Vietnamese guidelines for annotators and a tagging tool that we make publicly available. We report on a comparison with the English named entity (NE) corpus in our multilingual NER system.

I. Introduction The early detection of epidemic outbreaks in Southeast Asia is a very important target. In order to help public health workers in this surveillance, we are now developing a system called BioCaster1 for detecting and monitoring disease outbreaks, based on text mining from online news articles. One of the key components of the BioCaster system is the use of automated learning methods to identify NE and events using features derived from annotated examples in a multilingual collection of news articles. The initial target languages are English, Japanese, Vietnamese and Thai. Building a Vietnamese NE annotated corpus is one necessary task in this project. To achieve this, we need to collect relevant news articles, choose an appropriate tagset and write detailed guidelines for annotators as well as construct a tagging tool. We report here on the results of these activities. For the Vietnamese named entity task, a character-based NER model using Conditional Random Field (CRF) was proposed (Tu et al., 2005); however, construction of the Vietnamese NER corpus was not

reported in detail. Our aim is to build a corpus not only for general Vietnamese NER but also for NER in the BioCaster project in particular. Hence our NE tagset includes general target entity classes such as person, organization, location, temporal expressions, monetary values and percentages, as well as domain entities (e.g. disease, symptom, virus, etc.) according to ontological analysis for annotation of terms in the BioCaster project (Collier et al., 2007) and multilingual ontology for infectious disease surveillance (Kawazoe et al., 2006). This paper is organized as follows: in the next section, we discuss corpus preparation. The tagset is represented in section 3 and in section 4 we will introduce the process of corpus annotation and a tool supporting annotation. Finally, our experiment and conclusion are presented in section 5.

1

The BioCaster project: http://biocaster.nii.ac.jp Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

II. Corpus preparation We collected e-news articles from Tuoi Tre and VnExpress within the last 6 months of year 2005. These are two of the most popular online newspapers in Vietnam covering various topic fields. We extracted seven fields such as society, health, entertainment, sport, politics, business and scitech. For the English (EN) corpus, there are the same seven fields as the Vietnamese (VN) corpus; however, the number of potential news sources is larger. These include specialized news sources such as IRIN (UK), World Health Organization (WHO) outbreak alerts and more general ones such as Reuters (USA), CBS news (USA), Wall Street Journal (USA), New York Times (USA), BBC (UK), Xinhua (China). In the BioCaster project, the initial Vietnamese corpus consists of 500 annotated articles while the English corpus includes 498 articles. Table 1 shows the ratio of number of articles in each field between English and Vietnamese corpora. It can be seen that this is broadly balanced across fields. Fields

EN

VN

Society

12.85 %

13.20 %

Health

43.17 %

42.00 %

Entertainment

0.40 %

1.00 %

Sport

4.62 %

5.00 %

Politics

12.05 %

12.80 %

Business

22.89 %

22.20 %

Scitech

4.02 %

3.80 %

Table 1. The distribution of files in each field in English and Vietnamese corpora

Since the Vietnamese language is an isolated typology, we must perform word segmentation before annotating articles in our corpus. In Vietnamese, the boundaries between words are not always spaces as those in English and the words are usually composed of special linguistic units called ‘morphosyllable’. This morphosyllable may be a morpheme or a word or neither of them (Dien & Thuy, 2006a). For example: with a Vietnamese sentence as belows: “Một luật gia cầm cự với tình hình hiện nay” will be understood as many different statements due to its different word segmentations (here, we use the underscore “_” to link morphosyllables of a Vietnamese word together), e.g.: • “A lawyer contends with the present situation” (“Một luật_gia cầm_cự với tình_hình hiện_nay”) • “A law poultry resists the present situation” (“Một luật gia_cầm cự với tình_hình hiện_nay”) The comparison of Vietnamese and English word segmentation is shown in the Table 1.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

VN VN1 EN1 VN2 EN2

Một Một A Một A

luật gia cầm cự luật gia cầm cự lawyer contends luật gia cầm cự law poultry resists Table 2.

với với with với

tình hình tình hình situation tình hình situation

hiện nay hiện nay present hiện nay present

An ambiguous example in Vietnamese word segmentation

In this example, there is more than one way of understanding the sentence. If we segment words as shown in VN1 (the better one in terms of semantics), we may recognize that “luật_gia” (lawyer) is a PERSON entity. But if we segment words as shown in VN2, we won’t recognize that entity anymore. This implies that word segmentation is a necessary problem which has a significant affect on the named entity recognition. This problem needs to be solved in the preprocessing step before further processing can take place. We used a word segmentation tool (Dien & Thuy, 2006b) in 2 stages : • Spelling normalization: because there is more than one way of putting the tone mark (aesthetics or main vowel method), capitalization (partial or total), “i” or “y”, etc. in Vietnamese morpho-syllables , we need to normalize them as follows: 9 Hòa (main vowel method) Æ Hoà (aesthetics method) (name of a person) 9 Ngân_hàng công_thương Việt_Nam Æ Ngân_hàng Công_thương Việt_Nam (Industrial and Commercial Bank of Vietnam) • Word segmentation correction: Suppose that we have a phrase: Thủ tướng Trung Quốc Ôn Gia Bảo (Chinese Prime Minister On Gia Bao “Trung Quốc” in English is “Chinese” “Thủ tướng” in English is “Prime Minister”) If we want the NER result is Thủ_tướng [Trung_Quốc]LOC [Ôn_Gia_Bảo]PER [Chinese]LOC Prime Minister [On Gia Bao]PER However, if the result after word segmentation is: Thủ_tướng Trung Quốc_Ôn Gia_Bảo we cannot correctly recognize the named entity because in this case as “Quốc_Ôn” is considered an inseparable word. III. Named entity tagset One of the first named entity tagset (Ralph & Beth, 1996) had only 7 types, including organization, location, person, date, time, monetary expressions and percentages because its purpose was limited to information extraction for business activities. This 7 type tagset was used

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

for example in MUC72 (Message Understanding Conference 7). Since then, according to each project and each target, the number of tags and their significance has been changed or extended. In the IREX3 (Information Retrieval and Extraction Exercise) project (Sekine & Isahara, 2000), another kind of tag, artifact – a product of human workmanship such as Nobel Prize – was added. Besides, Sekine proposed a NE hierarchy which contains about 150 NE types (Sekine, Sudo & Nobata, 2000). In the biomedical field, there are several corpora related to our work, in which NLPBA4 and GENIA5 annotate 5 and 38 biological entity types respectively. However, these do not consider diseases or public health issues and focus on Medline abstracts rather than news. The tagset in our corpus has 22 tags, including some tags specialized for the BioCaster project. Each tag has its own meaning, but all of them contribute to the recognition of outbreaks, cases of disease with symptoms, and conditions of each patient so that there is essential prevention from outbreak damage. Some categories have properties to make their significance clearer. IV. Corpus annotating process After learning about definition of some general target entity classes such as person, location, organization, etc. (Chinchor, 1998; Hirschman et al., 1999), biological named entities (Kim et al., 2003; MeSH, 2006) and features of Vietnamese language, definition of Vietnamese proper names (Social Science Committee, 1983) as well as BioCaster NE annotation guidelines for English (Kawazoe, 2006), we wrote the NE guidelines for Vietnamese to promote consistent human annotation (Thao et al., 2006). The current annotated corpus was built semi-automatically. Before NE annotation, the corpus was word-segmented. Then, about a hundred raw articles were manually annotated. We used Yamcha6 to train a support vector machine (SVM) model for NER based on the annotated corpus. A -2/+2 features window including surface word, orthography, gazetteer and previous class predications were used. We used this model to bootstrap the annotation of other unannotated files and then made corrections manually. At each retraining stage, we annotated about one hundred articles. These corrected articles were then added into the current corpus which had 100 manually annotated articles and we retrained the SVM on this expanded corpus. This process was iteratively done until the corpus was large enough. The annotated file was then made availabe in XML format. Although the XML-tagged file could have been made using a text editor, semantically annotated corpora must be created by domain experts who are not always familiar with XML tag scheme. To make handmade tagging more convenient and restrict errors during tagging, a graphical user interface (GUI) was constructed. This tagging tool allows annotators to choose tags from a defined tagset and annotate files with those tags. This tool was given to our colleagues in the VCL Group and received compliments on its user-friendliness and ability to speed up annotators’ annotation.

2

3

MUC7 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html

IREX homepage http://www.cs.nyu.edu/cs/project/proteus/irex NLPBA http://www-tsujiii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html 5 GENIA http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/topics/Corpus/genia-ontology.html 6 http://chasen.org/~taku/software/yamcha/ 4

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

V. Experiment and Conclusion Our annotated corpus has 49,644 words, 74,414 tokens and 26,406 NEs while the English corpus has 16,041 NEs. Nearly all target entity classes in our Vietnamese corpus have more entities, except for four classes: CONTROL, NON_HUMAN, OUTBREAK and VIRUS. Table 3 shows an annotated sentence from our corpus. 4 April 2005 The Ministry of Health in Viet Nam has confirmed five additional cases of human infection with the H5 subtype of avian influenza virus . Table 3.

Ngày 4 tháng 5 năm 2005 Bộ Y_tế Việt_Nam đã xác_nhận thêm 5 trường_hợp người nhiễm virus cúm gia_cầm H5 .

A sample of an annotated sentence in English and Vietnamese

While annotating, sometimes it was difficult to determine whether an entity belonged to LOCATION or ORGANIZATION due to polysemy. Besides, some entities in VIRUS or SYMPTOM classes can be annotated as DISEASE. At that time, annotators had to judge the meaning based on the context. For example, in the two following sentences, “Việt Nam thông báo có 8 người được chẩn đoán là đã nhiễm SARS.” (Vietnam reported that eight people have been diagnosed as SARS cases.) and “Có 8 ca nhiễm SARS ở Việt Nam.” (There are 8 SARS cases in Vietnam), “Việt Nam” (Vietnam) should be annotated as an organization and location entity, respectively. We developed an NER system based on this corpus. This system was based on the SVM method which we discussed earlier. The result is rather high with an overall F-measure of 83.56%. The most problematic point is that some classes such as BACTERIA, CHEMICAL, etc. have rather low results because of their low quantity of named entities in the corpus since the source of this corpus are two online newspapers, which tend to focus on general information. In the future, we will focus our corpus collection on these low frequency target entity classes. In addition, ambiguity between ORGANIZATION and LOCATION classes slighty affects the performance and need to be considered in the guidelines. Moreover, errors caused by word segmentation make little effect to the performance. The length of a named entity is also a disadvantage of our NER system. The system is difficult to recognize a named entity whose length can be up to four or five words. In many fields of natural language processing (NLP) such as machine translation, NER, etc., a corpus is extremely necessary and useful. However, an annotated corpus for Vietnamese NER has not been fully described yet. In this paper we have described the construction of a tagged corpus for Vietnamese NER and discussed language specific difficulties arising out of word segmentation. We also draw a comparison with the English NE corpus correspondingly in our multi-lingual NER system. The resulting corpus has been used in the BioCaster multi-lingual project and proved its effectiveness. In the future, we will extract some other fields, annotate

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

more files to expand the corpus and make its content more varied. Besides, we will enhance the corpus with new linguistic information e.g. chunker, grammatical relation, coreference, etc. Acknowledgement We would like to thank the Global Liason Office of National Institute of Informatics in Tokyo for granting us the travel fund to research this problem. We also sincerely thank colleagues in the VCL Group (Vietnamese Computational Linguistics) for their invaluable and insightful comments. References Shih, C. W.; Tsai, R. T.; Wu, S. H.; Hsieh, C.C. & Hsu, W. L. (2004). The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0. Proceedings of ROCLING 2004. Chinchor, N. (1998). MUC-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference. Collier, N.; Kawazoe, A.; Jin, L.; Shigematsu, M.; Dien, D.; Barrero, R.A.; Takeuchi, K. & Kawtrakul, A. (2007). A multilingual ontology for infectious disease surveillance: rationale, design and challenges. (in press) Dien, D. & Thuy, V. (2006). A maximum entropy approach for Vietnamese word segmentation. In Proceedings of 4th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future 2006 (RIVF’06). Ho Chi Minh City , Vietnam , Feb 12-16, 2006. Thao, P.T.X ; Tri, T.Q. ; Dien, D. & Kawazoe, A. (2006). BioCaster NE Annotation Guidelines (Vietnamese), version 1.0, April 10th 2006. Project report. Hirschman, L.; Chinchor, N.; Grishman, R.; & Sundheim B (1999). Hub-4 Event Guidelines Version 2.6. http://www-nlpir.nist.gov/related_projects/muc/proceedings/hub4/guidelines.html Kawazoe, A.; Jin, L.; Shigematsu, M.; Barerro, R.; Taniguchi, R. & Colier, N. (2006). The development of a schema for the annotation of terms in the BioCaster disease detection/tracking system. In Proceedings of the International Workshop on Biomedical Ontology in Action at KR-MED 2006, pp. 7785 Kawazoe, A. (2006). BioCaster NE Annotation Guidelines, version 1.9, 20th April 2006. Project report. Kim, J.D.; Ohta, T.; Tateishi, Y. & Tsujii, J. (2003). GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics 19 (suppl. 1), pp. 80-82, Oxford University Press, 2003. Ohta, T.; Tateishi, Y.; Collier, N.; Nobata, C. & Tsujii, J. (2000). Building an Annotated Corpus from Biology Research Papers. In Proceedings of the Workshop on Semantically Annotated Corpora (at COLING'2000), Saarbrucken, Germany, August. Ralph, G. & Beth S. (1996). Message Understanding Conference – 6 : A brief history. In Proceedings of COLING-96. Sekine, S.; Sudo, K. & Nobata, C. (2002). Extended Named Entity Hierarchy. In Proceedings of the LREC-2002. Tu, N.C.; Oanh, T.T; Hieu, P.X. ; Thuy, H.Q. (2005). Named Entity Recognition in Vietnamese Free-Text and Web Documents Using Conditional Random Fields. The 8th Conference on Some selection problems of Information Technology and Telecommunication. Hai Phong, Vietnam. U.S. National Library of Medicine. Medical Subject Headings (MeSH), 2006. Social Science Committee (1983), Vietnamese grammar. Social Science Publisher, Hanoi, Vietnam.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France