Building a Semantic Role Labelling System for Vietnamese

05/11/2017 ∙ by Thai Hoang Pham, et al. ∙ FPT University VNU 0

Semantic role labelling (SRL) is a task in natural language processing which detects and classifies the semantic arguments associated with the predicates of a sentence. It is an important step towards understanding the meaning of a natural language. There exists SRL systems for well-studied languages like English, Chinese or Japanese but there is not any such system for the Vietnamese language. In this paper, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English could not give a good accuracy for Vietnamese. We then introduce a new algorithm for extracting candidate syntactic constituents, which is much more accurate than the common node-mapping algorithm usually used in the identification step. Finally, in the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL. Our SRL system achieves an F_1 score of 73.53% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

vnSRL

Tool for VIetnamese Semantic Role Labelling Task


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

SRL is the task of identifying semantic roles of predicates in the sentence. In particular, it answers a question Who did What to Whom, When, Where, Why?. A simple Vietnamese sentence Nam giúp Huy học bài vào hôm qua (Nam helped Huy to do homework yesterday) is given in Figure 1.

Fig. 1: An example sentence

To assign semantic roles for the sentence above, we must analyse and label the propositions concerning the predicate giúp (helped) of the sentence. Figure 2 shows a result of the SRL for this example, where meaning of the labels will be described in detail in Section LABEL:sec:exp.

Fig. 2: Semantic roles for the example sentence

SRL has been used in many natural language processing (NLP) applications such as question answering [Shen:2007], machine translation [Lo:2010]

, document summarization 

[Aksoy:2009] and information extraction [Christensen:2010]. Therefore, SRL is an important task in NLP.

The first SRL system was developed by Gildea and Jurafsky [Gildea:2002]. This system was performed on the FrameNet corpus and was used for English. After that, SRL task has been widely researched by the NLP community. In particular, there have been two shared-tasks, CoNLL-2004 [Carreras:2004] and CoNLL-2005 [Carreras:2005]

, focusing on SRL task for English. Most of the systems participating in these share-tasks treated this problem as a classification problem and applied some supervised machine learning techniques. In addition, there were some systems developed for other languages such as Chinese 

[Xue:2005] or Japanese [Tagami:2009].

In this paper, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English or other languages could not give a good accuracy for Vietnamese. In particular, in the constituent identification step, the widely used 1-1 node-mapping algorithm for extracting argument candidates performs poorly on the Vietnamese dataset, having score of 35.84%. We thus introduce a new algorithm for extracting candidates, which is much more accurate, achieving an score of 83.63%.

In the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL, including function tags and word clusters obtained by performing a Gaussian mixture analysis on the distributed representations of Vietnamese words. These features are employed in two statistical classification models, Maximum Entropy and Support Vector Machines, which are proved to be good at many classification problems.

Our SRL system achieves an score of 73.53% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.

The paper is structured as follows. Section II introduces briefly the SRL task and two well-known corpora for English. Section III describes the methodologies of some existing systems and of our system. Some difficulties of SRL for Vietnamese are also discussed. Section LABEL:sec:exp presents the evaluation results and discussion. Finally, Section LABEL:sec:conclusion concludes the paper and suggests some directions for future work.

Ii Background

Ii-a SRL Task Description

The SRL task is usually divided into two steps. The first step is argument identification. The goal of this step is to identify the syntactic constituents of a sentence which are the most likely to be semantic arguments of its predicates. This is a difficult problem since the number of constituent candidates is exponentially large, especially for long sentences.

The second step is argument classification which decides the exact semantic role for each constituent candidate identified in the first task. For example, the identification step of the sentence in the previous example Nam giúp Huy học bài vào hôm qua is described in Figure 3 and in the classification task, semantic roles are labelled as shown Figure 2.

Fig. 3: Example of identification task

Ii-B Existing Corpora for SRL

Ii-B1 FrameNet

The FrameNet project is a lexical database of English. It was built by annotating examples of how words are used in actual texts. It consists of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage and more than 170,000 manually annotated sentences [Baker:2003]. This is the most widely used dataset upon which SRL systems for English have been developed and tested.

FrameNet is based on the Frame Semantics theory [Boas:2005]. The basic idea is that the meanings of most words can be best understood on the basis of a semantic frame: a description of a type of event, relation, or entity and the participants in it. All members in semantic frames are called frame elements. For example, a sentence in FrameNet is annotated in cooking concept as shown in Figure 4.

Fig. 4: An example sentence in the FrameNet corpus

Ii-B2 PropBank

PropBank is a corpus that is annotated with verbal propositions and their arguments [Babko:2005]. PropBank tries to supply a general purpose labelling of semantic roles for a large corpus to support the training of automatic semantic role labelling systems. However, defining such a universal set of semantic roles for all types of predicates is a difficult task; therefore, only Arg0 and Arg1 semantic roles can be generalized. In addition to the core roles, PropBank defines several adjunct roles that can apply to any verb. It is called Argument Modifier. The semantic roles covered by the PropBank are the following:

  • Core Arguments (Arg0-Arg5, ArgA): Arguments define predicate specific roles. Their semantics depend on predicates in the sentence.

  • Adjunct Arguments (ArgM-): General arguments that can belong to any predicate. There are 13 types of adjuncts.

  • Reference Arguments (R-): Arguments represent arguments realized in other parts of the sentence.

  • Predicate (V): Participant realizing the verb of the proposition.

For example, the sentence of Figure 4 can be annotated in the PropBank role schema as shown in Figure 5.

Fig. 5: An example sentence in the PropBank corpus

Iii Methodology

Iii-a Existing Approaches

This section summarizes existing approaches used by typical SRL systems for well-studied languages. We describe these systems by investigating two aspects, namely data type that the systems use and their strategies for labelling semantic roles, including model types, labelling strategies and degrees of granularity.

Iii-A1 Data Types

There are some kinds of data used in the training of SRL systems. Some systems use bracketed trees as the input data. A bracketed tree of a sentence is the tree of nested constituents representing its constituency structure. Some systems use dependency trees of a sentence, which represents dependencies between individual words of a sentence. The syntactic dependency represents the fact that the presence of a word is licensed by another word which is its governor. In a typed dependency analysis, grammatical labels are added to the dependencies to mark their grammatical relations, for example nominal subject (nsubj) or direct object (dobj). Figure LABEL:fig:6 shows the bracketed tree and the dependency tree of an example sentence.