Tool for VIetnamese Semantic Role Labelling Task
Semantic role labelling (SRL) is a task in natural language processing which detects and classifies the semantic arguments associated with the predicates of a sentence. It is an important step towards understanding the meaning of a natural language. There exists SRL systems for well-studied languages like English, Chinese or Japanese but there is not any such system for the Vietnamese language. In this paper, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English could not give a good accuracy for Vietnamese. We then introduce a new algorithm for extracting candidate syntactic constituents, which is much more accurate than the common node-mapping algorithm usually used in the identification step. Finally, in the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL. Our SRL system achieves an F_1 score of 73.53% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.READ FULL TEXT VIEW PDF
In this paper, we study semantic role labelling (SRL), a subtask of sema...
Prepositions are an important vehicle for indicating semantic roles. The...
Medical synonym identification has been an important part of medical nat...
Image captioning is a multimodal task involving computer vision and natu...
This paper gives a general description of the ideas behind the Parallel
We present new data and semantic parsing methods for the problem of mapp...
One of the biggest bottlenecks in building accurate, high coverage neura...
Tool for VIetnamese Semantic Role Labelling Task
SRL is the task of identifying semantic roles of predicates in the sentence. In particular, it answers a question Who did What to Whom, When, Where, Why?. A simple Vietnamese sentence Nam giúp Huy học bài vào hôm qua (Nam helped Huy to do homework yesterday) is given in Figure 1.
To assign semantic roles for the sentence above, we must analyse and label the propositions concerning the predicate giúp (helped) of the sentence. Figure 2 shows a result of the SRL for this example, where meaning of the labels will be described in detail in Section LABEL:sec:exp.
SRL has been used in many natural language processing (NLP) applications such as question answering [Shen:2007], machine translation [Lo:2010]Aksoy:2009] and information extraction [Christensen:2010]. Therefore, SRL is an important task in NLP.
The first SRL system was developed by Gildea and Jurafsky [Gildea:2002]. This system was performed on the FrameNet corpus and was used for English. After that, SRL task has been widely researched by the NLP community. In particular, there have been two shared-tasks, CoNLL-2004 [Carreras:2004] and CoNLL-2005 [Carreras:2005]
, focusing on SRL task for English. Most of the systems participating in these share-tasks treated this problem as a classification problem and applied some supervised machine learning techniques. In addition, there were some systems developed for other languages such as Chinese[Xue:2005] or Japanese [Tagami:2009].
In this paper, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English or other languages could not give a good accuracy for Vietnamese. In particular, in the constituent identification step, the widely used 1-1 node-mapping algorithm for extracting argument candidates performs poorly on the Vietnamese dataset, having score of 35.84%. We thus introduce a new algorithm for extracting candidates, which is much more accurate, achieving an score of 83.63%.
In the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL, including function tags and word clusters obtained by performing a Gaussian mixture analysis on the distributed representations of Vietnamese words. These features are employed in two statistical classification models, Maximum Entropy and Support Vector Machines, which are proved to be good at many classification problems.
Our SRL system achieves an score of 73.53% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
The paper is structured as follows. Section II introduces briefly the SRL task and two well-known corpora for English. Section III describes the methodologies of some existing systems and of our system. Some difficulties of SRL for Vietnamese are also discussed. Section LABEL:sec:exp presents the evaluation results and discussion. Finally, Section LABEL:sec:conclusion concludes the paper and suggests some directions for future work.
The SRL task is usually divided into two steps. The first step is argument identification. The goal of this step is to identify the syntactic constituents of a sentence which are the most likely to be semantic arguments of its predicates. This is a difficult problem since the number of constituent candidates is exponentially large, especially for long sentences.
The second step is argument classification which decides the exact semantic role for each constituent candidate identified in the first task. For example, the identification step of the sentence in the previous example Nam giúp Huy học bài vào hôm qua is described in Figure 3 and in the classification task, semantic roles are labelled as shown Figure 2.
The FrameNet project is a lexical database of English. It was built by annotating examples of how words are used in actual texts. It consists of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage and more than 170,000 manually annotated sentences [Baker:2003]. This is the most widely used dataset upon which SRL systems for English have been developed and tested.
FrameNet is based on the Frame Semantics theory [Boas:2005]. The basic idea is that the meanings of most words can be best understood on the basis of a semantic frame: a description of a type of event, relation, or entity and the participants in it. All members in semantic frames are called frame elements. For example, a sentence in FrameNet is annotated in cooking concept as shown in Figure 4.
PropBank is a corpus that is annotated with verbal propositions and their arguments [Babko:2005]. PropBank tries to supply a general purpose labelling of semantic roles for a large corpus to support the training of automatic semantic role labelling systems. However, defining such a universal set of semantic roles for all types of predicates is a difficult task; therefore, only Arg0 and Arg1 semantic roles can be generalized. In addition to the core roles, PropBank defines several adjunct roles that can apply to any verb. It is called Argument Modifier. The semantic roles covered by the PropBank are the following:
Core Arguments (Arg0-Arg5, ArgA): Arguments define predicate specific roles. Their semantics depend on predicates in the sentence.
Adjunct Arguments (ArgM-): General arguments that can belong to any predicate. There are 13 types of adjuncts.
Reference Arguments (R-): Arguments represent arguments realized in other parts of the sentence.
Predicate (V): Participant realizing the verb of the proposition.
This section summarizes existing approaches used by typical SRL systems for well-studied languages. We describe these systems by investigating two aspects, namely data type that the systems use and their strategies for labelling semantic roles, including model types, labelling strategies and degrees of granularity.
There are some kinds of data used in the training of SRL systems. Some systems use bracketed trees as the input data. A bracketed tree of a sentence is the tree of nested constituents representing its constituency structure. Some systems use dependency trees of a sentence, which represents dependencies between individual words of a sentence. The syntactic dependency represents the fact that the presence of a word is licensed by another word which is its governor. In a typed dependency analysis, grammatical labels are added to the dependencies to mark their grammatical relations, for example nominal subject (nsubj) or direct object (dobj). Figure LABEL:fig:6 shows the bracketed tree and the dependency tree of an example sentence.