Sentence Object Notation: Multilingual sentence notation based on Wordnet

01/03/2018 ∙ by Abdelkrime Aries, et al. ∙ Ecole nationale Supérieure d'Informatique 0

The representation of sentences is a very important task. It can be used as a way to exchange data inter-applications. One main characteristic, that a notation must have, is a minimal size and a representative form. This can reduce the transfer time, and hopefully the processing time as well. Usually, sentence representation is associated to the processed language. The grammar of this language affects how we represent the sentence. To avoid language-dependent notations, we have to come up with a new representation which don't use words, but their meanings. This can be done using a lexicon like wordnet, instead of words we use their synsets. As for syntactic relations, they have to be universal as much as possible. Our new notation is called STON "SenTences Object Notation", which somehow has similarities to JSON. It is meant to be minimal, representative and language-independent syntactic representation. Also, we want it to be readable and easy to be created. This simplifies developing simple automatic generators and creating test banks manually. Its benefit is to be used as a medium between different parts of applications like: text summarization, language translation, etc. The notation is based on 4 languages: Arabic, English, Franch and Japanese; and there are some cases where these languages don't agree on one representation. Also, given the diversity of grammatical structure of different world languages, this annotation may fail for some languages which allows more future improvements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tagging sentences is a very important task in natural language processing. One of the famously known methods is syntactic tagging. The main idea is to detect the different parts (structure) of a sentence, such as nominal phrases, verbal phrases, nouns, verbs, etc. This structure can be expressed using some languages like XML, JSON, etc. (Bertran et al., 2008; Recasens and Martí, 2009; Lopatková et al., 2011). The problem with syntactic tagging is its dependency to the processed language. Indeed, it is a good way if our system is destined for a specific language. But, when it comes to multilingual or cross-lingual systems, it is better to come up with a way to represent the sentence structure independently from languages.

Semantic representation of sentences is one solution to this problem. The meaning of a sentence is something beyond languages; it is related to the different concepts of its words and the different relations between them. Words are just a way to describe a concept in a given language. For example, the words ”^sajaraT /shajarah/” in Arabic, ”tree” in English, ”arbre” in French and ”木 /ki/” in Japanese refer to the same concept. This concept in Wordnet 3.0111WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. URL: https://wordnet.princeton.edu(Miller, 1995) is defined as ”a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms”. The idea of semantic representation is to represent the words as concepts and each language can link its own words to these concepts. Then, the semantic relations between these concepts are extracted from the sentence (Uchida et al., 1999; Banarescu et al., 2013). It will be great if we can gather all concepts of different languages, then make links to each one of them. Also, to detect the semantic relations between the concepts in a sentence automatically, that will need a large amount of knowledge and processing power.

Our idea is to propose a simple language which will help us transfer the information in sentences between applications. For example, we can use it to send sentences from a summarization system to a translation system. That allows us to create what is known as cross-lingual summarization. The language has to be simple and unambiguous so the developers can, easily, create tools to encode a specific natural language to this one. It must have a minimal size to minimize the transfer time and also the processing one. One of the main characteristics on which we insist is the readability, which helps us to create STON representations manually. To this end, we want to represent the syntactic relations in sentences taking in mind the multilingualism aspect. We must insist on the term multilingual which means ”using several languages”, so it doesn’t have to represent all world languages. In our study, we are interested in four languages: Arabic, English, French and Japanese. For non Latin scripts, namely Arabic and Japanese, we will provide the ALA-LC romanization222ALA-LC (American Library Association - Library of Congress) is a set of standards for romanization to represent texts in other writing systems using the Latin script. URL: https://www.loc.gov/catdir/cpso/roman.html.

The rest of this paper is organized as follows. Section 2 presents some related words in the subject of text annotation. Section 3 describes our main proposition and different parts of STON. Section 4 addresses the cases where we have adpositional phrases, relative clauses and comparison. In section 5, some solutions are proposed concerning the coordination between references, proper names, verbs with two objects, complementizer and passive voice. Section 8 is reserved to discuss our work, its grounds, its contribution in comparison with other works, its benefits and its limits. Finally, section 9 is reserved for conclusion and future improvements.

2 Related works

Sentence structure can be represented using generic-purpose languages such as XML or JSON. Lets take XML as an example, to represent a sentence we have to specify some roles in a DTD file. It would contain different structures of a sentence, such as subject, object, verb, tense of the verb, etc. For example, Recasens and Martí (2009) uses XML to represent sentences in order to annotate corpora for Spanish and Catalan. They used AnCoraPipe (Bertran et al., 2008) to create the corpus. Figure 1 represents an example of XML annotation of the sentence ”La Comisión Europea anunció …” which means ”The European Commission announced …”. They use XML tags to express nouns, verbs, nominal phrases, etc. where each tag has some properties.

<sn arg="arg0" entity="entity1" entityref="ne"
 func="suj" ne="organization" tem="agt">
   <spec gen="f" num="s">
      <d gen="f" lem="el" num="s" posttype="article" wd="La" />
   </spec>
   <grup.nom gen="f" num="s">
      <n entityref="ne" gen="c" lem="Comisión_Europea"
       ne="organization" num="c" postype="proper"
       sense="16:cs1" wd="Comisión_Europea" />
   </grup.nom>
</sn>
<grup.verb>
   <v els="a2" lem="aunciar" mood="indicative" num="s"
    person="3" postype="main" tense="past" wd="anunció" />
<grup.verb>
Figure 1: Example of XML sentence annotation (Recasens and Martí, 2009)

Knowledge-based interlingual machine translation uses a representation of sentences as medium between the source language and the destination one. KANT (Mitamura et al., 1991) is an example system that uses an interlingua to represent sentences before translation. KANT interlingua is a list-based structural representation scheme using nested frames. Each interlingua frame contains a head concept, series of feature-value pairs, and semantic slots containing nested interlingua frames. Concepts are symbols which begins with an asterisk (*) followed by a concept prefix defining its category (e.g. *A-DRIVE which is the action drive). KANT interlingua distinguishes many categories such as action, object, manner, proper name, etc. It uses certain features of the input text, such as modality, aspect, discourse markers, etc. in order to generate grammatically accurate output texts (Mitamura et al., 1991). Semantic roles are relations between frames, such as agent, theme, etc. Figure 2 represents the KANT interlingua’s representation of the sentence ”The default rate remained close to zero during this time.”. It contains concepts such as *A-REMAIN, *K-DURING, etc.; features such as FORM, TENSE, etc. and semantic roles such as Q-MODIFIER, THEME, etc.

(*A-REMAIN  ; action rep for ’remain’
    (FORM FINITE)
    (TENSE PAST)
    (MOOD DECLARATIVE)
    (PUNCTUATION PERIOD)
    (IMPERSONAL -) ; passive + expletive subject
    (ARGUMENT-CLASS THEME+PREDICATE) ; predicate argument structure
    (Q-MODIFIER ; PP semrole (generic)
        (*K-DURING ; PP interlingua
            (POSITION FINAL) ; clue for translation
            (OBJECT ; PP object semrole
                (*O-TIME ; object rep for ’time’
                    (UNIT -)
                    (NUMBER SINGULAR)
                    (REFERENCE DEFINITE)
                    (DISTANCE NEAR)
                    (PERSON THIRD)))))
    (THEME ; object semrole
        (*O-DEFAULT-RATE ; object rep for ’default rate’
            (PERSON THIRD)
            (UNIT -)
            (NUMBER SINGULAR)
            (REFERENCE DEFINITE)))
    (PREDICATE ; adjective phrase semrole
        (*P-CLOSE ; property rep for ’closer’
            (DEGREE POSITIVE)
            (Q-MODIFIER
                (*K-TO
                    (OBJECT
                        (*O-ZERO
                            (UNIT -)
                            (NUMBER SINGULAR)
                            (REFERENCE NO-REFERENCE)
                            (PERSON THIRD))))))))
Figure 2: Representing the sentence ”The default rate remained close to zero during this time.” in KANT interlingua (Czuba et al., 1998)

Despite being an interlingua, KANT shows some limitations when representing sentences (Czuba et al., 1998). It is more close to English semantics than being multilingual, because it is intended for English to other languages translation. Also, It is designed for technical domains, thus the vocabulary is limited to a subset of meanings.

Universal Networking Language (UNL) is a knowledge representation language to represent the meaning of texts without ambiguity. It was developed in 1996, as un intermediate multilingual language to be used through the Internet (Uchida et al., 1999; Uchida and Zhu, 2005). The major commitments of the UNL are the following:

  • It must represent information: represent ”what was meant” and not ”what was said” or ”how it was said”.

  • It must be a language for computers: like HTML, SGML, XML, etc.

  • It must be self-sufficient: The UNL representation must not depend on any implicit knowledge and should explicitly codify all information.

  • It must be general-purpose: Its primary objective is to serve as an infrastructure for handling knowledge. It can be used for different tasks such as: translation, text mining, multilingual document generation, summarization, etc.

  • It must be independent from any particular natural language.

UNL defines some tags for the structure of the text: document ”[D]”, paragraph ”[P]” and sentence ”[S]”. Concepts are represented as character-strings called ”Universal Words (UWs)”, where each natural language (En, Fr, etc.) has its own word dictionary. UNL expressions are based on binary relations, where each binary relation has two UWs as parameters. Also, UNL specifies some attributes to represent information conveyed by natural language grammatical categories (such as tense, mood, aspect, number, etc). Figure 3 represents UNL formulation of the sentence ”Human affect the environment”. The sentence starts with the tag ”{unl}” followed by two binary relations. The relation ”agt” defines a thing which initiates an action. In our example, ”Human” which is a plural noun is the one who do (in present tense) the action of ”affecting”. The relation ”obj” defines a thing in focus which is directly affected by an event or state. In our example, ”environment” is the direct object of the action of ”affecting”.

{unl}
   agt(affect(icl>do).@present.@entry:01,
      human(icl>animal).@pl)
   obj(affect(icl>do).@present.@entry:01,
      environment(icl>abstract thing).@pl)
{/unl}
Figure 3: Representing sentence ”Human affect the environment” in UNL

Indeed, UNL has a minimal size, with a multilingual background and covers a large number of languages. However, it shows some limitations when it comes to which relations we must choose. Boguslavsky (2013) claims that the selection of relations differs from team to team as it is sometimes ambiguous which one to choose. An example of that is the phrase ”freedom for all” which was described with the purpose relation ”pur” and with the beneficiary relation ”ben” by another team. Martins (2013) raises some other issues concerning UNL. One of these issues is the proper nouns, are they treated as permanent UWs or just temporary? Also, a concept can be represented by simple UW or a compound UW. For example ”the physiological need for food” can be represented using the UW ”hunger”, as it can be represented with the compound UW ”hungry.@ness”. Technically speaking, we can say it is hard to design an encoder from natural language to UNL. After parsing a sentence, we have to find the relations between each part of this sentence. It is, sometimes, too ambiguous to select between two relations manually, let alone selecting them automatically.

A most recent representation language is AMR (Abstract Meaning Representation) proposed by Banarescu et al. (2013). It is a semantic representation language designed to represent the meaning of English sentences. The text is represented as a graph, where the leaves are labeled with concepts such as ”(b / boy)” which means an instance called ”b” of the concept ”boy” (See Figure 4). The concepts are English words (”boy”), PropBank framesets (Kingsbury and Palmer, 2002) (”want-01”), or special keywords: special entity types (”date-entity”, ”world-region”, etc.), quantities (”monetary-quantity”, ”distance-quantity”,etc.) and logical conjunctions (”and”, etc). The relations between concepts are:

  • Frame arguments, following PropBank conventions.

  • General semantic relations.

  • Relations for quantities.

  • Relations for date-entities.

  • Relations for lists.

w, b, g:
instance(w, want-01)
instance(g, go-01)
instance(b, boy)
arg0(w, b)
arg1(w, g) arg0(g, b)
(w / want-01
  :ARG0 (b / boy)
  :ARG1 (g / go-01
          :ARG0 b))
LOGIC format AMR format GRAPH format
Figure 4: AMR representation of the sentence ”The boy wants to go(Banarescu et al., 2013).

AMR is light weight annotation, but it is heavily based on English. It has some limitations where it comes to inflectional morphology for tense and number. It does not deeply capture many noun-noun or noun-adjective relations. Also, because it relies on Propbank framesets, it is subject to the Propbank constraints.

3 STON representation

STON is intended to represent sentences’ syntactic structures in a multilingual context. Figure 5 represents our vision on how STON may be used. We can use lexical parsers of several languages to get the different parts of speech. Then, we can use them to create STON representation using a STON generator and Wordnet lexicon of each language. The representation can be used in multiple multilingual applications, like the example we mentioned earlier. Then having a STON parser, we can extend it to handle each language apart. Using a realizer or a language generator, we can reproduce a readable text for the destination language.

Figure 5: Example of STON’s purpose

In our representation, we look at different parts of sentences as either actions or roles. Actions are the dynamic part of the sentence. Each action contains the verb and its morphological specifications like tense, negation, etc. While roles are, generally, nominal phrases that have a purpose in the action. They can be agents, themes, places, times, etc.

3.1 Roles

Nominal phrases play roles in an action; They can be agents, themes, places, times, etc. In our case, agents are those who do the action (either in active or passive voice), and themes are those who undergo or experience the action. For example, in table 1, ”the fat man” and ”delicious food” are roles played in the action of ”eating” which happened in the present. The first phrase plays the role of an agent, while the second plays the role of a theme.

Language Sentence
Arabic alrrajulu alssamiynu ya’ku–lu al-la.hma al-la_diy_d.
/al-Rajulu al-Samīnu ya’kulu al-Laḥma al-Ladhidh./
English The fat man eats the delicious meat.
French Le gros homme mange la viande déliciouse.
Japanese 太った男性は美味しい肉を食べます。
/futotta dansei wa oishii niku o tabemasu./
Table 1: Roles and Actions in Arabic, English, French and Japanese.

In our representation (STON), we list all the roles that appear in the sentences. Each role contains the following attributes (see Figure 6):

  • id: a name for the role in order to reference it.

  • syn: the synset number of the noun in the lexicon (Wordnet in our case).

  • A set of adjectives blocks that modify the noun. Each adjective block contains its synset and a set of adverbs synsets.

  • qnt: the quantity; it describes the amount of the noun. For example ”10 apples”, the quantity is 10. By default, it equals 1; it is a number but it can have the term ”PL” for plural. Some languages, like Arabic, has dual numbers ”mu_tannY /muthanna/

    ” which can simply represented by the number 2. Then, the language generator can handle this. As for ordinal numbers, we add ”

    O” before the number; example, ”O2” means second.

  • def: defined; many languages, such as Arabic, English and French, have the ability to identify a noun from many using the definite articles.

Ψ@r:[
Ψ   r:{
Ψ      id: <role-ID>;
Ψ      syn: <synset>;
Ψ      qnt: <[O]number/PL>;
Ψ      def: <Y/N>;
Ψ      @adj:[
Ψ         adj:{
Ψ            syn: <synset>;
Ψ            adv: [synset, ...];
Ψ         adj:}
Ψ         adj:{
Ψ            ...
Ψ         adj:}
Ψ      adj:];
Ψ   r:}
Ψ   r:{
Ψ      ...
Ψ   r:}
Ψr:]
Figure 6: Roles representation.

3.2 Actions

The verb in a sentence, or in a clause, represents an action. In languages, like Arabic, there are some nominal sentences, which doesn’t include a verb. Nevertheless, we can add a verb to express the action, like the examples in Table 2. The first sentence is composed of a subject ”mubtada’ /mubtada’/” (which is a Noun) and a predicate ”xabar /khabar/” (which is an Adjective). In the second one, the subject is a noun and the predicate is a prepositional phrase. In both sentences, we can use the copula (to be) as if it is the action, the subject will be considered as an agent and the predicate as a theme. In case of the third sentence, where we have an active participle ”’ism faA‘il /ism fā‘il/”, and the origin verb is a movement verb like ”to go” we can consider it as present continuous (Haak, 1997).

Arabic English

Noun + alrrajulu (yakuwnu) samiynuN. The man (is) fat.
Adjective /al-Rajulu (yakūnu) samīnun./

Noun + Prep. alrrajulu fiy al-ssuwqi. The man (is) in the market.
+ Noun /al-Rajulu fī al-Sūqi./

Noun + Active alrrajulu _daAhibuN ’ilY al-ssuwqi. The man is going to the market.
participle + PP /al-Rajulu dhahibun ilá al-Sūqi./

Table 2: Example of nominal sentences in Arabic.

To represent the actions, we define a set of actions blocks, where each block contains some attributes. The attributes are generally related to the verb, since the action is all about the verb. But, it contains also some links to agents, themes, etc. These are the main attributes contained in each action, where ”id” and ”syn” are compulsory attributes (See Figure 7):

  • id: a name for the action in order to reference it.

  • syn: the synset number of the verb in the lexicon (Wordnet).

  • agt: a set of role IDs; those who did the action.

  • thm: a set of role IDs; those who receive the action.

  • A set of adverbs blocks that modify the verb. Each adverb block contains its synset and a set of adverbs synsets.

  • Some verb specifications: tense (”tns”), progression (”prg”), perfect aspect (prf), negation (”neg”) and modality (”mod”).

@act:[
   act:{
      id: <action_ID>;
      syn: <synset>;
      tns: <PA/PR/FU>;
      prg: <Y/N>;
      neg: <Y/N>;
      mod: <CAN/MAY/MUST>;
      agt: [<id>,...];
      thm: [<id>,...];

      @adv:[
         adv:{
            syn: <synset>;
            adv: [synset, ...];
         adv:}
      adv:]
   act:}
   act:{
      ...
   act:}
act:]
Figure 7: Actions representation.

The tense can be: past (”PA”), present (”PR”) or future (”FU”). The absence of tense means the action is tense-free; e.g. ”To do so, you must try”. There are languages which doesn’t have future tense, such as Arabic and Japanese. In this case, we can detect the tense using adverbs (”Tomorrow”) or temporal prepositional phrases (”in the next year”). Other languages define (far/near) past and (far/near) future. For example, in Arabic, there is no tense called ”future” but it can be expressed using auxiliaries. For near future, we use ”sa- /sa/” attached to the verb in present tense (”sa’a_dhabu /sa’dhhabu/”, I will go <soon>). For far future, we use ”sawfa /sawfa/” detached from the verb in present tense (”sawfa ’a_dhabu /sawfa adhhabu/”, I will go <later>). But, since this can be detected using adverbs such as ”soon” and ”later”, we can ignore it.

There are two aspects which are mostly used in occidental languages: progressive and perfect. The perfect aspect refers to some actions prior to the time under consideration, which is viewed as already completed. In STON representation, the tense is imperfect unless we add (”prf: Y;”). As for progressive aspect, it is a situation where a verb is (was) in motion for an interval of time. Likely, the action is not progressive unless we add (prg: Y;).

Modality, in our case, can express possibility (”MAY”), admissibility (”CAN”) or obligation (”MUST”). The modal verb ”will” is used to express the future, which is a tense in our case. Advice (”You should see a doctor”), prohibition (”You mustn’t smoke here”), certainty (”He must be rich, since he lives there”), permission (”You can leave now”) and lack of necessity (”You don’t have to do anything”) can be represented using these three modal verbs.

3.3 Sentences

A Role-Action representation is not sufficient, since there are sentences which contains consecutive actions. In Table 3, we can observe three actions: going to the market, coming back and watching T.V. If we represent this sentence as three sentences instead, we will loose the information that these actions are consecutive. Not to mention, we have to specify which actions are the main ones (there are actions which are relatives of roles and actions).

Language Sentence
Arabic _dahaba kariym ’ilY as-suwqi _tumma ‘aAda ’ilY al-manzili _tumma ^saAhada at-tilfaAza.
/dhahaba Karīm ilá al-Sūqi thumma ‘āda ilá al-manzili thumma shāhada al-Tilfāza./
English Karim went to the market, came back home, then watched T.V.
French Karim est allé au marché, il est revenu à la maison, ensuite il a regardé la télé.
Japanese カリムさんは市場に行って、家に戻って、テレビを目ました。
/Karimu-san wa itchiba ni itte, ie ni modotte, terebi o memashita./
Table 3: Example of Consecutive actions.

The sentence part lists some sentences blocks. Each block contains the type of the sentence: affirmation (”AFF”), exclamation (”EXC”), question (”QST”) and imperative (”IMP”). It have an attribute ”act” to list the references of actions in this sentence. The annotation of sentences in STON is illustrated in Figure 8.

Ψ@st:[
Ψ   st:{
Ψ      typ: <AFF/EXCL/QUES>;
Ψ      act: [<act_id>,...|...]
Ψ   st:}
Ψst:]
Figure 8: Sentences representation in STON.

4 Relations

There are many relations between the clauses and phrases of a sentence. Attributes such as agents and themes in an action can just represent simple sentences. A more complicated sentence can contain adpositional phrases, relative clauses, etc. Figure 9 shows our view about the 4 relations between the roles and the actions.

Figure 9: Relations between roles and actions

A Role-Role relation can be expressed by adpositional relations as the example ”the man in the car”. Adpositional relations can, also, be used to express Action-Role relation (”he is in the car”). Another relation is the relation Role-Action found in relative clauses (”the man who is driving”). Like adjectives, a relative clause can modify (describe) a nominal phrase. A noun can be described by a relative clause, such in the sentence ”the man who was strong”. The late example is different from ”the strong man” because the first sentence adds the information of a past quality of strength. The relation Action-Action can be found in adverbial clauses (he is where I can see him). Comparison between two roles is another issue; Are they equal or one is more or less than the other? They can share a verb such in ”I do more work than you” or an adjective ”I am stronger than you”. More structure have to be added in order to allow these types of relations.

4.1 Relative clauses

Relative clauses are a little challenging to be represented using our Role-Action notation. Our notation starts with roles then actions. Since relative clauses are actions which describe a role, they have to be represented in the role itself. In Table 4, the man drinking water, is the one who ate the meat earlier.

Language Sentence
Arabic alrrajulu alla_diy ’akala al-la.hma ya^srabu almaA’a.
/al-Rajulu al-Ladhī akala al-Laḥma yashrabu al-Mā’./
English The man who ate the meat is drinking water.
French L’homme qui a mangé de la viande boit de l’eau.
Japanese 肉を食べた男性が水を飲んでいます。
/niku o tabeta dansei ga mizu o nonde imasu./
Table 4: Example of relative clauses.

One proposition is to use ulterior referencing in roles to reference an action as a relative. Then, the relationship between the role and the relative clause must be specified. For example, the text in Table 4 can be represented as in Figure 10.

Ψ@r:[
Ψ   r:{
Ψ      id: man;
Ψ      syn: 10287213;
Ψ      @rel:{
Ψ         typ: SBJ;
Ψ         ref: [ate]
Ψ      rel:}
Ψ   r:}
Ψ   r:{
Ψ      id: meat;
Ψ      syn: 7649854;
Ψ   r:}
Ψ   r:{
Ψ      id: water;
Ψ      syn: 14845743;
Ψ   r:}
Ψr:]

Ψ@act:[
Ψ   act:{
Ψ      id: ate;
Ψ      syn: 1168468;
Ψ      tns: PA;
Ψ      obj: [meat];
Ψ   act:}
Ψ   act:{
Ψ      id: drink;
Ψ      syn: 1168468;
Ψ      tns: PR;
Ψ      prg: Y;
Ψ      agt: [man];
Ψ      thm: [water];
Ψ   act:}
Ψact:]
Figure 10: Example of STON representation in case of relative clauses.

Looking at relative pronouns (See Table 5

), we can consider 4 main types of relative phrases: subject, possessive, direct object and indirect object. In this categorization, we considered a person as same as a thing, because the difference between them can be taken in consideration in text generation task.

Person thing Place Time Reason
Subject who/that which /that
Object who/whom/that which/that where when why
Possessive whose whose
Table 5: Relative pronouns.
  • The subject type (”SBJ”) indicates that the main clause is a subject of the relative one. For example, in the phrase ”the man who ate the meat”, the main clause ”the man” is the subject of the relative one ”ate the meat”.

  • The possessive type (”POS”) indicates that the noun in the relative clause is possessed by the main one. For example, the phrase ”the man whose car is so expensive” describes a man who possess an expensive car.

  • The object type (”OBJ”) indicates that the main clause is a direct object of the relative one. For example, the phrase ”the man whom I saw yesterday” describes a man who is an object of me seeing him. We can deduce the following information from it ”I saw the man yesterday”.

  • The indirect object, or the clauses that starts with prepositions, are various. The meaning of these clauses follows the meaning of their prepositions. In this case, we can form their types using the keyword (”IO_”) followed by the Adpositional relation (next subsection). For example, ”The town from which I came” describes a town which is an indirect object of me coming. We can deduce the following information from it ”I came from the town”. As for the relative adverbs ”where” and ”when”, we can replace them with a preposition. For example, ”The town where I met him” is equivalent to ”The town in which I met him”.

  • One other type is the reason (”RSN”), expressed by the relative adverb ”why”.

4.2 Adpositional phrases

An adpositional phrase includes prepositional phrases, postpositional phrases, and circumpositional phrases. Usually, they are used to describe the time or the place of an action. Arabic, English and French use prepositional phrases, while Japanese uses postpositional phrases. They can be related to an action: ”He works at 8”, or a role: ”The mother of the boy”.

Our objective is to define as few relations as possible which can represent the meaning of most adpositions. For that matter, we define these relations, which can be extended in the future:

  • AGO: An amount of time back in the past. e.g. ”I bought it 2 years ago”.

  • FRM: An origin or change of state; e.g. ”I came from Algeria”.

  • IN: The existence of a particular time, place or situation. e.g. ”I wake up at 7 am.”, ”I was born in 1986”, ”I am in town.”, ”I am at home.

  • SNC: A period from past till a particular time. e.g. ”I have worked since 2013”.

  • TO: A destination; It can be a location, a person, etc. e.g. ”I am going to the market.”, ”I will give it to him.”,”We waited till noon”.

  • FOR: An amount of time or an objective. e.g. ”I will sleep for 2 hours”, ”the foundation for the support of the cinema.”, ”during the war”.

  • BEF: A time earlier than another (before) or a place in front of something. e.g. ”She’s always up before down”, ”I sit in front the TV.”.

  • AFT: A time further than another (after) or a place at the back of something (behind). e.g. ”He always sleep after 10 pm”, ”the towel is behind the door”.

  • BY: Not later than a specified time, or a place besides something or an agent; Here, we consider these prepositions as the same: ”by”, ”next to”, ”besides” and ”near”. e.g. ”I will finish by 5pm.”, ”He walks by the river.”, ”diffraction by crystals”.

  • INS: Something inside something else. e.g. ”The present is inside the box.”.

  • OUT: Something outside something else. e.g. ”I am outside the home”.

  • BLW: A place under something; Here, we treat ”Below” and ”Under” as the same. e.g. ”He swims under the bridge.”.

  • ABV: A place above something; Here, we treat ”Above” and ”over” as the same. e.g. ”we walked over the bridge.”.

  • BTW: A place between two or more things. e.g. ”The museum is between two flats.”.

  • THR: From one side to another, or surrounded; Here, we don’t make a difference between ”through” and ”across”. e.g. ”She walked through the forest.”.

  • ON: A subject or something connected to something else. e.g. ”She borrowed a book about mathematics”.

  • WTH: Being together or being involved. e.g. ”I am with him”.

  • OF: Possession or belonging. e.g. ”The leafs of the tree”. This relation is used to represent compound nouns: ”printer cartridge” will be ”cartridge of printer”.

  • AS: A role. e.g. ”He works as a teacher.”.

  • UND: A situation. e.g. ”He did the exam under a lot of pressure.”.

In English, ”at” and ”in” are used to express ”exact” and ”wide” locations or times respectively. In Japanese, there is no difference between exact or wide range locations and times. But, there is another aspect which is used to distinguish the prepositions ”に /ni/” and ”で /de/”. The first one (/ni/) is used to express the existence of something or someone in a place. e.g. ”犬は公園にいます。 /inu wa kōen ni imasu./” (”The dog is in the park.”). While the second one (/de/) is used to express the place where an action takes place. e.g. ”犬が公園で吠えます。 ” /inu ga kōen de hoemasu./ (The dog barks in the park.). Unfortunately, our representation doesn’t take these aspects into consideration. Nevertheless, we believe this can be solved in text generation task. In case of ”at-in” differentiation, we can test the noun to generate the right preposition. Likewise, for Japanese postpositions ”ni-de”, we can use the verb.

4.3 Adverbial clauses

An adverbial clause is a dependent clause which functions as an adverb. in our representation, it represents the relation between an action and another. These are the relations used to express the adverbial clauses:

  • WHN: A specific time. e.g. ”He listens when you talk.”.

  • WHL: A period of time. e.g. ”He listens while you talk.”.

  • WHR: A place. e.g. ”He started where he stopped.”.

  • IF: A condition. e.g. ”He will come if he can.”.

  • SO: A purpose. e.g. ”He tries so he can succeed.”.

  • BCS: A reason. e.g. ”He can’t be angry with her because he likes her.”.

  • THG: A concession. e.g. ”I will come although I don’t like traveling.”.

  • LIK: A manner. e.g. ”He did the job as you asked.”.

  • FTR: After (time). e.g. ”He start the job after he wake up.”.

  • BFR: Before (time). e.g. ”Before we came here, the door was shut.”.

4.4 Comparison

Comparison is a little bit tricky; Mostly it includes an adjective which is shared between two roles, but also to the action. For instance, in the sentence ”Karim is taller than his brother”, the adjective ”tall” is a property of ”Karim” as it is of ”his bother”. We can represent the adjectives in the two roles, but the relation between them (superlative) is better to be represented in the action ”To be”. Sure, the relation of comparison can be represented in one of the roles ”Karim” and we can reference the other ”his bother”. But, there are cases where the comparison doesn’t include adjectives; it is about the verb instead. For example, the sentence ”Karim helps less than his brother” contains a comparison over the action and not the adjective. In the end, the comparison must be represented in the action rather than the role. Table 6 contains the two examples quoted previously. We must point out that the Arabic example for the sentence ”Karim helps less than his brother” is not so fluent. In order to be fluent, the sentence must be ”kariym ’aqallu musaA‘adaTaN min” ’axiyh. /Karīm aqallu musā‘adatan min akhīh./”. This can be literally translated as ”Karim is less helpful than his brother”. In Japanese example, the direct translation is ”Karim is less in help than his brother”. Also, we can use ”カリムさんは兄弟よりも少ないながらも手伝いします。 /Karimu-san wa kyōdai yorimo sukunai nagara tetsudaishimasu./”, which means literally ”Karim helps despite this is less than his brother do.”; A form which is more polite. What matters for us is the comparison itself, when it means ”less than” we represent it in the ”cmp” block. Then, when it comes to text generation task, each language handles how it has to be generated fluently.

Language Sentence
Arabic kariym ’a.twalu min” ’axiyh. /Karīm aṭwalu min akhīh./
kariym yusaA‘idu ’aqalla min” ’axiyh. /Karīm yusā‘idu aqalla min akhīh./
English Karim is taller than his brother.
Karim helps less than his brother.
French Karim est plus grand que son frère.
Karim aide moins que son frère.
Japanese カリムさんは兄弟より背が高いです。 /Karimu-san wa kyōdai yori se ga takai desu./
カリムさんは兄弟より手伝うのが少ないです。 /Karimu-san wa kyōdai yori tetsudau no ga sukunai desu./

Table 6: Example of sentences with comparative.

The idea is to use a block ”cmp” in the action block (see Figure 11). There are three types of comparison: comparative, superlative and equality. If we add less comparison, we will have five types: less (”L”) and more (”M”) for comparative, least (”LT”) and most (”MT”) for superlative and equal (”EQ”). The first parameter of the comparison would be the agent itself, and the second is a reference in the comparison block. The STON representations of Table 6 examples are illustrated in Figure 12.

@act:[
   act:{
      ...
      @cmp:[
         cmp:{
            typ: <L/M/LT/MT/EQ>;
            adj: [<synset>, ...];
            ref: [<id>,...|...];
         cmp:}
      cmp:]
   act:}
act:]
Figure 11: Comparison block.
@act:[
   act:{
      id: be_tall;
      syn: 2604760;
      tns: PR;
      @cmp:[
         cmp:{
            typ: M;
            adj: [2385102];
            ref: [brother];
         cmp:}
      cmp:];
      agt: [karim];
   act:}
act:]
@act:[
   act:{
      id: help;
      syn: 2547586;
      tns: PR;
      @cmp:[
         cmp:{
            typ: L;
            ref: [brother];
         cmp:}
      cmp:];
      agt: [karim];
   act:}
act:]
Karim is taller than his brother. Karim helps less than his brother.
Figure 12: Example of STON representation of comparison.

5 Some issues and their solutions

With the Role-Action-Sentence notation, we can represent a lot of sentences. But, there are some ambiguities when we want to represent some others. To handle this, we will show some problems and their solutions in context of STON.

5.1 Coordination between references

One problem is, how to represent the coordination between references. There are, principally, two main coordinations which are disjunction (”or”) and conjunction (”and”). To limit ambiguity when representing a sentence, we want to use either disjunctions of conjunctions or the inverse. For example:

  • Mother and son ate food.

  • Father and son ate food.

These two sentences can be aggregated in two ways:

  1. Mother and son or father and son ate food. (disjunctions of conjunctions);

  2. Mother or father and son ate food. (conjunctions of disjunctions).

The first one sounds more appropriate than the second. Moreover, the second sounds like either the mother alone or the father and the son ate the food. So, the notation of agents and themes must be changed. Taking the late example, the agents in the sentence can be represented as ”agt: [mother, child | father, child];

5.2 Proper names

A proper name is a phrase that identifies one unique entity from a class of entities. For example, ”London” is distinguished from the common noun ”city”; It is more specific. Proper names can be people (Eg. ”Abdelkrime Aries”), locations (Eg. ”Algeria”), organizations (Eg. ”ESI”), etc.

The problem with proper names is: how to represent them inside the role? STON is heavily based on Wordnet’s synsets (or any other lexicon), each word must have one synset to be represented. Some proper names already have synsets in Wordnet, such as U.S. cities. But, a city such as ”Jijel” for example, which is an Algerian city, has no synset number in Wordnet. There are many other proper names, such as persons’ names, which doesn’t exist in Wordnet. One solution is to add an attribute ”nam” to the role section, which contains the named entity.

In STON, a role must always have a synset or a pronoun. To afford much information about the role (Here, the proper name), we can specify its hypernym’s synset. For instance, the sentence ”Karim lives at Jijel” (See Figure 13) has two proper names: ”Karim” and ”Jijel”. For each, we afford the synset of ”person” and ”city” respectively.

Ψ@r:[
Ψ   r:{
Ψ      id: karim;
Ψ      syn: 7846;
Ψ      nam: Karim;
Ψ   r:}
Ψ   r:{
Ψ      id: jijel;
Ψ      syn: 8524735;
Ψ      nam: Jijel;
Ψ   r:}
Ψr:]



Ψ@act:[
Ψ   act:{
Ψ      id: lives;
Ψ      syn: 2649830;
Ψ      tns: PR;
Ψ      agt: [karim];
Ψ      @rel:[
Ψ         rel:{
Ψ            typ: IN;
Ψ            ref: [jijel];
Ψ         rel:}
Ψ      rel:]
Ψ   act:}
Ψact:]

Ψ@st:[
Ψ   st:{
Ψ      typ: AFF;
Ψ      act: [lives];
Ψ   st:}
Ψst:]
Figure 13: STON representation of the sentence ”Karim lives at Jijel”.

5.3 Verbs with two objects

In languages like Arabic and English, a verb can have two objects without using any preposition. Table 7 represents an example of the sentence ”the man gave the boy a gift”. The bold phrase in each language represents the indirect object. We can use the prepositions ”to” and ”li– /li/” with the indirect object. The English sentence will be ”The man gave a gift to the boy.”. Similarly, the Arabic sentence will be ”’a‘”.taY al-rrajulu hadiyyaTaN lil-.t.tif”li. /a‘ṭá al-Rajulu hadiyyatan li-al-Ṭifli./”. So, we can use adpositional phrases to represent the relation with the indirect object.

Language Sentence
Arabic ’a‘”.taY al-rrajulu al-.t.tif”la hadiyyaTaN.
/a‘ṭá al-Rajulu al-Ṭifla hadiyyatan./
English The man gave the boy a gift.
French L’homme a donné un cadeau à l’enfant.
Japanese 男性は少年に贈り物があげました。
/dansei wa shōnen ni okurimono ga agemashita./
Table 7: Example of sentences with two objects.

5.4 Pronouns

In a first time, we thought to use references to the original role instead of personal pronouns; In generation phase, we can generate them when the role is referred many times. This is can be applied in normal situations when we have all the information including the anaphoric relations. In situations when we want to represent sentences generated from extractive summarization for example, we may have a lot of trouble recovering these types of relations. Also, to boost the analysis (from natural language to STON) and generation (from STON to natural language) tasks, the pronouns are a need.

The pronouns can be classified using many features; Table

8 represents the eight features used for pronouns classification according to Seah and Bond (2014). Based on these features, we present the pronouns using two attributes:

  • typ: the type of the pronoun which is encoded on 6 characters:

    • 1st:D” for demonstrative pronoun (this, that, etc.); ”S” for subjective personal pronoun (I, he, etc.); ”O” for objective personal pronoun (me, him, etc.) and ”P” for possessive pronoun (my, his, etc.).

    • 2nd:F” for first person; ”S” for second person and ”T” for third person.

    • 3rd:S” for singular; ”D” for dual; ”P” for plural and ”N” for not defined number.

    • 4th:F” for female; ”M” for male and ”N” for neuter. Even if the source language attribute sex to objects such as Arabic and French, we consider them as neuter. For example, in Arabic, the chair is masculine while it is feminine in French.

    • 5th: This is reserved for formality and politeness. ”R” for rude; ”C” for casual; ”F” for formal and ”P” for polite. In languages where there is no formality level in pronouns, we choose to use formal by default.

    • 6th: The proximity can be: ”D” for distal; ”M” for medial; ”P” for proximal and ”N” for not defined.

  • ref: the reference(s) to the role(s) related to this pronoun.

Head Number Gender Case Type Formality Politeness Proximity
Demonstratives Dual Feminine Objective Assertive Formal Polite Distal
Entity Plural Masculine Possessive Elective Informal Medial
Time Singular Neuter Subjective Negative Proximal
Manner Other
Person Reciprocal
Place Universal
Reason Interrogative
Thing Reflexive
Personal (1e, 1i, 2, 3)
Quantifier

Table 8: Pronouns features accoding to (Seah and Bond, 2014).

The pronouns attribute can be accompanied with a synset to afford a more compact format. For instance, Figure 14 represents a case where a pronoun and a noun are packed together. The pronoun ”his” is represented as ”PTSMFN” which means: possessive, third person, singular, masculine, formal and without proximity. This representation is better than transforming the clause to ”first novel of him” before being represented.

r:{
    id: his_novel;
    syn: 3833065;<novel>
    typ: PTSMFN;<his>
    qnt: O1;<first>
r:}
Figure 14: STON representation of the role ”his first novel”.

5.5 Passive voice

The passive voice hasn’t a specific representation in STON. A sentence like ”Karim ate the apple” (See Table 9) has the same representation as ”The apple was eaten by Karim”. The choice of using active or passive voice is decided by the text generation task. As for sentences which doesn’t have an agent such as ”An apple was eaten”, we simply don’t define an agent in the representation. in this case, when we generate the sentence, we have to use passive voice.

Language Sentence
Arabic al-ttuffaA.haTu ’ukilat” min” .tarafi kariym”.
/al-tuffaḥah ukilat min ṭarafi Karīm./
English The apple was eaten by Karim.
French La pomme a été mangée par Karim.
Japanese 林檎がカリムさんに食べられました。
/Ringo ga Karimu-san ni taberaremashita./
Table 9: Example of sentences in passive voice.

5.6 Complementizer phrase

A complementizer phrase can be a subject or an object. In our representation, we rather consider it as an agent or a theme. Figure 15 represents an example of STON annotation in case of complementizer phrases. Here, the phrase ”STON represents sentences” is a theme and ”Karim” is the agent of the action ”Hoping”.

Ψ@act:[
Ψ   act:{
Ψ      id: represent;
Ψ      syn: 988028;
Ψ      tns: PR;
Ψ      agt: [ston];
Ψ      thm: [sentence];
Ψ   act:}
Ψ   act:{
Ψ      id: hope;
Ψ      syn: 1811441;
Ψ      tns: PR;
Ψ      agt: [karim];
Ψ      thm: [represent];
Ψ   act:}
Ψact:]
Figure 15: Actions of the sentence ”Karim hopes that STON represents sentences”.

The structure ”VERB + TO + INFINITIVE” will be solved as a complimentizer. A sentence like ”I want to go there” can be seen as if ”to go there” is the theme of ”wanting”.

6 STON tools and corpora

STON notation is intended to be used in NLP applications where the concern is about the sentence syntax in a multilingual context. It can be used in language generation as an intermediate language for machines. This implies that we can use it as a mean for text translation, text summarization, etc. Parsing STON language would be fast since its grammar is well defined (blocks, references, etc.). The most challenging task is to create tools to parse a text into STON and to generate a text from STON. Corpora for testing is a must, this is why we started to annotate some materials already annotated by UNL.

6.1 Annotation process

Annotating some materials can be helpful in the future, especially when we intend to use STON for other applications. Since some grammatical structures are not able to be represented before transformation, like the apposition (which can be replaced by ”which is”). Here, some steps to follow in order to have a good annotation in STON:

  • Begin to represent roles (Nominal phrases) such as the dependent role must be at last. For example, in the sentence ”the statue of liberty” we represent the role ”liberty” then the role ”statue” which has a relation ”OF” to the first one.

  • Adjectives can’t be represented alone. For predicative adjectives, create a role with the noun in the subject and the adjective. For example, ”The man is friendly” would be represented as ”The man is friendly man”.

  • As for proper names, if they exist in Wordnet just put their synset (e.g. ”Cairo”). If they don’t, put them as value of the attribute ”nam” (spaces must be transformed to underscores) and afford the synset of their type: city, person, animal, dog, cat, etc. (the more specific the type, the better).

  • Transform enumerations of a role to a relation ”OF”. For example, the sentence ”He did many jobs: web designer, engineer and teacher” will be ”He did many jobs of web designer, engineer and teacher”.

  • When you find some prepositions we didn’t talk about, try to find a similar in the relations. For example, The preposition ”As well as” can be considered as a simple ”and”.

  • Omit the chronological order indicators (first, then, finally) and represent it as consecutive actions.

  • Try to transform expressions in order to fit the STON representation. For example: ”much of it experimental” will be ”which is much experimental”.

  • Comparison doesn’t exist in roles, it just exists in actions. To solve this problem, we add the expression ”an amount which is ”; the comparison must not have an adjective. For example, ”the author of more than 200 articles” will be ”the author of an amount which is more than 200 articles”.

We started to annotate some biographies333http://www.undl.org/unldoc/bb.htm already annotated by UNL. You can check the annotated texts on Nolporas project444Nolporas project: https://github.com/kariminf/NaLanPar which aims to create corpora mostly for STON. Table 10 represents some statistics on the two biographies we annotated. The biographies contains long sentences with an average of 29 words. The synsets in STON are the equivalent of UWs in UNL; here the number is less because we use references to roles. Roles sometimes are repeated in other sentences, so we don’t need to repeat their descriptions. Relations in UNL corresponds to the relations (adpositions, relatives and adverbials) and the references of themes and agents. Concerning the manual annotation, each sentence took us from a half to one hour since we have to choose between different senses and the right relations.

Naguib Mahfouz Bio Louis de Broglie Bio
Original texts statistics:
# Sentences 10 35
# Words 287 1055
# Words/Sentence 29 30


UNL statistics:
# UWs with redundancy 264 803
# Relations 172 507
# Attributes 229 703


STON statistics:
# Synsets with redundancy 132 373
# Roles 85 237
# Actions 23 82
# Relations 61 183


Table 10: Statistics on the annotated texts.

The annotation process must be automatic which will be more interesting. We started working on another project called NaLanPar555NaLenPar source: https://github.com/kariminf/NaLanPar (Natural language parser) which aims to generate sentence representations, including STON, from texts. It uses open source text parsers such as Stanford Parser (Chen and Manning, 2014), which is a syntactic parser for many languages (English, Arabic, etc.) licensed under GPL license. Currently, we are working on transforming English text to STON, and more languages are to be added after finishing. At the mean time, it can handle the sentences of form ”Subject Verb Object Preposition Noun”.

6.2 STON parsing

We created a parser for STON in a project called SentRep666SentRep source: https://github.com/kariminf/SentRep(sentence representation). Its aim is to implement parsers for different sentence representation languages, taking STON as primary focus. The parser is an abstract class which can be extended to implement the cases where the parser finds a role, adjective, action, etc.

To test the speed of parsing, we used a PC with i7 processor (2 GHz 4 cores) and openJDK 7. The parser do nothing when it finds the different parts (just calls the functions which will treat them). For each biography (NaguibMahfouz and LouisdeBroglie) we executed the parser 10 times and calculated the time of parsing in milliseconds. NaguibMahfouz biography parsing took between 60ms and 71ms with an average of 67ms, and almost 8ms per sentence. As for LouisdeBroglie biography parsing, it took between 132ms to 176ms with an average of 163ms, and almost 5ms per sentence.

6.3 Generation

To generate sentences from STON representation, we are currently working on another project called NaLenGen777NaLenGen source: https://github.com/kariminf/NaLanGen (Natural language generator). NaLenGen aims to generate text from different sentence representations (It uses SentRep for that). It uses open source text realizers such as SimpleNLG (Gatt and Reiter, 2009), which is a realization engine for English licensed under MPL license. It has been adapted to many other languages: German (Bollmann, 2011), French (Vaudry and Lapalme, 2013) and Brazilian Portuguese (de Oliveira and Sripada, 2014). To map Wordnet synsets to other languages, we use Open Multilingual Wordnet (Bond and Foster, 2013). The mapping is not complete for many languages, as a result the generation may fail when the synset is not found.

We tried to generate some English and French text from STON annotation. Examine Table 11 which is an example of the two first sentences of Naguib Mahfouz biography. The resulted text was fair for English and a little bad for French. This is why more work has to be done to address some issues we have found:

  • Sometimes the mapping to other languages is done automatically which leads to some errors in translating concepts. For example, we found that the concept ”Cairo” was mapped to ”OSS 117 : Le Caire” in French. As we can see in the example, many other concepts are wrong.

  • The synsets contain many words, and some words are more adequate than others. This is why we have to propose a method in order to choose these words based on their frequency of use for example.

  • We can’t generate directly from STON, and some changes must be done to have more fluent texts. For example, predicative adjectives are expressed as a role with a noun. The sentence ”The child is happy” is represented as ”The child is happy child”.

Source Born in Cairo in 1911, Naguib Mahfouz began writing when he was seventeen. His first novel was published in 1939 and ten more were written before the Egyptian Revolution of July 1952, when he stopped writing for several years.
English generated text Naguib Mahfouz which was given birth in Cairo in 1911 began writing when he was 17 years. First, his novel was published in 1939 and 10, more novels were written before the revolution of Egyptians of July 1952 in which he discontinued writing for several years.
French generated text Naguib Mahfouz que a été accouché à un Le Caire à 1911 a débuté un œuvre quand lui a été de 17 années. Son premier nouveau a été publié à 1939 et de 10 nouveaux plus ont été écrits avant le tour de des égyptiens de July 1952 à lequel lui a cessé un œuvre pour des années es.
Table 11: Example of sentences generation to English and French.

7 Limitations and challenges

Although STON can represent a wide range of sentences, it shows some limitations. Sometimes, the variation between languages can prevent us from representing the sentences properly. In some languages, we don’t always find the same syntactic representation of a meaning. For instance, considering Table 12, the adjective ”hungry” has no exact translation in Japanese. In fact, the Japanese sentence literally means ”my stomach emptied”. The same adjective is translated to a noun in French, where the sentence literally means ”I have a hunger”. In Arabic, we can use the sentence ”’anA ju‘t. /anā ju‘t/”, which means the same thing. This sentence is composed of the pronoun ”I” and a verb ”to be hungry” conjugated in the past. Also, since STON is based on Wordnet synsets, it is limited to the concepts extracted from English language. So, if we want to represent a sentence from Arabic for example, we have to find Wornet concepts that are close to the meaning of the sentence’s words.

STON can’t deal with text structures like paragraphs, lists, tables, etc. Many languages have complex predicates which can represent the same meaning as a singleton verb. For instance, ”I gave the baby a bath” and ”I bathed the baby” have the same meaning. Unfortunately, STON is not a fully semantic representation which means it handles those as two different forms. A sentence like ”the purpose is to teach mathematics and develop physics” can be represented without problem. But, in case of this sentence: ”the purpose is to teach and develop physics”, we have to use redundancy and represent it as ”the purpose is to teach physics and develop physics”.

Language Sentence Romanization
Arabic ’anA jaw‘aAn. /anā jaw‘ān./
English I am hungry.
French J’ai faim.
Japanese お腹が空いた。 /onaka ga suita./
Table 12: Example of sentences which doesn’t have the same POS in different languages.

8 Discussion

STON is all about representing the different parts of a sentence independently from its structure in natural languages. It is meant to transport sentences information between different applications (programs). It represents the syntactic relations between the different parts of sentences in a multi-lingual way. When the syntax fails to keep the multilingual aspect, it uses semantic relations instead. To understand STON better, we have to know what this language is not about.

  • It does not represent the relations between the parts of speech semantically, even if there are some relations in relative clauses. For example, it does not allow us to represent relations like UNL does, such as beneficiary relation and purpose relation.

  • It is not a format for storing texts, such as Open document format (ODF) and Microsoft office format which are based on XML.

There are similarities and differences between STON and other representation languages. A concise comparison between KANT, UNL, AMR and STON is given in Table 13.

Criteria KANT UNL AMR STON
Objective Interlingua for machine translation from English technical manuals. Represent the meaning of texts without ambiguity, to be used as a language of the web. Write down the meanings of English sentences. Represent sentences in a multilingual way without basing so much on semantics, to be used as interchange format between applications.
Aspects Semantics, with morphological aspects (tense, etc.) Morphology, Semantics, Pragmatics Semantics, no morphological aspects (tense, etc.) Syntax, with morphological aspects (tense, etc.) and semantics when syntax fails to be multilingual
Concepts Own UWs (UNL Ontology) PropBank frames WordNet synsets
Dependency Domain dependent Language independent English language dependent Language independent
Relations Own Own PropBank relations Own
Readability Readable Difficult to follow Less readable Less readable, difficult when we have a big text
Table 13: Comparaison between STON and other well-known annotation formats.

When we check the meaning, UNL is the representation that mostly represents the meaning. AMR, in the other hand, is used to represent the meaning but it lacks some morphological aspects such as verb tense. KANT and STON annotations are less depending on semantics than the two previous ones. Our objective is to represent sentences with a minimum cost (time and processing effort). This is why we represent relations like ”from” as they are, even if they can mean many things: ”from 2 am”, ”from London”, etc.

Semantic representation is a powerful tool because it allows us to represent what could be understood from a sentence. It can eliminate redundancy in sentences; For example (Banarescu et al., 2013):

  • The soldier was afraid of battle.

  • The soldier feared battle.

  • The soldier had a fear of battle.

These sentences have the same meaning, therefore their representation must be the same. In contrast with UNL and AMR, STON doesn’t go deep into the semantic relations, or to comprehend the sentence as a whole. Figure 16 shows STON representation of these 3 sentences. It is clear that the representation makes difference between the verbs, nouns and adjectives.

act:{
   id: was;
   syn: 2604760;
   tns: PA;
   agt: [soldier];
   thm: [afraid-soldier];
   @rel:[
     rel:{
       typ: OF;
       ref: [battle];
     rel:}
   rel:]
act:}

act:{
   id: feared;
   syn: 1780729;
   tns: PA;
   agt: [soldier];
   thm: [battle];
act:}







act:{
   id: had;
   syn: 121046;
   tns: PA;
   agt: [soldier];
   thm: [fear];
   @rel:[
     rel:{
       typ: OF;
       ref: [battle];
     rel:}
   rel:]
act:}
Figure 16: STON action representation of 3 sentences with same meaning.

The four languages are based on different ontologies and lexicons to represent their concepts. KANT annotation is based on concepts defined especially for the KANT system, generally limited by technical reports domain. Likewise, UNL defines its own concepts base called UNL ontology, and each concept is referred to as a universal word (UW). AMR and STON use PropBank and Wordnet respectively, this is why they are limited to these two bases.

AMR and KANT are more based on English, while UNL and STON seeks to be multilingual. When it comes to multilingual aspect, the UWs of UNL are so powerful. For the mean time, we use Wordnet to represent the different concepts. Unfortunately, till nowadays, the mapping to other languages is not complete. For this reason, we had some problems in generating French text from STON annotation when the synset can’t be found.

Readability is important when we want to create a sentence representation manually or to check it after automatic generation. KANT annotation is more readable than the other three representations. STON is developed as a machine language such as UNL, even so, we want to allow some space for readability. This can be helpful in case we want to test a system that uses STON, because it will be easier to create test banks.

9 Conclusion

In this work, we proposed an inter-application language (STON) aimed to represent sentences structures in a multilingual context. STON is based somehow on the JSON representation with some adjustments to speed up the parsing. It is based on the assumption that anything in the sentence is either a role or an action with relations between them. The representation uses both syntactic structure (noun definition, verb tense, subjects, objects, etc) and semantic relations (time and place relations, etc.). To support multilingualism, we use concepts instead of words (in our case, we use Wordnet’s synsets). Our intention is to use STON as a mean of communication between different applications. More specifically, the language is intended to be used in cross-lingual automatic text summarization.

STON is far beyond being complete or being perfect. There still are some improvements to be made in the future, such as the problem of words’ syntactic alignment inter-languages. Because it is based on Wordnet, the concepts are limited to English. There are a lot of concepts that doesn’t exist in English but exist in other languages. Exploiting a larger semantic network with a more knowledge base like BabelNet888http://babelnet.org (Navigli and Ponzetto, 2012) may improve inter-languages representation.

Acknowledgement

Special thanks to Hisham Omar for his valuable feedback concerning Japanese examples.

References

  • Bertran et al. [2008] Manuel Bertran, Oriol Borrega, Marta Recasens, and Bàrbara Soriano. Ancorapipe: A tool for multilevel annotation. 41:291–292, 2008. ISSN 1135-5948.
  • Recasens and Martí [2009] Marta Recasens and M. Antònia Martí. Ancora-co: Coreferentially annotated corpora for spanish and catalan. Language Resources and Evaluation, 44(4):315–345, 2009. ISSN 1574-0218. doi: 10.1007/s10579-009-9108-x. URL http://dx.doi.org/10.1007/s10579-009-9108-x.
  • Lopatková et al. [2011] Markéta Lopatková, Petr Homola, and Natalia Klyueva. Annotation of sentence structure. Language Resources and Evaluation, 46(1):25–36, 2011. doi: 10.1007/s10579-011-9162-z. URL http://dx.doi.org/10.1007/s10579-011-9162-z.
  • Miller [1995] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995. ISSN 0001-0782. doi: 10.1145/219717.219748. URL http://doi.acm.org/10.1145/219717.219748.
  • Uchida et al. [1999] Hiroshi Uchida, Meiying Zhu, and Tarcisio Della Senta. The unl, a gift for a millennium. 1999. URL http://www.undl.org/publications/gm/index.htm.
  • Banarescu et al. [2013] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W13-2322.
  • Mitamura et al. [1991] Teruko Mitamura, Eric H Nyberg, and Jaime G Carbonell. An efficient interlingua translation system for multi-lingual document production. In Proceedings of the Third Machine Translation Summit, 1991.
  • Czuba et al. [1998] Krzysztof Czuba, Teruko Mitamura, and Eric H Nyberg. Can practical interlinguas be used for difficult analysis problems? In Proceedings of AMTA-98 Workshop on Interlinguas, 1998.
  • Uchida and Zhu [2005] Hiroshi Uchida and Meiying Zhu. Unl2005 from language infrastructure toward knowledge infrastructure. Special Speech, Pacific Association for Computational Linguistics (PCLING 2005), 2005.
  • Boguslavsky [2013] Igor Boguslavsky. Some lexical issues of unl. Universal Networking Language: Advances in Theory and Applications, pages 101–108, 2013.
  • Martins [2013] R. Martins. Lexical Issues of UNL: Universal Networking Language 2012 Panel. EBSCO ebook academic collection. Cambridge Scholars Publishing, 2013. ISBN 9781443852814. URL https://books.google.dz/books?id=tdcwBwAAQBAJ.
  • Kingsbury and Palmer [2002] Paul Kingsbury and Martha Palmer. From treebank to propbank. In Language Resources and Evaluation, 2002.
  • Haak [1997] M. Haak. The Verb in Literary and Colloquial Arabic. Functional grammar series. Mouton de Gruyter, 1997. ISBN 9783110154016. URL https://books.google.dz/books?id=I21G6qVQibkC.
  • Seah and Bond [2014] Yu Jie Seah and Francis Bond. Annotation of pronouns in a multilingual corpus of mandarin chinese, english and japanese. In 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, pages 82–87, Reykjavik, Iceland, 2014.
  • Chen and Manning [2014] Danqi Chen and Christopher Manning.

    A fast and accurate dependency parser using neural networks.

    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D14-1082.
  • Gatt and Reiter [2009] Albert Gatt and Ehud Reiter. SimpleNLG: A realisation engine for practical applications. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pages 90–93, Athens, Greece, March 2009. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W09-0613.
  • Bollmann [2011] Marcel Bollmann. Adapting simplenlg to german. In Proceedings of the 13th European Workshop on Natural Language Generation, pages 133–138, Nancy, France, September 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W11-2817.
  • Vaudry and Lapalme [2013] Pierre-Luc Vaudry and Guy Lapalme. Adapting simplenlg for bilingual english-french realisation. In Proceedings of the 14th European Workshop on Natural Language Generation, pages 183–187, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W13-2125.
  • de Oliveira and Sripada [2014] Rodrigo de Oliveira and Somayajulu Sripada. Adapting simplenlg for brazilian portuguese realisation. In Proceedings of the 8th International Natural Language Generation Conference (INLG), pages 93–94, Philadelphia, Pennsylvania, U.S.A., June 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W14-4412.
  • Bond and Foster [2013] Francis Bond and Ryan Foster. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1352–1362, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P13-1133.
  • Navigli and Ponzetto [2012] Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250, 2012.