The ARIEL-CMU Systems for LoReHLT18

02/24/2019 ∙ by Aditi Chaudhary, et al. ∙ 0

This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Low Resource Human Language Technologies (LoReHLT) program is a DARPA-sponsored program aiming to dramatically advance the state of computational linguistics and human language technology to enable rapid, low-cost development of capabilities for low-resource languages.111https://www.nist.gov/itl/iad/mig/lorehlt-evaluations The ARIEL-CMU team participated in three tasks (Entity Discovery and Linking, Machine Translation, and Situation Frame detection for Text and Speech) and also submitted a number of contrastive systems. We built systems for two incident languages (ILs), IL9 Kinyarwanda, and IL10 Sinhala.

Ii Submission Highlights

  • NER/EDL Highlights:

    • For both IL9 and IL10, our NER system takes training data which are acquired primarily through cross-lingual transfer from English and related languages, and annotations from native and non-native speakers. Our system benefits from more training data, pre-trained word embeddings, gazetteers for post-processing and ensembling of models.

    • Our major improvements this year came from efficient use of both native informants and non-native annotators. Different strategies leveraging outputs from Situation Frame and Machine Translation teams were used for getting both good quality and quantity of annotations.

  • MT Highlights:

    • Our MT systems took a two-pronged approach of using phrase-based statistical MT and neural MT.

    • Our neural MT models were trained massively multilingually before the evaluation started, then adapted to the incident languages and other related languages.

    • We performed extensive data cleaning and selection to ensure that noise in the provided training data did not adversely affect results.

    Fig. 1: Overall architecture of ARIEL-CMU system in LoReHLT 2018.
  • SF Highlights:

    • We processed the IL data and English data using the same models and pipelines, as our cross-lingual models can handle both IL and English texts. Speech data was first converted into text before feeding it into our SF pipeline.

    • Our major improvements this year come from training data augmentation with a bootstrapping approach and data clean up assisted with active learning.

  • SF Speech Highlights:

    • Our Speech pipeline was designed to convert the IL speech data into text to be fed to the existing individual pipelines. This would unify the SF, NER, and EDL systems built for speech and text.

    • Our major improvements this year came from efficient NI data collection, domain robust feature extractor, vocabulary pruning and cross-lingual transfer from Swahili for IL9.

Iii Data Resources

Iii-a Resources included in the IL Packs

Iii-A1 LDC2018E55 (IL9: Kinyarwanda)

From the LDC2018E55 pack, we made use of:

  • the monolingual Kinyarwanda text in Set0 and Set1 (for constructing language models, word vectors, and as data for annotation and NI recording)

  • the parallel Kinyarwanda-English data in Set0 (for machine translation training data, for deriving bilingual lexicons, for training multilingual word vectors, and as an aid to non-speaker annotators)

  • the monolingual English text in SetS (for verifying our English systems and as data for annotation)

  • the included and linked Kinyarwanda-English bilingual dictionaries (for MT training data, multilingual word vectors).

Iii-A2 LDC2018E57 (IL10: Sinhala)

From the LDC2018E55 pack, we made use of:

  • the monolingual Sinhala text in Set0 and Set1 (for constructing language models, word vectors, and as data for annotation and NI recording)

  • the parallel Sinhala-English data in Set0 (for machine translation training data, for deriving bilingual lexicons, for training multilingual word vectors, and as an aid to non-speaker annotators)

  • the monolingual English text in SetS (for verifying our English systems and as data for annotation)

  • the included and linked Sinhala-English bilingual dictionaries (for MT training data, multilingual word vectors).

As was the case last year, the parallel text data (for both languages) had significant deficiencies. The alignment and pre-processing were both of such a poor quality that they had to be done over in order to make the data usable as training data for MT or multilingual embeddings or as scaffolding for annotators.

Iii-B Other LDC resources

Iii-B1 LDC2017S05 (Babel Swahili Language Pack)

This pack was used as a training set for speech recognition in IL9, mentioned in §VIII-A1. There are around 55k utterances in this dataset. It is the high resource language closest to Kinyarwanda.

Iii-B2 LDC97S44 (English Broadcast News), LDC98S74 (Spanish Broadcast News), LDC98S73 (Mandarin Broadcast News), LDC2012S06 (Turkish Broadcast News), LDC2004S01 (Czech Broadcast News)

We used these packs for the large multilingual broadcast news speech recognition model we built for faster and better adaptation to the IL languages. The acoustics of this data is similar to the IL audio. These packs contain around 300k utterances in total (125k from English, 100k from Turkish, 30k each from Spanish and Mandarin and around 15k from Czech).

Iii-C Additional Resources

Iii-C1 Leidos HA/DR data

As in LoReHLT17, we made use of the English in-domain text collection based on ReliefWeb, collected and annotated by our Leidos sub-team (Horwood and Bartem, 2016). The text was used in training in-domain English language models for data selection and in creating back-translations for machine translation, and the text and annotations were used to generate keywords for our SF Keyword Model (§VIII-B1) and as training data for our SF Neural Model (§VIII-B1).

Iii-C2 Leidos LRLP data

Our Leidos sub-team sampled 10K English text snippets from the LDC’s LORELEI Representative Language Packs (LRLPs) and annotated them with Situation Frames, named entities, and relations among them. Snippets consist of 1-3 segments of text selected automatically according to density of terms found in the LORELEI HA/DR Lexicon. We used the English portions of parallel corpora in the following languages: Amharic, Arabic, Bengali, Chinese, Farsi, Hindi, Hungarian, Indonesian, Russian, Somali, Swahili, Tamil, Tagalog, and Yoruba. This data was used as training data for our SF Neural Model (§VIII-B1).

Iii-C3 Additional Sinhala speech data

Read Sinhala speech data, as part SLR52 222http://www.openslr.org/52/, SLR30 333http://www.openslr.org/30/ was used as a training set for speech recognition in IL10, mentioned in VIII-A1. There are around 200k utterances in total combining both the datasets.

Iii-C4 Additional Swahili speech data

Broadcast news Swahili speech data, as part of ALFFA (African Languages in the Field: speech Fundamentals and Automation) 444http://www.openslr.org/25/ was used as a training set for speech recognition in IL9, mentioned in §VIII-A1. There are around 12k utterances in this dataset. It is the high resource language closest to Kinyarwanda.

Iii-C5 Multilingual Bible Corpora

Collection of Bible Audio and text aligned at a chapter level, used to create training data for speech ASR, §VIII-A1. It is a collection of religious texts in around 1000 languages, pre-downloaded from bible.is.

Iii-C6 Additional entity gazetteers

We created gazetteers from Wikipedia resources – Wikipedia inter-language links, and inline translations of entities in English Wikipedia articles. We also collected the high frequency n-grams from the Set1 data and translated the named entities to English, assisted by the native informants. We then manually annotated these with their respective EDL knowledge base IDs. Additionally, we compiled a gazetteer using annotations acquired from native and non-native informants.

Iv Orthography, Phonology, and Morphology

Iv-a IPA conversion

In some components, it is desirable to obtain a phonetic/phonological representation for the incident language data. This can help make the text more accessible to annotators and linguists and can reveal relationships between languages that are obscured by orthography. Data was converted from orthographic representation to the International Phonetic Alphabet (IPA) in both IL9 and IL10 using our open source G2P library, Epitran (Mortensen et al, 2018). Epitran consists of a set of mappings between orthography and phonological representations and well as a collection of pre- and post-processors for languages where there is not a straightforward, many-to-one mapping between orthographic units and phonological units. At the beginning of the evaluation, IL9 was already supported by Epitran, but three person-hours of the first day were spent adding IL10 support. One additional hour was spent improving IL9 support.

Iv-A1 Re-romanization

New this year was a “re-romanizer,” a generalized mechanism that converted IPA representations into a more familiar romanized form (similar to what is used in the romanization of foreign names in English). This meant that, while accurate IPA transcriptions were still available, annotators who were not trained in the IPA had access to a familiar representation of names in the incident languages. This was primarily of interest for IL10.

Iv-A2 Epitran backoff

A second new addition was a backoff model, useful for mixed-language data (due to code-switching and borrowing). When using this model, the programmer instantiates Epitran with a list of language-script pairs rather than a single pair. When a token is passed to the resulting object, it attempts to transliterate as much of the token using the first language, but falls back on the other languages (in succession) when this is not possible. This is especially useful for cases, as in IL10, where documents mix scripts (Sinhala in Sinhala script and English in Latin script) and it is desirable to produce a single IPA representation of the whole document.

Iv-B Morphological parsing

For morphological parsing, we relied again on hand-crafted, rule-based systems. However, rather than using parser combinator-based analyzers as in the previous two evaluations, we wrote morphological analyzers for Foma

(Hulden, 2009), a reimplementation of Xerox’s XFST suite of finite state tools. These were typical Xerox-style analyzers (Beesley and Karttunen, 2003) with a thin Python layer providing disambiguation and a convenient interface. Using Foma allowed us to leverage existing work on the morphology of IL9 as well as create analyzers with an impressive runtime (allowing for the easy re-lemmatization of the whole dataset in short order as the analyzers improved).

At a very high level, the analyzers for the two incident languages had a very similar structure: Each consisted of an FST that parsed words having attested lemmas (the lemmas were present in the lexicon) and a second FST with a “guesser” that would parse forms with arbitrary stems/lemmas. For every input word, the two FSTs were applied in succession. The guesser was tried if and only if the FST for attested lemmas failed to yield a parse. Since the first FST tended to generate only a limited number of parses, this strategy helped to alleviate the problem of ambiguity that plagued our morphological parses for in LoReHLT16 and LoReHLT17.

Disambiguation was still a challenge, however, and was based on a variety of criteria. Parses were assigned costs based on the phonological shape of the lemma (e.g. a single consonant incurred a high cost for both languages) and the relative frequency of morphological properties (vocative case incurred a high cost for Sinhala). In general, our design strategy allowed for lower rates of over-parsing than those in previous evaluations.

At a low level, the two morphological analyzers were very different. The IL9 analyzer, based on an analyzer originally written for a MURI project, attempted to cover the whole morphology of the language—all parts of speech, and derivation as well as inflection. To some degree, this was suboptimal for our purposes, since most of the downstream tasks required a form of lemmatization and the “lemmas” output by the analyzer were actually roots rather than stems or citation forms. This likely hurt precision, while potentially benefiting recall.

The morphological analyzer for IL10 was philosophically opposite that for IL9. It represented completely original work and was tailored to the needs of the downstream tasks (except for MT). It was a very conservative parser than targeted only the inflectional morphology of nouns. It yielded lemmas that were identical to the citation form of nouns and seldom yielded more that two parses per word. When it did, the best parse was usually obvious (the one that was not nominative singular).

The morphological analyzers were used primarily to lemmatize data, which was consumed by all downstream tasks.

V Native Informants and Linguistic Analysts

V-a Native Informants

Data and annotations of many kinds were elicited from the native informants (NIs):

  • Translations

    • Translations of English SF keywords

    • Translations of named entities (eng IL) occurring in incident description and in set1

    • Translations of named entities (IL eng) occurring in Set0 and Set1

    • Translations of high-frequency phrases and sentences in set1

  • Annotations

    • EDL

      • Named entity annotations (see table I)

      • Correction of system output in active learning paradigm

    • Situation frame annotations of set0, set1, and transcribed speech

    • Audio transcription from set0 and set1

    • Speech recording of sentences selected from set0 and set1 by our SF system

V-B Non-Native Annotators

In addition to employing NIs to do translation, annotation, classification, and error correction, we made expanded use of non-speaker annotators who had varying levels of linguistic training but no direct knowledge of the incident languages. This builds upon our earlier work, in LoReHLT16 and LoReHLT17, with linguistic annotators.

We addressed the integration of annotators into our data pipelines with an improved annotation interface (based on the one introduced in LoReHLT17). This allows annotators to view multiple levels of linguistic representation, including the original text, glosses from lexical resources, IPA transcriptions, and conventional romanizations, all in one integrated and efficient interface. The same interface was used by the NIs and the annotators, but with different levels of representation visible. Using this tool, the annotators were able to produce a large number of annotations, especially named entity annotations (see table I).

IL9 IL10 Total
Tokens 11096 5253 16349
- NI 3500 2569 6069
- non-NI 7596 2684 10280
Total NEs (by CP1) 891 567 1458
Total NEs (by CP2) 7570 2958 10528
Unique NEs 4231 2084 6311
TABLE I: Named Entity annotations from NI and annotators by CP2

V-C Active Learning

In order to use informants efficiently, we developed an active learning system for Named Entity Recognition (NER). This system selects sub-spans from sentences for annotation based on uncertainty under a trained model, in addition to the frequency of the tokens in the sentence.

To compute the uncertainty, we measure the sub-span level entropy under an NER model, which is trained on the target low-resource language, by using data transferred from English and related language based on bilingual lexicons. The transfer is through word-to-word translation of the source language data. First, we cast two sets of individually trained word embeddings into a common space using a bilingual lexicon, and then retrieve the nearest neighbor target language word as the translation of the source language word. Subsequently, we train an NER model on the transferred data.

The active learning model selects the sub-spans from different sentences which have the highest entropy amongst all spans while tagging the spans with labels using the Viterbi algorithm for sequence CRF (Ma and Hovy, 2016). The informants performed two tasks: a) annotating the the uncertain sub-spans, b) correcting the labels and the span width for the rest of the spans which were tagged using the NER model.

When the active learning could not be set up before the NI session, we select unlabeled sentences based on a heuristic method: we rank sentences by the sum of top-5 TF-IDF scores of words in it which is used to measure the importance of a sentence and meanwhile maintain the same ratio of sentence sources (WL/NW/SN) as the sentence source ratio in setE.

Additionally, we used non-native annotators to help with NER annotations by leveraging multiple linguistic resources.

  • Represented incident language in the IPA space.

  • Added Indicator features like honorifics (Mister, Miss, Dr, etc), location indicators (river, Mount., Mt., etc), organization indicators (Association, Ministry, etc), geo-political indicators (Republic of, etc).

  • Augmented the interface with word-by-word translations acquired from Situation Frame and Machine Translation teams. The translations provided by Machine Translation teams were also added subsequently.

Vi Entity Discovery and Linking

Vi-a Named Entity Recognition

Our NER submissions are primarily based on the neural CRF model proposed in Ma and Hovy (2016), while also utilizing certain additional resources such as the compiled IL9 and IL10 gazetteer. Figure 2 provides a high level overview of our NER system.

Fig. 2: Overall architecture of NER system LoReHLT 2018.

Vi-A1 Model - Neural CRF

The neural CRF model leverages the strength of a strong neural representation learner — words in the sequence are modeled at both type and token level. A character level convolution layer is used for modeling the token level information and is concatenated with pre-trained word embeddings which capture the type level information. FastText (Bojanowski et al, 2016) was used to train word embeddings for both ILs. For Checkpoint 1, monolingual data extracted from set0 and setE was combined for training; for Checkpoint 2, monolingual data extracted from set0, set1 and setE was used for training. Optionally, we also provide the provision to incorporate discrete linguistic features like indicator features, brown clusters, etc as shown in Figure 3. For IL9, we notice that many entity words are capitalized words. We design the capitalization ratio feature for IL9, which is the ratio of words with capital letters (number of times word is capitalized vs total number of times appearing) over the whole monolingual corpus and we bucket this ratio to use it as one discrete feature when training models for IL9. Specifically, we calculate this ratio with the following heuristic: . Together, these token level representations are modeled with a bi-directional LSTM (Dyer et al, 2015)

, which is known to help in tagging tasks by capturing the left and right context in a sequence. Finally, for sequence labeling a CRF layer is used. CRF’s are undirected graphical models used for calculating the conditional probability of a sequence given the observations. The use of Viterbi algorithm allows the model to perform efficient inference over the space of entire output sequences (i.e. global/sequence-level normalization as opposed to local/word level ones). We experimented with two strategies for combining the discrete features in the above described Neural CRF model, which we describe below.

Fig. 3: Neural CRF architecture of NER system LoReHLT 2018.
  • Model: Sep-Neural CRF : Separate bidirectional LSTMs encoders were used to encode the embedding features (word level and character level) and the linguistic features (indicator features, brown clusters) respectively. The outputs of these two encoders are concatenated before the discriminative CRF layer.

  • Model: Cat-Neural CRF: Instead of having separate encoders for different types of features, all the features are concatenated into a single continuous representation and is encoded using a single bidirectional LSTM before the discriminative CRF layer.

Submissions for both IL9 and IL10 were made by varying the different combination of features we used, for instance:

  • Sep-Neural CRF + indicator features + brown clusters

  • Sep-Neural CRF + brown clusters

  • Cat-Neural CRF + indicator features + brown clusters

  • Cat-Neural CRF + brown clusters

Vi-A2 Noisy Training Data Acquisition

For both IL9 and IL10, no labeled data is provided in the LoReHLT18 setting. We developed the following approaches to acquire noisy training data.

  • Collection of Gazetteers: We collect named entities and their entity types from several different sources: (a) Named entities extracted from the native informants’ annotation sessions; (b) Named entities extracted from non native informants’ annotation sessions; (c) Name entities extracted from titles of incident language Wikipedia pages; (d) Named entities extracted from the knowledge base provided in the LDC language packages; (e) Non native informants annotated part of entities in IL-English bilingual dictionaries in the LDC IL language pack.

  • Normalization of Gazetteer and Creation of Negative Gazetteer: To make the collected Gazetteer above generalize to different situations, we expand it by removing special characters such as #, @ and punctuation marks from each entity word and lower case all entities in Roman scripts. We use this normalized Gazetteer together with the original Gazetteer. For IL9, we select the top 1500 words based on the capitalization ratios described in §VI-A1. We ask non native informants to pick out words that are not entities and make a negative entity set with them. For IL10, we manually make a negative entity set containing several words picked by the non native informants when they correct the Gazetteer.

  • In-domain Data Selection for Training: To select in-domain training data, we score each sentence in the monolingual data with the TF-IDF scores of n-grams in setE, number of key words provided by the SF team, length of n-grams appeared in setE. For IL9, we also consider the number of words that are capitalized. We scale these scores differently to make them comparable. We rank sentences by their scores and maintain the sentence type (WL/SN/NW) ratio as in the setE to select training data.

  • Label Propagation: Given a Gazetteer, we use it to annotate the selected training data with label propagation as follows: we iterate over each word in a sentence and look ahead an n-gram window (from five to zero in our experiments). Once the span is found in the Gazetteer, we label the span with entity tags and skip to the next unread word. If no span is found in the Gazetteer, we label the word as a non-entity.

  • Cross-lingual Transfer We transfer training data from English and related languages to the target languages using bilingual lexicons. For English, we use the CoNLL 2003 training data (Tjong Kim Sang and De Meulder, 2003) and we perform transfer through word-to-word translation. First, we cast two sets of individually trained word embeddings into a common space using a bilingual lexicon, and then retrieve the nearest neighbor target language word as the translation of each source English word. For related language, we use Swahili for IL9. We use English as a pivot language to form a source-to-target lexicon as we are provided English lexicons for both languages. Different from transferring from English, we first translate source words using the resulting lexicon, then using target words with edit distance less than 1, and lastly target nearest neighbor words in the shared embedding space. The resulting training data from English and related language can then be used for IL9 and IL10.

Vi-A3 Target Language Specificities

Some submissions were made specific to a particular IL which we describe in the below section.

  • IL9: (a) Since IL9 has Roman script, an additional capitalization ratio feature was added as part of the discrete features. (b) For all capitalized words that are unlabeled and not in the negative Gazetteer, we mark them with UNK labels, and during training marginalize over all labels at UNK words to calculate the score of a sentence. We denote this model output “partial-CRF”.

  • IL10: (a) Joint-training with Hindi was used for IL10 due to similarities in pronunciations. Word embeddings were trained by converting both IL10 and Hindi in the common IPA space. Hindi NER annotations, extracted from existing language pack 555LDC2017E62 were added to the IL10 training data. (b) Edit-distance based label propagation: we didn’t collect sufficient number of gazetteer items due to the fact that it is more difficult for non-native annotators to annotate non-Roman scripts. As a post-processing step, we first perform label-propagation on IL10 setE: for each word in setE if it does not exist in the Gazetteer we compare it with each word in the current Gazetteer, if they have an edit distance less than , we store the entity label of the Gazetteer word. Then we assign the majority label to the unlabeled word. Empirically, we set the to be 2.

Vi-A4 Post-processing

We first perform label propagation with Gazetteer and extract all predicted entities. Then we perform within and across document label propagation over the whole setE.

Vi-A5 English Data

For English, we used Stanford CoreNLP (Manning et al, 2014) for CP1 and a vanilla neural CRF model (without additional features) for CP2. For the neural CRF model, we used data from the CoNLL 2003 dataset (Tjong Kim Sang and De Meulder, 2003) and the OntoNotes 5.0 dataset 666https://catalog.ldc.upenn.edu/ldc2013t19 for training and tuning the model. Since the data from CoNLL 2003 used different entity types, we converted them following the procedure based on Freebase type as described by Tsai et al (2016). We used publicly available pre-trained GloVe (Pennington et al, 2014) 777https://nlp.stanford.edu/projects/glove/ word embeddings as the word embedding inputs.

As a domain-specific preprocessing step, we performed lower-case exact string matching for all word ngrams (up to 4) of the text data against the KB (pruned as described in the next section), and tagged those found in the KB with its corresponding KB entity type. We perform matching starting from the longer ngram, and simply skip those that share overlapping spans with other named entities. We also skipped those ngrams that contained very common words, which we use the top 5,000 most frequent words in GloVe. Lastly, we tagged all hashtagged words by performing lower-case exact string matching with the hashtag and space removed and using a list of known abbreviations manually compiled from Set1 and SetS. These preprocessing steps are used to handle named entities such as rare words, lower-case words, and hashtagged words. We add more entities by running the NER system on the preprocessed texts.

Next, we detect nominal mentions using the constituency parser from Stanford CoreNLP (Manning et al, 2014), which is implemented as part of our English EDL pipeline (Ma et al, 2017). In short, we select noun phrases returned by the parsing results that do not share overlapping spans with other NEs, and perform filtering post-processing steps based on WordNet types, noun types, and etc. For more details, please refer to Ma et al (2017).

Vi-B Entity Linking

Fig. 4: Overall architecture of the entity linking system LoReHLT 2018.

After obtaining output from the NER system, we use a entity linking methods to link the detected mentions to the knowledge base (KB). We have a high precision system that performs word-to-word translation of the mention strings and fast lookup on the KB. The mentions not linked by this system are then processed by a neural character-level encoder system. The overall architecture of the pipeline is shown in figure 4.

Vi-B1 Pre-Processing

  • KB Pruning: For all checkpoints, we linked mentions to a pruned version of the KB in order to reduce processing time, as well as remove entities unlikely to be related to the incident. The pruning was only for GPE/LOC entities, we used all the PER and ORG entities for our linking pipeline. For GPE/LOC, we selected all entries in the KB associated with the incident country as well as surrounding countries. We also added GPEs that had a population of more than 50000 according to the KB.

Vi-B2 Translation-Based Linking

We make a first attempt at linking mention strings using a translation-based system. For each mention string, the system uses various lexical resources to generate possible English translations of the string by performing a lookup of each token in the lexicons (word-by-word translation). We then find the KB link for the mention by looking up each translation in the KB and selecting the best KB entry match according to highest Jaccard similarity on the strings, with a threshold tuned on experiments with other languages (before the evaluation).

The lexical resources we used were:

  • Native informant translations of entities from the incident description

  • Wikipedia inter-language links (parallel article titles) between Kinyarwanda-English, Swahili-English and Sinhala-English

  • PanLex for the incident languages

  • Extracted lexicon from the parallel data in the given language packs using fast_align (Dyer et al, 2013a), pruned by the number of occurrences of the alignment.

  • Alternate names for entities in the Geonames database – in both the incident languages as well as English

Additional entity lexicon creation: The linguistics team and non-native annotators in the team translated over 800 high-priority entities from Set1 (300 in IL9 and 500 in IL10). Several of these were linked to the KB by team members and annotators, improving the quality of the translation-based entity linking system. A part of these annotations were used as a development set for model selection. We also asked the native informants to translate entities while annotating data for NER. A non-native annotator or EDL team member recorded these translations during the session, and mapped them back to the original IL entity as post-processing. We obtained over 300 translations for IL10 through the NI sessions. This number was less significant for IL9.

We attempted to use the morphological parsers to obtain variants of both the lexical resources as well as the mention string. However, this did not show improvement in the entity linking performance on the development set and we did not use these in our final system submission.

Vi-B3 Neural Scoring for KB Entries

The inputs to our second linking step are entities that remain without links after running the translation-based system. This might occur because the bilingual lexicons available may not have full coverage of the input entities. We attempt to tag these entities to increase the overall recall based on character-level and phoneme-level word similarity between each of the entity and all entries in the KB. Specifically, for each entity in the input, we compute a score for each entry in the KB, and then sort the KB entries based on these scores. We set a threshold on the difference between the highest score and second highest score to determine whether the entity should be linked to the KB or remain as NIL.

For each incident language, we build a model that uses two LSTM encoders – one that encodes strings in the incident language and the other that encodes English strings. The model is trained in order to maximize the cosine similarity between parallel entity pairs (between the IL and English). We use negative sampling with a max-margin objective during training 

(Bordes et al, 2011; Mikolov et al, 2013). We train models in both the orthographic (grapheme) space and the phoneme space (by converting the parallel data into IPA using Epitran (Mortensen et al, 2018).

Joint training with related languages: Apart from training models on the ILs themselves, we also leverage parallel data in languages closely related to the IL for training the model. Specifically, we used Swahili and Zulu for IL9, and Marathi and Hindi for IL10. For IL9, the writing systems used by the related languages are the same and we jointly trained models in both grapheme and phoneme spaces. With IL10, we used only the phoneme space for joint training.

While testing, we use the IL encoder to encode the input mention and the English encoder to encode all the KB entries. We then compute the similarity between each KB entry encoding and the input mention encoding. The selected entity link is the top-scoring KB entry for that mention.

To determine whether an entity is NIL or linkable to the KB, we set a threshold on the cosine similarity score. This threshold is tuned on the development set created using Set1 entities. We observe that the encoder-based linking system offers diminishing utility with increasing size of the lexicon used for the translation-based entity linking system. Interestingly, with the large number of entities translated by the end of CP2, the neural encoding system offered little to no improvement in entity linking performance over the translation-based system (which offers better scores in terms of precision).

Vi-B4 English Data

We use the EDL system of Ma et al (2017) for the English data, which takes in English NER outputs that contain both named and nominal mentions, finds a list of candidate entities for each named entity based on string similarity, and performs document-level inference for all the named entities within the same document using graphs built from Wikipedia. The highest scoring subgraph formed by the candidate entities is selected based on a graph densification procedure from (Moro et al, 2014). For more details of the complete linking system, please refer to Ma et al (2017).

After the system outputs the entity ID for each named entity, we perform a few post-processing steps. First, as the system outputs a Wikipedia ID for each linkable named entity, we have to map the Wikipedia ID to the LORELEI KB ID. For geonames entries, we obtain the mappings for all entries that have Wikipedia links in the alternative names table, and we perform exact string matching for all other entries to retrieve the mappings. If no mapping to LORELEI KB ID could be found for the Wikipedia ID, it would be mapped to NIL. Second, we rerun the exact string matching step of the English NER systems to handle mislinked and leftout entities, which are caused by the fact that, for example, the named entity does not exist in Wikipedia but in the LORELEI KB, or the system decides not to predict because it cannot find any reliable candidates (e.g., hashtagged words sometimes do not have high string similarity with any Wikipedia entry). Lastly, we perform NIL clustering based on the mention surface form. Note that, entities that do not share the surface form, but are linked to the same Wikipedia ID that cannot be mapped to a LORELEI KB ID, will be grouped in the same NIL cluster.

The complete list of submissions is shown in Table V and Table VI at the end of this report.

Vii Machine Translation

The ARIEL MT strategy was based on two pillars:

  1. Creating a diversity of systems, as different varieties of systems work better in different situations.

  2. Making best use of the resources available, including parallel IL data, lexicons, data in related languages, and monolingual data in English.

The details of all of these methods are described below.

Vii-a System Varieties

We created two different varieties of MT system, ones based on hierarchical phrase-based machine translation, and ones based on neural machine translation. We also experimented with system combination to combine together multiple strong systems and achieve better results.

Vii-A1 Hierarchical Phrase-based Machine Translation Systems

We used hierarchical phrase-based machine translation models (Chiang, 2007) trained using the cdec toolkit (Dyer et al, 2010). We used the fast-align toolkit (Dyer et al, 2013b) to do the bidirectional word alignments, and symmetrized the alignments using the grow-diag-final-and option. We heuristically scored parallel sentences in the set 0 data using an in-domain keyword list, and split the Set 0 data into train, development and test sets. The system parameters were tuned on the development set with MIRA for 20 iterations. Five language models were trained on Gigaword, Leidos, Set 0 English data, Set 1 English data and combination of Set 0 and Set 1 English data respectively using KenLM Heafield (2011). For both incident languages, we developed the morphology analyzer to extract the lemma for each source word, and trained our hierarchical phrase-based MT system on the lemmatized data.

Vii-A2 Neural Machine Translation Systems

All neural systems were trained using the xnmt Toolkit (Neubig et al, 2018) using a standard attentional encoder-decoder translation model (Bahdanau et al, 2014). The model used one or two layers of bi-directional LSTMs (Dyer et al, 2015) for the encoder, and a single layer of LSTM for the decoder. Training was performed with Adam (Kingma and Ba, 2014) with a learning rate of 0.001, batches were created based on the number of words such that the average batch size was 48 sentences, and a dropout rate of 0.3 was applied throughout the model.

The input and output were split into subword units using sentencepiece888https://github.com/google/sentencepiece, using the unigram-based segmentation strategy of Kudo (2018). A subword vocabulary of size 8,000 was used on the target side, and 32,000 was used on the source side.

These relatively simple settings were chosen and fixed mainly because our main innovations lay in the creation and utilization of training data, rather than new neural network architectures, which we detail next.

Vii-B Data Sources and Preparation

Vii-B1 Massively Multilingual Data Collection

Before the evaluation, the ARIEL team gathered a large set of multilingual resources that were deemed to be potentially useful. These resources spanned over 1,095 languages, and resulted in a total of 1.7 billion parallel sentences, spanning genres from religious texts, news, TED talks, and movie subtitles. Many of these were gathered from the OPUS online archive (Tiedemann, 2009), but also from a number of other sources.

Per the LORELEI rules, no new resources were gathered after the incident languages were announced, but the resources gathered before the eval included a small number of extra parallel resources for each incident language:

  • IL9: Data from the bible (123k sentences), from the GNOME project (233k sentences), from the KDE project (39k sentences), and from the Ubuntu project (6k sentences).

  • IL10: Data from the GNOME project (13k sentences), from the KDE project (26k sentences), from OpenSubtitles (392k sentences), and from the Ubuntu project (6k sentences).

For each of the languages, we additionally harvested a lexicon from existing resources, as detailed in the NER section, and all native informant parallel resources were added to the training data.

Vii-B2 Data Augmentation with Entities

One known weakness of NMT systems is that they are much worse at handling rare words, including named entities (Arthur et al, 2016). This weakness is particularly problematic when we lack in-domain training data, since some name entities at test time might not occur in the training data at all. Therefore, we add synthesized training data by replacing the named entity in the training data with a random pair of named entity in the provided lexicon. The first step for augmenting the data is to detect the location of the named entity pairs in the parallel training data.: 1) we first run the NER tagger from NLTK999https://www.nltk.org/ on the target English side; 2) word alignments between source and target are extracted by FastAlign101010https://github.com/clab/fast_align; 3) for each named entity detected on the English side, its corresponding source side location is determined by the word alignment information. After the locations of the named entity pairs are extracted, we can easily replace each named entity pair with a randomly sampled named entity in the lexicon.

Vii-B3 Selection of In-domain Data

As previously stated, we collected a total of 1.7 billion sentences parallel between English and one of over a thousand other languages to be used in our polyglot neural MT system. Due to time and compute constraints we sought a smaller sub-corpus of the most relevant sentence pairs to each of the two incident reports.

To do this data selection we used a set of relevant terms extracted from setE by the NER team and compute the relevance of the English side of each sentence pair in the large corpus to these terms.111111Before the LoReHLT18 evaluation period, we used terms extracted from all previous setEs, during the evaluation period we used setEs from the respective ILs. First, we for each English word in the vocabulary of the large corpus we pre-compute the number of unique sentences in the corpus that contain . We call this quantity (“document frequency”), and its inverse (“inverse document frequency”).

Next, for each sentence in the large corpus we compute a (sparse) vector of length , where each dimension corresponds to one word . We call the number of times the word occurs in the th sentence (“term frequency”). The th dimension of sentence ’s vector is computed as , the TF-IDF score (Salton and McGill, 1986) of in sentence . The intuition behind this approach is to have high values for words that are common in sentence (represented by a high TF) but are uncommon in the whole corpus (represented by a a low and thus a high ).

Given the list of relevant terms provided by the NER team, we calculate the term frequency of the word and divide again by the pre-computed on the large corpus. Finally, we rank the sentences in the large corpus by the cosine similarity between their vectors and relevant term vector, and use the highest scoring sentences (subject to some constraints discussed in §VII-C) as our relevant sub-corpus.

Vii-B4 Data Cleaning and Filtering

The original data provided in the language packs provided by LDC was both messy and highly misaligned. In order to fix this problem, we performed data re-alignment, and also further filtering of sentences that did not seem to be parallel.

To do the re-alignment, 1) we concatenated all training data, and split them into different documents; 2) for each document, we split the document into small sentence segments and ran a sentence realignment algorithm using the yasa toolkit Lamraoui and Langlais (2013) to get the realigned data.

After re-aligning the data, we performed parallel sentence filtering based on a variant of the method described by Munteanu and Marcu (2005), as implemented in the nafil toolkit.121212https://github.com/neubig/nafil

This method works by training a classifier to determine whether sentences are parallel or not by taking a “clean” corpus where the sentences are highly parallel, and artificially introducing noise by swapping some of the neighboring sentences and labeling them as incorrectly aligned (we used a swap rate of 0.1). This classifier is then applied to noisy data, and sentences that are labeled as incorrectly aligned are deleted from the corpus. For IL9, we used our pre-collected version of the Bible as clean data, and for IL10 we used OpenSubtitles. We trained a logistic regression classifier using

liblinear131313https://www.csie.ntu.edu.tw/~cjlin/liblinear/, and removed any sentences that were deemed to be noisy with a probability of over 0.5. As a result of filtering, the data size for IL9 reduced from 327k sentences to 296k sentences, and the data size for IL10 reduced from 434k sentences to 336k sentences.

Finally, since many of sentences in both the training data and the data we are expected to translate were extracted online from Twitter or magazines, both the source and target sentences contain a large number of identical tokens, for example, URLs, email addresses, or hash tags. We developed a tagger, called the do-not-translate (DNT) tagger, which extracts source tokens that should be simply passed through and not translated into English. These tags are removed before translation, and restored after translation.

Vii-C Multilingual Training of NMT

To take advantage of the large-scale multi-lingual resources, we performed multi-lingual training of our neural machine translation systems.

Before the evaluation started, we trained a large system from all the 1,095 languages in our database into English. This system was trained on data chosen such that the threshold of the TF-IDF training criterion above was greater than -9, and at least 4,000 sentences were included per language. This resulted in a total of approximately 60M sentences in the training data.

Once the evaluation started, we started adapting this pre-trained system to the incident languages. This was done by taking the pre-trained model and re-initializing only its word embeddings to reflect the new vocabulary in the source language, then continuing training of all parameters of the model.

In addition to performing this continued training on only the incident language itself, we also tested models that performed training with the source language and related languages, again using the TF-IDF based data selection to select relevant data. Specifically,

  • IL9: We were not able to find large amounts of data for any of the typologically related languages for Kinyarwanda. However, because both Kinyarwanda and English are written in Roman script, and because many of the entity names are shared, we decided to add additional English-to-English data as a pseudo-translation task. This data was selected so that the TF-IDF threshold was greater than -7, resulting in 317k sentences that contained keywords related to the incident.

  • IL10: For Sinhala, we were fortunate to have reasonably sized resources for two related languages: Hindi and Bengali. We used all of the resources in our database for these two related languages, which resulted in a total of 4.39M training sentences.

Vii-D System Combination

For the final submission, multiple systems were ensembled together to create the final results.

For combining NMT systems that share identical output vocabularies, it is simple to perform ensembling at hypothesis generation time, where multiple systems are run in parallel, and the average of the predicted word probabilities are used to predict the next word in the sequence (Sutskever et al, 2014). We used this method to combine together multiple NMT systems.

In addition, to combine more heterogenous systems, we used the MEMT (Heafield and Lavie, 2010) system combination method. Since MEMT requires a large -gram language model to rescore hypotheses, we built a large 4-gram Kneser-Ney (Kneser and Ney, 1995) language model using KenLM (Heafield et al, 2013). We then combined the 1-best output of eleven (for IL9) or seven (for IL10) of our best neural MT systems. Training was performed using setF and all of the MEMT toolkit’s default settings.

Vii-E Creation of Data w/ Native Informant

The native informant sessions were used to create two varieties of data: word or phrase lexicons, and translated sentences from set1.

To select data words or phrases for the native informant to translate, we used the method of Bloodgood and Callison-Burch (2010), which selects words or phrases that occur in monolingual data (i.e. set1), but not in bilingual data (i.e. set0), sorted in descending order of frequency. We additionally follow Miura et al (2016) in removing shorter phrases that are completely subsumed by longer phrases. These words or phrases were translated by having the native informant type translations in a Google Sheet. This resulted in approximately 200 high-frequency words/phrases in each language.

We also translated sentences from each of the three genres of text: newswire, social networks, and weblogs, to be used as development data to assess the accuracy of our systems. These sentences were translated by having the native informant speak the translations out loud, while a member of the ARIEL team typed in the English translations. This resulted in 79 sentences in Kinyarwanda, and 233 sentences in Sinhala.

Vii-F Final Submitted Systems

Our NMT system submissions fall into several categories: 1) NMT-adapt: we pre-train a large NMT system on the multilingual corpus collected as described previously, then fine-tune the system on the incident language; 2) NMT-mult: the NMT system is trained from scratch on a concatenation of the incident language training data, and the parallel data of two related languages in our multi-lingual corpus. For Kinyarwanda, we used Swahili and Zulu. For Sinhala, we used Marathi and Hindi; 3) NMT-plain: the NMT system is trained from scratch on the incident language training data only.

Here we provide a summary of our submissions. The complete list of submissions is shown in Table VII and Table VIII at the end of this report.

Vii-F1 Checkpoint 1

The submission statistics of different systems is summarized in Table II.

Smt

We trained the hierarchical phrase-based system described in §VII-A1 on the LDC tokenized data. We also made submissions that utilized the realigned data.

Nmt

For checkpoint 1, a particular challenge for utilizing multilingual corpus by NMT-mult is that training takes much longer time to converge because of increased amount of training data. NMT-plain takes much less time to train than NMT-mult, but its performance might not be as good as models that utilize multi-lingual corpus. On the other hand, NMT-adapt can quickly adapt to the incident language while taking advantage of the multi-lingual corpus. Because of the time constraint, we had NMT-adapt submissions for both languages and other NMT systems for only one language.

Model SMT NMT
adapt mult plain
IL9 4 2 1 3
IL10 1 2 - -
TABLE II: Checkpoint 1 MT submission statistics (all constrained)

Vii-F2 Checkpoint 2

The submission statistics of different systems is summarized in Table III. For checkpoint 2, we performed data filtering for realigned data to further remove misaligned data. We also used the small parallel data of set1 created with the help of native informant for system evaluation.

Smt

A big challenge for the hierachical phrased-based MT system is that the system cannot translate source words that are morphological variants of their lemmas. Some of our attempts include: 1) we made submissions that utilizes words segmented into morphemes. 2) we tried to further split words into subword units by the sentencepiece toolkit.

Nmt

We found that in general, the NMT-adapt outperformed the NMT-mult and NMT-plain by a large margin on the small set1 test set we created, so we focused on tuning NMT-adapt for checkpoint 2. Some of our attempts include: 1) we added back-translation data by directly copying monolingual English data as the source. This was especially helpful for Kinyarwanda, as it encouraged the model to pass through the English words on the source side; 2) we averaged system checkpoints for decoding; 3) we ran system combination on all the NMT outputs.

Constrained Unconstrained
Model SMT NMT SMT NMT
adapt mult adapt mult
IL9 4 6 - 4 5 1
IL10 3 7 - 2 7 1
TABLE III: Checkpoint 2 MT submission statistics

Viii Situation Frames

In this year’s evaluation, we used the same SF type classification pipeline for both text and audio input by first converting the audio files into text.

We note that all of our systems satisfy the constrained condition of the task, and in our submissions we marked 10 of them as constrained based on what we believe to be the best systems. The summary of our submissions with different settings are shown in Table IV. KW refers to our Keyword Model, and NN refers to our Neural Model. Each submission consists of English, IL text, and IL speech output. The complete list of submissions is shown in Table IX and Table X at the end of this report.

Constrained Unconstrained
Model: KW NN KW NN
Checkpoint 1
IL9 6 - - -
IL10 7 3 4 -
Checkpoint 2
IL9 4 6 5 5
IL10 5 5 5 5
TABLE IV: Number of SF submissions in various settings

Viii-a Speech to Text

The main focus of our Speech SF pipeline was on making decisions on grounded words which would unify the pipelines of all the tasks. This involved building an Automatic Speech Recognition (ASR) system for both the languages. Sinhala being a higher resourced language compared to Kinyarwanda in terms of resources that we had for the ASR, led us to apply different approaches for each of the language. The core model of our speech recognizer remained the same as last year. We used a sequence based ASR using CMU’s EESEN system (Miao et al, 2015) trained using the Connection Temporal Classification (CTC) loss. The target labels were generated using Epitran (Mortensen et al, 2018) the grapheme to phoneme library discussed earlier in this report. It was used to generate lexicons for the words present in the training, development and test set. To ground the acoustic model output to words a WFST based decoding was done using a language model that was built on the monolingual text corpus (lowest perplexity language models was chosen between 3-gram and 4-gram models). The decoding vocabulary was carefully chosen ensuring that we don’t miss any possible locations and situation frame keywords.

To perform ASR on the IL speech data provided to us we reduce the silence and split the data into small single speaker segments using 4-class Hidden Markov Model (HMM) segmentation followed by BIC clustering, based on the lium toolkit 

(Meignier and Merlin, 2010). This automatically segmented incident data was converted to text using the ASR and passed on to the Text SF, NER and EDL models. An illustration of our pipeline is shown in Figure 5.

Fig. 5: Speech Pipeline of ARIEL-CMU system in LoReHLT 2018.

Viii-A1 Speech Recognition System

IL9 Speech Recognition System

For CP1, due to the short development time, we had almost zero training data available. We tried developing a system on our Kinyarwanda and Kirundi Bible data. This was aligned automatically using a speech synthesis module, explained in §VIII-A4, which was later fine-tuned using downstream and upstream tasks of ASR and speech synthesis system (Prahallad et al, 2007). The bible ASR was trained using the domain robust features (Dalmia et al, 2018a). The bible ASR output was grounded to words using a beam search decoding using the ASR and phoneme RNN based Language Model.

We found that our Kinyarwanda output for CP1 was not so good and after some more verification using the NI collected data we figured that the alignment between the text and audio of the Bible data was not reliable enough to train a system.

For CP2, we shifted our focus to trying multilingual models and tried to transfer models from a close high resource language, discussed in  (Dalmia et al, 2018b). We used a pre-trained multilingual broadcast news model of English, Turkish, Spanish, Czech and Mandarin, using the resources mentioned in §III, which we adapted to Kinyarwanda using some phoneme mappings. Even though this gave us improvements over our CP1 model, the languages being used to transfer were very far away from the incident language. After some careful selections we found that Swahili was the closest language to transfer the ASR from. For our final model we trained a Swahili based recognizer, using the resources mentioned in §III and mapped its phones to Kinyarwanda. We also added around 550 utterances collected from the NI. This was crucial to improve the recognizer and fix some of the phonemes confusions that had occurred due to transfer from Swahili. To ground ASR to words we used WFST based decoding.

We cleaned the monolingual newswire text for the language model training where we filtered out all scripts except Latin. The vocabulary of the Kinyarwanda decoder was restricted to around 100k.

IL10 Speech Recognition System

For CP1, since we had available Sinhala speech data (§III), we started developing an ASR directly on that. Around 180k utterances were chosen to train the model and 1.5k was used as the validation set. We built a WFST based decoding graph using a trigram language model of set0 and setE data for checkpoint one. The best decoding parameters were chosen based on the performance of the system on NI recordings.

We found that there was a clear mismatch between the training data and the IL audio, which was mostly broadcast news data. Which could potentially lower the quality of the ASR.

For CP2, we used the domain robust feature extraction technique discussed in  

(Dalmia et al, 2018a). This gave us an 10% relative reduction in WER in the NI collected data. We also improved our language model by using more in-domain set1 data. We restricted our model to only newswire text. To clean the monolingual newswire text for the language model we filtered out all scripts except Sinhala and English. Sinhala being a non-Latin script we assumed Sinhala to contain Latin loan words and be influenced by English. The vocabulary of the Sinhala decoder was restricted to around 50k.

Viii-A2 NI speech annotation systems

To collect speech data from NI effectively, we developed two web applications to interact with NIs. During the NI sessions, we sent links of those applications to NIs and they followed instructions to collect speech data.

  1. The first application is the annotation application in which we can collect transcriptions of specific audios we present to NIs. During the testing period, we found that it was much easier for NI to transcribe shorter audio clips, because longer audios usually contains a lot words that they requires them to replay the clip several times. Prior to the NI session, we automatically segmented speech audio using the technique mentioned in the previous section. All clips shorter than 2 seconds were ignored as they would usually contain background noises or music. Clips between 2 seconds and 6 seconds were played to the NI after being manually verified to see if the automatic segmentation did not miss out any clips containing music or unrelated noise. If the NI thought the text being spoken in the audio could clearly define an SF type then that was noted and passed to the text SF as development data.

  2. The second application is the recording application. This application allows NI to record their voice by reading specific texts we have prepared. The recording application is more effective than the previous application in terms of collection speed as it is easier to read sentences rather than transcribe audios. It is particularly useful when NI is poor at typing or transcribing audios. However, the major drawback of this application is that speakers are confined to the few NIs, and their recording environment does not match audios in our datasets. Additionally, some background noises in NI’s room also makes the recording harder. The text that the NI was asked to read was set0/set1 segments of documents that the SF keyword system thought was high confidence. This way we could even verify if the SF prediction was correct.

During the entire sessions with NIs, we collected 680 audio/transcription pairs for IL9 and 477 audio/transcription pairs for IL10. As IL9 had very less speech training data, we used most of them as the training set and reserved 100 utterances for the validation purpose. On the other hand, IL10 has sufficient amount of training data and we used all utterances as the validation set.

Fig. 6: Situation Frame identification pipeline, which includes SF Type, SF Location, SF Justification, Status, and Urgency.

Viii-A3 IPA conversion for Speech

We performed all of the experiments in the IPA space. This is particularly useful because it makes it easy to transfer acoustic models across languages. We found high overlap of Kinyarwanda and Swahili phonemes which we believe was crucial to make the transfer successful. Even though we grounded the words back to its orthography as part of the WFST decoder. We converted the orthography into IPA using the Backoff mode in Epitran (§IV-A2), which helped filter out noise in the IL text like urls, non IL scripts, emoticons, numbers. These usually tend to effect the decoding of the ASR. In our initial experiments we also found that apart from the filtering of text, doing SF in the IPA space could be useful if there is high irregularity in the orthography and can often help in normalizing some of the spelling errors.

Viii-A4 Speech Synthesis

We built speech synthesizers for Kinyarwanda and Kirundi. We used CMU’s Clustergen Parametric Speech Synthesis system (Black and Muthukumar, 2015) as it is robust to data noise and produces reliable synthesis even for small amounts of data, pronunciations were produced by Epitran. We further used these synthesis models to align read Bible data in the incident language (even though that data had background music) to produce synthesizers with actual native acoustic data (Prahallad et al, 2007).

Viii-B Situation Frame Pipeline

The identification of SFs and selection of SF types is performed by two primary models, each with numerous variations. The sentences (and the surrounding context sentences) justifying the models’ predictions are used to further enrich the situation frames with location, status, resolution, and urgency. An illustration of our pipeline is shown in Figure 6.

Viii-B1 SF Type Identification

Keyword Model

Keyword Model is a lexicon matching model using a list of curated keywords for each SF type. We created the list of keywords in two steps: (1) build a list of keywords for each SF type in English, then (2) translate the keywords into the target language by using the provided dictionary and also by the native informants during NI sessions in the first checkpoint. During the translation process by the NIs, the NIs were shown English keywords with an example usage taken from ReliefWeb. The idea is to provide the NIs with the context of the English keywords in order to get the most relevant translation.

Building English keywords (step 1 above) is a two-step process. First, we used the ReliefWeb dataset to generate a list of 100 candidate keywords for each class by taking the top-100 words with the highest TF-IDF scores. Similar to the keyword generation method described by Littell et al (2017), we manually refined the keyword list by pruning based on world knowledge. For each candidate keyword, we added 30 most similar words using the English word2vec model trained on the Google News corpus.141414https://code.google.com/archive/p/word2vec We retained only those words which have cosine-similarity scores greater than 70%. For each candidate keyword in this extended list, we computed a label affinity score with each class label (e.g., water, evacuation) using cosine-similarity between their word2vec embeddings. Candidate keywords with similarity above a certain threshold were retained and are used as keywords for the corresponding classes. We primarily used in our submissions, but also submitted versions with for comparison.

During testing, we retrieved sentences that contain any of the keywords in our list, and assigned the top-2 SF types into the sentences that contain the respective keywords, based on the sum of confidence scores of the keywords.

At checkpoint 2, we used NI sessions to verify the outputs of the Keyword Model, and used the results to prune certain keywords which are not useful for prediction, based on our interactions with the NIs. By this time, we also used a morphological analyzer from our linguistics team for each IL for matching lemmatized keywords in order to improve recall. We think this will be particularly useful for IL9, which is morphologically rich.

We also made another submission where we did the keyword matching on the IPA version of the IL texts and of our speech-to-text outputs. The keywords were first converted into IPA, and matching was done as usual.

Neural Model

Our second model is a convolutional neural network (CNN) that takes sequences of word embeddings as input and classifies them into SF types.

The first step is to train a bilingual word embedding as a shared feature representation between English and ILs. We used XlingualEmb (Duong et al, 2016) and trained our bilingual word embedding for English and the IL. XlingualEmb is a cross-lingual extension from word2vec model (Mikolov et al, 2013) to bilingual text using monolingual corpora and a bilingual dictionary.

Then the CNN model takes a sequence of (bilingual) word embeddings as input and applies 1-D convolutional operation on the input to extract semantic features. The features are then passed through a fully-connected layer before reaching the final softmax layer. Thanks to the bilingual word embedding which maps the words from the two languages to the same feature representation space, the model trained in English can also be applied to documents in ILs. This enables us to use the same model and parameters for predicting SF types in both the English documents and IL documents. As described in our recent publication 

(Muis et al, 2018), we primarily trained our model on English data and fine-tune it with IL annotations if they exist. Our English data include ReliefWeb dataset (Horwood and Bartem, 2016) and LORELEI Representative Language Packs (LRLP) dataset (§III-C2). We extended the ReliefWeb dataset with sentences found by our bootstrapped keyword system, an extension of the Keyword Model described above with an additional keyword bootstrapping step to get more keywords. Because the resulting ReliefWeb dataset was biased by keywords, we filtered out false positives in the dataset using an SVM classifier. We used active learning strategy to rapidly build an accurate SVM classifier. More precisely, we alternately performed manual annotation and SVM training, where subsequent annotations were done on sentences for which the SVM model in the previous iteration gave low confidence score (i.e., closer to the decision boundary). This resulted in 2,562 annotations over 11 SF types. Finally, we took the top-25% positive and negative predictions151515That is, we removed 50% of the data in the middle. on the whole ReliefWeb dataset as our final training data.

During testing, we ran our model to each sentence in the test set to get the probability estimate of each SF type for each sentence. We then adopted two approaches to filter out SF types with low probability estimates. In the first approach, we calculated the average probability

and standard deviation

for each SF type from all sentences, and filter out the predictions whose estimates fall below , where

is a hyperparameter. In our submissions we used

based on the results in previous evaluations. The second approach considers the assumption that one document is not likely to describes many topics. We took only the top- SF types per document, where , where is the number of sentences in the document. Upon this top- extraction, we also filter out the predictions with low probability estimates by the first approach, where we used .

In checkpoint 2, we also experimented with the method of moment matching 

(Zellinger et al, 2017) to alleviate the domain mismatch between the our English training data and IL test data, aiming to build a feature extractor that only captures the semantics of the event types but not the difference in language usage between English and ILs. In other words, we tried to make the features captured by CNN informative for the event type classification and language-invariant at the same time. The method of moment matching by Zellinger et al (2017)

does this by minimizing the distance between feature vectors of English and IL text obtained from CNN. Concretely, we consider the sets of the extracted feature vectors as probability distributions and put a constraint on CNN minimizing the difference of higher order central moments of the distributions.

In checkpoint 2 we also show our Keyword Model outputs on Set1 to the NIs to be annotated with SF types, to be used as development set to estimate the performance of the numerous variants of the Neural Model. Although not perfect, since part of the training data of the Neural Model comes from a variant of the Keyword Model, we were able to identify certain hyperparameter combinations which are not performing well, and thus helping us deciding which systems to be submitted.

Like our Keyword Model, we also made another set of submissions where the IL texts were converted into IPA. For Neural Model, this also required us to train the bilingual word embeddings in IPA, after which the same SF pipeline can be applied.

Viii-B2 SF Justification

Our two models make predictions at sentence-level, so we can simply take the sentence that was used to predict that particular SF type as our justification sentence.

Viii-B3 SF Location linking

After populating SFs in a document, for each SF we assign the locations based on the GPEs and LOCs entities found in the sentences surrounding the justification sentence, up to sentences away, or if there is no location entities in those surrounding sentences, we assign the most recently assigned location mention. In our submissions we used and .

Viii-B4 SF Status and Urgency detection

In our SF systems we always predict “insufficient” and “current” for the resolution and status field, respectively, and focus our effort on urgency prediction.

For our Checkpoint 1 submissions, we had an IL-English parallel data where the English documents were labeled with sentiment, emotion and urgency labels using sentiment and pre-trained emotion systems. The sentiment system was developed using a bidirectional LSTM that was trained on English Twitter data with bilingual word embeddings. Emotion system was a gated Recurrent Neural Net trained with a multi-genre English corpus (genres: emotional blog posts, tweets, news title, movie reviews). Urgency labels were derived from emotion tag distribution (e.g., anger and fear) and according to a targeted urgency distribution. These tags are then projected to the IL documents and SVM classifier was trained in IL using sentiment as a feature. For urgency classifier in English, we combined the English documents from both IL9 and IL10 parallel data and use them as training data for English urgency classifier.

For our Checkpoint 2 submissions, we used our Checkpoint 1 classifiers to predict urgency labels on a subset of Set1, selected by taking the justification sentences of our Keyword Model when run on Set1. We combined this data with the parallel data from Set0, with 20% stratified sub-sampling to avoid having the parallel data dominate the Set1 data. We considered two variants of the additional training data: (1) all sentences in this Set1 subset, or (2) only those sentences which confidence scores from the urgency classifier pass certain threshold. After evaluating various models with a development set we created by eliciting annotations from the NIs on Set1, we re-trained the best model with the development set as additional training data.

For English urgency classifier in Checkpoint 2, we trained an SVM classifier on both IL9 and IL10 Set S, automatically labeled by our classifier in Checkpoint 1. We also added data that we collected from Figure Eight crowdsourcing platform, which are tweets about disasters.

Acknowledgment

This project was sponsored by the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O), program: Low Resource Languages for Emergent Incidents (LORELEI), issued by DARPA/I2O under Contract No. HR0011-15-C-0114.

References

  • Arthur et al (2016)

    Arthur P, Neubig G, Nakamura S (2016) Incorporating discrete translation lexicons into neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, pp 1557–1567, URL

    https://aclweb.org/anthology/D16-1162
  • Bahdanau et al (2014) Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473, URL http://arxiv.org/abs/1409.0473
  • Beesley and Karttunen (2003) Beesley KR, Karttunen L (2003) Finite-State Morphology. Center for Study of Language and Information Publications, Stanford, CA
  • Black and Muthukumar (2015)

    Black AW, Muthukumar PK (2015) Random forests for statistical speech synthesis. In: Sixteenth Annual Conference of the International Speech Communication Association,

    http://festvox.org
  • Bloodgood and Callison-Burch (2010) Bloodgood M, Callison-Burch C (2010) Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp 854–864
  • Bojanowski et al (2016) Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:160704606
  • Bordes et al (2011)

    Bordes A, Weston J, Collobert R, Bengio Y (2011) Learning structured embeddings of knowledge bases. In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, California, USA, August 7-11, 2011, URL

    http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3659
  • Chiang (2007) Chiang D (2007) Hierarchical phrase-based translation. computational linguistics 33(2):201–228
  • Dalmia et al (2018a) Dalmia S, Li X, Metze F, Black AW (2018a) Domain robust feature extraction for rapid low resource asr development. 2018 IEEE Spoken Language Technology Workshop (SLT) pp 258–265
  • Dalmia et al (2018b) Dalmia S, Sanabria R, Metze F, Black AW (2018b) Sequence-based multi-lingual low resource speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp 4909–4913
  • Duong et al (2016) Duong L, Kanayama H, Ma T, Bird S, Cohn T (2016) Learning crosslingual word embeddings without bilingual corpora. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Texas, USA, pp 1285–1295
  • Dyer et al (2010) Dyer C, Weese J, Setiawan H, Lopez A, Ture F, Eidelman V, Ganitkevitch J, Blunsom P, Resnik P (2010) cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In: Proceedings of the ACL 2010 System Demonstrations, Association for Computational Linguistics, pp 7–12
  • Dyer et al (2013a) Dyer C, Chahuneau V, Smith NA (2013a) A simple, fast, and effective reparameterization of IBM Model 2. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648
  • Dyer et al (2013b) Dyer C, Chahuneau V, Smith NA (2013b) A simple, fast, and effective reparameterization of ibm model 2. Association for Computational Linguistics
  • Dyer et al (2015)

    Dyer C, Ballesteros M, Ling W, Matthews A, Smith NA (2015) Transition-based dependency parsing with stack long short-term memory. CoRR abs/1505.08075, URL

    http://arxiv.org/abs/1505.08075, 1505.08075
  • Heafield (2011) Heafield K (2011) Kenlm: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, Association for Computational Linguistics, pp 187–197
  • Heafield and Lavie (2010) Heafield K, Lavie A (2010) Combining machine translation output with open source: The carnegie mellon multi-engine machine translation scheme. The Prague Bulletin of Mathematical Linguistics 93:27–36
  • Heafield et al (2013) Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp 690–696, URL https://kheafield.com/papers/edinburgh/estimate_paper.pdf
  • Horwood and Bartem (2016) Horwood G, Bartem K (2016) LORELEI HA/DR Lexicon V1. DOI 10.7910/DVN/TGOPRU, URL https://gitlab.nextcentury.com/Graham.Horwood/adriel
  • Hulden (2009) Hulden M (2009) Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp 29–32
  • Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:14126980
  • Kneser and Ney (1995) Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: icassp, vol 1, p 181e4
  • Kudo (2018) Kudo T (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 66–75, URL http://www.aclweb.org/anthology/P18-1007
  • Lamraoui and Langlais (2013) Lamraoui F, Langlais P (2013) Yet another fast, robust and open source sentence aligner. time to reconsider sentence alignment? In: XIV Machine Translation Summit, Nice, France
  • Littell et al (2017) Littell P, Tian T, Xu R, Sheikh Z, Mortensen D, Levin L, Tyers F, Hayashi H, Horwood G, Sloto S, Tagtow E, Black A, Yang Y, Mitamura T, Hovy E (2017) The ARIEL-CMU situation frame detection pipeline for LoReHLT16: a model translation approach. Machine Translation pp 1–22, DOI 10.1007/s10590-017-9205-3, URL http://link.springer.com/10.1007/s10590-017-9205-3
  • Ma and Hovy (2016) Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:160301354
  • Ma et al (2017) Ma X, Fauceglia NR, Lin YC, Hovy E (2017) Cmu system for entity discovery and linking at tac-kbp 2016. In: Proceedings of Text Analytics Conference (TAC 2017).
  • Manning et al (2014) Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp 55–60, URL http://www.aclweb.org/anthology/P/P14/P14-5010
  • Meignier and Merlin (2010) Meignier S, Merlin T (2010) Lium spkdiarization: an open source toolkit for diarization. In: CMU SPUD Workshop
  • Miao et al (2015) Miao Y, Gowayyed M, Metze F (2015) Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In: Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, IEEE, pp 167–174
  • Mikolov et al (2013)

    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pp 3111–3119

  • Miura et al (2016) Miura A, Neubig G, Paul M, Nakamura S (2016) Selecting syntactic, non-redundant segments in active learning for machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, pp 20–29, URL http://www.aclweb.org/anthology/N16-1003
  • Moro et al (2014) Moro A, Raganato A, Navigli R (2014) Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2:231–244
  • Mortensen et al (2018) Mortensen DR, Dalmia S, Littell P (2018) Epitran: Precision G2P for many languages. In: Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T (eds) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Paris, France
  • Muis et al (2018) Muis AO, Otani N, Vyas N, Xu R, Yang Y, Mitamura T, Hovy E (2018) Low-resource cross-lingual event type detection via distant supervision with minimal effort. In: Proceedings of the 27th International Conference on Computational Linguistics (COLING)
  • Munteanu and Marcu (2005) Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4):477–504
  • Neubig et al (2018) Neubig G, Sperber M, Wang X, Felix M, Matthews A, Padmanabhan S, Qi Y, Sachan DS, Arthur P, Godard P, Hewitt J, Riad R, Wang L (2018) XNMT: The extensible neural machine translation toolkit. In: Conference of the Association for Machine Translation in the Americas (AMTA) Open Source Software Showcase, Boston, URL https://arxiv.org/pdf/1803.00188.pdf
  • Pennington et al (2014) Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
  • Prahallad et al (2007) Prahallad K, Toth AR, Black AW (2007) Automatic building of synthetic voices from large multi-paragraph speech databases. In: INTERSPEECH, pp 2901–2904, http://festvox.org
  • Salton and McGill (1986) Salton G, McGill MJ (1986) Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA
  • Sutskever et al (2014) Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp 3104–3112, URL http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks
  • Tiedemann (2009) Tiedemann J (2009) News from opus-a collection of multilingual parallel corpora with tools and interfaces. In: Recent advances in natural language processing, vol 5, pp 237–248
  • Tjong Kim Sang and De Meulder (2003) Tjong Kim Sang EF, De Meulder F (2003) Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, Association for Computational Linguistics, pp 142–147
  • Tsai et al (2016) Tsai CT, Mayhew S, Roth D (2016) Cross-lingual named entity recognition via wikification. In: CoNLL, URL http://cogcomp.org/papers/TsaiMaRo16.pdf
  • Zellinger et al (2017) Zellinger W, Grubinger T, Lughofer E, Natschläger T, Saminger-Platz S (2017) Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning. In: ICLR 2017, pp 1–13, DOI 10.1145/3097983.3098126, URL http://arxiv.org/abs/1702.08811, 1702.08811