Interactive Refinement of Cross-Lingual Word Embeddings

Cross-lingual word embeddings transfer knowledge between languages: models trained for a high-resource language can be used in a low-resource language. These embeddings are usually trained on general-purpose corpora but used for a domain-specific task. We introduce CLIME, an interactive system that allows a user to quickly adapt cross-lingual word embeddings for a given classification problem. First, words in the vocabulary are ranked by their salience to the downstream task. Then, salient keywords are displayed on an interface. Users mark the similarity between each keyword and its nearest neighbors in the embedding space. Finally, CLIME updates the embeddings using the annotations. We experiment clime on a cross-lingual text classification benchmark for four low-resource languages: Ilocano, Sinhalese, Tigrinya, and Uyghur. Embeddings refined by CLIME capture more nuanced word semantics and have higher test accuracy than the original embeddings. CLIME also improves test accuracy faster than an active learning baseline, and a simple combination of CLIME with active learning has the highest test accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/01/2017

Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary

Cross-lingual model transfer is a compelling and popular method for pred...
06/05/2019

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Cross-lingual word embeddings encode the meaning of words from different...
05/17/2020

Cross-Lingual Word Embeddings for Turkic Languages

There has been an increasing interest in learning cross-lingual word emb...
03/23/2019

Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

Most work in text classification and Natural Language Processing (NLP) f...
09/09/2021

Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph

In cross-lingual text classification, it is required that task-specific ...
06/02/2021

Evaluating Word Embeddings with Categorical Modularity

We introduce categorical modularity, a novel low-resource intrinsic metr...
10/27/2020

Learning Contextualised Cross-lingual Word Embeddings for Extremely Low-Resource Languages Using Parallel Corpora

We propose a new approach for learning contextualised cross-lingual word...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

A hypothetical topographic map of an English-French embedding space tailored for sentiment analysis. Dots are English words, and squares are French words. Positive sentiment words such as “excellent”, “exceptional”, and “géniale” are grouped in a clime, while negative sentiment words such as “disappointing”, “lackluster”, and “décevant” are grouped in another clime. We expect that these climes makes text classification easier.

Modern text classification models are highly accurate when trained on large labeled datasets and pre-trained word embeddings (Kim, 2014; Iyyer et al., 2015; Joulin et al., 2017). For low-resource languages, the scarcity of labeled data is a problem. To bridge the resource gap between languages, a solution is to use cross-lingual word embeddings (clwe

), where words from different languages are mapped to a shared vector space. Using

clwe as features, models trained in a resource-rich language (e.g., English) can predict labels for documents in other languages.

clwe are usually learned from pre-defined resources such as parallel text, monolingual corpora, and dictionaries (Ruder et al., 2019; Glavas et al., 2019). The success of these methods relies on the domain and quality of training data (Søgaard et al., 2018). Furthermore, while these methods have impressive word translation accuracy, they are not tailored for downstream tasks such as text classification (Glavas et al., 2019).

We develop a system, CLassifying Interactively with Multilingual Embeddings (clime), that efficiently specializes clwe with human interaction. Given a pre-trained clwe, the interface brings a human in the loop by asking a bilingual speaker to review the nearest neighbors of each word. clime capitalizes on the intuition that neighboring words in an ideal embedding space should have similar semantic attributes. These regions resemble climes, geographic zones with distinctive meteorological features. Here, we refer to climes as areas in the embedding space where words share similar semantic features. Our goal is to convert neighborhoods in the embedding space into classification climes with words that induce similar labels for a given classification task. For example, in an ideal embedding space for English-French sentiment analysis, positive sentiment words such as “excellent”, “exceptional”, and their French translations are mapped to one area, while negative sentiment words such as “disappointing”, “lackluster”, and their translations are mapped to another area (Figure 1

). By separating the positive and negative sentiment words in the embedding space, these climes should help train sentiment classifiers.

To build classification climes, clime first uses a gradient-based approach to find keywords with high salience in the downstream task (Section 2.1). Focusing on these keywords should help users to most efficiently refine clwe. Then, the keywords and their nearest neighbors are displayed on an interface for a user to mark their similarity or dissimilarity (Section 2.2). To fine-tune the embeddings for the downstream classification task, we define similarity between words as inducing similar labels. After collecting annotations, clime updates the embeddings such that similar words are pulled closer and dissimilar words are pushed further apart (Section 3), establishing desired climes similar to Figure 1.

clime is related to active learning (Settles, 2012), which also improves a classifier through user interaction. Traditional active learning focuses on document-level interaction by asking users to label selected examples. In contrast, the word-level interactions with clime updates the embeddings in the model. Intuitively, the two methods are complementary: clime teaches the model lexical semantics by aligning words in the embedding space, while active learning provides document-level supervision that teaches how to predict labels from words.

To study the relationship between clime and active learning, we evaluate clime on a cross-lingual document classification task for four low-resource languages: Ilocano, Sinhalese, Tigrinya, and Uyghur (Section 4). Using clime, a bilingual speaker can significantly increase test accuracy in less than an hour. We then compare clime with an active learning baseline that asks a user to label target language documents. clime is very effective in this low-resource setting. Under the same annotation time constraint, clime has a higher test accuracy. Combining active learning with clime further improves classification.

Figure 2: The clime interface. A keyword is displayed on the top, and its nearest neighbors in the two languages appear in the two columns below. A user can accept or reject each neighbor with the buttons, and add new neighbors by typing them in the “add word” textboxes. They may also click on any word to read its context in the training set.

2 Interactive Neighborhood Reshaping

This section introduces the interface designed to solicit human feedback on neighborhoods of clwe and our keywords selection criterion. Suppose that we have two languages with vocabulary and , respectively. Let be a pre-computed clwe matrix, where is the vector representation of word type in the joint vocabulary . Our goal is to improve the clwe matrix for a downstream task through inspection of neighboring words by a bilingual user.

2.1 Keyword Selection

Due to the large size of the vocabulary , it is impossible for the user to check the neighbors of every word. Therefore, we focus on a subset of keywords that are the most salient in a target downstream task. For example, if the downstream task is sentiment analysis, our keywords set should include sentiment words such as “good” and “bad”. Prior work in active learning solicits keywords using information gain (Settles, 2011), but this cannot be applied to continuous embeddings. Li et al. (2016)

suggest that the contribution of one dimension of a word embedding to the loss function can be approximated by the absolute value of its partial derivative, and therefore they use partial derivatives to visualize the behavior of neural models. However, rather than understanding the importance of individual dimensions, we want to compute salience of an entire word vector. Therefore, we extend their idea by defining the salience of a word embedding as the

magnitude of the loss function’s gradient. This score summarizes salience of all dimensions from a word embedding. Formally, let be a document of words with label ; let be the training loss function. We measure the local salience of word in document as

(1)

Equation 1 defines the local contribution of a token in one document, but we are interested in the global importance of a word type across documents. To compute the global salience score of a word type , we add local salience scores (Equation 1) of all occurrences of in a large labeled dataset and multiply by the inverse document frequency (idf) of :

(2)

The idf term is necessary because it discounts stop words with high document frequency (e.g., “the” and “of”). These words are often irrelevant to the downstream task, but they have high total local salience because they appear in many examples. Therefore, it is necessary to weigh the salience scores by idf.

We sort word types based on Equation 2 and choose the top- most salient words as the keyword set

. The hyperparameter

is the number of keywords that we wish to display to the user, which controls the length of a session.

2.2 User Interaction

For each keyword , we want to collect a positive set with semantically similar words, and a negative set with unrelated words. To specialize embeddings for a classification task, we ask the user to consider semantic similarity as inducing the similar label. As an example, if the task is English-French sentiment analysis, then “good” should be considered similar to “excellent” and “génial” but dissimilar to “bad” and “décevant”. On the interface, keyword is displayed with its nearest neighbors in the two languages, arranged in two columns (Figure 2). These neighbors are the words with embeddings closest to

in cosine similarity. The number of displayed nearest neighbors in the two languages can be adjusted as hyperparameters. For each nearest neighbor, the user can either: (1) press on the green checkmark to add a positive neighbor to

, (2) press on the red “X” mark to add a negative neighbor to , or (3) leave an uncertain neighbor alone. Users can also use the “add word” textbox to add a new word not in the current neighbor list. The added word can then be marked as positive or negative. We explain how clime refines the embeddings with the feedback sets and in Section 3. The interface also provides word concordance—a brief overview of the contexts where a word appears—to disambiguate and clarify words. Users can click on any word to find example sentences.

3 Fitting Word Embeddings to Feedback

After receiving cross-lingual information from the user, clime updates the embeddings to reflect feedback. The algorithm reshapes the neighborhood so that words marked as similar are pulled closer, and words marked as dissimilar are pushed away. Thus, the words near a keyword would share similar semantic attributes. Together, these embeddings form desired task-specific connections between words across languages.

Our update equations are inspired by the attract-repel algorithm (Mrkšić et al., 2017), which fine-tunes word embeddings with a set of synonym and antonym constraints. attract-repel optimizes a margin-based cost function that pulls synonyms closer and pushes antonyms further apart relative to their respective nearest neighbors. The margin-based objective works well with lexical resources, like BabelNet (Navigli and Ponzetto, 2010), where there are hundreds of thousands linguistic constraints are available, but our pilot experiment suggests that the method is not suitable for small constraint sets. Since clime is designed for low-resource languages, we optimize a novel objective that reshapes the neighborhood more drastically than attract-repel.

3.1 Feedback Cost

For each keyword , we collect a positive set a negative set (Section 2.2). To refine embeddings with human feedback, we increase the similarity between and each positive word, and decrease the similarity between and each negative word . Formally, we minimize the following cost function:

(3)

where measures the similarity between the keyword and a negative word , and measures the similarity between the keyword and a positive word . Minimizing is equivalent to maximizing similarities of positive pairs while minimizing similarities of negative pairs.

3.2 Topology-Preserving Regularization

Ilocano … Makapaladingit ta maysa a sister a taga-Benguet ti natay ken uppat a brother ti nasugatan gapu iti bagyo … … Nagtalinaed dagiti pito a balod ti Bureau of Jail Management and Penology (BJMP) ditoy ciudad ti Laoag iti isolation room gapo iti tuko … … 3 nasugatan iti panagdinungpar ti traysikel ken kotse …
English … Sadly, as a result of Typhoon Mangkhut, one sister from Benguet was killed and four brothers were injured … … Seven inmates from the Bureau of Jail Management and Penology (BJMP), Laoag City, have been transferred to the isolation room due to chicken pox … … 3 injured in a tricycle and car collision …
Figure 3: Excerpts of three positive Ilocano test examples (top) and their English translations (bottom). They are marked as positive because they describe medical emergencies.

Prior post-processing methods emphasize regularization as a means of maintaining the topology, or properties that should be preserved under transformations, of the embedding space (Mrkšic et al., 2016; Mrkšić et al., 2017; Glavaš and Vulić, 2018). If the original clwe align certain translations together, those translated words should remain close after updating the embeddings. The topology also encodes important semantic information that should not be discarded. Therefore, we regularize toward the original embeddings following previous post-processing methods. Let be the original embedding matrix, we add a regularizer to the cost function:

(4)

Minimizing the regularizer prevents from drifting too far away from the original embeddings .

The final cost function combines the feedback cost (Equation 3) and the regularizer (Equation 4):

(5)

where the hyperparameter controls the strength of the regularizer. The updated embeddings enforce constraints from the user feedback while preserving other structures from the original embeddings. We set to one and use Adam optimizer (Kingma and Ba, 2015) with default hyperparameters.

4 Cross-Lingual Classification Experiments

We evaluate clime on a cross-lingual document-classification task (Klementiev et al., 2012), where we build a document classifier for a low-resource target language using labeled documents in another high-resource source language through clwe. Our classification task is to identify whether a document describes a medical emergency, which is useful for planning disaster relief. We use English as the source language and experiment with four low-resource target languages: Ilocano, Sinhalese, Tigrinya, and Uyghur.

Our experiments confirm that a bilingual user can quickly improve the test accuracy of cross-lingual models through clime. An alternative use of a bilingual annotator is to label more training documents. Therefore, we also compare clime to an active learning baseline that queries the user for document labels and find that clime improves the test accuracy faster. Finally, we combine clime and active learning, which leads to even faster improvement of test accuracy.

4.1 Experiment Setup

(a) Ilocano
(b) Tigrinya
(c) Uyghur
(d) Sinhalese
Figure 4: Test accuracy of four methods on four target languages. Base model uses the original clwe and the original training set. Active model uses the original clwe and a training set augmented by active learning. We select and label fifty target language documents by uncertainty sampling (Lewis and Gale, 1994) and combine them with the source language (English) training set. clime model uses the clwe refined by clime and the original training set. A+C model uses the clwe refined by clime and a training set augmented by active learning. We control the number of user interactions so that Active, clime, and A+C require the same interaction time (Section 4.2). The Sinhalese clime results are averaged over ten users, while we only have one user for other languages. clime has higher accuracy than active learning, and the combined A+C model has the highest accuracy across language.

Labeled Data.

We train models on 572 English documents and test on 48 Ilocano documents, 58 Sinhalese documents, 158 Tigrinya documents, and 94 Uyghur documents. The documents are extracted from lorelei language packs (Strassel and Tracey, 2016), a multilingual collection of incident-related documents. We use documents labeled with medical situation as positive examples and sample an equal number of negative examples to balance the two labels. Figure 3 shows excerpts of positive Sinhalese examples.

Word Embeddings.

To transfer knowledge between languages, we build clwe between English and each target language. We train monolingual embeddings with word2vec (Mikolov et al., 2013) and map them to a shared space with multiCCA (Ammar et al., 2016). The English embeddings are trained on Wikipedia, and the target language embeddings are trained on unlabeled documents from the lorelei language packs.

Text Classifier.

Our classifier is a convolutional neural network. Each document is represented as the concatenation of word embeddings and passed through a convolutional layer, followed by max-pooling and a final softmax layer 

(Kim, 2014). We minimize cross-entropy on the training set by running Adam (Kingma and Ba, 2015)

with default hyperparameters for thirty epochs.

User Study.

We hire participants that are fluent in both English and the low-resource target language. We have ten users for Sinhalese experiments and one user for Ilocano, Tigrinya and Uyghur experiments. Each user annotates fifty keywords, compiled with the method described in Section 2.1. For each keyword, we show five nearest neighbors for each language. On average, each user provides 9.4 constraints for each keyword. Reviewing fifty keywords takes forty-five minutes on average.

4.2 Methods

After receiving feedback, we update the embeddings (Section 3). We then evaluate the embeddings by retraining a classifier using the new embeddings and computing its test accuracy. For each set of embeddings, we train ten models with different random seeds and report average test accuracy.

We compare the test accuracy of a classifier trained on the updated embeddings (clime in Figure 4) against two baselines. The first baseline is a classifier trained on original embeddings (Base in Figure 4). We expect the classifier trained on refined embeddings to have higher test accuracy because the embeddings are fine-tuned for the classification task.

If we have access to an a bilingual speaker, an alternative to using clime is to ask the user to annotate more training documents. Therefore, we also compare clime to uncertainty sampling (Lewis and Gale, 1994), an active learning method that asks a user to label documents (Active in Figure 4). We choose a set of fifty documents where model outputs have the highest uncertainty from a set of unlabeled target language documents and ask an annotator to label them as additional training documents. We then retrain a model on both the English training set and the fifty target language documents, using the original embeddings. On average, labeling fifty documents takes approximately the same time as a clime session (forty-five minutes). Thus, we can evaluate which method improves test accuracy more under a fixed time constraint.

Finally, we experiment with a simple combination of active learning and clime (A+C in Figure 4). Document-level and word-level interactions are complementary, so combining the two methods may lead to higher accuracy. To keep the results comparable, we allocate half of the user interaction time to active learning, and the other half to clime. Specifically, we use active learning to expand the training set with twenty-five target language documents, and we refine the embeddings by running clime on only twenty-five keywords. We then retrain a model using both the augmented training set and the refined embeddings.

disease ambulance flu dengue bacteria
coughing hospitals contagious symptoms wounded
breathing fluids casualties injuries ill
victims vaccine glaucoma over rhinitis
outbreak clot coverage drain caused
given attacked wide blood happen
throat tube reduce tuberculosis lives
innocent plague both came typhoid
clotting taken heavily worst killed
ebola terrorist deaths somalia prevent
Table 1: List of English keywords used in experiments. These words are automatically identified with the method described in Section 2.1. Most keywords are relevant to medical emergency. A few words seem out of place (e.g., “over”), which may come from training data biases.

4.3 Results and Analysis.

Effectiveness of clime.

Figure 4 compares the four methods described in the previous section. On all four target languages, the classifier that trains on embeddings refined by clime has higher accuracy than the classifier that trains on the original embeddings, which shows that clime reshapes embeddings in a way that helps classification. clime also has higher accuracy than uncertainty sampling, which demonstrates that our approach is useful in this low-resource settings. The combined method has the highest accuracy, confirming the intuition that active learning and clime are complementary.

Keyword Detection.

Table 1 shows the list of the fifty most salient keywords used in the experiments. The keywords are identified as described in Section 2.1. Most keywords have obvious connections to our classification task of detecting medical emergencies, such as “ambulance”, “hospitals”, and “disease”. However, the list also contains some words that are unrelated to medical emergency, including “over”, “given”, “happen”, and “both”. These words may be biases or artifacts from training data (Feng et al., 2018).

(a) Tigrinya
(b) Uyghur
Figure 5: For the Uyghur and Tigrinya user study, we plot test accuracy for embeddings refined on different number of keywords. The leftmost point corresponds to the Base model in Figure 4, while the rightmost point corresponds to the clime

model. We also plot a fitted linear model with confidence intervals. Test accuracy generally improves with more feedback at the beginning and then slightly drops after reaching an optimal number of keywords.

Number of Keywords.

To evaluate how feedback quantity changes accuracy, we vary the number of keywords and compare test accuracy. For each from one to fifty, we update the original embeddings using only the feedback on the top- keywords, and we evaluate each set of embeddings with test accuracy. Figure 5 shows test accuracy against the number of keywords for Tigrinya and Uyghur experiments. For both languages, test accuracy generally increases with more feedback at the beginning of the session. Interestingly, test accuracy plateaus and slightly drops after reaching an optimal number of keywords, which is around twenty for Tigrinya and about forty for Uyghur. One explanation is that the later keywords are less salient, which causes the feedback to become less relevant. These redundant constraints hamper optimization and slightly hurt test accuracy.

Qualitative Analysis.

Human interaction helps with the low-resource classification task by adding missing links in the cross-lingual embedding space. For example, many users add the Sinhalese translation for “patients” as a nearest neighbor of the keyword “hospital”. This improves classification because both words are indicators for medical emergency. By pulling the Sinhalese word for “patients” closer to “hospital”, this forms a classification clime that helps transfer knowledge between languages (similar to Figure 1). Indeed, the Sinhalese word for “patients” appears in several test documents. The classifier trained on the original embeddings incorrectly marks some of them as negative, but these mistakes are fixed when using the refined embeddings.

5 Related Work

Cross-Lingual Word Embeddings.

Ruder et al. (2019) summarize previous clwe methods. These methods learn from existing resources such as dictionaries, parallel text, and monolingual corpora. Therefore, success of these methods is largely determined by the availability and quality of training data (Søgaard et al., 2018). To improve the suitability of clwe methods in low-resource settings, recent work focuses on learning without cross-lingual supervision (Zhang et al., 2017a, b; Conneau et al., 2018; Artetxe et al., 2018; Alvarez-Melis and Jaakkola, 2018; Hoshen and Wolf, 2018) and normalizing monolingual embeddings before alignment (Zhang et al., 2019). In contrast, we design a human-in-the-loop system to efficiently improve clwe. Moreover, previous clwe methods are heavily tuned for the intrinsic evaluation task of dictionary induction, sometimes at the cost of performance on downstream tasks (Glavas et al., 2019). Our method is tailored for downstream tasks such as text classification.

Cross-Lingual Document Classification.

Cross-lingual document classification has been extensively studied. Previous method transfers knowledge with cross-lingual resources, such as bilingual dictionary (Wu et al., 2008; Shi et al., 2010), parallel text (Xu and Yang, 2017), labeled data from related languages (Zhang et al., 2018), multilingual topic model (Ni et al., 2011; Andrade et al., 2015), machine translation system (Banea et al., 2008; Wan, 2009; Zhou et al., 2016), and clwe (Klementiev et al., 2012). Our method instead brings a bilingual speaker in the loop to actively provide cross-lingual knowledge, which is more reliable in low-resource settings.

Human-in-the-Loop Multilingual Systems.

Our method is inspired by previous work that bridges language gap with a human-in-the-loop. Brown and Grinter (2016) build an interactive translation platform to help refugees resettle in America. Yuan et al. (2018) interactively align topic models across languages.

Active Learning.

A common solution to data scarcity is active learning, the framework in which the learner iteratively queries an oracle (often a human) to receive annotations on unlabeled data. Settles (2012) summarizes popular active learning methods. Most active learning methods solicits labels for training examples/documents, while clime asks for word-level annotation. One exception is dualist, an active learning system that seeks both word-level and document-level annotations (Settles, 2011). However, dualist

only applies to a monolingual naive Bayes classifier, while

clime is designed for modern neural networks and clwe.

Neural Network Interpretation.

Our keyword detection algorithm (Section 2.1) expands upon prior work in interpreting neural networks. Li et al. (2016)

uses the gradient of the objective function to linearly approximate salience of one dimension, which helps interpreting and visualizing word compositionality in neural networks. Their ideas are inspired by visual salience in computer vision 

(Erhan et al., 2009; Simonyan et al., 2013; Zeiler and Fergus, 2014). We further extend the idea to compute salience of an entire word vector. More recent work has looked into understanding feature saliency in neural networks with higher-order derivatives (Singla et al., 2019).

Specializing Word Embeddings.

Our update equations are modified from previous work on specializing word embeddings, which are designed for improving word embeddings with a large lexical knowledge base. Faruqui et al. (2015) retrofit word embeddings to synonym constraints. Mrkšic et al. (2016) expand the method by also fitting antonym relations. Mrkšić et al. (2017) includes both monolingual and cross-lingual constraints to improve clwe. Glavaš and Vulić (2018) use a neural network to learn an external specialization function that can generalize to words with no lexical constraints. We modify the objective function proposed by Mrkšić et al. (2017) to fit the smaller constraint sets used in clime.

6 Conclusion and Future Work

We design clime, an interactive system that enhances clwe for a specific task by asking a bilingual speaker for word-level similarity annotations. We test clime on a cross-lingual document classification dataset for four low-resource languages. Bilingual users can quickly improve a model with the help of clime at a faster rate than an active learning baseline. Combining active learning with clime further improves the system.

clime has a modular design with three components: keyword ranking, user interface, and embedding refinement. The keyword ranking and the embedding refinement modules builds upon existing methods for interpreting neural network (Li et al., 2016) and fine-tuning word embeddings (Mrkšić et al., 2017). Therefore, future advances in these areas may be used to improve clime. Another line of future work is to investigate alternative user interfaces. For example, we could ask bilingual users to rank nearest neighbors (Sakaguchi and Van Durme, 2018) instead of accepting/rejecting individual neighbors.

In the experiments, we explored a simple combination of active learning and clime

, and show that mixing the two types of interactions are better than using only one type of interaction. In the future, we plan to combine the two interactions in a more optimized way, by training a policy through reinforcement learning 

(Fang et al., 2017).

References

  • Alvarez-Melis and Jaakkola (2018) David Alvarez-Melis and Tommi S. Jaakkola. 2018. Gromov-wasserstein alignment of word embedding spaces. In

    Proceedings of Empirical Methods in Natural Language Processing

    .
  • Ammar et al. (2016) Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
  • Andrade et al. (2015) Daniel Andrade, Kunihiko Sadamasa, Akihiro Tamura, and Masaaki Tsuchida. 2015.

    Cross-lingual text classification using topic-dependent word probabilities.

    In Proceedings of the Association for Computational Linguistics.
  • Artetxe et al. (2018) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the Association for Computational Linguistics.
  • Banea et al. (2008) Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proceedings of Empirical Methods in Natural Language Processing.
  • Brown and Grinter (2016) Deana Brown and Rebecca E Grinter. 2016. Designing for transient use: A human-in-the-loop translation platform for refugees. In International Conference on Human Factors in Computing Systems.
  • Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.
  • Erhan et al. (2009) Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2009. Visualizing higher-layer features of a deep network. Technical report.
  • Fang et al. (2017) Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning how to active learn: A deep reinforcement learning approach. In Proceedings of Empirical Methods in Natural Language Processing.
  • Faruqui et al. (2015) Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. 2015.

    Retrofitting word vectors to semantic lexicons.

    Conference of the North American Chapter of the Association for Computational Linguistics.
  • Feng et al. (2018) Shi Feng, Eric Wallace, II Grissom, Mohit Iyyer, Pedro Rodriguez, Jordan Boyd-Graber, et al. 2018. Pathologies of neural models make interpretations difficult. In Proceedings of Empirical Methods in Natural Language Processing.
  • Glavas et al. (2019) Goran Glavas, Robert Litschko, Sebastian Ruder, and Ivan Vulic. 2019. How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In Proceedings of the Association for Computational Linguistics.
  • Glavaš and Vulić (2018) Goran Glavaš and Ivan Vulić. 2018. Explicit retrofitting of distributional word vectors. In Proceedings of the Association for Computational Linguistics.
  • Hoshen and Wolf (2018) Yedid Hoshen and Lior Wolf. 2018. Non-adversarial unsupervised word translation. In Proceedings of Empirical Methods in Natural Language Processing.
  • Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the Association for Computational Linguistics.
  • Joulin et al. (2017) Armand Joulin, Edouard Grave, and Piotr Bojanowski Tomas Mikolov. 2017. Bag of tricks for efficient text classification. Proceedings of the European Chapter of the Association for Computational Linguistics.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of Empirical Methods in Natural Language Processing.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.
  • Klementiev et al. (2012) Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. Proceedings of International Conference on Computational Linguistics.
  • Lewis and Gale (1994) David D Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Li et al. (2016) Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Conference of the North American Chapter of the Association for Computational Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of Advances in Neural Information Processing Systems.
  • Mrkšic et al. (2016) Nikola Mrkšic, Diarmuid OSéaghdha, Blaise Thomson, Milica Gašic, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Conference of the North American Chapter of the Association for Computational Linguistics.
  • Mrkšić et al. (2017) Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gašić, Anna Korhonen, and Steve Young. 2017. Semantic specialisation of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics, 5:309–324.
  • Navigli and Ponzetto (2010) Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the Association for Computational Linguistics.
  • Ni et al. (2011) Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2011. Cross lingual text classification by mining multilingual topics from wikipedia. In Proceedings of ACM International Conference on Web Search and Data Mining.
  • Ruder et al. (2019) Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models.

    Journal of Artificial Intelligence Research

    , 65:569–631.
  • Sakaguchi and Van Durme (2018) Keisuke Sakaguchi and Benjamin Van Durme. 2018. Efficient online scalar annotation with bounded support. In Proceedings of the Association for Computational Linguistics.
  • Settles (2011) Burr Settles. 2011. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of Empirical Methods in Natural Language Processing.
  • Settles (2012) Burr Settles. 2012. Active learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , 6(1):1–114.
  • Shi et al. (2010) Lei Shi, Rada Mihalcea, and Mingjun Tian. 2010.

    Cross language text classification by model translation and semi-supervised learning.

    In Proceedings of Empirical Methods in Natural Language Processing.
  • Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
  • Singla et al. (2019) Sahil Singla, Eric Wallace, Shi Feng, and Soheil Feizi. 2019.

    Understanding impacts of high-order loss approximations and features in deep learning interpretation.

    Proceedings of the International Conference of Machine Learning.
  • Søgaard et al. (2018) Anders Søgaard, Sebastian Ruder, and Ivan Vulić. 2018. On the limitations of unsupervised bilingual dictionary induction. In Proceedings of the Association for Computational Linguistics.
  • Strassel and Tracey (2016) Stephanie Strassel and Jennifer Tracey. 2016. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Language Resources and Evaluation Conference.
  • Wan (2009) Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proceedings of the Association for Computational Linguistics.
  • Wu et al. (2008) Ke Wu, Xiaolin Wang, and Bao-Liang Lu. 2008. Cross language text categorization using a bilingual lexicon. In International Joint Conference on Natural Language Processing.
  • Xu and Yang (2017) Ruochen Xu and Yiming Yang. 2017. Cross-lingual distillation for text classification. In Proceedings of the Association for Computational Linguistics.
  • Yuan et al. (2018) Michelle Yuan, Benjamin Van Durme, and Jordan Boyd-Graber. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. In Proceedings of Advances in Neural Information Processing Systems.
  • Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision.
  • Zhang et al. (2017a) Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017a. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the Association for Computational Linguistics.
  • Zhang et al. (2017b) Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017b. Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of Empirical Methods in Natural Language Processing.
  • Zhang et al. (2018) Mozhi Zhang, Yoshinari Fujinuma, and Jordan Boyd-Graber. 2018. Exploiting cross-lingual subword similarities in low-resource document classification. arXiv preprint arXiv:1812.09617.
  • Zhang et al. (2019) Mozhi Zhang, Keyulu Xu, Ken ichi Kawarabayashi, Stefanie Jegelka, and Jordan Boyd-Graber. 2019. Are girls neko or shōjo? Cross-lingual alignment of non-isomorphic embeddings with Iterative Normalization. In Proceedings of the Association for Computational Linguistics.
  • Zhou et al. (2016) Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Cross-lingual sentiment classification with bilingual document representation learning. In Proceedings of the Association for Computational Linguistics.