Distributional word embeddings, which represent the “meaning” of a word via a low-dimensional vector, have been widely applied by many natural language processing (NLP) pipelines and algorithmsGoldberg (2016). Following the success of recent neural Mikolov et al. (2013) and matrix-factorization Pennington et al. (2014) methods, researchers have sought to extend the approach to other text features, from subword elements to -grams to sentences Bojanowski et al. (2016); Poliak et al. (2017); Kiros et al. (2015). However, the performance of both word embeddings and their extensions is known to degrade in small corpus settings Adams et al. (2017) or when embedding sparse, low-frequency features Lazaridou et al. (2017). Attempts to address these issues often involve task-specific approaches Rothe and Schütze (2015); Iacobacci et al. (2015); Pagliardini et al. (2018) or extensively tuning existing architectures such as skip-gram Poliak et al. (2017); Herbelot and Baroni (2017).
For computational efficiency it is desirable that methods be able to induce embeddings for only those features (e.g. bigrams or synsets) needed by the downstream task, rather than having to pay a computational prix fixe to learn embeddings for all features occurring frequently-enough in a corpus. We propose an alternative, novel solution via à la carte embedding, a method which bootstraps existing high-quality word vectors to learn a feature representation in the same semantic space via a linear transformation of the average word embeddings in the feature’s available contexts. This can be seen as a shallow extension of the distributional hypothesis Harris (1954), “a feature is characterized by the words in its context,” rather than the computationally more-expensive “a feature is characterized by the features in its context” that has been used implicitly by past work Rothe and Schütze (2015); Logeswaran and Lee (2018).
Despite its elementary formulation, we demonstrate that the à la carte method can learn faithful word embeddings from single examples and feature vectors improving performance on important downstream tasks. Furthermore, the approach is resource-efficient, needing only pretrained embeddings of common words and the text corpus used to train them, and easy to implement and compute via vector addition and linear regression. After motivating and specifying the method, we illustrate these benefits through several applications:
-gram embeddings: we build seven million
-gram embeddings from large text corpora and use them to construct document embeddings that are competitive with unsupervised deep learning approaches when evaluated on linear text classification.
Our experimental results222Code: www.github.com/NLPrinceton/ALaCarte clearly demonstrate the advantages of à la carte embedding. For word embeddings, the approach is an easy way to get a good vector for a new word from its definition or a few examples in context. For feature embeddings, the method can embed anything that does not need labeling (such as a bigram) or occurs in an annotated corpus (such as a word-sense). Our document embeddings, constructed directly using à la carte -gram vectors, compete well with recent deep neural representations; this provides further evidence that simple methods can outperform modern deep learning on many NLP benchmarks Arora et al. (2017); Mu and Viswanath (2018); Arora et al. (2018a, b); Pagliardini et al. (2018).
2 Related Work
Many methods have been proposed for extending word embeddings to semantic feature vectors, with the aim of using them as interpretable and structure-aware building blocks of NLP pipelines Kiros et al. (2015); Yamada et al. (2016). Many exploit the structure and resources available for specific feature types, such as methods for sense, synsets, and lexemes Rothe and Schütze (2015); Iacobacci et al. (2015) that make heavy use of the graph structure of the Princeton WordNet (PWN) and similar resources Fellbaum (1998). By contrast, our work is more general, with incorporation of structure left as an open problem. Embeddings of -grams are of special interest because they do not need annotation or expert knowledge and can often be effective on downstream tasks. Their computation has been studied both explicitly Yin and Schutze (2014); Poliak et al. (2017) and as an implicit part of models for document embeddings Hill et al. (2016); Pagliardini et al. (2018), which we use for comparison. Supervised and multi-task learning of text embeddings has also been attempted Wang et al. (2017); Wu et al. (2017).
A main motivation of our work is to learn good embeddings, of both words and features, from only one or a few examples. Efforts in this area can in many cases be split into contextual approaches Lazaridou et al. (2017); Herbelot and Baroni (2017) and morphological methods Luong et al. (2013); Bojanowski et al. (2016); Pado et al. (2016). The current paper provides a more effective formulation for context-based embeddings, which are often simpler to implement, can improve with more context information, and do not require morphological annotation. Subword approaches, on the other hand, are often more compositional and flexible, and we leave the extension of our method to handle subword information to future work. Our work is also related to some methods in domain adaptation and multi-lingual correlation, such as that of Bollegala et al. (2014).
Mathematically, this work builds upon the linear algebraic understanding of modern word embeddings developed by Arora et al. (2018b) via an extension to the latent-variable embedding model of Arora et al. (2016). Although there have been several other applications of this model for natural language representation Arora et al. (2017); Mu and Viswanath (2018), ours is the first to provide a general approach for learning semantic features using corpus context.
3 Method Specification
We begin by assuming a large text corpus consisting of contexts of words in a vocabulary , with the contexts themselves being sequences of words in (e.g. a fixed-size window around the word or feature). We further assume that we have trained word embeddings on this collocation information using a standard algorithm (e.g. word2vec / GloVe). Our goal is to construct a good embedding of a text feature given a set of contexts it occurs in. Both and its contexts are assumed to arise via the same process that generates the large corpus . In many settings below, the number of contexts available for a feature of interest is much smaller than the number of contexts that the typical word occurs in. This could be because the feature is rare (e.g. unseen words, -grams) or due to limited human annotation (e.g. word senses, named entities).
3.1 A Linear Approach
A naive first approach to construct feature embeddings using context is additive, i.e. taking the average over all contexts of a feature of the average word vector in each context:
This formulation reflects the training of commonly used embeddings, which employs additive composition to represent the context Mikolov et al. (2013); Pennington et al. (2014). It has proved successful in the bag-of-embeddings approach to sentence representation Wieting et al. (2016); Arora et al. (2017), which can compete with LSTM representations, and has also been given theoretical justification as the maximum a posteriori (MAP) context vector under a generative model related to popular embedding objectives Arora et al. (2016). Lazaridou et al. (2017) use this approach to learn embeddings of unknown word amalgamations, or chimeras, given a few context examples.
The additive approach has some limitations because the set of all word vectors is seen to share a few common directions. Simple addition amplifies the component in these directions, at the expense of less common directions that presumably carry more “signal.” Stop-word removal can help to ameliorate this Lazaridou et al. (2017); Herbelot and Baroni (2017), but does not deal with the fact that content-words also have significant components in the same direction as these deleted words. Another mathematical framework to address this lacuna is to remove the top one or top few principal components, either from the word embeddings themselves Mu and Viswanath (2018) or from their summations Arora et al. (2017). However, this approach is liable to either not remove enough noise or cause too much information loss without careful tuning (c.f. Figure 1).
We now note that removing the component along the top few principal directions is tantamount to multiplying the additive composition by a fixed (but data-dependent) matrix. Thus a natural extension is to use an arbitrary linear transformation which will be learned from the data, and hence guaranteed to do at least as well as any of the above ideas. Specifically, we find the transform that can best recover existing word vectors —which are presumed to be of high quality— from their additive context embeddings . This can be posed as the following linear regression problem
where is learned and we assume for simplicity that is constant (e.g. if has a fixed window size) and is thus subsumed by the transform. After learning the matrix, we can embed any text feature in the same semantic space as the word embeddings via the following expression:
Note that is fixed for a given corpus and set of pretrained word embeddings and so does not need to be re-computed to embed different features or feature types.
As shown by Arora et al. (2018b, Theorem 1), the approximation (2) holds exactly in expectation for some matrix when contexts are generated by sampling a context vector from a zero-mean Gaussian with fixed covariance and drawing words using . The correctness (again in expectation) of (3) under this model is a direct extension. Arora et al. (2018b) use large text corpora to verify their model assumptions, providing theoretical justification for our approach. We observe that the best linear transform
can recover vectors with mean cosine similarity as high as 0.9 or more with the embeddings used to learn it, thus also justifying the method empirically.
3.2 Practical Details
The basic à la carte method, as motivated in Section 3.1 and specified in Algorithm LABEL:alg:induction, is straightforward and parameter-free (the dimension is assumed to have been chosen beforehand, along with the other parameters of the original word embeddings). In practice we may wish to modify the regression step in an attempt to learn a better transformation matrix . However, the standard first approach of using
-regularized (Ridge) regression instead of simple linear regression gives little benefit, even when we have more parameters than word embeddings (i.e. when).
A more useful modification is to weight each point by some non-decreasing function of each word’s corpus count , i.e. to solve
where is the additive context embedding. This reflects the fact that more frequent words likely have better pretrained embeddings. In settings where is large we find that a hard threshold ( for some ) is often useful. When we do not have many embeddings we can still give more importance to words with better embeddings via a function such as , which we use in Section 5.1.
4 One-Shot and Few-Shot Learning of Word Embeddings
While we can use our method to embed any type of text feature, its simplicity and effectiveness is rooted in word-level semantics: the approach assumes pre-existing high quality word embeddings and only considers collocations of features with words rather than with other features. Thus to verify that our approach is reasonable we first check how it performs on word representation tasks, specifically those where word embeddings need to be learned from very few examples. In this section we first investigate how representation quality varies with number of occurrences, as measured by performance on a similarity task that we introduce. We then apply the à la carte method to two tasks measuring the ability to learn new or synthetic words from context, achieving strong results on the nonce task of Herbelot and Baroni (2017).
4.1 Similarity Correlation vs. Sample Size
Performance on pairwise word similarity tasks is a standard way to evaluate word embeddings, with success measured via the Spearman correlation between a human score and the cosine similarity between word vectors. An overview of widely used datasets is given by Faruqui and Dyer (2014). However, none of these datasets can be used directly to measure the effect of word frequency on embedding quality, which would help us understand the data requirements of our approach. We address this issue by introducing the Contextual Rare Words (CRW) dataset, a subset of 562 pairs from the Rare Word (RW) dataset Luong et al. (2013) supplemented by 255 sentences (contexts) for each rare word sampled from the Westbury Wikipedia Corpus (WWC) Shaoul and Westbury (2010). In addition we provide a subset of the WWC from which all sentences containing these rare words have been removed. The task is to use embeddings trained on this subcorpus to induce rare word embeddings from the sampled contexts.
More specifically, the CRW dataset is constructed using all pairs from the RW dataset where the rarer word occurs between 512 and 10000 times in WWC; this yields a set of 455 distinct rare words. The lower bound ensures that we have a sufficient number of rare word contexts, while the upper bound ensures that a significant fraction of the sentences from the original WWC remain in the subcorpus we provide. In CRW, the first word in every pair is the more frequent word and occurs in the subcorpus, while the second word occurs in the 255 sampled contexts but not in the subcorpus. We provide word2vec embeddings trained on all words occurring at least 100 times in the WWC subcorpus; these vectors include those assigned to the first (non-rare) words in the evaluation pairs.
For every rare word the method under consideration is given eight disjoint subsets containing example contexts. The method induces an embedding of the rare word for each subset, letting us track how the quality of rare word vectors changes with more examples. We report the Spearman (as described above) at each sample size, averaged over 100 trials obtained by shuffling each rare word’s 255 contexts.
The results in Figure 2 show that our à la carte method significantly outperforms the additive baseline (1) and its variants, including stop-word removal, SIF-weighting Arora et al. (2017), and top principal component removal Mu and Viswanath (2018). We find that combining SIF-weighting and top component removal also beats these baselines, but still does worse than our method. These experiments consolidate our intuitions from Section 3 that removing common components and frequent words is important and that learning a data-dependent transformation is an effective way to do this. However, if we train word2vec embeddings from scratch on the subcorpus together with the sampled contexts we achieve a Spearman correlation of 0.45; this gap between word2vec and our method shows that there remains room for even better approaches for few-shot learning of word embeddings.
4.2 Learning Embeddings of New Concepts: Nonces and Chimeras
|Nonce Herbelot and Baroni (2017)||Chimera Lazaridou et al. (2017)|
|Method||Mean Recip. Rank||Med. Rank||2 Sent.||4 Sent.||6 Sent.|
|additive, no stop words||0.03686||861||0.3376||0.3624||0.4080|
|à la carte||0.07058||165.5||0.3634||0.3844||0.3941|
We now evaluate our work directly on the tasks posed by Herbelot and Baroni (2017), who developed simple datasets and methods to “simulate the process by which a competent speaker encounters a new word in known contexts.” The general goal will be to construct embeddings of new concepts in the same semantic space as a known embedding vocabulary using contextual information consisting of definitions or example sentences.
We first discuss the definitional nonce dataset made by the authors themselves, which has a test-set consisting of 300 single-word concepts and their definitions. The task of learning each concept’s embedding is simulated by removing or randomly re-initializing its vector and requiring the system to use the remaining embeddings and the definition to make a new vector that is close to the original. Because the embeddings were constructed using data that includes these concepts, an implicit assumption is made that including or excluding one word does not greatly affect the semantic space; this assumption is necessary in order to have a good target vector for the system to be evaluated against.
Using 259,376 word2vec embeddings trained on Wikipedia as the base vectors, Herbelot and Baroni (2017) heavily modify the skip-gram algorithm to successfully learn on one definition, creating the nonce2vec system. The original skip-gram algorithm and are used as baselines, with performance measured as the mean reciprocal rank and median rank of the concept’s original vector among the nearest neighbors of the output.
To compare directly to their approach, we use their word2vec embeddings along with contexts from the Wikipedia corpus to construct context vectors for all words apart from the 300 nonces. We then learn the à la carte transform , weighting the data points in the regression (4) using a hard threshold of at least 1000 occurrences in Wikipedia. An embedding for each nonce can then be constructed by multiplying by the sum over all word embeddings in the nonce’s definition. As can be seen in Table 1, this approach significantly improves over both baselines and nonce2vec; the median rank of 165.5 of the original embedding among the nearest neighbors of the nonce vector is very low considering the vocabulary size is more than 250,000, and is also significantly lower than that of all previous methods.
The second dataset Herbelot and Baroni (2017) consider is that of Lazaridou et al. (2017), who construct unseen concepts by combining two related words into a fake nonce word (the “chimera”) and provide two, four, or six example sentences for this nonce drawn from sentences containing one of the two component words. The desired nonce embeddings is then evaluated via the correlation of its cosine similarity with the embeddings of several other words, with ratings provided by human judges.
We use the same approach as in the nonce task, except that the chimera embedding is the result of summing over multiple sentences. From Table 1 we see that, while our method is consistently better than both the additive baseline and nonce2vec, removing stop-words from the additive baseline leads to stronger performance for more sentences. Since the à la carte algorithm explicitly trains the transform to match the true word embedding rather than human similarity measures, it is perhaps not surprising that our approach is much more dominant on the definitional nonce task.
5 Building Feature Embeddings using Large Corpora
Having witnessed its success at representing unseen words, we now apply the à la carte method to two types of feature embeddings: synset embeddings and -gram embeddings. Using these two examples we demonstrate the flexibility and adaptability of our approach when handling different corpora, base word embeddings, and downstream applications.
5.1 Supervised Synset Embeddings for Word-Sense Disambiguation
|SemEval-2013 Task 12||SemEval-2015 Task 13|
|à la carte (SemCor)||60.0||72.2||67.7||85.2||60.6||68.1|
|à la carte (glosses)||51.8||75.3||62.5||79.0||55.8||64.2|
|à la carte (combined)||60.5||74.1||70.3||86.4||59.4||69.6|
|Raganato et al. (2017)||66.9||72.4|
Embeddings of synsets, or sets of cognitive synonyms, and related entities such as senses and lexemes have been widely studied, often due to the desire to account for polysemy Rothe and Schütze (2015); Iacobacci et al. (2015). Such representations can be evaluated in several ways, including via their use for word-sense disambiguation (WSD), the task of determining a word’s sense from context. While current state-of-the-art methods often use powerful recurrent models Raganato et al. (2017), we will instead use a simple similarity-based approach that heavily depends on the synset embedding itself and thus serves as a more useful indicator of representation quality. A major target for our simple systems is to beat the most-frequent sense (MFS) method, which returns for each word the sense that occurs most frequently in a corpus such as SemCor. This baseline is “notoriously hard-to-beat,” routinely besting many systems in SemEval WSD competitions Navigli et al. (2013).
We use SemCor Langone et al. (2004), a subset of the Brown Corpus (BC) Francis and Kucera (1979) annotated using PWN synsets. However, because the corpus is quite small we use GloVe trained on Wikipedia instead of on BC itself. The transform is learned using context embeddings computed with windows of size ten around occurrences of in BC and weighting each word by the log of its count during the regression stage (4). Then we set the context embedding of each synset to be the average sum of word embeddings representation over all sentences in SemCor containing . Finally, we apply the à la carte transform to get the synset embedding .
To determine the sense of a word given its context , we convert into a vector using the à la carte transform on the sum of its word embeddings and return the synset of whose embedding is most similar to this vector. We try two different synset embeddings: those induced from SemCor as above and those obtained by embedding a synset using its gloss, or PWN-provided definition, in the same way as a nonce in Section 4.2. We also consider a combined approach in which we fall back on the gloss vector if the synset does not appear in SemCor and thus has no induced embedding.
As shown in Table 2, synset embeddings induced from SemCor alone beat MFS overall, largely due to good noun results. The method improves further when combined with the gloss approach. While we do not match the state-of-the-art, our success in besting a difficult baseline using very little fine-tuning and exploiting none of the underlying graph structure suggests that the à la carte method can learn useful synset embeddings, even from relatively small data.
5.2 N-Gram Embeddings for Classification
|Method||beef up||cutting edge||harry potter||tight lipped|
|meat, out||cut, edges||deathly, azkaban||loose, fitting|
|but, however||which, both||which, but||but, however|
|ECO||meats, meat||weft, edges||robards, keach||scaly, bristly|
|Sent2Vec||add, reallocate||science, multidisciplinary||naruto, pokemon||wintel, codebase|
|à la carte||need, improve||innovative, technology||deathly, hallows||worried, very|
As some of the simplest and most useful linguistic features, -grams have long been a focus of embedding studies. Compositional approaches, such as sums and products of unigram vectors, are often used and work well on some evaluations, but are often order-insensitive or very high-dimensional Mitchell and Lapata (2010). Recent work by Poliak et al. (2017) works around this while staying compositional; however, as we will see their approach does not seem to capture a bigram’s meaning much better than the sum of its word vectors. -grams embeddings have also gained interest for low-dimensional document representation schemes Hill et al. (2016); Pagliardini et al. (2018); Arora et al. (2018a), largely due to the success of their sparse high-dimensional Bag-of--Grams (BonG) counterparts Wang and Manning (2012). This setting of document embeddings derived from -gram features will be used for quantitative evaluation in this section.
We build -gram embeddings using two corpora: 300-dimensional Wikipedia embeddings, which we evaluate qualitatively, and 1600-dimensional embeddings on the Amazon Product Corpus McAuley et al. (2015), which we use for document classification. For both we use as source embeddings GloVe vectors trained on the respective corpora over words occurring at least a hundred times. Context embeddings are constructed using a window of size ten and a hard threshold at 1000 occurrences is used as the word-weighting function in the regression (4). Unlike Poliak et al. (2017), who can construct arbitrary embeddings but need to train at least two sets of vectors of dimension at least to do so, and Yin and Schutze (2014), who determine which -grams to represent via corpus counts, our à la carte approach allows us to train exactly those embeddings that we need for downstream tasks. This, combined with our method’s efficiency, allows us to construct more than two million bigram embeddings and more than five million trigram embeddings, constrained only by their presence in the large source corpus.
|à la carte||1||1600||79.8||81.3||92.6||87.4||85.6||84.1||46.7||89.0|
Vocabulary sizes (i.e. BonG dimensions) vary by task; usually 10K-100K.
We first compare bigram embedding methods by picking some idiomatic and entity-related bigrams and examining the closest word vectors to their representations. These word-pairs are picked because we expect sophisticated feature embedding methods to encode a better vector than the sum of the two embeddings, which we use as a baseline. From Table 3 we see that embeddings based on corpora rather than composition are better able to embed these bigrams to be close to concepts that are semantically similar. On the other hand, as discussed in Section 3 and evident from these results, the additive context approach is liable to emphasize stop-word directions due to their high frequency.
Our main application and quantitative evaluation of -gram vectors is to use them to construct document embeddings. Given a length document , we define its embedding as a weighted concatenation over sums of our induced -gram embeddings, i.e.
where is the embedding of the -gram . Following Arora et al. (2018a), we weight each -gram component by to reflect the fact that higher-order -grams have lower quality embeddings because they occur less often in the source corpus. While we concatenate across unigram, bigram, and trigram embeddings to construct our text representations, separate experiments show that simply adding up the vectors of all features also yields a smaller but still substantial improvement over the unigram performance. The higher embedding dimension due to concatenation is in line with previous methods and can also be theoretically supported as yielding a less lossy compression of the -gram information Arora et al. (2018a).
In Table 4 we display the result of running cross-validated,
-regularized logistic regression on documents from MR movie reviews(Pang and Lee, 2005), CR customer reviews (Hu and Liu, 2004), SUBJ subjectivity dataset (Pang and Lee, 2004), MPQA opinion polarity subtask (Wiebe et al., 2005), TREC question classification (Li and Roth, 2002), SST sentiment classification (binary and fine-grained) (Socher et al., 2013), and IMDB movie reviews (Maas et al., 2011). The first four are evaluated using tenfold cross-validation, while the others have train-test splits.
Despite the simplicity of our embeddings (a concatenation over sums of à la carte -gram vectors), we find that our results are very competitive with many recent unsupervised methods, achieving the best word-level results on two of the tested datasets. The fact that we do especially well on the sentiment tasks indicates strong exploitation of the Amazon review corpus, which was also used by DisC, CNN-LSTM, and byte mLSTM. At the same time, the fact that our results are comparable to neural approaches indicates that local word-order may contain much of the information needed to do well on these tasks. On the other hand, separate experiments do not show a substantial improvement from our approach over unigram methods such as SIF Arora et al. (2017) on sentence similarity tasks such as STS Cer et al. (2017). This could reflect either noise in the -gram embeddings themselves or the comparative lower importance of local word-order for textual similarity compared to classification.
We have introduced à la carte embedding, a simple method for representing semantic features using unsupervised context information. A natural and principled integration of recent ideas for composing word vectors, the approach achieves strong performance on several tasks and promises to be useful in many linguistic settings and to yield many further research directions. Of particular interest is the replacement of simple window contexts by other structures, such as dependency parses, that could yield results in domains such as question answering or semantic role labeling. Extensions of the mathematical formulation, such as the use of word weighting when building context vectors as in Arora et al. (2018b) or of spectral information along the lines of Mu and Viswanath (2018), are also worthy of further study.
More practically, the Contextual Rare Words (CRW) dataset we provide will support research on few-shot learning of word embeddings. Both in this area and for -grams there is great scope for combining our approach with compositional approaches Bojanowski et al. (2016); Poliak et al. (2017) that can handle settings such as zero-shot learning. More work is needed to understand the usefulness of our method for representing (potentially cross-lingual) entities such as synsets, whose embeddings have found use in enhancing WordNet and related knowledge bases Camacho-Collados et al. (2016); Khodak et al. (2017). Finally, there remain many language features, such as named entities and morphological forms, whose representation by our method remains unexplored.
We thank Karthik Narasimhan and our three anonymous reviewers for helpful suggestions. The work in this paper was in part supported by SRC JUMP, Mozilla Research, NSF grants CCF-1302518 and CCF-1527371, Simons Investigator Award, Simons Collaboration Grant, and ONR-N00014-16-1-2329.
- Adams et al. (2017) Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. 2017. Cross-lingual word embeddings for low-resource language modeling. In Proc. EACL.
- Arora et al. (2018a) Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. 2018a. A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and lstms. In Proc. ICLR.
- Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to pmi-based word embeddings. TACL.
- Arora et al. (2018b) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018b. Linear algebraic structure of word senses, with applications to polysemy. TACL.
- Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proc. ICLR.
- Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. ArXiv.
- Bollegala et al. (2014) Danushka Bollegala, David Weir, and John Carroll. 2014. Learning to predict distributions of words across domains. In Proc. ACL.
- Camacho-Collados et al. (2016) José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. AI.
- Cer et al. (2017) Daniel Cer, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In Proc. SemEval.
- Faruqui and Dyer (2014) Manaal Faruqui and Chris Dyer. 2014. Community evaluation and exchange of word vectors at wordvectors.org. In Proc. ACL: System Demonstrations.
- Fellbaum (1998) Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
- Francis and Kucera (1979) W. Nelson Francis and Henry Kucera. 1979. Brown Corpus Manual. Brown University.
Gan et al. (2017)
Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence
Learning generic sentence representations using convolutional neural networks.In Proc. EMNLP.
- Goldberg (2016) Yoav Goldberg. 2016. A primer on neural network models for natural language processing. JAIR.
- Harris (1954) Zellig Harris. 1954. Distributional structure. Word, 10:146–162.
- Herbelot and Baroni (2017) Aurélie Herbelot and Marco Baroni. 2017. High-risk learning: Acquiring new word vectors from tiny data. In Proc. EMNLP.
Hill et al. (2016)
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.
Learning distributed representations of sentences from unlabelled data.In Proc. NAACL.
- Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proc. KDD.
- Iacobacci et al. (2015) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In Proc. ACL-IJCNLP.
- Khodak et al. (2017) Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora. 2017. Automated wordnet construction using word embeddings. In Proc. Workshop on Sense, Concept and Entity Representations and their Applications.
- Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. In Adv. NIPS.
- Langone et al. (2004) Helen Langone, Benjamin R. Haskell, and George A. Miller. 2004. Annotating wordnet. In Proc. Workshop on Frontiers in Corpus Annotation.
- Lazaridou et al. (2017) Angeliki Lazaridou, Marco Marelli, and Marco Baroni. 2017. Multimodal word meaning induction from minimal exposure to natural text. Cognitive Science.
Li and Roth (2002)
Xin Li and Dan Roth. 2002.
Learning question classifiers.In Proc. COLING.
- Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In Proc. ICLR.
- Luong et al. (2013) Thang Luong, Richard Socher, and Christopher Manning. 2013. Better word representations with recursive neural networks for morphology. In Proc. CoNLL.
Maas et al. (2011)
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and
Christopher Potts. 2011.
Learning word vectors for sentiment analysis.In Proc. ACL-HLT.
- McAuley et al. (2015) Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. In Proc. KDD.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Adv. NIPS.
- Mitchell and Lapata (2010) Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science.
- Moro and Navigli (2015) Andrea Moro and Roberto Navigli. 2015. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proc. SemEval.
- Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective post-processing for word representations. In Proc. ICLR.
- Navigli et al. (2013) Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. Semeval-2013 task 12: Multilingual word sense disambiguation. In Proc. SemEval.
- Pado et al. (2016) Sebastian Pado, Aurelie Herbelot, Max Kisselew, and Jan Snajder. 2016. Predictability of distributional semantics in derivational word formation. In Proc. COLING.
- Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proc. NAACL.
- Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proc. ACL.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proc. ACL.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proc. EMNLP.
- Poliak et al. (2017) Adam Poliak, Pushpendre Rastogia, M. Patrick Martin, and Benjamin Van Durme. 2017. Efficient, compositional, order-sensitive n-gram embeddings. In Proc. EACL.
- Radford et al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. ArXiv.
- Raganato et al. (2017) Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017. Neural sequence learning models for word sense disambiguation. In Proc. EMNLP.
- Rothe and Schütze (2015) Sascha Rothe and Hinrich Schütze. 2015. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In Proc. ACL-IJCNLP.
- Shaoul and Westbury (2010) Cyrus Shaoul and Chris Westbury. 2010. The westbury lab wikipedia corpus.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. EMNLP.
Wang et al. (2017)
Dingquan Wang, Nanyun Peng, and Kevin Duh. 2017.
A multi-task learning approach to adapting bilingual word embeddings for cross-lingual named entity recognition.In Proc. IJCNLP.
- Wang and Manning (2012) Sida Wang and Christopher D. Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proc. ACL.
- Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Proc. LREC.
- Wieting et al. (2016) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards universal paraphrastic sentence embeddings. In Proc. ICLR.
- Wu et al. (2017) Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. 2017. Starspace: Embed all the things! ArXiv.
- Yamada et al. (2016) Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the embedding of words and entities for named entity disambiguation. In Proc. CoNLL.
- Yin and Schutze (2014) Wenpeng Yin and Hinrich Schutze. 2014. An exploration of embeddings for generalized phrases. In Proc. ACL 2014 Student Research Workshop.