Learning embeddings for classification, retrieval and ranking.
We present StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings. In each case the model works by embedding those entities comprised of discrete features and comparing them against each other -- learning similarities dependent on the task. Empirical results on a number of tasks show that StarSpace is highly competitive with existing methods, whilst also being generally applicable to new cases where those methods are not.READ FULL TEXT VIEW PDF
Learning embeddings for classification, retrieval and ranking.
We introduce StarSpace, a neural embedding model that is general enough to solve a wide variety of problems:
Text classification, or other labeling tasks, e.g. sentiment classification.
Ranking of sets of entities, e.g. ranking web documents given a query.
Collaborative filtering-based recommendation, e.g. recommending documents, music or videos.
Content-based recommendation where content is defined with discrete features, e.g. words of documents.
Embedding graphs, e.g. multi-relational graphs such as Freebase.
Learning word, sentence or document embeddings.
StarSpace can be viewed as a straight-forward and efficient strong baseline for any of these tasks. In experiments it is shown to be on par with or outperforming several competing methods, whilst being generally applicable to cases where many of those methods are not.
The method works by learning entity embeddings with discrete feature representations from relations among collections of those entities directly for the task of ranking or classification of interest. In the general case, StarSpace embeds entities of different types into a vectorial embedding space, hence the “star” (“*”, meaning all types) and “space” in the name, and in that common space compares them against each other. It learns to rank a set of entities, documents or objects given a query entity, document or object, where the query is not necessarily of the same type as the items in the set.
We evaluate the quality of our approach on six different tasks, namely text classification, link prediction in knowledge bases, document recommendation, article search, sentence matching and learning general sentence embeddings. StarSpace is available as an open-source project at https://github.com/facebookresearch/Starspace.
Latent text representations, or embeddings, are vectorial representations of words or documents, traditionally learned in an unsupervised way over large corpora. Work on neural embeddings in this domain includes [bengio2003], [collobert2011], word2vec [word2vec] and more recently fastText [fasttext-unsup]. In our experiments we compare to word2vec and fastText as representative scalable models for unsupervised embeddings; we also compare on the SentEval tasks [infersent] against a wide range of unsupervised models for sentence embedding.
In the domain of supervised embeddings, SSI [bai2009supervised] and WSABIE [wsabie] are early approaches that showed promise in NLP and information retrieval tasks ([weston2013connecting], [hermann2014]). Several more recent works including [tang2015document], [zhang2015text], [conneau2016very], TagSpace [tagspace] and fastText [fasttext]
have yielded good results on classification tasks such as sentiment analysis or hashtag prediction.
In the domain of recommendation, embedding models have had a large degree of success, starting from SVD [goldberg2001eigentaste] and its improvements such as SVD++ [koren2015advances], as well as a host of other techniques, e.g. [rendle2010factorization, lawrence2009non, shi2012climf]. Many of those methods have focused on the collaborative filtering setup where user IDs and movie IDs have individual embeddings, such as in the Netflix challenge setup (see e.g., [koren2015advances], and so new users or items cannot naturally be incorporated. We show how StarSpace can naturally cater for both that setting and the content-based setting where users and items are represented as features, and hence have natural out-of-sample extensions rather than considering only a fixed set.
Performing link prediction in knowledge bases (KBs) with embedding-based methods has also shown promising results in recent years. A series of work has been done in this direction, such as [transE] and [garcia2015composing]. In our work, we show that StarSpace can be used for this task as well, outperforming several methods, and matching the TransE method presented in [transE].
The StarSpace model consists of learning entities, each of which is described by a set of discrete features (bag-of-features) coming from a fixed-length dictionary. An entity such as a document or a sentence can be described by a bag of words or -grams, an entity such as a user can be described by the bag of documents, movies or items they have liked, and so forth. Importantly, the StarSpace model is free to compare entities of different kinds. For example, a user entity can be compared with an item entity (recommendation), or a document entity with label entities (text classification), and so on. This is done by learning to embed them in the same space such that comparisons are meaningful – by optimizing with respect to the metric of interest.
Denoting the dictionary of features as which is a matrix, where indexes the feature (row), yielding its -dimensional embedding, we embed an entity with .
That is, like other embedding models, our model starts by assigning a
-dimensional vector to each of the discrete features in the set that we want to embed directly (which we call adictionary, it can contain features like words, etc.). Entities comprised of features (such as documents) are represented by a bag-of-features of the features in the dictionary and their embeddings are learned implicitly. Note an entity could consist of a single (unique) feature like a single word, name or user or item ID if desired.
To train our model, we need to learn to compare
entities. Specifically, we want to minimize the following loss function:
There are several ingredients to this recipe:
The generator of positive entity pairs coming from the set . This is task dependent and will be described subsequently.
The generator of negative entities coming from the set . We utilize a -negative sampling strategy [word2vec] to select such negative pairs for each batch update. We select randomly from within the set of entities that can appear in the second argument of the similarity function (e.g., for text labeling tasks are documents and are labels, so we sample from the set of labels). An analysis of the impact of is given in Sec. 4.
The similarity function
. In our system, we have implemented both cosine similarity and inner product, and selected the choice as a hyperparameter. Generally, they work similarly well for small numbers of label features (e.g. for classification), while cosine works better for larger numbers, e.g. for sentence or document similarity.
The loss function that compares the positive pair with the negative pairs , . We also implement two possibilities: margin ranking loss (i.e. , where is the margin parameter), and negative log loss of softmax. All experiments use the former as it performed on par or better.
We optimize by stochastic gradient descent (SGD), i.e., each SGD step is one sample fromin the outer sum, using Adagrad [adagrad] and hogwild [hogwild] over multiple CPUs. We also apply a max norm of the embeddings to restrict the vectors learned to lie in a ball of radius in space , as in other works, e.g. [wsabie].
At test time, one can use the learned function to measure similarity between entities. For example, for classification, a label is predicted at test time for a given input using over the set of possible labels . Or in general, for ranking one can sort entities by their similarity. Alternatively the embedding vectors can be used directly for some other downstream task, e.g., as is typically done with word embedding models. However, if directly fits the needs of your application, this is recommended as this is the objective that StarSpace is trained to be good at.
We now describe how this model can be applied to a wide variety of tasks, in each case describing how the generators E and E work for that setting.
The positive pair generator comes directly from a training set of labeled data specifying pairs where are documents (bags-of-words) and are labels (singleton features). Negative entities are sampled from the set of possible labels.
In this case, each document can have multiple positive labels, one of them is sampled as at each SGD step to implement multilabel classification.
The training data consists of a set of users, where each user is described by a bag of items (described as unique features from the dictionary) that the user likes. The positive pair generator picks a user, selects to be the unique singleton feature for that user ID, and a single item that they like as . Negative entities are sampled from the set of possible items.
One problem with classical collaborative filtering is that it does not generalize to new users, as a separate embedding is learned for each user ID. Using the same training data as before, one can learn an alternative model using StarSpace. The positive pair generator instead picks a user, selects as all the items they like except one, and
as the left out item. That is, the model learns to estimate if a user would like an item by modeling the user not as a single embedding based on their ID, but by representing the user as the sum of embeddings of items they like.
This task consists of a set of users, where each user is described by a bag of items, where each item is described by a bag of features from the dictionary (rather than being a unique feature). For example, for document recommendation, each user is described by the bag-of-documents they like, while each document is described by the bag-of-words it contains. Now can be selected as all of the items except one, and as the left out item. The system now extends to both new items and new users as both are featurized.
Given a graph of triples, consisting of a head concept , a relation and a tail concept , e.g. (Beyoncé, born-in, Houston), one can learn embeddings of that graph. Instantiations of , and are all defined as unique features in the dictionary. We select uniformly at random either: (i) consists of the bag of features and , while consists only of ; or (ii) consists of , and consists of and . Negative entities are sampled from the set of possible concepts. The learnt embeddings can then be used to answer link prediction questions such as (Beyoncé, born-in, ?) or (?, born-in, Houston) via the learnt function .
Given supervised training data consisting of (search keywords, relevant document) pairs one can directly train an information retrieval model: contains the search keywords, is a relevant document and are other irrelevant documents. If only unsupervised training data is available consisting of a set of unlabeled documents, an alternative is to select as random keywords from the document and as the remaining words. Note that both these approaches implicitly learn document embeddings which could be used for other purposes.
We can also use StarSpace to learn unsupervised word embeddings using training data consisting of raw text. We select as a window of words (e.g., four words, two either side of a middle word), and as the middle word, following [collobert2011, word2vec, fasttext-unsup].
Learning word embeddings (e.g. as above) and using them to embed sentences does not seem optimal when you can learn sentence embeddings directly. Given a training set of unlabeled documents, each consisting of sentences, we select and as a pair of sentences both coming from the same document; are sentences coming from other documents. The intuition is that semantic similarity between sentences is shared within a document (one can also only select sentences within a certain distance of each other if documents are very long). Further, the embeddings will automatically be optimized for sets of words of sentence length, so train time matches test time, rather than training with short windows as typically learned with word embeddings – window-based embeddings can deteriorate when the sum of words in a sentence gets too large.
Any of these tasks can be combined, and trained at the same time if they share some features in the base dictionary
. For example one could combine supervised classification with unsupervised word or sentence embedding, to give semi-supervised learning.
We employ StarSpace for the task of text classification and compare it with a host of competing methods, including fastText, on three datasets which were all previously used in [fasttext]. To ensure fair comparison, we use an identical dictionary to fastText and use the same implementation of -grams and pruning (those features are implemented in our open-source distribution of StarSpace). In these experiments we set the dimension of embeddings to be 10, as in [fasttext].
We use three datasets:
AG news111http://www.di.unipi.it/˜gulli/AG_corpus_of_news_articles.html is a 4 class text classification task given title and description fields as input. It consists of 120K training examples, 7600 test examples, 4 classes, 100K words and 5M tokens in total.
DBpedia [lehmann2015dbpedia] is a 14 class classification problem given the title and abstract of Wikipedia articles as input. It consists of 560K training examples, 70k test examples, 14 classes, 800K words and 32M tokens in total.
The Yelp reviews dataset is obtained from the 2015 Yelp Dataset Challenge222https://www.yelp.com/dataset_challenge. The task is to predict the full number of stars the user has given (from 1 to 5). It consists of 1.2M training examples, 157k test examples, 5 classes, 500K words and 193M tokens in total.
Results are given in Table 2. Baselines are quoted from the literature (some methods are only reported on AG news and DBPedia, others only on Yelp15). StarSpace outperforms a number of methods, and performs similarly to fastText. We measure the training speed for -grams in Table 3
. fastText and StarSpace are both efficient compared to deep learning approaches, e.g.[zhang2015text]
takes 5h per epoch on DBpedia, 375x slower than StarSpace. Still, fastText is faster than StarSpace. However, as we will see in the following sections, StarSpace is a more general system.
|Metric||Hits@1||Hits@10||Hits@20||Mean Rank||Training Time|
|fastText (public Wikipedia model)||0.5%||1.7%||2.5%||4154.4||-|
|fastText (our dataset)||0.79%||2.5%||3.7%||3910.9||4h30m|
|SVM Ranker: BoW features||0.99%||3.3%||4.6%||2440.1||-|
|SVM Ranker: fastText features (our dataset)||0.92%||3.3%||4.2%||3833.8||-|
|Training time||ag news||dbpedia||Yelp15|
We consider the task of recommending new documents to a user given their past history of liked documents. We follow a very similar process described in [tagspace] in our experiment. The data for this task is comprised of anonymized two-weeks long interaction histories for a subset of people on a popular social networking service. For each of the 641,385 people considered, we collected the text of public articles that s/he clicked to read, giving a total of 3,119,909 articles. Given the person’s trailing clicked articles, we use our model to predict the ’th article by ranking it against 10,000 other unrelated articles, and evaluate using ranking metrics. The score of the ’th article is obtained by applying StarSpace: the input is the previous articles, and the output is the ’th candidate article. We measure the results by computing hits@k, i.e. the proportion of correct entities ranked in the top k for 1, 10, 20, and the mean predicted rank of the clicked article among the 10,000 articles.
As this is not a classification task (i.e. there are not a fixed set of labels to classify amongst, but a variable set of never seen before documents to rank per user) we cannot use supervised classification models directly. Starspace however can deal directly with this task, which is one of its major benefits. Following[tagspace], we hence use the following models as baselines:
Word2vec model. We use the publicly available word2vec model trained on Google News articles333https://code.google.com/archive/p/word2vec/, and use the word embeddings to generate article embeddings (by bag-of-words) and users’ embedding (by bag-of-articles in users’ click history). We then use cosine similarity for ranking.
Unsupervised fastText model. We try both the previously trained publicly available model on Wikipedia444https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md, and train on our own dataset. Unsupervised fastText is an enhancement of word2Vec that also includes subwords.
Linear SVM ranker, using either bag-of-words features or fastText embeddings (component-wise multiplication of ’s and ’s features, which are of the same dimension).
Tagspace model trained on a hashtag task, and then the embeddings are used for document recommendation, a reproduction of the setup in [tagspace]. In that work, the Tagspace model was shown to outperform word2vec.
TFIDF bag-of-words cosine similarity model.
|Metric||Hits@10 r.||Mean Rank r.||Hits@10 f.||Mean Rank f.||Train Time|
|Metric||Hits@1||Hits@10||Hits@20||Mean Rank||Training Time|
|fastText (public Wikipedia model)||18.08%||36.36%||42.97%||987.27||-|
|fastText (our dataset)||16.89%||37.60%||45.25%||786.77||40h|
|SVM Ranker BoW features||56.73%||69.24%||71.86%||723.47||-|
|SVM Ranker: fastText features (public)||18.44%||37.80%||45.91%||887.96||-|
|Metric||Hits@1||Hits@10||Hits@20||Mean Rank||Training Time|
|fastText (public Wikipedia model)||5.77%||14.08%||17.79%||2393.38||-|
|fastText (our dataset)||5.47%||13.54%||17.60%||2363.74||40h|
|StarSpace (word-level training)||5.89%||16.41%||20.60%||1614.21||45h|
|SVM Ranker BoW features||26.36%||36.48%||39.25%||2368.37||-|
|SVM Ranker: fastText features (public)||5.81%||12.14%||15.20%||1442.05||-|
|StarSpace (sentence pair training)||30.07%||50.89%||57.60%||422.00||36h|
|StarSpace (word+sentence training)||25.54%||45.21%||52.08%||484.27||69h|
For fair comparison, we set the dimension of all embedding models to be 300. We show the results of our StarSpace model comparing with the baseline models in Table 1. Training time for StarSpace and fastText [fasttext-unsup] trained on our dataset is also provided.
Tagspace was previously shown to provide superior performance to word2vec, and we observe the same result here. Unsupervised FastText, which is an enhancement of word2vec is also slightly inferior to Tagspace, but better than word2vec. However, StarSpace, which is naturally more suited to this task, outperforms all those methods, including Tagspace and SVMs by a significant margin. Overall, from the evaluation one can see that unsupervised methods of learning word embeddings are inferior to training specifically for the document recommendation task at hand, which StarSpace does.
We show that one can also use StarSpace on tasks of knowledge representation. We use the Freebase 15k dataset from [transE], which consists of a collection of triplets (head, relation_type, tail) extracted from Freebase555http://www.freebase.com
. This data set can be seen as a 3-mode tensor depicting ternary relationships between synsets. There are 14,951 concepts (mids) and 1,345 relation types among them. The training set contains 483,142 triplets, the validation set 50,000 and the test set 59,071. As described in[transE], evaluation is performed by, for each test triplet, removing the head and replacing by each of the entities in the dictionary in turn. Scores for those corrupted triplets are first computed by the models and then sorted; the rank of the correct entity is finally stored. This whole procedure is repeated while removing the tail instead of the head. We report the mean of those predicted ranks and the hits@10. We also conduct a filtered evaluation that is the same, except all other valid heads or tails from the train or test set are discarded in the ranking, following [transE].
We compare with a number of methods, including transE presented in [transE]. TransE was shown to outperform RESCAL [rescal], RFM [jenatton2012latent], SE [bordes2011learning] and SME [bordes2014semantic] and is considered a standard benchmark method. TransE uses an L2 similarity head + relation - tail and SGD updates with single entity corruptions of head or tail that should have a larger distance. In contrast, StarSpace uses a dot product, -negative sampling, and two different embeddings to represent the relation entity, depending on whether it appears in or .
The results are given in Table 4. Results for SE, SME and LFM are reported from [transE] and optimize the dimension from the choices 20, 50 and 75 as a hyperparameter. RESCAL is reported from [hole]. For TransE we ran it ourselves so that we could report the results for different embedding dimensions, and because we obtained better results by fine tuning it than previously reported. Comparing TransE and StarSpace for the same embedding dimension, these two methods then give similar performance. Note there are some recent improved results on this dataset using larger embeddings [kadlec2017knowledge] or more complex, but less general, methods [shen2017modeling].
In this section, we ran experiments on the Freebase 15k dataset to illustrate the complexity of our model in terms of the number of negative search examples. We set , and the max training time of the algorithm to be 1 hour for all experients. We report the number of epochs the algorithm completes within the time limit and the best filtered hits@10 result over possible learning rate choices, for different (number of negatives searched for each positive training example). We set .
The result is presented in Table 5. We observe that the number of epochs finished within the 1 hour training time constraint is close to an inverse linear function of . In this particular setup, [1, 100] is a good range of and the best result is achieved at .
|Input Query||StarSpace result||fastText result|
|She is the 1962 Blue Swords champion and 1960 Winter Universiade silver medalist.||Article: Eva Grožajová. Paragraph: Eva Grožajová , later Bergerová-Grožajová , is a former competitive figure skater who represented Czechoslovakia. She placed 7th at the 1961 European Championships and 13th at the 1962 World Championships. She was coached by Hilda Múdra.||Article: Michael Reusch. Paragraph: Michael Reusch (February 3, 1914–April 6 , 1989) was a Swiss gymnast and Olympic Champion. He competed at the 1936 Summer Olympics in Berlin, where he received silver medals in parallel bars and team combined exercises…|
|The islands are accessible by a one-hour speedboat journey from Kuala Abai jetty, Kota Belud, 80 km north-east of Kota Kinabalu, the capital of Sabah.||Article: Mantanani Islands. Paragraph: The Mantanani Islands form a small group of three islands off the north-west coast of the state of Sabah, Malaysia, opposite the town of Kota Belud, in northern Borneo. The largest island is Mantanani Besar; the other two are Mantanani Kecil and Lungisan…||Article: Gum-Gum Paragraph: Gum-Gum is a township of Sandakan, Sabah, Malaysia. It is situated about 25km from Sandakan town along Labuk Road.|
|Maggie withholds her conversation with Neil from Tom and goes to the meeting herself, and Neil tells her the spirit that contacted Tom has asked for something and will grow upset if it does not get done.||Article: Stir of Echoes Paragraph: Stir of Echoes is a 1999 American supernatural horror-thriller released in the United States on September 10 , 1999 , starring Kevin Bacon and directed by David Koepp . The film is loosely based on the novel ”A Stir of Echoes” by Richard Matheson…||Article: The Fabulous Five Paragraph: The Fabulous Five is an American book series by Betsy Haynes in the late 1980s . Written mainly for preteen girls , it is a spin-off of Haynes ’ other series about Taffy Sinclair…|
In this section, we apply our model on a Wikipedia article search and a sentence match problem. We use the Wikipedia dataset introduced by [chen2017reading], which is the 2016-12-21 dump of English Wikipedia. For each article, only the plain text is extracted and all structured data sections such as lists and figures are stripped. It contains a total of 5,075,182 articles with 9,008,962 unique uncased token types. The dataset is split into 5,035,182 training examples, 10,000 validation examples and 10,000 test examples. We then consider the following evaluation tasks:
|Unigram-TFIDF*||73.7||79.2||90.3||82.4||-||85.0||73.6 / 81.7||-||-||0.58 / 0.57|
|ParagraphVec (DBOW)*||60.2||66.9||76.3||70.7||-||59.4||72.9 / 81.1||-||-||0.42 / 0.43|
|SDAE*||74.6||78.0||90.8||86.9||-||78.4||73.7 / 80.7||-||-||0.37 / 0.38|
|SIF(GloVe+WR)*||-||-||-||82.2||-||-||-||-||84.6||0.69 / -|
|word2vec*||77.7||79.8||90.9||88.3||79.7||83.6||72.5 / 81.4||0.80||78.7||0.65 / 0.64|
|GloVe*||78.7||78.5||91.6||87.6||79.8||83.6||72.1 / 80.9||0.80||78.6||0.54 / 0.56|
|fastText (public Wikipedia model)*||76.5||78.9||91.6||87.4||78.8||81.8||72.4 / 81.2||0.80||77.9||0.63 / 0.62|
|StarSpace [word]||73.8||77.5||91.53||86.6||77.2||82.2||73.1 / 81.8||0.79||78.8||0.65 / 0.62|
|StarSpace [sentence]||69.1||75.1||85.4||80.5||72.0||63.0||69.2 / 79.7||0.76||76.2||0.70 / 0.67|
|StarSpace [word + sentence]||72.1||77.1||89.6||84.1||77.5||79.0||70.2 ／ 80.3||0.79||77.8||0.69/0.66|
|StarSpace [ensemble w+s]||76.6||80.3||91.8||88.0||79.9||85.2||71.8 / 80.6||0.78||82.1||0.69 / 0.65|
|fastText (public Wikipedia model)||0.60 / 0.59||0.62 / 0.63||0.63 / 0.62||0.68 / 0.69||0.62 / 0.66|
|StarSpace [word]||0.53 / 0.54||0.60 / 0.60||0.65 / 0.62||0.68 / 0.67||0.64 / 0.65|
|StarSpace [sentence]||0.58 / 0.58||0.66 / 0.65||0.70 / 0.67||0.74 / 0.73||0.69 / 0.69|
|StarSpace [word+sentence]||0.58 / 0.59||0.63 / 0.63||0.68 / 0.65||0.72 / 0.72||0.68 / 0.68|
|StarSpace [ensemble w+s]||0.58 / 0.59||0.64 / 0.64||0.69 / 0.65||0.73 / 0.72||0.69 / 0.69|
Task 1: given a sentence from a Wikipedia article as a search query, we try to find the Wikipedia article it came from. We rank the true Wikipedia article (minus the sentence) against 10,000 other Wikipedia articles using ranking evaluation metrics. This mimics a web search like scenario where we would like to search for the most relevant Wikipedia articles (web documents). Note that we effectively have supervised training data for this task.
Task 2: pick two random sentences from a Wikipedia article, use one as the search query, and try to find the other sentence coming from the same original document. We rank the true sentence against 10,000 other sentences from different Wikipedia articles. This fits the scenario where we want to find sentences that are closely semantically related by topic (but do not necessarily have strong word overlap). Note also that we effectively have supervised training data for this task.
We can train our Starspace model in the following way: each update step selects a Wikipedia article from our training set. Then, one random sentence is picked from the article as the input, and for Task 2 another random sentence (different from the input) is picked from the article as the label (otherwise the rest of the article for Task 1). Negative entities can be selected at random from the training set. In the case of training for Task 1, for label features we use a feature dropout probability of 0.8 which both regularizes and greatly speeds up training. We also try StarSpace word-level training, and multi-tasking both sentence and word-level for Task 2.
We compare StarSpace with the publicly released fastText model, as well as a fastText model trained on the text of our dataset.666FastText training is unsupervised even on our dataset since its original design does not support directly using supervised data here. We also compare to a TFIDF baseline. For fair comparison, we set the dimension of all embedding models to be 300. The results for tasks 1 and 2 are summarized in Table 6 and 7 respectively. StarSpace outperforms TFIDF and fastText by a significant margin, this is because StarSpace can train directly for the tasks of interest whereas it is not in the declared scope of fastText. Note that StarSpace word-level training, which is similar to fastText in method, obtains similar results to fastText. Crucially, it is StarSpace’s ability to do sentence and document level training that brings the performance gains.
A comparison of the predictions of StarSpace and fastText on the article search task (Task 1) on a few random queries are given in Table 8. While fastText results are semantically in roughly the right part of the space, they lack finer precision. For example, the first query is looking for articles about an olympic skater, which StarSpace correctly understands whereas fastText picks an olympic gymnast. Note that the query does not specifically mention the word skater, StarSpace can only understand this by understanding related phrases, e.g. the phrase “Blue Swords” refers to an international figure skating competition. The other two examples given yield similar conclusions.
In this section, we evaluate sentence embeddings generated by our model and use SentEval777https://github.com/facebookresearch/SentEval which is a tool from [infersent] for measuring the quality of general purpose sentence embeddings. We use a total of 14 transfer tasks including binary classification, multi-class classification, entailment, paraphrase detection, semantic relatedness and semantic textual similarity from SentEval. Detailed description of these transfer tasks and baseline models can be found in [infersent].
We train the following models on the Wikipedia Task 2 from the previous section, and evaluate sentence embeddings generated by those models:
StarSpace trained on word level.
StarSpace trained on sentence level.
StarSpace trained (multi-tasked) on both word and sentence level.
Ensemble of StarSpace models trained on both word and sentence level: we train a set of 13 models, multi-tasking on Wikipedia sentence match and word-level training then concatenate all embeddings together to generate a dimension embedding for each word.
We present the results in Table 9 and Table 10. StarSpace performs well, outperforming many methods on many of the tasks, although no method wins outright across all tasks. Particularly on the STS (Semantic Textual Similarity) tasks Starspace has very strong results. Please refer to [infersent] for further results and analysis of these datasets.
In this paper, we propose StarSpace, a method of embedding and ranking entities using the relationships between entities, and show that the method we propose is a general system capable of working on many tasks:
Text Classification / Sentiment Analysis: we show that our method achieves good results, comparable to fastText [fasttext] on three different datasets.
Content-based Document recommendation: it can directly solve these tasks well, whereas applying off-the-shelf fastText, Tagspace or word2vec gives inferior results.
Link Prediction in Knowledge Bases: we show that our method outperforms several methods, and matches TransE [transE] on Freebase 15K.
Wikipedia Search and Sentence Matching tasks: it outperforms off-the-shelf embedding models due to directly training sentence and document-level embeddings.
Learning Sentence Embeddings: It performs well on the 14 SentEval transfer tasks of [infersent] compared to a host of embedding methods.
StarSpace should also be highly applicable to other tasks we did not evaluate here such as other classification, ranking, retrieval or metric learning tasks. Importantly, what is more general about our method compared to many existing embedding models is: (i) the flexibility of using features to represent labels that we want to classify or rank, which enables it to train directly on a downstream prediction/ranking task; and (ii) different ways of selecting positives and negatives suitable for those tasks. Choosing the wrong generators and gives greatly inferior results, as shown e.g. in Table 7.
Future work will consider the following enhancements: going beyond discrete features, e.g. to continuous features, considering nonlinear representations and experimenting with other entities such as images. Finally, while our model is relatively efficient, we could consider hierarchical classification schemes as in FastText to try to make it more efficient; the trick here would be to do this while maintaining the generality of our model which is what makes it so appealing.
We would like to thank Timothee Lacroix for sharing with us his implementation of TransE. We also thank Edouard Grave, Armand Joulin and Arthur Szlam for helpful discussions on the StarSpace model.