Implementing FastSent in theano
Unsupervised methods for learning distributed representations of words are ubiquitous in today's NLP research, but far less is known about the best ways to learn distributed phrase or sentence representations from unlabelled data. This paper is a systematic comparison of models that learn such representations. We find that the optimal approach depends critically on the intended application. Deeper, more complex models are preferable for representations to be used in supervised systems, but shallow log-linear models work best for building representation spaces that can be decoded with simple spatial distance metrics. We also propose two new unsupervised representation-learning objectives designed to optimise the trade-off between training time, domain portability and performance.READ FULL TEXT VIEW PDF
In this work we propose a simple and efficient framework for learning
Experimental evidence indicates that simple models outperform complex de...
A lot of the recent success in natural language processing (NLP) has bee...
Many popular representation-learning algorithms use training objectives
Character-based neural models have recently proven very useful for many ...
Entity Resolution (ER) is a fundamental problem with many applications.
Research in analysis of microblogging platforms is experiencing a renewe...
Implementing FastSent in theano
Distributed representations - dense real-valued vectors that encode the semantics of linguistic units - are ubiquitous in today’s NLP research. For single-words or word-like entities, there are established ways to acquire such representations from naturally occurring (unlabelled) training data based on comparatively task-agnostic objectives (such as predicting adjacent words). These methods are well understood empirically[Baroni et al.2014b] and theoretically [Levy and Goldberg2014]. The best word representation spaces reflect consistently-observed aspects of human conceptual organisation [Hill et al.2015b], and can be added as features to improve the performance of numerous language processing systems [Collobert et al.2011].
By contrast, there is comparatively little consensus on the best ways to learn distributed representations of phrases or sentences.111See the contrasting conclusions in [Mitchell and Lapata2008, Clark and Pulman2007, Baroni et al.2014a, Milajevs et al.2014] among others. With the advent of deeper language processing techniques, it is relatively common for models to represent phrases or sentences as continuous-valued vectors. Examples include machine translation [Sutskever et al.2014], image captioning [Mao et al.2015] and dialogue systems [Serban et al.2015]. While it has been observed informally that the internal sentence representations of such models can reflect semantic intuitions [Cho et al.2014], it is not known which architectures or objectives yield the ‘best’ or most useful representations. Resolving this question could ultimately have a significant impact on language processing systems. Indeed, it is phrases and sentences, rather than individual words, that encode the human-like general world knowledge (or ‘common sense’) [Norman1972] that is a critical missing part of most current language understanding systems.
We address this issue with a systematic comparison of cutting-edge methods for learning distributed representations of sentences. We constrain our comparison to methods that do not require labelled data gathered for the purpose of training models, since such methods are more cost-effective and applicable across languages and domains. We also propose two new phrase or sentence representation learning objectives - Sequential Denoising Autoencoders
Sequential Denoising Autoencoders(SDAEs) and FastSent, a sentence-level log-linear bag-of-words model. We compare all methods on two types of task - supervised and unsupervised evaluations
- reflecting different ways in which representations are ultimately to be used. In the former setting, a classifier or regression model is applied to representations and trained with task-specific labelled data, while in the latter, representation spaces are directly queried using cosine distance.
We observe notable differences in approaches depending on the nature of the evaluation metric. In particular, deeper or more complex models (which require greater time and resources to train) generally perform best in the supervised setting, whereas shallow log-linear models work best on unsupervised benchmarks. Specifically, SkipThought Vectors[Kiros et al.2015] perform best on the majority of supervised evaluations, but SDAEs are the top performer on paraphrase identification. In contrast, on the (unsupervised) SICK sentence relatedness benchmark, FastSent, a simple, log-linear variant of the SkipThought objective, performs better than all other models. Interestingly, the method that exhibits strongest performance across both supervised and unsupervised benchmarks is a bag-of-words model trained to compose word embeddings using dictionary definitions [Hill et al.2015a]. Taken together, these findings constitute valuable guidelines for the application of phrasal or sentential representation-learning to language understanding systems.
To constrain the analysis, we compare neural language models that compute sentence representations from unlabelled, naturally-ocurring data, as with the predominant methods for word representations.222This excludes innovative supervised sentence-level architectures including [Socher et al.2011, Kalchbrenner et al.2014] and many others. Likewise, we do not focus on ‘bottom up’ models where phrase or sentence representations are built from fixed mathematical operations on word vectors (although we do consider a canonical case - see CBOW below); these were already compared by milajevs2014evaluating. Most space is devoted to our novel approaches, and we refer the reader to the original papers for more details of existing models.
SkipThought Vectors For consecutive sentences in some document, the SkipThought model [Kiros et al.2015] is trained to predict target sentences and given source sentence . As with all sequence-to-sequence
models, in training the source sentence is ‘encoded’ by a Recurrent Neural Network (RNN) (with Gated Recurrent uUnits[Cho et al.2014]) and then ‘decoded’ into the two target sentences in turn. Importantly, because RNNs employ a single set of update weights at each time-step, both the encoder and decoder are sensitive to the order of words in the source sentence.
For each position in a target sentence , the decoder computes a softmax distribution over the model’s vocabulary. The cost of a training example is the sum of the negative log-likelihood of each correct word in the target sentences and
. This cost is backpropagated to train the encoder (and decoder), which, when trained, can map sequences of words to a single vector.
ParagraphVector le2014distributed proposed two log-linear models of sentence representation. The DBOW model learns a vector for every sentence in the training corpus which, together with word embeddings , define a softmax distribution optimised to predict words given . The are shared across all sentences in the corpus. In the DM model, -grams of consecutive words are selected and is combined with to make a softmax prediction (parameterised by additional weights) of .
We used the Gensim implementation,333https://radimrehurek.com/gensim/ treating each sentence in the training data as a ‘paragraph’ as suggested by the authors. During training, both DM and DBOW models store representations for every sentence (as well as word) in the training corpus. Even on large servers it was therefore only possible to train models with representation size , and DM models whose combination operation was averaging (rather than concatenation).
Bottom-Up Methods We train CBOW and SkipGram word embeddings [Mikolov et al.2013b] on the Books corpus, and compose by elementwise addition as proposed by mitchell2010composition.444We also tried multiplication but this gave very poor results.
We also compare to C-PHRASE [Pham et al.2015], an approach that exploits a (supervised) parser to infer distributed semantic representations based on a syntactic parse of sentences. C-PHRASE achieves state-of-the-art results for distributed representations on several evaluations used in this study.555Since code for C-PHRASE is not publicly-available we use the available pre-trained model (http://clic.cimec.unitn.it/composes/cphrase-vectors.html). Note this model is trained on more text than others in this study.
Non-Distributed Baseline We implement a TFIDF BOW model in which the representation of sentence encodes the count in of a set of feature-words weighted by their tfidf in , the corpus. The feature-words are the 200,000 most common words in .
The following models rely on (freely-available) data that has more structure than raw text.
DictRep hill2015learning trained neural language models to map dictionary definitions to pre-trained word embeddings of the words defined by those definitions. They experimented with BOW and RNN (with LSTM) encoding architectures and variants in which the input word embeddings were either learned or pre-trained (+embs.) to match the target word embeddings. We implement their models using the available code and training data.666https://www.cl.cam.ac.uk/~fh295/. Definitions from the training data matching those in the WordNet STS 2014 evaluation (used in this study) were excluded.
CaptionRep Using the same overall architecture, we trained (BOW and RNN) models to map captions in the COCO dataset [Chen et al.2015] to pre-trained vector representations of images. The image representations were encoded by a deep convolutional network [Szegedy et al.2014] trained on the ILSVRC 2014 object recognition task [Russakovsky et al.2014]. Multi-modal distributed representations can be encoded by feeding test sentences forward through the trained model.
NMT We consider the sentence representations learned by neural MT models. These models have identical architecture to SkipThought, but are trained on sentence-aligned translated texts. We used a standard architecture [Cho et al.2014] on all available En-Fr and En-De data from the 2015 Workshop on Statistical MT (WMT).777www.statmt.org/wmt15/translation-task.html
We introduce two new approaches designed to address certain limitations with the existing models.
Sequential (Denoising) Autoencoders The SkipThought objective requires training text with a coherent inter-sentence narrative, making it problematic to port to domains such as social media or artificial language generated from symbolic knowledge. To avoid this restriction, we experiment with a representation-learning objective based on denoising autoencoders (DAEs). In a DAE, high-dimensional input data is corrupted according to some noise function, and the model is trained to recover the original data from the corrupted version. As a result of this process, DAEs learn to represent the data in terms of features that explain its important factors of variation [Vincent et al.2008]. Transforming data into DAE representations (as a ‘pre-training’ or initialisation step) gives more robust (supervised) classification performance in deep feedforward networks [Vincent et al.2010].
The original DAEs were feedforward nets applied to (image) data of fixed size. Here, we adapt the approach to variable-length sentences by means of a noise function , determined by free parameters . First, for each word in , deletes
with (independent) probability. Then, for each non-overlapping bigram in , swaps and with probability . We then train the same LSTM-based encoder-decoder architecture as NMT, but with the denoising objective to predict (as target) the original source sentence given a corrupted version (as source). The trained model can then encode novel word sequences into distributed representations. We call this model the Sequential Denoising Autoencoder (SDAE). Note that, unlike SkipThought, SDAEs can be trained on sets of sentences in arbitrary order.
We label the case with no noise (i.e. and ) SAE. This setting matches the method applied to text classification tasks by dai2015semi. The ‘word dropout’ effect when has also been used as a regulariser for deep nets in supervised language tasks [Iyyer et al.2015], and for large the objective is similar to word-level ‘debagging’ [Sutskever et al.2011]. For the SDAE, we tuned , on the validation set (see Section 3.2).888We searched and observed best results with . We also tried a variant (+embs) in which words are represented by (fixed) pre-trained embeddings.
FastSent The performance of SkipThought vectors shows that rich sentence semantics can be inferred from the content of adjacent sentences. The model could be said to exploit a type of sentence-level Distributional Hypothesis [Harris1954, Polajnar et al.2015]. Nevertheless, like many deep neural language models, SkipThought is very slow to train (see Table 1). FastSent is a simple additive (log-linear) sentence model designed to exploit the same signal, but at much lower computational expense. Given a BOW representation of some sentence in context, the model simply predicts adjacent sentences (also represented as BOW) .
More formally, FastSent learns a source and target embedding for each word in the model vocabulary. For a training example of consecutive sentences, is represented as the sum of its source embeddings . The cost of the example is then simply:
where is the softmax function.
We also experiment with a variant (+AE) in which the encoded (source) representation must predict its own words as target in addition to those of adjacent sentences. Thus in FastSent+AE, (1) becomes
At test time the trained model (very quickly) encodes unseen word sequences into distributed representations with .
|Dataset||Sentence 1||Sentence 2|
|News||Mexico wishes to guarantee citizens’ safety.||Mexico wishes to avoid more violence.||4|
|Forum||The problem is simpler than that.||The problem is simple.||3.8|
|STS||WordNet||A social set or clique of friends.||An unofficial association of people or groups.||3.6|
|2014||Taking Aim #Stopgunviolence #Congress #NRA||Obama, Gun Policy and the N.R.A.||1.6|
|Images||A woman riding a brown horse.||A young girl riding a brown horse.||4.4|
|Headlines||Iranians Vote in Presidential Election.||Keita Wins Mali Presidential Election.||0.4|
|SICK (test+train)||A lone biker is jumping in the air.||A man is jumping into a full pool.||1.7|
Unless stated above, all models were trained on the Toronto Books Corpus,999http://www.cs.toronto.edu/~mbweb/ which has the inter-sentential coherence required for SkipThought and FastSent. The corpus consists of 70m ordered sentences from over 7,000 books.
Specifications of the models are shown in Table 1
. The log-linear models (SkipGram, CBOW, ParagraphVec and FastSent) were trained for one epoch on one CPU core. The representation dimensionfor these models was found after tuning on the validation set.101010For ParagraphVec only was possible due to the high memory footprint. All other models were trained on one GPU. The S(D)AE models were trained for one epoch ( days). The SkipThought model was trained for two weeks, covering just under one epoch.111111Downloaded from https://github.com/ryankiros/skip-thoughts For CaptionRep and DictRep, performance was monitored on held-out training data and training was stopped after 24 hours after a plateau in cost. The NMT models were trained for 72 hours.
In previous work, distributed representations of language were evaluated either by measuring the effect of adding representations as features in some classification task - supervised evaluation [Collobert et al.2011, Mikolov et al.2013a, Kiros et al.2015] - or by comparing with human relatedness judgements - unspervised evaluation [Hill et al.2015a, Baroni et al.2014b, Levy et al.2015]. The former setting reflects a scenario in which representations are used to inject general knowledge (sometimes considered as pre-training) into a supervised model. The latter pertains to applications in which the sentence representation space is used for direct comparisons, lookup or retrieval. Here, we apply and compare both evaluation paradigms.
|Data||Model||MSRP (Acc / F1)||MR||CR||SUBJ||MPQA||TREC|
|SAE||74.3 / 81.7||62.6||68.0||86.1||76.8||80.2|
|SAE+embs.||70.6 / 77.9||73.2||75.3||89.8||86.2||80.4|
|Unordered||SDAE||76.4 / 83.4||67.6||74.0||89.3||81.3||77.6|
|Sentences||SDAE+embs.||73.7 / 80.7||74.6||78.0||90.8||86.9||78.4|
|(Toronto Books:||ParagraphVec DBOW||72.9 / 81.1||60.2||66.9||76.3||70.7||59.4|
|70m sents,||ParagraphVec DM||73.6 / 81.9||61.5||68.6||76.4||78.1||55.8|
|0.9B words)||Skipgram||69.3 / 77.2||73.6||77.3||89.2||85.0||82.2|
|CBOW||67.6 / 76.1||73.6||7730||89.1||85.0||82.2|
|Unigram TFIDF||73.6 / 81.7||73.7||79.2||90.3||82.4||85.0|
|Ordered||SkipThought||73.0 / 82.0||76.5||80.1||93.6||87.1||92.2|
|Sentences||FastSent||72.2 / 80.3||70.8||78.4||88.7||80.6||76.8|
|(Toronto Books)||FastSent+AE||71.2 / 79.1||71.8||76.7||88.8||81.5||80.4|
|NMT En to Fr||69.1 / 77.1||64.7||70.1||84.9||81.5||82.8|
|Other||NMT En to De||65.2 / 73.3||61.0||67.6||78.2||72.9||81.6|
|structured||CaptionRep BOW||73.6 / 81.9||61.9||69.3||77.4||70.8||72.2|
|data||CaptionRep RNN||72.6 / 81.1||55.0||64.9||64.9||71.0||62.4|
|resource||DictRep BOW||73.7 / 81.6||71.3||75.6||86.6||82.5||73.8|
|DictRep BOW+embs.||68.4 / 76.8||76.7||78.7||90.7||87.2||81.0|
|DictRep RNN||73.2 / 81.6||67.8||72.7||81.4||82.5||75.8|
|DictRep RNN+embs.||66.8 / 76.0||72.5||73.5||85.6||85.7||72.0|
|CPHRASE||72.2 / 79.6||75.7||78.8||91.1||86.2||78.8|
|Model||News||Forum||WordNet||Images||Headlines||All||Test + Train|
|NMT En to Fr||.35/.32||.18/.18||.47/.43||.55/.53||.44/.45||.43/.43||.43/.42||.47/.49|
NMT En to De
Representations are applied to 6 sentence classification tasks: paraphrase identification (MSRP) [Dolan et al.2004], movie review sentiment (MR) [Pang and Lee2005], product reviews (CR) [Hu and Liu2004], subjectivity classification (SUBJ) [Pang and Lee2004], opinion polarity (MPQA) [Wiebe et al.2005] and question type classification (TREC) [Voorhees2002]
. We follow the procedure (and code) of kiros2015skip: a logistic regression classifier is trained on top of sentence representations, with 10-fold cross-validation used when a train-test split is not pre-defined.
We also measure how well representation spaces reflect human intuitions of the semantic sentence relatedness, by computing the cosine distance between vectors for the two sentences in each test pair, and correlating these distances with gold-standard human judgements. The SICK dataset [Marelli et al.2014] consists of 10,000 pairs of sentences and relatedness judgements. The STS 2014 dataset [Agirre et al.2014] consists of 3,750 pairs and ratings from six linguistic domains. Example ratings are shown in Table 2
. All available pairs are used for testing apart from the 500 SICK ‘trial’ pairs, which are held-out for tuning hyperparameters (representation size of log-linear models, and noise parameters in SDAE). The optimal settings on this task are then applied to both supervised and unsupervised evaluations.
Performance of the models on the supervised evaluations (grouped according to the data required by their objective) is shown in Table 3. Overall, SkipThought vectors perform best on three of the six evaluations, the BOW DictRep model with pre-trained word embeddings performs best on two, and the SDAE on one. SDAEs perform notably well on the paraphrasing task, going beyond SkipThought by three percentage points and approaching state-of-the-art performance of models designed specifically for the task [Ji and Eisenstein2013]. SDAE is also consistently better than SAE, which aligns with other findings that adding noise to AEs produces richer representations [Vincent et al.2008].
Results on the unsupervised evaluations are shown in Table 4. The same DictRep model performs best on four of the six STS categories (and overall) and is joint-top performer on SICK. Of the models trained on raw text, simply adding CBOW word vectors works best on STS. The best performing raw text model on SICK is FastSent, which achieves almost identical performance to C-PHRASE’s state-of-the-art performance for a distributed model [Pham et al.2015]. Further, it uses less than a third of the training text and does not require access to (supervised) syntactic representations for training. Together, the results of FastSent on the unsupervised evaluations and SkipThought on the supervised benchmarks provide strong support for the sentence-level distributional hypothesis: the context in which a sentence occurs provides valuable information about its semantics.
Across both unsupervised and supervised evaluations, the BOW DictRep with pre-trained word embeddings exhibits by some margin the most consistent performance. Ths robust performance suggests that DictRep representations may be particularly valuable when the ultimate application is non-specific or unknown, and confirms that dictionary definitions (where available) can be a powerful resource for representation learning.
Different objectives yield different representations It may seem obvious, but the results confirm that different learning methods are preferable for different intended applications (and this variation appears greater than for word representations). For instance, it is perhaps unsurprising that SkipThought performs best on TREC because the labels in this dataset are determined by the language immediately following the represented question (i.e. the answer) [Voorhees2002]. Paraphrase detection, on the other hand, may be better served by a model that focused entirely on the content within a sentence, such as SDAEs. Similar variation can be observed in the unsupervised evaluations. For instance, the (multimodal) representations produced by the CaptionRep model do not perform particularly well apart from on the Image category of STS where they beat all other models, demonstrating a clear effect of the well-studied modality differences in representation learning [Bruni et al.2014].
The nearest neighbours in Table 5 give a more concrete sense of the representation spaces. One notable difference is between (AE-style) models whose semantics come from within-sentence relationships (CBOW, SDAE, DictRep, ParagraphVec) and SkipThought/FastSent, which exploit the context around sentences. In the former case, nearby sentences generally have a high proportion of words in common, whereas for the latter it is the general concepts and/or function of the sentence that is similar, and word overlap is often minimal. Indeed, this may be a more important trait of FastSent than the marginal improvement on the SICK task. Readers can compare the CBOW and FastSent spaces at http://126.96.36.199/.
Differences between supervised and unsupervised performance Many of the best performing models on the supervised evaluations do not perform well in the unsupervised setting. In the SkipThought, S(D)AE and NMT models, the cost is computed based on a non-linear decoding of the internal sentence representations, so, as also observed by [Almahairi et al.2015], the informative geometry of the representation space may not be reflected in a simple cosine distance. The log-linear models generally perform better in this unsupervised setting.
Differences in resource requirements As shown in Table 1, different models require different resources to train and use. This can limit their possible applications. For instance, while it was easy to make an online demo for fast querying of near neighbours in the CBOW and FastSent spaces, it was not practical for other models owing to memory footprint, encoding time and representation dimension.
|Query||If he had a weapon, he could maybe take out||An annoying buzz started to ring in my ears, becoming|
|their last imp, and then beat up Errol and Vanessa.||louder and louder as my vision began to swim.|
|CBOW||Then Rob and I would duke it out, and every||Louder.|
|once in a while, he would actually beat me.|
|Skip||If he could ram them from behind, send them saling over||A weighty pressure landed on my lungs and my vision blurred|
|Thought||the far side of the levee, he had a chance of stopping them.||at the edges, threatening my consciousness altogether.|
|FastSent||Isak’s close enough to pick off any one of them,||The noise grew louder, the quaking increased as the|
|maybe all of them, if he had his rifle and a mind to.||sidewalk beneath my feet began to tremble even more.|
|He’d even killed some of the most dangerous criminals||I smile because I’m familiar with the knock,|
|in the galaxy, but none of those men had gotten to him like Vitktis.||pausing to take a deep breath before dashing down the stairs.|
|DictRep||Kevin put a gun to the man’s head, but even though||Then gradually I began to hear a ringing in my ears.|
|(FF+embs.)||he cried, he couldn’t tell Kevin anything more.|
|Paragraph||I take a deep breath and open the doors.||They listened as the motorcycle-like roar|
|Vector (DM)||of an engine got louder and louder then stopped.|
|Supervised (combined )||Unsupervised (combined )|
|0.94 (6)||0.85 (1)||0.86 (4)||0.85 (1)||0.86 (3)||0.89 (5)||0.92 (4)||0.92 (3)||0.92 (4)||0.93 (6)||0.95 (8)||0.92 (2)||0.91 (1)||0.93 (7)|
The role of word order is unclear
The average scores of models that are sensitive to word order (76.3) and of those that are not (76.6) are approximately the same across supervised evaluations. Across the unsupervised evaluations, however, BOW models score 0.55 on average compared with 0.42 for RNN-based (order sensitive) models. This seems at odds with the widely held view that word order plays an important role in determining the meaning of English sentences. One possibility is that order-critical sentences that cannot be disambiguated by a robust conceptual semantics (that could be encoded in distributed lexical representations) are in fact relatively rare. However, it is also plausible that current available evaluations do not adequately reflect order-dependent aspects of meaning (see below). This latter conjecture is supported by the comparatively strong performance of TFIDF BOW vectors, in which the effective lexical semantics are limited to simple relative frequencies.
The evaluations have limitations The internal consistency (Chronbach’s ) of all evaluations considered together is (just above ‘acceptable’).121212wikipedia.org/wiki/Cronbach’s_alpha Table 6 shows that consistency is far higher (‘excellent’) when considering the supervised or unsupervised tasks as independent cohorts. This indicates that, with respect to common characteristics of sentence representations, the supervised and unsupervised benchmarks do indeed prioritise different properties. It is also interesting that, by this metric, the properties measured by MSRP and image-caption relatedness are the furthest removed from other evaluations in their respective cohorts.
While these consistency scores are a promising sign, they could also be symptomatic of a set of evaluations that are all limited in the same way. The inter-rater agreement is only reported for one of the 8 evaluations considered (MPQA, [Wiebe et al.2005]), and for MR, SUBJ and TREC, each item is only rated by one or two annotators to maximise coverage. Table 2 illustrates why this may be an issue for the unsupervised evaluations; the notion of sentential ’relatedness’ seems very subjective. It should be emphasised, however, that the tasks considered in this study are all frequently used for evaluation, and, to our knowledge, there are no existing benchmarks that overcome these limitations.
Advances in deep learning algorithms, software and hardware mean that many architectures and objectives for learning distributed sentence representations from unlabelled data are now available to NLP researchers. We have presented the first (to our knowledge) systematic comparison of these methods. We showed notable variation in the performance of approaches across a range of evaluations. Among other conclusions, we found that the optimal approach depends critically on whether representations will be applied in supervised or unsupervised settings - in the latter case, fast, shallow BOW models can still achieve the best performance. Further, we proposed two new objectives, FastSent and Sequential Denoising Autoencoders, which perform particularly well on specific tasks (MSRP and SICK sentence relatedness respectively).131313We make all code for training and evaluating these new models publicly available, together with pre-trained models and an online demo of the FastSent sentence space. If the application is unknown, however, the best all round choice may be DictRep: learning a mapping of pre-trained word embeddings from the word-phrase signal in dictionary definitions. While we have focused on models using naturally-occurring training data, in future work we will also consider supervised architectures (including convolutional, recursive and character-level models), potentially training them on multiple supervised tasks as an alternative way to induce the ’general knowledge’ needed to give language technology the elusive human touch.
This work was supported by a Google Faculty Award to AK and FH and a Google European Doctoral Fellowship to FH. Thanks also to Marek Rei, Tamara Polajnar, Laural Rimell, Jamie Ryan Kiros and Piotr Bojanowski for helpful discussion and comments.
The Journal of Machine Learning Research, 12:2493–2537.
Simlex-999: Evaluating semantic models with (genuine) similarity estimation.Computational Linguistics.
A convolutional neural network for modelling sentences.In Proceedings of EMNLP.
Evaluating neural word representations in tensor-based compositional settings.In Proceedings of EMNLP.
A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts.In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics.
International Journal of Computer Vision, pages 1–42.