Composition of Sentence Embeddings:Lessons from Statistical Relational Learning

04/04/2019 ∙ by Damien Sileo, et al. ∙ 0

Various NLP problems -- such as the prediction of sentence similarity, entailment, and discourse relations -- are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. A popular model for such problems is to embed sentences into fixed size vectors, and use composition functions (e.g. concatenation or sum) of those vectors as features for the prediction. At the same time, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this article, we show that previous work on relation prediction between texts implicitly uses compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL models to address textual relational problems, showing that they are more expressive, and can alleviate issues from simpler compositions. The resulting models significantly improve the state of the art in both transferable sentence representation learning and relation prediction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predicting relations between textual units is a widespread task, essential for discourse analysis, dialog systems, information retrieval, or paraphrase detection. Since relation prediction often requires a form of understanding, it can also be used as a proxy to learn transferable sentence representations. Several tasks that are useful to build sentence representations are derived directly from text structure, without human annotation: sentence order prediction (Logeswaran et al., 2016; Jernite et al., 2017), the prediction of previous and subsequent sentences (Kiros et al., 2015; Jernite et al., 2017), or the prediction of explicit discourse markers between sentence pairs (Nie et al., 2017; Jernite et al., 2017). Human labeled relations between sentences can also be used for that purpose, e.g. inferential relations (Conneau et al., 2017)

. While most work on sentence similarity estimation, entailment detection, answer selection, or discourse relation prediction seemingly uses task-specific models, they all involve predicting whether a relation

holds between two sentences and . This genericity has been noticed in the literature before (Baudiš et al., 2016) and it has been leveraged for the evaluation of sentence embeddings within the SentEval framework (Conneau et al., 2017).

A straightforward way to predict the probability of

being true is to represent and with -dimensional embeddings and , and to compute sentence pair features , where

is a composition function (e.g. concatenation, product, …). A softmax classifier

can learn to predict with those features. can be seen as a reasoning based on the content of and (Socher et al., 2013).

Our contributions are as follows:

  • we review composition functions used in textual relational learning and show that they lack expressiveness (section 2);

  • we draw analogies with existing SRL models (section 3) and design new compositions inspired from SRL (section 4);

  • we perform extensive experiments to test composition functions and show that some of them can improve the learning of representations and their downstream uses (section 6).

2 Composition functions for relation prediction

We review here popular composition functions used for relation prediction based on sentence embeddings. Ideally, they should simultaneously fulfill the following minimal requirements:

  • make use of interactions between representations of sentences to relate;

  • allow for the learning of asymmetric relations (e.g. entailment, order);

  • be usable with high dimensionalities (parameters and should fit in GPU memory).

Additionally, if the main goal is transferable sentence representation learning, compositions should also incentivize gradually changing sentences to lie on a linear manifold, since transfer usually uses linear models. Another goal can be learning of transferable relation representation. Concretely, a sentence encoder and can be trained on a base task, and can be used as features for transfer in another task. In that case, the geometry of the sentence embedding space is less relevant, as long as the

space works well for transfer learning. Our evaluation will cover both cases.

A straightforward instantiation of is concatenation (Hooda & Kosseim, 2017):

(1)

However, interactions between and cannot be modeled with followed by a softmax regression. Indeed, can be rewritten as a sum of independent contributions from and , namely

. Using a multi-layer perceptron before the softmax would solve this issue, but it also harms sentence representation learning

(Conneau et al., 2017; Logeswaran & Lee, 2018), possibly because the perceptron allows for accurate predictions even if the sentence embeddings lie in a convoluted space. To promote interactions between and , element-wise product has been used in Baudiš et al. (2016):

(2)

Absolute difference is another solution for sentence similarity (Mueller & Thyagarajan, 2016), and its element-wise variation may equally be used to compute informative features:

(3)

The latter two were combined into a popular instantiation, sometimes refered as heuristic matching (Tai et al., 2015; Kiros et al., 2015; Mou et al., 2015):

(4)

Although effective for certain similarity tasks, is symmetrical, and should be a poor choice for tasks like entailment prediction or prediction of discourse relations. For instance, if denotes entailment and = (“It just rained”, “The ground is wet”), should hold but not . The composition function is nonetheless used to train/evaluate models on entailment (Conneau et al., 2017) or discourse relation prediction (Nie et al., 2017).

Sometimes is concatenated to (Ampomah et al., 2016; Conneau et al., 2017). While the resulting composition is asymmetrical, the asymmetrical component involves no interaction as noted previously. We note that this composition is very commonly used. On the SNLI benchmark,111nlp.stanford.edu/projects/snli/, as of February 2019. out of the listed sentence embedding based models use it, and use a weaker form (e.g. omitting ).

The outer product has been used instead for asymmetric multiplicative interaction (Jernite et al., 2017):

(5)

This formulation is expressive but it forces to have parameters per relation, which is prohibitive when there are many relations and is high.

The problems outlined above are well known in SRL. Thus, existing compositions (except ) can only model relations superficially for tasks currently used to train state of the art sentence encoders, like NLI or discourse connectives prediction.

3 Statistical Relational Learning models

Model Scoring function Parameters
Unstructured -
TransE
RESCAL
DistMult
ComplEx
Table 1: Selected relational learning models. Unstructured is from (Bordes et al., 2013a), TransE from (Bordes et al., 2013b), RESCAL from (Nickel et al., 2011), DistMult from (Yang et al., 2015) and (Trouillon et al., 2016). Following the latter, denotes is the real part of , and is commonly set to .

In this section we introduce the context of statistical relational learning (SRL) and relevant models. Recently, SRL has focused on efficient and expressive relation prediction based on embeddings. A core goal of SRL (Getoor & Taskar, 2007) is to induce whether a relation holds between two arbitrary entities . As an example, we would like to assign a score to = (Paris, located_in, France) that reflects a high probability. In embedding-based SRL models, entities have vector representations in and a scoring function reflects truth values of relations. The scoring function should allow for relation-dependent reasoning over the latent space of entities. Scoring functions can have relation-specific parameters, which can be interpreted as relation embeddings. Table 1 presents an overview of a number of state of the art relational models. We can distinguish two families of models: subtractive and multiplicative.

The TransE scoring function is motivated by the idea that translations in latent space can model analogical reasoning and hierarchical relationships. Dense word embeddings trained on tasks related to the distributional hypothesis naturally allow for analogical reasoning with translations without explicit supervision (Mikolov et al., 2013). TransE generalizes the older Unstructured model. We call them subtractive models.

The RESCAL, Distmult, and ComplEx scoring functions can be seen as dot product matching between

and a relation-specific linear transformation of

(Liu et al., 2017). This transformation helps checking whether matches with some aspects of . RESCAL allows a full linear mapping but has a high complexity, while Distmult is restricted to a component-wise weighting . ComplEx has fewer parameters than RESCAL but still allows for the modeling of asymmetrical relations. As shown in Liu et al. (2017), ComplEx boils down to a restriction of RESCAL where

is a block diagonal matrix. These blocks are 2-dimensional, antisymmetric and have equal diagonal terms. Using such a form, even and odd indexes of

’s dimensions play the roles of real and imaginary numbers respectively. The ComplEx model (Trouillon et al., 2016) and its variations (Lacroix et al., 2018) yield state of the art performance on knowledge base completion on numerous evaluations.

4 Embeddings composition as SRL models

We claim that several existing models (Conneau et al., 2017; Nie et al., 2017; Baudiš et al., 2016) boil down to SRL models where the sentence embeddings ( act as entity embeddings (). This framework is depicted in figure 1. In this article we focus on sentence embeddings, although our framework can straightforwardly be applied to other levels of language granularity (such as words, clauses, or documents).

Some models (Chen et al., 2017b; Seo et al., 2016; Gong et al., 2018; Radford, 2018; Devlin et al., 2018) do not rely on explicit sentence encodings to perform relation prediction. They combine information of input sentences at earlier stages, using conditional encoding or cross-attention. There is however no straightforward way to derive transferable sentence representations in this setting, and so these models are out of the scope of this paper. They sometimes make use of composition functions, so our work could still be relevant to them in some respect.

In this section we will make a link between sentence composition functions and SRL scoring functions, and propose new scoring functions drawing inspiration from SRL.

Figure 1: Implicit SRL model in text relation prediction
(a) Score map of ) over possible sentences using Unstructured composition.
Score map of ) over possible sentences using TransE composition.
Score map of ) over possible sentences using DistMult composition.
Score map of ) over possible sentences using ComplEx composition.
Figure 2: Possible scoring function values according to different composition functions. and are fixed and color brightness reflects likelihood of for each position of embedding . (b) and (d) are respectively more expressive than (a) and (c).

4.1 Linking composition functions and SRL models

The composition function from equation 2 followed by a softmax regression yields a score whose analytical form is identical to the Distmult model score described in section 3. Let denote the softmax weights for relation

. The logit score for the truth of

is which is equal to the Distmult scoring function if act as entities embeddings and as the relation weight .

Similarly, the composition from equation 3 followed by a softmax regression can be seen as an element-wise weighted score of Unstructured (both are equal if softmax weights are all unitary).

Thus, from 4 (with softmax regression) can be seen as a weighted ensemble of Unstructured and Distmult. These two models are respectively outperformed by TransE and ComplEx on knowledge base link prediction by a large margin (Trouillon et al., 2016; Bordes et al., 2013a). We therefore propose to change the Unstructured and Distmult in such that they match their respective state of the art variations in the following sections. We will also show the implications of these refinements.

4.2 Casting TransE as a composition

Simply replacing with

(6)

would make the model analogous to TransE. is learned and is shared by all relations. A relation-specific translation could be used but it would make relation-specific. Instead, here, each dimension of can be weighted according to a given relation. Non-zero makes asymmetrical and also yields features that allow for the checking of an analogy between and . Sentence embeddings often rely on pre-trained word embeddings which have demonstrated strong capabilities for analogical reasoning. Some analogies, such as part-whole, are computable with off-the-shelf word embeddings (Chen et al., 2017a) and should be very informative for natural language inference tasks. As an illustration, let us consider an artificial semantic space (depicted in figures 1(a) and 2) where we posit that there is a “to the past” translation so that is the embedding of a sentence changed to the past tense. Unstructured is not able to leverage this semantic space to correctly score ) while TransE is well tailored to provide highest scores for sentences near where is an estimation of that could be learned from examples.

4.3 Casting ComplEx as a composition

Let us partition dimensions into two equally sized sets and , e.g. even and odd dimension indices of . We propose a new function as a way to fit the ComplEx scoring function into a composition function.

(7)

multiplied by softmax weights is equivalent to the ComplEx scoring function . The first half of weights corresponds to the real part of ComplEx relation weights while the last half corresponds to the imaginary part.

is to the ComplEx scoring function what is to the DistMult scoring function. Intuitively, ComplEx is a minimal way to model interactions between distinct latent dimensions while Distmult only allows for identical dimensions to interact.

Let us consider a new artificial semantic space (shown in figures 2 and 2) where the first dimension is high when a sentence means that it just rained, and the second dimension is high when the ground is wet. Over this semantic space, Distmult is only able to detect entailment for paraphrases whereas ComplEx is also able to naturally model that (“it just rained”, , “the ground is wet”) should be high while its converse should not.

We also propose two more general versions of :

(8)
(9)

can be seen as Distmult concatenated with the asymmetrical part of ComplEx and can be seen as RESCAL with unconstrained block diagonal relation matrices.

5 On the evaluation of relational models

The SentEval framework (Conneau et al., 2017)

provides a general evaluation for transferable sentence representations, with open source evaluation code. One only needs to specify a sentence encoder function, and the framework performs classification tasks or relation prediction tasks using cross-validated logistic regression on embeddings or composed sentence embeddings. Tasks include sentiment analysis, entailment, textual similarity, textual relatedness, and paraphrase detection. These tasks are a rich way to train or evaluate sentence representations since in a triple

, we can see as a label for (Baudiš et al., 2016). Unfortunately, the relational tasks hard-code the composition function from equation 4. From our previous analysis, we believe this composition function favors the use of contextual/lexical similarity rather than high-level reasoning and can penalize representations based on more semantic aspects. This bias could harm research since semantic representation is an important next step for sentence embedding. Training/evaluation datasets are also arguably flawed with respect to relational aspects since several recent studies (Dasgupta et al., 2018; Poliak et al., 2018; Gururangan et al., 2018; Glockner et al., 2018) show that InferSent, despite being state of the art on SentEval evaluation tasks, has poor performance when dealing with asymmetrical tasks and non-additive composition of words. In addition to providing new ways of training sentence encoders, we will also extend the SentEval evaluation framework with a more expressive composition function when dealing with relational transfer tasks, which improves results even when the sentence encoder was not trained with it.

6 Experiments

Our goal is to show that transferable sentence representation learning and relation prediction tasks can be improved when our expressive compositions are used instead of the composition from equation 4. We train our relational model adaptations on two relation prediction base tasks (), one supervised () and one unsupervised () described below, and evaluate sentence/relation representations on base and transfer tasks using the SentEval framework in order to quantify the generalization capabilities of our models. Since we use minor modifications of InferSent and SentEval, our experiments are easily reproducible.

name N task C representation(s) used
MR 11k sentiment (movies) 2
SUBJ 10k subjectivity/objectivity 2
MPQA 11k opinion polarity 2
TREC 6k question-type 6
10k NLI 3
4k paraphrase detection 2
17k discursive relation 5
STS14 4.5k similarity -
Table 2: Transfer evaluation tasks. N = number of training examples; C = number of classes if applicable. are sentence representations, a composition function from section 4.

6.1 Training tasks

Natural language inference ( = NLI)’s goal is to predict whether the relation between two sentences (premise and hypothesis) is Entailment, Contradiction or Neutral. We use the combination of SNLI dataset (Bowman et al., 2015) and MNLI dataset (Williams et al., 2017). We call AllNLI the resulting dataset of examples. Conneau et al. (2017) claim that NLI data allows universal sentence representation learning. They used the composition function with concatenated sentence representations in order to train their Infersent model.

We also train on the prediction of discourse connectives between sentences/clauses ( = Disc). Discourse connectives make discourse relations between sentences explicit. In the sentence I live in Paris but I’m often elsewhere, the word but highlights that there is a contrast between the two clauses it connects. We use Malmi et al.’s (2017) dataset of selected instances with discourse connectives (e.g. however, for example) with the provided train/dev/test split. This dataset has no other supervision than the list of 20 connectives. Nie et al. (2017) used concatenated with the sum of sentence representations to train their model, DisSent, on a similar task and showed that their encoder was general enough to perform well on SentEval tasks. They use a dataset that is, at the time of writing, not publicly available.

6.2 Evaluation tasks

Table 2 provides an overview of different transfer tasks that will be used for evaluation. We added another relation prediction task, the PDTB coarse-grained implicit discourse relation task, to SentEval. This task involves predicting a discursive link between two sentences among Comparison, Contingency, Entity based coherence, Expansion, Temporal. We followed the setup of Pitler et al. (2009), without sampling negative examples in training. MRPC, PDTB and SICK will be tested with two composition functions: besides SentEval composition , we will use

for transfer learning evaluation, since it has the most general multiplicative interaction and it does not penalize models that do not learn a translation. For all tasks except STS14, a cross-validated logistic regression is used on the sentence or relation representation. The evaluation of the STS14 task relies on Pearson or Spearman correlation between cosine similarity and the target. We force the composition function to be symmetrical on the MRPC task since paraphrase detection should be invariant to permutation of input sentences.

6.3 Setup

We want to compare the different instances of . We follow the setup of Infersent (Conneau et al., 2017): we learn to encode sentences into

with a bi-directional LSTM using element-wise max pooling over time. The dimension size of

is . Word embeddings are fixed GloVe with 300 dimensions, trained on Common Crawl 840B.222https://nlp.stanford.edu/projects/glove/ Optimization is done with SGD and decreasing learning rate until convergence.

The only difference with regard to Infersent is the composition. Sentences are composed with six different compositions for training according to the following template:

(10)

(subtractive interaction) is in , (multiplicative interaction) is in . We do not consider since it yielded inferior results in our early experiments using NLI and SentEval development sets.

is fed directly to a softmax regression. Note that Infersent uses a multi-layer perceptron before the softmax, but uses only linear activations, so is analytically equivalent to Infersent when .

6.4 Results

Models trained on natural language inference ()
m,s MR SUBJ MPQA TREC STS14 AVG
81.2 92.7 90.4 89.6 76.1 46.7 86.6 69.5 84.2 79.1
81.4 92.8 90.5 89.6 75.4 46.6 86.7 69.5 84.3 79.1
81.2 92.6 90.5 89.6 76 46.5 86.6 69.5 84.2 79.1
81.1 92.7 90.5 89.7 76.5 46.4 86.5 70.0 84.8 79.2
81.3 92.6 90.6 89.2 76.2 47.2 86.5 70.0 84.6 79.2
81.2 92.7 90.4 88.5 75.8 47.3 86.8 69.8 84.2 79.1
Table 3: SentEval and base task evaluation results for the models trained on natural language inference (); AllNLI is used for training. All scores are accuracy percentages, except STS14, which is Pearson correlation percentage. AVG denotes the average of the SentEval scores.
Models trained on discourse connective prediction ()
m,s MR SUBJ MPQA TREC STS14 AVG
80.4 92.7 90.2 89.5 74.5 47.3 83.2 57.9 35.7 77
80.4 92.9 90.2 90.2 75 47.9 83.3 57.8 35.9 77.2
80.2 92.8 90.2 88.4 74.9 47.5 82.9 57.7 35.9 76.8
80.2 92.8 90.2 90.4 74.6 48.5 83.4 58.6 36.1 77.3
80.2 92.9 90.3 90.3 75.1 47.8 83.2 58.3 36.1 77.3
80.2 92.8 90.3 89.7 74.4 47.9 83.7 58.2 35.7 77.2
Table 4: SentEval and base task evaluation results for the models trained on discourse connective prediction (). All scores are accuracy percentages, except STS14, which is Pearson correlation percentage. AVG denotes the average of the SentEval scores.
Comparison models
model MR SUBJ MPQA TREC STS14 AVG
Infersent 81.1 92.4 90.2 88.2 76.2 46.7- 86.3 70 78.9
SkipT 76.5 93.6 87.1 92.2 73 - 82.3 29 -
BoW 77.2 91.2 87.9 83 72.2 43.9 78.4 54.6 73.6
Table 5: Comparison models from previous work. InferSent represents the original results from Conneau et al. (2017), SkipT is SkipThought from Kiros et al. (2015), and BoW is our re-evaluation of GloVe Bag of Words from Conneau et al. (2017). AVG denotes the average of the SentEval scores..
m,s AVG AVG
74.8 48.2 83.6 68.9 76.2 47.2 86.9 70.1
74.9 49.3 83.8 69.3 75.9 47.1 86.9 70
75 48.8 83.4 69.1 75.8 47 87 69.9
74.9 48.7 83.6 69.1 76.2 47.8 86.8 70.3
75.2 48.6 83.5 69.1 76.2 47.6 87.3 70.4
74.6 48.9 83.9 69.1 76.2 47.8 87 70.3
Table 6: Results for sentence relation tasks using an alternative composition function () during evaluation. AVG denotes the average of the three tasks.

Having run several experiments with different initializations, the standard deviations between them do not seem to be negligible. We decided to take these into account when reporting scores, contrary to previous work

(Kiros et al., 2015; Conneau et al., 2017): we average the scores of 6 distinct runs for each task and use standard deviations under normality assumption to compute significance. Table 3 shows model scores for , while Table 4 shows scores for . For comparison, Table 5 shows a number of important models from previous work. Finally, in Table 6, we present results for sentence relation tasks that use an alternative composition function () instead of the standard composition function used in SentEval.

For sentence representation learning, the baseline, composition already performs rather well, being on par with the InferSent scores of the original paper, as would be expected. However, macro-averaging all accuracies, it is the second worst performing model. is the best performing model, and all three best models use the translation (). On relational transfer tasks, training with and using complex for transfer (Table 6) always outperforms the baseline ( with composition in Tables 3 and 4). Averaging accuracies of those transfer tasks, this result is significant for both training tasks at level (using Bonferroni correction accounting for the 5 comparisons). On base tasks and the average of non-relational transfer tasks (MR, MPQA, SUBJ, TREC), our proposed compositions are on average slightly better than . Representations learned with our proposed compositions can still be compared with simple cosine similarity: all three methods using the translational composition () very significantly outperform the baseline (significant at level with Bonferroni correction) on STS14 for . Thus, we believe has more robust results and could be a better default choice than as composition for representation learning. 333

Note that our compositions are also beneficial with regard to convergence speed: on average, each of our proposed compositions needed less epochs to converge than the baseline

, for both training tasks.

Additionally, using (Table 6) instead of (Tables 3 and 4) for transfer learning in relational transfer tasks (PDTB, MRPC, SICK) yields a significant improvement on average, even when was used for training (). Therefore, we believe is an interesting composition for inference or evaluation of models regardless of how they were trained.

7 Related work

There are numerous interactions between SRL and NLP. We believe that our framework merges two specific lines of work: relation prediction and modeling textual relational tasks.

Some previous NLP work focused on composition functions for relation prediction between text fragments, even though they ignored SRL and only dealt with word units. Word2vec (Mikolov et al., 2013) has sparked a great interest for this task with word analogies in the latent space. Levy & Goldberg (2014) explored different scoring functions between words, notably for analogies. Hypernymy relations were also studied, by Chang et al. (2017) and Fu et al. (2014). Levy et al. (2015) proposed tailored scoring functions. Even the skipgram model (Mikolov et al., 2013) can be formulated as finding relations between context and target words. We did not empirically explore textual relational learning at the word level, but we believe that it would fit in our framework, and could be tested in future studies. Numerous approaches (Chen et al., 2017b; Seok et al., 2016; Gong et al., 2018; Joshi et al., 2018) were proposed to predict inference relations between sentences, but don’t explicitely use sentence embeddings. Instead, they encode sentences jointly, possibly with the help of previously cited word compositions, therefore it would also be interesting to try applying our techniques within their framework.

Some modeling aspects of textual relational learning have been formally investigated by Baudiš et al. (2016). They noticed the genericity of relational problems and explored multi-task and transfer learning on relational tasks. Their work is complementary to ours since their framework unifies tasks while ours unifies composition functions. Subsequent approaches use relational tasks for training and evaluation on specific datasets (Conneau et al., 2017; Nie et al., 2017).

8 Conclusion

We have demonstrated that a number of existing models used for textual relational learning rely on composition functions that are already used in Statistical Relational Learning. By taking into account previous insights from SRL, we proposed new composition functions and evaluated them. These composition functions are all simple to implement and we hope that it will become standard to try them on relational problems. Larger scale data might leverage these more expressive compositions, as well as more compositional, asymmetric, and arguably more realistic datasets (Dasgupta et al., 2018; Gururangan et al., 2018). Finally, our compositions can also be helpful to improve interpretability of embeddings, since they can help measure relation prediction asymmetry. Analogies through translations helped interpreting word embeddings, and perhaps anlyzing our learned translation could help interpreting sentence embeddings.

References