1 Introduction
Predicting relations between textual units is a widespread task, essential for discourse analysis, dialog systems, information retrieval, or paraphrase detection. Since relation prediction often requires a form of understanding, it can also be used as a proxy to learn transferable sentence representations. Several tasks that are useful to build sentence representations are derived directly from text structure, without human annotation: sentence order prediction (Logeswaran et al., 2016; Jernite et al., 2017), the prediction of previous and subsequent sentences (Kiros et al., 2015; Jernite et al., 2017), or the prediction of explicit discourse markers between sentence pairs (Nie et al., 2017; Jernite et al., 2017). Human labeled relations between sentences can also be used for that purpose, e.g. inferential relations (Conneau et al., 2017)
. While most work on sentence similarity estimation, entailment detection, answer selection, or discourse relation prediction seemingly uses task-specific models, they all involve predicting whether a relation
holds between two sentences and . This genericity has been noticed in the literature before (Baudiš et al., 2016) and it has been leveraged for the evaluation of sentence embeddings within the SentEval framework (Conneau et al., 2017).A straightforward way to predict the probability of
being true is to represent and with -dimensional embeddings and , and to compute sentence pair features , whereis a composition function (e.g. concatenation, product, …). A softmax classifier
can learn to predict with those features. can be seen as a reasoning based on the content of and (Socher et al., 2013).Our contributions are as follows:
2 Composition functions for relation prediction
We review here popular composition functions used for relation prediction based on sentence embeddings. Ideally, they should simultaneously fulfill the following minimal requirements:
-
make use of interactions between representations of sentences to relate;
-
allow for the learning of asymmetric relations (e.g. entailment, order);
-
be usable with high dimensionalities (parameters and should fit in GPU memory).
Additionally, if the main goal is transferable sentence representation learning, compositions should also incentivize gradually changing sentences to lie on a linear manifold, since transfer usually uses linear models. Another goal can be learning of transferable relation representation. Concretely, a sentence encoder and can be trained on a base task, and can be used as features for transfer in another task. In that case, the geometry of the sentence embedding space is less relevant, as long as the
space works well for transfer learning. Our evaluation will cover both cases.
A straightforward instantiation of is concatenation (Hooda & Kosseim, 2017):
(1) |
However, interactions between and cannot be modeled with followed by a softmax regression. Indeed, can be rewritten as a sum of independent contributions from and , namely
. Using a multi-layer perceptron before the softmax would solve this issue, but it also harms sentence representation learning
(Conneau et al., 2017; Logeswaran & Lee, 2018), possibly because the perceptron allows for accurate predictions even if the sentence embeddings lie in a convoluted space. To promote interactions between and , element-wise product has been used in Baudiš et al. (2016):(2) |
Absolute difference is another solution for sentence similarity (Mueller & Thyagarajan, 2016), and its element-wise variation may equally be used to compute informative features:
(3) |
The latter two were combined into a popular instantiation, sometimes refered as heuristic matching (Tai et al., 2015; Kiros et al., 2015; Mou et al., 2015):
(4) |
Although effective for certain similarity tasks, is symmetrical, and should be a poor choice for tasks like entailment prediction or prediction of discourse relations. For instance, if denotes entailment and = (“It just rained”, “The ground is wet”), should hold but not . The composition function is nonetheless used to train/evaluate models on entailment (Conneau et al., 2017) or discourse relation prediction (Nie et al., 2017).
Sometimes is concatenated to (Ampomah et al., 2016; Conneau et al., 2017). While the resulting composition is asymmetrical, the asymmetrical component involves no interaction as noted previously. We note that this composition is very commonly used. On the SNLI benchmark,111nlp.stanford.edu/projects/snli/, as of February 2019. out of the listed sentence embedding based models use it, and use a weaker form (e.g. omitting ).
The outer product has been used instead for asymmetric multiplicative interaction (Jernite et al., 2017):
(5) |
This formulation is expressive but it forces to have parameters per relation, which is prohibitive when there are many relations and is high.
The problems outlined above are well known in SRL. Thus, existing compositions (except ) can only model relations superficially for tasks currently used to train state of the art sentence encoders, like NLI or discourse connectives prediction.
3 Statistical Relational Learning models
Model | Scoring function | Parameters |
---|---|---|
Unstructured | - | |
TransE | ||
RESCAL | ||
DistMult | ||
ComplEx |
In this section we introduce the context of statistical relational learning (SRL) and relevant models. Recently, SRL has focused on efficient and expressive relation prediction based on embeddings. A core goal of SRL (Getoor & Taskar, 2007) is to induce whether a relation holds between two arbitrary entities . As an example, we would like to assign a score to = (Paris, located_in, France) that reflects a high probability. In embedding-based SRL models, entities have vector representations in and a scoring function reflects truth values of relations. The scoring function should allow for relation-dependent reasoning over the latent space of entities. Scoring functions can have relation-specific parameters, which can be interpreted as relation embeddings. Table 1 presents an overview of a number of state of the art relational models. We can distinguish two families of models: subtractive and multiplicative.
The TransE scoring function is motivated by the idea that translations in latent space can model analogical reasoning and hierarchical relationships. Dense word embeddings trained on tasks related to the distributional hypothesis naturally allow for analogical reasoning with translations without explicit supervision (Mikolov et al., 2013). TransE generalizes the older Unstructured model. We call them subtractive models.
The RESCAL, Distmult, and ComplEx scoring functions can be seen as dot product matching between
and a relation-specific linear transformation of
(Liu et al., 2017). This transformation helps checking whether matches with some aspects of . RESCAL allows a full linear mapping but has a high complexity, while Distmult is restricted to a component-wise weighting . ComplEx has fewer parameters than RESCAL but still allows for the modeling of asymmetrical relations. As shown in Liu et al. (2017), ComplEx boils down to a restriction of RESCAL whereis a block diagonal matrix. These blocks are 2-dimensional, antisymmetric and have equal diagonal terms. Using such a form, even and odd indexes of
’s dimensions play the roles of real and imaginary numbers respectively. The ComplEx model (Trouillon et al., 2016) and its variations (Lacroix et al., 2018) yield state of the art performance on knowledge base completion on numerous evaluations.4 Embeddings composition as SRL models
We claim that several existing models (Conneau et al., 2017; Nie et al., 2017; Baudiš et al., 2016) boil down to SRL models where the sentence embeddings ( act as entity embeddings (). This framework is depicted in figure 1. In this article we focus on sentence embeddings, although our framework can straightforwardly be applied to other levels of language granularity (such as words, clauses, or documents).
Some models (Chen et al., 2017b; Seo et al., 2016; Gong et al., 2018; Radford, 2018; Devlin et al., 2018) do not rely on explicit sentence encodings to perform relation prediction. They combine information of input sentences at earlier stages, using conditional encoding or cross-attention. There is however no straightforward way to derive transferable sentence representations in this setting, and so these models are out of the scope of this paper. They sometimes make use of composition functions, so our work could still be relevant to them in some respect.
In this section we will make a link between sentence composition functions and SRL scoring functions, and propose new scoring functions drawing inspiration from SRL.

![]() |
![]() |
![]() |
![]() |
4.1 Linking composition functions and SRL models
The composition function from equation 2 followed by a softmax regression yields a score whose analytical form is identical to the Distmult model score described in section 3. Let denote the softmax weights for relation
. The logit score for the truth of
is which is equal to the Distmult scoring function if act as entities embeddings and as the relation weight .Similarly, the composition from equation 3 followed by a softmax regression can be seen as an element-wise weighted score of Unstructured (both are equal if softmax weights are all unitary).
Thus, from 4 (with softmax regression) can be seen as a weighted ensemble of Unstructured and Distmult. These two models are respectively outperformed by TransE and ComplEx on knowledge base link prediction by a large margin (Trouillon et al., 2016; Bordes et al., 2013a). We therefore propose to change the Unstructured and Distmult in such that they match their respective state of the art variations in the following sections. We will also show the implications of these refinements.
4.2 Casting TransE as a composition
Simply replacing with
(6) |
would make the model analogous to TransE. is learned and is shared by all relations. A relation-specific translation could be used but it would make relation-specific. Instead, here, each dimension of can be weighted according to a given relation. Non-zero makes asymmetrical and also yields features that allow for the checking of an analogy between and . Sentence embeddings often rely on pre-trained word embeddings which have demonstrated strong capabilities for analogical reasoning. Some analogies, such as part-whole, are computable with off-the-shelf word embeddings (Chen et al., 2017a) and should be very informative for natural language inference tasks. As an illustration, let us consider an artificial semantic space (depicted in figures 1(a) and 2) where we posit that there is a “to the past” translation so that is the embedding of a sentence changed to the past tense. Unstructured is not able to leverage this semantic space to correctly score ) while TransE is well tailored to provide highest scores for sentences near where is an estimation of that could be learned from examples.
4.3 Casting ComplEx as a composition
Let us partition dimensions into two equally sized sets and , e.g. even and odd dimension indices of . We propose a new function as a way to fit the ComplEx scoring function into a composition function.
(7) |
multiplied by softmax weights is equivalent to the ComplEx scoring function . The first half of weights corresponds to the real part of ComplEx relation weights while the last half corresponds to the imaginary part.
is to the ComplEx scoring function what is to the DistMult scoring function. Intuitively, ComplEx is a minimal way to model interactions between distinct latent dimensions while Distmult only allows for identical dimensions to interact.
Let us consider a new artificial semantic space (shown in figures 2 and 2) where the first dimension is high when a sentence means that it just rained, and the second dimension is high when the ground is wet. Over this semantic space, Distmult is only able to detect entailment for paraphrases whereas ComplEx is also able to naturally model that (“it just rained”, , “the ground is wet”) should be high while its converse should not.
We also propose two more general versions of :
(8) |
(9) |
can be seen as Distmult concatenated with the asymmetrical part of ComplEx and can be seen as RESCAL with unconstrained block diagonal relation matrices.
5 On the evaluation of relational models
The SentEval framework (Conneau et al., 2017)
provides a general evaluation for transferable sentence representations, with open source evaluation code. One only needs to specify a sentence encoder function, and the framework performs classification tasks or relation prediction tasks using cross-validated logistic regression on embeddings or composed sentence embeddings. Tasks include sentiment analysis, entailment, textual similarity, textual relatedness, and paraphrase detection. These tasks are a rich way to train or evaluate sentence representations since in a triple
, we can see as a label for (Baudiš et al., 2016). Unfortunately, the relational tasks hard-code the composition function from equation 4. From our previous analysis, we believe this composition function favors the use of contextual/lexical similarity rather than high-level reasoning and can penalize representations based on more semantic aspects. This bias could harm research since semantic representation is an important next step for sentence embedding. Training/evaluation datasets are also arguably flawed with respect to relational aspects since several recent studies (Dasgupta et al., 2018; Poliak et al., 2018; Gururangan et al., 2018; Glockner et al., 2018) show that InferSent, despite being state of the art on SentEval evaluation tasks, has poor performance when dealing with asymmetrical tasks and non-additive composition of words. In addition to providing new ways of training sentence encoders, we will also extend the SentEval evaluation framework with a more expressive composition function when dealing with relational transfer tasks, which improves results even when the sentence encoder was not trained with it.6 Experiments
Our goal is to show that transferable sentence representation learning and relation prediction tasks can be improved when our expressive compositions are used instead of the composition from equation 4. We train our relational model adaptations on two relation prediction base tasks (), one supervised () and one unsupervised () described below, and evaluate sentence/relation representations on base and transfer tasks using the SentEval framework in order to quantify the generalization capabilities of our models. Since we use minor modifications of InferSent and SentEval, our experiments are easily reproducible.
name | N | task | C | representation(s) used |
---|---|---|---|---|
MR | 11k | sentiment (movies) | 2 | |
SUBJ | 10k | subjectivity/objectivity | 2 | |
MPQA | 11k | opinion polarity | 2 | |
TREC | 6k | question-type | 6 | |
10k | NLI | 3 | ||
4k | paraphrase detection | 2 | ||
17k | discursive relation | 5 | ||
STS14 | 4.5k | similarity | - |
6.1 Training tasks
Natural language inference ( = NLI)’s goal is to predict whether the relation between two sentences (premise and hypothesis) is Entailment, Contradiction or Neutral. We use the combination of SNLI dataset (Bowman et al., 2015) and MNLI dataset (Williams et al., 2017). We call AllNLI the resulting dataset of examples. Conneau et al. (2017) claim that NLI data allows universal sentence representation learning. They used the composition function with concatenated sentence representations in order to train their Infersent model.
We also train on the prediction of discourse connectives between sentences/clauses ( = Disc). Discourse connectives make discourse relations between sentences explicit. In the sentence I live in Paris but I’m often elsewhere, the word but highlights that there is a contrast between the two clauses it connects. We use Malmi et al.’s (2017) dataset of selected instances with discourse connectives (e.g. however, for example) with the provided train/dev/test split. This dataset has no other supervision than the list of 20 connectives. Nie et al. (2017) used concatenated with the sum of sentence representations to train their model, DisSent, on a similar task and showed that their encoder was general enough to perform well on SentEval tasks. They use a dataset that is, at the time of writing, not publicly available.
6.2 Evaluation tasks
Table 2 provides an overview of different transfer tasks that will be used for evaluation. We added another relation prediction task, the PDTB coarse-grained implicit discourse relation task, to SentEval. This task involves predicting a discursive link between two sentences among Comparison, Contingency, Entity based coherence, Expansion, Temporal. We followed the setup of Pitler et al. (2009), without sampling negative examples in training. MRPC, PDTB and SICK will be tested with two composition functions: besides SentEval composition , we will use
for transfer learning evaluation, since it has the most general multiplicative interaction and it does not penalize models that do not learn a translation. For all tasks except STS14, a cross-validated logistic regression is used on the sentence or relation representation. The evaluation of the STS14 task relies on Pearson or Spearman correlation between cosine similarity and the target. We force the composition function to be symmetrical on the MRPC task since paraphrase detection should be invariant to permutation of input sentences.
6.3 Setup
We want to compare the different instances of . We follow the setup of Infersent (Conneau et al., 2017): we learn to encode sentences into
with a bi-directional LSTM using element-wise max pooling over time. The dimension size of
is . Word embeddings are fixed GloVe with 300 dimensions, trained on Common Crawl 840B.222https://nlp.stanford.edu/projects/glove/ Optimization is done with SGD and decreasing learning rate until convergence.The only difference with regard to Infersent is the composition. Sentences are composed with six different compositions for training according to the following template:
(10) |
(subtractive interaction) is in , (multiplicative interaction) is in . We do not consider since it yielded inferior results in our early experiments using NLI and SentEval development sets.
is fed directly to a softmax regression. Note that Infersent uses a multi-layer perceptron before the softmax, but uses only linear activations, so is analytically equivalent to Infersent when .
6.4 Results
Models trained on natural language inference () | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
m,s | MR | SUBJ | MPQA | TREC | STS14 | AVG | ||||
81.2 | 92.7 | 90.4 | 89.6 | 76.1 | 46.7 | 86.6 | 69.5 | 84.2 | 79.1 | |
81.4 | 92.8 | 90.5 | 89.6 | 75.4 | 46.6 | 86.7 | 69.5 | 84.3 | 79.1 | |
81.2 | 92.6 | 90.5 | 89.6 | 76 | 46.5 | 86.6 | 69.5 | 84.2 | 79.1 | |
81.1 | 92.7 | 90.5 | 89.7 | 76.5 | 46.4 | 86.5 | 70.0 | 84.8 | 79.2 | |
81.3 | 92.6 | 90.6 | 89.2 | 76.2 | 47.2 | 86.5 | 70.0 | 84.6 | 79.2 | |
81.2 | 92.7 | 90.4 | 88.5 | 75.8 | 47.3 | 86.8 | 69.8 | 84.2 | 79.1 |
Models trained on discourse connective prediction () | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
m,s | MR | SUBJ | MPQA | TREC | STS14 | AVG | ||||
80.4 | 92.7 | 90.2 | 89.5 | 74.5 | 47.3 | 83.2 | 57.9 | 35.7 | 77 | |
80.4 | 92.9 | 90.2 | 90.2 | 75 | 47.9 | 83.3 | 57.8 | 35.9 | 77.2 | |
80.2 | 92.8 | 90.2 | 88.4 | 74.9 | 47.5 | 82.9 | 57.7 | 35.9 | 76.8 | |
80.2 | 92.8 | 90.2 | 90.4 | 74.6 | 48.5 | 83.4 | 58.6 | 36.1 | 77.3 | |
80.2 | 92.9 | 90.3 | 90.3 | 75.1 | 47.8 | 83.2 | 58.3 | 36.1 | 77.3 | |
80.2 | 92.8 | 90.3 | 89.7 | 74.4 | 47.9 | 83.7 | 58.2 | 35.7 | 77.2 |
Comparison models | |||||||||
---|---|---|---|---|---|---|---|---|---|
model | MR | SUBJ | MPQA | TREC | STS14 | AVG | |||
Infersent | 81.1 | 92.4 | 90.2 | 88.2 | 76.2 | 46.7- | 86.3 | 70 | 78.9 |
SkipT | 76.5 | 93.6 | 87.1 | 92.2 | 73 | - | 82.3 | 29 | - |
BoW | 77.2 | 91.2 | 87.9 | 83 | 72.2 | 43.9 | 78.4 | 54.6 | 73.6 |
m,s | AVG | AVG | ||||||
---|---|---|---|---|---|---|---|---|
74.8 | 48.2 | 83.6 | 68.9 | 76.2 | 47.2 | 86.9 | 70.1 | |
74.9 | 49.3 | 83.8 | 69.3 | 75.9 | 47.1 | 86.9 | 70 | |
75 | 48.8 | 83.4 | 69.1 | 75.8 | 47 | 87 | 69.9 | |
74.9 | 48.7 | 83.6 | 69.1 | 76.2 | 47.8 | 86.8 | 70.3 | |
75.2 | 48.6 | 83.5 | 69.1 | 76.2 | 47.6 | 87.3 | 70.4 | |
74.6 | 48.9 | 83.9 | 69.1 | 76.2 | 47.8 | 87 | 70.3 |
Having run several experiments with different initializations, the standard deviations between them do not seem to be negligible. We decided to take these into account when reporting scores, contrary to previous work
(Kiros et al., 2015; Conneau et al., 2017): we average the scores of 6 distinct runs for each task and use standard deviations under normality assumption to compute significance. Table 3 shows model scores for , while Table 4 shows scores for . For comparison, Table 5 shows a number of important models from previous work. Finally, in Table 6, we present results for sentence relation tasks that use an alternative composition function () instead of the standard composition function used in SentEval.For sentence representation learning, the baseline, composition already performs rather well, being on par with the InferSent scores of the original paper, as would be expected. However, macro-averaging all accuracies, it is the second worst performing model. is the best performing model, and all three best models use the translation ().
On relational transfer tasks, training with and using complex for transfer (Table 6) always outperforms the baseline ( with composition in Tables 3 and 4). Averaging accuracies of those transfer tasks, this result is significant for both training tasks at level (using Bonferroni correction accounting for the 5 comparisons).
On base tasks and the average of non-relational transfer tasks (MR, MPQA, SUBJ, TREC), our proposed compositions are on average slightly better than .
Representations learned with our proposed compositions can still be compared with simple cosine similarity: all three methods using the translational composition () very significantly outperform the baseline (significant at level with Bonferroni correction) on STS14 for .
Thus, we believe has more robust results and could be a better default choice than as composition for representation learning.
333 Note that our compositions are also beneficial with regard to convergence speed: on average, each of our proposed compositions needed less epochs to converge than the baseline
Additionally, using (Table 6) instead of (Tables 3 and 4) for transfer learning in relational transfer tasks (PDTB, MRPC, SICK) yields a significant improvement on average, even when was used for training (). Therefore, we believe is an interesting composition for inference or evaluation of models regardless of how they were trained.
7 Related work
There are numerous interactions between SRL and NLP. We believe that our framework merges two specific lines of work: relation prediction and modeling textual relational tasks.
Some previous NLP work focused on composition functions for relation prediction between text fragments, even though they ignored SRL and only dealt with word units. Word2vec (Mikolov et al., 2013) has sparked a great interest for this task with word analogies in the latent space. Levy & Goldberg (2014) explored different scoring functions between words, notably for analogies. Hypernymy relations were also studied, by Chang et al. (2017) and Fu et al. (2014). Levy et al. (2015) proposed tailored scoring functions. Even the skipgram model (Mikolov et al., 2013) can be formulated as finding relations between context and target words. We did not empirically explore textual relational learning at the word level, but we believe that it would fit in our framework, and could be tested in future studies. Numerous approaches (Chen et al., 2017b; Seok et al., 2016; Gong et al., 2018; Joshi et al., 2018) were proposed to predict inference relations between sentences, but don’t explicitely use sentence embeddings. Instead, they encode sentences jointly, possibly with the help of previously cited word compositions, therefore it would also be interesting to try applying our techniques within their framework.
Some modeling aspects of textual relational learning have been formally investigated by Baudiš et al. (2016). They noticed the genericity of relational problems and explored multi-task and transfer learning on relational tasks. Their work is complementary to ours since their framework unifies tasks while ours unifies composition functions. Subsequent approaches use relational tasks for training and evaluation on specific datasets (Conneau et al., 2017; Nie et al., 2017).
8 Conclusion
We have demonstrated that a number of existing models used for textual relational learning rely on composition functions that are already used in Statistical Relational Learning. By taking into account previous insights from SRL, we proposed new composition functions and evaluated them. These composition functions are all simple to implement and we hope that it will become standard to try them on relational problems. Larger scale data might leverage these more expressive compositions, as well as more compositional, asymmetric, and arguably more realistic datasets (Dasgupta et al., 2018; Gururangan et al., 2018). Finally, our compositions can also be helpful to improve interpretability of embeddings, since they can help measure relation prediction asymmetry. Analogies through translations helped interpreting word embeddings, and perhaps anlyzing our learned translation could help interpreting sentence embeddings.
References
- Ampomah et al. (2016) Isaac K E Ampomah, Seong-bae Park, and Sang-jo Lee. A Sentence-to-Sentence Relation Network for Recognizing Textual Entailment. World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering, 10(12):1955–1958, 2016.
- Baudiš et al. (2016) Petr Baudiš, Jan Pichl, Tomáš Vyskočil, and Jan Šedivý. Sentence Pair Scoring: Towards Unified Framework for Text Comprehension. 2016. URL http://arxiv.org/abs/1603.06127.
- Bordes et al. (2013a) Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. A Semantic Matching Energy Function for Learning with Multi-relational Data. Machine Learning, 2013a. ISSN 0885-6125. doi: 10.1007/s10994-013-5363-6. URL http://arxiv.org/abs/1301.3485.
- Bordes et al. (2013b) Antoine Bordes, Nicolas Usunier, Jason Weston, and Oksana Yakhnenko. Translating Embeddings for Modeling Multi-Relational Data. Advances in NIPS, 26:2787–2795, 2013b. ISSN 10495258. doi: 10.1007/s13398-014-0173-7.2.
-
Bowman et al. (2015)
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.
A large annotated corpus for learning natural language inference.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,Lisbon, Portugal, 17-21 September 2015
, (September):632–642, 2015. ISSN 9781941643327. - Chang et al. (2017) Haw-Shiuan Chang, ZiYun Wang, Luke Vilnis, and Andrew McCallum. Unsupervised Hypernym Detection by Distributional Inclusion Vector Embedding. 2017. URL http://arxiv.org/abs/1710.00880.
- Chen et al. (2017a) Dawn Chen, Joshua C. Peterson, and Thomas L. Griffiths. Evaluating vector-space models of analogy. CoRR, abs/1705.04416, 2017a.
- Chen et al. (2017b) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. In Regina Barzilay and Min-Yen Kan (eds.), ACL (1), pp. 1657–1668. Association for Computational Linguistics, 2017b. ISBN 978-1-945626-75-3. URL http://dblp.uni-trier.de/db/conf/acl/acl2017-1.html#ChenZLWJI17.
- Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. Emnlp, 2017.
- Dasgupta et al. (2018) Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel J. Gershman, and Noah D. Goodman. Evaluating Compositionality in Sentence Embeddings. (2011), 2018. URL http://arxiv.org/abs/1802.04302.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Fu et al. (2014) Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. Learning Semantic Hierarchies via Word Embeddings. Acl, pp. 1199–1209, 2014.
- Getoor & Taskar (2007) Lise Getoor and Ben Taskar. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning). The MIT Press, 2007. ISBN 0262072882.
- Glockner et al. (2018) Max Glockner, Vered Shwartz, and Yoav Goldberg. Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), (3):1–6, 2018. URL http://arxiv.org/abs/1805.02266.
- Gong et al. (2018) Yichen Gong, Heng Luo, and Jian Zhang. Natural language inference over interaction space. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1dHXnH6-.
- Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation Artifacts in Natural Language Inference Data. 2018. URL http://arxiv.org/abs/1803.02324.
- Hooda & Kosseim (2017) Sohail Hooda and Leila Kosseim. Argument Labeling of Explicit Discourse Relations using LSTM Neural Networks. 2017. URL http://arxiv.org/abs/1708.03425.
- Jernite et al. (2017) Yacine Jernite, Samuel R. Bowman, and David Sontag. Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning. 2017. URL http://arxiv.org/abs/1705.00557.
- Joshi et al. (2018) Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S. Weld, and Luke Zettlemoyer. pair2vec: Compositional word-pair embeddings for cross-sentence inference. CoRR, abs/1810.08854, 2018. URL http://arxiv.org/abs/1810.08854.
- Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302, 2015.
-
Lacroix et al. (2018)
Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski.
Canonical tensor decomposition for knowledge base completion.
In ICML, 2018. - Levy & Goldberg (2014) Omer Levy and Yoav Goldberg. Linguistic Regularities in Sparse and Explicit Word Representations. Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 171–180, 2014. doi: 10.3115/v1/W14-1618. URL http://aclweb.org/anthology/W14-1618.
- Levy et al. (2015) Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Naacl-2015, pp. 970–976, 2015. URL http://www.aclweb.org/anthology/N/N15/N15-1098.pdf.
- Liu et al. (2017) Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical Inference for Multi-Relational Embeddings. Icml, 2017. ISSN 1938-7228. URL http://arxiv.org/abs/1705.02426.
- Logeswaran & Lee (2018) Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. pp. 1–16, 2018. URL http://arxiv.org/abs/1803.02893.
- Logeswaran et al. (2016) Lajanugen Logeswaran, Honglak Lee, and Dragomir Radev. Sentence Ordering using Recurrent Neural Networks. pp. 1–15, 2016. URL http://arxiv.org/abs/1611.02654.
- Malmi et al. (2017) Eric Malmi, Daniele Pighin, Sebastian Krause, and Mikhail Kozhevnikov. Automatic Prediction of Discourse Connectives. 2017. URL http://arxiv.org/abs/1702.00992.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. Nips, pp. 1–9, 2013. ISSN 10495258. doi: 10.1162/jmlr.2003.3.4-5.951.
- Mou et al. (2015) Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. Natural Language Inference by Tree-Based Convolution and Heuristic Matching. pp. 130–136, 2015. URL http://arxiv.org/abs/1512.08422.
- Mueller & Thyagarajan (2016) Jonas Mueller and Aditya Thyagarajan. Siamese Recurrent Architectures for Learning Sentence Similarity. In AAAI, pp. 2786–2792, 2016.
- Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A Three-Way Model for Collective Learning on Multi-Relational Data. Icml, pp. 809—-816, 2011.
- Nie et al. (2017) Allen Nie, Erin D. Bennett, and Noah D. Goodman. DisSent: Sentence Representation Learning from Explicit Discourse Relations. 2017. URL http://arxiv.org/abs/1710.04334.
- Pitler et al. (2009) Emily Pitler, Annie Louis, and Ani Nenkova. Automatic sense prediction for implicit discourse relations in text. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP Volume 2 ACLIJCNLP 09, 2(August):683–691, 2009. doi: 10.3115/1690219.1690241. URL http://www.aclweb.org/anthology/P/P09/P09-1077.
- Poliak et al. (2018) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis Only Baselines in Natural Language Inference. Proceedings of the 7th Joint Conference on Lexical and Computational Semantics, (1):180–191, 2018.
- Radford (2018) Alec Radford. Improving language understanding by generative pre-training. 2018.
- Seo et al. (2016) Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603, 2016. URL http://arxiv.org/abs/1611.01603.
- Seok et al. (2016) Miran Seok, Hye-Jeong Song, Chan-Young Park, Jong-Dae Kim, and Yu-Seop Kim. Named Entity Recognition using Word Embedding as a Feature 1. International Journal of Software Engineering and Its Applications, 10(2):93–104, 2016. ISSN 1738-9984. doi: 10.14257/ijseia.2016.10.2.08. URL http://dx.doi.org/10.14257/ijseia.2016.10.2.08.
- Socher et al. (2013) Richard Socher, Danqi Chen, Christopher Manning, Danqi Chen, and Andrew Ng. Reasoning With Neural Tensor Networks for Knowledge Base Completion. In Neural Information Processing Systems (2003), pp. 926–934, 2013. URL https://nlp.stanford.edu/pubs/SocherChenManningNg{_}NIPS2013.pdf.
-
Tai et al. (2015)
Kai Sheng Tai, Richard Socher, and Christopher D. Manning.
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.
2015. ISSN 9781941643723. doi: 10.1515/popets-2015-0023. URL http://arxiv.org/abs/1503.00075. - Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex Embeddings for Simple Link Prediction. In Proceedings of the 33nd International Conference on Machine Learning, volume 48, 2016. ISBN 9781510829008. URL http://arxiv.org/abs/1606.06357.
- Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R. Bowman. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. 2017. URL http://arxiv.org/abs/1704.05426.
- Yang et al. (2015) Min Chul Yang, Do Gil Lee, So Young Park, and Hae Chang Rim. Knowledge-based question answering using the semantic embedding space. Expert Systems with Applications, 42(23):9086–9104, 2015. ISSN 09574174. doi: 10.1016/j.eswa.2015.07.009. URL http://dx.doi.org/10.1016/j.eswa.2015.07.009.