Log In Sign Up

pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference

Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. This paper proposes new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. Our pairwise embeddings are computed as a compositional function of each word's representation, which is learned by maximizing the pointwise mutual information (PMI) with the contexts in which the the two words co-occur. We add these representations to the cross-sentence attention layer of existing inference models (e.g. BiDAF for QA, ESIM for NLI), instead of extending or replacing existing word embeddings. Experiments show a gain of 2.72 representations also aid in better generalization with gains of around 6-7 adversarial SQuAD datasets, and 8.8 Glockner et al.


page 1

page 2

page 3

page 4


Spanish Biomedical and Clinical Language Embeddings

We computed both Word and Sub-word Embeddings using FastText. For Sub-wo...

Evaluating Compositionality in Sentence Embeddings

An important frontier in the quest for human-like AI is compositional se...

Lost in Context? On the Sense-wise Variance of Contextualized Word Embeddings

Contextualized word embeddings in language models have given much advanc...

Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

We present the Siamese Continuous Bag of Words (Siamese CBOW) model, a n...

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

The recent tremendous success of unsupervised word embeddings in a multi...

From Hyperbolic Geometry Back to Word Embeddings

We choose random points in the hyperbolic disc and claim that these poin...

No Training Required: Exploring Random Encoders for Sentence Classification

We explore various methods for computing sentence representations from p...

Code Repositories


All about NLP

view repo

1 Introduction

Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for cross sentence inference problems such as question answering (QA) and natural language inference (NLI). In NLI, for example, given a premise such as “golf is prohibitively expensive”, inferring that the hypothesis “golf is a cheap pastime” is a contradiction requires one to know that expensive and cheap are antonyms. Recent work Glockner et al. (2018) has shown that current models, which rely heavily on unsupervised single-word embeddings, struggle to learn such relationships. In this paper, we show that they can be learned with word pairvectors (pair2vec), which are trained, unsupervised, at a very large scale, and which significantly improve performance when added to existing cross-sentence attention mechanisms.

X Y Contexts
with X and Y baths
hot cold too X or too Y
neither X nor Y
in X, Y
Portland Oregon the X metropolitan area in Y
X International Airport in Y
food X are maize, Y, etc
crop wheat dry X, such as Y,
more X circles appeared in Y fields
X OS comes with Y play
Android Google the X team at Y
X is developed by Y
Table 1: Example word pairs (italicized) and their contexts (Wikipedia).

Unlike single-word representations, which are typically trained by modeling the co-occurrence of a target word with its context , our word-pair representations are learned by modeling the three-way co-occurrence between two words and the context that ties them together, as illustrated in Table 1. While similar training signals have been used to learn models for ontology construction Hearst (1992); Snow et al. (2005); Turney (2005); Shwartz et al. (2016) and knowledge base completion Riedel et al. (2013), this paper shows, for the first time, that large scale learning of pairwise embeddings can be used to directly improve the performance of neural cross-sentence inference models.

More specifically, we train a feedforward network that learns representations for the individual words and , as well as how to compose them into a pairwise embedding. Training is done by maximizing a generalized notion of the pointwise mutual information (PMI) among , , and their context using a variant of negative sampling Mikolov et al. (2013a); Levy and Goldberg (2014). Making a compositional function on individual word embeddings alleviates, at least partially, the sparsity that necessarily comes with embedding pairs of words, even at a very large scale.

We show that our embeddings can be added to existing cross-sentence inference models, such as BiDAF DocumentQA  Seo et al. (2017); Clark and Gardner (2018) for QA and ESIM Chen et al. (2017) for NLI. Instead of changing the word embeddings that are fed into the encoder, we add the pretrained pair representations to higher layers in the network where cross sentence attention mechanisms are used. This allows the model to use the background knowledge that the pair embeddings implicitly encode to better reason about the likely relationships between the pairs of words it aligns to each other.

Experiments on varied cross-sentence inference tasks show strong gains over high-performing models, which already use contextualized ELMo embeddings Peters et al. (2018)

, by simply adding our word-pair embeddings with no other modifications – not even hyperparameter tuning. We show 2.72 F1 points over the BiDAF++ model

Clark and Gardner (2018) on the SQuAD 2.0 QA benchmark Rajpurkar et al. (2018), as well as a 1.3 point gain over ESIM Chen et al. (2017) on MultiNLI Williams et al. (2018). Additionally, we show that our approach generalizes well to adversarial out-of-domain examples, with a 6-7% F1 increase on adversarial SQuAD Jia and Liang (2017) and a 8.8% gain on Glockner et al.’s 2018 NLI benchmark (both absolute). An analysis of pair2vec on word analogies suggests that it contains information complementary to single-word representations, and is particularly useful for encyclopedic and lexicographic relations.

2 Unsupervised Pretraining

Extending the distributional hypothesis to word pairs, we posit that similar word pairs tend to occur in similar contexts and that the contexts provide strong clues about the likely relationships that hold between the words (e.g. see Table 1). We assume a dataset of triplets, where each instance depicts a word pair and the context in which they appeared. W̄e learn two compositional representation functions, and , to encode the pair and the context, respectively, as -dimensional vectors (Section 2.1). The functions are trained using a variant of the negative sampling objective, which tries to embed word pairs close to the contexts with which they were observed (Section 2.2).

2.1 Representation

Our word-pair and context representations are both fixed-length vectors, composed from individual words. The word-pair representation function first embeds and normalizes the individual words with a shared lookup matrix :

These vectors, along with their element-wise product, are fed into a four-layer perceptron:

The context as a -dimensional vector using the function , which embeds each token with a lookup matrix , contextualizes it with a single-layer Bi-LSTM  Hochreiter and Schmidhuber (1997), and then aggregates the entire context with attentive pooling:

where and . All parameters, including the lookup tables and , are trained.

Our representation is similar to two recently-proposed frameworks by Washio and Kato (2018a, b), but differs in that: (1) they use dependency paths as context, while we use surface form; (2) they encode the context as either a lookup table or the last state of a unidirectional LSTM. We also use a different objective, which we discuss next.

2.2 Objective

To optimize our compositional representation functions, we consider two variants of the negative sampling objective Mikolov et al. (2013a): bivariate and multivariate. The bivariate objective models the two-way distribution of context and (monolithic) word pair co-occurrences. We extend this to model the multivariate (three-way) distribution of word-word-context co-occurrences. We further augment the multivariate objective with typed sampling to adversarially generate additional negative examples. We discuss the impact of the bivariate and multivariate objectives (and other components) in Section 4.3.

Bivariate Negative Sampling

On one hand, our objective aspires to make and similar (i.e. have high inner products) for that were observed together in the data. At the same time, we wish to keep our pair vectors dis-similar from random context vectors. In a straightforward application of the original (bivariate) negative sampling objective, we could generate a negative example from each observed data instance by replacing the context with a randomly-sampled context :

where is the number of sampled contexts.

Assuming that the negative contexts are sampled from the empirical distribution – with being the portion of instances in the dataset – we can follow the proof in Levy and Goldberg (2014) to show that this objective converges into the pointwise mutual information (PMI) between the word pair and the context:

This objective primarily captures co-occurrences of word pairs and contexts. The modeling of co-occurences between target words, and themselves, is limited by the fact that the training data consists of word pairs that occur within a sentence. For better generalization to cross-sentence tasks, where the distribution of word pairs is different from that of the training data, we require a multivariate objective that captures the full three-way interaction between , , and .

Multivariate Negative Sampling

Here we depart from previous work, and in addition to negative sampling of contexts , we also introduce negative sampling of target words, and . Our ternary negative sampling objective, for each data instance, can be expressed as follows:111The global objective over the entire dataset is more complicated; we present it in Appendix A.1.


Our new objective also converges to a novel multivariate generalization of PMI, different from previous PMI extensions that were inspired by information theory Van de Cruys (2011)

and heuristics

Jameel et al. (2018).222See Appendix A.3 for exact formulations. However, our unique formulation is derived from the multivariate negative sampling objective. Following Levy and Goldberg’s 2014 technique, we can show that when replacing target words in addition to contexts, our objective in (2.2) will converge to:333A full proof is provided in Appendix A.2.


where , the denominator, is:


i.e. a linear mixture of marginal probability products. By introducing terms such as

and , the objective is essentially penalizing spurious correlations between one word and the context that disregard the other word. For instance, it would assign the pattern “X is a type of Y” a high score with (banana, fruit), but a lower score with (cat, fruit).

Though not necessary in our case, our objective can easily be extended to higher-order multivariate distributions by replacing every independently, which will likewise converge into a higher-order generalization of PMI.

Typed Sampling

In multivariate negative sampling, we typically replace and

by sampling from their unigram distributions. In addition to this, we also sample uniformly from the top 100 words according to cosine similarity using distributional word vectors. The rationale for doing so is to encourage the model to learn relations between specific instances as opposed to more general types. For example, using

California as a negative sample for Oregon helps the model to learn that the pattern “X is located in Y” fits the pair (Portland, Oregon), but not the pair (Portland, California). Similar adversarial constraints were used in knowledge base completion Toutanova et al. (2015) as well as word embeddings Li et al. (2017).444Applying typed sampling also changes the value to which our objective will converge, and will replace the unigram probabilities in Equation (3) to reflect the type-based distribution.

3 Adding pair2vec to Inference Models

Figure 1:

The figure shows the typical architecture of a cross-sentence attention model (left), and how

pair2vec is added to the it (right). Given two sequences and , existing models create -aware representations of words in . For any word , this typically involves the Bi-LSTM representation of word , and the attention-weighted (with respect to ) representation of BiLSTM states of . To this, we add the attention-weighted word pair representation () of and the words in . The boldfaced arrows in the attention layer indicate relatively stronger word pair alignments such as cheap, expensive.

We first present a general outline for incorporating pair2vec into attention-based architectures, and then discuss changes made to BiDAF++ and ESIM. The key idea is to inject our pairwise representations into the attention layer by reusing the cross-sentence attention weights. In addition to attentive pooling over single word representations, we also pool over cross-sentence word pair embeddings (Figure 1).

3.1 General Approach

Pair Representation

We assume that we are given two sequences and . We represent the word-pair embeddings between and using the pretrained pair2vec model as:


We include embeddings in both directions, and , because the many relations can be expressed in both directions; e.g., hypernymy can be expressed via “X is a type of Y” as well as “Y such as X”. We take the

normalization of each direction’s pair embedding because the heavy-tailed distribution of word pairs results in significant variance of their norms.

Base Model

Let and be the vector representations of sequences and , as produced by the input encoder (e.g. ELMo embeddings contextualized with model-specific BiLSTMs). Furthermore, we assume that the base model computes soft word alignments between and via co-attention (5, 6), which are then used to compute -aware representations of :


The symmetric term is defined analogously. We refer to and as the inputs to the inference layer, since this layer computes some function over aligned word pairs, typically via a feedforward network and LSTMs. Seo et al. (2017); Peters et al. (2018); Chen et al. (2017). The inference layer is followed by an aggregation function (usually sum) and an output layer.

Injecting pair2vec

We conjecture that the inference layer effectively learns word-pair relationships from training data, and it should, therefore, help to augment its input with pair2vec. We augment (8) with the pair vectors (4) by concatenating a weighted average of the pair vectors involving , where the weights are the same computed via attention in (6):


The symmetric term is defined analogously.

3.2 Question Answering

We augment the inference layer in the BiDAF++ model with pair2vec. BiDAF++ is an improved version of the BiDAFNoAnswer Seo et al. (2017); Levy et al. (2017) which includes self-attention and ELMo embeddings from Peters et al. (2018). We found this variant to be stronger than the baselines presented in Rajpurkar et al. (2018) by over 2.5 F1. We use BiDAF++ as a baseline since its architecture is typical for QA systems, and, until recently, was state-of-the-art on SQuAD 2.0 and other benchmarks.


Let be the output of the passage encoder, and be the output of the question encoder.555The passage is typically denoted and the question , but we use for passage and for question for consistency with our general notation. The inference layer’s inputs are defined similarly to the generic model’s in (8), but also contain an aggregation of the elements in , with better-aligned elements receiving larger weights:


In the later layers, is projected and recontextualized using a BiGRU. This is followed by self attention, and finally a prediction layer that predicts start and end tokens.

BiDAF++ with pair2vec

To add our pair vectors, we simply concatenate (4) to (13):


3.3 Natural Language Inference

For NLI, we augment the ESIM model Chen et al. (2017), which was previously state-of-the-art on both SNLI Bowman et al. (2015) and MultiNLI Williams et al. (2018) benchmarks.


Let be the output of the premise encoder, and be the output of the hypothesis encoder.666The premise is typically denoted and the hypothesis , but we use for premise and for hypothesis for consistency with our general notation. The inference layer’s inputs (and ) are defined similarly to the generic model’s in (8):


In the later layers, and are projected, recontextualized, and converted to a fixed-length vector for each sentence using multiple pooling schemes. These vectors are then passed on to an output layer, which predicts the class.

ESIM with pair2vec

To add our pair vectors, we simply concatenate (4) to (15):


The vector is augmented analogously.

A very similar augmentation of ESIM was recently proposed in KIM Chen et al. (2018). However, the pair vectors are composed of manually-engineered WordNet features, while our pair embeddings are learned directly from text (see further discussion in Section 6).

4 Experiments

For experiments on question-answering (Section 4.1) and natural language inference (Section 4.2), we use our full model which includes multivariate and typed negative sampling. We test other objectives and composition functions in Section 4.3


We use a January 2018 dump of Wikipedia, containing 96M sentences to train pair2vec. We restrict the vocabulary to the top 100K words based on frequency. During preprocessing, we remove all words in the corpus that aren’t in the vocabulary. We consider each word pair within a window of 5 in the preprocessed corpus, and subsample777Like in word2vec, subsampling reduces the size of the training dataset and speeds up training. We define the word pair probability as the multiplication of unigram probabilities of individual words in the pair. instances based on word-pair probability with a threshold of . We define the context as one word to the left, all the words in between, and one word to the right of each word pair occurrence. Additionally, we replace the both words of the pair in the context with placeholder tokens and (see Table 1).


For both word pairs and contexts, we use 300-dimensional word embeddings initialized with FastText Bojanowski et al. (2017). The context representation uses a single-layer Bi-LSTM with a hidden layer size of 100. We use 2 negative context samples and 3 negative argument samples for each pair-context tuple.

For pre-training, we used stochastic gradient descent with an initial learning rate of 0.01. We reduce the learning rate by a factor of 0.9 if the loss does not decrease for 300K steps. We use a batch size of 600, and train for 12 epochs.

888On Titan X GPUs, the training takes about a week.

For both end-task models, we use AllenNLP’s implementations Gardner et al. (2017) with default hyperparameters; we did not change any setting before or after injecting pair2vec. We use 0.15 dropout on our pretrained pair embeddings.

4.1 Question Answering

Benchmark BiDAF + pair2vec
SQuAD 2.0 EM 65.66 68.02 +2.36
F1 68.86 71.58 +2.72
AddSent EM 37.50 44.20 +6.70
F1 42.55 49.69 +7.14
AddOneSent EM 48.20 53.30 +5.10
F1 54.02 60.13 +6.11
Table 2: Performance on SQuAD 2.0 and adversarial SQuAD benchmarks, with and without pair2vec. All models have ELMo.

We experiment on the SQuAD 2.0 extractive question answering benchmark Rajpurkar et al. (2018), as well as the adversarially-created datasets of SQuAD 1.1 Rajpurkar et al. (2016); Jia and Liang (2017). Table 2 shows the performance of BiDAF++, with ELMo embeddings Peters et al. (2018), before and after adding pair2vec. Experiments on SQuAD 2.0 show that our pair representations improve performance by 2.72 F1. Moreover, adding pair2vec also results in better generalization on the adversarial SQuAD datasets with gains of 7.14 and 6.11 F1.

4.2 Natural Language Inference

Benchmark ESIM + pair2vec
Matched 79.68 81.03 +1.35
Mismatched 78.80 80.12 +1.32
Table 3: Performance on MultiNLI, with and without pair2vec. All models have ELMo.
Model Glockner
Rule-based Models
WordNet Baseline 85.8
Models with GloVe
ESIM Chen et al. (2017) 77.0
KIM Chen et al. (2018) 87.7
ESIM + pair2vec 92.9
Models with ELMo
ESIM Peters et al. (2018) 84.6
ESIM + pair2vec 93.4
Table 4: Performance on the out-of-domain NLI test set of Glockner et al. (2018).

We report the performance of our model on MultiNLI and the out-of-domain test set from Glockner et al. (2018) in Table 4. We outperform the ESIM + ELMo baseline by 1.3% on the matched as well as mismatched portions of the dataset.

We also record a gain of 8.8% absolute over ESIM on the Glockner et al. (2018) test set making our approach the state-of-the-art. Following the experimental setup in Glockner et al. (2018), we train all models on a combination of SNLI Bowman et al. (2015) and MultiNLI. Glockner et al. (2018) show that with the exception of KIM Chen et al. (2018), which uses WordNet features, several NLI models fail to generalize to this setting which involves lexical inference. For a fair comparison with KIM on the Glockner test set, we replace ELMo with GLoVE embeddings, and still outperform KIM by almost halving the error rate.

4.3 Ablations

Model EM () F1 ()
pair2vec (Full Model) 69.20 72.68
Composition: 2 Layers 68.35 (-0.85) 71.65 (-1.03)
Composition: Multiply 67.10 (-2.20) 70.20 (-2.48)
Objective: Bivariate NS 68.63 (-0.57) 71.98 (-0.70)
Unsupervised: Pair Dist 67.07 (-2.13) 70.24 (-2.44)
No pair2vec (BiDAF) 66.66 (-2.54) 69.90 (-2.78)
Table 5: Ablations on the Squad 2.0 development set show that argument sampling as well as using a deeper composition function are useful.

Ablating parts of pair2vec shows that all components of the model (Section 2) are useful. To show the importance of each component, we ablate it and report the EM and F1 on the development set of SQuAD 2.0 (Table 5). The full model, which uses a 4-layer MLP for and trains with multivariate negative sampling, achieves the highest F1 of 72.68.

We experiment with two alternative composition functions, a 2-layer MLP (Composition: 2 Layers) and element-wise multiplication (Composition: Multiply), which yield significantly smaller gains over the baseline BiDAF++ model. This result demonstrates the need for a deep and complicated composition function. By eliminating sampling of target words from the objective, we are also able to evaluate the bivariate negative sampling objective (Objective: Bivariate NS). Here we see a drop of 0.7 F1, accounting for about a quarter of the overall gain. This suggests that while the bulk of the signal is mined from the pair-context interactions, there is also valuable information in other interactions as well.

We also test whether specific pre-training of word pair representations is useful by replacing pair2vec embeddings with the vector offsets of pre-trained word embeddings (Unsupervised: Pair Dist). We follow the PairDistance method for word analogies Mikolov et al. (2013b), and represent the pair as the L2 normalized difference of single-word vectors: . We use the same FastText word vectors with which we initialized pair2vec before training. We observe a gain of only 0.34 F1 over the baseline, about 8 times smaller than the gains from the full pair2vec model.

5 Analysis

Figure 2:

Accuracy as a function of the interpolation parameter

(see Equation (17)). The configuration relies only on FastText vector offsets, while reflects pair2vec.
Relation 3CosAdd +pair2vec
Country:Capital   1.2 86.1 0.9
Name:Occupation   1.8 44.6 0.8
Name:Nationality   0.1 42.0 0.9
UK City:County   0.7 31.7 1.0
Country:Language   4.0 28.4 0.8
Verb 3pSg:Ved 49.1 61.7 0.6
Verb Ving:Ved 61.1 73.3 0.5
Verb Inf:Ved 58.5 70.1 0.5
Noun+less   4.8 16.0 0.2
Substance Meronym   3.8 14.5 0.6
Table 6: The top 10 analogy relations for which interpolating with pair2vec improves performance. At single-digit resolution, is the optimal interpolation parameter for each relation.
Relation Context X Y (Top 3)
Antonymy/Exclusion either X or Y accept reject, refuse, recognise
birth death, fertility, arrival
hard soft, brittle, polished
Hypernymy including X and other Y copper ones, metals, mines
google apps, browsers, searches
oak oaks, hardwoods, elms
Hyponymy X like Y cities solaris, speyer, medina
browsers chrome, firefox, netscape
companies bravo, nike, uber
Co-hyponymy , X , Y , copper malachite, flint, ivory
google microsoft, bing, yahoo
oak araucaria, hornbeam, blackjack
Meronymy X is made of Y cake marzipan, icing, candles
statue bronze, lenin, wax
utensils copper, tea, brass
City-State in X , Y . portland oregon, maine, dorset
dallas tx, texas, va
oakland california, ca, piedmont
City-City from X to Y . portland salem, astoria, ogdensburg
dallas denton, allatoona, addison
oakland hayward, emeryville, berkeley
Profession X , a famous Y , ronaldo footballer, portuguese, player
monet painter, painting, butterfly
orwell writer, author, essay
Table 7: Given a context and a word , we select the top 3 words from the entire vocabulary using our trained scoring function . The analysis suggests that the model tends to rank correct matches (in italics) over others.

In Section 4 we showed that pair2vec adds additional information on top of single-word representations such as ELMo Peters et al. (2018). Here, we ask what this extra information is, and try to characterize which lexical relations are indeed better captured by pair2vec. To that end, we evaluate performance on a word analogy dataset with over 40 different relation types (Section 5.1), and observe how pair2vec fills hand-crafted relation patterns (Section 5.2).

5.1 Quantitative Analysis: Word Analogies

Word Analogy Dataset

Given a word pair and word , the word analogy task involves predicting a word such that . We use the Bigger Analogy Test Set (BATS) Gladkova et al. (2016) which contains four groups of relations: encyclopedic semantics (e.g., person-profession as in Einstein-physicist), lexicographic semantics (e.g., antonymy as in cheap-expensive), derivational morphology (e.g., noun forms as in oblige-obligation), and inflectional morphology (e.g., noun-plural as in bird-birds). Each group contains 10 sub-relations.


We interpolate pair2vec scores with those of the 3CosAdd method Mikolov et al. (2013b); Levy et al. (2014) on FastText embeddings, as follows:


where , , , and represent FastText embeddings999The FastText embeddings in the analysis were retrained using the same Wikipedia corpus used to train pair2vec to control for the corpus when comparing the two methods. and , represent the pair2vec embedding for the word pairs and , respectively; is the linear interpolation parameter. Following Mikolov et al. (2013b), we return the highest-scoring in the entire vocabulary, excluding the given words , , and .


Figure 2 shows how the accuracy on each category of relations varies with . For all four groups, adding pair2vec to 3CosAdd results in significant gains. In particular, the biggest relative improvements are observed for encyclopedic (356%) and lexicographic (51%) relations.

Table 6 shows the specific relations in which pair2vec made the largest absolute impact. The gains are particularly significant for relations where FastText embeddings provide limited signal. For example, the accuracy for substance meronyms goes from 3.8% to 14.5%. In some cases, there is also a synergistic effect; for instance, in noun+less, pair2vec alone scored 0% accuracy, but mixing it with 3CosAdd, which got 4.8% on its own, yielded 16% accuracy.

These results, alongside our experiments in Section 4, strongly suggest that pair2vec encodes information complementary to that in single-word embedding methods such as FastText and ELMo.

5.2 Qualitative Analysis: Slot Filling

To further explore how pair2vec encodes such complementary information, we consider a setting similar to that of knowledge base completion: given a Hearst-like context pattern and a single word , predict the other word from the entire vocabulary. We rank candidate words based on the scoring function in our training objective: . We use a fixed set of example relations and manually define their predictive context patterns and a small set of candidate words .

Table 7 shows the top three words. The model embeds correct pairs close to contexts predictive of the relation between them. For example, substituting Portland in the city-state pattern (“in X, Y .”), the top two words are Oregon and Maine, both US states with cities named Portland. When used with the city-city pattern (“from X to Y .”), the top two words are Salem and Astoria, both cities in Oregon. The word-context interaction often captures multiple relations; for example, Monet is used to refer to the painter (profession) as well as his paintings.

As intended, pair2vec captures the three-way interaction between word pairs and context, and not just the two-way interaction between a single word and its context (as in single-word embeddings). This profound difference allows pair2vec to complement single-word embeddings with additional information.

6 Related Work

Pretrained Word Embeddings

Many state-of-the-art models initialize their word representations using pretrained word embeddings such as word2vec Mikolov et al. (2013a) or ELMo Peters et al. (2018). These representations are typically trained using an interpretation of the Distributional Hypothesis Harris in which the bivariate distribution of target words and contexts is modeled. Our work deviates from the word embedding literature in two major aspects. First, our goal is to represent pairs of words, not individual words. Second, our new PMI formulation models the trivariate distribution between the first word, the second word, and their joint context. Experiments show that our pair embeddings can complement individual word embeddings, and that they are perhaps capturing information that eludes the traditional interpretation of the Distributional Hypothesis.

Mining Textual Patterns

There is extensive literature on mining textual patterns to predict relations between words Hearst (1992); Snow et al. (2005); Turney (2005); Riedel et al. (2013); Toutanova et al. (2015); Shwartz and Dagan (2016). These approaches focus on relations between pairs of nouns (perhaps with the exception of VerbOcean Chklovski and Pantel (2004), which models relations between verbs). More recently, these approaches have been expanded to predict relations between unrestricted pairs of words Jameel et al. (2018); Espinosa Anke and Schockaert (2018), assuming that each word-pair was observed together during pretraining. Washio and Kato 2018a; 2018b relax this assumption with a compositional model that can represent any word pair, as long as each word appeared (individually) in the corpus.

These methods are evaluated on either intrinsic relation prediction tasks, such as BLESS Baroni and Lenci (2011) and CogALex Santus et al. (2016), or knowledge-base population benchmarks, e.g. FB15 Bordes et al. (2013). To the best of our knowledge, our work is the first to integrate pattern-based methods into modern high-performing semantic models and evaluate their impact on complex end-tasks like QA and NLI.

Integrating Knowledge in Complex Models

There has been some work on injecting external knowledge into NLP models. Ahn et al. (2016) integrate Freebase facts into a language model using a copying mechanism over fact attributes. Yang and Mitchell (2017) modify the LSTM cell to incorporate WordNet and NELL knowledge for event and entity extraction. For cross-sentence inference tasks, Weissenborn et al. (2018) and Mihaylov and Frank (2018) dynamically refine word representations by reading free text assertions from ConceptNet and Wikipedia abstracts. Our approach, on the other hand, relies on a relatively simple extension of existing cross-sentence inference models. Furthermore, we do not need to dynamically retrieve and process knowledge base facts or Wikipedia texts, and just pretrain our pair vectors in advance.

Perhaps the most similar approach to ours is KIM Chen et al. (2017), which integrates word-pair vectors into the ESIM model for NLI in a very similar way to ours. However, KIM’s word-pair vectors contain only hand-engineered word-relation indicators from WordNet, whereas our word-pair vectors are automatically learned from unlabeled text. Our vectors can therefore reflect relation types that do not exist in WordNet (such as profession) as well as word pairs that do not have a direct link in WordNet (e.g. bronze and statue); see Table 7 for additional examples.

7 Conclusion

We presented new methods for learning and using embeddings of word pairs that implicitly represent background knowledge. Our pairwise embeddings are computed as a compositional function of representations of the individual words, which is learned by maximizing a variant of the pointwise mutual information (PMI) with the contexts in which the the two words co-occur. Experiments on benchmark cross-sentence inference datasets demonstrated that adding these representations to existing models results in sizable improvements for both in-domain and adversarial settings.


Appendix A Multivariate Negative Sampling

In this appendix, we elaborate on mathematical details of multivariate negative sampling to support our claims in Section 2.2.

a.1 Global Objective

Equation (2.2) in Section 2.2 characterizes the local objective for each data instance. To understand the mathematical properties of this objective, we must first describe the global objective in terms of the entire dataset. However, this cannot be done by simply summing the local objective for each , since each such example may appear multiple times in our dataset. Moreover, due to the nature of negative sampling, the number of times an triplet appears as a positive example will almost always be different from the number of times it appears as a negative one. Therefore, we must determine the frequency in which each triplet appears in each role.

We first denote the number of times the example appears in the dataset as ; this is also the number of times is used as a positive example. We observe that the expected number of times is used as a corrupt example is , since can only be created as a corrupt example by randomly sampling from an example that already contained and . The number of times is used as a corrupt or example can be derived analogously. Therefore, the global objective of our trenary negative sampling approach is:

Van de Cruys (2011)
Jameel et al. (2018)
(This Work)
Table 8: Multivariate generalizations of PMI.

a.2 Relation to Multivariate PMI

With the global objective, we can now ask what is the optimal value of (21) by comparing the partial derivative of (18) to zero. This derivative is in fact equal to the partial derivative of (19), since it is the only component of the global objective in which appears:

The partial derivative of (19) can be expressed as:

which can be reformulated as:

By expanding the fraction by (i.e. dividing by the size of the dataset), we essentially convert all the frequency counts (e.g. ) to empirical probabilities (e.g. ), and arrive at Equation (2) in Section 2.2.

a.3 Other Multivariate PMI Formulations

Previous work has proposed different multivariate formulations of PMI, shown in Table 8. Van de Cruys (2011) presented specific interaction information () and specific correlation (). In addition to those metrics, Jameel et al. (2018) experimented with , which is the bivariate PMI between and , and with . Our formulation deviates from previous work, and, to the best of our knowledge, cannot be trivially expressed by one of the existing metrics.