New Embedded Representations and Evaluation Protocols for Inferring Transitive Relations

by   Sandeep Subramanian, et al.

Beyond word embeddings, continuous representations of knowledge graph (KG) components, such as entities, types and relations, are widely used for entity mention disambiguation, relation inference and deep question answering. Great strides have been made in modeling general, asymmetric or antisymmetric KG relations using Gaussian, holographic, and complex embeddings. None of these directly enforce transitivity inherent in the is-instance-of and is-subtype-of relations. A recent proposal, called order embedding (OE), demands that the vector representing a subtype elementwise dominates the vector representing a supertype. However, the manner in which such constraints are asserted and evaluated have some limitations. In this short research note, we make three contributions specific to representing and inferring transitive relations. First, we propose and justify a significant improvement to the OE loss objective. Second, we propose a new representation of types as hyper-rectangular regions, that generalize and improve on OE. Third, we show that some current protocols to evaluate transitive relation inference can be misleading, and offer a sound alternative. Rather than use black-box deep learning modules off-the-shelf, we develop our training networks using elementary geometric considerations.



There are no comments yet.


page 1

page 2

page 3

page 4


AutoETER: Automated Entity Type Representation for Knowledge Graph Embedding

Recent advances in Knowledge Graph Embed-ding (KGE) allow for representi...

DOLORES: Deep Contextualized Knowledge Graph Embeddings

We introduce a new method DOLORES for learning knowledge graph embedding...

CoKE: Contextualized Knowledge Graph Embedding

Knowledge graph embedding, which projects symbolic entities and relation...

What is Learned in Knowledge Graph Embeddings?

A knowledge graph (KG) is a data structure which represents entities and...

Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning

Recent work on word embeddings has shown that simple vector subtraction ...

Jointly Embedding Relations and Mentions for Knowledge Population

This paper contributes a joint embedding model for predicting relations ...

Detecting rare visual relations using analogies

We seek to detect visual relations in images of the form of triplets t =...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Contemporary information extraction from text, relation inference in knowledge graphs (KGs), and question answering (QA) are informed by continuous representations of words, entities, types and relations. Faced with the query “Name scientists who played the violin,” and having collected candidate response entities, a QA system will generally want to verify if a candidate is a scientist. Testing if or , where is an entity and are types, is therefore a critical requirement. Unlike Albert Einstein, lesser-known candidates may not be registered in knowledge graphs, and we may need to assign a confidence score of belongingness to a target type.

A common recipe for inferring general relations between entities is to fit suitable vectors to each of them, and to train a network to input query vectors and predict presence or absence of the probed relationship. A key question has been whether types merit a special representation, different from the generic devices that represent KG relations, because of their special properties. Two types may be disjoint, overlapping, or one may contain the other. Containment is transitive.

Compared to the vast array of entity-relation representations available (Bordes et al., 2013; Nickel, 2013; Nickel et al., 2016; Trouillon et al., 2016; Xie et al., 2017), few proposals exist (Vilnis and McCallum, 2014; Vendrov et al., 2015; Jameel and Schockaert, 2016) for representing types to satisfy their specific requirements. Of these, only order embedding (OE) by Vendrov et al. (2015) directly enforces transitivity by modeling it as elementwise vector dominance.

We make three contributions. First, we present a significant improvement to the OE loss objective. Second, we generalize OE to rectangle

embeddings for types: types and entities are represented by (hyper-)rectangles and points respectively. Ideally, type rectangles contain subtype rectangles and entity instance points. Rather than invoke established neural gadgets as black boxes, we introduce constraints and loss functions in a transparent manner, suited to the geometric constraints induced by the task at hand. Third, we remove a limitation in the training and evaluation protocol of

Vendrov et al. (2015), and propose a sound alternative. Experiments using synsets from the WordNet noun hierarchy (same as Vendrov et al. (2015)) show the benefits of our new formulations. Our code will be available111

2. Related work

Words and entities222Also see wiki2vec are usually embedded as points or rays from the origin (Mikolov et al., 2013; Pennington et al., 2014; Yamada et al., 2017). It is well appreciated that relations need more sophisticated representation (Bordes et al., 2013; Nickel, 2013; Nickel et al., 2016; Trouillon et al., 2016; Xie et al., 2017), but types seem to have fallen by the wayside, in relative terms. Vilnis and McCallum (2014) pioneered a Gaussian density representation for words, to model hypernymy via the asymmetric KL divergence as an inference gadget. Items are represented by Gaussian densities (with suitable mean and covariance parameters). If we want low . Normalized densities with unit mass seem inappropriate for types with diverse population sizes. Athiwaratkun and Wilson (2018) have used a thresholded divergence . However, modeling asymmetry does not, in itself, enforce transitivity. Neither is anti-symmetry modeled. Jameel and Schockaert (2016) proposed using subspaces to represent types. They do not address type hierarchies or transitive containment. Recently, Nickel and Kiela (2017) introduced an elegant hyperbolic geometry to represent types, but moving away from Euclidean space can complicate the use of such embeddings in downstream applications, in conjunction with conventional word embeddings. Vendrov et al. (2015) proposed a simpler mechanism: embed each type to vector , and, if , then require , where is elementwise. I.e., must dominate . OE was found better at modeling hypernymy than Gaussian embeddings. In OE, types are open cones with infinite volume, which complicates representing various intersections.

3. : OE with improved loss objective

In what follows, we use the partial order to unify and for notational simplicity. If , OE required . OE defines , which is 0 iff . Given labeled positive instances and negative instances , the overall loss is the sum of two parts:


where is a tuned additive margin. The intuition is that when , we want . There are two limitations to the above loss definitions. First,

is too sensitive to outliers. This is readily remedied by redefining

using L1 norm, as



is the hinge/ReLU operator. But the semantics of

are wrong: we are needlessly encouraging all dimensions to violate dominance, whereas violation in just one dimension would have been enough.

Specifically, for , loss should be zero if for any . Accordingly, we redefine


so that the loss is zero if dominance fails in at least one dimension. To balance this form in case of positive instances, we redefine


so that the loss is zero only if dominance holds in all dimensions.

The unbounded hinge losses above mean a few outliers can hijack the aggregate losses and . Moreover, the absence of a SVM-like geometric margin (as distinct from the loss margin above) also complicates separating and

cases confidently. Our final design introduces a nonlinearity (sigmoid function) to normalize per-instance losses, additive margin

and a standard stiffness hyperparameter



(Obviously the ‘’ terms are immaterial for optimization, but bring the loss expression to zero when there are no constraint violations.)

4. Rectangle embeddings

Despite its novelty and elegance, OE has some conceptual limitations. A type with embedding is the infinite axis-aligned open convex cone with its apex at . Thus, types cannot “turn off” dimensions, all pairs of types intersect (although the intersection may be unpopulated), and all types have the same infinite measure, irrespective of their training population sizes.

We propose to represent each type by a hyper-rectangle (hereafter, just ‘rectangle’), a natural generalization of OE cones. A rectangle is convex, bounded and can have collapsed dimensions (i.e., with zero width). Obviously, rectangles can be positioned to be disjoint, and their sizes can give some indication of the number of known instances of corresponding types. Containment of one rectangle in another is transitive by construction, just like OE. Entities remain represented as points (or infinitesimal rectangles for uniform notation).

Each type or entity is represented by a base vector , as well as a nonnegative width vector , so that in dimension , the rectangle has extent . Informally, the rectangle representing is bounded by “lower left corner” and “upper right corner” . For entities, . For types, are regularized with a L2 penalty. The rectangles are allowed to float around freely, so are not regularized.

If or , the rectangle representing must be contained in the rectangle representing . Let the violation in the th dimension be


Then the loss expression for positive instances is


This ensures that the loss is proportional to the largest violating margin and that the loss is zero if the rectangle of is contained in the rectangle of . Analogously, we define


As in , we can add margin, stiffness, and nonlinearity to rectangles, and get


5. Training and evaluation protocols

Because the training and evaluation instances are tuple samples from a single (partially observed) partial order, great care is needed in designing the training, development and testing folds. To use unambiguous short subscripts, we call them learn, dev and eval folds, each with positive and negative instances . Let be the raw set of tuples ( or ). The transitive closure (TC) of , denoted , includes all tuples implied by via transitivity.

5.1. OE protocol

Vendrov et al. (2015) followed this protocol:

  1. Compute .

  2. Sample positive eval fold .

  3. Sample positive learn fold .

  4. Sample positive dev fold .

  5. Generate negative eval, learning and dev folds, & (see below).

  6. Return .

A negative tuple is generated by taking a positive tuple and perturbing either of them randomly to or , where are sampled uniformly at random. In OE negative folds were the same size as positive folds.

The WordNet (Miller et al., 1993) hypernymy data set used by Vendrov et al. (2015) has and . and , sampled from , had only 4000 tuples each. All remaining tuples were in the learn fold. Vendrov et al. (2015) freely admit that “the majority of test set edges can be inferred simply by applying transitivity, giving [them] a strong baseline.” They reported that the TC baseline gave a 0/1 accuracy of 88.2%, Gaussian embeddings (Vilnis and McCallum, 2014) was at 86.6%, and OE at 90.6%.

Figure 1. A large fraction of test instances can be inferred by simply computing the transitive closure of the training fold in the OE protocol.

Instead of 0/1 accuracy, Figure 1 shows the more robust F1 score on test instances achieved by transitive closure and OE, as the size of training data is varied. Vendrov et al. (2015) reported accuracy near the right end of the scale, where OE has little to offer beyond TC. In fact, OE does show significant lift beyond TC when training data is scarce. As we shall see, even with ample training data,  and rectangle embeddings improve on OE.

5.2. Sanitized OE protocol

Clearly, evaluation results must be reported separately for instances that cannot be trivially inferred via TC, where the algorithm needs discover a suitable geometry from the combinatorial structure of beyond mere reachability. To this end, we propose the following sanitized protocol.

  1. Sample positive learn fold .

  2. Negative learn fold of size is generated by repeating as needed:

    1. Sample uniformly.

    2. Perturb one of or to get .

    3. If , discard.

  3. Sample positive dev fold .

  4. Discard from if or not found in (explained below).

  5. Sample positive eval fold .

  6. Discard elements from using the same protocol used to discard elements from .

  7. Generate negative dev and eval folds, and , using the same protocol used to generate from .

An entity or type never encountered in the learn fold cannot be embedded meaningfully (unless corpus information is harnessed, see Section 7), so it is pointless to include in dev or eval folds instances that contain such entities or types. Such sampled instances are discarded. To fill folds up to desired sizes, we repeatedly sample pairs until we can retain enough instances.

6. Experiments

Data set:

We prepare our data set similar to Vendrov et al. (2015). WordNet (Miller et al., 1993) gives 82115 (hypernym, hyponym) pairs which we use as directed edges to construct our KG. The WordNet noun hierarchy is prepared by experts, and is also at the heart of other type systems (Suchanek et al., 2007; Murty et al., 2017) used in KG completion and information extraction. We augment the KG by computing its transitive closure, which increases the edge count to 838073. Then we use the two protocols in Section 5 to create training, dev and test folds. The sanitized protocol produces 679241 positive and 679241 negative training instances, 4393 positive and 4393 negative dev instances, and 4316 positive and 4316 negative test instances. These sizes are close to those of Vendrov et al. (2015).

Code and hyperparameter details:

OE and our enhancements,

 and rectangle embeddings, were coded in Tensorflow with Adam’s optimizer. Hyperparameters, such as batch size (500), initial learning rate (0.1), margin

and stiffness

, were tuned using the dev fold. Optimization was stopped if the loss on the dev fold did not improve more than 0.1% for 20 consecutive epochs. All types and entities were embedded to



Vendrov et al. (2015) reported only microaveraged 0/1 accuracy (‘Acc’). Here we also report average precision (AP), recall (R), precision (P) and F1 score, thus covering both ranking and set-retrieval objectives. AP and R-P curves are obtained by ordering test instances by the raw score given to them by OE, , and rectangle embeddings. Table 1 compares the three systems after using the two sampling protocols to generate folds.

OE protocol Sanitized OE protocol
OE Rect OE Rect
Acc 0.922 0.921 0.926 0.574 0.742 0.767
AP 1 1 1 0.977 0.969 0.986
P 0.994 0.915 0.973 0.987 0.925 0.983
R 0.850 0.929 0.877 0.151 0.527 0.544
F1 0.916 0.922 0.923 0.262 0.671 0.700
Table 1. Performance of OE, , and rectangle embeddings under the OE protocol and the sanitized protocol, on the WordNet hypernymy relation.

It is immediately visible that absolute performance numbers are very high under the original OE protocol, for reasons made clear earlier. As soon as the OE protocol is replaced by the sanitized protocol, no system is given any credit for computing transitive closure. The 0/1 accuracy of OE drops from 0.922 to 0.574. F1 score drops even more drastically from 0.916 to 0.262. In contrast,  and rectangle embeddings fare better overall, with rectangle embeddings improving beyond .

Figure 2. Recall-precision profiles on WordNet.

Whereas  and rectangle embeddings improve on OE at the task of set retrieval, their ranking abilities are slightly different. Figure 2 shows that  is inferior at ranking to both OE and rectangle embeddings. Rectangle embeddings have the best precision profile at low recall. Modifying our code to use ranking-oriented loss functions (Cao et al., 2007) may address ranking applications better.

7. Concluding remarks

Here we have addressed the problem of completing and relations starting from an incomplete KG, but without corpus support. For out-of-vocabulary (not seen during training) entities, mention contexts in a corpus are vital typing clues (Ling and Weld, 2012; Yaghoobzadeh and Schütze, 2015; Shimaoka et al., 2016). We plan to integrate context (word) embeddings with order and rectangle embeddings. It would be of interest to see how our refined loss objectives and testing protocols compare with other corpus-based methods (Chang et al., 2017; Yamane et al., 2016).


Thanks to Aditya Kusupati and Anand Dhoot for helpful discussions, and nVidia for a GPU grant.