1 A General Framework for MultiRelational Representation Learning
Most existing neural embedding models for multirelational learning can be derived from a general framework. The input is a relation triplet describing (the subject) and (the object) that are in a certain relation . The output is a scalar measuring the validity of the relationship. Each input entity can be represented as a highdimensional sparse vector (“onehot” index vector or “hot” feature vector). The first neural network layer projects the input vectors to low dimensional vectors, and the second layer projects these vectors to a real value for comparison via a relationspecific operator (it can also be viewed as a scoring function).
More formally, denote as the input for entity and as the first layer neural network parameter. The scoring function for a relation triplet can be written as
(1) 
Many choices for the form of the scoring function
are available. Most of the existing scoring functions in the literature can be unified based on a basic linear transformation
, a bilinear transformation or their combination, where and are defined as(2) 
which are parametrized by and , respectively.
In Table 1, we summarize several popular scoring functions in the literature for a relation triplet , reformulated in terms of the above two functions. Denote by two entity vectors. Denote by and matrix or vector parameters for linear transformation . Denote by and matrix or tensor parameters for bilinear transformation .
is an identity matrix.
is an additional parameter for relation . The scoring function for TransE is derived from as in [2].Models  Scoring Function  
Distance [3]    
Single Layer [25]    
TransE [2]  
Bilinear [14]    
NTN [25] 
This general framework for relationship modeling also applies to the recent deepstructured semantic model [12, 22, 23, 9, 28], which learns the relevance or a single relation between a pair of word sequences. The framework above applies when using multiple neural network layers to project entities and using a relationindependent scoring function . The cosine scoring function is a special case of with normalized and .
The neural network parameters of all the models discussed above can be learned by minimizing a marginbased ranking objective^{1}^{1}1Other objectives such as mutual information (as in [12]) and reconstruction loss (as in tensor decomposition approaches [4]) can also be applied. Comparisons among these objectives are beyond the scope of this paper., which encourages the scores of positive relationships (triplets) to be higher than the scores of any negative relationships (triplets). Usually only positive triplets are observed in the data. Given a set of positive triplets , we can construct a set of negative triplets by corrupting either one of the relation arguments, . The training objective is to minimize the marginbased ranking loss
(3) 
2 Experiments and Discussion
Datasets and evaluation metrics
We used the WordNet (WN) and Freebase (FB15k) datasets introduced in [2]. WN contains triplets with entities and relations, and FB15k consists of triplets with entities and relations. We also consider a subset of FB15k (FB15k401) containing only frequent relations (relations with at least training examples). This results in triplets with entities and relations. We use link prediction as our prediction task as in [2]. For each test triplet, each entity is treated as the target entity to be predicted in turn. Scores are computed for the correct entity and all the corrupted entities in the dictionary and are ranked in descending order. We consider Mean Reciprocal Rank (MRR) (an average of the reciprocal rank of an answered entity over all test triplets), HITS@10 (top10 accuracy), and Mean Average Precision (MAP) (as used in [4]
) as the evaluation metrics.
Implementation details
All the models were implemented in C# and using GPU. Training was implemented using minibatch stochastic gradient descent with AdaGrad
[8]. At each gradient step, we sampled for each positive triplet two negative triplets, one with a corrupted subject entity and one with a corrupted object entity. The entity vectors are renormalized to have unit length after each gradient step (it is an effective technique that empirically improved all the models). For the relation parameters, we used standard L2 regularization. For all models, we set the number of minibatches to , the dimensionality of the entity vector , the regularization parameter, and the number of training epochs
on FB15k and FB15k401 and on WN ( was determined based on the learning curves where the performance of all models plateaued.) The learning rate was initially set to and then adapted during training by AdaGrad.2.1 Model Comparisons
We examine five embedding models in decreasing order of complexity: (1) NTN with tensor slices as in [25]; (2) Bilinear+Linear, NTN with tensor slice and without the nonlinear layer; (3) TransE with L2 norm^{2}^{2}2Empirically we found no significant differences between L1norm and L2norm for the TransE objective. , a special case of Bilinear+Linear as described in [2]; (4) Bilinear; (5) Bilineardiag: a special case of Bilinear where the relation matrix is a diagonal matrix.
FB15k  FB15k401  WN  
MRR  HITS10  MRR  HITS10  MRR  HITS10  
NTN  0.25  41.4  0.24  40.5  0.53  66.1 
Blinear+Linear  0.30  49.0  0.30  49.4  0.87  91.6 
TransE (DistADD)  0.32  53.9  0.32  54.7  0.38  90.9 
Bilinear  0.31  51.9  0.32  52.2  0.89  92.8 
Bilineardiag (DistMult)  0.35  57.7  0.36  58.5  0.83  94.2 
Table 2 shows the results of all compared methods on all the datasets. In general, we observe that the performance increases as the complexity of the model decreases on FB. NTN, the most complex model, provides the worst performance on both FB and WN, which suggests overfitting. Compared to the previously published results of TransE [2], our implementation achieves much better results (53.9% vs. 47.1% on FB15k and 90.9% vs. 89.2% on WN) using the same evaluation metric (HITS@10). We attribute such discrepancy mainly to the different choice of SGD optimization: AdaGrad vs. constant learning rate. We also found that Bilinear consistently provides comparable or better performance than TransE, especially on WN. Note that WN contains much more entities than FB, it may require the parametrization of relations to be more expressive to better handle the richness of entities. Interestingly, we found that a simple variant of Bilinear – Bilineardiag, clearly outperforms all baselines on FB and achieves comparable performance to Bilinear on WN.
2.2 Multiplicative vs. Additive Interactions
Note that Bilineardiag and TransE have the same number of model parameters and their difference lies in the operational choices of the composition of two entity vectors – Bilineardiag uses weighted elementwise dot product (multiplicative operation) and TransE uses elementwise subtraction with a bias (additive operation). To highlight the difference, here we use DistMult and DistAdd to refer to Bilineardiag and TransE, respectively. Comparison between these two models can provide us with more insights on the effect of two common choices of compositional operations – multiplication and addition for modeling entity relations. Overall, we observed superior performance of DistMult on all the datasets in Table 2. Table 3 shows the HITS@10 score on four types of relation categories (as defined in [2]) on FB15k401 when predicting the subject entity and the object entity respectively. We can see that DistMult significantly outperforms DistAdd in almost all the categories. More qualitative results can be found in the Appendix.
Predicting subject entities  Predicting object entities  
1to1  1ton  nto1  nton  1to1  1ton  nto1  nton  
DistADD  70.0  76.7  21.1  53.9  68.7  17.4  83.2  57.5 
DistMult  75.5  85.1  42.9  55.2  73.7  46.7  81.0  58.8 
2.3 Entity Representations
In the following, we examine the learning of entity representations and introduce two further improvements: using nonlinear projection and initializing entity vectors with pretrained phrase vectors. We focus on DistMult as our baseline and compare it with the two modifications DistMulttanh (using for entity projection^{3}^{3}3When applying nonlinearity, we remove the normalization steps on entity parameters during training as already helps control the scaling freedoms.) and DistMulttanhEVinit (initializing the entity parameters with the dimensional pretrained phrase vectors released by word2vec [17]) on FB15k401. We also reimplemented the word vector representation and initialization technique introduced in [25] – each entity is represented as an average of its word vectors and the word vectors are initialized using the dimensional pretrained word vectors released by word2vec. We denote this method as DistMulttanhWVinit. Inspired by [4], we design a new evaluation setting where the predicted entities are automatically filtered according to “entity types” (entities that appear as the subjects/objects of a relation have the same type defined by that relation). This provides us with better understanding of the model performance when some entity type information is provided.
MRR  HITS10  MAP (w/ type checking)  

DistMult  0.36  58.5  64.5 
DistMulttanh  0.39  63.3  76.0 
DistMulttanhWVinit  0.28  52.5  65.5 
DistMulttanhEVinit  0.42  73.2  88.2 
In Table 4, we can see that DistMulttanhEVinit provides the best performance on all the metrics. Surprisingly, we observed performance drops by DistMulttanhWVinit. We suspect that this is because word vectors are not appropriate for modeling entities described by noncompositional phrases (more than 73% of the entities in FB15k401 are person names, locations, organizations and films). The promising performance of DistMulttanhEVinit suggests that the embedding model can greatly benefit from pretrained entitylevel vectors.
3 Conclusion
In this paper we present a unified framework for modeling multirelational representations, scoring, and learning, and conduct an empirical study of several recent multirelational embedding models under the framework. We investigate the different choices of relation operators based on linear and bilinear transformations, and also the effects of entity representations by incorporating unsupervised vectors pretrained on extra textual resources. Our results show several interesting findings, enabling the design of a simple embedding model that achieves the new stateoftheart performance on a popular knowledge base completion task evaluated on Freebase. Given the recent successes of deep learning in various applications; e.g.
[11, 27, 5], our future work will aim to exploit deep structure including possibly tensor construct in computing the neural embedding vectors; e.g. [12, 29, 13]. This will extend the current multirelational neural embedding model to a deep version that is potentially capable of capturing hierarchical structure hidden in the input data.References
 [1] Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. A semantic matching energy function for learning with multirelational data. Machine Learning, pages 1–27, 2013.
 [2] Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In NIPS, 2013.
 [3] Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. Learning structured embeddings of knowledge bases. In AAAI, 2011.
 [4] KaiWei Chang, Wentau Yih, Bishan Yang, and Chris Meek. Typed tensor decomposition of knowledge bases for relation extraction. In EMNLP, 2014.
 [5] Li Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition and related applications: An overview. In in ICASSP, 2013.
 [6] Pedro Domingos. Prospects and challenges for multirelational data mining. ACM SIGKDD explorations newsletter, 5(1):80–83, 2003.
 [7] Xin Luna Dong, K Murphy, E Gabrilovich, G Heitz, W Horn, N Lao, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A Webscale approach to probabilistic knowledge fusion. In KDD, 2014.
 [8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
 [9] Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong He, Li Deng, and Yelong Shen. Modeling interestingness with deep neural networks. In EMNLP, 2014.
 [10] Lise Getoor and Ben Taskar, editors. Introduction to Statistical Relational Learning. The MIT Press, 2007.
 [11] Geoff Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Proc. Mag., 29:82–97, 2012.
 [12] PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for Web search using clickthrough data. In CIKM, 2013.
 [13] B Hutchinson, L. Deng, and D. Yu. Tensor deep stacking networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1944–1957, 2013.
 [14] Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, and Guillaume Obozinski. A latent factor model for highly multirelational data. In NIPS, 2012.
 [15] Charles Kemp, Joshua B Tenenbaum, Thomas L Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006.
 [16] Ni Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, pages 529–539, 2011.
 [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
 [18] Maximilian Nickel, Volker Tresp, and HansPeter Kriegel. A threeway model for collective learning on multirelational data. In ICML, pages 809–816, 2011.
 [19] Maximilian Nickel, Volker Tresp, and HansPeter Kriegel. Factorizing YAGO: scalable machine learning for linked data. In WWW, pages 271–280, 2012.
 [20] Alberto Paccanaro and Geoffrey E. Hinton. Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering, 13(2):232–244, 2001.
 [21] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(12):107–136, 2006.
 [22] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutionalpooling structure for information retrieval. In CIKM, 2014.

[23]
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil.
Learning semantic representations using convolutional neural networks for Web search.
In WWW, pages 373–374, 2014.  [24] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective matrix factorization. In KDD, pages 650–658. ACM, 2008.
 [25] Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In NIPS, 2013.
 [26] Ilya Sutskever, Joshua B Tenenbaum, and Ruslan Salakhutdinov. Modelling relational data using Bayesian clustered tensor factorization. In NIPS, pages 1821–1828, 2009.
 [27] O. Vinyals, Y. Jia, L. Deng, and T. Darrell. Learning with recursive perceptual representations. In NIPS, 2012.
 [28] Wentau Yih, Xiaodong He, and Christopher Meek. Semantic parsing for singlerelation question answering. In ACL, 2014.
 [29] D. Yu, L. Deng, and F. Seide. The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Trans. Audio, Speech and Language Proc., 21(2):388 –396, 2013.
Appendix
Figure 1 and 2 illustrate the relation embeddings learned by DistMult and DistAdd using tSNE. We selected relations in the FB15k401 dataset. The embeddings learned by DistMult nicely reflect the clustering structures among these relations (e.g. /film/release_region is closed to /film/country); whereas the embeddings learned by DistAdd present structure that is harder to interpret.
Table 5 shows some concrete examples: top nearest neighbors in the relation space learned by DistMult and DistAdd along with the distance values (Frobenius distance between two relation matrices or Euclidean distance between two relation vectors). We can see that the nearest neighbors found by DistMult are much more meaningful. DistAdd tends to retrieve irrelevant relations which take in completely different types of arguments.
DistMult  DistAdd  

/film_distributor/film 
/film/distributor (2.0) /production_company/films (3.4) /film/production_companies (3.4) 
/production_company/films (2.6) /award_nominee/nominated_for (2.7) /award_winner/honored_for (2.9) 
/film/film_set_decoration_by 
/film_set_designer/film_sets_designed (2.5) /film/film_art_direction_by (6.8) /film/film_production_design_by (9.6) 
/award_nominated_work/award_nominee (2.7) /film/film_art_direction_by (2.7) /award_winning_work/award_winner (2.8) 
/organization/leadership/role 
/leadership/organization (2.3) /job_title/company (12.5) /business/employment_tenure/title (13.0) 
/organization/currency (3.0) /location/containedby (3.0) /university/currency (3.0) 
/person/place_of_birth 
/location/people_born_here (1.7) /person/places_lived (8.0) /people/marriage/location_of_ceremony (14.0) 
/us_county/county_seat (2.6) /administrative_division/capital (2.7) /educational_institution/campuses (2.8) 