Representation Learning for Visual-Relational Knowledge Graphs

09/07/2017 ∙ by Daniel Oñoro-Rubio, et al. ∙ Universidad de Alcalá NEC Corp. 0

A visual-relational knowledge graph (KG) is a KG whose entities are associated with images. We propose representation learning for relation and entity prediction in visual-relational KGs as a novel machine learning problem. We introduce ImageGraph, a KG with 1,330 relation types, 14,870 entities, and 829,931 images. Visual-relational KGs lead to novel probabilistic query types treating images as first-class citizens. We approach the query answering problems by combining ideas from the areas of computer vision and embedding learning for KGs. The resulting ML models can answer queries such as "How are these two unseen images related to each other?" We also explore a novel zero-shot learning scenario where an image of an entirely new entity is linked with multiple relations to entities of an existing KG. Our experiments show that the proposed deep neural networks are able to answer the visual-relational queries efficiently and accurately.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several application domains can be modeled with knowledge graphs where entities are represented by nodes, object attributes by node attributes, and relationships between entities by directed edges between the nodes. For instance, a product recommendation system can be represented as a knowledge graph where nodes represent customers and products and where typed edges represent customer reviews and purchasing events. In the medical domain, there are several knowledge graphs that model diseases, symptoms, drugs, genes, and their interactions (cf. [1, 2, 3]). Increasingly, entities in these knowledge graphs are associated with visual data. For instance, in the online retail domain, there are product and advertising images and in the medical domain, there are patient-associated imaging data sets (MRIs, CTs, and so on).

The ability of knowledge graphs to compactly represent a domain, its attributes, and relations make them an important component of numerous AI systems. KGs facilitate the integration, organization, and retrieval of structured data and support various forms of reasoning. In recent years KGs have been playing an increasingly crucial role in fields such as question answering [4, 5], language modeling [6]

, and text generation 

[7]. Even though there is a large body of work on learning and reasoning in KGs, the setting of visual-relational KGs, where entities are associated with visual data, has not received much attention. A visual-relational KG represents entities, relations between these entities, and a large number of images associated with the entities (see Figure (a)a for an example). While ImageNet [8] and the VisualGenome [9] datasets are based on KGs such as WordNet they are predominantly used as either an object classification data set as in the case of ImageNet

or to facilitate scene understanding in a single image. With

ImageGraph, we propose the problem of reasoning about visual concepts across a large set of images organized in a knowledge graph.

Figure 3: A small part of a visual-relational knowledge graph and a set of query types.

The core idea is to treat images as first-class citizens both in the KG and in relational KG completion queries. In combination with the multi-relational structure of a KG, numerous more complex queries are possible. The main objective of our work is to understand to what extent visual data associated with entities of a KG can be used in conjunction with deep learning methods to answer visual-relational queries. Allowing images to be arguments of queries facilitates numerous novel query types. In Figure 

(b)b we list some of the query types we address in this paper. In order to answer these queries, we built both on KG embedding methods as well as deep representation learning approaches for visual data. There has been a flurry of machine learning approaches tailored to specific problems such as link prediction in knowledge graphs. Examples are knowledge base factorization and embedding approaches [10, 11, 12, 13] and random-walk based ML models [14, 15]. We combine these approaches with deep neural networks to facilitate visual-relational query answering.

There are numerous application domains that could benefit from query answering in visual KGs. For instance, in online retail, visual representations of novel products could be leveraged for zero-shot product recommendations. Crucially, instead of only being able to retrieve similar products, a visual-relational KG would support the prediction of product attributes and more specifically what attributes customers might be interested in. For instance, in the fashion industry visual attributes are crucial for product recommendations [16, 17, 18, 19]. In general, we believe that being able to ground novel visual concepts into an existing KG with attributes and various relation types is a reasonable approach to zero-shot learning.

We make the following contributions. First, we introduce ImageGraph, a visual-relational KG with 1,330 relations where 829,931 images are associated with 14,870 different entities. Second, we introduce a new set of visual-relational query types. Third, we propose a novel set of neural architectures and objectives that we use for answering these novel query types. This is the first time that deep CNNs and KG embedding learning objectives are combined into a joint model. Fourth, we show that the proposed class of deep neural networks are also successful for zero-shot learning, that is, creating relations between entirely unseen entities and the KG using only visual data at query time.

2 Related Work

We discuss the relation of our contributions to previous work with an emphasis on object detection, scene understanding, existing data sets, and zero-shot learning.

Relational and Visual Data

Answering queries in a visual-relational knowledge graph is our main objective. Previous work on combining relational and visual data has focused on object detection [20, 21, 22, 23, 24] and scene recognition [25, 26, 27, 28, 29] which are required for more complex visual-relational reasoning. Recent years have witnessed a surge in reasoning about human-object, object-object, and object-attribute relationships [30, 31, 32, 33, 20, 34, 35, 36]. The VisualGenome project [9] is a knowledge base that integrates language and vision modalities. The project provides a knowledge graph, based on WordNet, which provides annotations of categories, attributes, and relation types for each image. Recent work has used the dataset to focus on scene understanding in single images. For instance, Lu et al. [37] proposed a model to detect relation types between objects depicted in an image by inferring sentences such as “man riding bicycle." Veit et al. [17] propose a siamese CNN to learn a metric representation on pairs of textile products so as to learn which products have similar styles. There is a large body of work on metric learning where the objective is to generate image embeddings such that a pairwise distance-based loss is minimized [38, 39, 40, 41, 42]. Recent work has extended this idea to directly optimize a clustering quality metric [43]. Zhou et al. propose a method based on a bipartite graph that links depictions of meals to its ingredients. Johnson et al. [44] propose to use the VisualGenome data to recover images from text queries. ImageGraph is different from these data sets in that the relation types hold between different images and image annotated entities. This defines a novel class of problems where one seeks to answer queries such as “How are these two images related?" With this work, we address problems ranging from predicting the relation types for image pairs to multi-relational image retrieval.

Entities Relations Triples Images
Train Valid Test Train Valid Test
FB15k [10] 14,951 1,345 483,142 50,000 59,071 0 0 0
ImageGraph 14,870 1,330 460,406 47,533 56,071 411,306 201,832 216,793
Table 1: Statistics of the knowledge graphs used in this paper.

Zero-shot Learning

We focus on exploring ways in which KGs can be used to find relationships between unseen images, that is, images depicting novel entities that are not part of the KG, and visual depictions of known KG entities. This is a form of zero-shot learning (ZSL) where the objective is to generalize to novel visual concepts without seeing any training examples. Generally, ZSL methods (e.g. [45, 46]) rely on an underlying embedding space, such as attributes, in order to recognize the unseen categories. However, in this paper, we do not assume the availability of such a common embedding space but we assume the existence of an external visual-relational KG. Similar to our approach, when this explicit knowledge is not encoded in the underlying embedding space, other works rely on finding the similarities through the linguistic space (e.g. [47, 37]), leveraging distributional word representations so as to capture a notion of taxonomy and similarity. But these works address scene understanding in a single image, i.e. these models are able to detect the visual relationships in one given image. On the contrary, our models are able to find relationships between different images and entities.

3 ImageGraph: A Visual-Relational Knowledge Graph

ImageGraph is a visual-relational KG whose relational structure is based on that of Freebase [48]. More specifically, it is based on FB15k, a subset of FreeBase, which has been used as a benchmark data set [13]. Since FB15k does not include visual data, we perform the following steps to enrich the KG entities with image data. We implemented a web crawler that is able to parse query results for the image search engines Google Images, Bing Images, and Yahoo Image Search. To minimize the amount of noise due to polysemous entity labels (for example, there are more than 100 Freebase entities with the text label “Springfield") we extracted, for each entity in FB15k, all Wikipedia URIs from the billion triple Freebase RDF dump. For instance, for Springfield, Massachusetts, we obtained such URIs as Springfield_(Massachusetts,United_States) and Springfield_(MA). These URIs were processed and used as search queries for disambiguation purposes. We used the crawler to download more than 2.4M images (more than 462Gb of data). We removed corrupted, low quality, and duplicate images and we used the 25 top images returned by each of the image search engines whenever there were more than 25 results. The images were scaled to have a maximum height or width of 500 pixels while maintaining their aspect ratio. This resulted in 829,931 images associated with 14,870 different entities ( images per entity). After filtering out triples where either the head or tail entity could not be associated with an image, the visual KG consists of 564,010 triples expressing 1,330 different relation types between 14,870 entities. We provide three sets of triples for training, validation, and testing plus three more image splits also for training, validation and test. Table 1 lists the statistics of the resulting visual KG. Any KG derived from FB15k such as FB15k-237[49] can also be associated with the crawled images. Since providing the images themselves would violate copyright law, we provide the code for the distributed crawler and the list of image URLs crawled for the experiments in this paper111ImageGraph crawler and URLs:

The distribution of relation types is depicted in the Figure (a)a. We show in logarithmic scale the number of times that each relation occurs in the KG. We observe how relationships like or occur quite frequently while others such as occur just a few times. of the relation types are symmetric, are asymmetric, and are neither (see Figure (b)b). There are 585 distinct entity types such as , and . In Figure (a)a we plot he most frequent entity types. In the Figure (b)b we plot the entity frequencies and some example entities.

Figure 6: (a)a plots the distribution of relation type frequencies. The y-axis represents the number of occurrences and x-axis the relation type index. (b)b shows the most common relation types.

To the best of our knowledge, ImageGraph is the visual-relational KG with the most entities, relation types, and entity-level images. The main differences between ImageGraph and ImageNet are the following. ImageNet is based on WordNet a lexical database where synonymous words from the same lexical category are grouped into synsets. There are relations expressing connections between synsets. In Freebase, on the other hand, there are two orders of magnitudes more relations. In FB15k, the subset we focus on, there are 1,345 relations expressing location of places, positions of basketball players, and gender of entities. Moreover, entities in ImageNet exclusively represent entity types such as and whereas entities in FB15k are either entity types or instances of entity types such as and . This renders the computer vision problems associated with ImageGraph more challenging than those for existing datasets. Moreover, with ImageGraph the focus is on learning relational ML models that incorporate visual data both during learning and at query time.

Figure 9: (b)b depicts the 10 most frequent entity types. (a)a plots the entity distribution. The y-axis represents the total number of occurrences and the x-axis the entity indices.

4 Representation Learning for Visual-Relational Graphs

A knowledge graph (KG) is given by a set of triples , that is, statements of the form , where are the head and tail entities, respectively, and is a relation type. Figure (a)a depicts a small fragment of a KG with relations between entities and images associated with the entities. Prior work has not included image data and has, therefore, focused on the following two types of queries. First, the query type asks for the relations between a given pair of head and tail entities. Second, the query types and , asks for entities correctly completing the triple. The latter query type is often referred to as knowledge base completion. Here, we focus on queries that involve visual data as query objects, that is, objects that are either contained in the queries, the answers to the queries, or both.

4.1 Visual-Relational Query Answering

When entities are associated with image data, several completely novel query types are possible. Figure (b)b lists the query types we focus on in this paper. We refer to images used during training as seen and all other images as unseen.

  1. Given a pair of unseen images for which we do not know their KG entities, determine the unknown relations between the underlying entities.

  2. Given an unseen image, for which we do not know the underlying KG entity, and a relation type, determine the seen images that complete the query.

  3. Given an unseen image of an entirely new entity that is not part of the KG, and an unseen image for which we do not know the underlying KG entity, determine the unknown relations between the two underlying entities.

  4. Given an unseen image of an entirely new entity that is not part of the KG, and a known KG entity, determine the unknown relations between the two entities.

For each of these query types, the sought-after relations between the underlying entities have never been observed during training. Query types (3) and (4) are a form of zero-shot learning since neither the new entity’s relationships with other entities nor its images have been observed during training. These considerations illustrate the novel nature of the visual query types. The machine learning models have to be able to learn the relational semantics of the KG and not simply a classifier that assigns images to entities. These query types are also motivated by the fact that for typical KGs the number of entities is orders of magnitude greater than the number of relations.

Figure 12: (a) Proposed architecture for query answering. (b) Illustration of two possible approaches to visual-relational query answering. One can predict relation types between two images directly (green arrow; our approach) or combine an entity classifier with a KB embedding model for relation prediction (red arrows; baseline +).

4.2 Deep Representation Learning for Query Answering

We first discuss the state of the art of KG embedding methods and translate the concepts to query answering in visual-relational KGs. Let be the raw feature representation for entity and let and

be differentiable functions. Most KG completion methods learn an embedding of the entities in a vector space via some

scoring function that is trained to assign high scores to correct triples and low scores to incorrect triples. Scoring functions have often the form where is a relation, and are -dimensional vectors (the embeddings of the head and tail entities, respectively), and where is an embedding function

that maps the raw input representation of entities to the embedding space. In the case of KGs without additional visual data, the raw representation of an entity is simply its one-hot encoding.

Existing KG completion methods use the embedding function where is a matrix, and differ only in their scoring function, that is, in the way that the embedding vectors of the head and tail entities are combined with the parameter vector :

  • Difference (TransE[10]): where is a -dimensional vector;

  • Multiplication (DistMult[50]): where is the element-wise product and a -dimensional vector;

  • Circular correlation (HolE[51]): where and a -dimensional vector; and

  • Concatenation: where is the concatenation operator and a -dimensional vector.

For each of these instances, the matrix (storing the entity embeddings) and the vectors are learned during training. In general, the parameters are trained such that is high for true triples and low for triples assumed not to hold in the KG. The training objective is often based on the logistic loss, which has been shown to be superior for most of the composition functions [52], min_Θ∑_(h,r,t) ∈T_pos log(1 + exp(-f_r(e_h, e_t)) + ∑_(h,r,t) ∈T_neg log(1 + exp(f_r(e_h, e_t))) + λ||Θ||^2_2, where and are the set of positive and negative training triples, respectively, are the parameters trained during learning and

is a regularization hyperparameter. For the above objective, a process for creating corrupted triples

is required. This often involves sampling a random entity for either the head or tail entity. To answer queries of the types and after training, we form all possible completions of the queries and compute a ranking based on the scores assigned by the trained model to these completions.

For the queries of type one typically uses the softmax activation in conjunction with the categorical cross-entropy loss, which does not require negative triples


where are the parameters trained during learning.

For visual-relational KGs, the input consists of raw image data instead of the one-hot encodings of entities. The approach we propose builds on the ideas and methods developed for KG completion. Instead of having a simple embedding function

that multiplies the input with a weight matrix, however, we use deep convolutional neural networks to extract meaningful visual features from the input images. For the composition function

we evaluate the four operations that were used in the KG completion literature: difference, multiplication, concatenation, and circular correlation. Figure (a)a depicts the basic architecture we trained for query answering. The weights of the parts of the neural network responsible for embedding the raw image input, denoted by , are tied. We also experimented with additional hidden layers indicated by the dashed dense layer. The composition operation is either difference, multiplication, concatenation, or circular correlation. To the best of our knowledge, this is the first time that KG embedding learning and deep CNNs have been combined for visual-relationsl query answering.

5 Experiments

We conduct a series of experiments to evaluate our proposed approach to visual-relational query answering. First, we describe the experimental set-up that applies to all experiments. Second, we report and interpret results for the different types of visual-relational queries.

5.1 General Set-up

We used Caffe, a deep learning framework [53] for designing, training, and evaluating the proposed models. The embedding function is based on the VGG16 model introduced in [54]. We pre-trained the VGG16 on the ILSVRC2012 data set derived from ImageNet [8]

and removed the softmax layer of the original VGG16. We added a 256-dimensional layer after the last dense layer of the VGG16. The output of this layer serves as the embedding of the input images. The reason for reducing the embedding dimensionality from 4096 to 256 is motivated by the objective to obtain an efficient and compact latent representation that is feasible for KGs with billion of entities. For the composition function

, we performed either of the four operations difference, multiplication, concatenation, and circular correlation. We also experimented with an additional hidden layer with ReLu activation. Figure 

(a)a depicts the generic network architecture. The output layer of the architecture has a softmax or sigmoid activation with cross-entropy loss. We initialized the weights of the newly added layers with the Xavier method [55].

We used a batch size of which was the maximal possible fitting into GPU memory. To create the training batches, we sample a random triple uniformly at random from the training triples. For the given triple, we randomly sample one image for the head and one for the tail from the set of training images. We applied SGD with a learning rate of for the parameters of the VGG16 and a learning rate of for the remaining parameters. It is crucial to use two different learning rates since the large gradients in the newly added layers would lead to unreasonable changes in the pretrained part of the network. We set the weight decay to . We reduced the learning rate by a factor of every 40,000 iterations. Each of the models was trained for 100,000 iterations.

Since the answers to all query types are either rankings of images or rankings of relations, we utilize metrics measuring the quality of rankings. In particular, we report results for hits@1 (hits@10, hits@100) measuring the percentage of times the correct relation was ranked highest (ranked in the top 10, top 100). We also compute the median of the ranks of the correct entities or relations and the Mean Reciprocal Rank (MRR) for entity and relation rankings, respectively, defined as follows:


where is the set of all test triples, is the rank of the correct relation, and is the rank of the highest ranked image of entity . For each query, we remove all triples that are also correct answers to the query from the ranking. All experiments were run on commodity hardware with 128GB RAM, a single 2.8 GHz CPU, and a NVIDIA 1080 Ti.

5.2 Visual Relation Prediction

Model Median Hits@1 Hits@10 MRR
+ 94 6.0 11.4 0.087
Prob. Baseline 35 3.7 26.5 0.104
DIFF 11 21.1 50.0 0.307
MULT 8 15.5 54.3 0.282
CAT 6 26.7 61.0 0.378
DIFF+1HL 8 22.6 55.7 0.333
MULT+1HL 9 14.8 53.4 0.273
CAT+1HL 6 25.3 60.0 0.365
Table 2: Results for the relation prediction problem.

Given a pair of unseen images we want to determine the relations between their underlying unknown entities. This can be expressed with . Figure (b)b(1) illustrates this query type which we refer to as visual relation prediction. We train the deep architectures using the training and validation triples and images, respectively. For each triple in the training data set, we sample one training image uniformly at random for both the head and the tail entity. We use the architecture depicted in Figure (a)a

with the softmax activation and the categorical cross-entropy loss. For each test triple, we sample one image uniformly at random from the test images of the head and tail entity, respectively. We then use the pair of images to query the trained deep neural networks. To get a more robust statistical estimate of the evaluation measures, we repeat the above process three times per test triple. Again, none of the test triples and images are seen during training nor are any of the training images used during testing. Computing the answer to one query takes the model 20 ms.

We compare the proposed architectures to two different baselines: one based on entity classification followed by a KB embedding method for relation prediction (+), and a probabilistic baseline (Prob. Baseline). The entity classification baseline consists of fine-tuning a pretrained VGG16 to classify images into the entities of ImageGraph. To obtain the relation type ranking at test time, we predict the entities for the head and the tail using the VGG16 and then use the KB embedding method DistMult[50] to return a ranking of relation types for the given (head, tail) pair. is a KB embedding method that achieves state of the art results for KB completion on FB15k [56]. Therefore, for this experiment we just substitute the original output layer of the VGG16 pretrained on ImageNet with a new output layer suitable for our problem. To train, we join the train an validation splits, we set the learning rate to for all the layers and we train following the same strategy that we use in all of our experiments. Once the system is trained, we test the model by classifying the entities of the images in the test set. To train , we sample negatives triples for each positive triple and used an embedding size of . Figure (b)b illustrates the +

baseline and contrasts it with our proposed approach. The second baseline (probabilistic baseline) computes the probability of each relation type using the set of training and validation triples. The baseline ranks relation types based on these prior probabilities.

Figure 13: Example queries and results for the multi-relational image retrieval problem.

Table 2 lists the results for the two baselines and the different proposed architectures. The probabilistic baseline outperforms the +

baseline in 3 of the metrics. This is due to the highly skewed distribution of relation types in the training, validation, and test triples. A small number of relation types makes up a large fraction of triples. Figure

(a)a and (b)b plots the counts of relation types and entities. Moreover, despite DistMult achieving a hits@1 value of 0.46 for the relation prediction problem between entity pairs the baseline + performs poorly. This is due to the poor entity classification performance of the VGG (accurracy: 0.082, F1: 0.068). In the remainder of the experiments, therefore, we only compare to the probabilistic baseline. In the lower part of Table 2, we lists the results of the experiments. DIFF, MULT, and CAT stand for the different possible composition operations. We omitted the composition operation circular correlation since we were not able to make the corresponding model converge, despite trying several different optimizers and hyperparameter settings. The post-fix 1HL stands for architectures where we added an additional hidden layer with ReLu activation before the softmax. The concatenation operation clearly outperforms the multiplication and difference operations. This is contrary to findings in the KG completion literature where MULT and DIFF outperformed the concatenation operation. The models with the additional hidden layer did not perform better than their shallower counterparts with the exception of the DIFF model. We hypothesize that this is due to difference being the only linear composition operation, benefiting from an additional non-linearity. Each of the proposed models outperforms the baselines.

Median Hits@100 MRR
Model Head Tail Head Tail Head Tail
Baseline 6504 2789 11.9 18.4 0.065 0.115
DIFF 1301 877 19.6 26.3 0.051 0.094
MULT 1676 1136 16.8 22.9 0.040 0.080
CAT 1022 727 21.4 27.5 0.050 0.087
DIFF+1HL 1644 1141 15.9 21.9 0.045 0.085
MULT+1HL 2004 1397 14.6 20.5 0.034 0.069
CAT+1HL 1323 919 17.8 23.6 0.042 0.080
CAT-SIG 814 540 23.2 30.1 0.049 0.082
Table 3: Results for the multi-relational image retrieval problem.

5.3 Multi-Relational Image Retrieval

Given an unseen image, for which we do not know the underlying KG entity, and a relation type, we want to retrieve existing images that complete the query. If the image for the head entity is given, we return a ranking of images for the tail entity; if the tail entity image is given we return a ranking of images for the head entity. This problem corresponds to query type (2) in Figure (b)b. Note that this is equivalent to performing multi-relational metric learning which, to the best of our knowledge, has not been done before. We performed experiments with each of the three composition functions

and for two different activation/loss functions. First, we used the models trained with the softmax activation and the categorical cross-entropy loss to rank images. Second, we took the models trained with the softmax activation and substituted the softmax activation with a sigmoid activation and the corresponding binary cross-entropy loss. For each training triple

we then created two negative triples by sampling once the head and once the tail entity from the set of entities. The negative triples are then used in conjunction with the binary cross-entropy loss of equation 4.2 to refine the pretrained weights. Directly training a model with the binary cross-entropy loss was not possible since the model did not converge properly. Pretraining with the softmax activation and categorical cross-entropy loss was crucial to make the binary loss work.

During testing, we used the test triples and ranked the images based on the probabilities returned by the respective models. For instance, given the query , we substituted with all training and validation images, one at a time, and ranked the images according to the probabilities returned by the models. We use the rank of the highest ranked image belonging to the true entity (here: ) to compute the values for the evaluation measures. We repeat the same experiment three times (each time randomly sampling the images) and report average values. Again, we compare the results for the different architectures with a probabilistic baseline. For the baseline, however, we compute a distribution of head and tail entities for each of the relation types. For example, for the relation type we compute two distributions, one for head and one for tail entities. We used the same measures as in the previous experiment to evaluate the returned image rankings.

Table 3 lists the results of the experiments. As for relation prediction, the best performing models are based on the concatenation operation, followed by the difference and multiplication operations. The architectures with an additional hidden layer do not improve the performance. We also provide the results for the concatenation-based model with softmax activation where we refined the weights using a sigmoid activation and negative sampling as described before. This model is the best performing model. All neural network models are significantly better than the baseline with respect to the median and hits@100. However, the baseline has slightly superior results for the MRR. This is due to the skewed distribution of entities and relations in the KG (see Figure (b)b and Figure (a)a). This shows once more that the baseline is highly competitive for the given KG. Figure 13 visualizes the answers the CAT-SIG model provided for a set of four example queries. For the two queries on the left, the model performed well and ranked the correct entity in the top 3 (green frame). The examples on the right illustrate queries for which the model returned an inaccurate ranking. To perform query answering in a highly efficient manner, we precomputed and stored all image embeddings once, and only compute the scoring function (involving the composition operation and a dot product with ) at query time. Answering one multi-relational image retrieval query (which would otherwise require 613,138 individual queries, one per possible image) took only 90 ms.

Median Hits@1 Hits@10 MRR
Zero-Shot Query (3)
Base 34 31 1.9 2.3 18.2 28.7 0.074 0.089
CAT 8 7 19.1 22.4 54.2 57.9 0.306 0.342
Zero-Shot Query (4)
Base 9 5 13.0 22.6 52.3 64.8 0.251 0.359
CAT 5 3 26.9 33.7 62.5 70.4 0.388 0.461
Table 4: Results for the zero-shot learning experiments.

5.4 Zero-Shot Visual Relation Prediction

Figure 14: Example results for zero-shot learning. For each pair of images the top three relation types (as ranked by the CAT model) are listed. For the pair of images at the top, the first relation type is correct. For the pair of images at the bottom, the correct relation type is not among the top three relation types.

The last set of experiments addresses the problem of zero-shot learning via visual relation prediction. For both query types, we are given an new image of an entirely new entity that is not part of the KG. The first query type asks for relations between the given image and an unseen image for which we do not know the underlying KG entity. The second query type asks for the relations between the given image and an existing KG entity. We believe that creating multi-relational links to existing KG entities is a reasonable approach to zero-shot learning since an unseen entity or category is integrated into an existing KG. The relations to existing visual concepts and their attributes provide a characterization of the new entity/category. This problem cannot be addressed with KG embedding methods since entities need to be part of the KG during training for these models to work.

For the zero-shot experiments, we generated a new set of training, validation, and test triples. We randomly sampled 500 entities that occur as head (tail) in the set of test triples. We then removed all training and validation triples whose head or tail is one of these 1000 entities. Finally, we only kept those test triples with one of the 1000 entities either as head or tail but not both. For query type (4) where we know the target entity, we sample 10 of its images and use the models 10 times to compute a probability. We use the average probabilities to rank the relations. For query type (3) we only use one image sampled randomly. As with previous experiments, we repeated procedure three times and averaged the results. For the baseline, we compute the probabilities of relation in the training and validation set (for query type (3)) and the probabilities of relations conditioned on the target entity (for query type (4)). Again, these are very competitive baselines due to the skewed distribution of relations and entities. Table 4 lists the results of the experiments. The model based on the concatenation operation (CAT) outperforms the baseline and performs surprisingly well. The deep models are able to generalize to unseen images since their performance is comparable to the performance in the relation prediction task (query type (1)) where the entity was part of the KG during training (see Table 2). Figure 14 depicts example queries for the zero-shot query type (3). For the first query example, the CAT model ranked the correct relation type first (indicated by the green bounding box). The second example is more challenging and the correct relation type was not part of the top 10 ranked relation types.

6 Conclusion

KGs are at the core of numerous AI applications. Research has focused either on KG completion methods working only on the relational structure or on scene understanding in a single image. We present a novel visual-relational KG where the entities are enriched with visual data. We proposed several novel query types and introduce neural architectures suitable for probabilistic query answering. We propose a novel approach to zero-shot learning as the problem of visually mapping an image of an entirely new entity to a KG.

We have observed that for some relation types, the proposed models tend to learn a fine-grained visual type that typically occurs as the head or tail of the relation type. In these cases, conditioning on either the head or tail entity does not influence the predictions of the models substantially. This is a potential shortcoming of the proposed methods and we believe that there is a lot of room for improvement for probabilistic query answering in visual-relational KGs.


  • [1] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene Ontology: tool for the unification of biology. Nat Genet 25(1) (2000) 25–29
  • [2] Wishart, D.S., Knox, C., Guo, A., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B., Hassanali, M.: Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Research 36 (2008) 901–906
  • [3] Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S., Sontag, D.: Learning a health knowledge graph from electronic medical records. Nature Scientific Reports 5994(7) (2017)
  • [4] Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 (2015)
  • [5] Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., Batra, D.: Visual dialog. In: CVPR. (July 2017)
  • [6] Ahn, S., Choi, H., Parnamaa, T., Bengio, Y.: A neural knowledge language model. arXiv preprint arXiv:1608.00318 (2016)
  • [7] Serban, I.V., García-Durán, A., Gulcehre, C., Ahn, S., Chandar, S., Courville, A., Bengio, Y.: Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807 (2016)
  • [8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR. (2009)
  • [9] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. In: arXiv preprint arXiv:1602.07332. (2016)
  • [10] Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems. (2013) 2787–2795
  • [11] Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th international conference on machine learning (ICML-11). (2011) 809–816
  • [12] Guu, K., Miller, J., Liang, P.: Traversing knowledge graphs in vector space. arXiv preprint arXiv:1506.01094 (2015)
  • [13] Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104(1) (2016) 11–33
  • [14] Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base.

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (2011) 529–539

  • [15] Gardner, M., Mitchell, T.M.:

    Efficient and expressive knowledge base completion using subgraph feature extraction.

    In: EMNLP. (2015) 1488–1498
  • [16] Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: CVPR. (June 2016)
  • [17] Veit, A., Kovacs, B., Bell, S., McAuley, J., Bala, K., Belongie, S.: Learning visual clothing style with heterogeneous dyadic co-occurrences. In: ICCV. (2015)
  • [18] Simo-Serra, E., Ishikawa, H.: Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction.

    In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016)

  • [19] Vaccaro, K., Shivakumar, S., Ding, Z., Karahalios, K., Kumar, R.: The elements of fashion style. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology, ACM (2016)
  • [20] Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32(9) (2010) 1627–1645
  • [21] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. (2014) 580–587
  • [22] Russakovsky, O., Deng, J., Huang, Z., Berg, A.C., Fei-Fei, L.: Detecting avocados to zucchinis: what have we done, and where are we going? In: International Conference on Computer Vision (ICCV). (2013)
  • [23] Marino, K., Salakhutdinov, R., Gupta, A.: The more you know: Using knowledge graphs for image classification. In: CVPR. (2017)
  • [24] Li, Y., Huang, C., Tang, X., Loy, C.C.: Learning to disambiguate by asking discriminative questions. In: ICCV. (2017)
  • [25] Doersch, C., Gupta, A., Efros, A.A.: Mid-level visual element discovery as discriminative mode seeking. In: Advances in Neural Information Processing Systems 26. (2013) 494–502
  • [26] Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object localization with deformable part-based models. In: Computer Vision (ICCV), 2011 IEEE International Conference on. (2011) 1307–1314
  • [27] Sadeghi, F., Tappen, M.F.: Latent pyramidal regions for recognizing scenes. In: Proceedings of the 12th European Conference on Computer Vision - Volume Part V. (2012) 228–241
  • [28] Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large-scale scene recognition from abbey to zoo. In: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition. (2010) 3485–3492
  • [29] Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR. (July 2017)
  • [30] Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 1775–1789
  • [31] Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. (2009) 1778–1785
  • [32] Malisiewicz, T., Efros, A.A.: Beyond categories: The visual memex model for reasoning about object relationships. In: Advances in Neural Information Processing Systems. (2009)
  • [33] Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. (2010) 17–24
  • [34] Chen, X., Shrivastava, A., Gupta, A.: Neil: Extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 1409–1416
  • [35] Izadinia, H., Sadeghi, F., Farhadi, A.: Incorporating scene context and object layout into appearance modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 232–239
  • [36] Zhu, Y., Fathi, A., Fei-Fei, L.: Reasoning about object affordances in a knowledge base representation. In: European conference on computer vision. (2014) 408–424
  • [37] Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV. (2016)
  • [38] Schroff, F., Kalenichenko, D., Philbin, J.:

    Facenet: A unified embedding for face recognition and clustering.

    In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 815–823
  • [39] Bell, S., Bala, K.: Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics (TOG) 34(4) (2015)  98
  • [40] Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4004–4012
  • [41] Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Advances in Neural Information Processing Systems. (2016) 1857–1865
  • [42] Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. In: International Conference on Computer Vision (ICCV). (2017)
  • [43] Song, H.O., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility location. In: Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  • [44] Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: CVPR. (2015)
  • [45] Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: ICML. (2015)
  • [46] Zhang, Z., Saligrama, V.: Zero-shot learning via semantic similarity embedding. In: ICCV. (2015)
  • [47] Ba, J., Swersky, K., Fidler, S., Salakhutdinov, R.: Predicting deep zero-shot convolutional neural networks using textual descriptions. In: CVPR. (2015)
  • [48] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A collaboratively created graph database for structuring human knowledge. In: SIGMOD. (2008) 1247–1250
  • [49] Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and text inference. In: Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality. (2015) 57–66
  • [50] Yang, B., Yih, W.t., He, X., Gao, J., Deng, L.: Learning multi-relational semantics using neural-embedding models. arXiv preprint arXiv:1411.4072 (2014)
  • [51] Nickel, M., Rosasco, L., Poggio, T.A.: Holographic embeddings of knowledge graphs.

    In: Proceedings of the Thirtieth Conference on Artificial Intelligence. (2016) 1955–1961

  • [52] Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. arXiv preprint arXiv:1606.06357 (2016)
  • [53] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  • [54] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014)
  • [55] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS. (2010)
  • [56] Kadlec, R., Bajgar, O., Kleindienst, J.: Knowledge base completion: Baselines strike back. arXiv preprint arXiv:1705.10744 (2017)