Visual Semantic Information Pursuit: A Survey

03/13/2019 ∙ by Daqi Liu, et al. ∙ University of Surrey 36

Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherent visual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task while the latter one corresponds to visual context reasoning. Remarkable advances in visual perception have been achieved due to the success of deep learning. In contrast, visual semantic information pursuit, a visual scene semantic interpretation task combining visual perception and visual context reasoning, is still in its early stage. It is the core task of many different computer vision applications, such as object detection, visual semantic segmentation, visual relationship detection or scene graph generation. Since it helps to enhance the accuracy and the consistency of the resulting interpretation, visual context reasoning is often incorporated with visual perception in current deep end-to-end visual semantic information pursuit methods. However, a comprehensive review for this exciting area is still lacking. In this survey, we present a unified theoretical paradigm for all these methods, followed by an overview of the major developments and the future trends in each potential direction. The common benchmark datasets, the evaluation metrics and the comparisons of the corresponding methods are also introduced.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantics is the linguistic and philosophical study of meaning, in language, programming languages or formal logics. In linguistics, the semantic signifiers can be words, phrases, sentences or paragraphs. To interpret the complicated signifiers such as phrases or sentences, we need to understand the meaning of each word as well as the semantic relation among those words. Here, words are the basic semantic units and the semantic relation is any relationship between two or more words based on the meaning of the words. In other words, the semantic relation defines the consistency among the associated semantic units in terms of meaning, which guarantee that the corresponding complex semantic signifier can be interpreted.

The above strategy can be seamlessly applied to the visual semantic information pursuit, in which the basic semantic units are potential pixels or potential bounding boxes while the visual semantic relation is represented as the local visual relationship structure or the holistic scene graph. For visual perception tasks such as visual semantic segmentation or object detection, the visual semantic relation promotes smoothness and consistency among the input visual semantic units. It acts as a regularizer and causes the associated visual semantic units to be biased towards certain configurations which are more likely to occur. For visual context reasoning applications, such as visual relationship detection or scene graph generation, the corresponding visual semantic units are considered as the associated context information and different inference methods are applied to pursue the visual semantic relation. In a word, the visual semantic units and the visual semantic relation are complementary. The visual semantic units are the prerequisites of the visual semantic relation, while the visual semantic relation can be explored to further improve the detection accuracy of the visual semantic units.

The extent to which a visual semantic information pursuit method can interpret the input visual stimuli is totally dependent on the prior knowledge of the observer. Vocabulary is one part of the knowledge, which defines the meaning of each visual semantic unit. The vocabulary itself may be enough for some specific visual perception tasks (such as weakly-supervised learning for object detection). However, for visual context reasoning applications, it is certainly not sufficient since we still need additional knowledge to identify and understand the interpretable visual semantic relations. In most cases, besides the vocabulary, the associated benchmark datasets should provide other ground-truth information about the visual semantic relations.

In this survey, four main research topics in visual semantic information pursuit are introduced: object detection, visual semantic segmentation, visual relationship detection and scene graph generation. Traditionally, the first two tasks are categorized as visual perception applications. Instead of considering them as a single visual perception task, the current visual semantic information pursuit research treats them as a combination of perception and reasoning. Therefore, unlike the previous surveys, all applications mentioned in this article include visual context reasoning modules and can be trained end-to-end through the associated deep learning models.

Specifically, object detection (OD) aims at detecting all possible objects appearing in the input image by assigning corresponding bounding boxes as well as their associated labels. Essentially, it consists of two modules: localization and classification. The former is achieved by an associated regression algorithm while the latter is typically implemented by a corresponding classification method. In this article, we only focus on introducing the object detection methods with visual context reasoning modules [1], [2], [3], [4]. The comprehensive survey of conventional object detection methods can be found in paper [5].

Visual semantic segmentation (VSS) [6], [7], [8] refers to labelling each pixel to be one of the semantic categories. To robustly parse input images, effective visual context modelling is essential. Due to its intrinsic characteristic, visual semantic segmentation is often formulated as an undirected graphical model, such as Undirected Cyclic Graph (UCG) [8] or Conditional Random Field (CRF) [9]. In most cases, the energy function corresponding to the undirected graphical model is factorized into two potential functions: unary function and binary function. The former generates the predicted label for each input pixel while the latter defines the pairwise interaction between adjacent pixels. As a constraint term, the binary potential function is used to regulate the predicted labels generated from the unary potential function to be spatially consistent and smooth.

Fig. 1: Four main visual semantic information pursuit applications are introduced in this survey, which include object detection (OD), visual semantic segmentation (VSS), visual relationship detection (VRD) and scene graph generation (SGG).

Visual relationship detection (VRD) [10], [11], [12], [13], [14], [15], [16], [17], [18] focuses on recognizing the potential relationship between pairs of detected objects, in which the output is often formulated as a triplet in the form of . Generally, it is not sufficient to interpret the input image by only recognizing the individual objects. The visual relationship triplet, in particular the predicate, plays an important role in understanding the input images. However, it is often hard to predict the predicates since they tend to exhibit a long-tail distribution. In most cases, for a same predicate, the diversity of the subject-object combinations is often enormous [19]. In other words, compared with the individual objects, the corresponding predicates capture more general abstractions from the input images.

Scene graph generation (SGG) [20], [21], [22], [23], [24], [25], [26] builds a visually-grounded scene graph to explicitly model the objects and their relationships. Unlike the visual relationship detection, the scene graph generation aims to build a global scene graph instead of producing local visual relationship triplets. The contextual information conveyed within the scene graph is not limited to isolated triplets, but extends to all the related objects and predicates. To jointly infer the scene graph, message passing among the associated objects and predicates is essential in scene graph generation tasks.

The above visual semantic information pursuit applications, as shown in Fig.1, try to interpret the input image at different semantic levels. For instance, object detection tries to interpret the visual semantic units while visual semantic segmentation seeks to interpret the visual semantic regions (essentially, semantic regions are semantic units with different representation forms); Visual relationship detection tries to interpret the visual semantic phrases while the scene graph generation attempts to interpret the visual semantic scene. Visual semantic information from the above low- and mid-level visual intelligence tasks is the basis of the high-level visual intelligence tasks such as visual captioning [27], [28], [29] or visual question answering [30], [31], [32].

Specifically, to accomplish the visual semantic information pursuit, three key questions need to be answered: 1) What kind of visual context information is required? 2) How to model the required visual context information? 3) How to infer the posterior distribution given the visual context information? This article presents a comprehensive survey of the state-of-the-art visual semantic information pursuit algorithms, which try to answer the above questions.

This survey is organized as follows: Section ii@ presents the terminologies and the fundamentals of the visual semantic information pursuit. Section iii@ introduces a unified paradigm for all visual semantic information pursuit methods. The major developments and the future research directions are covered in Section iv@ and Section v@, respectively. The common benchmarks, the evaluation metrics and the experimental comparison of the key methods are summarized in Section vi@ and Section vii@, respectively. Finally, the conclusions are drawn in Section viii@.

2 preliminary knowledge

A typical visual semantic information pursuit method consists of two modules: a visual perception module and a visual context reasoning module. The visual perception module tries to detect visual semantic units from the input visual stimuli and assign specific meaning to them. Convolutional neural network (CNN) architectures such as fully convolutional networks (FCNs)

[33] or faster regional CNNs (faster R-CNNs) [34] are often used to model the visual perception tasks. They can not only provide the initial predictions of the visual semantic units, but also the locations of the associated bounding boxes. The comprehensive introduction to these CNN models can be found in the previous surveys [5], [35]. In general, regional proposal networks (RPNs) [34], [36] are often used to produce the proposal bounding boxes, and region of interest (ROI) pooling [34]

or bilinear feature interpolation

[15]

are usually applied to obtain the corresponding feature vectors. Three possible prior factors - visual appearance, class information and relative spatial relationship - are often considered in forming the visual semantic perception module, as shown in Fig.2.

Fig. 2: Three possible prior factors are often considered in the current visual perception modules.

Given the detected visual semantic units, the aim of the visual context reasoning module is to produce the most probable interpretation through a maximum a posteriori (MAP) inference. Basically, the above MAP inference is a NP-hard integer programming problem

[37]

. However, it is possible to address the above integer programming problem via a linear relaxation, where the generated linear programming problem can be presented as a variational free energy minimization approximation. Within current visual semantic information pursuit methods, to accomplish the MAP inference, the associated marginal polytopes corresponding to the target variational free energies are often approximated by their corresponding feasible polytopes

[38]. Such feasible polytopes can be further factorized into numerous regions accordingly, in which they can be trained sequentially or in parallel using the corresponding optimization methods. Essentially, the aim of the above approximation is to find a upper bound for the target variational free energy. The tighter the upper bound, the better the MAP inference will be.

Furthermore, to accomplish the MAP inference, the visual context reasoning module often incorporates prior knowledge to regularize the target variational free energy. Generally, there are two types of prior knowledge: internal prior knowledge and external prior knowledge. The internal prior knowledge is acquired from the visual stimulus itself. For instance, the adjacency of the visual stimuli is a typical internal prior knowledge, i.e. the adjacent objects tend to have a relationship or the adjacent pixels tend to have the same label. The external prior knowledge is obtained from the external sources, such as tasks, contexts, knowledge bases. For instance, the linguistic knowledge bases like word2vec [39], [40] are often used as external prior knowledge since the embeddings can be used to measure the semantic similarity among different words. Specifically, the word embeddings are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another. Accordingly, the current visual semantic pursuit algorithms can generally be divided into the following two categories: bottom-up methods and top-down methods. The former only use internal prior knowledge while the latter incorporate both internal and external prior knowledge.

Besides the above MAP inference step, one still needs model selection step to resolve the visual semantic information pursuit applications. Specifically, MAP inference step finds the most probable interpretation for the input visual stimuli while model selection step aims to find the best model (an optimum member within a distribution family) through maximizing the corresponding conditional likelihood. Within the current visual semantic information pursuit methods, deep learning-based message passing optimization strategies are often used to accomplish the MAP inference step, while the model selection step is generally implemented by stochastic gradient descent (SGD) methods.

3 unified paradigm

Within a visual semantic information pursuit system, the visual perception module initializes the visual context reasoning module, while the visual context reasoning module constraints the visual perception module. Those two modules are complementary since both can provide contextual information to each other. In recent years, the deep learning models like the CNNs have been shown to achieve superior performance in numerous visual perception tasks [41], [42], [43]. They become the de facto choices as visual perception modules in the current research. However, the conventional CNNs are still not close to solving the inference tasks within the visual context reasoning modules.

3.1 Formulation

To accomplish an inference task, a probabilistic graphical model is often adopted as a visual semantic information pursuit framework. It uses a graph-based representation as the foundation for encoding a distribution over a multi-dimensional space. It is a factorized representation of the set of independences that hold in a specific distribution. Two types of graphical models are commonly used, namely, Bayesian Networks and Markov Random Fields. The former are directed acyclic graphical models with causality connections while the latter are undirected graphical models with cycles in most cases. Both of them can be reformulated as the corresponding factor graph models. Specifically, within the associated factor graphical model, the visual semantic units are represented as the variable nodes while the visual semantic relations are depicted as the factor nodes.

Given the associated factor graphical model, the aim of the visual context reasoning module is to infer the most probable interpretation given the observed input visual stimuli. In other words, given the input images and other ground-truth information (such as the locations of the associated bounding boxes), we want to maximize the corresponding posterior distribution. The above MAP inference is a NP-hard integer programming problem and it is often reformulated as a linear programming problem via a linear relaxation [38]. Furthermore, within the probabilistic graphical model, the posterior can be derived from the corresponding energy functions according to Boltzmann’s Law. Basically, the lower the energy, the more probable the potential interpretation will be. The above linear programming problem can be further expressed as a variational free energy minimization problem. In a word, instead of exact inference, the visual context reasoning module would interpret the input visual stimuli using a relevant variational free energy minimization approximation.

3.2 Unified Paradigm

Based on the above analysis, the current visual semantic information pursuit methods follow a unified paradigm. Specifically, the visual perception module applies a corresponding CNN model to produce the visual semantic units, while the visual context reasoning module uses a relevant deep learning-based variational free energy minimization method to approximate the target visual semantic relations from the above visual semantic units. Fig.3 schematically represents the unified paradigm of the current visual semantic information pursuit methods.

Fig. 3: The unified paradigm of the current visual semantic information pursuit methods.

In this survey, we use the more general scene graph generation task as an example to develop a mathematical model corresponding to the unified paradigm. Other visual semantic information pursuit applications are special cases of this formulation. To generate a visually-grounded scene graph, the corresponding visual perception module applies a CNN model like RPN to automatically obtain an initial set of object bounding boxes from the input image . For each proposal bounding box, the visual context reasoning module needs to infer three variables: 1) the associated object class label; 2) the corresponding four bounding box offsets relative to the proposal box coordinates; 3) the relevant predicate labels between the potential object pairs.

Given a set of object classes and a set of relationship types , the above set of variables can be depicted as , where is the number of the proposal bounding boxes, represents the class label of the -th proposal bounding box, depicts the bounding box offsets relative to the -th proposal box coordinates, and is the relationship predicate between the -th and the -th proposal bounding boxes. Generally, the ground-truth posterior is computationally intractable. Therefore, in current research, a tractable variational distribution is often used to approximate the ground-truth posterior and we need to accomplish the following MAP inference to obtain the optimal interpretation:

(1)

where is a possible configuration or interpretation and is generally considered as the target posterior, which can also be derived as follows:

(2)

where is the energy function, which computes the assignment cost for the potential interpretation. is the associated partition function. Generally, the energy function can be factorized into a summation of numerous potential terms. For instance, the following equation demonstrates one possible factorization:

(3)

where represents the unary potential term and depicts the binary potential term. Essentially, the unary potential terms relate to the visual perception module, while the higher order potential terms characterize the visual context reasoning module.

Furthermore, suppose the above energy function is parametrized as , the current visual semantic information pursuit methods tend to minimize the corresponding target variational free energy by applying different deep learning-based coordinate descent strategies:

(4)

where the optimal and can be obtained by alternating between a MAP inference step and a model selection step. In the current literature, deep learning-based message passing strategies are generally applied to implement the MAP inference step, while the model selection step is often accomplished by SGD methods. Luckily, since the message passing update rule has already implicitly accomplished the MAP inference step, it is not necessary to explicitly state the variational free energy if one choose to use message passing optimization strategy. In other words, one can choose different types of variational free energy by changing the corresponding message passing update rules.

3.3 Training Strategy

The existing visual semantic information pursuit methods generally follow two main training strategies: modular training and end-to-end training. Within the model selection step, the error differentials of the former one are only allowed to back-propagate within the visual context reasoning module, while the latter one can further back-propagate the error differentials to the previous visual perception module so that the whole learning system can be trained end-to-end. Essentially, within the modular training strategy, the the visual context reasoning module can be considered as a post-processing stage of the visual perception. For instance, [44] formulates the sequential scene parsing task as a binary tree graphical model and proposes a Bayesian framework to infer the associated target posterior. Specifically, three variants of VGG nets [45] are applied as the visual perception module, while the proposed Bayesian framework is used as the post-processing visual context reasoning module. To accomplish the visual semantic segmentation task, [46], [47] use FCNs as the visual perception modules and apply CRF models as the post-processing visual context reasoning modules.

Instead of using modular training, the current visual semantic information pursuit methods tend to apply end-to-end training. Moreover, they tend to use deep learning based variational free energy minimization methods to model the visual context reasoning module. Such changes have several advantages: 1) since the error differentials within the visual context reasoning module can be back-propagated to the previous visual perception module, the whole system can be trained end-to-end and the final performance would be improved accordingly; 2) with the deep learning models, the classical inference operations like message passing or aggregation can be easily accomplished by a simple tensor manipulation; 3) the visual context reasoning module based on deep learning models can fully utilize the advanced parallel capability of the modern GPUs so that the inference speed can be improved.

4 major developments

In this survey, we will limit the discussion to the key deep learning based visual context reasoning methods. Specifically, based on the manners the prior knowledge is applied, the current deep learning based visual context reasoning methods can be categorized as either bottom-up or top-down. In the following sections, we will introduce the major developments of these two directions in terms of the applied variational free energies and the corresponding optimization methods.

4.1 Major Developments in Bottom-up Methods

A visual semantic information pursuit task can generally be represented in terms of associated probabilistic graphical models. For instance, MRFs or CRFs are often used to model the visual semantic segmentation tasks. Given a probabilistic graphical model, the visual context reasoning module is often formulated as a MAP estimation or variational free energy minimization. The optimization problem is generally NP-hard to resolve and we need to relax the original tight constraints and use variational-based methods to approximate the target posterior. Specifically, we firstly need to define an associated variational free energy and then try to find a corresponding optimization method to minimize it. In most cases, the applied variational free energy depends on the corresponding relaxation strategy.

Even though numerous types of optimization methods [48] are capable of minimizing the target variational free energy, one particular type of optimization strategy - message passing [49], [50] - stands out from the competition and is widely applied in the current deep learning based visual context reasoning methods. This is because, unlike the sequential optimization methods (such as the steepest descent algorithm [51] and its many variants), the message passing strategy is capable of optimizing different decomposed sub-problems (usually from dual decomposition) in parallel. Furthermore, the message passing or aggregation operation can be easily accomplished by a simple tensor manipulation. For the bottom-up deep learning based visual context reasoning methods, only the internal prior knowledge can be used to regularize the associated variational free energies. Within the existing bottom-up methods, numerous message passing variants have been proposed in various deep learning architectures, which can be summarized as follows:

4.1.1 Triplet-based Reasoning Models

The visual relationship triplet

plays an important role in understanding the input image. Instead of categorizing the triplet as a whole, the current bottom-up methods tend to jointly classify each component since the computational complexity would reduce from

to (for objects and predicates). However, it is extremely hard to detect predicates since they often obey a long-tail distribution (the complexity become quadratic when considering all possible subject-object pairs). Given a set of object classes and a set of relationship types , the triplet variable can be depicted as , where represents the class label of the associated proposal bounding box, depicts the bounding box offsets relative to the associated proposal box coordinates, and is the relationship predicate. The aim of the triplet-based reasoning model is to maximize the posterior , in which represent the associated observed feature vectors. Through modeling the unary potential terms of the associated variational free energies, CNN-based visual perception modules are generally used to generate the above feature vectors.

Fig. 4: Message passing strategies for different types of the triplet-based reasoning models, in which the messages are passing among the corresponding triplet components.

In the current literature, the triplet-based reasoning models generally use CNN architectures to model higher order potential terms. Moreover, to minimize the variational free energy, the associated message passing strategies are often proposed based on the connection configuration of the triplet graphical model, as demonstrated in Fig.4. One typical triplet-based reasoning model is the relationship proposal network [16], which formulates the triplet structure as a fully connected clique and only considers a third-order potential term within the associated variational free energy. Two CNN-based compatibility modules are proposed to model the third-order potential terms so that a consistent unary prediction combination is rendered more likely. Inspired by the faster RCNN model, the authors in [12] propose a CNN-based phrase-guided message passing structure (PMPS) to infer input triplet proposals, in which the subjects and the objects are only connected through the predicates. They place the predicate at the dominant position and specifically design a gather-and-broadcast message passing strategy, which is applied in both convolutional and fully connected layers. Unlike the above methods, [15]

proposes to model the predicate as a vector translation between the subject and the object, in which both subject and object are mapped into a low-dimensional relation space with less variance. Instead of using the conventional ROI pooling, the authors use the bilinear feature interpolation to transfer knowledge between object and predicate.

4.1.2 MRF-based or CRF-based Reasoning Models

MRFs and CRFs are commonly used undirected probabilistic graphical models in computer vision community. They are capable of capturing rich contextual information exhibited in natural images or videos. MRFs are generative models while CRFs are discriminative models. Most of the visual semantic information pursuit applications, in particular the visual semantic segmentation tasks, are often formulated as MRFs or CRFs. Given an input stimuli and the variable representing the semantic information of interest, the aim of the MRF-based or CRF-based reasoning model is to maximize the posterior or minimize the variational free energy . The exact inference methods only exist for special MRF or CRF structures such as conjunction trees or local cliques. For instance, the authors in [11] propose a deep relational network to detect the potential visual relationship triplets. They apply CRFs to model the associated fully connected triplet cliques and use the sequential CNN computing layers to exactly infer the corresponding marginals factorized from the joint posterior.

Fig. 5: Message passing strategies for MRF-based or CRF-based reasoning models, in which the messages are generally passing within the same semantic level.

In the current literature relating to the bottom-up approach, the general MRF or CRF structures often need variational inference methods such as mean field (MF) approximation [52], [53] or loopy belief propagation (BP) [54] to infer the target posterior. Specifically, the applied variational free energy is often devised depending on the relaxation strategy of the corresponding constraint optimization problem, and the message passing optimization methodology is generally applied to minimize the above variational free energy. Fig.5 shows the general message passing strategies of the MRF-based or CRF-based reasoning models. A well-known CRF-based reasoning model is the CRF-as-RNN [6], [35], which incorporates the RNN-based visual context reasoning module in the FCN visual perception module so that the proposed visual semantic segmentation system can be trained end-to-end. Specifically, FCN layers are used to formulate the unary potentials of the DenseCRF model [55] while the binary potentials are formed by a sequence of CNN layers. As a result, the associated mean field inference method can be accomplished by the corresponding RNN. Essentially, two relaxation measures are applied within the mean field approximation: 1) the tractable variational distribution is used to approximate the underling posterior

; 2) the joint variational distribution can be fully factorized into a combination of independent nodes, which implies maximizing the marginal of each independent node is guaranteed to accomplish the original MAP estimation. Inspired by the above methodology, the authors in

[20]

use a more general RNN architecture - gated recurrent units (GRUs) - to generate the scene graphs from the input images using the mean filed inference method. Specifically, they use the internal memory cells in GRUs to store the generated contextual information and apply a primal-dual update rule to speed up the inference procedure. Unlike the above methods that optimize the CRFs using iterative strategy, the deep parsing network (DPN) proposed in

[7] is able to achieve high visual semantic segmentation performance by only applying one iteration of MF, which also can be considered as a generalized case of the existing models since it can represent various types of binary potential terms.

Fig. 6: Message passing strategies for the visual semantic hierarchy reasoning models, in which the messages are passing though different semantic levels.

4.1.3 Visual Semantic Hierarchy Reasoning Models

In visual semantic information pursuit tasks, visual semantic hierarchies are ubiquitous. They often consist of visual semantic units, visual semantic phrases, local visual semantic regions and the scene graph. Given a visual semantic hierarchy, the contextual information within other semantic layers are often used to maximize the posterior of the current semantic layer. Essentially, such inference procedure tend to have a tighter upper bound for the target variational free energy and thus often obtains a better MAP inference performance. Specifically, to dynamically build visual semantic hierarchies, the visual semantic hierarchy reasoning models are often required to align the contextual information of different visual semantic levels. As shown in Fig.6, through passing the contextual information among different visual semantic levels within the generated visual semantic hierarchy, each visual semantic level can obtain much more consistent posterior.

Unlike the previous methods that only model pairwise potential terms within the same visual semantic level, the structure inference network (SIN) [4] propagates contextual information from the holistic scene and the adjacent connected nodes to the target node. Within this object detection framework, GRUs are used to store the contextual information. Motivated by the computational consideration, the maximum pooling layer is applied to aggregate messages from the adjacent connected nodes into an integrated message. To leverage the contextual information across different semantic levels, multi-level scene description network (MSDN) [21] establishes a dynamic graph consisting of object nodes, phrase nodes and region nodes. For each semantic level, a CNN-based merge-and-refine strategy is proposed to pass the contextual information along the graph structure.

4.1.4 DAG-based Reasoning Models

Generally, the variational free energy employed in CRFs usually fails to enforce higher-order contextual consistency due to the computational considerations. Furthermore, small-sized objects are often smoothed out by the CRFs, which degrades the semantic segmentation performance. Instead of applying conventional CRFs, some bottom-up methods tend to use undirected cyclic graphical (UCG) models to formulate the visual semantic segmentation tasks. However, due to its loopy property, UCG models generally cannot be formulated as RNNs. To resolve this issue, as shown in Fig.7, UCGs are often decomposed into a sequence of directed acyclic graphs (DAGs), in which each DAG can be modelled by a corresponding RNN. Such a RNN architecture is also known as DAG-RNN, which explicitly propagate local contextual information based on the directed graphical structure. Essentially, a DAG-based reasoning model like the DAG-RNN has two main advantages: 1) compared with conventional CNN models (such as FCNs), it is empirically found to be significantly more effective at aggregating context; 2) it requires substantially less parameters as well as demanding fewer computation operations, which makes it more favourable for applications on resource-limited embedded platforms.

Fig. 7: The applied UCG is decomposed as a sequence of DAGs (one possible decomposition), in which the messages in each DAG are passing along its specific structure.

In current DAG-based reasoning models [8], [56], the DAG-RNNs apply 8-neighborhood UCG graphs to effectively encode the long-range contextual information so that the discriminative capabilities of the local representations are greatly improved. Inspired by the conventional tree-reweighed max-product algorithm (TRW) [57], the applied UCGs are decomposed into a sequence of DAGs, in which any vertex pair can be mutually reachable in the resulting set of DAGs. Furthermore, the DAG-RNNs are often integrated with convolution and deconvolution layers, and a novel class-weighted loss is applied since the class occurrence frequencies are generally imbalanced in most visual semantic information pursuit tasks, especially the visual semantic segmentation applications.

4.1.5 External Memory Reasoning Models

One of difficult issues for the visual semantic information pursuit is to resolve the dataset imbalance problem. To address this issue, one generally needs to accomplish the so-called few-shot learning tasks [58], [59], [60]

since most categories in the datasets have only few training samples. Unlike the above models, instead of using the internal memory cells like long-short term memory (LSTM) or gated recurrent unit (GRU), the external memory reasoning models apply the external memory cells, such as neural turing machine (NTM)

[61] or memory-augmented neural network (MANN) [62], to store the generated contextual information. More importantly, they tend to use meta-learning strategy [62] within the inference procedure. Such a meta-learning strategy can be summarized as ”learning to learn”, which selects parameters to reduce the expected learning cost across a distribution of datasets : . To prevent the network from slowly learning specific sample-class bindings, it directly stores the new input stimuli at the corresponding external memory cells instead of relearning them. Through such meta-learning, the convergence speed of the visual semantic information pursuit task is greatly improved so that only few training samples are enough to converge at a stable status.

Fig. 8: Overview of 2-D spatial external memory iterations for object detection. The old detection is marked with a green box, and the new detection is marked with orange. Here, the spatial memory network is only unrolled one iteration.

Instead of detecting objects in parallel like the conventional object detection methods, the authors in [1] propose a novel instance-level spatial reasoning strategy, which tries to recognize objects conditioned on the previous detections. To this end, spatial memory network (SMN) [1], a 2-D spatial external memory, is devised to store the generated contextual information and extract spatial patterns by using a effective reasoning module. Essentially, this leads to a new sequential reasoning module where image and memory are processed in parallel to obtain detections which update the memory again, as shown in Fig.8. Unlike the above method that made sequential updates to memory, the authors in [2] propose to update the regions in parallel as an approximation, in which a cell can be covered multiple times from different regions in overlapping cases. Specifically, a weight matrix is devised to keep track of how much a region has contributed to a memory cell. The final value of each updated cell is the weighted average of all regions.

4.2 Major Developments in Top-down Methods

For visual semantic information pursuit applications, the associated visual semantic relations generally reside in a huge semantic space. Unfortunately, only limited training samples are available, which implies it is impossible to fully train every possible visual relation. To maximize the target posterior from this long-tail distribution, the existing top-down methods generally transform the MAP inference tasks into linear programming problems. More importantly, they often distill the external linguistic prior knowledge into the associated learning systems so that the objective functions of the target constraint optimization problems can be further regularized accordingly. Therefore, compared with the bottom-up methods, the top-down methods generally converge relatively easily. In this section, based on their distillation strategies, we divide the existing top-down methods into the following categories:

Fig. 9: The diagram of semantic affinity distillation models.

4.2.1 Semantic Affinity Distillation Models

Even though the visual semantic relations obey a long-tail distribution, they are often semantically related to each other, which means it is possible to infer an infrequent relation from similar relations. To this end, the semantic affinity distillation models project the corresponding feature vectors (which are often generated from the union of bounding boxes of the associated objects) into a low-dimensional semantic relation embedding space and use their semantic affinities as the external prior knowledge to regularize the target optimization problem. In general, the projection function would be trained by enforcing similar visual semantic relations to be close together in the semantic relation embedding space. For instance, the visual semantic relation should be close to and far away from in the associated semantic relation embedding space, as illustrated in Fig.9. The semantic affinity distillation models are capable of resolving zero-shot learning tasks since the visual semantic relations without any training samples can still be recognized by the external linguistic knowledge, which is clearly impossible for the bottom-up methods that only use internal visual prior knowledge.

One of the pioneering works is the visual relationship detection with language prior method [10], which trains the visual models for objects and predicates individually, and later combines them together by applying the external semantic affinity-based linguistic knowledge to predict consistent visual semantic relations. Unlike the above algorithm, the context-aware visual relationship detection method [19] tries to recognize the predicate by incorporating the subject-object pair semantic contextual information. Specifically, the context is encoded via word2vec into a semantic embedding space and is applied to generate a classification result for the predicate. To summarize, the external semantic affinity linguistic knowledge can not only improve the inference speed but also leads to zero-shot generalizations.

Fig. 10: The diagram of teacher-student distillation models.

4.2.2 Teacher-student Distillation Models

To resolve the long-tail distribution issue, instead of relying on semantic affinities, the teacher-student distillation models tend to use external linguistic knowledge (the conditional distribution of a visual semantic relation given specific visual semantic units) generated from the public knowledge databases to constrain the target optimization problem. Given the input stimuli and the predictions , the optimal teacher network is selected from an associated candidate set by minimizing , where and represent the prediction results of the teacher and student networks, respectively; is the parameter set of the student network and is a balancing term; depicts the constraint function, in which the predictions that satisfy the constraints are rewarded and the remaining are penalized; measures the divergence of the teacher’s and the student’s prediction distributions. Essentially, through solving the above optimization, the teacher’s output can be viewed as a projection of the student’s output in the feasible polytopes constrained by the external linguistic prior knowledge.

However, the teacher network itself, in most cases, is not enough to provide accurate predictions since the external linguistic prior knowledge is often noisy. In general, the student network represents the architecture without any external linguistic knowledge, while the framework incorporating both internal visual and external linguistic knowledge is formulated as the teacher network. They each have their own advantages: the teacher outperforms in cases with sufficient training samples, while the student achieves superior performance in few-shot or zero-shot learning scenarios. Therefore, unlike the previous distillation methods [63], [64], [65] that only use either the teacher or the student as the output, the current teacher-student distillation models [14], [17] tend to incorporate the prediction results from both student and teacher networks, as shown in Fig.10.

5 future research directions

Even though the current visual semantic information pursuit methods have achieved satisfying performance not seen before, there are still numerous challenging yet exciting research directions to investigate in the future.

5.1 Weakly-supervised Pursuit Methods

Annotations for the visual semantic information pursuit applications, especially the visual semantic segmentation tasks, are generally hard to obtain since we need to invest tremendous times and efforts into the labelling procedure. However, most current visual semantic information pursuit methods are essentially fully-supervised algorithms, which implies rich annotations are required if we want to train these methods. For instance, to locate the objects, the fully-supervised methods require ground-truth locations of the associated bounding boxes. To alleviate the annotation burden, the weakly-supervised pursuit methods [66], [67], [68], [69], [70] have been proposed in recent years, which only require the vocabulary information to extract the visual semantic information. Unfortunately, even though the current weakly-supervised pursuit methods generally require more computation, they still do not achieve comparable performance to fully-supervised pursuit methods. This is the reason why the fully-supervised methods are prevalent in the visual semantic information pursuit literature.

5.2 Pursuit Methods using Region-based Decomposition

For most visual semantic information graphical models, it is generally computationally intractable to infer the target posteriors. To resolve this NP-hard problem, mean field approximation is often applied in the current visual semantic information pursuit methods, in which the associated graphical model is fully decomposed into independent nodes. Unfortunately, such simple decomposition strategy only incorporates unary pseudo-marginals into the associated variational free energy, which is clearly not enough for the complicated visual semantic information pursuit applications. Moreover, for the generated dense (fully connected) inference model, it is often slow to converge. Inspired by the generalized belief propagation algorithm [71], there are pursuit methods [24], [25] starting to apply region-based decomposition strategy, in which the associated graphical model is factorized into various regions and the nodes within each region are not independent. The applied region-based decomposition strategy can not only improve the inference speed in some cases, but also incorporate higher-order pseudo-marginals into the associated variational free energy. However, there are still several open questions: 1) How many regions are enough for most visual semantic information pursuit applications? 2) How to efficiently compute the higher-order pseudo-marginals given the decomposed regions? 3) How to properly propagate contextual information between different regions?

5.3 Pursuit Methods with Higher-order Potential Terms

To extract the visual semantic information, the scene modelling methods generally need to factorize the associated energy functions into various potential terms, in which the unary terms produce the predictions while the higher-order potential terms constrain the generated predictions to be consistent. However, the current visual semantic information pursuit methods typically incorporate only pair-wise potential terms into the associated variation free energy, which is clearly not enough. To resolve this issue, the current visual semantic information pursuit methods start to incorporate higher-order potential terms into the associated variational free energies. For instance, for the visual semantic segmentation tasks, the recently proposed UCG-based pursuit methods [8], [56] replace the pair-wise potential terms (applied in most CRF-based models) with higher potential terms, and thus achieve the state-of-the-art segmentation performance. However, through incorporating the higher-order potential terms, the target constraint optimization problems become much harder to resolve since the polynomial higher-order potential terms would inject more non-convexities into the objective function [72]. Therefore, further efforts are needed to address this non-convexity issue.

5.4 Pursuit Methods with Advanced Domain Adaptation

One of the most difficult issues for the visual semantic information pursuit applications is the dataset imbalance problem. In most cases, only few categories have enough training samples, while the remaining either have few or even zero training samples. Due to the long-tail distribution, such situation would become even worse when we try to pursue the visual semantic relation information. To address this few-shot or zero-shot learning issue, domain adaptation [73], [74], [75], [76] becomes a natural choice since it can transfer the missing knowledge from the related domains into the target domain. For instance, to resolve the corresponding few-shot learning problems, the top-down pursuit methods use the distilled external linguistic knowledge to regularize the variational free energy, while the bottom-up pursuit methods achieve meta-learning through using the external memory cells. Essentially, the current domain adaptation strategies used in existing visual semantic information pursuit methods mainly focus on learning generic feature vectors from one domain that are transferable to other domains. Unfortunately, they generally transfer unary features and largely ignore more structured graphical representations [77]. To transfer the structured graphs to the corresponding domains, more advanced domain adaptation methodologies are much needed in the future.

5.5 Pursuit Methods without Message Passing

To resolve the target NP-hard constrained optimization problems, even though numerous constraint optimization strategies are available [50], the current deep learning based visual semantic information pursuit methods totally depend on one specific optimization methodology - message passing. Essentially, message passing is generally motivated by linear programming and variational optimization. It is proven that, for modern deep learning architectures, such parallel optimization methodology is more effective than other optimization strategies. Besides being widely applied in different visual semantic information pursuit tasks, it also succeeds in the quantum chemistry area [78], [79], [80], [81]. However, the parallel message passing strategies empirically underperform compared to the sequential optimization methodologies and typically do not provide feasible integer solutions [50]. To address these issues, the visual semantic information pursuit methods without using message passing optimization strategy are certainly needed in the future.

6 benchmarks and evaluation metrics

In this section, we will introduce the main benchmarks and evaluation metrics for the four research applications investigated in this survey, which include object detection, visual semantic segmentation, visual relationship detection and scene graph generation.

6.1 Object Detection

6.1.1 Benchmarks

Two benchmarks are commonly used in object detection applications: the test set of PASCAL VOC 2007 [82] and the validation set of MS COCO [83]. More specifically, the test set of PASCAL VOC 2007 contains 4,952 images and 14,976 object instances from 20 categories. To evaluate the performances of the object detection methods in different image scenes, it includes large number of objects with abundant variations within each category, viewpoint, scale, position, occlusion and illumination. Compared with PASCAL VOC 2007 test set, the MS COCO benchmark is more challenging since the images in this dataset are gathered from complicated day-to-day scenes that contain common objects in their natural contexts. Specifically, MS COCO benchmark contains 80,000 training images and 500,000 instance annotations. To evaluate the detection performance, most object detection methods often use the first 5,000 MS COCO validation images. Additional non-overlapping 5,000 images, in some cases, have also been used as the validation dataset.

6.1.2 Evaluation Metrics

To evaluate the object detection methods, one needs to consider the following two performance measures: the object proposals generated by object detection methods and the corresponding objectness detection. In the existing literatures, the metrics for evaluating object proposals are often functions of intersection over union (IOU) between the proposal locations and the associated ground-truth annotations. Given the IOU, recall can be obtained as the fraction of ground-truth bounding boxes covered by proposal locations above a certain IOU overlap threshold. To evaluate the performance of the objectness detection, mean average precision (mAP) metric is often used for VOC 2007 test benchmark (the IOU threshold is normally set to ), while MS COCO 2015 test-dev benchmark generally apply two types of metrics: average precision (AP) over all categories and different IOU thresholds, and average recall (AR) over all categories and IoUs (which is basically computed on a per-category basis, i.e. the maximum recall given a fixed number of detections per image). Specifically, , , represent the average precision over different IOU thresholds from to with a step of (written as 0.5:0.95), the average precision with IOU threshold of and the average precision with IOU threshold of , respectively. , , depict the average recall given , , detections per image, respectively. We recommend interested readers refer to the relevant papers [34], [84] for the details and the mathematical formulations of the above metrics.

6.2 Visual Semantic Segmentation

6.2.1 Benchmarks

In this survey, three main benchmarks are chosen out from the abundant datasets for visual semantic segmentation methods: Pascal Context [85], Sift Flow [86] and COCO Stuff [87]. Specifically, Pascal Context benchmark contains 10,103 images extracted from the Pascal VOC 2010 dataset, in which 4,998 images are used for training. The images are relabelled as pixel-wise segmentation maps which include 540 semantic categories (including the original 20 categories) and each image has approximately the size of . Sift Flow dataset contains 2,688 images obtained from 8 specific kinds of outdoor scenes. Each image has the size of , which belongs to one of the 33 semantic classes. COCO Stuff is a recently released scene segmentation dataset. It includes 10,000 images extracted from the Microsoft COCO dataset, in which 9,000 images are used for training and the previous unlabelled stuff pixels are further densely annotated with extra 91 classes.

6.2.2 Evaluation Metrics

To evaluate the performances of the visual semantic segmentation methods, three main metrics are generally applied in the existing literatures: Global Pixel Accuracy (GPA), Average per-Class Accuracy (ACA) and mean Intersection of Union (mIOU). Specifically, GPA represents the percentage of all correctly classified pixels, ACA depicts the mean of class-wise pixel accuracy and mIOU is the mean of the accuracy metric IOU. The details and the corresponding mathematical formulations of the above metrics can be found in [33].

6.3 Visual Relationship Detection

6.3.1 Benchmarks

The current visual relationship detection methods often use two benchmarks: visual relationship dataset [10] and visual genome [88]. Unlike the datasets for object detection, the visual relationship datasets should contain more than just objects localized in the image. Instead, they should capture the rich variety of interactions between the subject and the object pairs. Various types of interactions are considered in the above visual relationship benchmark datasets, i.e. verbs (e.g. wear), spatial (e.g. in front of), prepositions (e.g. with) or comparative (e.g. higher than). Moreover, the types of predicates per category should be large enough. For instance, a can be associated with the predicates such as , , , etc. Specifically, visual relationship dataset contains 5000 images with 100 object categories and 70 predicates. In total, the dataset contains 37,993 relationships with 6,672 relationship types and 24.25 predicates per object category. Unlike the visual relationship dataset, the recently proposed visual genome dataset incorporates numerous kinds of annotations, one of which is visual relationships. The visual genome relationship dataset contains 108,077 images and 1,531,448 relationships. However, it generally needs to be cleansed since the corresponding annotations often contain some misspellings and noisy characters, and the verbs and the nouns are also in different forms.

6.3.2 Evaluation Metrics

The current visual relationship detection methods often use two evaluation metrics: and . Here, [89] represents the fraction of times the correct relationship is predicted in the top confident relationship predictions. The reason why we use instead of widely applied mean average precision (mAP) metric is because mAP is a pessimistic evaluation metric, meaning we can not exhaustively annotate all possible relationships in an image. Even if the prediction is correct, mAP still would penalize the prediction if we do not have that particular ground truth annotation.

6.4 Scene Graph Generation

6.4.1 Benchmarks

The visual genome [88] is often used as the benchmark for the scene graph generation applications. Unlike the previous visual relationship datasets, the visual relationships within the visual genome scene graph dataset locate in the associated scene graphs and are generally not independent of each other. Specifically, the visual genome scene graph dataset contains 108,077 images with an average of 38 objects and 22 relationships per image. However, a substantial fraction of the object annotations have poor quality and overlapping bounding boxes and/or ambiguous object names. Therefore, a clean visual genome scene graph generation dataset is often needed in the real evaluation procedure. In the current scene graph generation methods, instead of training on all possible categories and predicates, the most frequent categories and predicates are often chosen for evaluation.

6.4.2 Evaluation Metrics

Similar to the previous visual relationship detection methods, the scene graph generation methods generally apply [89] metric instead of the mAP metric. Specifically, and are often used to evaluate the corresponding scene graph generation methods.

7 experimental comparison

In this section, we compare the performance of different visual semantic information pursuit methods for each potential application mentioned in this survey. Specifically, for each of the following subsection, we will choose the most representative methods to compare the pursuit performance. Moreover, the benchmarks and the evaluation metrics mentioned in the above section will be used to accomplish the performance comparisons.

7.1 Object Detection

In this section, two benchmarks - VOC 2007 test [82] and MS COCO 2015 test-dev [83] - are applied to compare different cutting-edge object detection methods. For VOC 2007 test dataset, we select 5 current object detection methods including Fast R-CNN [36], Faster R-CNN [34], SSD500 [90], ION [91] and SIN [4], as shown in Table I; For MS COCO 2015 test-dev benchmark, Fast R-CNN [36], Faster R-CNN [34], YOLOv2 [84], ION [91] and SIN [4] are included in the performance comparison task shown in Table II. Among the above methods, only ION [91] and SIN [4] incorporate the visual context reasoning modules within the learning procedure while others merely apply visual perception modules. The reason for incorporating various state-of-the-art visual perception models in the comparison is to provide a complete comparison and to gain further understanding of the impact of the visual context reasoning modules.

Method Train
Fast R-CNN [36]
Faster R-CNN [34]
SSD500 [90]
ION [91]
SIN [4]
  • Note: represents 07 trainval + 12 trainval.

TABLE I: Performance comparison on VOC 2007 test.
Method Train
Fast R-CNN [36]
Faster R-CNN [34]
YOLOv2 [84]
ION [91]
SIN [4]
  • Note: represents COCO train + 35k val [91].

TABLE II: Performance comparison on COCO 2015 test-dev.

In Table I and II, we can observe that the object detection methods with the visual context reasoning modules (such as ION and SIN) can generally achieve better performance than the current visual perception object detection algorithms (such as Fast R-CNN, Faster R-CNN, SSD500 and YOLOv2). This is because they consider the object detection task as a combination of perception and reasoning instead of only concentrating on the perception. Specifically, they consider the previous detected objects or the holistic scene as contextual information and try to improve the detection performance by inferencing over the above contextual information. In some cases, such contextual information can be quite important for detecting the target objects. For instance, when the target object is partly occluded by other objects or the target object only occupies extremely small region within the image. For such scenarios, it is almost impossible for the current visual perception object detection methods to detect the target objects. However, given the contextual information around the target objects, it is still possible to infer the target objects even in such harsh scenarios.

7.2 Visual Semantic Segmentation

In this section, we compare several state-of-the-art visual semantic segmentation methods on three benchmarks: Pascal Context [85], Sift Flow [86] and COCO Stuff [87]. For Pascal Context, only the most frequent 59 classes are selected for evaluation. The classes whose frequencies are lower than are considered as rare classes according to the 85-15 percent rule; For Sift Flow, similar to paper [86], we split the whole dataset into training and testing sets with 2,488 and 200 images, respectively. Each pixel within the above images can be classified as one of the most frequent 33 semantic categories. Based on the 85-15 percent rule, the classes whose frequencies are lower than are considered as rare classes; For COCO Stuff, each pixel can be categorized as one of 171 semantic classes in total and the frequency threshold is used to determine the rare classes.

Method GPA ACA mIOU
CFM [92]
DeepLab [46]
FCN-8s [93]
CRF-RNN [6]
DeepLab + CRF [46]
ParseNet [7]
ConvPP-8s [94]
UoA-Context + CRF [95]
DAG-RNN [8]
DAG-RNN + CRF [8]
  • Note: For fair comparison, all the above methods apply VGG-16 [45] as visual perception module.

TABLE III: Performance comparison (%) on Pascal Context dataset (59 classes).

Specifically, 10 current visual semantic segmentation methods including CFM [92], DeepLab [46], DeepLab + CRF [46], FCN-8s [93], CRF-RNN [6], ParseNet [7], ConvPP-8s [94], UoA-Context + CRF [95], DAG-RNN [8] and DAG-RNN +CRF [8] are used to compare on Pascal Context benchmark, as shown in Table III. For the Sift Flow dataset, besides the state-of-the-art semantic segmentation methods such as ParseNet [7], ConvPP-8s [94], FCN-8s [93], UoA-Context + CRF [95], DAG-RNN [8] and DAG-RNN +CRF [8], we also compare various previous methods like Byeon et al. [96], Liu et al. [86], Pinheiro et al. [97], Farabet et al. [98], Tighe et al. [99], Sharma et al. [100], Yang et al. [101] and Shuai et al. [102], as shown in Table IV. For the recently released COCO Stuff benchmark, we compare 5 different visual semantic segmentation methods, which include FCN [87], DeepLab [46], FCN-8s [93], DAG-RNN [8] and DAG-RNN +CRF [8], as depicted in Table V.

Method GPA ACA mIOU
Byeon et al. [96]
Liu et al. [86]
Pinheiro et al. [97]
Farabet et al. [98]
Tighe et al. [99]
Sharma et al. [100]
Yang et al. [101]
Shuai et al. [102]
ParseNet [7]
ConvPP-8s [94]
FCN-8s [93]
DAG-RNN + CRF [8]
DAG-RNN [8]
UoA-Context + CRF [95]
  • Note: For fair comparison, all current methods below the middle horizontal line apply VGG-16 [45] as visual perception module. While the previous methods still employ their default settings.

TABLE IV: Performance comparison (%) on Sift Flow dataset (33 classes).
Method GPA ACA mIOU
FCN [87]
DeepLab [46]
FCN-8s [93]
DAG-RNN [8]
DAG-RNN + CRF [8]
  • Note: For fair comparison, all the above methods apply VGG-16 [45] as visual perception module.

TABLE V: Performance comparison (%) on COCO Stuff dataset (171 classes).

Recently, due to the effective feature generation, CNN-based visual semantic segmentation methods are becoming popular. For instance, FCN [87] and its variant FCN-8s [93] are the most well-known examples. However, the direct prediction of those visual perception models generally are in low-resolution. To obtain high-resolution predictions, various visual semantic segmentation methods using visual context reasoning modules are proposed, i.e. DeepLab + CRF [46], CRF-RNN [6], UoA-Context + CRF [95] and DAG-RNN +CRF [8]. Essentially, combing the strength of CNNs and CRFs for semantic segmentation becomes the focus. Among those methods, only DeepLab + CRF [46] trains FCN [87] and applies a dense CRF method as a post-processing step, while others jointly learn the dense CRFs and CNNs. Most of the above methods only incorporate pairwise (binary) potential terms within their corresponding variational free energies.

According to the comparison results shown in Table III, IV, V, the recently proposed DAG-RNN + CRF [8] method achieves the best performances in most scenarios while the previous UoA-Context + CRF [95] algorithm only outperforms it in few cases within the Sift Flow benchmark. This is because DAG-RNN module within the DAG-RNN + CRF [8] method incorporates higher-order potential terms into the variational free energy instead of only applying pairwise potential terms like the previous methods. It is capable of enforcing local consistency and can enforce higher-order semantic coherence to a large extent [8]. Moreover, the CRF module boosts the unary predictions and improves the ability of localizing object boundaries, which is inferior in DAG-RNN module [8].

Predicate Phrase Relationship
Method R@50 R@100 R@50 R@100 R@50 R@100
LP [10]
VTransE [15]
PPRFCN [103]
SA-Full [104]
CAI [19]
ViP [12]
VRL [13]
Zoom-Net [105]
LK [14]
CAI + SCA-M [105]
  • Note: All the above methods apply RPN [34] and triplet NMS [12] to generate object proposals and remove redundant triplet candidates, respectively.

TABLE VI: Performance comparison on Visual Relationship dataset ().

7.3 Visual Relationship Detection

In this section, two main benchmarks - visual relationship dataset [10] and visual genome [88] - are used to compare different visual relationship detection methods on three tasks: predicate recognition where both the bounding boxes and labels of the subject and object are given; phrase recognition which predict the triple labels given a triplet structure as a union bounding box; relationship recognition which also outputs triple labels but detects separate bounding boxes of the subject and object. Specifically, for visual relationship dataset, 10 state-of-the-art visual relationship detection methods are chosen in the performance comparison including LP [10], VTransE [15], CAI [19], ViP [12], VRL [13], LK [14], PPRFCN [103], SA-Full [104], Zoom-Net [105] and CAI + SCA-M [105], as shown in Table VI; For visual genome, we compare three current visual relationship detection methods: DR-Net [11], ViP [12] and Zoom-Net [105], as demonstrated in Table VII. Moreover, the evaluation metric [89] performance is relative to the number of predicates per subject-object pair to be evaluated, i.e. top predictions. In this survey, we choose for visual relationship dataset and for visual genome. The IOU between the predicated bounding boxes and the ground-truth is required above for the above methods. Furthermore, for a fair comparison, all methods mentioned above apply RPN [34] and triplet NMS [12] to generate object proposals and remove redundant triplet candidates, respectively.

Predicate Phrase Relationship
Method R@50 R@100 R@50 R@100 R@50 R@100
DR-Net [11]
ViP [12]
Zoom-Net [105]
  • Note: All the above methods apply RPN [34] and triplet NMS [12] to generate object proposals and remove redundant triplet candidates, respectively.

TABLE VII: Performance comparison on Visual Genome dataset ().

According to the results shown in Table VI, the recently proposed CAI + SCA-M method [105] outperforms previous visual relationship detection methods in all comparison criteria. For better understanding of the comparison results, we divide the methods into three different categories. Specifically, PPRFCN [103] and SA-Full [104] are essentially weakly-supervised visual relationship detection methods, that cannot generate comparable detection performance as other fully-supervised algorithms. Among all the fully-supervised methods, VTransE [15], ViP [12] and Zoom-Net [105] are virtually bottom-up visual relationship pursuit models, which only incorporate internal visual prior knowledge into the detecting procedure. Unlike the above algorithms, the top-down visual relationship pursuit methods such as LP [10], CAI [19], VRL [13], LK [14] and CAI + SCA-M [105] distill external linguistic prior knowledge into the learning frameworks. Generally, the external linguistic prior knowledge would regularize the original constraint optimization problems so that the associated top-down methods tend to bias towards certain feasible polytopes.

Unlike Table VI, all methods in Table VII are bottom-up visual relationship pursuit methods. We can observe in Table VII that Zoom-Net [105] outperforms other two methods by a large margin, especially the visual relationship recognition task. As a visual semantic hierarchy reasoning model, Zoom-Net [105] propagates contextual information among different visual semantic levels. Essentially, within the associated MAP inference, it can obtain a tighter upper bound for the target variational free energy and thus would generally converge to a better local optimum, as shown in Table VII.

7.4 Scene Graph Generation

In this section, seven available scene graph generation methods - IMP [20], MSDN [21], NM-Freq [22], Graph R-CNN [23], MotifNet [22], GPI [25] and LinkNet [26] - are compared on the visual genome dataset [88], as shown in Table VIII. Various visual genome dataset cleaning strategies exist in the current literature and, for a fair comparison, we choose the one used in the pioneering work [20] as the universal preprocessing model for all the above methods. Such cleaning strategy would generate training and test sets with 75,651 images and 32,422 images, respectively. Moreover, the most-frequent 150 object classes and 50 relation classes are selected in this survey. In general, each image has around 11.5 objects and 6.2 relationships in the scene graph. Furthermore, three evaluation aspects - Predicate Classification (PredCls), Phrase Classification (PhrCls) and Scene Graph Generation (SGGen) - are considered in this survey. Specifically, PredCls represents the performance for recognizing the relation between two objects given the ground-truth locations; PhrCls depicts the performance in the task of recognizing two object categories and their relation given the ground-truth locations; SGGen indicates the performance for detecting objects (IOU 0.5) and recognising the predicates linking object pairs.

PredCls PhrCls SGGen
Method R@50 R@100 R@50 R@100 R@50 R@100
IMP [20]
MSDN [21]
NM-Freq [22]
Graph R-CNN [23]
MotifNet [22]
GPI [25]
LinkNet [26]
  • Note: All the above methods apply the same cleaning strategy proposed in paper [20].

TABLE VIII: Performance comparison on Visual Genome dataset.

Unlike visual relationship detection, scene graph generation needs to model global inter-dependency among the entire object instances, rather than focus on local relationship triplets in isolation. Essentially, the strong independence assumptions in local predictors limit the quality of the global predictions [22]. As shown in Table VIII, the first four methods (IMP [20], MSDN [21], NM-Freq [22] and Graph R-CNN [23]) use graph-based inference to propagate local contextual information in both directions between object and relationship nodes, while the last three methods (MotifNet [22], GPI [25] and LinkNet [26]) tend to incorporate global contextual information within the inference procedure. From Table VIII, it can be seen that the latter methods incorporating global contextual information outperform the previous ones to a large extent. Among them, the recently proposed LinkNet [26] achieves the best performance in almost all comparison criteria. This is mainly because the authors propose a simple and effective relational embedding module to explicitly model the global contextual information.

8 Conclusion

This survey presents a comprehensive review of state-of-the-art visual semantic information pursuit methods. Specifically, we mainly focus on four related applications: object detection, visual semantic segmentation, visual relationship detection and scene graph generation. To understand the essence of the current visual semantic information pursuit methods, a specific unified paradigm is distilled in this survey. The main developments and the future trends in each potential direction are also reviewed, followed by summarising the most popular benchmarks, the evaluation metrics, and the relative performance of the key algorithms.

Acknowledgments

This work was supported in part by the U.K. Defence Science and Technology Laboratory, and in part by the Engineering and Physical Research Council (collaboration between U.S. DOD, U.K. MOD, and U.K. EPSRC through the Multidisciplinary University Research Initiative) under Grant EP/K014307/1 and Grant EP/R018456/1.

References

  • [1] X. Chen and A. Gupta, “Spatial memory for context reasoning in object detection,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4106–4116.
  • [2] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visual reasoning beyond convolutions,” in

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018, pp. 7239–7248.
  • [3] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbd-net for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 9, pp. 2109–2123, 2018.
  • [4] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure inference net: Object detection using scene-level context and instance-level relationships,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6985–6994.
  • [5] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-learning techniques for salient and category-specific object detection: a survey,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018.
  • [6]

    S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in

    Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1529–1537.
  • [7] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Deep learning markov random field for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1814–1828, 2018.
  • [8] B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Scene segmentation with dag-recurrent neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1480–1493, 2018.
  • [9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in

    Proceedings of the Eighteenth International Conference on Machine Learning (ICML)

    , 2001, pp. 282–289.
  • [10] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in Proceedings of European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 852–869.
  • [11] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with deep relational networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3298–3308.
  • [12] y. li, w. ouyang, x. wang, and x. tang, “vip-cnn: visual phrase guided convolutional neural network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7244–7253.
  • [13]

    X. Liang, L. Lee, and E. P. Xing, “Deep variation-structured reinforcement learning for visual relationship and attribute detection,” in

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4408–4417.
  • [14] R. Yu, A. Li, V. I. Morariu, and L. S. Davis, “Visual relationship detection with internal and external linguistic knowledge distillation,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1068–1076.
  • [15] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3107–3115.
  • [16] J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal, “Relationship proposal networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5226–5234.
  • [17] F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, “Visual relationship detection based on guided proposals and semantic knowledge distillation,” arXiv preprint arXiv:1805.10802, 2018.
  • [18] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, and M. Elhoseiny, “Large-scale visual relationship understanding,” arXiv preprint arXiv:1804.10660, 2018.
  • [19] B. Zhuang, L. Liu, C. Shen, and I. Reid, “Towards context-aware interaction recognition for visual relationship detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 589–598.
  • [20] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3097–3106.
  • [21] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1261–1270.
  • [22] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5831–5840.
  • [23] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of European Conference on Computer Vision (ECCV), September 2018, pp. 690–706.
  • [24] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: An efficient subgraph-based framework for scene graph generation,” in Proceedings of European Conference on Computer Vision (ECCV), September 2018, pp. 346–363.
  • [25] R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson, “Mapping images to scene graphs with permutation-invariant structured prediction,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 7211–7221.
  • [26] S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Linknet: Relational embedding for scene graph,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 558–568.
  • [27] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3128–3137.
  • [28]

    X. He and L. Deng, “Deep learning for image-to-text generation: a technical overview,”

    IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 109–116, 2017.
  • [29] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321–2334, 2017.
  • [30]

    M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A deep learning approach to visual question answering,”

    International Journal of Computer Vision, vol. 125, no. 1-3, pp. 110–135, 2017.
  • [31] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra, “Vqa: Visual question answering,” International Journal of Computer Vision, vol. 123, no. 1, pp. 4–31, 2017.
  • [32] D. Teney, Q. Wu, and A. van den Hengel, “Visual question answering: A tutorial,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 63–75, 2017.
  • [33] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91–99.
  • [35] A. Arnab, S. Zheng, S. Jayasumana, B. Romera-Paredes, M. Larsson, A. Kirillov, B. Savchynskyy, C. Rother, F. Kahl, and P. H. Torr, “Conditional random fields meet deep neural networks for semantic segmentation: Combining probabilistic graphical models with deep learning for structured prediction,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 37–52, 2018.
  • [36] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
  • [37] T. Werner, “A linear programming approach to max-sum problem: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 7, pp. 1165–1179, 2007.
  • [38] Q. Liu and A. Ihler, “Variational algorithms for marginal map,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3165–3200, 2013.
  • [39] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [40]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    Advances in Neural Information Processing Systems (NIPS), 2013, pp. 3111–3119.
  • [41]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [44] E. Jahangiri, E. Yoruk, R. Vidal, L. Younes, and D. Geman, “Information pursuit: A bayesian framework for sequential scene parsing,” arXiv preprint arXiv:1701.02343, 2017.
  • [45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [46] I. K. K. M. L.-C. Chen, G. Papandreou and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [47] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [48] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt, “Advances in variational inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [49] S. Ross, D. Munoz, M. Hebert, and J. A. Bagnell, “Learning message-passing inference machines for structured prediction,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 2737–2744.
  • [50] J. Kappes, B. Andres, F. Hamprecht, C. Schnorr, S. Nowozin, D. Batra, S. Kim, B. Kausler, J. Lellmann, N. Komodakis et al., “A comparative study of modern inference techniques for discrete energy minimization problems,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1328–1335.
  • [51] R. Battiti, “First-and second-order methods for learning: between steepest descent and newton’s method,” Neural Computation, vol. 4, no. 2, pp. 141–166, 1992.
  • [52] A. Georges, G. Kotliar, W. Krauth, and M. J. Rozenberg, “Dynamical mean-field theory of strongly correlated fermion systems and the limit of infinite dimensions,” Reviews of Modern Physics, vol. 68, no. 1, p. 13, 1996.
  • [53] A.-L. Barabási, R. Albert, and H. Jeong, “Mean-field theory for scale-free random networks,” Physica A: Statistical Mechanics and its Applications, vol. 272, no. 1-2, pp. 173–187, 1999.
  • [54] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: An empirical study,” in

    Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI)

    , 1999, pp. 467–475.
  • [55] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in Neural Information Processing Systems (NIPS), 2011, pp. 109–117.
  • [56] B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Dag-recurrent neural networks for scene labeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3620–3629.
  • [57] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “A new class of upper bounds on the log partition function,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2313–2335, 2005.
  • [58] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” in Advances in Neural Information Processing Systems (NIPS), 2013, pp. 935–943.
  • [59] B. Romera-Paredes and P. Torr, “An embarrassingly simple approach to zero-shot learning,” in Proceedings of International Conference on Machine Learning (ICML), 2015, pp. 2152–2161.
  • [60] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 3630–3638.
  • [61] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
  • [62] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in Proceedings of International Conference on Machine Learning (ICML), 2016, pp. 1842–1850.
  • [63] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [64] Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing, “Harnessing deep neural networks with logic rules,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2016, pp. 2410–2420.
  • [65] Z. Hu, Z. Yang, R. Salakhutdinov, and E. Xing, “Deep neural networks with massive learned knowledge,” in

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , 2016, pp. 1670–1679.
  • [66] H. Bilen, M. Pedersoli, and T. Tuytelaars, “Weakly supervised object detection with convex clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1081–1089.
  • [67] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2846–2854.
  • [68] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan, “Stc: A simple to complex framework for weakly-supervised semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2314–2320, 2017.
  • [69] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 189–203, 2017.
  • [70] D. Zhang, J. Han, L. Zhao, and D. Meng, “Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework,” International Journal of Computer Vision, pp. 1–18, 2018.
  • [71] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized belief propagation,” in Advances in Neural Information Processing Systems (NIPS), 2001, pp. 689–695.
  • [72] G. Fazelnia and J. Paisley, “Crvi: Convex relaxation for variational inference,” in Proceedings of International Conference on Machine Learning (ICML), 2018, pp. 1476–1484.
  • [73] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in Proceedings of the 28th International Conference on Machine Learning (ICML), 2011, pp. 513–520.
  • [74] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
  • [75]

    Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in

    Proceedings of International Conference on Machine Learning (ICML), 2015, pp. 1180–1189.
  • [76] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain adaptation: A survey of recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 53–69, 2015.
  • [77]

    Z. Yang, J. Zhao, B. Dhingra, K. He, W. W. Cohen, R. R. Salakhutdinov, and Y. LeCun, “Glomo: Unsupervised learning of transferable relational graphs,” in

    Advances in Neural Information Processing Systems (NIPS), 2018, pp. 8964–8975.
  • [78] M. B. R. Z. Yujia Li, Daniel Tarlow, “Gated graph sequence neural networks,” in Proceedings of International Conference on Learning Representations (ICLR), 2016.
  • [79] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al., “Interaction networks for learning about objects, relations and physics,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 4502–4510.
  • [80] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nature Communications, vol. 8, p. 13890, 2017.
  • [81] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of International Conference on Machine Learning (ICML), 2017, pp. 1263–1272.
  • [82] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2007 (voc 2007) results (2007),” 2008.
  • [83] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
  • [84] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6517–6525.
  • [85] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 891–898.
  • [86] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1972–1979.
  • [87] H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” CoRR, abs/1612.03716, vol. 5, p. 8, 2016.
  • [88] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
  • [89] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2189–2202, 2012.
  • [90] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Proceedings of European Conference on Computer Vision (ECCV), 2016, pp. 21–37.
  • [91] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2874–2883.
  • [92] J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint object and stuff segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3992–4000.
  • [93] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
  • [94] S. Xie, X. Huang, and Z. Tu, “Top-down learning for structured labeling with convolutional pseudoprior,” in Proceedings of European Conference on Computer Vision (ECCV), 2016, pp. 302–317.
  • [95] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficient piecewise training of deep structured models for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3194–3203.
  • [96] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling with lstm recurrent neural networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3547–3555.
  • [97] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural networks for scene labeling,” in Proceedings of International Conference on Machine Learning (ICML), 2014, pp. 82–90.
  • [98] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
  • [99] J. Tighe and S. Lazebnik, “Finding things: Image parsing with regions and per-exemplar detectors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3001–3008.
  • [100] A. Sharma, O. Tuzel, and M.-Y. Liu, “Recursive context propagation network for semantic scene labeling,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2447–2455.
  • [101] J. Yang, B. Price, S. Cohen, and M.-H. Yang, “Context driven scene parsing with attention to rare classes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3294–3301.
  • [102]

    B. Shuai, Z. Zuo, G. Wang, and B. Wang, “Scene parsing with integration of parametric and non-parametric models,”

    IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2379–2391, 2016.
  • [103] H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang, “Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4233–4241.
  • [104] J. Peyre, I. Laptev, C. Schmid, and J. Sivic, “Weakly-supervised learning of visual relations,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5179–5188.
  • [105]

    G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy, “Zoom-net: Mining deep feature interactions for visual relationship recognition,” in

    Proceedings of European Conference on Computer Vision (ECCV), 2018, pp. 330–347.