Despite the recent sweep of progress in Machine Learning (ML) which is stirring public imagination about the capabilities of future Artificial Intelligence (AI) systems, the research community seems more composed, in part due to the realization that current achievements are largely based on engineering advancements, and only partly on novel scientific progress. Undoubtedly, the accomplishments are neither small nor temporary; in the latest “One Hundred Year Study on AI”, a panel of renowned AI experts is foreseeing tremendous impact of AI in a multitude of technological and societal domains in the next decades, fueled primarily by systems running ML algorithms. Yet, there is still a lot of ground to cover, before we can obtain a deep understanding of how to overcome the limitations of data-driven approaches at a more generic level.
Aiming at exploiting the full potential of AI, a growing body of research is devoted to the idea of integrating ML and knowledge-based approaches. Davies and Marcus (Davis15) for instance, while discussing the multifaceted challenges related to automating commonsense reasoning, a crucial ability for any intelligent entity operating in real-world conditions, underline the need to combine the strengths of diverse AI approaches from these two fields. Others, as for example Bengio et al. (Bengio19) and Pearl (Pearl18), emphasize the inability of Deep Learning to effectively recognize cause and effect relations. Pearl, Geffner (Geffner18) and recently Lenat111https://towardsdatascience.com/statistical-learning-and-knowledge-engineering-all-the-way-down-1bb004040114
, suggest to seek solutions by bridging the gap between model-free, data-intensive learners and knowledge-based models and by building on the synergy between heuristic level and epistemological level languages.
In this paper, we review recent progress in the direction of coupling the strengths of ML and knowledge-based methods, focusing our attention on the topic of Object Perception (OP), an important sub-field of Computer Vision (CV). Tasks related to OP are at the core of a wide spectrum of practical systems and relevant research has traditionally relied on ML to approach the related problems. The recent developments have significantly advanced the field, but, interestingly, state-of-the-art studies try to integrate symbolic methods, in order to achieve broader visual intelligence. It seems that it is becoming less of a paradox within the CV community that in order to build intelligent vision systems, much of the information needed is not directly observable.
This review is aimed at researchers and engineers working on OP-related problems, as well as on the broader area of CV and AI. Existing surveys on the intersection of ML and knowledge engineering (e.g., ) are indeed very informative, but they usually offer a high-level understanding of the challenges involved. The rich literature on CV reviews, on the other hand, adopts a more problem-specific analysis, studying in detail the requirements of each particular CV task (see e.g., [62, 19, 33]). Only recently was an attempt made to show how studies that utilize background knowledge can benefit tasks, such as image understanding ; our goal is to explore this direction on the topic of intelligent OP, reporting state-of-the-art achievements and showing how different facets of knowledge-based research can contribute to addressing the rich diversity of OP tasks.
The rest of the paper investigates state-of-the-art literature on intelligent OP along three pillars (see also Figure 1): i) symbolic models, further analyzed from the perspectives of expressive representations, reasoning capacity, as well as open domain, Web-based knowledge exploitation; ii) commonsense knowledge exploitation, a key skill for any intelligent system; and, iii) enhanced learning ability, building on hybrid approaches. The paper concludes with a discussion on open questions and prominent research directions. Table 1 at the end summarizes the reviewed literature.
2 Exploitation of Symbolic Models
The scope of OP research ranges over a wide spectrum of problems, from object, action and affordance detection, to localization and recognition in images, to motion and structure inference in videos, to scene understanding and visual reasoning. Traditionally, OP relied on ML methodologies to find patterns in realms of data, taking as input feature vectors representing entities in terms of numeric or categorical (membership to a more general class) attributes.
In this section, we discuss how top-down knowledge related to visual entities can improve the performance of OP algorithms in manifold ways. We start by reviewing state-of-the-art in coupling data-driven approaches and rich knowledge representations about aspects such as context, space and affordances. We, then consider the level of complex querying and reasoning that has been achieved on top of such representations. Finally, we explore how more generic Web-based knowledge sources are employed in an attempt to address similar problems.
2.1 Expressive Representations of Knowledge
Although the notion of a representation is rather general, the modeling of relational knowledge in the form of individuals (entities) and their associated relations is well-studied in AI, especially in the context of symbolic representations (see for instance [57, 7, 13]). The goal is to offer the level of abstraction needed to design a system versatile enough to adapt to the requirements of a particular domain, yet rigid enough to be encoded in a computer program, having clear semantics and elegant properties . As a result, expressiveness, i.e., what can or cannot be represented in a given model, and computation, i.e., how fast conclusions are drawn or which statements can be evaluated by algorithms guaranteed to terminate, are two, often competing, aspects to be taken into consideration.
The form of the representational model applied plays a decisive role on such considerations, affecting the richness of semantics that can be captured. While the range of models varies significantly, even relatively shallow representations have proven to offer improvements in the performance of OP methodologies (see for example [14, 70, 72, 44]
). Models as simple as relational tables or flat weighted graphs, but also more complex multi-relational graphs, often called Knowledge Graphs (KGs), or even semantically-rich conceptualizations with formal semantics, often called ontologies, are proposed in the relevant literature. In the sequel, we refer to any model that offers at least a basic structuring of data as a Knowledge Base (KB).
2.1.1 Utilization of Contextual Knowledge
Context awareness is the ability of a system to understand the state of the environment, and to perceive the interplay of the entities inhabiting it. This feature offers advantages in accomplishing inference tasks, but also enhances a system in terms of reusability, as contextual knowledge enables it to adapt to new situations and environments that resemble known ones.
A preliminary approach towards this direction utilized a special semantic fusion network, which combined novel object- and scene-based information and was able to capture relationships between video class labels and semantic entities (objects and scenes) 
. This was a Convolutional Neural Network (CNN) consisting of three layers, with the first layer detecting low-level features, the second layer object features and the third layer scene features. Although no symbolic representation was applied, this modeling of abstraction layers for the representation of the information depicted in an image is similar to the hierarchical structuring of knowledge used by top-down methods.
In a similar style, Liu et al. (Liu2018) recently attempted to address the problem of object detection by proposing an algorithm that exploits jointly the context of a visual scene and an object’s relationships. Such features are typically taken into consideration in isolation by most object detection methods. The algorithm uses a CNN-based framework tailored for object detection and combines it with a graphical model designed for the inference of object states. A special graph is created for each image in the training set, with nodes corresponding to objects and edges to object relationships. The intuition behind this approach is that the graph’s topology enables an object state to be determined not only by its low-level characteristics (appearance details), but also by the states of other objects interacting with it and by the overall scene context. The experimental evaluation conducted on different datasets underscored the importance of knowledge stemming from local and global context.
Currently, a collection of prominent approaches are oriented towards Gated Graph Neural Networks (GGNNs) 
, a variation of Gated Neural Networks (GNNs)
, in order to integrate contextual knowledge in the training of a system. GNNs are a special type of NNs tailored to the learning of information encoded in a graph. In GGNNs, each node corresponds to a hidden state vector that is updated in a iterative way. At each time step of the training process, the history of the node, along with any incoming message from nodes with which it is linked, determines the way that the hidden state of a node will be updated. These updates are applied simultaneously to all nodes in the graph at each propagation step. After the number of prescribed propagation steps concludes, the output of the network is produced as a function of its nodes hidden states. The building block of GGNNs are gated recurrent units which serve as recurrent functions, with recurrence being applied for a fixed number of steps. Therefore, there is no need for constraining the parameters in order to ensure convergence. Furthermore, the fact that in GGNNs, information can move bi-directionally and that many nodes can be updated per time step, differentiates them from standard Recurrent Neural Networks (RNNs) which also use the mechanism of recurrence.
Chuang et al. (Chuang2018) propose a GGNN-based method which exploits contextual information, such as the types of objects and their spatial relations, in order to detect action-object affordances. In the same vein,  utilize a two-stage pipeline built around a CNN to detect functional areas in indoor scenes. More recently,  utilized a GGNN, which takes into account the global context of a scene and infers the affordances of the contained objects. It also proposes the most suitable object for a specific task. The authors prove that this approach yields better results compared to methods that rely solely on the results of an object classification algorithm.
In , an approach aiming to perform situation recognition is presented, based on the detection of human-object interactions. The goal is to predict the most representative verb to describe what is taking place in a scene, capturing also relevant semantic information (roles), such as the actor, the source and target of the action, etc. The authors utilize a GGNN, which enables combined reasoning about verbs and their roles through the iterative propagation of messages along the edges.
2.1.2 Spatial Contextual Knowledge
A particular type of contextual knowledge concerns the spatial properties of the entities populating a visual scene. These may involve simple spatial relations, such as “object is usually part of object ” and “object is usually situated near the objects ”, but also semantically enriched statements, such as “objects of type are usually found inside object (e.g., in the fridge), located in room ”. Due to the ubiquitous nature of spatial data in practical domains, such relations are often captured as a separate class of context.
Semantic spatial knowledge, when fused with low-level metric information, gives great flexibility to a system. This is demonstrated in the study of Gemignani et al. (gemignani2016living), where a novel representation is introduced that combines the metric information of the environment with the symbolic information that conveys meaning to the entities inhabiting it, as well as with topological graphs. Although delivering a generic model with clear semantics is not the main objective of that study, the resulting integrated representation enables a system to perform high-level spatial reasoning, as well as understanding target locations and positions of objects in the environment.
Generality is the aim of the model proposed by Tenorth and Beetz (tenorth2017representations), which uses a DL-based OWL ontology, combining information from OpenCyc and othe Web sources, which help compile new classes while forming the environmental map. The main reasoning mechanisms of this study is Prolog, although probabilistic reasoners are also used to tackle fuzzy information or uncertain relations. A CV system annotates objects based on their shape, their distances and the dimension of the environment, using a monotonic Description Logic (DL), to build the environmental map. As a result, a coherent and well-formalized representation of environments is achieved, which can offer high-quality datasets for training data-driven models. The use of DL can also offer a wide spectrum of spatial reasoning capabilities with well-specified properties and formal semantics.
Adopting a different approach, the KG given in  enables spatial reasoning both in a local and a global context which, in turn, results in improved performance in semantic scene understanding. The proposed framework consists of two distinct modules, one focusing on local regions of the image and one dealing with the whole image. The local module is convolution-based, analyzing small regions in the image, whereas the basic component of the global module is a KG representing regions and classes as nodes, capturing spatial and semantic relations. According to the experimental evaluation performed on the ADE and Visual Genome datasets, the network achieves better performance over other CNN-based baselines for region classification, by a margin which sometimes is close to 10%. According to the ablation study, the most decisive factor for the framework’s performance was the KG.
2.1.3 Modeling Affordances
Building on the geometrical structure and physical properties of objects, such as rigidity and hollowness, the representation of affordances helps develop systems that can reason about how human-level tasks are performed. While ML is invaluable for automating the process of learning from example when data is available, rich representations can generalize and reuse the obtained models in situations where data-based training is not possible.
One of the first studies that demonstrated that even very basic semantic models can improve the performance of recognizing human-object interaction was 
. The authors succeeded in boosting the performance of visual classifiers by exploiting the compositionality and concurrency of semantic concepts contained in images.
KNOWROB 2.0 , which is the result of a series of research activities in the field of Cognitive Robotics, is an excellent example of integrating top-down knowledge engineering with bottom-up information structuring, involving, among others, a variety of CV tasks. A combination of KBs helps the KNOWROB 2.0 framework capture rich models of the world. The framework combines an inner world KB with objects 6D poses, equipped with a physics engine; a virtual KB for the state of the environment; a logic KB, which contains abstracted symbolic sensor and action data with logical axioms and an inference mechanism; and an episodic memory KB with past collected experiences that empowers the system’s visual and cognitive skills. The representation of knowledge is based on OWL-DL ontologies, a decidable fragment of First-order Logic (FOL), yet adequately expressive for most practical domains. Temporal aspects are also captured. The aforementioned features enable the system to answer questions, such as “how to pick up the cup”, “Which body part to use”, “What was the pose then”. The authors provide evidence that learning human manipulation tasks on existing methods can be boosted by using symbolic level structured knowledge, so that the system can formalize learning problems and generate training datasets based on their episodic memories.
A recently proposed novel representation model that manages to balance between concept abstraction, uncertainty modeling and scalability is given in . The so called ROBOCSE framework encodes the abstract, semantic knowledge of an environment, i.e., the main concepts and their relations, such as location, material and affordance, obtained by observations, simulations, or even from external sources, into multi-relational embeddings. These embeddings are used to represent the knowledge graph of the domain in vector space, encoding vertices that represent entities as vectors and edges that represent relations as mappings. While the majority of similar approaches rely on Bayesian Logic Networks and Markov Logic Networks, suffering from well-known intractability problems, the authors prove that their model is highly scalable, robust to uncertainty, and generalizes learned semantics.
Learning from demonstration, or imitation learning, is a relevant, yet broader objective, which introduces interesting opportunities and challenges to a CV system (see[56, 43]). Purely ML-based methods constitute the predominant research direction, and only a few state-of-the-art studies utilize knowledge-based methods, taking advantage of the reusability and generalization of the learned information. A popular choice is to deploy expressive OWL-DL [42, 27] or pure DL  representations to capture world knowledge. The CV modules are assigned the task to extract information about the state of the environment, the expert agent’s pose and location, grasping areas of objects, affordances, shapes etc. On top of these, the coupling with knowledge-based systems assists in visual interpretation, for example to track human motion, to semantically annotate the movement (i.e., “how the human performs the action”) or to understand if a task is doable in a given setting. These studies show that such representations enable a system to reuse the learned knowledge in diverse settings and under different conditions, without having to re-train classifiers from scratch. Moreover, complex queries can be answered, a topic that is consider in the next subsection.
2.2 Reasoning over Expressive KBs
Encoding knowledge in a semantically structured way is only part of the story; a rich representation model can also offer inference capabilities to a CV system, which are needed for accomplishing complex tasks, such as scene understanding, or simpler tasks under realistic conditions, such as scene analysis with occlusions, noisy or erroneous input etc. A reasoning system can be used to connect the dots that relate concepts together when only partial observation is available, especially in data-scarce situations, where annotated data are not sufficiently many. In such situations, the compositionality of information, an inherent characteristic of the entities encountered in visual domains, can be exploited by applying reasoning mechanisms.
2.2.1 Complex Query Answering
Probably the field that highlights more clearly the needs and challenges faced by a CV system in answering complex queries about a visual scene is the field of Visual Question Answering (VQA). VQA was recently introduced as a collection of benchmark image-based open-domain questions that, in order to be answered, call for a deep understanding of the visual setting. VQA goes beyond traditional CV, since apart from image analysis, the proposed methods apply also a repertoire of AI techniques, such as Natural Language Processing, in order to correctly analyze the textual form of the question, and inferencing, in order to interpret the purpose and intentions of the entities acting in the scene . The challenges posed by this field are complex and multifaceted, a fact which is also demonstrated by the rather poor performance of state-of-the-art-systems in comparison to humans. VQA is probably the area of CV that has drawn the most inspiration from symbolic AI approaches to date.
An indicative example is the approach recently presented by Wu et al. (Wu2018), who introduced a VQA model combining observations obtained from the image with information extracted from a general KB, namely DBpedia. Given an image-question pair, a CNN is utilized to predict a set of attributes from the image, i.e., the most recognizable objects in the image, in terms of clarity and size. Consequently, a series of captions based on the attributes is generated, which is then used to extract relevant information from DBpedia through appropriately formulated queries. In a similar style, in  an external RDF repository is used to retrieve properties of visual concepts, such as category, used for, created by, etc. The technique utilizes a Graph Convolution Network (GCN), a variation of GNN, before producing an answer. In both cases, the ablation analysis reveals the impact of the KB in improving performance.
Other types of questions in VQA require inferencing about the properties of the objects depicted in an image. For example, queries such as “How is the man going to work?” or more complex queries, such as “When did the plane land?”, have been the subject of the study presented by Krishna et al. (Krishna2017), who introduced the Visual Genome dataset and a VQA method. In fact, this is one of the first studies to bring a model trained on an RDF-based scene graph that had good recall results to all What, Where, When, Who, Why, How queries. Even further,  introduced the visual knowledge memory network (VKMN) in order to handle questions, whose answers cannot be directly inferred from the image visual content but require reasoning over structured human knowledge.
The importance of capturing the semantic knowledge in VQA collections led also to the creation of the Relation-VQA dataset , which extends the Visual Genome dataset with a special module measuring the semantic similarity of images. In contrast to previous methods mining only concepts or attributes, this model extracts relation facts related to both concepts and attributes. The experimental evaluation conducted on VQA and COCO dataset showed that the method outperformed other state-of-the-art ones. Moreover, the ablation studies show that the incorporated semantic knowledge was crucial for the performance of the network.
Despite its increasing popularity in the research community, the VQA field is still hard to confront. The generality of existing methods is also questioned . Developing generic solutions, less tightly coupled to specific datasets, will definitely benefit the pursuit of broader visual intelligence.
2.2.2 Visual Reasoning
A task related to VQA that has gained popularity in recent years is that of Visual Reasoning (VR). In this case, the type of questions that have to be answered are more complex and require a multi-step reasoning procedure. For example, given an image containing objects of different shapes and color, the task of recognizing the color of an object of certain shape that lies in a certain area w.r.t. the position of another object of certain shape and color falls to the category of VR (in this case, first the “source” object must be detected, then the “target” object, and, finally, its color must be recognized). Similar to the case of VQA, a number of VR works has drawn inspiration from symbolic AI-based ideas.
In general, many VR works are based on Neural Module Networks (NMNs) which are NNs of adaptable architecture, the topology of which is determined by the parsing of the question that has to be answered. NMNs simplify complex questions into simpler sub-questions (sub-tasks), which can be more easily addressed. The modules that constitute the MNMs are pre-defined neural networks that implement the functions that are required for the tackling of sub-tasks, which are assembled into a layout dynamically. Central to many MNMs is the utilization of prior symbolic (structured) knowledge, which facilitates the handling of the sub-tasks.
Hu et al. (Hu2017) propose End-to-End Module Networks as a variation of NMNs. The network first uses coarse functional expressions describing the structure of the computation required for the answering and, then, refines it according to the textual input in order to assemble the network. For example, for the question “how many other objects of the same size as the purple cube exist?”, first crude functional expression for counting and relocating would be predicted as relevant to the answering of the question which, subsequently, would be refined by the parameters from text analysis (in this case one such parameter is the color of the cube).
Similarly, Johnson et al. (johnson2017inferring) propose a variation of NMNs, which is based on the concept of programs. Programs are symbolic structures of certain specification written in a Domain-Specific Language and are defined by a syntax and semantics. In the context of VR, programs describe a sequence of functions that must be executed, in order for an answer to be computed. During testing on the CLEVR dataset the model exhibited notable performance, generalizing better in a variety of settings, such as for new question types and human-posed questions. Building on the notion of programs, Yi et al. (Yi2018) further incorporated knowledge regarding the structural scene representation of the image. The method achieved near-perfect accuracy, while also providing transparency to the reasoning process.
An alternative NN-based approach for VR is found in , where the incorporation of Relation Networks (RNs) in CNNs and Long Sort-Term Memory (LSTM) architectures is proposed. RNs are architectures whose computations focus explicitly on relational reasoning and are characterized by three important features: they can infer relations, they are data efficient, and they operate on a set of objects, a flexible symbolic input format that is agnostic to the kind of inputs it receives. For example, an object could correspond to the background, to a particular physical object, a texture, conjunctions of physical objects etc.
2.3 The Web as a Problem-Agnostic Source of Data
As the recent renaissance in AI is partly due to the availability of big volumes of training data, along with the computational power to analyze them, it is only reasonable to expect that data-driven approaches will turn their attention to the Web in order to collect the data needed. Although the benefits mentioned in the previous sections are still achievable, the challenges faced when using a Web repository rather than a custom-made KB are now different.
The vast majority of large-scale Web repositories are not problem-specific, containing a lot of irrelevant information for a ML system to be trained correctly. For the time being, ML systems are highly specific, excelling only when trained for a particular task and tested on similar to the training conditions. As a result, state-of-the-art approaches try to rely on the semantics of structured KBs, in order to filter out noisy or irrelevant knowledge, by integrating external knowledge when visual information is not sufficiently reliable for conclusion making.
2.3.1 Exploitation of Web-based Knowledge Graphs and Semantic Repositories
There exists a multitude of studies that use external knowledge from structured or semi-structured Web resources, in order to answer visual queries or to perform cognitive tasks. A characteristic example is found in , where the ConceptNet KG, a semantic repository of commonsense Linked Open Data, is used to answer open domain questions on entities such as “What is the dog’s favorite food?”. The approach proceeds in a step-wise manner: first, visual objects and keywords are extracted from an image, using a Fast-RCNN for the objects and a LSTM for the syntactical analysis; then, queries to ConceptNet provide properties and values for the entities found in the image. When an answer is considered correct, a Dynamic Memory Network, which is an embedding vector space that contains vector representations of symbolic knowledge triples, is renewed for future encounter of the same query. In a rather similar style, Wu et al. (wu2016ask) extract properties from DBpedia, by retrieving and performing semantic analysis on the comment boxes of relevant wikipedia pages. Here, a CNN performs object detection on the image, whereas a pre-trained RNN correlates attributes to sentence descriptions.
The approach presented in  is the first attempt to answer a more knowledge-intensive category of questions, such as “Who is to the left of Barack Obama?” or ‘‘Do all the people in the image have a common occupation?”
. These questions make reference to the named entities contained in an image, e.g., Barack Obama, White House, France etc. and require large KBs to retrieve the relevant information. In this case, the authors choose Wikidata, an RDF repository. They first extract named entities and then try to connect them with a Wikidata entity using SPARQL queries. In addition, they extract spatial relations with other entities shown in the image and feed them to a Bi-LSTM. A multi-layered perceptron calculates the prediction for an answer, taking as input the output of the LSTM, along with the SPARQL results.
2.3.2 Aligning Data Obtained from Diverse Online Sources
Entity resolution, also known as instance matching, concerns the task of identifying which entities across different KBs refer to the same individual. As the Web is growing in size, this problem is becoming crucial, especially in application domains that need to integrate and align knowledge obtained from various sources. An increasing number of CV studies face this problem, in an attempt to interpret visual information based on commonsense, non-visual knowledge.
Two characteristic approaches are given in  and  that try to assign labels to a visual scene using Bayesian Logic Networks (BLNs) and relying on commonsense knowledge. In , knowledge is extracted from WordNet, ConceptNet, and Wikipedia. WordNet is utilized in order to disambiguate seed words returned by the CV annotator with the aid of their hypernym. ConceptNet properties, such as or that may point the location of an object, are also retrieved. With this method, the system can generate a compact semantic KB given only a small number of objects.
, a CNN trained on ImageNet is used to annotate objects recognized in images. The system is capable of assigning semantic categories to specific regions, by relying on DBpedia comment boxes to calculate the semantic relatedness between objects. As expected, high accuracy of such an approach is difficult to achieve, due to the diversity of information retrieved from DBpedia; consequently, smarter ways of identifying only the relevant part of the comment boxes need to be devised.
3 Exploitation of Commonsense Knowledge
Much of the information presented in a visual scene is not explicitly related with the features captured at the pixel level, but concerns observations implicitly depicted in images. Understanding the structure and dynamics of visual entities requires being able to interpret the semantic and commonsense (CS) features that are relevant, in addition to the low-level information obtained by photorealistic rendering techniques . This is a popular conclusion reached within the CV community in the pursue towards achieving visual intelligence and there is a long line of studies that attempt to address the problem of extracting commonsense knowledge from visual scenes or, similarly, of utilizing commonsense inferences to improve scene understanding. In this section, we discuss state-of-the-art approaches that advance the field in these two directions.
3.1 Mining Commonsense Knowledge from Images
Even though ML is becoming part of many systems, it is still not able to easily capture CS knowledge from the perceived information. Additional techniques need to be devised to extract this valuable type of knowledge from visual scenes. A combination of textual and visual analysis, which extracts subject-predicate-object triples (SPO) about objects recognized in a scene, is addressed in certain studies, e.g., [58, 32]. ML classifiers for object recognition are trained on image datasets, while pre-trained NN classifiers help extract SPO triples, by considering both the entities identified by the classifiers and the textual description of the images.
In a different direction, in  the authors rely on Web images to verify the validity of simple phrases, such as “horses eat hay”, analyzing the spatial consistency of the relative configurations of the entities and the relations involved. This unsupervised method is particularly interesting, due to the leverage it offers in automatically enriching CS repositories. In fact, the authors show how CV-based analysis can help improve recall in KBs, such as WordNet, Cyc and ConceptNet, offering a complementary and orthogonal source of evidence.
Aditya et al. (aditya2015images) address the problem of generating linguistic descriptions of images by utilizing a special type of graph, namely scene description graphs (SDGs). Such graphs are built by using both low-level information derived using perception methods and high-level features capturing CS knowledge stemming from the image annotations and lexical ontological knowledge from Web resources. SDGs produce object, scene and constituent detection tuples, accompanied by a confidence score; pre-processed background knowledge helps remove noise contained in the detection. A Bayesian Network is utilized, in order for the dependencies among co-occurring entities and knowledge regarding abstract visual concepts to be captured. Experimental evaluations of the method on the image-sentence alignment quality, i.e., how close the generated description is to the image being described, on Flickr8k, 30k and COCO datasets, showed that the method achieves comparable performance to previous state-of-the-art methods.
3.2 Commonsense Knowledge in Addressing OP Tasks
State-of-the-art CS-based methodologies improve the performance of a CV system, mainly by taking into account textual descriptions about the entities found in a visual scene or by retrieving semantic information from external sources that is relevant to the image and the task at hand.
A combination of external Web-based knowledge, text processing and vision analysis is at the core of the study presented in . The framework annotates objects with a Fast-RNN, trained over the MS COCO dataset. The extracted entities are enriched with (i) knowledge retrieved from Wikipedia, in oder to perform entity classification; (ii) knowledge from WebChild, attempting a comparative analysis between relevant entities; and (iii) CS knowledge obtained from ConceptNet, to create a semantically rich description. The enriched entity is stored in an RDF graph and is used to address a variety of tasks. For instance, the framework has achieved improved accuracy in VQA benchmarks, but also it can be used to generate explanations for its answers. Prominent recent studies, as in  and , also build on the direction of combining textual and visual analysis with the help of knowledge obtained from CS repositories.
Another problem that researchers try to address with the help of CS knowledge is the sparsity of categorical variables in the training datasets. For example, Ramanathan et al. (Ramanathan2015) utilize a neural network framework that uses different types of cues (linguistic, visual and logical) in the context of human actions identification. Similarly, Lu et al. (Lu2016) exploit language priors extracted from the semantic features of an image, in order to facilitate the understanding of visual relationships. The proposed model combines a visual module tailored to the learning of visual appearance models for objects and predicates with a language module capable of detecting semantically related relationships.
More recently, Gu et al. (gu2019scene) utilize commonsense knowledge stemming from an external KB in the context of scene graph generation. Namely, a special knowledge-based feature refinement module is used, which incorporates CS knowledge from ConceptNet for the prediction of object labels consisting of triplets containing the top-K corresponding relationships, the object entity and a weight corresponding to the frequency of the triplet. This strategy, aiming to address the long tail distribution of relationships, differentiates the approach from the linguistic-based ones described previously, managing to showcase improvement in generalizability and accuracy.
CS knowledge is also used to tackle other CV problems, such as in understanding relevant information about unknown objects existing in a visual scene. In  or  for instance, external CS Web-based repositories are used as a source for locating relevant information. The general idea in both approaches is to retrieve as much information as possible about the recognizable objects that, based on diverse metrics, are considered semantically close to the unknown ones. , , properties found in ConceptNet, or comment boxes retrieved from DBPedia are all relevant knowledge that can be used for developing semantic similarity measures. Similar to some extent, is the approach presented in 
, which relies on RDF graphs with a probabilistic distribution over relations to capture the CS knowledge, but reverts also to a human-supervised learning approach whenever unknown objects are encountered.
4 Ability to Learn New Knowledge
The majority of state-of-the-art studies covered in the previous sections exploit a loosely-coupled combination of ML and knowledge-based methodologies. A tighter integration of methodologies of the two fields is expected to achieve much broader impact, especially in the process of learning. In the sequel, We consider prominent attempts towards this direction, originating either from a model-free standpoint or from a more declarative, inductive-based perspective.
4.1 Model-Free Learning
A host of recent works devise methods that attempt to exploit information contained in higher-level representations, in order to improve scalability and generalization for tasks, such as Zero-Shot Learning (ZSL). ZSL is the problem of recognizing objects for which no visual examples have been obtained and is typically achieved by exploring a semantic embedding space, e.g., attribute space or semantic word vector space.
For example, Fu et al. (fu2015zero) utilize a semantic class label graph, which results in a more accurate distance metric in the semantic embedding space and an improved performance in ZSL. Likewise, Xian et al. (xian2016latent) address the same problem by proposing a novel latent embedding model, which learns a compatibility function between the image and semantic (class) embeddings. The model utilizes image and class-level side-information that is either collected through human annotation or through an unsupervised way from a Web repository of text corpora.
Lee et al. (Lee2018) propose a novel deep learning architecture for multi-label ZSL, which relies on KGs for the discovery of the relationships between multiple classes of objects. The KG is built on knowledge stemming from WordNet and contains 3 types of label relations, super-subordinate, positive correlation, and negative correlation. The KG is coupled to a GGNN-type module for predicting labels.
In the same vein, Wang, Ye and Gupta (Wang2018b) exploit the information contained in KGs about unseen objects, in order to infer visual attributes that enable their detection. The KG nodes correspond to semantic categories and the edges to semantic relationships, whereas the input to each node is the vector representation (semantic embedding) of each category. A GCN is used to transfer information between different categories. This way, by utilizing the semantic embeddings of a novel category, the method can link categories in the KG to familiar ones and, thus, infer its attributes. The experimental evaluation demonstrated a significant improvement on the ImageNet dataset, while the ablation studies indicated that the incorporation of KGs enabled the system to learn meaningful classifiers on top of semantic embeddings.
In , the use of structured prior knowledge led to improved performance on the task of multi-label image classification. The KG is built using WordNet for the concepts and Visual Genome for the relations among them. An interesting aspect of this study is the introduction of a novel NN architecture, Graph Search Neural Network, as a means to efficiently incorporate large knowledge graphs, in order to be exploited for CV tasks.
4.2 Inductive Learning
The benefits of developing intelligent visual components with reasoning and learning abilities are becoming evidence in broader to CV domains, such as in the field of Robotics. This conclusion was nicely demonstrated in a recent special issue of the AI Journal , where causality-based reasoning emerged as a key contribution. It is, therefore, interesting to investigate how the recent trend in combining knowledge-based representations with model-free models for the development of intelligent robots is making an impact in related OP research.
A highly prominent line of research for modeling uncertainty and high-level action knowledge is focusing on combining expressive logical probabilistic formalisms, ontological models and ML. In  for example, the system learns probabilistic first-order rules describing relational affordances and pre-grasp configurations from uncertain video data. It uses the ProbFOIL+ rule learner, along with a simple ontology capturing object categories.
More recently, Moldovan et al. (Moldovan2018) significantly extended this approach, using the Distributional Clauses (DCs) formalism that integrates logic programming and probability theory. DCs can use both continuous and discrete variables, which is highly appropriate for modeling uncertainty, in comparison for instance to ProbLog, which is commonly found in relevant literature. Compared to approaches that model affordances with Bayesian Networks, this approach scales much better, but most importantly, due to its relational nature, structural parts of the theory, such as the abstract action-effect rules, can be transferred to similar domains without the need to be learned again.
A similar objective is pursued by Katzouris et al. (Katzouris19), who propose an abductive-inductive incremental algorithm for learning and revising causal rules, in the form of Event Calculus programs. The Event Calculus is a highly expressive, non-monotonic formalism for capturing causal and temporal relations in dynamic domains. The approach uses the XHAIL system as a basis, but sacrifices completeness due to its incremental nature. Yet, it is able to learn weighted causal temporal rules, in the form of Markov Logic Networks, scaling up to large volumes of sequential data with a time-like structure.
Also worth mentioning is the study of Antanas et al. (antanas2018semantic), which instead of learning how to map visual perceptions to task-dependent grasps, it uses a probabilistic logic module to semantically reason about the most likely object part to be grasped, given the object properties and task constraints. The approach models rules in Causal Probabilistic logic, implemented in ProbLog, in order to reason about object categories, about the most affordable tasks and about the best semantic pre-grasps.
5 Discussion and Conclusion
|Indicative Recent Literature||CV Problem Focus||ML Methods Applied||KB Methods Applied||KB Contribution||KB-ML Impact|
|, , , , ||affordance detection||CNN, GNN, GGNN||Knowledge Graphs||3, 4, 5, 7||offers new insights|
|, , , , ||affordance detection||scoring functions, probabilistic programming models, Bayesian Networks||OWL Ontology||1, 2, 3, 4, 5, 6, 9||offers new insights and improves SotA|
|, , ||object detection||RCNN, CNN||Knowledge Graph, BLN||1, 3, 4, 5, 8||offers new insights|
|, , , ||object detection||scoring functions, probabilistic programming models||OWL Ontology, DL, MLN||1, 2, 3, 4, 5, 8, 9||improves SotA|
|, , ||scene understanding||probabilistic programming, Bayesian Network||BLN||2, 3, 4, 8||offers new insights|
|, , ||scene understanding||GGNN||Knowledge Graph||3, 4, 7||improves SotA|
|, , , , , , , , , , ||VQA||CNN, LSTM, RCNN||Knowledge Graphs (RDF mostly)||1, 2, 3, 4, 5, 8||offers new insights and improves SotA|
|, ||VQA||Gausian Mixture Model, SVM||RDF Graph||2, 3, 4, 5||improves SotA|
|||VQA||Gated Recurrent Unit Network||RDF Graph||1, 2, 4, 8||offers new insights|
|, , , ||visual reasoning||Neural Module Network||Symbolic Programming Language||2, 3, 5,||offers new insights|
|, , ||image classification/ zero-shot recognitions||GGNN, GCN,||Knowledge Graph, RDF Graph||1, 2, 5||offers new insights and improves SotA|
|, ||image classification/ zero-shot recognitions||
Latent embedding model, Markov Chain Process
|Knowledge Graph||1, 2, 5||offers new insights|
|, , , ||affordance learning||scoring functions, probabilistic programming models||FOL, Causal Probabilistic Logic, MLN, Event Calculus||1, 2, 3, 6, 7, 9||improves SotA|
|KB Contribution: 1:concept abstraction/reuse, 2:complex data querying, 3:spatial reasoning, 4:contextual reasoning,|
|5:relational reasoning, 6:temporal reasoning, 7:causal reasoning, 8:access to open-domain knowledge, 9:formal semantics|
The investigation of state-of-the-art research discussed so far revealed prominent solutions for various OP-related subtopics and new insights obtained from novel contributions (Table 1). It can also help frame open questions that are still hard to tackle in the attempt to combine ML and knowledge-based approaches in the given domain.
The exploitation of CS knowledge is a characteristic example: its significance has been acknowledged more than two decades ago and the research conducted over the years contributed studies that combine methodologies from diverse fields of AI. At the same time, it is also evident that the progress in coupling textual and visual analysis, which is the mainstream in current VQA related studies, has still a long way to go. Moreover, it will require to further integrate other aspects, such as complex forms of inferencing or ways to fuse the huge volume of general knowledge that exists on the Web, while eliminating the problems related to the bias of information found online.
Progress in the field of learning from demonstration can also prove a vital contribution to CS inferencing and vice versa. Apart from the visual challenges involved, this application domain is characterized also by the central role of, human mostly, agents. Interaction with human users calls for intuitive means of communication, where high-level, declarative languages seem to offer a natural way of capturing human intuition. Transferring knowledge between high-level languages and low-level models is a key area of investigation for future symbiotic systems and a fruitful domain for symbolic approaches. Apart from the ontological perspectives related to the representation of high-level concepts, complex reasoning tasks, such as causal, temporal or counterfactual reasoning, cannot be left aside.
Still, the most demanding outcomes that are expected by the integration of knowledge-based and ML methodologies concern the aspects of causality learning and explainability. As argued in 
, ML needs to go beyond the detection of associations, in order to exhibit explainability and counterfactual reasoning. Furthermore, the black-box character of ML-based methods hinders the understanding of their behavior, and eventually the acceptance of such systems. For example, a recent work demonstrated the fundamental inability of feed-forward neural networks to efficiently and robustly learn visual relations, which renders the high performance that networks of this type achieved highly suspicious[24, 45].
The present review shows initial evidence that progress in these two aspects can indeed be achieved by building on appropriately combined hybrid methodologies. The example of GNNs is indicative, offering a well-established combination of model-free and model-based methods, and having achieved prominent results in diverse CV tasks, such as object classification , affordance detection  and scene understanding . Other approaches also present promising behavior, overall supporting the view that hybrid methods constitute an avenue worth exploring.
-  (2015) From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292. Cited by: Table 1.
-  (2019) Integrating knowledge and reasoning in image understanding. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 6252–6259. Cited by: §1.
-  (2017) Efficient interactive decision-making framework for robotic applications. Artificial Intelligence 247, pp. 187–212. Cited by: §2.1.3, Table 1.
-  (2018) Relational affordance learning for task-dependent robot grasping. In Inductive Logic Programming, N. Lachiche and C. Vrain (Eds.), Cham, pp. 1–15. Cited by: §4.2, Table 1.
-  (2018) Semantic and geometric reasoning for robotic grasping: a probabilistic logic approach. Autonomous Robots, pp. 1–26. Cited by: Table 1.
-  (2018) Know rob 2.0âa 2nd generation knowledge processing framework for cognition-enabled robotic agents. In 2018 IEEE ICRA, pp. 512–519. Cited by: §2.1.3, Table 1.
-  (2004) Knowledge representation and reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: Cited by: §2.1.
-  (2015) HICO: A benchmark for recognizing human-object interactions in images. IEEE ICCV 2015 Inter, pp. 1017–1025. External Links: Cited by: §2.1.3, Table 1.
-  (2018) Iterative Visual Reasoning beyond Convolutions. IEEE CVPR, pp. 7239–7248. External Links: Cited by: §2.1.2, Table 1.
-  (2017) Situated bayesian reasoning framework for robots operating in diverse everyday environments. In International Symposium on Robotics Research (ISRR), Cited by: §2.3.2, Table 1.
-  (2018) Learning to Act Properly: Predicting and Explaining Affordances from Images. IEEE CVPR, pp. 975–983. External Links: Cited by: Table 1, §5.
-  (2019) RoboCSE: robot common sense embedding. arXiv preprint arXiv:1903.00412. Cited by: §2.1.3.
-  (1995) Artificial intelligence: theory and practice. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA. External Links: Cited by: §2.1.
-  (2014) Large-scale object classification using label relation graphs. In ECCV, pp. 48–64. Cited by: §2.1.
-  (2015) Zero-shot object recognition by semantic manifold distance. In IEEE CVPR, pp. 2635–2644. Cited by: Table 1.
-  (2016) Living with robots: interactive environmental knowledge acquisition. Robotics and Autonomous Systems 78, pp. 1–16. Cited by: Table 1.
-  (2019) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. IJCV 127 (4), pp. 398–414. External Links: Cited by: §2.2.1.
-  (2019) Scene graph generation with external knowledge and image reconstruction. In IEEE CVPR, pp. 1969–1978. Cited by: Table 1.
-  (2017) Going deeper into action recognition: a survey. IMAVIS 60, pp. 4–21. Cited by: §1.
-  (2017) Learning to Reason: End-to-End Module Networks for Visual Question Answering. IEEE ICCV 2017-Octob (Figure 1), pp. 804–813. External Links: Cited by: Table 1.
How a general-purpose commonsense ontology can improve performance of learning-based image retrieval. arXiv preprint arXiv:1705.08844. Cited by: §3.2, Table 1.
-  (2017) Inferring and executing programs for visual reasoning. In IEEE ICCV, pp. 2989–2998. Cited by: Table 1.
-  (2019) Online learning of weighted relational rules for complex event recognition. In Machine Learning and Knowledge Discovery in Databases, M. Berlingerio, F. Bonchi, T. Gärtner, N. Hurley, and G. Ifrim (Eds.), Cham, pp. 396–413. Cited by: Table 1.
-  (2018) Not-so-clevr: learning same–different relations strains feedforward neural networks. Interface focus 8 (4), pp. 20180011. Cited by: §5.
-  (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV 123 (1), pp. 32–73. External Links: Cited by: §2.2.1, Table 1.
-  (2018) Multi-label Zero-Shot Learning with Structured Knowledge Graphs. IEEE CVPR, pp. 1576–1585. External Links: Cited by: Table 1.
-  (2017) Artificial cognition for social human–robot interaction: an implementation. Artificial Intelligence 247, pp. 45–69. Cited by: §2.1.3, Table 1.
-  (2017) Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. arXiv preprint arXiv:1712.00733. Cited by: §2.3.1, Table 1.
-  (2019) Visual question answering as reading comprehension. In IEEE CVPR, pp. 6319–6328. Cited by: §3.2, Table 1.
-  (2017) Situation recognition with graph neural networks. In IEEE ICCV, pp. 4173–4182. Cited by: §2.1.1, Table 1, §5.
-  (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.1.1.
-  (2015) Don’t just listen, use your imagination: leveraging visual common sense for non-visual tasks. In IEEE CVPR, pp. 2984–2993. Cited by: §3.1, Table 1.
-  (2018) Deep learning for generic object detection: a survey. arXiv preprint arXiv:1809.02165. Cited by: §1.
-  (2018) Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships. IEEE CVPR, pp. 6985–6994. External Links: Cited by: Table 1.
-  (2018) R-VQA: Learning visual relation facts with semantic attention for visual question answering. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1880–1889. External Links: Cited by: §2.2.1, Table 1.
-  (2017) The more you know: using knowledge graphs for image classification. IEEE CVPR 2017-Janua, pp. 20–28. External Links: Cited by: §4.1, Table 1, §5.
-  (2018-01-01) Relational affordances for multiple-object manipulation. Autonomous Robots 42 (1), pp. 19–44. External Links: Cited by: Table 1.
-  (2018) Straight to the facts: learning knowledge base retrieval for factual visual question answering. In Proceedings of the ECCV (ECCV), pp. 451–468. Cited by: §2.2.1, §3.2, Table 1.
-  (2016) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §1.
-  (2018) Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv preprint arXiv:1801.04016. Cited by: §5.
Learning semantic relationships for better action retrieval in images ( Supplementary ).
Computer Vision and Pattern Recognition, pp. 1–4. External Links: Cited by: Table 1.
-  (2017) Transferring skills to humanoid robots by extracting semantic representations from observations of human activities. Artificial Intelligence 247, pp. 95–118. Cited by: §2.1.3, Table 1.
-  (2019) Robot learning from demonstration: a review of recent advances. Annual Review of Control, Robotics, and Autonomous Systems, pp. In Press. Cited by: §2.1.3.
-  (2017) YOLO9000: better, faster, stronger. In IEEE CVPR, pp. 7263–7271. Cited by: §2.1, Table 1.
-  (2018) The elephant in the room. arXiv preprint arXiv:1808.03305. Cited by: §5.
-  (2016) Probability and common-sense: tandem towards robust robotic object recognition in ambient assisted living. In Ubiquitous Computing and Ambient Intelligence, pp. 3–8. Cited by: §3.2.
-  (2015) Viske: visual knowledge extraction and question answering by visual verification of relation phrases. In IEEE CVPR, pp. 1456–1464. Cited by: §3.1, Table 1.
-  (2017) A simple neural network module for relational reasoning. (Nips). External Links: Cited by: §2.2.2, Table 1.
-  (2019) What Object Should I Use? - Task Driven Object Detection. External Links: Cited by: §2.1.1, Table 1.
-  (2008) The graph neural network model. IEEE Transactions on NN 20 (1), pp. 61–80. Cited by: §2.1.1.
-  (2019) KVQA: knowledge-aware visual question answering. Cited by: §2.3.1, Table 1.
-  (2017-06) Special issue on ai and robotics. In Artificial Intelligence, K. Rajan and A. Saffiotti (Eds.), Vol. 247, pp. 1–440. Cited by: §4.2.
-  (2016) Artificial intelligence and life in 2030. One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel. Cited by: §1.
-  (2018) Learning Visual Knowledge Memory Networks for Visual Question Answering. IEEE CVPR, pp. 7736–7745. External Links: Cited by: §2.2.1, Table 1.
-  (2017) Representations for robot knowledge in the knowrob framework. Artificial Intelligence 247, pp. 151–169. Cited by: Table 1.
-  (2019-07) Recent advances in imitation learning from observation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 6325–6331. Cited by: §2.1.3.
-  (2007) Handbook of knowledge representation. Elsevier Science, San Diego, USA. Cited by: §2.1.
-  (2015) Learning common sense through visual abstraction. In IEEE ICCV, pp. 2542–2550. Cited by: §3.1, §3, Table 1.
-  (2018) Fvqa: fact-based visual question answering. IEEE Trans. on PAMI 40 (10), pp. 2413–2427. Cited by: §3.2, Table 1.
-  (2018) Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs. IEEE CVPR, pp. 6857–6866. External Links: Cited by: Table 1.
-  (2018) Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Trans. on PAMI 40 (6), pp. 1367–1381. External Links: Cited by: Table 1.
-  (2017) Visual question answering: a survey of methods and datasets. CVIU 163, pp. 21–40. Cited by: §1.
-  (2016) Ask me anything: free-form visual question answering based on knowledge from external sources. In IEEE CVPR, pp. 4622–4630. Cited by: Table 1.
-  (2016-06) Harnessing object and scene semantics for large-scale video understanding. In TheIEEE CVPR, Cited by: §2.1.1.
-  (2016) Latent embeddings for zero-shot classification. In IEEE CVPR, pp. 69–77. Cited by: Table 1.
-  (2017) What can i do around here? Deep functional scene understanding for cognitive robots. IEEE ICRA, pp. 4604–4611. External Links: Cited by: §2.1.1, Table 1.
-  (2018) Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. Advances in Neural Information Processing Systems 2018-Decem (NeurIPS), pp. 1031–1042. External Links: Cited by: Table 1.
-  (2016) Towards lifelong object learning by integrating situated robot perception and semantic web mining. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, pp. 1458–1466. Cited by: §3.2, Table 1.
-  (2017) Making sense of indoor spaces using semantic web mining and situated robot perception. In European Semantic Web Conference, pp. 299–313. Cited by: §2.3.2, §2.3.2, Table 1.
-  (2014) Reasoning about object affordances in a knowledge base representation. In ECCV, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 408–424. Cited by: §2.1.
-  (2015) Visual7W: Grounded Question Answering in Images. External Links: Cited by: Table 1.
-  (2015) Building a large-scale multimodal knowledge base for visual question answering. CoRR abs/1507.05670. External Links: Cited by: §2.1.