RoboCSE: Robot Common Sense Embedding

03/01/2019 ∙ by Angel Daruna, et al. ∙ Georgia Institute of Technology 0

Autonomous service robots require computational frameworks that allow them to generalize knowledge to new situations in a manner that models uncertainty while scaling to real-world problem sizes. The Robot Common Sense Embedding (RoboCSE) showcases a class of computational frameworks, multi-relational embeddings, that have not been leveraged in robotics to model semantic knowledge. We validate RoboCSE on a realistic home environment simulator (AI2Thor) to measure how well it generalizes learned knowledge about object affordances, locations, and materials. Our experiments show that RoboCSE can perform prediction better than a baseline that uses pre-trained embeddings, such as Word2Vec, achieving statistically significant improvements while using orders of magnitude less memory than our Bayesian Logic Network baseline. In addition, we show that predictions made by RoboCSE are robust to significant reductions in data available for training as well as domain transfer to MatterPort3D, achieving statistically significant improvements over a baseline that memorizes training data.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robots operating in human environments benefit from encoding world information in a semantically-meaningful representation in order to facilitate generalization and domain transfer. This work focuses on the problem of semantically representing a robot’s world in a robust, generalizable, and scalable fashion. Semantic knowledge is typically modeled by a set of entities representing concepts known to the robot (e.g. apple, fabric, kitchen), and a set of possible relations (e.g. atLocation, hasMaterial, hasAffordance) between them [tenorth2013knowrob, chernovasituated, saxena2014robobrain, zhu2014reasoning, celikkanat2015probabilistic, beetz2018know].

While some semantic information can be hard-coded, large-scale and long-term deployments of autonomous systems require the development of computational frameworks that i) enable abstract concepts to be learned and generalized from observations, ii) effectively model the uncertain nature of complex real-world environment, and iii) are scalable, incorporating data from a wide range of environments (e.g., hundreds of households). Previous work in semantic reasoning for robot systems has addressed subsets of the above challenges. Directed graphs [koller2009probabilistic] used in [saxena2014robobrain] allowed individual observations to adapt generalized concepts at large scale, integrating multiple projects. Bayesian Logic Networks (BLN) [jain2009bayesian] in [chernovasituated]

allowed for precise probabilistic inference and learning assuming knowledge graphs have manageable sizes. Description Logics (DL)

[baader2008description] used in [tenorth2010knowrob] allowed for large-scale deterministic reasoning about many concepts. In summary, each of these representations have limitations with respect to at least one of the three characteristics above.

Fig. 1: RoboCSE can be queried by a robot to infer knowledge and make decisions.

In this work, we contribute Robot Common Sense Embedding (RoboCSE), a novel computational framework for semantic reasoning that is highly scalable, robust to uncertainty, and generalizes learned semantics. Given a knowledge graph , formalized in Section III-A, RoboCSE encodes semantic knowledge using multi-relational embeddings [nickel2016review], embedding

into a high-dimensional vector space that preserves graphical structure between nodes and edges, while also facilitating generalization (see Figure 

1(a),1(e)). We show that RoboCSE can be trained on simulated environments (AI2Thor [kolve2017ai2]), and that the resulting learned model effectively transfers to data from real-world domains, including both pre-recorded household scenes (MatterPort3D [chang2017matterport3d]) and real-time execution on a robot111 (Figure 1).

Fig. 2: From a directed graph (1(a)) we can learn a vector embedding containing the same nodes and edges. Multi-relational embeddings begin randomly initialized (1(b)) and are updated by calculating the losses between target transformations and actual transformations (1(c)-1(d)) until they converge on a semantically meaningful structure (1(e)-1(f)).

We compare our work to three baselines: BLNs, Google’s pre-trained Word2Vec embeddings [mikolov2013distributed], and a theoretical upper bound on the performance of logic-based methods [baader2008description]. Our results show that RoboCSE uses orders of magnitude less memory than BLNs222implemented using ProbCog [jain2010soft] and outperforms the Word2Vec and logic-based baselines across all accuracy metrics. RoboCSE also successfully generalizes beyond the training data, inferring triples held-out from the training set, by leveraging latent interactions between multiple relations for a given entity. Furthermore, results returned by RoboCSE are ranked by confidence score, enabling robot behavior architectures to be constructed that effectively reason about the level of uncertainty in the robot’s knowledge. Combined, the memory efficiency and learned generalizations of RoboCSE allow a robot to semantically model a larger world while accounting for uncertainty.

Ii Related Work

Data-driven methods using convolutional neural networks have shown promise in semantically reasoning about high-level tasks

[zhu2017visual] and trajectory selection [sung2018robobarista]. However, the representations learned by these methods can lose the graph structure beneficial for reasoning, often require large amounts of data, and use end-to-end pipelines that could lead to performance drops across domains even with similar semantics. Inspired by data-driven approaches, RoboCSE is a learned representation but makes simplifying mathematical assumptions about the embedding space to promote structure. In addition, we decouple perceptual stimuli from semantics by operating on symbols (vectors) to allow domain transfer without tuning (e.g. virtual vs. real-world).

We propose to use multi-relational embeddings to learn a knowledge graph . Multi-relational embeddings represent knowledge graphs in vector space, encoding vertices that represent entities as vectors and edges that represent relations as mappings. The simplest of these models are Translational Methods [wang2017kge_survey] such as TransE [bordes2013transe], TransH [wang2014transh], and TransR [lin2015transr]. Semantic Matching Methods have outperformed translational methods because they offer a wider range of possible relations between entities than vector addition [wang2017kge_survey]. Semantic matching methods [socher2013reasoning, dong2014knowledge]

leverage neural-networks to offer relations that can capture non-linear transforms between entities. However, the increase in modeling parameters requires more data and training to avoid over-fitting, which may be difficult for robot systems to acquire.

Our work uses ANALOGY [liu2017analogical], a semantic matching method, to learn multi-relational embeddings. ANALOGY constrains relations to be normal linear mappings between entities to promote structure in the learned embeddings and simplify the optimization objective. The multiplicative relationship between entities allows for more complex relations to be expressed than vector addition while only requiring a single matrix per relation, balancing scalability with expressiveness to achieve state-of-the-art results [wang2017kge_survey].

Iii Approach

Iii-a Background: Multi-Relational Embeddings

The objective of the multi-relational (i.e. knowledge graph) embedding problem is to learn a continuous vector representation of a knowledge graph , encoding vertices that represent entities as a set of vectors and edges that represent relations as mappings between vectors , where and are the dimensions of vectors and mappings, respectively [nickel2016review, wang2017kge_survey]. The knowledge graph is composed from individual knowledge triples such that are identified as head and tail entities of the triple, respectively, for which the relation holds (e.g. cup, hasAffordance, fill). Collectively, the set of all triples from a dataset form a directed graph expressing the knowledge for that domain (note this directed graph is considered incomplete because some set of triples may be missing).

Generically, a multi-relational embedding is learned by minimizing the loss using a scoring function over the set of knowledge triples from . In addition to knowledge triples from , embedding performance substantially improves when negative triples are sampled from a negative triple knowledge graph [nickel2016review]. Therefore, is defined as where y is the positive or negative label for the triple.

Iii-B RoboCSE

RoboCSE is a computational framework for semantic reasoning that uses multi-relational embeddings to encode abstract knowledge obtained by the robot from its sensors, simulation, or even external knowledge graphs (Figure 3). The robot can use the resulting knowledge representation as a queriable database to obtain information about its environment, such as likely object locations, material properties of objects, object affordances, and any other relation-based semantic information the robot is able to mine.

Figure 2 summarizes the embedding process, conceptually. Training instances are provided in the form of knowledge triples (Figure 1(a)). The embedding space containing all entity vectors has no structure before training (Figure 1(b)) because the relational embeddings must be learned. Therefore, all entities and relations are initialized as random vectors and mappings, respectively.


training instance provided is used to perform stochastic-gradient-descent. The loss function is formulated as

, where

is the sigmoid function and

is a bilinear scoring function of the form , [liu2017analogical]. Given a particular multi-relational embedding, its loss function is used to compute a loss between a current vector and a target vector (Figure 1(d)). The current vector is calculated using a subset of the knowledge triple (e.g. pick up in Figure 1(c)) and the target vector is calculated using the remaining subset (e.g. mug hasAffordance in Figure 1(c)). This loss, is used to update the appropriate vectors and mappings to better approximate the correct representation (Figure 1(e)).

The vectors and mappings of and , respectively, converge to semantically meaningful values after repeating the training process with different subsets of knowledge instances (Figure 1(e)), which can be used to perform inference (see [nickel2016review] for more on learning in multi-relational embeddings). In Figure 1(f) we see that similar entities are grouped horizontally, cabinets are more likely to be filled than picked up, and mugs are equally likely to have either affordance.

Inference in RoboCSE is done by completing a knowledge triple given only partial information. For example given , RoboCSE returns a list of the most likely tails to complete the knowledge instance. Mathematically, given , maps an by some transformation, then the vectors with the highest scores to the resultant are selected as results, which represent the most likely tails . In the case of RoboCSE, which uses [liu2017analogical], maps an via . Result tails are ordered using the bilinear scoring function , in which higher scores be more likely (i.e. more closely aligned vectors). RoboCSE can make inferences about knowledge triples it has never seen before because these transformations can be done over any entities in the embedding space, allowing for generalization.

Fig. 3: Overview of RoboCSE framework integrated with mobile robot.

An assumption made widely across prior multi-relational embedding work [bordes2013transe, liu2017analogical, wang2017kge_survey] is that query responses are deterministic (i.e. either always true with rank 1 or false with lower ranks), and only factual relations are provided in the training data. However, the semantic data we are modeling is highly stochastic; for example, multiple potential locations are likely for a given object. As a result, the ground truth rank of responses is often not 1. Instead, ground truth ranks reflect the number of observations in the data. To support this in our evaluation, we extend the standard performance metrics of mean-reciprocal-rank (MRR) and hits-at-top-k (Hits@K), which assume a ground truth rank of 1, and for the experiments in Section V, we instead report:


where is the number of triples tested, is the ground truth rank, and is the inferred rank. For both these metrics, scores range from 0 to 1 with 1 being the best performance. MRR* is a more complete ranking metric for which the inferred and ground truth ranks must match exactly to get an MRR* 1. Hits@5 gives a more granular look at rankings, but is informative of how often the correct response is within some threshold. We discuss how the ground truth set and ranks are generated for each experiment in Section IV.

Iv Experimental Procedure

We evaluated RoboCSE’s generalization capability on two scenarios: inferring the ranks of unseen triples (triple generalization) and accurately predicting the properties and locations of objects in previously unseen environments (environment generalization).

Iv-a RoboCSE Knowledge Source: AI2Thor

AI2Thor: 3 Relation Types, 117 Entities
Median Count per Environment
Bathroom 28 21 46 18 30
Bedroom 28.5 16 54.5 20 30
Kitchen 59.5 51 109 27 30
Livingroom 22.5 8 37 20 30
All 29.5 18.5 50 20 120
TABLE I: RoboCSE Knowledge Source

In this work, our semantic reasoning framework targets common sense knowledge for residential service robots. Knowledge embedded in RoboCSE was mined from a highly realistic simulation environment of household domains, AI2Thor [kolve2017ai2]. AI2Thor offers realistic environments from which instances of semantic triples about affordances and locations of objects can be mined (see Table I). Entities include 83 household items (e.g. microwave, toilet, kitchen) and 17 affordances (e.g. pick up, open, turn on). Additionally, we manually extended objects within AI2Thor to model 17 material properties (e.g. wood, fabric, glass), which were assigned probabilistically based on materials encountered in the SUNCG dataset [song2016ssc]. The addition of material properties brought the total number of triples available for training, validation, and testing to over 15K.

Prior work on multi-relational embedding has shown that inclusion of negative examples in the training data, defined as triples known to be false, leads to improved training performance [nickel2016review]. To take advantage of this result, we additionally trained on negative examples for our domain.

Similar to prior work, we used the closed world assumption to sample negative triples. However, we did not find that using the perturbing method suggested in [liu2017analogical] gave the best results. Instead better results were achieved after filtering perturbed triples to verify the sample was not in the training set. The reason for this empirical phenomenon needs further evaluation.

Iv-B Inferring Unseen Triples: Triple Generalization

Fig. 4: Diagram of triples contained in train, validation, and test sets for each fold (not to scale) during triple generalization (above) and environment generalization (below).

Inevitably, an autonomous robot operating in a real-world environment will encounter problems that require answers to queries it was not trained on (e.g. can mugs be filled?). To probe how well RoboCSE can correctly generalize to do triple prediction, triple generalization performance was tested for each algorithm as follows.

Five-fold cross-validation was performed to estimate each algorithm’s performance. To generate each fold, a set of all unique triples

was generated from the set of all triples in our test case dataset where by filtering out repeated triples (i.e. each triple has the property ) (see Figure 4). was split into five equally sized sets of triples for folds where . was divided in half to create a validation portion and test portion . The training set for each fold was generated from by ensuring such that . For each fold, was trained on while validating on , and the learned embedding was then tested on .

The training process follows the same procedure as in [liu2017analogical]. Testing was done by generating three ranks with each triple (i.e. rank given , rank given , and rank given ) then comparing them to their respective ground truth ranks. Each triple in the test set was a held-out triple ranked using the full-distribution of triples . Ground truth ranking was calculated according to the number of observations (i.e. more observations give higher rank). Error metrics similar to those from the relational embedding community (MRR* and Hits@K*) were calculated using the ground truth rank for comparison.

Iv-C Applying Common Sense: Environment Generalization

Our second test targets the scenario of deploying a robot equipped with a semantic knowledge base in a new environment, with the goal of evaluating how well the embedded knowledge generalizes to new rooms and the degree to which a robot can use its knowledge to predict object properties or locations in the new setting (e.g. in a new house, where would I likely find a towel?). Environment generalization was tested as follows.

Five-fold cross-validation was performed to estimate each algorithm’s performance over a test case dataset balanced across environment types (i.e. bathroom, bedroom, kitchen, livingroom). To generate each fold, was separated into four sets for each environment type maintaining resolution at environment level (i.e. a single environment with all triples contained is an atomic unit for splitting purposes), for bathrooms, for bedrooms, for kitchens, and for living rooms (see Figure 4). Then each environment type set for was split at environment resolution into five equally sized sets of environments for folds where . The smaller fraction of each fold of was then divided in half to create a validation portion and test portion , while the larger fraction served as a training set . Finally, the balanced train , validation , and test sets were generated according to:


The training process followed the same procedure as in [liu2017analogical]. Testing was done by querying the tested algorithm for triples that come from new environments, which have not been trained on found in each for folds . The standard MRR and Hits@K were used to measure the algorithm’s performance, allowing us to assess how frequently the robot was correct on the first attempt.

V Experimental Results

In this section, we report results characterizing the performance of various models trained on AI2Thor data to understand the advantages and limitations of RoboCSE. Pre-trained Word2Vec embeddings were used in Triple Generalization as a comparable baseline not within the class of multi-relational embeddings. An upper-bound on the performance of logic-based systems was also included in the Triple Generalization experiment to compare with more historically prevalent approaches [tenorth2010knowrob, lemaignan2010oro, suh2007ontology]. For Environment Generalization and Domain Transfer, an instance-based learning baseline that memorizes the training set was used. This controlled baseline gave a clear indicator of how well RoboCSE generalized knowledge beyond what was available in the training set. Lastly, the memory requirements of RoboCSE were compared to a Bayesian Logic Network (BLN) because both account for uncertainty while modeling a graph of knowledge triples unlike Word2Vec or logic-based approaches333Bayesian Logic Networks and Markov Logic Networks were widely used in previous works but suffer from similar intractability problems [chernovasituated, tenorth2010knowrob, zhu2014reasoning]. Due to memory requirements, the BLN baseline could not be included in all experiments..

V-a Testing Triple Generalization

Triple Generalization was tested to quantify how well RoboCSE could infer missing triples using the learned representation (i.e. infer rank for fork atLocaion kitchen, not in the train set).

Two baselines were used to compare with RoboCSE, Word2Vec and Description Logics (DL) performance upper-bound. The Word2Vec baseline first forms a ‘comparison’ group of responses from all triples in the training set matching a test query (i.e. given (,atLocation,cabinet), group heads from all training triples matching this query). With the Word2Vec embeddings of , the Word2Vec embedding of all candidate responses (i.e. all entities ) are ranked using the cosine distance. We estimated the upper-bound of a DL based system at best be able to perform at type-specific chance (e.g. for a total of 17 affordances, guessing the correct affordance to (mug,hasAffordance,) in the top five hits has a chance ). This is because DL can determine the type of result that should be returned by a query but cannot infer which entity within a type would be most likely. Therefore the performance could be estimated for each query assuming type-specific chance (see Figure 4(a)). The bar graphs in Figure 5 show the performance of each algorithm w.r.t. Hits@5* and MRR* metrics for each relation and query type on the x-axis.

RoboCSE outperformed all baselines across all metrics at predicting unseen triples, which were statistically significant improvements on and queries compared using non-parametric 2 group Mann-Whitney U tests. The DL bound performs well for queries because DL has explicitly defined types for all entities in a T-Box [baader2008description], allowing the framework to select the correct relation given a head and tail. The overall implication of these performance improvements is that a robot using RoboCSE to reason about a task could not only infer new knowledge it might not have been trained on to complete a task, but also reason about the confidence in the inferences to return the best result.

All algorithms performed worse at queries than other queries, which is prevalent across our experiments. This drop in performance is because selecting the right entity as a head to complete a triple is a more difficult learning problem (a chance of ) versus selecting the correct affordance, material, or location, which are much fewer in number.

Fig. 5: Performance w.r.t. Hits@5* and MRR* metrics for Triple Generalization in AI2Thor

V-B Testing Environment Generalization

Environment Generalization was tested to measure how well RoboCSE could accurately complete triples in new rooms, motivated by real-world application of RoboCSE and the way training/deployment would proceed when a service robot encounters a new environment.

We compared RoboCSE to an instance-based learning baseline that memorizes the training set (i.e. frequency count) and the initial results showed these two methods were comparable. The baseline completed triples by selecting the most observed matching candidate (i.e. given query it returned the head most often observed with the matching relation and tail ). We trained each algorithm on 24 rooms of each type available and the results showed the baseline and RoboCSE had closely matching strong performances (whereby performance of each was within 1% of each other, 90% for and queries and 40% for ). This was because the default rooms of AI2Thor do not have enough variety between rooms (i.e. algorithms rarely have to generalize to unseen triples).

Fig. 6: Performance Trends w.r.t. Hits@5* and MRR* Metrics for Environment Generalization in AI2Thor

However, reducing the number of rooms reveals RoboCSE’s ability to learn from the interactions of triples and generalize to the best performance faster than the baseline (see Figure 6). Lines in this plot were generated by averaging across relations for each query type at varying numbers of rooms in the training set. The trend of RoboCSE generalizing to new rooms faster than the baseline was most pronounced with the fewest number of rooms in the training set (i.e. 1) but continued up to about 9 rooms as shown in the line plots. We saw this most pronounced for the and queries on both metrics. This showed from an application’s perspective how a robot bootstrapped with RoboCSE can learn general structures from individual instances to perform better in new environments and require less training data.

V-C Domain Transfer: Testing on MatterPort3D

The learned embeddings from AI2Thor were tested on MatterPort3D (MP3D) to measure how well RoboCSE transfers to envrionments from real-world domains. While MP3D does not contain all the object properties we included in AI2Thor (no affordances or materials), it does contain triples about object locations for over 500 real-world environments.

Fig. 7: Performance w.r.t. Hits@5 and MRR metrics for Domain Transfer to MatterPort3D

The results from domain transfer showed that RoboCSE generalized to MP3D better than our instance-based learning baseline that memorizes the training set (i.e. frequency count), effectively inferring new triples not present in the training data. Training and validation for domain transfer closely followed the Environment Generalization procedure (see Section IV-C) but only for atLocation relations. During testing, the models learned from all rooms in AI2Thor were used to answer queries about all rooms in MP3D. The bar graphs in Figure 7 show that the semantics learned in AI2Thor can be directly applied to MatterPort3D, evident in the high performance of both algorithms. Furthermore, inference in RoboCSE successfully generalized beyond training data to accurately infer more queries indicated by the statistically significant higher scores RoboCSE gets on and queries compared using non-parametric 2 group Mann-Whitney U tests. In short, this shows that bootstrapping a robot with semantics learned in simulation using RoboCSE can be applied to data from real world environments.

V-D Analyzing Memory Requirements

We analyzed the memory requirements of RoboCSE and BLNs [jain2009bayesian] to compare the scalability of each.

To analyze memory requirements, all unique triples from AI2Thor were extracted (352) and modeled in a BLN using a standard package (ProbCog [jain2010soft]). The resulting BLN required 9 orders of magnitude more memory than RoboCSE (i.e. 100 TB vs. 96 KB). Although BLNs have been used to model semantic knowledge within robot systems to do accurate probabilistic reasoning [tenorth2010knowrob, chernovasituated]

, maintaining conditional-probability tables in BLNs can be intractable due to the rapid increase of node in-degree (i.e. number of parents) and therefore table size, for densely connected networks.

RoboCSE’s drastic memory reduction was possible because its space complexity scales linearly with the number of entity or relation types and RoboCSE’s space complexity does not directly depend on node in-degree. RoboCSE requires (number_of_entitiesnumber_of_relations) bytes of memory, where

is the vector space dimensionality. While this is a considerable improvement in space complexity, RoboCSE cannot represent the joint distribution or true probabilities as a BLN can. Instead, the distances measured using a scoring function between the queried transformation and results are interpreted as confidence (see Section 

III-B). Furthermore, only the subset of the triple in the query can be used as ‘evidence’ to condition on (e.g. the best are selected conditioned on an query).

Vi Discussion & Conclusion

In this work we approached the problem of semantically representing a robot’s world via triples in a manner that supports generalization, accounts for uncertainty, and is memory-efficient. We presented RoboCSE, a novel framework for robot semantic knowledge that leverages mutli-relatonal embeddings.

From our experiments two benefits have emerged from the use of multi-relational embeddings in RoboCSE: (1) the generalizations learned outperformed Word2Vec at prediction, being robust to significant reductions in training data and domain transfer and (2) RoboCSE used orders of magnitude less memory to represent projections of graphs than representations of the same graph with BLNs. The collectively distinct set of benefits multi-relational embeddings have to offer could be taken advantage of to further progress for robots performing semantic reasoning robustly in semantically rich environments.

However, leveraging multi-relational embeddings has its limitations. As previously mentioned, answering queries is particularly difficult. This query is useful for robots reasoning to plan tasks (i.e. which head satisfies (,hasAffordance,fill)). Secondly, conditioning is very limited compared to a BLN. This leads to the same responses in different environments. Lastly, realistic systems in long-term deployments need the ability of incremental learning, enabling online adaptations as new knowledge arrives, which is not possible in this RoboCSE formulation.


This work is supported in part by NSF IIS 1564080, NSF GRFP DGE-1650044, and ONR N000141612835. Zsolt Kira was partially supported by NRI/NSF grant #IIS-1426998. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the supporters.