Semantic World Modeling, Perceptual Anchoring, Probabilistic Anchoring, Statistical Relational Learning, Probabilistic Logic Programming, Object Tracking, Relational Particle Filtering, Probabilistic Rule Learning.
Semantic World Modeling, Perceptual Anchoring, Probabilistic Anchoring, Statistical Relational Learning, Probabilistic Logic Programming, Object Tracking, Relational Particle Filtering, Probabilistic Rule Learning.
Statistical Relational Learning (SRL) (Getoor:2007:ISR:1296231; raedt2016statistical) tightly integrates predicate logic with graphical models in order to extend the expressive power of graphical models towards relational logic and to obtain probabilistic logics than can deal with uncertainty. After two decades of research, a plethora of expressive probabilistic logic reasoning languages and systems exists, (e.g. sato2001parameter; richardson2006markov; Getoor.2013.PSL; fierens2015inference). One obstacle that still lies ahead in the field of SRL (but see gardner2014incorporating and beltagy2016representing), is to combine symbolic reasoning and learning, on the one hand, with sub-symbolic data and perception, on the other hand. The question is how to create a symbolic representation of the world from sensor data in order to reason and ultimately plan in an environment riddled with uncertainty and noise. In this paper, we will take a probabilistic logic approach to study this problem in the context of perceptual anchoring.
An alternative to using SRL or probabilistic logics would be to resort to deep learning. Deep learning is based onend-to-end learning (e.g., silver2016mastering
). Although exhibiting impressive results, deep neural networks do suffer from certain drawbacks. As opposed to probabilistic rules, it is, for example, not straightforward to include prior (symbolic) knowledge in a neural system. Moreover, it is also often difficult to give guarantees for the behavior of neural systems, cf. the debate around safety and explainability in AI(huang2017safety; gilpin2018explaining). Although not free from this latter shortcoming, this is less of a concern for symbolic systems, which implies that bridging the symbolic/sub-symbolic gap is therefore paramount. A notion that aims to bridge the symbolic/sub-symbolic gap is the definition of perceptual anchoring, as introduced by coradeschi&saffiotti-2000; coradeschi&saffiotti-2001. Perceptual anchoring tackles the problem of creating and maintaining, in time and space, the correspondence between symbols and sensor data that refer to the same physical object in the external world (a detailed overview of perceptual anchoring is given in Section 3.1). In this paper, we particularly emphasize sensor-driven bottom-up anchoring (loutfi.et.al-2005), whereby the anchoring process is triggered by the sensory input data.
A further complication in robotics, and perceptual anchoring more specifically, is the inherent dependency on time. This means that a probabilistic reasoning system should incorporate the concept of time natively. One such system, rooted in the SRL community, is the probabilistic logic programming language Dynamic Distributional Clauses (DDC) (nitti2016learning), which can perform probabilistic inference over logic symbols and over time. In our previous work, we coupled the probabilistic logic programming language DDC to a perceptual anchoring system (persson2019semantic), which endowed the perceptual anchoring system with probabilistic reasoning capabilities.
A major challenge in combining perceptual anchoring with a high-level probabilistic reasoner, and which is still an open research question, is the administration of multi-modal probability distributions in anchoring111 A multi-modal probability distribution is a continuous probability distribution with strictly more than one local maximum. The key difference to a uni-modal probability distribution, such as a simple normal distribution, is that summary statistics do not adequately mirror the actual distribution. In perceptual anchoring these multi-modal distributions do occur, especially in the presence of object occlusions, and handling them appropriately is critical for correctly anchoring objects. This kind of phenomenon is well known when doing filtering and is the reason why particle filters can be preferred over Kalman filters.
A multi-modal probability distribution is a continuous probability distribution with strictly more than one local maximum. The key difference to a uni-modal probability distribution, such as a simple normal distribution, is that summary statistics do not adequately mirror the actual distribution. In perceptual anchoring these multi-modal distributions do occur, especially in the presence of object occlusions, and handling them appropriately is critical for correctly anchoring objects. This kind of phenomenon is well known when doing filtering and is the reason why particle filters can be preferred over Kalman filters.. In this paper, we extend the anchoring notation in order to handle additionally multi-modal probability distributions. A second point that we have not addressed in persson2019semantic, is the learning of probabilistic rules that are used to perform probabilistic logic reasoning. We show that, instead of hand-coding these probabilistic rules, we can adapt existing methods present in the body of literature of SRL to learn them from raw sensor data. In other words, instead of providing a model of the world to a robotic agent, it learns this model in form of probabilistic logical rules. These rules are then used by the robotic agent to reason about the world around it, i.e. perform inference.
In persson2019semantic, we showed that enabling a perceptual anchoring system to reason further allows for correctly anchoring objects under object occlusions. We borrowed the idea of encoding a theory of occlusion as a probabilistic logic theory from nitti2014relational (discussed in more detail in Subsection 3.3). While nitti2014relational operated in a strongly simplified setting, by identifying objects with AR tags, we used a perceptual anchoring system instead — identifying objects from raw RGB-D sensor data. In contrast to the approach presented here, the theory of occlusion was not learned but hand-coded in these previous works and did not take into account the possibility of multi-modal probability distributions. We evaluate the extensions of perceptual anchoring, proposed in this paper, on three showcase examples, which exhibit exactly this behavior: 1) we perform probabilistic perceptual anchoring when object occlusion induces a multi-modal probability distributions, and 2) we perform probabilistic perceptual anchoring with a learned theory of occlusion.
We structure the remainder of the paper as follows. In Section 3, we introduce the preliminaries of this work by presenting the background and motivation of used techniques. Subsequently, we discuss, in Section 4, our first contribution by first giving a more detailed overview of our prior work (persson2019semantic), followed by introducing a probabilistic perceptual anchoring approach in order to enable anchoring in a multi-modal probabilistic state-space. We continue, in Section 5, by explaining how probabilistic logical rules are learned. In Section 6, we evaluate both our contributions on representative scenarios before closing this paper with conclusions, presented in Section 7.
3.1 Perceptual Anchoring
Perceptual anchoring, originally introduced by coradeschi&saffiotti-2000; coradeschi&saffiotti-2001, addresses a subset of the symbol grounding problem in robotics and intelligent systems. The notion of perceptual anchoring has been extended and refined since its first definition. Some notable refinements include the integration of conceptual spaces (chella.et.al-2003; chella.et.al-2004), the addition of bottom-up anchoring (loutfi.et.al-2005), extensions for multi-agent systems (leblanc&saffiotti-2008), considerations for non-traditional sensing modalities and knowledge based anchoring given full scale knowledge representation and reasoning systems (loutfi-2006; loutfi&coradeschi-2006; loutfi.et.al-2008), and perception and probabilistic anchoring (blodow.et.al-2010). All these approaches to perceptual anchoring share, however, a number of common ingredients from coradeschi&saffiotti-2000; coradeschi&saffiotti-2001, including:
A symbolic system (including: a set of individual symbols; a set of predicate symbols).
A perceptual system (including: a set of percepts; a set of attributes with values in the domain ).
Predicate grounding relations that encode the correspondence between unary predicates and values of measurable attributes (i.e., the relation maps a certain predicate to compatible attribute values).
While the traditional definition of coradeschi&saffiotti-2000; coradeschi&saffiotti-2001 assumed unary encoded perceptual-symbol correspondences, this does not support the maintenance of anchors with different attribute values at different times. To address this problem, persson.et.al-2017 distinguishes two different types of attributes:
Static attributes , which are unary within the anchor according to the traditional definition.
Volatile attributes , which are individually indexed by time , which are maintained in a set of attribute instances , such that .
Without loss of generality, we assume from here on that all attributes stored in an anchor are volatile, i.e., that they are indexed by a time step . Static attributes are trivially converted to volatile attributes by giving them the same attribute value in each time step.
Given the components above, an anchor is an internal data structure , indexed by time and identified by a unique individual symbol (e.g., mug-1 and apple-4), which encapsulates and maintains the correspondences between percepts and symbols that refer to the same physical object, as depicted in Figure 1. Following the definition presented by loutfi.et.al-2005, the principal functionalities to create and maintain anchors in a bottom-up fashion, i.e., functionalities triggered by a perceptual event, are:
Acquire – initiates a new anchor whenever a candidate object is received that does not match any existing anchor . This functionality defines a structure , indexed by time and identified by a unique identifier , which encapsulates and stores all perceptual and symbolic data of the candidate object.
Re-acquire – extends the definition of a matching anchor from time to time . This functionality ensures that the percepts pointed to by the anchor are the most recent and adequate perceptual representation of the object.
Based on the functionalities above, it is evident that an anchoring matching function is essential to decide whether a candidate object is matches an existing anchor or not. Different approaches in perceptual anchoring vary, in particular in how the matching function is specified. For example, in persson2019semantic, we have shown that the anchoring matching function can be approximated by a learned model trained with manually labeled samples collected through an annotation interface (through which the human user can interfere with the anchoring process and provide feedback about which objects in the scene match previously existing anchors).
In another recently published anchoring approach, ruiz-sarmiento.et.al-2017 focus on spatial features and distinguish unary object features, e.g., the position of an object, from pairwise object features, e.g., the distance between two objects, in order to build a graph-based world model that can be exploited by a probabilistic graphical model (koller-2009) in order to leverage contextual relations between objects to support object recognition. In parallel with our previous work on anchoring, gunther.et.al-2018 have further exploited this graph-based model on spatial features and propose, in addition, to learn the matching function through the use of a Support Vector Machine (trained on samples of object pairs manually labeled as "same or different object"), in order to approximate the similarity between two objects. The assignment of candidate objects to existing anchors is, subsequently, calculated using prior similarity values and a Hungarian method (kuhn-1955). However, in contrast to gunther.et.al-2018, the matching function introduced in persson2019semantic do not only rely upon spatial features (or attributes), but can also take into consideration visual features (such as color features), as well as semantic object categories, in order to approximate the anchoring matching problem.
3.2 Dynamic Distributional Clauses
Dynamic Distributional Clauses (DDC) (nitti2016learning) provide a framework for probabilistic programming that extends the logic programming language Prolog (sterling1994art) to the probabilistic domain. A comprehensive treatise on the field of probabilistic logic programming can be found in de2015probabilistic and riguzzi2018foundations
. DDC is capable of representing discrete and continuous random variables and to perform probabilistic inference. Moreover, DDC explicitly models time, which makes it predestined to model dynamic systems. The underpinning concepts of DDC are related to ideas presented inmilch20071 but embedded in logic programming. Related ideas of combining discrete time steps, Bayesian learning and logic programming are also presented in angelopoulos2008bayesian; angelopoulos2017distributional.
An atom consists of a predicate of arity and terms . A term is either a constant (written in lowercase), a variable (in uppercase), or a function symbol. A literal is an atom or its negation. Atoms which are negated are called negative atoms and atoms which are not negated are called positive atoms.
A distributional clause is of the form , where is a predicate in infix notation and ’s are literals, i.e., atoms or their negation. is a term representing a random variable and tells us how the random variable is distributed. The meaning of such a clause is that each grounded instance of a clause defines a random variable that is distributed according to whenever all literals are true. A grounding substitution is a transformation that simultaneously substitutes all logical variables in a distributional clause with non-variable terms . DDC can be viewed as a language that defines conditional probabilities for discrete and continuous random variables: .
Consider the following DDC program:
The first rule states that the number of objects n in the world is distributed according to a Poisson distribution with mean
. The second rule states that the position of the n objects, which are identified by a number P between 1 and n, are distributed according to a uniform distribution between 0 and 100. Here, the notationn~=N means that the logical variable N takes the value of our random variable n. The label (resp. ) in the program denotes the point in time. So, pos(P): denotes the position of object P at time 0. Next, the program describes how the position evolves over time: at each time step the object moves three units of length, giving it a velocity of . Finally, the example program defines the left predicate, through which a relationship between each object is introduced at each time step. DDC then allows for querying this program through its builtin predicate: 1em
Probability in the second argument unifies with the probability of object 1 being to the left of object 2 and having a positive coordinate position.
A DDC program is a set of distributional and/or definite clauses (as in Prolog). A DDC program defines a probability distribution over possible worlds .
One possible world of the uncountably many possible worlds encoded by the program in Example 1. The sampled number n determines that objects exists, for which the ensuing distributional clauses then generate a position and the left/2 relationship: 1em
When performing inference within a specific time step, DDC deploys importance sampling combined with backward reasoning (SLD-resolution), likelihood weighting and Rao-Blackwellization (Nitti:2016:PLP:2949339.2949375). Inferring probabilities in the next time given the previous time step is achieved through particle filtering (nitti2013particle). If the DDC program does not contain any predicates labelled with a time index the program represents a Distributional Clauses (DC) (gutmann2011magic) program, where filtering over time steps is not necessary.
Object occlusion is a challenging problem in visual tracking and a plethora of different approaches exist that tackle different kinds of occlusions; a thorough review of the field is given in meshgi2015state. The authors use three different attributes of an occlusion to categorize it: the extent (partial or full occlusion), the duration (short or long), and the complexity (simple or complex)222An occlusion of an object is deemed complex if during the occlusion the occluded object considerably changes one of its key characteristics, e.g. position, color, size). An occlusion is simple if it is not complex.. Another classification of occlusions separates occlusions into dynamic occlusions, where objects in the foreground occlude each other and scene occlusions, where objects in the background model are located closer to the camera and occlude target objects by being moved between the camera and the target objects333Further categories exist, we refer the reader to vezzani2011probabilistic; meshgi2015state..
meshgi2015state report that the majority of research on occlusions in visual tracking has been done on partial, temporal and simple occlusions. Furthermore, they report that none of the approaches examined in the comparative studies of smeulders2013visual and wu2013online, handles either partial complex occlusions or full long complex occlusions. To the best of our knowledge, our previous paper on combining bottom-up anchoring and probabilistic reasoning, constitutes the first tracker that is capable of handling occlusions that are full, long and complex (persson2019semantic). This was achieved by declaring a theory of occlusion (ToO) expressed as dynamic distributional clauses.
An excerpt from the set of clauses that constitute the ToO. The example clause describes the conditions under which an object is considered a potential Occluder of an other object Occluded. 1em
Out of all the potential Occluder’s the actual occluding object is then sampled uniformly: 1em
Declaring a theory of occlusion and coupling it to the anchoring system allows the anchoring system to perform occlusion reasoning and to track objects not by directly observing them but by reasoning about relationships that occluded objects have entered with visible (anchored) objects. The idea of declaring a theory of occlusion first appeared in nitti2013particle, where, however, the data association problem was assumed to be solved by using AR tags.
As the anchoring system was not able to handle probabilistic states in our previous work, the theory of occlusion had to describe unimodal probability distributions. In this paper we repair this deficiency (cf. Section 4.2). Moreover, the theory of occlusion had to be hand-coded (also the case for nitti2013particle). We replace the hand-coded theory of occlusion by a learned one (cf. Section 5).
Considering our previous work from the anchoring perspective, our approach is most related to the techniques proposed in elfring.et.al-2013, who introduced the idea of probabilistic multiple hypothesis anchoring in order to match and maintain probabilistic tracks of anchored objects, and thus, maintain an adaptable semantic world model. From the perspective of how occlusions are handled, elfring.et.al-2013’s and our work differs, however, substantially. elfring.et.al-2013 handle occlusions that are due to scene occlusion. Moreover, the occlusions are handled by means of a multiple hypothesis tracker, which is suited for short occlusions rather then long occlusions. The limitations with the use of multiple hypothesis tracking for world modeling, and consequently also for handling object occlusions in anchoring scenarios (as in elfring.et.al-2013), have likewise been pointed out in a publication by wong.et.al-2015. wong.et.al-2015 reported instead the use of a clustering-based data association approach (opposed to a tracking-based approach), in order to aggregate a consistent semantic world model from multiple viewpoints, and hence, compensate for partial occlusions from a single viewpoint perspective of the scene.
4 Anchoring of Objects in Multi-Modal States
In this section, we present a probabilistic anchoring framework based on our previous work on conjoining probabilistic reasoning and object anchoring (persson2019semantic). An overview of our proposed framework, which is implemented utilizing the libraries and communication protocols available in the Robot Operating System (ROS)444The code can be found online at: https://bitbucket.org/reground/anchoring, can be seen in Figure 2. However, our prior anchoring system, seen in Figure 2–2⃝., was unable to handle probabilistic states of objects. While the probabilistic reasoning module, seen in Figure 2–3⃝., was able to model the position of an object as a probability distribution over possible positions, the anchoring system only kept track of a single deterministic position: the expected position of an object. Therefore, we extend the anchoring notation towards a probabilistic anchoring approach, in order to enable the anchoring system to handle multi-modal probability distributions.
4.1 Requirements for Anchoring and Semantic Object Tracking
Before presenting our proposed probabilistic anchoring approach, we first introduce the necessary requirements and assumptions (which partly originate in our previous work, persson2019semantic):
We assume that unknown anchor representations, , are supplied by a black-box perceptual processing pipeline, as exemplified in Figure 2–1⃝.. They consist of extracted attribute measurements and corresponding grounded predicate symbols. We further assume that for each perceptual representation of an object, we have the following attribute measurements: 1) a color attribute (), 2) a position attribute (), and 3) a size attribute ().
In this paper we use the combined Depth Seeding Network (DSN) and Region Refinement Network (RNN), as presented by xie2019, for the purpose of segmenting arbitrary object instances in tabletop scenarios. This two-stage approach leverages both RGB and depth data (given by a Kinect V2 RGB-D sensor), in order to first segment rough initial object masks (based on depth data), followed by a second refinement stage of these object masks (based on RGB data). The resulting output for each segmented object, is then both a spatial percept (), as well as a visual percept (). For each segmented spatial percept, and with the use of the Point Cloud Library (PCL), are both a position attribute measured as the geometrical center, and a size attribute measured as the
geometrical bounding box. Similarly, using the Open Computer Vision Library (OpenCV), acolor attribute is measured as the discretized color histogram (in HSV color-space) for each segmented visual percept, as depicted in Figure 4.
In order to semantically categorize objects
, we assume a Convolutional Neural Network (CNN), such as the GoogLeNet model(szegedy.et.al-2015), is available, cf. persson.et.al-2017. In the context of anchoring, the inputs for this model are segmented visual percepts (), while resulting object categories, denoted by the predicate , are given together with the predicted probabilities (cf. Section 3.1). An example of segmented objects together with the 3-top best object categories, given by an integrated GoogLeNet model, is illustrated in Figure 4. In addition, this integrated model is also used to enhance the traditional acquire functionality such that a unique identifier is generated based on the object category symbol
. For example, if the anchoring system detects an object it has not seen before and classifies it as acup, a corresponding unique identifier could be generated (where the means that this is the forth distinct instance of a cup object perceived by the system).
We require the presence of a probabilistic inference system coupled to the anchoring system, as illustrated in Figure 2–3⃝.. The anchoring system is responsible for maintaining objects perceived by the sensory input data and for maintaining the observable part of the world model. Maintained anchored object representations are then treated as observations in the inference system, which uses relational object tracking to infer the state of occluded objects through their relations with perceived objects in the world. This inferred belief of the world is then sent back to the anchoring system, where the state of occluded objects is updated. The feedback-loop between the anchoring system and the probabilistic reasoner results in an additional anchoring functionality (persson2019semantic):
Track – extends the definition of an anchor from time to time . This functionality is directly responding to the state of the probabilistic object tracker, which ensures that the percepts pointed to by the anchor are the adequate perceptual representation of the object, even though the object is currently not perceived.
Even though the mapping between measured attribute values and corresponding predicate symbols is an essential facet of anchoring, we will not cover the predicate grounding in further detail in this paper. However, for completeness, we will refer to Figure 4 and exemplify that the predicate grounding relation of a color attribute can, intuitively, be expressed as the encoded correspondence between a specific peek in the color histogram and certain predicate symbol (e.g., the symbol black for the mug object). Likewise, a future greater ambition of this work is to establish a practical framework through which the spatial relationships between objects are encoded and expressed using symbolic values, e.g., object A is underneath object B.
4.2 Probabilistic Anchoring System
The entry point for the anchoring system, seen in Figure 2–2⃝., is a learned matching function. This function assumes a bottom-up approach to perceptual anchoring, described in loutfi.et.al-2005, where the system constantly receives candidate anchors and invokes a number of attribute specific matching similarity formulas (i.e., one matching formula for each measured attribute). More specifically, a set of attributes of an unknown candidate anchor (given at current time ) is compared against the set of attributes of an existing anchor (defined at time ) through attribute specific similarity formulas. For instance, the similarity between the positions attributes of an unknown candidate anchor, and the last updated position of an existing anchor, is calculated according to the -norm (in space), which is further mapped to a normalized similarity value (blodow.et.al-2010):
Hence, the similarity between two positions attributes is given in interval , where a value of is equivalent with perfect correspondence. Likewise, the similarity between two color attributes are calculated by the color correlation, while the similarity between size attributes is calculated according to the generalized Jaccard similarity (for further details regarding similarity formulas, we refer to our previous work (persson2019semantic)). The similarities between the attributes of a known anchor and an unknown candidate anchor are then fed to the learned matching function to determine whether the matching function classifies the unknown anchor to be acquired as a new anchor, or re-acquired as an existing anchor.
In our prior work on anchoring, the attribute values have always been assumed to be deterministic within a single time step. This assumption keeps the anchoring system de facto deterministic even though it is coupled to a probabilistic reasoning module. We extend the anchoring notation with two distinct specifications of (volatile) attributes:
An attribute is deterministic at time if it takes a single value from the domain .
An attribute is probabilistic at time if it is distributed according to a probability distribution over the domain at time step .
Having a probabilistic attribute value (e.g., in Equation 1), means that the similarity calculated with the probabilistic attribute values (e.g., the similarity value
), will also be probabilistic. Next, in order to use an anchor matching function together with probabilistic similarity values, two extensions are possible: 1) extend the anchor matching function to accept random variables (i.e., probabilistic similarity values), or 2) retrieve a point estimate of the random variable.
We chose the second option as this allows us to reuse the anchor matching function learned in persson2019semantic without the additional expense of collecting data and re-training the anchor matching function. The algorithm to produce the set of matching similarity values that are fed to the anchor matching function is given in Algorithm 1, where lines 6-7 are the extension proposed in this work.
The function in Algorithm 1 (line 7) is attribute specific (indicated by the subscript ()), i.e. we can chose a different point estimation function for color attributes than for position attributes. An obvious attribute upon which reasoning can be done is the position attribute, for example, in the case of possible occlusions. In other words, we would like to perform probabilistic anchoring while taking into account the probability distribution of an anchor’s position. A reasonable goal is then to match an unknown candidate anchor with the most likely anchor, i.e. with the anchor whose position attribute value is located at the highest mode of the probability distribution of the position attribute values. This is achieved by replacing line 7 in Algorithm 1 with:
is the set of positions situated at the modes of the probability distribution . In Equation 3 we take the as the co-domain of the position similarity value is in , where 1 reflects perfect correspondence (cf. Equation 1).
In persson2019semantic, we approximated the probabilistic state of the world in the inference system (cf. Figure 2–3⃝.) by particles, which are updated by means of particle filtering. The precise information that is passed from the inference system to the anchoring system is a list of particles that approximate a (possible) multi-modal belief of the world. More specifically, an anchor is updated according to the particles of possible states of a corresponding object, maintained in the inference system, such that possible positions are added to the volatile position attributes . In practice we assume that samples are only drawn around the modes of the probability distribution, which means that we can replace line 7 of Algorithm 1 with:
Where is a sampled position and ranges from to the number of samples .
Performing probabilistic inference in the coordinate space is a choice made in the design of the probabilistic anchoring system. Instead, the probabilistic tracking could also be done in the HSV color space, for instance. In this case, the similarity measure used in Algorithm 1 would have to be adapted accordingly. It is also conceivable to combine the tracking in coordinate space and color space. This introduces, however, the complication of finding a similarity measure that works on the coordinate space and the color space at the same time. A solution to this would be to, yet again, learn this similarity function from data (persson2019semantic).
5 Learning Dynamic Distributional Clauses
While several approaches exist in the SRL literature that learn probabilistic relational models, most of them focus on parameter estimation (sato1995statistical; friedman1999learning; taskar2002discriminative; neville2007relational) and structure learning has been restricted to discrete data. Notable exceptions include the recently proposed hybrid relational formalism by ravkic2015learning, which learns relational models in a discrete-continuous domain but has not been applied to dynamics or robotics, and the related approach of nitti2016learning, where a relational tree learner DDC-TL learns both the structure and the parameters of distributional clauses. DDC-TL has been evaluated on learning action models (pre- and post-conditions) in a robotics setting from before- and after-states of executing the actions. However, there were several limitations of the approach. It simplified perception by resorting to AR tags for identifying the objects, it did not consider occlusion, and it could not deal with uncertainty or noise in the observations.
A more general approach to learning distributional clauses, extended with statistical models, is being proposed in kumar2020learning555https://github.com/niteshroyal/DreaML. Such a statistical model relates continuous variables in the body of a distributional clause to parameters of the distribution in the head of the clause. The approach simultaneously learns the structure and parameters of (non-dynamic) distributional clauses, and estimates the parameters of the statistical model in clauses. A DC program consisting of multiple distributional clauses is capable of expressing intricate probability distributions over discrete and continuous random variables. A further shortcoming of DDC-TL (also tackled by kumar2020learning) is the inability of learning in the presence of background knowledge — that is, additional (symbolic) probabilistic information about objects in the world and relations (such as spatial relations) among the objects that the learning algorithm should take into consideration.
However, until now, the approach presented in kumar2020learning has only been applied to the problem of autocompletion of relational databases by learning a (non-dynamic) DC program. We now demonstrate with an example of how this general approach can also be applied for learning dynamic distributional clauses in a robotics setting. A key novelty in the context of perceptual anchoring is that we learn a DDC program that allows us to reason about occlusions.
Consider again a scenario where objects might get fully occluded by other objects. We would now like to learn the ToO that describes whether an object is occluded or not given multiple observations of the before and after state. In DDC we represent observations through facts as follows 1em
For the sake of clarity, we have considered only one-dimensional positions in this example.
Given the data in form dynamic distributional clauses, we are now interested in learning the ToO instead of relying on a hand-coded one, as in Example 3. An excerpt from the set of clauses that constitute a learned ToO is given below. As in Example 3, the clause describe the circumstances under which an object (Occluded) is potentially occluded by an other object (Occluder).
Note that, in the second but last line of the last clause above the arbitrary threshold on the Distance
is superseded by a learned statistical model, in this case a logistic regression, which maps the input parameterDistance to the probability P1:
Replacing the hand-coded occluder rule with the learned one in the theory of occlusion allows us to track occluded objects with a partially learned model of the world.
In order to learn dynamic distributional clauses, we first map the predicates with subscripts that refer to the current time step t and the next time step t+1 to standard predicates, which gives us an input DC program. For instance, we map pos(o1_exp1):t to pos_t(o1_exp1), and occluder(o1_exp1,o2_exp2):t+1 to occluder_t1(o1_exp1,o2_exp2). The method introduced in (kumar2020learning) can now be applied for learning distributional clauses for the target predicate occluder_t1(o1_exp1,o2_exp2) from the input DC program.
Clauses for the target predicate are learned by inducing a distributional logic tree. An example of such a tree is shown in Figure 5. The key idea is that the set of clauses for the same target predicate are represented by a distributional logic tree, which satisfies the mutual exclusiveness property of distributional clauses. This property states that if there are two distributional clauses defining the same random variable, their bodies must be mutually exclusive. Internal nodes of the tree correspond to atoms in the body of learned clauses. A leaf node corresponds to a distribution in the head and to a statistical model in the body of a learned clause. A path beginning at the root node and proceeding to a leaf node in the tree corresponds to a clause. Parameters of the distribution and the statistical model of the clause are estimated by maximizing the expectation of the log-likelihood of the target in partial possible worlds. The worlds are obtained by proving all possible groundings of the clause in the input DC program. The structure of the induced tree defines the structure of the learned clauses. The approach requires declarative bias to restrict the search space while inducing the tree.
In summary, the input to the learning algorithm is a DC program consists of
background knowledge, in the form of DC clauses;
observations, in the form of DC clauses — these constitute the training data;
the declarative bias, which is necessary to specify the hypothesis space of the DC program (ade1995declarative);
the target predicates for which clauses should be learned.
The output is:
a set of DC clauses represented as a tree for each target predicate specified in the input.
Once the clauses are learned, predicates are mapped back to predicates with subscripts to obtain dynamic distributional clauses. For instance, occluder_t1(Occluded,Occluder) in the learned clauses is mapped back to occluder(Occluded,Occluder):t+1.
The data used for the learning of the theory of occlusion consists of training points of before-after states of two kinds. The first kind are pairs describing a transition of an objection from being observed to being occluded. The second kind of data pairs describe an object being occluded in the current state as well as in the next state. Examples of two raw data points for the first kind can be seen in Figure 6. The processed data that was fed to the distributional clauses learner is available online666https://bitbucket.org/reground/anchoring/downloads/ as well as models with the learned theory of occlusion777https://bitbucket.org/reground/anchoring.
A probabilistic anchoring system that is coupled to an inference system (cf. Section 4.2) is comprised of several interacting components. This turns the evaluation of such a combined framework, with many integrated systems, into a challenging task. We, therefore, evaluate the integrated framework as a whole on representative scenarios (videos of which are online888https://vimeo.com/manage/folders/1365568) that demonstrate our proposed extensions to perceptual anchoring. In Section 6.1, we demonstrate how the extended anchoring system can handle probabilistic multi-modal states (described in Section 4). In Sections 6.2 and 6.3, we show that semantic relational object tracking can be performed with the probabilistic logic rules (in form of a DDC program) instead of handcrafted ones.
6.1 Multi-Modal Occlusions
We present the evaluation in the form of screenshots captured during the execution of a scenario where we obscure the stream of sensor data. We start out with three larger objects (two mug objects and one box object), and one smaller ball object. During the occlusion phase, seen in Figure 7–1⃝., the RGB-D sensor is covered by a human hand and the smaller ball is hidden underneath one of the larger objects. In Figure 7–1⃝., it should also be noted that the anchoring system preserves the latest update of the objects, which is here illustrated by the outlined contour of each object. At the time that the sensory input stream is uncovered, and there is no longer any visual perceptual input of the ball object, the system can only speculate about the whereabouts of the missing object. Hence, the belief of the ball’s position becomes a multi-modal probability distribution, from which we draw samples, as seen in Figure 7–2⃝.. At this point, we are, however, able to track the smaller ball through its probabilistic relationships with the other larger objects. During all the movements of the larger objects, the probabilistic inference system manages to track the modes of the probability distribution of the position of the smaller ball. The probability distribution for the position of the smaller ball (approximated by samples) is continuously fed back to the anchoring system. Consequently, once the hidden ball is revealed and reappears in the scene, as seen in Figures 7–3⃝. and 7–4⃝., the ball is correctly re-acquired as the initial ball-1 object. This would not have been possible with a non-probabilistic anchoring approach.
6.2 Uni-Modal Occlusions with Learned Rules
The conceptually easiest ToO is one that describes the occlusion of on object by an other object. Using the method described in Section 5, we learned such a ToO, which we demonstrate in Figure 8. Shown are two scenarios. In the one in the upper row the a can gets occluded by a box — shown in the second screenshot. The can is subsequently tracked through its relation with the observed box and successfully re-anchored as can-1 once it is revealed. Note that in the second screenshot, the mug is also briefly believed to be hidden under box, shown through the black dots, as is the mug is temporally obscured behind the box and not observed by the vision system. However, once the mug is again observed the black dots disappear.
In the second scenario, we occlude one of two ball objects with a box and track the ball again through its relation with the box. Note that some of the probability mass accounts for the possibility for the occluded ball to be occluded by the mug. This is due to the fact that the learned rule is probabilistic.
In both scenarios, we included background knowledge that specifies that a ball cannot be the an occluder of an object (it does not afford to be the occluder). This is also why we see a probability mass of the occluded ball at the mug’s location and not at the observed ball’s location in the second scenario.
6.3 Transitive Occlusions with Learned Rules
Learning (probabilistic) rules, instead of a black-box function, has the advantage that a set of rules can easily be extended with further knowledge. For example, if we would like the ToO to be recursive, i.e., objects can be occluded by objects that are themselves occluded, we simply have to add the following rule to the DDC program describing the theory of occlusion: 1em
Extending the ToO from Section 6.2 with the above rule, enables the anchoring system to handle recursive occlusions. We demonstrate such a scenario in Figure 9. Initially, we start this scenario with a ball, a mug and a box object (which in the beginning is miss-classified as block object, cf. Figure 4). In the first case of occlusion, seen in Figures 9–1⃝., we have the same type of uni-modal occlusion as described in the previous Section 6.2, where the mug occludes the ball and, subsequently, triggers the learned relational transition (where plotted yellow dots represent samples drawn from the probability distribution of the occluded ball object). In the second recursive case of occlusion, seen in Figure 9–2⃝., we proceed by also occluding the mug with the box. Above rule administers this transitive occlusion — triggered when the ball is still hidden underneath the mug and the mug is occluded by the box. This is illustrated here by both yellow and black plotted dots that represent samples drawn from the probability distributions of occluded mug and the transitively occluded ball object, respectively. Consequently, once the box is moved, both the mug and the ball are tracked through the transitive relation with the occluding box. Reversely, it can be seen, in Figure 9–3⃝., that once the mug object is revealed the object is correctly re-acquired as the same mug-1 object, while the relation between the mug and the occluded ball object is still preserved. Finally, as the ball object is revealed, in Figure 9–4⃝., it can be also seen that the object is, likewise, correctly re-acquired as the same ball-1 object.
7 Conclusions & Future Work
We have presented a two-fold extension to our previous work on semantic world modelling (persson2019semantic), where we proposed an approach to couple an anchoring system to an inference system. Firstly, we extended the notions of perceptual anchoring towards the probabilistic setting by means of probabilistic logic programming. This allowed us to maintain a multi-modal probability distribution of the positions of objects in the anchoring system and to use it for matching and maintaining objects at the perceptual level — thus, we introduce probabilistic anchoring of objects either directly perceived by the sensory input data or logically inferred through probabilistic reasoning. We illustrated the benefit of this approach with the scenario in Section 6.1, which the anchoring system was able to resolve correctly only due to its ability of maintaining a multi-modal probability distribution. This also extends an earlier approach to relational object tracking (nitti2014relational), where the symbol-grounding problem was solved by the use of AR tags.
Secondly, we have deployed methods from statistical relational learning to the field of anchoring. This approach allowed us to learn, instead of handcraft, rules needed in the reasoning system. A distinguishing feature of the applied rule learner (kumar2020learning) is its ability to handle both continuous and discrete data. We then demonstrated that combining perceptual anchoring and SRL is also feasible in practice by performing relational anchoring with a learned rule (demonstrated in Section 6.2). This scenario did also exhibit a further strength of using SRL in anchoring domains, namely that the resulting system becomes a highly modularizable system. In our evaluation, for instance, we were able to integrate an extra rule into the ToO, which enabled us to resolve recursive occlusions (described in Section 6.3).
A possible future direction would be to exploit how anchored objects and their spatial relationships, tracked over time, facilitate the learning of both the function of objects, as well as object affordances (kjellstrom.et.al-2011; moldovan2012learning; koppula.et.al-2013; koppula&saxena-2014). Through the introduction of a probabilistic anchoring approach, together with the learning of the rules that express the relation between objects, we have presented a potential framework for future studies of spatial relationship between objects, e.g., the spatial-temporal relationships between objects and human hand actions to learn the function of objects (cf. kjellstrom.et.al-2011; moldovan2012learning). Such a future direction would tackle a similar question, currently discussed in the neural-symbolic community (garcez2019neural), namely how to propagate back symbolic information to sub-symbolic representations of the world. A recent piece of work that combines SRL and neural methods is, for instance, manhaeve2018deepproblog.
Another aspect of our work that deserves future investigation is probabilistic anchoring, in itself. With the approach presented in this paper we are merely able to perform MAP inference. In order to perform full probabilistic anchoring, one would need to render the anchor matching function itself fully probabilistic, i.e. the anchor matching function would need to take as arguments random variables and again output probability distributions instead of point estimates — ideas borrowed from multi-hypothesis anchoring (elfring.et.al-2013) might, therefore, be worthwhile to consider for future work.
PZ and AP outlined the extension of the framework to include probabilistic properties and multi-modal states PZ and NK integrated SRL with perceptual anchoring. PZ, AP and NK performed the experimental evaluation. AL and LD have developed the notions and the ideas in the paper together with the other authors. PZ, NK, AP, AL and LD have all contributed to the text.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.