Abstraction is mankind’s ability to condense and generalize previous experience into symbolic entities that can act as proxies for reasoning about the world. We ground these abstractions in our sensory experience and let them take on different meanings depending on context Barsalou (2008). The task of learning to detect and abstract affordances is different to visual categorization. A deeper understanding is based on being able to detect intra-category commonality rather than saliency.
As an example of what this deeper understanding implies we can picture an unstructured environment where the right object for an intended action is unavailable. An agent that can reason in an abstract fashion can replace the unavailable object with other objects that afford the same or similar actions. For example, it can replace a spoon with a pen for stirring, replace a pan with a pot, etc. Hence categories in this sense are not binary but loosely defined by a set of abstract functional properties that makes up the common denominators of the category.
An additional benefit of learning to abstract is that reasoning about the similarity between categories becomes simpler as we are comparing similarities across subsets of the feature space. The agent can thus assemble hierarchies of clusters of similar actions, that in turn enables reasoning within specific action domains, that in turn enables better planning and synthesizes of explorative strategies in unknown domains.
Having this cognitive ability is extremely useful. This paper thus proposes a method for learning affordance abstractions, showing how the agent can ground them in its own sensory input, and use it for reasoning about the semantic similarity between objects.
We hypothesize that the abstract representation of an affordance category is a latent space of the general space of vector representation of objects. Associated with this latent space is a metric that we can as a proxy for reasoning about similarity. We learn this similarity metric from the data guided by the notion that similar items should be close in the latent feature space and dissimilar items far away. In this paper, this means learning a feature transform,, that lets us reason in the latent space under a Euclidian metric.
an agent can learn the relevant features for classifying to the affordance and enables it to ground the affordance in the selected part of the sensory input. This grounding helps the robot to locate affordance specific parts of the object from a set of general global and local object features. Simply put, we learn the agent to point out which parts of the feature representation of an object are important for the affordance. By extension we enable the agent to point out the physical parts of an object that is important for classifying it to the affordance.
To learn to abstract affordance categories into a set of common features the agent must ground the affordance in its own sensor input. The grounding and abstraction allow for reasoning about the similarity between affordances as it is reasonable to expect similar affordances to have similar sets of common features. In the light of this, we propose a novel interpretation: that we can understand the grounding of the features for the categories through the similarity transform of the data itself rather than through an analysis of data points in the latent feature space. We show that this semantic meta-similarity analysis is possible through reasoning about the distance between the transforms, .
To form a complete understanding of an affordance the agent needs to learn from interaction with the object, observing object, action, and effect (OAE) triples. However, it has been argued that human design of objects follows or should follow certain design principles that through simple cues reveal the affordance of an object to the human observer Norman (2002). In the light of this, this paper asks how much of an affordance can we understand from just observing a set of objects that afford the same action? What similarities exist in the feature space of a category and can we deduce them just from observing category and non-category members? Are these abstract similarities relevant for interaction with the object?
We start by giving a wide perspective on current approaches to affordance learning and go into detail on related work connected to the proposed method. We proceed to describe our approach in detail and give experiments showing how an agent can learn abstractions for affordance categories and how it can reason about these categories. We end by outlining some important principles and future work that needs addressing.
2 Perspective on Affordance Learning
This paper learns an agent to abstract affordance categories by observing common features in the representation of objects that affords the same action. Specifically, we learn a linear transform, , that gives us a latent representation where similar items close and dissimilar items far away. The latent representation enables us to compare items using the Euclidean metric as a proxy for similarity. We penalize the learning of such that the transformation only selects and transforms relevant features. We interpret the selected features as an abstraction of the affordance we are learning and the magnitude of as a measure of the relative relevance of a feature. This relevance enables us to pinpoint important parts of the objects belonging to an affordance category. Further on, we measure the similarity between affordances by measuring the distance between the transform magnitudes, that is, we are able to abstract affordance categories and compare the abstractions.
Our approach to the affordance learning problem is thus quite different to the general affordance learning research being done in robotics which we divide into developmental methods based on exploration and methods that learn to predict more advanced affordances from demonstration (LfD) or annotated datasets.
Developmental methods have so far followed a paradigm of measuring object, action, and effect (OAE) triples. They focus on simple affordances such as pushing, rolling, simple tool use, etc., where the outcome of an action is clear and measurable Chao et al. (2011); Griffith et al. (2010, 2009); Högman et al. (2013); Modayil and Kuipers (2008); Montesano et al. (2008); Niekum et al. (2012); Sinapov and Staley (2013); Sinapov and Stoytchev (2008); Stoytchev (2005); Sahin et al. (2007); Yuruten et al. (2013); Dehban et al. (2016); Gonçalves et al. (2014); Penkov et al. (2017); Ivaldi et al. (2014); Mar et al. (2015); Kraft et al. (2009). This is sensible since their objective is to learn a robot with limited cognitive and motor abilities to connect OAE triples. Learning is often unsupervised and explorative Kraft et al. (2009); Griffith et al. (2009, 2010); Modayil and Kuipers (2008) and based on learning thresholds Chao et al. (2011); Niekum et al. (2012)
for the perceived features thus requiring clear pre- and post-conditions. These threshold operations are similar in nature to abstracting the feature of a category, however, they are often semi-automated and built using heuristics rather than the automatic process of our approach.
One of the more complete models, with regard to structuring the learning as well as showing experiments in real environments, comes from Modayil and Kuipers (2007, 2008). The authors represent objects not as physical entities but as a “hypothesized entity that accounts for a spatiotemporally coherent cluster of sensory experience”. They represent objects by a set containing a tracker, percept, classes, and actions, which are all more or less temporal. The most interesting aspect of this formulation is the representation of objects as consistent sensory inputs over time and associated with action possibilities that produce certain outcomes. This more integrated view of learning about objects and interacting with the world is much closer to the idea of symbolic grounding and how some researcher thinks humans organize grounded knowledge.
More recent affordance-based learning approaches also employ the OAE paradigm still with simple actions but with some form of convolutional neural network (CNN) used together with massive amounts of collected OAE triplesPinto et al. (2016); Agrawal et al. (2016); Pinto and Gupta (2017). In Agrawal et al. (2016) the authors hypothesize that humans have an internal physics model that allows them to understand and predict OAEs. They suggest learning a similar model via a siamese CNN of the image input from before and after an action. Pinto et al. (2016) takes a similar approach, however, there the novelty lays in the construction of a branching deep net. The network has a pre-trained common base that branches out with nets pre-trained for inference of pinch grasping, pushing, and pulling actions. The base net feeds its output into the branches and receives feedback from them updating both weights in the base and branches. This enables the net to refine the input to cater to specific tasks. This is similar to the current perception of how humans process visual information. The processing starts with a unified preprocessing of the visual input and then branching into cortical areas that handle vision for action and vision for cognition.
It is obvious that deep methods offer a great advantage in processing as they can take raw image input and consistently produce good results. In addition, they can process large amounts of training data that a robot needs to learn affordances that are not toy examples. However, the drawback of these methods is that deep nets are somewhat of black box method lacking in interpretability, they are also data inefficient, and can be unreliable in prediction. To the contrary our approach yields interpretable results, it allows us to locate the position of important features on the object for specific affordances, and to reason about the similarity between affordances.
Learning OAE triples are thus a seemingly agreed upon fundamental component to learning affordances. However, learning affordances from everyday object interactions is more complicated. Actions are complex. They involve several steps of manipulation and outcomes are therefore not always as clear-cut. Efforts so far have thus focused on some form of supervision either in acquiring the training data, the provision of labels, or implicitly in the model. A majority of the models tries to infer the affordance or the action instead of learning the robot to generalize, understand, and perform the action associated with the affordance.
Most methods take a standard supervised computer vision approach, that is, categorizing labeled images, sequences, or action commandsFritz et al. (2006); Stark et al. (2008); Montesano and Lopes (2009); Hermans et al. (2011); Sun et al. (2010); Ye et al. (2017); Myers et al. (2015); Nguyen et al. (2016). Others try to model relationships between an observed actor, typically a human, and the objects it interacts with Pieropan et al. (2013); Koppula et al. (2013); Gall et al. (2011, 2011); Aksoy et al. (2011); Kjellström et al. (2011); Bütepage et al. (2018) learning affordances and actions jointly. However, robots are frequently used as well and they are generally equipped with some form of pre-programmed knowledge such as actions, action effects, features, or object knowledge Wang et al. (2014); Thomaz and Cakmak (2009); Chu et al. (2016); Montesano et al. (2008), to assist in the learning. These methods are good at what they do: predicting actions and outcomes from visual input. However, for a robot trying to understand the action and perhaps learn to perform the action itself these methods describe discretization of sensory input and knowledge from a human perspective, not from the robot’s own sensory perspective.
Our approach is similar in that we learn from labeled images, but with multiple affordance labels for the whole image as in Hjelm et al. (2015); Nagarajan et al. (2018) instead of learning to predict pixelwise labels in Myers et al. (2015); Nguyen et al. (2016); Do et al. (2018); Abelha and Guerin (2017); Nguyen et al. (2017); Detry et al. (2017) and without the addition of actions and outcomes. Our goal is to ground the affordance in the representation of the object. As stated in the introduction our interest lays in what kind of abstractions an agent can learn from observing the common features in a category and how we can use these grounded features to reason about and perform the affordance.
Humans use rule-based and similarity reasoning to transfer knowledge about categories but it is almost certainly not how our visual system categorize everyday objects at the basic category level. Nevertheless, works exploring classification by attributes or attribute learning are important because they touch on the deeper question of how to learn the invariant features of categories, albeit from high-level abstractions. This is an extremely important ability to have when generalizing affordances. When humans substitute objects, it tends to happen in an ad-hoc fashion. We base the selection process on similarity comparisons across the abstraction we have for an affordance to motivate the substitution.
These types of attribute approaches have mostly been explored in computer vision Ferrari and Zisserman (2008); Lampert et al. (2009); Farhadi et al. (2009); Malisiewicz and Efros (2008). Ferrari and Zisserman (2008) segments images and learns a graphical model over the segments that models relations between segments and contexts enabling it to predict patterns such as striped, colors, etc. Lampert et al. (2009) associate specific attributes with specific image categories such that they can infer the image class from knowledge about the attributes. The attributes act as an intermediate layer in a graphical model which enables conditioning novel classes on learned classes and attributes. The model does not recognize new attributes but rather rely on the notion that learned attributes contain information relevant for novel classes.
The approach most similar to ours is that of Farhadi et al. (2009); Malisiewicz and Efros (2008). Malisiewicz and Efros (2008) formulate the categorization problem as data association problem, that is, they define an exemplar by a small set of visually similar objects each with associated distance functions learned from the data. Farhadi et al. (2009) approach equates the ability of attribute prediction with the ability to predict image classes from the learned attributes. They stack a broad number of different features and use feature selection to filter out irrelevant features. They realize that the number of attributes they have specified is not sufficient to classify to the specific categories and opt to learn additional attributes from the data.
The abstractions we want to learn can also be considered discriminative attributes, however, we learn these through similarity comparisons rather than through discrimination. Our aim is to simultaneously learn to predict categories and abstract them as we view them as different aspects of the same process.
uses a Bayesian Network (BN) that relates class, features, and attributes. The authors learn a robot to recognize key attributes of objects such as size, shape, color, material, and weight, which they use to predict affordances such as traverse, move, etc. They compare their approach to affordance prediction with an SVM trained directly on the feature space. The direct approach performs comparably or better than the attribute-based approach, the explanation they propose is that the feature space contains information not directly explainable as any specific semantic attribute.
This serves to illustrate that we should not program autonomous robots in unstructured environments to process the world in hierarchies of abstract symbols derived from our own human sensorimotor systems Brooks (1990). Human language is abstract semantic symbols grounded in invariant features to make conveying, planning, and reasoning easy. All psychophysical evidence points to that human’s sensorimotor systems do not count attributes to recognize an object. Further on, we cannot expect different sensorimotor systems, humans, and robots, to produce the same semantic grounding unless they are exactly identical in construction and experience. This is a key feature of our work. We learn the invariant features from detecting similarities in the representation of intra-category objects. The agent is thus able to ground the semantic meaning of the affordance in its own sensory-motor system and enables it to locate physical parts of the objects that are important for classifying to the affordance. However, we never use these parts to infer if an object affords an action or not.
Measuring similarity is difficult especially for high-dimensional representations as any arbitrary measure would have to treat each dimension as equally valuable. This will leave the important features open to being drowned out by either noise or the amount of non-relevant features. Further on, different metrics are useful for certain distributions of the data while being detrimental for others.
One way of solving this involves specifying a relevant representation or measure for each category. However, this solution does not scale and contrasts with the idea that an agent should ground semantics in its own sensory input. The other approach, which we adhere to, is learning the similarity from the data. We consider the useful representation for a category to be a latent representation of a more general object representation. Learning the latent representation, in turn, enables us to use the Euclidean metric for reasoning about similarity.
The similarity measures used in affordance learning are mostly used to describe the similarity between OAE triples or a subset of them. Many formulate their own measures or use the standard Euclidean measure Gall et al. (2011); Aksoy et al. (2011); Griffith et al. (2010, 2009); Modayil and Kuipers (2008); Sinapov and Staley (2013); Sinapov and Stoytchev (2008); Stark et al. (2008); Wang et al. (2014)
. The measures are often used in an unsupervised setting to cluster for affordance categories. Other use kernels as an implicit measure of similarity in supervised learningEk et al. (2010); Hermans et al. (2011); Montesano et al. (2008); Pieropan et al. (2013); Yuruten et al. (2013).
Entropy Shannon (2013) is sometimes used to compute distances between distributions that describe possible actions or object categories Högman et al. (2013) or measure the stability of unsupervised category learning Sinapov and Staley (2013); Gall et al. (2011); Griffith et al. (2010). Lastly, another popular approach is to model associations as graphical models, Bayesian Networks (BN) or Conditional Random Fields (CRF), as they are good at describing the temporal nature of object interaction or other complex associations Ek et al. (2010); Kjellström et al. (2011); Koppula et al. (2013)
. Here probability becomes a proxy for similarity.
To the best of our knowledge, no previous method has approached the affordance classification problem by learning the metric from the data. We can think of the CNN approaches described above as learning a transform that enables an implicit similarity mapping, however, as opposed to our approach they are unable to locate what in the input caused the classification. Further on, CNN based approaches projects non-linearly onto massively high-dimensional spaces using massive amounts of data. We instead show that our linear projection can reduce the dimensionality down from 322 dimensions to 3 with no significant loss in accuracy using low amounts of data. At same the time, our sparsity-inducing regularization forces the projection to only use a small subset of the features, on average of the feature space. Finally, we learn what physical part of the objects in a category are relevant for the category giving the agent a deeper connection between sensory input and the actionable parts of the objects.
In this sense, the most similar approach to ours in learning the feature space is a method by Sun et al. (2013)
. They learn a feature codebook over the RGBD space of objects by optimizing towards a compact representation of the feature space in an unsupervised fashion similar to an autoencoder. The authors use the codebook to find a lower dimensional representation of objects and to classify object attributes. They show that by regularizing the classifier that they can learn which codewords are important for specific attributes. However, this approach is computationally taxing as they learn the codebook and the latent representation simultaneously. Contrary to our approach they aim for a general representation for all tasks rather than utilizing class labels to learn task-specific representations.
Our goal is to learn a feature transform, , for each affordance, that given a general object representation, , outputs a latent representation, . This latent representation should have the quality that objects that affords an action should be close in the latent space and others far away. This implies that we can use the Euclidean metric as a proxy for measuring similarity. We learn from a set of input-target pairs, . Here is a general feature vector where is a label denoting if the object affords an action or not.
Given a set of feature transforms for different affordances our approach has four goals. We want to:
Learn what features of the general objection representation are important for classifying instances to each of the affordances.
Formulate a general abstract representation of the affordance-based upon the relevant features.
Given an object locate the relevant parts on the object.
Model the relationship between the affordances such that we can understand which affordances are similar.
We capture objects as an RGBD image using a Kinect sensor and convert it into a 2D image and a point cloud. The point cloud representation is noisy and many times parts of the object are missing due to the reflective material of those parts. We can, therefore, expect the relevance and reliability of the features to vary substantially across the different affordance categories. Further on, it is difficult to know what features are important for classifying to a specific affordance. Because of this, we choose to stack a number of global and local features and let the algorithm decide on relevance. The stacking gives a feature vector dimension of .
The global features are:
Object volume - the volume of the convex hull enclosing the object point cloud.
Shape primitive - similarity to primitive shapes cylindrical, spheric, and cubic as fitted by the RANSAC algorithm.
Elongation - the ratio of the minor axes to the major axis.
Dimensions - the length of the sides of the object.
Material - Objects often consists of different materials. We want a vector representation that gives a score for the different materials of an object. To finds these scores we train the SVM to classify textures glass, carton, porcelain, metal, plastic, wood. The input is the concatenation of a Fisher Vector (FV) representation of the SIFT features of the image and the output of the 5th layer of a re-trained GoogleLeNet. We take the scores of the SVM over an object as the decomposition score of the different materials.
We motivate these global features by research showing their usefulness in predicting variables involved in human grasping and affordances, e.g. Baugh et al. (2012); Buckingham et al. (2009); Fabbri et al. (2016); Feix et al. (2014); Martin (2007); Grafton (2010); Jenmalm et al. (2000); Sartori et al. (2011, 2011).
The local features are:
Image gradients - histograms of intensity and gradient order 1, 2, 3.
Color quantization - the mapping of colors to a finite set of colors and computing the histogram over the mapped colors.
- entropy, mean, and variance over the color quantized object.
FPHF - Bag-of-Words over Fast Point Feature Histograms Rusu et al. (2009) for a number of radius scales.
HoG - Bag-of-Words representation over the HoG Dalal and Triggs (2005) features of the image.
Again we motivate these features by studies showing their usefulness, especially shape descriptors e.g. Krüger et al. (2013); Norman (2002). Due to the point cloud representation, we only need to keep the portion of features associated with the point cloud. For example, for the gradients, we only compute the gradients for pixels associated with the point cloud. This works for all features except for HoG as it uses patches overlapping the image.
As discussed in the introduction we use distance as a proxy for similarity. However, with low amounts of data, it is difficult to construct a general high-dimensional feature space that works well under some metric for a number of different labelings.
We, therefore, want to learn a transform, , for each affordance that puts similar instances close in space and dissimilar instances far way. The transform should help us locate parts of the feature space that is relevant and project onto a subspace,
, such that alleviates the curse of dimensionality.
To this end, we use a regularized version of the Large Margin Component Analysis (LMCA) metric learning algorithm Torresani and Lee (2006) which we will refer to as LMCA-R. LMCA learns a linear transformation, , of the input space that pushes the class nearest-neighbors (NN) of every instance closer together while pushing non-class members outside a margin as illustrated in fig.2. We learn
using gradient descent over the following loss function,
Here means the nearest neighbors of the instance that belong to the same class,
is a binary variable that is zero ifand have the same label, and one otherwise. The first term penalizes large distances for the NN and the second term penalizes non-class instances that are closer to the instance than the class NN, by a margin of . is a constant controlling the relative importance of the pushing component and is the differentiable smooth hinge loss Rennie and Srebro (2005).
is a weight term that aims to balance the learning in terms of false-positive rates when some of the classes have few numbers of exemplars. We formulate the weights as where is the total number of data points and is the number of instances in the class that -th instance belongs to. The justification here is that if we assume that each of the summands, in the loss function, are roughly similar in magnitude then the weight factor will level the contribution from each class to the loss. For example,
The reason for multiplying by is to keep the ratio at a reasonable value to avoid numerical instability.
The last term in eq.1 is a penalization term due to Obozinski et al. (2009). It is the sum of the -norm applied to the columns of . The -norm is simply the norm, of the norm, of the columns of the, transformation matrix . The -norm is the crucial factor as it helps contain the full column, reducing it fully. This means that it will remove irrelevant features completely instead of zeroing individual matrix elements of as happens with the -norm over the matrix elements. is a constant controlling how much weight we want to put on penalizing non-zero columns.
As is a projection we can choose to let it project onto a subdimension, , which can be much less than . In the experiments section, we show that we can project from dimensions down to without a significant loss in accuracy, a reduction in dimension of roughly .
To classify to an affordance category we formulate the problem as a binary decision problem, that is, we learn a specific for each affordance class. We apply
to the data and classify to the affordance using kNN whereis equal to the number of neighbors used in the learning phase. Our use of kNN is thus a direct evaluation of the metric.
We analyze the feature selection by taking the magnitude of the columns. Low values will mean an irrelevant feature while high will mean relevant. To analyze the similarity between different affordances we treat the magnitude vector as having a Gaussian multivariate distribution and use the KL-divergence as the distance measure.
The affordance learning problem is a multiclass problem. However, two factors motivated us to switch from multiclass to a binary decision problem. Firstly, for the feature selection analysis to work, we need to pit the objects in one affordance category against a wide range of different objects. If we are learning multiple classes simultaneous this analysis is not possible; the feature selection will instead show good general features. Secondly, learning multiple classes at the same time is not optimal as we would use fewer parameters and data points for each problem.
We motivate our experiments by the following three questions:
1. Does our approach select features that are sensible as an abstraction for explaining an affordance? 2. Do the selected features map out a similar set of parts on all the objects in an affordance category? 3. How do the affordances relate to each other? Are the affordances we as humans view as similar equal to what the model deems as similar?
We collected 265 RGBD images of everyday objects ranging from cups to cereal boxes, tools, cans, and water bottles. To collect the images we placed the objects on different flat surfaces and recorded an RGBD image using a Kinect camera. We took each image under different light conditions and tried to vary the pose of the objects to a reasonable amount. Many of the images had small parts or parts made of glass or metal leaving large holes in the depth recordings. Since each image is devoid of clutter it is simple to segment out the object by simply removing all point cloud points, not above the planar surface.
We labeled each object as a binary vector specifying if it affords each one of the affordances in table 1. Many of these affordances are quite vague and labeling is not as binary as in standard image classification. This vagueness follows from the vagueness in the definition of the affordance concept. Many objects that afford an action will under normal circumstances not be used for the affordance if other suitable objects are available.
|Eating From (25,241)|
|Handle Grasp (56,210)|
|Lifting Top (79,187)|
|Loop Grasp (31,235)|
|Squeezing Out (14,252)|
A prerequisite for answering the above questions is to first validate if the algorithm and features provide good affordance classification accuracy, that is, if the similarity metric we learn produces valid results. We compare our results to a kNN, and a linear SVM trained on the provided features. We also compare to an SVM trained on the output from the last fully connected layer of a pre-trained CNN as it has proven to be a good baseline.
As a pre-processing step, we standardize all data. We use five-fold cross-validation to learn the optimal parameters. For the kNN and SVMs, we also cross-validate against a PCA projection between dimensions where means no PCA is performed. For LMCA-R, we cross-validate for the impostor loss parameter, , and the regularization parameter, .
We set the NN to for kNN and LMCA-R. For the SVMs we use the Scikit-learn library Pedregosa et al. (2011) which uses LibSVM. We use a linear kernel and one-against-all classification. For LMCA-R we set the dimensionality reduction to
. For the CNN features we us CaffeJia et al. (2014)
with a GooleLeNet model pre-trained on Imagenet. We extract the fifth layer giving us adimensional feature vector.
We create , training-test splits of the dataset. We give the results as averages over the splits in table 1. We measure performance using the F1-score as our main metric as we are performing binary classification over many unbalanced categories. For example, for the spraying affordance, the accuracy is around but the best F1-score is . As we can see from table 1 that LMCA-R performs best in a majority of the cases, outperforming kNN in all but one case.
Comparing the CNN features to the constructed we see that they perform roughly the same but for a few were one or the other significantly outperforms the other. It is difficult to pinpoint exactly why this is. We hypothesize that some affordances contain objects for which the depth recordings contain a high amount of noise. This propagates into uncertainties for the constructed features which are mostly dependent on depth recordings. For example, objects affording hanging usually have an arched part which can be difficult to record with sufficient accuracy as they are usually around 1 cm in diameter and thus close to the Kinect noise threshold. The CNN features, on the other hand, does not rely on depth measurements and are thus free of this constraint. We also see that the LMCA-R performs decently for these classes. This is due to the reweighing factor and the penalization that is able to disregard irrelevant features and weigh the lesser class as equally important.
. The right axis shows the KL-divergence between the normalized weights of a feature and a uniform distribution indicating the within feature distribution of magnitude values.
4.3 Feature Selection
We take the average of the column magnitudes over the runs and normalize, this will indicate each dimension’s importance. We give results for of the more interesting affordances in Fig.3-4. The bar plots show the sum of the magnitude for each feature, that is, the fraction of each feature of the full magnitude vector. To provide a notion of the distribution of magnitude within each feature we compute the KL-divergence of the normalized magnitudes for the features with a uniform distribution. The right-hand bars thus indicate how evenly the magnitudes are distributed across each of the features.
The general tendency is that some features like material, size, and shape are important across the board. Size and material are good for making an initial guess. For example, there are no tools made of paper or very thin objects that affords Stacking or Handle Grasping, etc. Shape features are more specific and vary much more across the different affordances, however, in general size, shape, and material features are the most important as expected. Analyzing the diagrams for all the affordances it is clear that the features *volume, shape primitive, gradients, and color stats* are not as important for classifying affordances compared to the other features.
4.4 Feature Projection
The second question we set out to answer is: are there certain invariant parts of the objects that are valuable for classifying to an affordance? To investigate this we want to extract the important local features and locate them on the objects that afford the action.
We proceed in the same way as in the feature selection analysis. We take the mean of the magnitudes of over all the runs. From the mean we select the subset of features that are point-cloud based, that is, the gradient, color quantization, FPFH features, and normalize this subset.
To get an indication of the important parts we assign an importance weight value to each point. We compute it by summing, the feature weights associated with the point according to each of the selected features,
Here is a feature function that takes a point cloud index and returns an index corresponding to the weight value for that feature and is the weight vector for the feature . For example, for the BoW FPFH features each codeword has a weight, to find the weight we thus classify a point to a codeword and look up the index for that codeword in the weight vector.
To color the object we divide all the values by the max value taken over all points. We input the values to a gradient function between red and blue, such that values close to the max value becomes red, and values close to zero becomes blue.
Before we analyze the results in Fig.5 we want to bring up one important point that cannot be stressed enough: humans and robots are different sensorimotor systems. We have different feature representations and mechanisms for detecting invariant features. Therefore, we cannot expect the invariant selected parts of the objects to be the same for robots and humans. Our approach might detect invariant features that humans are unable to detect or understand. The important part is the consistency in the invariances across objects. With that being said it would be interesting if there is a correspondence between the invariant parts selected by our model and what one can expect from a human.
The objects selected in Fig.5 is just a small subset of all positive examples, roughly 1400, but gives a good representation of the main results.
For Drinking, Fig.44.1-44.2, the highlighted part is the rounded back part of the object. The back part was selected in a similar fashion across most of the objects even for such diverse objects such as the smaller bowl and the teapot.
For Eating From, Fig.44.3-44.5, we see that the algorithm highlighted the flat bottom for two of the objects in Fig.44.3 and Fig.44.5 but not in Fig.44.4. This highlights the difficulty in generalizing from a couple of highlights. What these three images show is that the flat parts are important for categorizing those two objects while the sides of the frying pan in Fig.44.4 is more important than its flat part for categorizing to the affordance. Despite this, a majority of the objects in the category shows highlighting of the flat or base parts.
Handle Grasping, Fig.44.6-44.8, gave mixed results. Many objects had colorings similar to those in Fig.44.6-44.7. However, we also had a number of objects where the algorithm either selected the whole object or the connecting part where the handle meets the tool part as in Fig.44.8. We expected this as the connecting part is a common shape across objects with handles.
In Hanging, Fig.44.9-44.10, we gave the algorithm a number of objects with loops. The results were not satisfactory. On one hand, we had results as in Fig.44.9, yet most results were similar to Fig.44.10
with significant noise. A closer inspection revealed that a large number of cups skewed the results towards detecting cylindrical parts. The set of objects affordingLoop Grasping, a subset of hanging, Fig.44.13-44.14, showed similar effects.
The Lifting Top, Fig.44.11-44.12, also gave mixed results. The objects varied significantly in shape and we expected the algorithm to detect the small correlations across the objects given by the shape of the tops. The results show to the contrary that detecting small shapes is difficult at best due to the Kinect’s low resolution and level of noise.
Opening, Fig.44.15-44.17, were perhaps the most surprising results. The objects had large variations in shape, ranging from toothpaste tubes to milk-cartons and bottles. We, therefore, considered it to be one of the more difficult categories. Despite this, the algorithm consistently highlighted parts of the objects approached for opening for a majority of the objects.
For Rolling, Fig.44.19-44.20, we expected results where the whole object was colored. This happened in the majority of the objects, but there was also some with spurious colorings such as in Fig.44.20 where the results were more difficult to interpret.
Stacking, Fig.44.21-44.23, proved to be quite a good illustration of the point made in the beginning about difference in sensorimotor systems. We expected a coloring of the flat parts, but what actually is the common denominator are the edges. The algorithm selected edges similar to those for a majority of the objects.
Finally Stirring, Fig.44.24-44.26 and Tool, Fig.44.27-44.27, gave very interesting results. The objects contained in these two categories are similar and as we can see from Fig.44.24-44.26 the algorithm has selected the whole handle part with almost uncanny certainty. Seemingly the algorithm has picked up the rule that objects that afford stirring should have thin and elongated handle parts.
To conclude, the above results show good consistency in selecting sensible parts of the objects in most of the categories. It is clear that we need more data points for the results that showed low consistency such as in *Hanging* and *Loop Grasping*. For example, the algorithm will benefit from more negative examples such as cups without or occluded handles. Creating good datasets with sensible labelings for learning complex abstractions is a trial and error process since the features that you expect to be important might not be. Further on, better depth resolution with less noise will provide a major improvement. For example, flat surfaces are not always interpreted as flat due to the noise. This makes the FPFH BoW features map flat surfaces differently thus introducing large variance in shapes that might not be that different. Lastly, the analysis we made of the selected features differed, in some categories significantly, from the analysis of the projected features. This shows, as mentioned earlier, that drawing conclusions from the belief that different sensorimotor systems will produce similar results can be precarious.
|Eating From||Putting||Loop Grasping||Drinking|
|Squeezing Out||Spraying||Squeezing||Lifting Top|
The three nearest neighbors for each affordance. We compute the distances using the KL-divergence between the Gaussian distributions over the magnitude vectors of the affordance transforms,. Distances are therefore non-symmetric.
4.5 Affordance Association
Finally, we examine how the different affordances relate to each other. We start by assuming that the magnitude of has a multivariate Gaussian distribution. We compute the mean and covariance by treating all the 25 runs as samples from the distribution. We can now measure the similarity between the affordances using the KL-divergence.
In Table 2 we list the 3 nearest neighbors (NN) for each affordance. Since the KL-divergence is asymmetrical the NN of one affordance might not be the NN of the other.
From Table 2 we can see that most of the affordances that we expected to be close to each other are in fact close. For example, objects that afford tool use are similar to objects that afford handle grasping, scraping, and stirring. Rolling is close to Lifting Top and Squeezing, Loop Grasping is close to Hanging and Drinking, and Cutting is close to Tools. Stacking is close to objects that affords Lifting Top and Putting, etc. The results clearly show that our approach can learn to relate affordances in a consistent and sensible manner.
One interpretation of the KL-divergence is the amount of information one learns of the true distribution from the information given by another distribution. In our context, this means, how much an affordance says about the features that are important for another affordance. Learning to associate affordances implies learning the interrelation between similar affordances and the objects that make up the clusters of association. This deeper understanding is key to generalizing and abstracting affordances. Practically, this knowledge has the potential to help a robot perform an unknown action demonstrated by another actor. It can do this by analyzing the affordances of the object being manipulated and figuring out what features might be important from what it has learned from other objects effectively bootstrapping the learning process.
We started out with the simple notion of distance as a proxy for similarity. This guided us to learn a transform of the feature space that put similar items close and dissimilar items far away. Objects are usually similar in only a few aspects of their representation and we, therefore, penalized parts of the feature space that were not relevant for classifying to the affordance.
We analyzed the penalized transform to deduce the relevant features and provide a grounding of the affordances. Since some of the feature space was tied to a point cloud representation we could locate important parts of the objects for classifying to an affordance. Our model is thus proof of concept that applying a sensible approach to reasoning about similarity facilitates the ability to learn abstractions of categories without the need for pixel ground truths, pre-segmentation, other cues, and heuristics.
Furthermore, we showed that the model can learn to associate categories with each other. Instead of analyzing the transformed data, as is common, we analyzed the feature transforms themselves, computing distances between them. Again using distance as a proxy for similarity. The key is the realization that the transform itself contains the information necessary to reason about the category. The learned similarities between the affordances proved to be sensible and gave insight into how an agent can learn to reason about categories.
The shortcomings of our model are obvious. Firstly, stacking designed features is not a viable option for a fully autonomous system, it will need to learn the features from the data. This implies that future work should focus on finding ways to analyze and compare activations in deep nets e.g. Ku et al. (2017), either by developing retinotopic feedback loops similar to how human vision works or other recurrent ways of learning abstractions, however, without the need for pixel-wise labeling. Further on, when creating these abstractions we need to understand to what a degree we should mimic human capabilities, as this will be a crucial component in human-robot interaction.
Secondly, we showed that there is sufficient information in the shape of objects to ground the affordances. However, for a robot to gain a complete understanding of an affordance, it will have to interact with the objects and ground all observed sensorimotor input, both proprioceptive and exteroceptive. If we want grounding and abstraction to be as fluent and effortless as in humans, to enable high-level reasoning, future work needs to focus on building this knowledge in a holistic fashion.
- Barsalou (2008) L. W. Barsalou, Grounded Cognition, Annu. Rev. Psychol. 59 (2008) 617–645.
- Norman (2002) D. A. Norman, The Design of Everyday Things, Basic Books, 2002.
- Chao et al. (2011) C. Chao, M. Cakmak, A. L. Thomaz, Towards Grounding Concepts for Transfer in Goal Learning From Demonstration, in: ICDL, 2011, pp. 1–6.
- Griffith et al. (2010) S. Griffith, J. Sinapov, V. Sukhoy, A. Stoytchev, How to separate containers from non-containers? a behavior-grounded approach to acoustic object categorization, in: ICRA, 2010, pp. 1852–1859.
- Griffith et al. (2009) S. Griffith, J. Sinapov, M. Miller, A. Stoytchev, Toward interactive learning of object categories by a robot: A case study with container and non-container objects, in: ICDL, 2009, pp. 1–6.
- Högman et al. (2013) V. Högman, M. Björkman, D. Kragic, Interactive object classification using sensorimotor contingencies, in: IROS, 2013, pp. 2799–2805.
- Modayil and Kuipers (2008) J. V. Modayil, B. J. Kuipers, The Initial Development of Object Knowledge by a Learning Robot, Rob Auton Syst 56 (2008) 879–890.
- Montesano et al. (2008) L. Montesano, M. Lopes, A. Bernardino, J. Santos-Victor, Learning Object Affordances: From Sensory-Motor Coordination to Imitation, IEEE Trans Robot 24 (2008) 15–26.
- Niekum et al. (2012) S. Niekum, S. Osentoski, G. Konidaris, A. G. Barto, Learning and generalization of complex tasks from unstructured demonstrations, IROS (2012) 5239–5246.
- Sinapov and Staley (2013) J. Sinapov, K. Staley, Grounded Object Individuation by a Humanoid Robot, in: ICRA, 2013.
- Sinapov and Stoytchev (2008) J. Sinapov, A. Stoytchev, Detecting the functional similarities between tools using a hierarchical representation of outcomes, in: ICDL, 2008, pp. 91–96.
- Stoytchev (2005) A. Stoytchev, Behavior-Grounded Representation of Tool Affordances, in: ICRA, 2005, pp. 3060–3065.
- Sahin et al. (2007) E. Sahin, M. Cakmak, M. R. Dogar, E. Ugur, G. Ucoluk, To Afford or Not to Afford: A New Formalization of Affordances Toward Affordance-Based Robot Control, Adapt Behav 15 (2007) 447–472.
- Yuruten et al. (2013) O. Yuruten, E. Sahin, S. Kalkan, The learning of adjectives and nouns from affordance and appearance features, Adapt Behav 21 (2013) 437–451.
- Dehban et al. (2016) A. Dehban, L. Jamone, A. R. Kampff, J. e. S. Victor, Denoising Auto-Encoders for Learning of Objects and Tools Affordances in Continuous Space, in: ICRA, 2016, pp. 4866–4871.
- Gonçalves et al. (2014) A. Gonçalves, J. Abrantes, G. Saponaro, L. Jamone, A. Bernardino, Learning Intermediate Object Affordances: Towards the Development of a Tool Concept, in: ICDL-EpiRob, 2014, pp. 482–488.
- Penkov et al. (2017) S. Penkov, A. Bordallo, S. Ramamoorthy, Physical symbol grounding and instance learning through demonstration and eye tracking, in: ICRA, 2017, pp. 5921–5928.
- Ivaldi et al. (2014) S. Ivaldi, S. M. Nguyen, N. Lyubova, A. Droniou, V. Padois, D. Filliat, P.-Y. Oudeyer, O. Sigaud, Object Learning Through Active Exploration, IEEE Trans Auton Ment Dev 6 (2014) 56–72.
- Mar et al. (2015) T. Mar, V. Tikhanoff, G. Metta, L. Natale, Self-supervised learning of grasp dependent tool affordances on the iCub Humanoid robot, in: ICRA, 2015, pp. 3200–3206.
- Kraft et al. (2009) D. Kraft, R. Detry, N. Pugeault, E. Baseski, J. H. Piater, N. Krüger, Learning Objects and Grasp Affordances through Autonomous Exploration, in: ICVS, 2009, pp. 235–244.
- Modayil and Kuipers (2007) J. V. Modayil, B. J. Kuipers, Autonomous Development of a Grounded Object Ontology by a Learning Robot, in: AAAI, 2007.
- Pinto et al. (2016) L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, A. Gupta, The Curious Robot: Learning Visual Representations via Physical Interactions, in: ECCV, 2016, pp. 3–18.
- Agrawal et al. (2016) P. Agrawal, A. Nair, P. Abbeel, J. Malik, S. Levine, Learning to Poke by Poking: Experiential Learning of Intuitive Physics, in: NIPS, 2016, pp. 5074–5082.
- Pinto and Gupta (2017) L. Pinto, A. Gupta, Learning to push by grasping: Using multiple tasks for effective learning, in: ICRA, 2017, pp. 2161–2168.
- Fritz et al. (2006) G. Fritz, L. Paletta, R. Breithaupt, E. Rome, Learning Predictive Features in Affordance based Robotic Perception Systems, in: IROS, 2006, pp. 3642–3647.
- Stark et al. (2008) M. Stark, P. Lies, M. Zillich, J. Wyatt, B. Schiele, Functional Object Class Detection Based on Learned Affordance Cues, in: ICVS, Springer Berlin Heidelberg, 2008, pp. 435–444.
- Montesano and Lopes (2009) L. Montesano, M. Lopes, Learning grasping affordances from local visual descriptors, in: ICDL, IEEE, 2009, pp. 1–6.
- Hermans et al. (2011) T. Hermans, J. M. Rehg, A. Bobick, Affordance prediction via learned object attributes, in: ICRA Workshop, 2011.
- Sun et al. (2010) J. Sun, J. L. Moore, A. Bobick, J. M. Rehg, Learning Visual Object Categories for Robot Affordance Prediction, Int J Rob Res 29 (2010) 174–197.
Ye et al. (2017)
C. Ye, Y. Yang, R. Mao,
C. Fermüller, Y. Aloimonos,
What can i do around here? Deep functional scene understanding for cognitive robots,in: ICRA, 2017, pp. 4604–4611.
- Myers et al. (2015) A. Myers, C. L. Teo, C. Fermüller, Y. Aloimonos, Affordance detection of tool parts from geometric features, in: ICRA, IEEE, 2015, pp. 1374–1381.
- Nguyen et al. (2016) A. Nguyen, D. Kanoulas, D. G. Caldwell, N. G. Tsagarakis, Detecting object affordances with Convolutional Neural Networks, in: IROS, 2016, pp. 2765–2770.
- Pieropan et al. (2013) A. Pieropan, C. H. Ek, H. Kjellström, Functional object descriptors for human activity modeling, in: ICRA, IEEE, 2013, pp. 1282–1289.
- Koppula et al. (2013) H. S. Koppula, R. Gupta, A. Saxena, Learning Human Activities and Object Affordances from RGB-D Videos, Int J Rob Res 32 (2013) 951–970.
- Gall et al. (2011) J. Gall, A. Fossati, L. Van Gool, Functional Categorization of Objects Using Real-Time Markerless Motion Capture, in: CVPR, 2011, pp. 1969–1976.
- Aksoy et al. (2011) E. E. Aksoy, A. Abramov, J. Dörr, K. Ning, B. Dellen, F. Wörgötter, Learning the semantics of object–action relations by observation, Int J Rob Res 30 (2011) 1229–1249.
- Kjellström et al. (2011) H. Kjellström, J. Romero, D. Kragic, Visual object-action recognition: Inferring object affordances from human demonstration, Comput Vis Image Underst 115 (2011) 81–90.
- Bütepage et al. (2018) J. Bütepage, H. Kjellström, D. Kragic, Classify, predict, detect, anticipate and synthesize: Hierarchical recurrent latent variable models for human activity modeling, arXiv (2018). arXiv:1809.08875v2.
- Wang et al. (2014) C. Wang, K. V. Hindriks, R. Babuska, Effective transfer learning of affordances for household robots, in: ICDL-EpiRob, IEEE, 2014, pp. 469–475.
- Thomaz and Cakmak (2009) A. L. Thomaz, M. Cakmak, Learning about objects with human teachers, in: HRI, 2009, pp. 15–22.
- Chu et al. (2016) V. Chu, T. Fitzgerald, A. L. Thomaz, Learning Object Affordances by Leveraging the Combination of Human-Guidance and Self-Exploration, in: HRI, IEEE Press, 2016, pp. 221–228.
- Hjelm et al. (2015) M. Hjelm, C. H. Ek, R. Detry, D. Kragic, Learning Human Priors for Task-Constrained Grasping., in: ICVS, 2015, pp. 207–217.
- Nagarajan et al. (2018) T. Nagarajan, C. Feichtenhofer, K. Grauman, Grounded Human-Object Interaction Hotspots from Video, arXiv (2018). arXiv:1812.04558v1.
Do et al. (2018)
T.-T. Do, A. Nguyen, I. D.
AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection,in: 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, 2018, pp. 1–5.
- Abelha and Guerin (2017) P. Abelha, F. Guerin, Learning how a tool affords by simulating 3D models from the web, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, 2017, pp. 4923–4929.
- Nguyen et al. (2017) A. Nguyen, D. Kanoulas, D. G. Caldwell, N. G. Tsagarakis, Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, 2017, pp. 5908–5915.
- Detry et al. (2017) R. Detry, J. Papon, L. Matthies, Semantic and Geometric Scene Understanding for Task-oriented Grasping of Novel Objects from a Single View, in: ICRA Workshop, 2017.
- Ferrari and Zisserman (2008) V. Ferrari, A. Zisserman, Learning Visual Attributes, in: NIPS, 2008, pp. 433–440.
- Lampert et al. (2009) C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: CVPR, 2009, pp. 951–958.
- Farhadi et al. (2009) A. Farhadi, I. Endres, D. Hoiem, D. Forsyth, Describing objects by their attributes, in: CVPR, IEEE, 2009, pp. 1778–1785.
- Malisiewicz and Efros (2008) T. Malisiewicz, A. A. Efros, Recognition by association via learning per-exemplar distances., in: CVPR, 2008.
- Brooks (1990) R. A. Brooks, Elephants don’t play chess, Rob Auton Syst 6 (1990) 3–15.
- Ek et al. (2010) C. H. Ek, D. Song, K. Huebner, D. Kragic, Exploring Affordances in Robot Grasping Through Latent Structure Representation, ECCV (2010).
- Shannon (2013) C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal 27 (2013) 379–423.
- Sun et al. (2013) Y. Sun, L. Bo, D. Fox, Attribute based object identification, in: ICRA, IEEE, 2013, pp. 2096–2103.
- Baugh et al. (2012) L. A. Baugh, M. Kao, R. S. Johansson, J. R. Flanagan, Material evidence: interaction of well-learned priors and sensorimotor memory when lifting objects, J. Neurophysiol. 108 (2012) 1262–1269.
- Buckingham et al. (2009) G. Buckingham, J. S. Cant, M. A. Goodale, Living in A Material World: How Visual Cues to Material Properties Affect the Way That We Lift Objects and Perceive Their Weight, J. Neurophysiol. 102 (2009) 3111–3118.
- Fabbri et al. (2016) S. Fabbri, K. M. Stubbs, R. Cusack, J. Culham, Disentangling Representations of Object and Grasp Properties in the Human Brain, J Neurosci 36 (2016) 7648–7662.
- Feix et al. (2014) T. Feix, I. M. Bullock, A. M. Dollar, Analysis of human grasping behavior: correlating tasks, objects and grasps., IEEE Trans Haptics 7 (2014) 430–441.
- Martin (2007) A. Martin, The Representation of Object Concepts in the Brain, Annu. Rev. Psychol. 58 (2007) 25–45.
- Grafton (2010) S. T. Grafton, The cognitive neuroscience of prehension: recent developments, Exp Brain Res 204 (2010) 475–491.
- Jenmalm et al. (2000) P. Jenmalm, S. Dahlstedt, R. S. Johansson, Visual and Tactile Information About Object-Curvature Control Fingertip Forces and Grasp Kinematics in Human Dexterous Manipulation, J. Neurophysiol. 84 (2000) 2984–2997.
- Sartori et al. (2011) L. Sartori, E. Straulino, U. Castiello, How Objects Are Grasped: The Interplay between Affordances and End-Goals, PLoS One 6 (2011) 1–10.
- Rusu et al. (2009) R. B. Rusu, N. Blodow, M. Beetz, Fast Point Feature Histograms (FPFH) for 3D registration, in: ICRA, IEEE, 2009, pp. 3212–3217.
- Dalal and Triggs (2005) N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, in: CVPR, 2005, pp. 886–893.
- Krüger et al. (2013) N. Krüger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. H. Piater, A. J. Rodriguez-Sanchez, L. Wiskott, Deep Hierarchies in the Primate Visual Cortex - What Can We Learn for Computer Vision?, IEEE Trans Pattern Anal Mach Intell 35 (2013) 1847–1871.
- Torresani and Lee (2006) L. Torresani, K.-C. Lee, Large Margin Component Analysis, in: NIPS, 2006, pp. 1385–1392.
- Rennie and Srebro (2005) J. D. M. Rennie, N. Srebro, Fast Maximum Margin Matrix Factorization for Collaborative Prediction, in: ICML, 2005, pp. 713–719.
- Obozinski et al. (2009) G. Obozinski, B. Taskar, Jordan, Michael, Joint covariate selection and joint subspace selection for multiple classification problems, Stat Comput 20 (2009) 231–252.
Pedregosa et al. (2011)
F. Pedregosa, G. Varoquaux,
A. Gramfort, V. Michel,
B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher,
M. Perrot, É. Duchesnay,
Scikit-learn: Machine Learning in Python,J Mach Learn Res 12 (2011) 2825–2830.
- Jia et al. (2014) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. J. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, in: Proc ACM Int Conf Multimed, ACM, New York, NY, USA, 2014, pp. 675–678.
- Ku et al. (2017) L. Y. Ku, E. G. L. Miller, R. A. Grupen, Associating grasp configurations with hierarchical features in convolutional neural networks, in: IROS, 2017, pp. 2434–2441.