When interacting with their environment, humans model the action possibilities directly in the product space of their own capabilities and the environment. This idea of the existence of an intuitive and perceptual representation of the possibilities in an environment is known as affordances .
In this paper, we propose an algorithmic framework to learn and encode such affordances from data. By modeling affordances as probability density functions conditioned on the environment and the kinematic state of the human, we are able to anticipate the human intention by maximum likelihood. This intention can then be combined with a full-body motion prediction system to produce accurate predictions as seen in Fig. 1.
In our experiments grasp and place densities are defined over submanifolds of the hands pose spaces. Placeability is defined over support planes and grasps are defined over sphere surfaces around objects. Our models for each affordance derive from a common structure based on recurrent neural networks (RNNs) for modeling the human latent state and dedicated networks, i.e., Convolution Neural Networks (CNNs), for modeling the environment. The models then combine environment and human latent spaces using fully connected layers to produce densities over the sub-manifolds. Note that we use mixtures to encode multi-modal densities which is important for placeability.
Given a prediction for placements and grasps we optimize a full-body movement with a nonlinear program , which accounts for obstacle and goalset constraints, and models short-term movements with a data-driven dynamical system. To the best of our knowledge this paper is the first to accurately leverage the 3D geometry of the environment for combined intention and motion prediction of full-body movements.
We gathered a dataset with 5 participants using a motion capture system. Affordances and short-term motion models were trained on this dataset. Our results demonstrate superiority of our affordance densities for predicting placements and grasping locations. Finally, we show that combining goalset predictions and motion predictions compares similarly to using oracle goal locations.
Ii Related Work
Ii-a Intention and motion prediction
In prior work graphical models, such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF), have been used in order to predict human motion or intention. For instance, Bennewitz et al. modeled human intention using HMMs in order to improve navigation behavior of a mobile robot. Kulić et al. used HMMs to model full-body motion primitives and applied it to motion imitation . Elfring et al. used growing HMMs in order to learn human’s goal position from data and use a social forces-based motion model to predict human motion . Koppula and Saxena focused on movement prediction using conditional random fields . While these approaches are sound they generally do not scale to large databases of motion capture or are limited to predict 2d motion of humans and do not deal with the full-body case.
The concept of affordances stems its roots from psychology [2, 8] and relates to the action possibilities offered by a given environment to an animal or human. Jamone et al. present a survey on affordances in the field of psychology, neuroscience and robotics 
. The field of visual affordances deals with learning affordances as a computer vision problem. Roy et al. use a Convolutional Neural Network based architecture to extract affordance segmentations in RGB images 
. Nguyen et al. model affordances using an autoencoder structure.
. For example, Montesano et al. use Bayesian networks to encode affordances and demonstrate how a humanoid robot can use it to interact with objects. For Human Robot Interaction affordance models are used to model human action possibilities and to be able to infer human intent [16, 7]. Koppula and Saxena define object affordance as potential functions depending upon how the object will be interacted with .
In this paper, we design and implement a system to understand human object affordances in a real world table setup task performed in a motion capture environment. As we aim to use the affordance model in order to predict human motion, we use a probabilistic model. Given the human state and the scene context, it predicts a density of interaction possibilities for the corresponding affordance. In particular, we concentrate on graspability and placeability affordances, and model them using a probabilistic neural network framework.
Ii-C Neural Network Human Motion Prediction
Prior work on full-body human motion prediction for has focused on recurrent neural network (RNN) architectures. Fragkiadaki et al. proposed a Long Short Term Memory (LSTM) based model that is able to train across multiple subjects
. Martinez et al. introduced a gated recurrent unit (GRU) based approach
. A residual connection forces the network to predict velocities and thus improves the generalization capability of the network. Pavllo et al. changed the joint angle representation to quaternions which further improved the predictions. Recently Wang and Feng introduced a position-velocity recurrent encoder-decoder model (VRED) . Their model adds an additional velocity connection as an input to the GRU cell in the recurrent structure. Motion prediction aproaches based on recurrent neural networks show good results on forecasting of purely human motion. However, they do not handle environmental context, an issue we tackle in this paper due to encoding the environment in our affordance model.
Ii-D Motion Optimization
Moreover, motion optimization techniques have been used for human motion synthesis. For example, Mordatch et al. use motion optimization approaches to synthesize realistic motions and animate human behavior [23, 24].
In our prior work we propose to use motion optimization in order to improve short-term motion prediction [25, 3]. We built on the VRED model and used the trajectory optimization technique to change the prediction in order to adapt for specific constraints . In this paper we will use this proposed method in order to predict full-body motion towards a goal state that is sampled from a seperate affordance model. This makes it possible to take environmental context into account.
Iii Combined Intention and Full-body Motion Prediction
A schematic overview of the prediction system is shown in Figure 2. Using our captured motion database , we offline train probabilistic affordance models as described in Sections III-B and III-C. Additionally we train a short-term full-body prediction model. While the affordance models are trained on human data and scene data, the full-body prediction model is only trained on human data. At prediction time, we first use the affordance prediction and extract a goal position. After that a trajectory optimizer is used to iteratively change the predicted trajectory through the full-body model in order to adapt to the goal position, as we describe in Section III-D. Then the final prediction is returned.
We model affordances by building a relationship between agent and environment . We aim to find a probabilistic model for every object and action that gives us a probability over interaction possibilities . For instance, would give as probability over possible place locations on the table, while would give us a probability over possible wrist locations for grasping a jug.
For the full-body prediction we want to predict a future human trajectory with prediction horizon , based on a previously observed trajectory . We want to constrain it so that the end state fullfills a sample from . For example, the hand of the human should end up at the predicted grasp point or over the predicted place position.
Iii-B Placeability Affordance
We define the placeability affordance as a probability distribution over possible place locations on a surface. We model the placeability affordance using the neural network architecture shown in Figure2(a). The inputs
to the model are the human skeleton and object states in positions over a trajectory of 1sec (20 timesteps), a 14 dimensional one-hot encoding of both: the object type the human has in the hand and the surface we compute the affordance for, and a grid that covers the plane state.
The network additionally takes plane features as input which are a grid consisting of a binary occupancy map, a 2d position of the planes reference frame and a signed distance field (SDF) (see Figure 4).
Placeability is fundamentally multi-modal. For instance in our experiments we consider a table setting scenario such as found in a home or restaurant, four people can sit next to the table, therefore there are four possible locations where the human can place a plate.
A standard approach to model multi-modal distributions are Mixture Density Networks (MDN) , which we make use of for modeling placement distributions:
where indicates the count of the components in the mixture model, are the mixing coefficients. are functions representing conditional densities for the kernel.
We use multivariate Gaussian kernels with diagonal covariance. We use 7 kernels in output, which gave good empirical results on our dataset. The network is trained using a neg-log likelihood (NLL) loss with the 2d place position on the surface as ground truth.
Constraining affordances to free regions
We improve our placeability model with the intention of making it more robust against violating regions where objects are already placed. We consider 2 approaches to tackle the issue. In the penalty approach we modify the cost function to include a penalty term penalizing placement in invalide regions using the value of the SDF map. In the transfer learning approach
we learn environment features related to plane occupancy separately. To achieve this, we build an autoencoder network with inputs being the 4 feature maps and the one-hot encoding vector. The encoder uses two convolutional layers with maxpooling to downsample, the decoder upsamples and uses three convolutional layers. It is trained to output the binary occupancy map of the plane after the object is placed on the plane. We train the autoencoder using a standard mean squared loss.
The pre-trained encoder model is connected to the main placeability model. The encoder model weights are made non-trainable when the overall placeability network is trained. The intuition here is that, with the autoencoder, we capture the latent representation that are unique for different combinations of the occupancy map. With the pre-trained encoder network producing distinctive feature representations, the main model should learn to not predict outputs in invalid regions.
Iii-C Graspability Affordance
We model the graspability affordance as follows: Given that the subject wants to grasp an object of a particular type from its current resting surface, predict the likelihood of the right wrist position for successful grasp action. The model can then be queried for every object in the scene to get the complete dynamic mapping of the grasp affordance from a human’s perspective.
The choice of the posterior distribution influences how the affordance is modeled. We will compare two probability distributions: The Gaussian distribution and the von Mises-Fisher (vMF) distribution. The VMF distribution describes a probability distribution on a hypersphere, which could be useful because grasp points might lie on a hypersphere around the object.
The base structure of our base graspability model is shown in Figure 2(b). An additional layer is appended in the end, depending on whether we want to model a Gaussian or a vMF distribution.
In the Gaussian network type, the wrist position is modeled as a 3D position in Euclidean space and the final layer of this network outputs are the parameters of a Gaussian distribution having a diagonal covariance structure. We use a NLL cost function similar to :
with being data points and labels and diagonal covariance matrix
The intention for the vMF formulation is to model grasp points as a distribution on a 2D manifold defined on the surface of a sphere:
where , is the mean direction, with and is the concentration parameter, which defines the spread of the distribution on the surface the hypershere in the direction of . is the normalizing constant given by
where denotes the modified Bessel function of the first kind at order .
We have 4 output neurons for the vMF model: 3 for the mean direction and 1 for the
term that defines the spread. Additionally, the output neurons corresponding to the mean direction should satisfy the unit norm constraint. The network is trained on two loss functions simultaneously: neg-log likelihood of the vMF distribution evaluated at ground truth direction and mean squared loss for the distance parameter.
Iii-D Full-Body Prediction
The goal for full-body prediction is to find a trajectory of human motion of future states, given a trajectory of already observed states and our affordance model. For this purpose we use a trajectory prediction framework introduced in our prior work . The framework works in 2 phases: 1) Offline, a VRED model  is trained to predict purely kinematic trajectories only based on human motion. . 2) Online, trajectory optimization techniques are used to adapt to environmental objectives while being close to the prediction. This is done by changing additional controls that are added to the VRED architecture. In this paper we use the low-level objective and the goalset objective as described in :
The low-level objective ensures that the deltas are close to zero and therefore the deviation from what the network predicts is small.
The goalset objective optimizes the position of the hand of the human to end up close to position , with being the forward kinematics map, mapping the last human state to the hand position.
In order to account for our affordance model, we compute the expected prediction position from the affordance model . Thus, the trajectory will be optimized to end up at this position.
The gradient based optimization algorithm L-BFGS  is used to optimize the trajectory with the loss:
|Train set||Test set|
|MDN+no CNN features||-3.1769||0.0184||-2.6904||0.0239|
The models were implemented using Keras functional API, with TensorFlow as backend. To develop the loss functions used to train our custom models, we used the Tensorflow distributions  package.
In our setup an Optitrack111https://optitrack.com/ Motion capture system was used. The environment has a total size of meters. The human subjects were asked to wear a motion capture suit with 50 markers attached to it. There are objects in the scene that are each attached with markers for tracking and can be categorized into two types: The first type of objects are the ones that the users can directly interact with, such as cups, plates, jug and bowl. The second type of objects remain stationary in a given recording session and also acts as supporting bodies over which the first type of objects can be placed, namely a table, a big shelf and a small shelf. We model affordances for the first type of objects. Participants were asked to perform tasks related to setting up the table and clearing it. In the collected data, the users were subject to two affordances, namely graspability and placeability.
A total of 5 users participated in the recording session, with each session being approximately 25 minutes long. We extracted a total of 1551 grasp-place sequences. For training the models, we split the data based on the subjects. We used data of 3 of the subjects for training and 2 for testing
We computed results on different variants of our MDN networks showing the NLL loss and the mean squared error between the mean of the MDN and the ground truth. Results can be seen in Table I
. The results are computed on place sequences extracted from the training data. The sequences include place actions for several planes, namely the table and the planes of the big and the small shelf. The baseline for the place affordance is based on a heuristic using the SDF and the distance map. It selects a valid point on the surface which is closest to the human and fits the object. The MDN with transfer learning is our full model. In the MDN+CNN we remove the autoencoder and replace it by two convolutional layers. In the MDN without CNN features no convolutional layers are used at all. It can be seen that the MDN using the transfer learning technique achieves the best performance on the test set.
In order to measure whether the model predicts to place into a invalid region (outside of the surface or on space occupied by another object), we additionally calculate the percentage of predictions that satisfy the valid region.
Figure 5 shows the performances of different networks over time before the placement happens. Without 2D plane features, the performance of the model is significantly lower compared to the approaches with the CNN network added. It can be seen that the models using the transfer learning or penalty approaches significantly outperform the models without these modifications on the valid placement rates as well as on the Euclidean distance. This holds especially when being farer away from the plane, which is when the prediction of the affordance is most useful.
The transfer learning approach improves the result by a margin of consistently, along the time axis. This is because, our Autoencoder model inherently learnt unique latent space representation from the CNN features. It was trained to produce a binary occupancy map, that exists in the same space as the input feature maps.
The pre-trained encoder part produces unique features for the two plane occupancy configurations, thereby it forces the model to predict in the free regions. While the MSE for the model with transfer learning and penalty is about the same as without penalty, training with the penalty term slightly improves the valid region metric for all timesteps.
An interesting observation in placeability, is to check the uncertainty in network predictions at different time instances before the object is placed. This can be seen by checking the mixture components predicted by the MDN network and the corresponding density. Figure 6 visualizes the same on a test set example for placing a cup on the table at 3 time instances. It can be seen that when the subject is far away from the table, there are multiple possibilities of potential placeable regions and as the subject moves towards the table, that uncertainty reduces and confines to one dense most likely region.
We compare the MSE for the vMF model, the Gaussian Model and a baseline. The results can be seen in Table II. The baseline for the grasp affordance is based on maximum likelihood, wherein for all combinations of object types and surfaces, the mean distances of the wrist from the object being grasped is computed. During inference, starting from the object, the unit vector along the direction of the right wrist is calculated and the grasp position is computed using this and the corresponding mean distance.
For calculating the MSE with the Gaussian model, the output parameters of the network that correspond to the 3D mean are considered as prediction point, and this is used to compare against the ground truth grasp point. For the vMF model, based on the predicted mean direction and distance, we calculate the 3D position and consider it as the prediction point. Table II shows the best results obtained from the network. The results are computed at 1s before grasping.
|Models||Train Set MSE||Test Set MSE|
By observing the MSE, we can see that both neural network models beat the baseline by a significant margin. The Gaussian model achieves a bit better results than the vMF model on the MSE. However, the models digress in the manner of uncertainty estimation. With the Gaussian model, we get a spherical covariance structure indicating the confidence interval around the mean position. This interval gives the possible locations in 3D space, where the human wrist should position, in order to grasp the object. In case of the vMF model, uncertainty is defined on a 2D manifold, i.e. the surface of a sphere with its center at object centroid, and radius being the predicted distance. The output vMF parameters inform on the direction and spread of the distribution on this manifold. Since this gives a density on a surface around the object, it reflects on the possible approach angle of the wrist for successful grasping.
Iv-D Full-Body Prediction
In order to test the full-body prediction we use the prediction framework introduced in . We train the position-velocity model (VRED) on the training set. From the test data we extract 27 trajectories for placing on the table. We use the place affordance model and extract the expected place point from the MDN. We predict for 1.5sec of motion and optimize the prediction to end up above . Table III shows the distance to the ground truth at different times in the future for our method and several baselines. In the first part of the table the sum over distances of key joints (wrists, elbows, knees, ankles and pelvis) is shown, in the second part only the distance of the wrist to the ground truth is shown. Values are averaged over the 27 trajectories.
The zero velocity baseline just keeps the current state as prediction for future timesteps. The VRED baseline just unrolls the recurrent neural network. Our method takes the affordance prediction into account and optimizes to end up at . The oracle has additional oracle information about the true endposition of the wrist.
It can be seen that the oracle prediction performs best, which is not surprising, as it uses information that is not available at prediction time. Our method using the place point prediction performs second best and outperforms the prediction without any optimization at all time steps.
Figure 7 shows an example trajectory for predicting motion to place a cup. The top row shows our method, the bottom row shows a uninformed prediction using VRED. It can be seen that our method is very close to the ground truth, while the uninformed predicts that the human only moves forward a bit and keeps position afterwards.
We presented a system to learn human object affordances for human motion prediction. We demonstrate that the method can be used to predict full-body trajectories.
A user study was conducted to collect a dataset in a motion-capture setup on a table setup task, in which the actors were subjected to two affordances, namely graspability and placeability.
We modeled the two affordances as conditional probability distributions using deep learning methods by capturing the implicit uncertainty. For the grasp affordance we use a vMF model. The uncertainty encodes the possible approach angles of the human hand for a successful grasp action. The place affordance was modeled with a MDN model. The uncertainty is encoded as possible regions on the surface where the object can be placed.
Testing within our experimental framework shows the good results of the proposed method. Furthermore, our experiments proof that the affordances can be used to improve full-body motion prediction within a state-of-the-art motion prediction framework.
This work was funded by the University of Stuttgart and the regional research alliance of Baden-Württemberg “System Mensch” funded by the German Federal Ministry for Science, Research and Arts. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Philipp Kratzer.
-  H. Wang and J. Feng, “Vred: A position-velocity recurrent encoder-decoder for human motion prediction,” arXiv preprint arXiv:1906.06514, 2019.
-  J. J. Gibson, “The senses considered as perceptual systems.” 1966.
-  P. Kratzer, M. Toussaint, and J. Mainprice, “Prediction of human full-body movements with motion optimization and recurrent neural networks,” in IEEE Int. Conf. Robotics And Automation (ICRA), 2020.
-  M. Bennewitz, W. Burgard, G. Cielniak, and S. Thrun, “Learning motion patterns of people for compliant robot motion,” The Int. Journal of Robotics Research, vol. 24, no. 1, pp. 31–48, 2005.
-  D. Kulić, C. Ott, D. Lee, J. Ishikawa, and Y. Nakamura, “Incremental learning of full body motion primitives and their sequencing through human motion observation,” The Int. Journal of Robotics Research, vol. 31, no. 3, pp. 330–345, 2012.
-  J. Elfring, R. Van De Molengraft, and M. Steinbuch, “Learning intentions for improved human motion prediction,” Robotics and Autonm. Systems, vol. 62, no. 4, pp. 591–602, 2014.
-  H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” IEEE Trans. on Pattern Analysis and Machine Intell., vol. 38, no. 1, pp. 14–29, 2016.
-  J. Gibson, The Ecological Approach to Visual Perception, ser. Resources for ecological psychology. Lawrence Erlbaum Associates, 1979.
-  L. Jamone, E. Ugur, A. Cangelosi, L. Fadiga, A. Bernardino, J. Piater, and J. Santos-Victor, “Affordances in psychology, neuroscience, and robotics: A survey,” IEEE Trans. on Cognitive and Developmental Systems, vol. 10, no. 1, pp. 4–25, 2018.
-  M. Hassanin, S. Khan, and M. Tahtali, “Visual affordance and function understanding: A survey,” 2018.
-  A. Roy and S. Todorovic, “A multi-scale cnn for affordance segmentation in rgb images,” in European Conf. on Computer Vision (ECCV), vol. 9908, 2016, pp. 186–201.
-  A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Detecting object affordances with convolutional neural networks,” in IEEE/RSJ Int. Conf. on Intel. Robots And Systems (IROS), 2016, pp. 2765–2770.
-  L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor, “Learning object affordances: From sensory–motor coordination to imitation,” IEEE Trans. Robotics, vol. 24, no. 1, pp. 15–26, 2008.
-  A. Gonçalves, G. Saponaro, L. Jamone, and A. Bernardino, “Learning visual affordances of objects and tools through autonomous robot exploration,” in IEEE Int. Conf. on Autonm. Robot Systems and Competitions (ICARSC), 2014, pp. 128–133.
-  A. Dehban, L. Jamone, A. Kampff, and J. Santos-Victor, “Denoising auto-encoders for learning of objects and tools affordances in continuous space,” in IEEE Int. Conf. Robotics And Automation (ICRA), 2016, pp. 4866–4871.
-  H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activities and object affordances from rgb-d videos,” The Int. Journal of Robotics Research, vol. 32, no. 8, pp. 951–970, 2013.
-  K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 4346–4354.
J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using
recurrent neural networks,” in
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
-  D. Pavllo, C. Feichtenhofer, M. Auli, and D. Grangier, “Modeling human motion with quaternion-based neural networks,” Int. Journal of Computer Vision, pp. 1–18, 2019.
-  E. Todorov and W. Li, “A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems,” in American Control Conference (ACC). IEEE, 2005, pp. 300–306.
-  N. Ratliff, M. Zucker, J. A. Bagnell, and S. Srinivasa, “Chomp: Gradient optimization techniques for efficient motion planning,” in IEEE Int. Conf. Robotics And Automation (ICRA). IEEE, 2009, pp. 489–494.
-  M. Toussaint, “Newton methods for k-order markov constrained motion problems,” arXiv preprint arXiv:1407.0414, 2014.
-  I. Mordatch, E. Todorov, and Z. Popović, “Discovery of complex behaviors through contact-invariant optimization,” ACM Trans. on Graphics, vol. 31, no. 4, pp. 1–8, 2012.
-  I. Mordatch, J. M. Wang, E. Todorov, and V. Koltun, “Animating human lower limbs using contact-invariant optimization,” ACM Trans. on Graphics, vol. 32, no. 6, pp. 1–8, 2013.
-  P. Kratzer, M. Toussaint, and J. Mainprice, “Towards combining motion optimization and data driven dynamical models for human motion prediction,” in IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids). IEEE, 2018, pp. 202–208.
-  C. M. Bishop, “Mixture density networks,” 1994.
-  D. A. Nix and A. S. Weigend, “Estimating the mean and variance of the target probability distribution,” in IEEE Int. Conf. on Neural Networks (ICNN), vol. 1, 1994, pp. 55–60 vol.1.
-  R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995.
-  F. Chollet et al., “Keras,” https://keras.io, 2015.
-  M. Abadi et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2016.
-  J. V. Dillon et al., “Tensorflow distributions,” 2017.