Object affordances describe potential interactions with an object [gibson1979ecological]
. They play an important role in human perception of real-world objects with tremendous variance of appearances[dicarlo2012does]. Object affordances are very related to the functionality of an object. For example, a chair is able to afford the functionality of sitting. Thus, it possesses the sitting affordance. For robots operating in unstructured environments and interacting with previously unseen objects, e.g., personal robots, the understanding of object affordances is very desirable. It allows robots to access unseen objects and understand how to interact with them efficiently and intelligently. This further brings about the possibility of more natural and intelligent human-robot interactions. Reasoning object affordances can also help robots discover the potential of an object to afford a functionality, despite it not being a typical instance of the object class to which the functionality is related (e.g., the improvised chair assembled with books and boxes in Fig. 5(d)). Independently identifying affordances represents a higher level of robot intelligence.
To address the problem of affordance reasoning in robotics, we introduced the interaction-based definition (IBD) which defines objects from the perspective of interactions instead of appearances as in our previous work [wu2020chair]. In particular, IBD defines a chair as “an object which can be stably placed on a flat horizontal surface in such a way that a typical human is able to sit (to adopt or rest in a posture in which the body is supported on the buttocks and thighs and the torso is more or less upright) stably above the ground”.
In this paper, we go beyond our previous work. We propose a novel method for robots to imagine how to sit on an unseen chair, i.e., the sitting pose for the chair. We also develop a robotic system, consisting of a Franka Emika Panda robot arm and a NAO humanoid robot, to actualize the understanding of the sitting affordance by autonomously putting a teddy bear on the chair (Fig. 1). To “put the bear on the chair”, the robot needs to carry the bear to the chair and seat the bear on it. To accomplish this task, the robot first needs to understand the sitting affordance of the chair, i.e., how the chair can be sat. Also, the robot needs to plan a whole-body motion to put the bear on the chair while satisfying several constraints on kinematics, stability, and collision. Moreover, if the chair is in an inaccessible pose (e.g., when the chair is facing a wall), the robot should be able to reason the accessibility and understand how to make the chair accessible. In this paper, we introduce a human-robot interaction (HRI) framework with closed-loop visual feedback to enable the robot to instruct a human to rotate the chair and make the chair accessible for putting the bear. Fig. 2 shows the pipeline of our method and details can be found in Sec. IV. Results show that our method enables the robot to autonomously put the bear on 12 previously unseen chairs with diverse shapes and appearances in 72 trials, including both accessible and inaccessible chair poses, with a success rate of . Also, our HRI framework is shown to be very effective. It is able to successfully change the accessibility of the chair for placing the bear with a success rate in 36 trials with inaccessible poses. To our knowledge, this work is the first to physically seat an articulated agent on an unseen chair in the real world. The aim of this work is to examine how well the robots are able to accomplish this challenging task by leveraging different components of robotics. We envision promising future applications of our method for robots operating in household environments and interacting with humans.
This work differs from our previous work [wu2020chair] in various ways. Rather than classifying whether an unseen object is a chair or not, in this work, we assume the given object is known as a chair a priori, as such can be determined from our previous work. The robot imagines how to sit on the chair, i.e., the sitting pose. Also, unlike [wu2020chair] which focuses on perceiving the object affordance without real-robot experiments, this work goes beyond by actualizing the understanding of affordances with physical experiments. The main contributions of the paper are as follows:
A method for robots to imagine the sitting pose for a previously unseen chair.
A human-robot interaction framework with closed-loop visual feedback to enable the robot to reason the accessibility of a chair and interact with a human to change the accessibility if necessary.
A robotic system which is able to autonomously put a teddy bear on a previously unseen chair for sitting.
Ii Related Work
Object Affordance of Chairs. The class of chairs is an important object class in human life. The exploration of sitting affordance has become popular in recent decades [hinkle2013predicting, grabner2011makes, seib2016detecting, wu2020chair, bar2006functional]. Hinkle and Olson [hinkle2013predicting] simulate dropping spheres onto objects and classify objects into containers, chairs, and tables based on the final configuration of spheres. Grabner et al. [grabner2011makes] fit a humanoid mesh onto an object mesh and exploit the vertex distance and the triangle intersection to evaluate the object affordance as a chair for chair classification. Instead of object classification, we use object affordances to enable robots to understand the interaction with a chair and showcase the understanding with real-robot experiments.
There is a growing interest in reasoning object affordances in the field of computer vision[myers2015affordance, sawatzky2017weakly, roy2016multi, ruiz2020geometric, zhu2014reasoning, ho1987representing, desai2013predicting] and robotics [stoytchev2005behavior, do2018affordancenet, chu2019learning, wu2020can, abelha2017learning, piyathilaka2015affordance]. Stoytchev [stoytchev2005behavior] introduces an approach to ground tool affordances via dynamically applying different behaviors from a behavioral repertoire. [do2018affordancenet, chu2019learning, sawatzky2017weakly, roy2016multi, desai2013predicting]
use convolutional neural networks (CNNs) to detect regions of affordance in an image. Ruiz and Mayol-Cuevas[ruiz2020geometric]
predict affordance candidate locations in environments via the interaction tensor. In contrast to detecting different types of affordances, we focus on understanding the sitting affordance and use it for real-robot experiment.
Physics Reasoning. Reasoning of physics has also been introduced to the study of object affordances [wu2020chair, wu2020can, battaglia2013simulation, zhu2015understanding, zhu2016inferring, kunze2017envisioning]. Battaglia et al. [battaglia2013simulation] introduce the concept of intuitive physics engine which simulates physics to explain human physical perception of the world. Zhu et al. [zhu2016inferring] employ physical simulations to infer forces exerted on humans and learn human utilities from videos. The digital twin [boschert2016digital] makes use of physical simulations to optimize operations and predict failures in manufacturing. In contrast, we do not infer physics or use it to predict outcomes, but exploit physics to imagine potential interactions with objects to perceive object affordances.
Iii-a Problem Formulation
Throughout the paper, we assume the given previously unseen chair is upright and the back of the agent (or the bear) is supported while sitting. The virtual agent in the imagination is an articulated humanoid body (Fig. 3(a)). To put the agent (the bear) on the chair, we want to find 1) the sitting pose where and specify the position and rotation of the agent’s (the bear’s) base link in the world frame and 2) the joint angles of the agent (the bear). However, the joints of the bear can be considered fixed111The teddy bear is a plush toy. The joints are not rigidly fixed but have large damping coefficients. and the joint angles are already close to those of a sitting configuration (Fig. 1(b)). Thus, in this paper, we simplify the problem to just finding .
According to the interaction-based definition (IBD) of chairs (Sec. I), the torso is more or less upright when sitting. Thus, we further simplify the problem by restricting . denotes the rotation about the z-axis of the world frame and is the yaw angle of the base link. denotes the initial rotation which sets the agent to an upright sitting configuration with its face facing towards the x-axis of the world frame. That is, given an unseen chair, the problem becomes that of finding the position and the yaw angle of the base link in the world frame for sitting. We denote the direction indicated by in the xy-plane as the sitting direction. The base link of the agent is its pelvis link.
Iii-B Revisit of Sitting Affordance Model (SAM)
In the sitting imagination (Sec. III-C), we simulate a virtual humanoid agent sitting on the chair in physical simulations. The resultant configuration of the agent at the end of the simulation is denoted as . The sitting affordance model (SAM) [wu2020chair] evaluates whether is a correct sitting by comparing it with a key configuration . We briefly review SAM here and further details can be found in [wu2020chair].
The evaluation is based on four criteria. 1) Joint Angle Score. The joint angles of a configuration
can be described with a vectorwhere denotes the total number of the agent’s joints. The joint angle score is defined as . denotes the weight of the -th joints. () is the -th element of (). Lower is better. 2) Link Rotation Score. According to the IBD, the torso is more or less upright. Thus, SAM considers the link rotation in the world frame when evaluating . The link rotation score is defined as . is the weight for the -th link. () is the z-axis unit vector of the -th link frame in (). Lower is better. 3) Sitting Height. Sitting height is also an important factor in sitting. SAM takes into account in the evaluation of . 4) Contact. SAM also counts the number of contact points of the agent’s links. The number of contact points of all the body links can be described with a vector where denotes the total number of links.
is classified as a correct sitting if the following are all satisfied: , , , . , , , and are four thresholds. is a binary function which outputs 1 if 1) the total number of contact points is larger than some thresholds and 2) the number of contact points for lower and upper body links are both larger than zeros, and 0 if otherwise.
Iii-C Sitting Imagination
We physically simulate the agent sitting on the chair in different rotations to find and for sitting (Fig. 3). Given the 3D model of the chair, we first compute the minimum volume oriented bounding box (OBB) [trimesh]. As in [wu2020chair], we then apply a rigid body transformation to the model (Fig. 3(b)). horizontally translates the chair to align the OBB center with the origin of the world frame in the xy-plane. rotates the chair about the z-axis to align the OBB with the coordinate system of the world frame. We apply
because we notice that the back of the chair is heuristically coincident with one of the OBB faces which benefits the finding of correct sittings in the imagination. After applying, we attach the world frame as the body frame of the chair.
The rotation of the chair in the simulation is enumerated by setting and (Fig. 3(c)). Note that is a rotation about the z-axis in the world frame. We drop the agent from above the chair to simulate sitting (Fig. 3(d)). Unlike [wu2020chair] which simulates drops for different chair rotations one-by-one, we simulate them simultaneously to reduce the runtime. Before the drop, the agent is set to a pre-sitting configuration facing the x-axis (Fig. 3(a)). The base link of the agent is placed on a plane 15cm above the chair OBB. For each rotation, we first sample three positions on the plane to drop: the origin and two positions with a translation of along the x-axis from the origin, respectively. is scaled linearly with respect to the size of the OBB. If no more than one correct sitting is found for all rotations, four extra positions are sampled to drop: positions with and translations along the x-axis from the origin, respectively. The reason we start sampling drops around the origin of the plane, which is aligned with the center of the OBB horizontally, is that most chairs have their seats positioned close to the center of the OBB. However, for some chairs, the seats are closed to the OBB peripheral. Thus, if not enough correct sittings can be found, our search expands towards the peripheral. In total, for a chair, we simulate 24 drops if no extra drops is needed and 56 drops if otherwise.
The rotation with the largest number of correct sittings is selected as the best rotation for sitting. If more than one rotations have the largest (Fig. 3(f)(g)(h)), we select the one with the smallest averaged value of (lower is better). The sitting pose of the agent in the world frame is:
where and the inverse ; and are the weighted average of the agent’s position and yaw angle relative to the chair frame of all the correct sittings with . The weight is the reciprocal value of of the sitting.
Iii-D Motion Planning
Optimal Control Module. We assume the motion of placing the bear is quasi-static. Due to the complexity of the motion constraints, we formulate the planning of this motion as a trajectory optimization problem [han2020can]. We use the direct collocation method [tsang1975optimal] to solve this problem:
where ; and denote the state of the system (joint angles) and the control inputs at the -th time interval, respectively. The quasi-static assumption simplifies the state transition as Eq. (4). And Eq. (5) limits the magnitude of the control inputs to satisfy the quasi-static assumption. Eq. (7) confines the robot’s center of mass (COM) horizontal projection to be within the supporting polygon formed by the feet to ensure the stability of the robot. Eq. (9) ensures that the trajectory is collision-free.
The initial configuration is a pre-defined standing posture (Fig. 4(a)). The goal configuration is generated via a constrained optimization [han2020can]:
The cost (Eq. (10)) aims to minimize 1) which is the distance between the COM horizontal projection and the center of the supporting polygon and 2) which is the bending angle of the torso. A less bending torso exerts less torque on the motors. This makes uprighting the body easier after finishing placing the bear (Fig. 4(b)). is the forward kinematics of the robot. Eq. (13) ensures that the robot reaches the goal pose such that the bear sits at the imagined sitting pose .
SE(2) Planning Module. We use RRTConnect [kuffner2000rrt] in the OMPL library [sucan2012the-open-motion-planning-library] to plan the trajectory for the robot to carry the bear to the chair. The robot is encapsulated with an ellipsoid for collision checking with the FCL library [pan2012fcl]. The setting is shown in Fig. 4(c).
Iv System Pipeline
Fig. 2 shows the pipeline of our method. The robot arm first scans and reconstructs the 3D model of the chair (Sec. V-A). Then, the sitting imagination is conducted to find the imagined sitting pose (Sec. III-C). With , we first determine the goal pose for the NAO to walk to and place the bear. The rotation of is set such that the NAO faces the opposite of the sitting direction indicated by . The position of is on a horizontal ray which originates from the projection of on the xy-plane and points towards the sitting direction. It is initially set such that the NAO is away from . If the NAO is in collision with the chair at , we move it along the ray until it is collision-free. After that, if the distance between the NAO and is too large, we move horizontally along the ray and make it closer to the robot due to the robot workspace constraint. We then use the optimal control module to pre-plan the motion to place the bear and check the validity of . To reduce the planning time, the motion is simplified as a bilaterally symmetric motion – the motion of the left-half body is symmetric to that of the right-half. At the beginning of the motion, the bear is held in hands facing the NAO (Fig. 4(a)). Thus, the bear faces the sitting direction. We restrict the motion to be an motion, in which only the pitch joints are activated, to maintain the facing of the bear throughout the motion.
The trajectory to walk to is then planned with the planning module. If is out of the planning arena or blocked by obstacles, no plans will be made. In this case, the NAO gives a language instruction to the human to rotate the chair about the vertical axis such that the sitting direction points towards the NAO where there are no obstacles in between (see HRI in Fig. 2). The instruction is generated from a template: “Please rotate the chair about the vertical axis for degrees”. . is the multiple of 30 degrees closest to the precise rotation angle such that the sitting direction is pointing towards the NAO222We use multiples of 30 degrees instead of the precise angle because it is easier for humans to understand and act in the HRI.. The pose of the chair during the interaction is tracked by the iterative closest point (ICP) [besl1992ICP]. The transformation of the chair in the interaction is denoted as . The imagined sitting pose and the goal are updated and transformed by after the interaction. The planning module then tries to plan a trajectory to the updated . This process is repeated until a valid trajectory is found. In practice, we regard the trial as a failure if no trajectories can be found after three interactions.
After a valid trajectory is found, the NAO is controlled to walk and follow the waypoints along the trajectory via a PID controller. The walking motion is controlled by the NAOqi SDK333https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis. The bear is passed to the NAO manually before it starts walking. When the NAO arrives at , the optimal control module plans a whole-body motion to place the bear based on the robot current pose. Finally, the robot executes the motion, releases the bear, and uprights its body (Fig. 4(b)).
Fig. 1(a) shows the experiment setup. A PrimeSense Carmine 1.09 RGB-D camera is mounted on the robot arm. Besides scanning the chair, the camera is also used for tracking the chair in the HRI and the NAO, with an ArUco tag placed on top of its head, during its walking.
V-a 3D Scanning
In the experiment, the chair is placed randomly in a 5050cm squared area in front of the robot arm in its upright pose. The robot arm is moved to 9 pre-defined collision-free configurations to capture depth images of the scene with the RGB-D camera (Fig. 5(a)(b)(c)). The pose of the camera at each view is obtained from the forward kinematics of the robot arm. This allows us to use TSDF Fusion [curless1996volumetric] to densely reconstruct the scene and the point cloud. The chair is segmented from the scene by plane segmentation in the PCL library [rusu20113d].
Our dataset contains 15 chairs with diverse shapes and appearances (Fig. 5(d)). They are designed for children aged from 0 to 3. The reason we choose small chairs is that the size of the chair is restricted by the workspace of the NAO and the robot arm444If the chair is too tall, the NAO will be too short to put the bear on it. If the chair is too large, the robot arm will not be able to scan it.. We use 3 chairs (calibration set) to calibrate the simulation and the motion planning and control. The rest 12 chairs (test set), which are unseen by the robot, are used for testing.
V-C Physical Simulation
Pybullet [coumanspybullet] is used as the physics engine for sitting imagination. The chair and the virtual humanoid agent are imported with the URDF files which specify the mass, COM, inertia matrix, friction coefficient, and joint properties. We use the default Coulomb friction model. The collision in the simulation is modelled as inelastic. The physics attributes of the chair is computed with Meshlab [cignoni2008meshlab]. As the chairs in the dataset are designed for children, we set the height and weight of the agent in the simulation accordingly [onis2008child].
Vi Results & Discussions
We implement our method with the Robot Operating System (ROS) on a computer running Intel Core i9-10920X @ 3.5GHz CPU. Our unoptimized implementation takes about 70s, 4s, and 29s for 3D scanning (including capturing and model reconstruction), sitting imagination, and motion planning (including both the whole-body and motion planning), respectively. The walking and placing the bear motions take about 67s and 24s, respectively.
Accessible. In the first set of experiments, the chair is placed such that it is accessible for seating the bear. The sitting direction points towards the NAO with a deviation within a range of degrees. No HRIs are needed in this case. For each chair in the test set, we place it in 3 different poses for testing, resulting in 36 trials in total (Fig. 6).
Inaccessible + Human Obeys. In the second set of experiments, the chair is placed such that it is inaccessible. That is the sitting direction of the chair is either pointing towards 1) the robot arm or 2) to the edges of the planning arena (Fig. 7(a)). In both cases, no valid trajectories can be found in the initial configuration. HRIs are needed to rotate the chair and make it accessible. We recruit 6 volunteers to participate in the experiments. For each chair in the test set, we place it in 2 different inaccessible poses (24 trials in total). We ask the human to obey the instruction given by the NAO throughout the whole trial (Fig. 7).
Inaccessible + Human Disobeys. There exist many uncertainties in HRIs (e.g., the human is distracted or misunderstands the instruction) which will result in inaccessibility even after the interaction. In this set of experiment, we want to test the robustness of our method in addressing these uncertainties. For each chair in the test set, we place it in an inaccessible pose as in the Inaccessible + Human Obeys setting (12 trials in total). We ask the human to deliberately disobey the first instruction given by the NAO and obey the following instructions (Fig. 8).
We recruit 15 annotators to annotate the experiment results. Each trial is annotated by five different annotators. For each trial, we show the experiment video and the images of the bear at the end of the trial. The annotator is then asked 1) “Do you think the robot has been successful in seating the bear on the chair?” For the trials where the chair is inaccessible, we also ask 2) “Do you think the human obeyed the instruction given by the NAO?” for each HRI in the trial and 3) “Is the chair accessible at the end of all the human-robot interactions?” The reason we recruit human annotators to annotate the results is that we think there is a perspective variance (e.g., whether a trial is successful) among different human subjects. The results on the test set are shown in the Tab. I.
(1) Seating Bear Success
|Experiment||Trial||Positive Annotation Num.|
|Inaccessible + Human Obeys||24||18||22||23|
|Inaccessible + Human Disobeys||12||10||10||11|
|(2) Human Obeys in HRI|
|Experiment||Interact.||Positive Annotation Num.|
|Inaccessible + Human Obeys||26||20||23||23|
|Inaccessible + Human Disobeys||11||8||11||11|
|(3) Chair Accessible After HRI|
|Experiment||Interact.||Positive Annotation Num.|
|Inaccessible + Human Obeys||23||23||23||23|
|Inaccessible + Human Disobeys||11||10||11||11|
In table (1) of Tab. I, we show the number of trials with at least 5, 4, and 3 positive annotations to the first question. We count the trial as successful if and only if the sitting imagination found the sitting pose and more than half of all the 5 annotations (at least 3) are positive. The results justify our point of perspective variance: the number of trials with at least 5, 4, and 3 positive annotations varies from each other. For example, some annotators allow the bear to be a bit tilted on the chair while others may count it as a failure. In general, we are able to achieve a very high success rate of on the 12 unseen chairs in the test set (68 successful trials out of all the 72 trials). The success rates of the three sets of experiments are roughly the same. The step stool accounts for all the 4 failure cases in the 72 trials. Two in Accessible and one in Inaccessible + Human Obeys and Inaccessible + Human Disobeys, respectively. In all of these failure cases, the sitting imagination fails to find the sitting pose because the depth of the seat is too shallow. A successful trial of the step stool is shown in Fig. 6. Notably, the success rate of all the 6 trials of the improvised chair is . This opens up a promising potential of our imagination method to discover the affordance of an object which can afford the sitting functionality despite it not being a typical chair.
In table (2) of Tab. I, we show the total interaction number in all the inaccessible trials and the number of interactions with at least 5, 4, and 3 positive annotations to the second question. We only count the trial when HRIs are involved, i.e., we disregard the failure trials where the sitting imagination fails to find the sitting pose and no instructions will be given. Also, for Inaccessible + Human Disobeys, the interactions in which the human deliberately disobeys the instruction are not counted. We consider the volunteer obeys the NAO’s instruction if more than half of all the annotations are positive. Interestingly, we observe that in some Inaccessible + Human Obeys trials, although the volunteer was told to obey the robot instruction, he/she somehow disobeyed the instruction. And in these trials, the robot was able to give new instructions based on the current pose of the chair until it is accessible, resulting in a larger number of interactions (26) than the total number of trials counted (23). The perception variance also exists in this annotation. Some people requires the rotation of the chair to be very close to the angle in the instruction, while others don’t. In general, in of all HRIs, the human is considered to follow the robot instruction.
In table (3), we show the total number of trials with HRIs and the number of trials with at least 5, 4, and 3 positive annotations to the third question. Similar to table (2), we only count the trial when HRIs are involved. The goal of the HRI is to make the chair accessible for placing the bear. We evaluate the accessibility of the chair at the end of the HRI to assess the effectiveness of our HRI framework. We consider the chair accessible after the HRI if more than half of all the annotations are positive. The results show that our framework is very effective – it achieves a success rate in making the chair accessible in all the trials with successful sitting imagination. The perspective variance in this annotation is not as significant as that in table (1) and (2).
Besides the test set results shown in Tab. I, we also test on the three chairs in the calibration set with the same three experiment settings (18 trials in total). Although the chairs are seen, the poses of the chairs are new and different from those in the calibration. The success rate of placing the bear in the 18 trials is 100%. In the 9 trials involving HRIs, the success rate of making the chair accessible is 100%.
Vii Conclusions & Future Work
We propose a novel method to imagine the sitting pose of a previously unseen chair. We develop a robotic system which is able to put a teddy bear on the chair autonomously via robot imagination. Moreover, we introduce a human-robot interaction (HRI) framework to change the accessibility of the chair when the chair is in an inaccessible pose. Experiment results show that our method enables the robot to put the bear on 12 previously unseen chairs in 72 trials with a very high success rate. The HRI framework is also shown to be very effective in making the chair accessible from inaccessible poses. Future work can adaptively change the size and the shape of the humanoid agent in the sitting imagination to imagine the affordance of chairs with various sizes. Mobile manipulators and larger humanoid robots can be used to put the bear on these chairs in the real world. Exploring a more versatile whole-body motion planning for placing the bear is also a promising future direction.
The authors thank Yuanfeng Han for his helpful discussions, all the volunteers for the HRI experiments, and all the annotators for the human annotations.