Everyday objects are usually placed in certain orientations that are convenient to humans. For example, a cup is designed to be placed on its bottom rather than on its side. Placing objects down properly in orientations preferable to humans is a fundamental skill for service robots. For example, a robot that is unpacking the dishwasher should place plates, glasses and bowls on shelves in certain orientations. Research in robotic manipulation over the past decades has mostly focused on how to pick up objects 
, with recent works utilizing the advances in deep learning[20, 21, 24, 22]. However, what to do with the object after it has been grasped has largely been overlooked in the field. In fact, the most common practice in pick-and-place robotic manipulation scenarios is to drop the object at a height without any consideration to its resulting pose [21, 22, 11]. Only a handful of researchers have studied how to place the grasped object down, as reviewed in Section II. Furthermore, no work to our knowledge has leveraged deep learning for the object placement problem.
In this paper, we study the problem of placing a grasped object down on an empty flat surface in the human-preferred orientation. We consider the solution to the problem as finding the required object rotation such that when the robot end-effector is lowered until a contact is made and the gripper is opened, the object would be stably placed in the human-preferred orientation. We present two neural networks, Placement Rotation Convolutional Neural Network (PR-CNN) and Placement Stability Convolutional Neural Network (PS-CNN) that are used in an iterative algorithm. Both networks takes input depth images of the grasped object (with the end-effector) from three viewpoints. At every iteration, the PR-CNN recommends an object rotation to achieve the human-preferred orientation, which the robot executes, and then PS-CNN estimates the stability of the proposed object orientation. The algorithm is stopped in one of two conditions: 1) PR-CNN’s output is the rotational identity which suggests that the object orientation has converged or 2) The maximum number of iterations is reached, at which point the rotation that yields the the maximum predicted stability is chosen among all iterations. The iterative approach helps in getting more observations from the object and correcting the errors of PR-CNN. Stability estimation provided by PS-CNN is useful when the output of the PR-CNN results in oscillations in the object orientations, which is a common failure mode when distinguishing features that could typically be used the determine the upright orientation is occluded in the depth images. We train both networks in simulation using 18 objects we picked from the KIT object database. We report on the experimental results starting from random object orientations in the robotic gripper.
The contributions of this paper is four-fold:
PR-CNN: Learning the required rotation to reach desired object orientations, from depth images of the gripper holding an object only, without explicit object pose detection.
PS-CNN: Learning to estimate whether an object would end up in a stable orientation, again from depth images without object pose detection.
Iterative Placement with Stability: An algorithm that iteratively uses PR-CNN to move the arm and PS-CNN to assess the hypothetical stability, which helps gather more observations and correct errors.
An implementation on a robotic system, demonstrating the feasibility of direct sim-to-real transfer.
The organization of the paper is as follows. We first review the relevant literature in Section II. We then describe the problem in Section III. Our approach, called Iterative Placement with Stability, is detailed in Section IV. Experimental setup is presented in Section V, before presenting the results and discussing failure modes in Section VI. Robot implementation details are described in Section VII, before concluding in Section VIII.
Ii Related Work
Ii-a Robotic Placing
In one of the earliest implementations of robotic object placement, Edsinger and Kemp  use a compliant robotic arm to place an object onto a shelf by moving the arm to a fixed configuration and then lowering the end-effector using force control, hence utilizing contact with the environment. However, this approach assumes that the pose of the object in the gripper is known, which is unavailable for unknown objects. Since then, many researchers approached the placement problem analytically, attempting to find flat features on the object and the surface on which the object can be placed [2, 3, 12, 13, 14]. Baumgart [2, 3] find stable poses of the object analytically by finding a point of first object contact, followed by rotating the object such that additional contact points are found. Their approach runs in real time, however does not take into account preferred placement orientations. Harada [12, 13] matches planar surface patches on the object with planar surface patches in the environment, which allows finding placements on large, flat surfaces, but also less obvious placements such as a mug hanging on a flat bar. Their approach, however, requires the 3D model of the object and its pose. Haustein  presents a similar approach and uses Monte Carlo Tree Search to optimize motion planning to reach the stable pose.
Majority of recent approaches in robotic manipulation is driven by machine learning approaches, and object placement is no exception. The approach presented by Jiang[18, 17] uses learning on hand-designed features and successfully places known objects stably of the time and new objects of the time. Paolini 
estimates the probability of a successful placement, then attempts to solve for the most likely placement location, given a grasped object. Fu use hand chosen features to try and find the upright orientation of man-made objects.
Ii-B Planning for Placement
Object placement is also studied in the task and motion planning context. Alami  first formalized robotic manipulation as atomic actions to achieve a higher-level task in configuration space. Grasping is not the only way to place objects to their correct places as objects can also be moved using non-prehensile manipulation. Scholz  optimizes for the configuration of tabletop objects and plans pushing actions to achieve the desired table configuration. Cosgun  plans for a sequence of pushing actions on a cluttered table in order to make space to place a new object.
Ii-C Representing Rotations in Neural Networks
Rotations can be represented in many ways, such as rotation matrices, Euler angles, quaternions and axis-angle representations. The choice of rotation representation has a important effect on the performance of machine learning models when inputs and/or outputs include rotations . A common issue of angular representations is their discontinuity stemming from their periodic nature. Theoretical analyses suggest that smooth functions  or functions which have stronger continuity properties have lower approximation errors [30, 6]. Zhou  posits that rotations are discontinuous under any representation using four or fewer dimensions, and propose a six dimensional representation for rotations which they show to outperform all other typical representations, in the context of machine learning. We adopt this rotational representation in this paper.
Iii Problem Description
Fig. 2 illustrates the axes and angles we define for an object. The coordinate frame defined by the axes () represents an arbitrary and fixed world frame, in which gravity is acting in direction. We define the stable axis as the orthogonal unit vector to the surface on which objects are placed. The placement surface is set as the infinite and uncluttered -plane, hence we set to be in the direction of the -axis. We assign an upright vector attached to an object, such that when the object is in its human-preferred orientation. An important property of object placement on an uncluttered infinite plane is that the placement is independent of any rotation about the stable axis . To illustrate this, any rotation of the mug shown in Fig. 2 about can be reframed as a rotation of the global reference frame about the same axis. Because of this, there exists a continuous set of rotations that can orient an object to a human-preferred orientation. We uniquely define the ground truth rotation as the shortest possible rotation required to move the object from its current orientation to a human-preferred orientation. The ground truth rotation can therefore be found as the transformation required to rotate the unit vector to . This rotation is calculated most easily using axis-angle representation, which is defined by an angle rotated about an axis . The ground truth rotation angle is the angle between unit vectors and and can be found by:
The ground truth axis is defined as the axis perpendicular to both and and can be found by:
We consider the robotic placement problem as follows. The robot starts with an object already in hand and the task is to successfully place the object down. A solution to the problem is a proposed object rotation that would result in a particular object orientation (the human-preferred orientation in this case). We assume that the robot has no a priori knowledge about the object class, 3D model or the human-preferred orientation, however it has access to depth cameras and joint force sensors. We use a simple placing behavior: the end-effector is lowered along the - direction with a fixed orientation until a contact with the table is felt. The gripper fingers are then opened and the end-effector is retracted along the reverse direction of the end-effector. Successful placement onto the tabletop requires that the object is stable and in the human-preferred orientation under gravitational and contact forces after release.
Iv Iterative Placement with Stability
We propose an iterative, learning-based approach to robotic object placement. Three depth cameras are used in order to observe the object from different viewpoints. We use two neural networks:
A network takes the depth images as input and outputs the rotation that should be applied to the object so that it results in a human-preferred orientation. We call this network PR-CNN: Placement Rotation Convolutional Neural Network.
A network takes the depth images as input and estimates the confidence level that the object would be stable if it is placed in its current orientation. We call this network PS-CNN: Placement Stability Convolutional Neural Network.
Note that the two networks have different criteria. PR-CNN estimates the required rotation towards the ground truth orientation, whereas PS-CNN considers the physics when the object is released. Moreover, a stable placement does not necessarily mean the object would end up in a human-preferred orientation. For example, if a cup is placed upside down, the placement would be stable but not in the human-preferred orientation.
At each iteration, we first get a proposed rotation from PR-CNN and apply the rotation on the object. Assuming the gripper and grasped object act as a single rigid body (i.e. no slipping), we can execute object rotations by applying the same rotation to the robotic gripper. We then estimate the stability of the resulting object orientation using PS-CNN. This is repeated until PR-CNN’s output converges within a threshold of to the rotational identity since theoretically an object which is already at a stable orientation will yield a rotational identity when evaluated via PR-CNN. If, however, the maximum number iterations is reached without convergence, we pick the rotation that yielded the highest stability among all the iterations, as estimated by PS-CNN. The pseudocode can be seen in Algorithm 1.
The iterative approach helps with fine-tuning the object orientation especially if the object is slipping in the gripper, as well as getting new observations from the object. The use of the PS-CNN in evaluating stability helps in resolving the situations when proposed rotations show oscillatory behavior, which can happen with near symmetric objects such as cups. A more detailed discussion on common failure modes can be found in Section VI-C.
Iv-a Placement Rotation CNN (PR-CNN)
We learn the required rotation that is applied to the object that would result in a human-preferred orientation. The ground truth rotation is obtained analytically using the methodology outlined in Section III, and a single network is trained on all the objects in a given dataset.
PR-CNN inherits its architecture from DenseNet-121 
and is pre-trained on ImageNet. PR-CNN defines the set of parameters used to represent the required rotation function . PR-CNN takes as input three depth images of the object. The depth values are saturated at and then normalised between . Modifications were made to the final layers to output a six dimensional representation of rotations as proposed by Zhou . This represents the first two columns of a rotation matrix, which is then converted to a full rotation matrix using the Gram-Schmidt-like process described by Zhou . Furthermore, we use a similar loss function to Zhou , which is the geodesic distance between the output and the ground truth:
where is the matrix trace operator. For a rotation matrix, this is defined as:
Iv-B Placement Stability CNN (PS-CNN)
We learn the estimated stability of an object being placed in its current orientation. The ground truth binary label is obtained by physics simulation.
Our goal is to attempt to learn this function over a series of objects and orientations that classifies stability according to a binary success metric.
Where is the binary cross-entropy loss function and defines the parameters of PS-CNN.
PS-CNN is based on VGG16  and is pretrained on ImageNet . PS-CNN defines the set of parameters used to represent the stability function . The first four convolutional blocks are frozen and only the final layers are fine tuned. The input to the network is three depth images stacked and scaled to . Similar prepossessing steps as PR-CNN were applied to the depth images. Modifications were made to the final layer to output one number, with sigmoid activation. This number represents the confidence of stability if the object was placed in the current orientation.
V-a Object Models
We picked 18 everyday objects from the KIT object models database , which provides 3D meshes of the objects. The simulated renderings of all 18 objects can be seen in Fig 3. The objects were picked such that each one had a well defined, single upright vector (we avoided objects with multiple stable axes, such as a coke can or cereal box). The upright vector for each object model is annotated with the human-preferred orientation. We use a scene where three depth cameras were positioned orthogonally from each other, to the front, left and right of the gripper, at a 25cm radius around the point at which objects and the gripper interact (as seen in Fig. 1).
V-B Data Collection
We used the PyRep toolkit  for the simulation environment. To generate a data point, we randomly pick an object from the 18 objects, and randomly sample orientations with slight positional variations around . We collect the three depth camera images along with the ground truth rotations that would lead the object to the human-preferred orientations and binary labels indicating whether the object is in a stable placement orientation or not.
In order to evaluate how our approach performs, we collected two datasets: data with and without the robot arm in simulation. Both datasets contain 85,000 data points each.
Without Robot: In order to evaluate the feasibility of our approach, we first experiment on a setup where there is no robot involved and the object is moved in simulation directly. This mode is simpler, because there are no contact forces between the gripper and the object that can affect the object orientation, and we can move the objects without kinematic constraints. Placement stability is evaluated by setting the position of the object to the lowest possible point without colliding with the placement surface, then enabling gravitational physics to simulate placement.
With Robot: Random object grasps are sampled with a Panda Franka robotic arm. Placement stability is evaluated by using inverse-kinematics to rotate and place the object onto a surface. Placement is achieved by lowering the end effector until a contact is detected via force sensing of the robot joints. In order to increase the manipulability of the robot arm, objects are placed on a surface elevated from the ground.
An important distinction between the datasets is that in the “With Robot” dataset, the robot arm appears in the depth images along with the grasped object, whereas in the “Without Robot” dataset the depth images contain only the object of interest.
We compared the performance of several placement methods to benchmark our proposed algorithm against.
Random placement (RND): Object rotation is randomly sampled.
Single pass (SP): A single pass of PR-CNN is used to determine the object rotation.
Iterative (ITR): PR-CNN is run iteratively until the identity rotation is achieved or a maximum number iterations (5 in this case) has been reached.
Iterative with Stability (ITR-S): Our full approach combining PR-CNN and PS-CNN, as detailed in Section IV.
Success Rate: The percentage of placements where the steady-state object orientation is within of the human-preferred ground truth orientation.
Stability Rate: The percentage of placements where the object stays stationary for a minute and the final object orientation is within of the initial placement orientation. This metric was adopted from .
Angular Error: The average angle difference between the upright vector and the stable axis .
For the “With Robot” experiments, placements that lead to kinematically infeasible orientations were ignored and did not contribute towards the results.
Vi-a Without Robot
The aggregate results on all the objects is shown in Table I. RND expectedly performed very poorly with 3.6% success rate. It had, however, achieve a moderate 45.8% stability rate. The discrepancy is because objects can be stably placed in orientations other than the human-preferred orientation.
|Method||Success Rate||Stability Rate||Angular Error|
SP, ITR and ITR-S all performed similarly, achieving 97.9%, 98.4% and 97.0% success rates respectively. ITR performed the best in terms of success rate and angular error, and outperformed the full approach ITR-S. Fig. 6 shows the placement success rate for each object and method. It is interesting to note that all methods except RND achieved a 100% success rate for 7 out of 18 objects.
These results suggests that the advantage of the stability metric is not apparent when there is no robot arm involved and even the SP method is sufficient for a reasonable performance.
Vi-B With Robot
The aggregate results on all the objects is shown in Table II. Compared to the previous experiment without the robot arm, all methods performed worse in all metrics, due to the inherent difficulty of the scenario of the robot hand being in the images and the contact physics between the end-effector and the object. The success rates were 72.3%, 84.6% and 86.1% for SP, ITR and ITR-S approaches respectively, and the full approach ITR-S performed the best in all metrics.
|Method||Success Rate||Stability Rate||Angular Error|
By introducing the robotic arm, network errors are more pronounced, which allows iterative approaches to compensate for the errors of SP. This is most noticeable for objects which SP struggles with such as the Toothpaste and various cup models as seen in Figure 7. This is also apparent during training when the PR-CNN training error is still high, as seen in Figure 4. The only exception to this is before the network is trained, at which point the network is outputting random rotations. This validates our intuition that an iterative approach is able to compensate for errors in the network as long as the network is trained to an extent such that it is able to move in the general direction towards the ground truth optimum.
Vi-C Failure Modes
A common failure mode of our approach is sometimes placing the objects upside-down, which occurs with objects that look symmetrical, such as the cups, ShowerGel and ToothPaste object models. This occurs most often when the object is in an orientation such that distinguishing features that could typically be used to determine an upright orientation, such as the handle or mouth of a cup, are not visible to the cameras. These objects also result in oscillatory behaviour when using the ITR, as the model oscillates between two perceived stable orientations. When this occurs, the angular error of object orientation can be very high depending on which iteration the algorithm is stopped. This is most noticeable for the Glassbowl object, which exhibited the most oscillatory behavior out of our object set. In Fig.5, the angular error is shown for each iteration for two objects. While HamburgerSauce object shows the expected behavior, the behavior is oscillatory for Glassbowl. An advantage of ITR-S is to solve the oscillation problem by testing for stability at each iteration and outputting a highly stable orientation among all iterations.
Another common failure mode occurs with intrinsically unstable objects, such as objects with a small base area relative to their total volume. Objects such as the Toothpaste and Pitcher are examples of objects exhibiting this trait. Although the angular error of the network to the ground truth pose is similar to other more intrinsically stable objects, the success rate of placements is noticeably lower. Due to the different intrinsic stability properties of different objects, it is difficult to benchmark our results against other works.
Vii Robot Implementation
To demonstrate the feasibility of our proposed approach and the potential for sim-to-real transfer, we implemented our approach on a Franka Panda robotic arm. The system consists of three computers connected via TCP/IP where each computer is connected to a Realsense D435 RGB-D camera. The reason for using multiple computers was that the high USB bandwidth required for realsense cameras. Each computer uses the Melodic version of the Robot Operating System (ROS)  running on Ubuntu 18.04 LTS operating system. We use MoveIt motion planning framework for planning and control.
The physical setup is similar to simulation, with the three RGB-D cameras placed orthagonally from each other. We moved the location of the shelf to the right of the arm due to physical constraints. The depth images are inpainted using OpenCV  to assist in noise removal from the depth images. The output of the network is then processed by PR-CNN, which is trained only in simulation, to calculate the new rotation. The object is then lowered onto the shelf using force-feedback to detect when the object is in collision with the shelf. We 3D-printed 6 of the 18 objects in our dataset and the robot was able to successfully place the objects in certain initial object orientations. At the time of writing, the system was not robust enough for a full evaluation. This is due to the limited considerations during motion planning and noisy depth images, which were not considered during the training of PR-CNN.
Viii Conclusion and Future Work
In this work, we proposed an approach to rotate grasped objects into orientations such that they can be placed in stable, human-preferred orientations. We show the feasibility of learning to place objects from depth images without object detection or explicit pose estimation. Our experimental results suggest that iterative approaches such as ITR and ITR-S
show improvements over the single pass approach. Our work also shows potential for sim-to-real transfer learning, and justifies the need for more research to make the system more robust and generalize the approach to different classes and shapes of objects.
There are many interesting future research directions for this work. First, our current iterative approaches only reevaluates the object’s orientation after it has completed the rotation, making it slow to react to disturbances. A closed-loop reactive approach, analogous to grasping in , that is able to reevaluate the rotation at every time step will be much more efficient at dealing with these problems, but still potentially yield the same benefit as our iterative approaches. Second, our approach is designed for objects with a single defined preferred orientation, however some objects such as cans and boxes have multiple preferred orientations. It is possible to extend our representation to include such objects, which may also help to reduce the oscillatory behaviour of iterative approaches. Third, we have observed that in many situations, it is impossible to place to object in the preferred orientation, either the arm would make contact with the surface or that the desired pose is kinematically infeasible. It would be interesting to couple our work with grasping - such that robot picks up the object in an orientation that enables placement in a desired orientation.
-  (1990) A geometrical approach to planning manipulation tasks. the case of discrete placements and grasps. Cited by: §II-B.
-  (2013) A geometrical placement planner for unknown sensor-modelled objects and placement areas. In IEEE International Conference on Robotics and Biomimetics (ROBIO), Cited by: §II-A.
-  (2014) A fast, gpu-based geometrical placement planner for unknown sensor-modelled objects and placement areas. In 2014 IEEE International Conference on Robotics and Automation (ICRA), Cited by: §II-A.
-  (2014) Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics. Cited by: §I.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §VII.
The construction and approximation of neural networks operators with gaussian activation function. In Mathematical Communications, Cited by: §II-C.
-  (2011) Push planning for object placement on cluttered table surfaces. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §II-B.
-  (2009) Imagenet: a large-scale hierarchical image database. In In CVPR, Cited by: §IV-A, §IV-B.
-  (2006) Manipulation in human environments. In IEEE-RAS International Conference on Humanoid Robots, Cited by: §II-A.
-  (2008) Upright orientation of man-made objects. In ACM SIGGRAPH, Cited by: §II-A.
-  (2018) FFRob: leveraging symbolic planning for efficient task and motion planning. The International Journal of Robotics Research. Cited by: §I.
-  (2012) Object placement planner for robotic pick and place tasks. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §II-A.
-  (2014) Validating an object placement planner for robotic pick-and-place tasks. Robotics and Autonomous Systems. Cited by: §II-A.
-  (2019) Object placement planning and optimization for robot manipulators. arXiv preprint arXiv:1907.02555. Cited by: §II-A.
-  (2017) Densely connected convolutional networks. In , Cited by: §IV-A.
-  (2019) Pyrep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: §V-B.
-  (2012) Learning to place new objects in a scene. The International Journal of Robotics Research. Cited by: §II-A, 2nd item.
-  (2011) Learning to place new objects. Proceedings - IEEE International Conference on Robotics and Automation. Cited by: §II-A.
-  (2012) The kit object models database: an object model database for object recognition, localization and manipulation in service robotics. The International Journal of Robotics Research. Cited by: §I, Fig. 3, §V-A.
-  (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research. Cited by: §I.
-  (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Cited by: §I.
-  (2018) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. arXiv preprint arXiv:1804.05172. Cited by: §I, §VIII.
-  (2014) A data-driven statistical framework for post-grasp manipulation. The International Journal of Robotics Research. Cited by: §II-A.
Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I.
ROS: an open-source robot operating system. In ICRA workshop on open source software, Cited by: §VII.
-  (2010) Combining motion planning and optimization for flexible robot manipulation. In IEEE-RAS International Conference on Humanoid Robots, Cited by: §II-B.
-  (2014) Very deep convolutional networks for large-scale image recognition. Cited by: §IV-B.
-  (2019) MoveIt motion planning framework. External Links: Cited by: §VII.
-  (2005) Simultaneous lp-approximation order for neural networks. Neural Networks. Cited by: §II-C.
-  (2004) The essential order of approximation for neural networks. Science in China Series F: Information Sciences. Cited by: §II-C.
-  (2019) On the continuity of rotation representations in neural networks.. In CVPR, Cited by: §II-C, §IV-A.