Mobile manipulating robots, operating autonomously in human environments, have limited sensory abilities. Manipulation planning benefits from possession of a complete object model [1, 2]. But, if the object is unfamiliar, so as to construct a complete model, the robot must either circumnavigate the object or use a wrist camera to scan it from multiple viewpoints. In contrast, humans use strong priors to complete objects by imagining occluded parts of the object from just a single view. The reconstruction task based on sparse image data is addressed here.
In this paper, we present a system designed for use by a mobile manipulation robot. The goal is to generate a set of images of an object from desired viewpoints using a single input image. For example, in the situation presented in Fig. 1 the robot collects information about an object using the RGB-D sensor. Some parts of the object (handle, the rear surface of the mug) are occluded. To avoid careful scanning, we propose a method which allows recovering information about the object from a single view. If a grasping  or motion planning algorithm  algorithm needs specific information about the object’s visual representation, the recovered images can be used to provide such data.
Recently, reconstruction abilities from a single view have been achieved using deep neural networks. Examples include the 3D Recurrent Reconstruction Neural Network (3D-R2N2)  and the 3D Generative Adversarial Network (3D-GAN) . In each case, the neural network produces a 3D occupancy grid. However, fine details of the reconstructed objects are difficult to obtain directly. It is caused by the rapid growth of the computational cost and memory demands as the number of voxels increases, driven in turn by a decrease in the size of each voxel. For example, the object resolution set to (which is not high enough to show small details of the object) results in overall greater number of parameters and almost 17 million voxels. In contrast, neural networks are proven to perform efficiently in the image space. Thus, an image-based reconstruction has the potential to provide a better spatial resolution of a reconstructed object. The typical problems solved by the neural network are image classification and object detection , but recently they have been proven to be efficient in image synthesis  and scene rendering . Thus, in this research, we propose an image-based, view-dependent approach to gather information about the object from a single view.
I-a Related Work
Single-view images can be used for effective planning of grasping points for vacuum-based end effectors because only a single visible point of contact of suitable surface geometry is required 
. Along with a greater number of fingers in a gripper, the estimation of grasping points becomes more difficult. A wide variety of grasp planning methods are available. For example, Kopickiet al.  presented a method for computing grasp contact points for a multi-finger robot given a partial 3D point cloud model. The grasp success rate decreases when this model is obtained from a single view. The proposed method for images generation can provide missing data and improve the grasping success rate. Another solution is to recover the 3D model and then apply grasp planning. Given a full 3D model a grasp can also be transferred to another novel object via contact warping .
It is possible to recover the pose and shape of a known object from a single view using a Convolutional Neural Network (CNN), applied to the single-shot object pose estimation problem. However, most methods for object reconstruction focus on end-to-end learning a 3D voxel model of the object from a single image. A general approach, which enables the completion of a 3D shape from a single-view 3D point cloud using a CNN, was proposed by Varley et al. . The network generates a 3D voxel occupancy grid from a partial point cloud and can also generalize to novel objects. The detailed mesh of the object is obtained by further post-processing of both the input point cloud and a 3D occupancy grid . A similar approach to object reconstruction, based on the 3D Recurrent Reconstruction Neural Network architecture, is proposed by Choy et al. . In this case, the 3D occupancy grid is obtained from an RGB image. Another approach to 3D object reconstruction is based on a set of algorithms for object detection, segmentation, and pose estimation, which fit a deformable 3D shape to the image to produce the 3D reconstruction of the object .
Many objects met in manipulation tasks are symmetric. The complete shape of a partially observed object can be recovered by finding the symmetry planes and taking the scene context into account . A similar approach to object shape prediction, based on the symmetry plane, is proposed by Bohg et al. . In contrast, a CNN-based neural network is used to complete partial 3D shapes . The network operates on the 3D map of voxels and generates a high-resolution voxel grid.
Recently, CNN has been proven to be effective in the task of rendering a whole 3D scene from few images , image synthesis from the text , or semantic image synthesis , new-view image synthesis from sets of real-world, natural imagery , or image completion . However, we are first to show that the sequence of 2D images of the object from a given set of viewpoints can be generated from a single image only using CNN.
I-B Approach and Contribution
In this paper, we use 2D view-dependent approach to generate images of the object from various viewpoints. As a result, the robot can “hallucinate” the shape of the currently observed object (RGB and depth images) from different viewpoints.
The dominant part of our object reconstruction pipeline is based only on view dependent representations (images). This emphasis places our method in contrast to others [4, 5]. Firstly, we justify our approach by the fact that human visual cortex allows performing the addressed task fairly easily. The human vision pipeline starts with position and scale-dependent representations . Then, higher layers of the perception system build 3D view-invariant models 
. Second, new methods from computer vision allow reconstruction of a 3D shape from the silhouette of that shape. This means that the generated 2D views can be used to generate precise point clouds of the object or 3D voxels map. Generated images can be also used to localize the relative motion of the camera by comparing the generated images from the reference view with the current camera images. It also allows finding the parts of the object which are occluded from the current view (e.g. mug handle) or to predict views during planning the motion of the robot.
Ii View-dependent Image Generation
In order to generate different views of a given object, we propose a whole processing pipeline. The block diagram of the proposed method is presented in Fig. 2. The main blocks of our proposed architecture constitute two modules: object extractor and view generator.
Ii-a Object extractor
The object extractor utilizes data from the RGB-D camera mounted on the robot. The camera provides raw information about the environment (RGB frames). We use the Mask R-CNN method  to find the 2D mask and bounding boxes of objects. After detection, the objects are cut out from the image. Our generative network operates on square images of fixed size of px, therefore we need to process the data obtained from the Mask R-CNN. We first scale each image in such a way that the longer side matches the required
px (ratio-preserving scaling). Then, in order to get a square image, we pad the images and fill them with constant background (black) color.
The same procedure is applied to the masks of detected objects. After obtaining both RGB and mask images of the object, we concatenate them and feed to the appropriate generative network. Apart from detecting objects, Mask R-CNN predicts their class labels. We utilize this information to decide which generative network should be used in the further part of the processing pipeline.
Ii-B Generative network
). It takes a concatenated RGB and the depth image of an object as input forwards it through the encoder and computes the latent representation of the input. The encoder is fully convolutional. It uses strides of 1 and kernels of shape 33. Independent from the extracted features, information about the desired view angle is fed to the network. The latent representation of input and the information about angle are concatenated and forwarded through a set of fully connected layers. Before being passed to the decoder, the features are reshaped in order to match the required 3D shape for the convolutional layers. The decoder consists of a sequence of bilinear upsampling followed by standard convolutional layers. After the first convolutional layer, the network branches out into an RGB branch and a depth branch. Both branches contain two convolutional layers and are responsible for the generation of an RGB image and a depth map of the input object observed from the desired angle, respectively.
The network utilizes shortcut connections, proposed in 
. The motivation behind this approach is the ability of the network with a shortcut connection to keep the most important features in the latent space. Due to the limited size of the latent space, the information about less important features or small texture patches from the feature maps is stored in the encoder. These snippets of information are weighted in shortcut connections. To avoid overfitting, we rely on the concept of batch normalization
and gradient clipping in the range. Due to the abundance of data in the generated dataset, we do not use other regularization techniques and data augmentation. The weights in the network were initialized with the usage of Xavier initialization .
During experiments, we found out that a single generative network which generates images for multiple objects is difficult to obtain. Thus, we decide to use a set of small networks each dedicated to a single object class. These networks can be trained faster and with limited resources. During inference, we rely on the class predictions from Mask R-CNN in order to choose the appropriate generative network.
To the best of our knowledge, there is no large dataset of images of real objects acquired from various viewpoints. This is not surprising, taking into account an effort needed for creating such dataset. Each object should be photographed from many viewpoints in a controlled environment, with adjustable distance between camera and object, the sampling of view angles, lighting conditions and background variations. It should be noted, that some attempts to create small versions of such datasets have been made. In  the authors put 300 different real-world objects (belonging to 51 classes) on a turntable and photographed them with a step of about 5 degrees. A similar dataset was proposed in , where 125 objects were photographed (with 600 samples per object). In , the authors collected images of 100 objects under three different lighting conditions, sampling each object 144 times.
Unfortunately, all of the aforementioned datasets have one disadvantage: the number of different instances of objects belonging to the same class is relatively small, usually below 10. For a typical neural network, such a small amount of data is usually not sufficient. Therefore, a number of attempts have been made to utilize synthetic data, which is much easier to collect. For example, in , a method of object category detection is utilized as a solution of 2D to 3D alignment problem. The authors employ a large dataset of 3D models of artificially synthesized chairs, then successfully run it on real-world images.
In this work, we also decided to train our models on synthetic data due to its abundance. We utilized the ShapeNet dataset . It contains 55 common object classes with about 51,000 unique 3D models. The objects are categorized using WordNet  synsets, which means that each object will typically belong to several categories arranged in a hierarchy, from coarse (animal) to fine (Siberian Husky). The authors of ShapeNet normalized the initial position of each object. We extracted models of objects belonging to multiple categories: birdhouse, bottle, bowl, can, car, chair, faucet, guitar, lamp, microphone, mug, table. On average, each class contains about 300 different models. Then, for each class, we rendered at most first 300 models (due to the high computational cost of rendering multiple images) at different angles. We generated both RGB images and depth maps. We sampled the pitch angle from the range 0 to 30 with 10step and from -360 to 360 with 12 step for the yaw angle. We did not modify the roll angle.
Ii-D Training the network
We trained all parts of our reconstruction pipeline independently. For object extraction, we used a pre-trained Mask R-CNN and partially fine-tuned it on our data. The network was pre-trained on the COCO dataset, containing 80 classes. Unfortunately, there is no such class as can available in COCO. Therefore, we fine-tuned Mask R-CNN on the synthetic dataset of cans. In order to prepare a dataset for Mask R-CNN, we sampled one thousand images related to things like workshops, interiors, rooms etc. from Google. Then we randomly chose 20 different can instances. Based on that, we embedded up to 7 random objects into the randomly chosen background image. With this method, we generated 2500 training samples of cans for Mask R-CNN.
For the generative module, we decided to use a set of 12 independent models of identical structure. Each model (neural network) is responsible for generating images of objects belonging to a single class. Each model was trained to minimize the mean square error between the target ground truth image and the generated image (RGB and depth). We set equal weights for both the RGB and depth loss. During training, we used the Adam optimizer 32].
The very important aspect of network training is data shuffling. We pair the views within each object instance randomly at every training step. However, we never combine input and output images from various instances. Creating random pairs of mixed instances resulted in a much worse quality of generated view and lack of instance-specific detail. Therefore the network is not able to generalize the view correctly and pay attention to individual instance features.
We are mostly interested in the visual quality of generated RGB images and depth maps. We also check the generalization capabilities of our neural network: both related to the generation of unseen objects as well as the generation of objects viewed from angles that the network was not trained on.
The example results are presented in Fig. 4. In Fig. 4 we show example images generated from the testing dataset. We tested our method on the 12 categories of objects. In Fig. 4 we show the input image selected from the dataset and generated RGB and depth images. RGB images have a black background and depth images have a white background. The top row of the RGB and depth images are generated by the neural network, and the row below shows the reference images. Our network was not trained with various textures so the color of the object is always blue.
To test how the neural network generalizes the shape of the objects, we provided images of three different instances of mug class to the input of the neural network. Results are presented in Fig. 5. The proposed neural network can extract the visual shape of the object and generate images of this object from different perspectives despite the fact that these objects have not been shown during training the network.
We also verified how the neural network generates images of the same object observed from different viewpoints. In Fig. 6 we show three different images of the same object and sequences of RGB and depth images from these input images. The most interesting example is presented in Fig. 6b. In the input image, the handle is not visible. However, the neural network can generate the images of the mug with the handle when we generate images from different viewpoints. The shape and size of the handle are slightly different than the real handle but the neural network can correctly predict that the handle is located on the occluded side of the mug.
To show the properties of our method, we present how the network generates images for orientations which have never been presented to the network. The results of the experiment presented in Fig. 7
show that the model is able to interpolate between the training angle samples. The yaw angle for the training dataset was changed by 12. The images presented in Fig. 7 are generated for the camera poses which differ by 6
. It means that the odd images are obtained for the angle presented during the training phase and the remaining images are obtained for orientation of the camera not used for the training and results are interpolated by the network. It can be clearly seen that the continuity of angle space is preserved. This fact is interesting when we take into account that the autoencoder used during training was not designed to preserve the space continuity (as opposed to, for example, variational autoencoders).
In the next experiment, we checked what would happen if the Mask R-CNN misclassifies the object. Thus, we provide the images of a birdhouse to the neural network related to cars (Fig. 8a and Fig. 8b), a lamps to the guitar model (Fig. 8c and Fig. 8d), and mugs to the bottle model (Fig. 8e and Fig. 8f). The neural network can properly generate the orientation of the object. It means that it’s easier for the neural network to model the transformation of rigid objects. More difficult is the generation of the object’s shape. However, it is visible how the neural network mixes the input object with the model of the object stored in the neural network producing reasonable and interesting images (Fig. 8).
To provide qualitative results, we also compare the images generated by the neural network and the reference images obtained from the 3D model of the object . The error between two images is computed as follows:
where is the size of the image, is the number of layers, and are the corresponding pixels of the reference image and the image generated by the neural network, respectively. The number of layers for the depth image is 1 and the value is set to 3 for the RGB images. We also compute the accuracy of the obtained RGB and depth images:
which is normalized by the maximal value of the image (255).
The accuracy of the proposed method is presented in Tab. I. The number of testing objects for each category is the same. The results are obtained for the first five instances of test objects from each category of the ShapeNet dataset. For each selected object we generate five different images which differ with the observation angle. The images are generated by sampling the pitch angle from the range 0 to 30 with 3 step and from 0 to 360 with 6 step for the yaw angle. It means that we have 300 input images and we generate 180 000 images for comparison. Surprisingly, we obtained the best accuracy for complex objects like cars, faucets, and guitars. The objects like chairs and tables are more difficult for the proposed neural network. The biggest error is caused by the generated legs of these object. They are very often bent or vague. The lowest accuracy is obtained for the birdhouse class. This is mainly because the testing objects differ significantly from the training dataset. However, the results are still satisfactory (Fig. 5).
We also checked how the accuracy of the generated RGB and depth images depend on the rotational distance between input image viewpoint and the reference viewpoint. The results are presented in Fig. 9. Unsurprisingly, the accuracy decreases when the rotation angle between the input and generated image increase but stays at a reasonable level.
Finally, we evaluated how well our model performs on real data, unavailable in the synthetic dataset. The example results can be seen in Fig. 10. These results confirm the conclusions drawn from the synthetic data. Note that our neural network was not trained with textured objects and does not generate textures on the output images.
Iv Conclusions and Future Work
In this paper, we present a method which generates a sequence of RGB and depth images from a single RGB input image. We proposed the whole processing pipeline, which extracts the objects from the raw input image, generates a set of RGB and depth viewpoints of the query object. In the paper, we show that the proposed neural network is capable of interpolation to viewpoints not used during training and generate models of novel objects. The proposed method is especially important in the field of mobile and manipulating robots. With the proposed method the robot can better understand the spatial properties of objects, without the need for complete scanning, which is time-consuming and sometimes impossible to perform. Our method works end to end without human supervision.
The method has also some limitations which we are going to deal with in the future. The neural network can’t handle the texture of the object. We also failed to train a single neural network which can generate images for different object classes. We are convinced that these problems can be solved by investigating the architecture of the proposed neural network and involvement of more computational resource. In the future, we are also going to use the generated images to reconstruct the 3D model of the object and estimate the motion of the camera by comparing generated and current camera images.
-  M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, J.L. Wyatt, One shot learning and generation of dexterous grasps for novel objects, International Journal of Robotics Research, pp. 959–976, vol. 35(8), 2015
-  U. Hillenbrand, M. Roa, Transferring functional grasps through contact warping and local replanning, IEEE/RSJ International Conference on Robotics and Systems, pp. 2963–2970, 2012
-  B. Frank, C. Stachniss, N. Abdo, W. Burgard, Efficient motion planning for manipulation robots in environments with deformable objects, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2180–2185, 2011
-  C.B Choy, D. Xu, J.Y. Gwak, K. Chen, S. Savarese, 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction, Computer Vision – ECCV 2016, Lecture Notes in Computer Science, vol. 9912, B. Leibe et al. (eds.), Springer, pp. 628–644, 2016
-  J. Wu, C. Zhang, T. Xue, W. Freeman, J. Tenenbaum, Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling, Advances in Neural Information Processing Systems, pp. 82–90, 2016
J. Redmon, A. Farhadi, YOLO9000: Better, Faster, Stronger, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525, 2017
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative Adversarial Text-to-Image Synthesis, Proc. of The 33rd International Conference on Machine Learning, pp. 1060–1069, 2016
S. M. Ali Eslami, D.J. Rezende, F. Besse, F. Viola, A.S. Morcos, M. Garnelo, A. Ruderman, A.A. Rusu, I. Danihelka, K. Gregor, D.P. Reichert, L. Buesing, T. Weber, O. Vinyals, D. Rosenbaum, N. Rabinowitz, H. King, C. Hillier, M. Botvinick, D. Wierstra, K. Kavukcuoglu, D. Hassabis, Neural Scene Representation and Rendering, Science vol. 360(6394), pp. 1204–1210, 2018
J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, K. Goldberg, Dex-Net 3.0: Computing Robust Robot Vacuum Suction Grasp Targets in Point Clouds using a New Analytic Model and Deep Learning, IEEE/RAS International Conference on Robotics and Automation, pp. 1–8, 2018
-  Y. Xiang, T. Schmidt, V. Narayanan, D. Fox, PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Robotics: Science and Systems (RSS), 2018
-  J. Varley, C. DeChant, A. Richardson, J. Ruales, P. Allen, Shape completion enabled robotic grasping, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2442–2447, 2017
-  A. Kar, S. Tulsiani, J. Carreira, J. Malik, Category-specific object reconstruction from a single image, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1966–1974, 2015
D. Schiebener, A. Schmidt, N. Vahrenkamp, T. Asfour, Heuristic 3D object shape completion based on symmetry and scene context, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 74–81, 2016
-  J. Bohg, M. Johnson-Roberson, B. León, J. Felip, X. Gratal, N. Bergström, D. Kragic, A. Morales, Mind the gap - robotic grasping under incomplete observation, IEEE/RAS International Conference on Robotics and Automation, pp. 686–693, 2011
-  A. Dai, C. Ruizhongtai Qi, M. Nießner, Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis, IEEE Conference on Computer Vision and Pattern Recognition, pp. 6545–6554, 2017
-  Q. Chen, V. Koltun, Photographic Image Synthesis with Cascaded Refinement Networks, IEEE International Conference on Computer Vision, pp. 1520–1529, 2017
-  J. Flynn, I. Neulander, J. Philbin, N. Snavely, Deep Stereo: Learning to Predict New Views from the World’s Imagery, IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524, 2016
A. Van Den Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel recurrent neural networks, International Conference on Machine Learning - vol. 48, pp. 1747–1756, 2016
-  J. Spehr, On Hierarchical Models for Visual Recognition and Learning of Objects, Scenes, and Activities, Studies in Systems, Decision and Control, Springer, 2015
-  A.A. Soltani, H. Huang, J. Wu, T.D. Kulkarni, J.B. Tenenbaum, Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1511–1519, 2017
-  K. He and G. Gkioxari and P. Dollar and R. B. Girshick, Mask R-CNN, IEEE International Conference on Computer Vision, pp. 2980–2988, 2017
-  O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241, 2015
-  S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, International Conference on International Conference on Machine Learning, Vol. 37, pp. 448–456, 2015
X. Glorot and Y. Bengio, Understanding the Difficulty of Training Deep Feedforward Neural Networks, In Proceedings of the International Conference on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, pp. 249–256, 2010
-  K. Lai, L. Bo, X. Ren, and D. Fox, A Large-Scale Hierarchical Multi-View RGB-D Object Dataset, IEEE International Conference on Robotics and Automation, pp. 1817–1824, 2011
-  A. Singh, J. Sha, K. Narayan, T. Achim, P. Abbeel, BigBIRD: A large-scale 3D database of object instances, IEEE International Conference on Robotics and Automation, pp. 509–516, 2014
-  P. Moreels and P. Perona, Evaluation of Features Detectors and Descriptors based on 3D objects, International Conference on Computer Vision, pp. 800–807, 2005
-  M. Aubry, D. Maturana, A. Efros, B. Russell and J. Sivic, Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models, Conference on Computer Vision and Pattern Recognition, pp. 3762–3769, 2014
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, F. Yu, ShapeNet: An Information-Rich 3D Model Repository, Computing Research Repository, 2015
-  C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, 1998
-  D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, International Conference on Learning Representations, 2014
-  R. Hahnloser, R. Sarpeshkar, M A Mahowald, R. J. Douglas, H.S. Seung, Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit, Nature. 405. pp. 947–951, 2000