Semantic Augmented Reality Environment with Material-Aware Physical Interactions

08/03/2017 ∙ by Long Chen, et al. ∙ Bournemouth University 0

In Augmented Reality (AR) environment, realistic interactions between the virtual and real objects play a crucial role in user experience. Much of recent advances in AR has been largely focused on developing geometry-aware environment, but little has been done in dealing with interactions at the semantic level. High-level scene understanding and semantic descriptions in AR would allow effective design of complex applications and enhanced user experience. In this paper, we present a novel approach and a prototype system that enables the deeper understanding of semantic properties of the real world environment, so that realistic physical interactions between the real and the virtual objects can be generated. A material-aware AR environment has been created based on the deep material learning using a fully convolutional network (FCN). The state-of-the-art dense Simultaneous Localisation and Mapping (SLAM) has been used for the semantic mapping. Together with efficient accelerated 3D ray casting, natural and realistic physical interactions are generated for interactive AR games. Our approach has significant impact on the future development of advanced AR systems and applications.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Methods

1.1 Camera tracking and model reconstruction

We have adapted KinectFusion as the core camera tracking system with dense 3D model reconstructions. A Kinect depth sensor has been used to fuse the data into a single global surface model while simultaneously obtaining the camera pose by using a coarse-to-fine iterative closest point (ICP) algorithm. The tracking and modeling processes consist of four steps: (i) Each pixel acquired by the depth camera is firstly transformed into the 3D space by the cameraś intrinsic parameters and the corresponding depth value acquired by the camera; (ii) A ICP alignment algorithm is performed to estimate the camera pose between the current frame and the reconstructed model; (iii) With the available camera poses, each consecutive depth frame can be fused incrementally into one single 3D reconstruction by a volumetric truncated signed distance function (TSDF); (iv) Finally, a surface model is predicted via a ray-casting process.

1.2 Deep learning for material recognition

To train a neural network for material recognition, we follow the method in


, the VGG-16 pre-trained model for ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) is used as the initial weights of our neural network


. We then fine-tuned the network from 1000 different classes of materials into 23 class labels as the output based on the Materials in Context Database (MINC) that contains 3 million material samples across 23 categories. However, the Convolutional Neural Network (CNN) is specifically designed for classification tasks and only produces a single classification result for a single image. We manually cast the CNN into a Fully Convolutional Network (FCN) for pixel-wise dense outputs

[9]. By transforming the last three inner product layers into convolutional layers, the network can learn to make dense predictions efficiently at pixel level for tasks like semantic segmentation. Finally, we have trained the FCN-32s, FCN-16s and FCN-8s consecutively using images with material labels provided in the MINC database.

1.3 Semantic label fusion using CRF

KinectFusion builds a 3D model, but our material recognition network only provides 2D outputs. Therefore, following[1] [4] we employed a graphical model of Conditional Random Fields (CRF) [5] to guide the fusion process of mapping the 2D semantic labels onto the 3D reconstruction model. CRF is to ensure the contextual consistency, and the final fusion result is shown in Figure 1.

Figure 1: The result of 3D model reconstruction, 3D semantic labelling and semantic 3D model fusing.

2 Result and discussion

We have developed a small shooting game demo 111 (see Figure 6) in Unity to demonstrate our proposed concept of semantic material-aware AR. Our framework is built as a drop-and-play plugin in Unity, which processes the AR camera pose tracking and feeds the 3D semantic-aware model. The game contains two layers, in which the top layer displays the current video stream from a RGBD camera, whilst the semantic 3D model serves for physical interaction layer by correctly mapping the video stream with synchronised camera poses for semantic inference. An oct-tree acceleration data structure has been implemented for efficient ray-casting to query the material properties and corresponding physical interactions are applied through physic simulations. As can be seen from Figure 6, realistic interactions between the real and virtual objects (e.g. bullet holes, flying chips and sound) are simulated at real-time with various different material responses i.e. (a)wood, (b)glass and (c)fabric, creating a real-time interactive semantic driven AR shooting game. Our work demonstrates the first step towards the high-level conceptual interaction modelling for enhanced user experience in complex AR environment.

Figure 6: A prototype shooting game. Different material responses: (a) wood, (b) glass, (c) fabric. (d) shows the hidden material-aware layer that handles the physical interaction.


  • [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 1534–1543, 2016.
  • [2] P. Chevaillier, T.-H. Trinh, M. Barange, P. De Loor, F. Devillers, J. Soler, and R. Querrec. Semantic modeling of virtual environments using mascaret. In Software Engineering and Architectures for Realtime Interactive Systems (SEARIS), 2012 5th Workshop on, pp. 1–8. IEEE, 2012.
  • [3] O. De Troyer, F. Kleinermann, B. Pellens, and W. Bille. Conceptual modeling for virtual reality. In the 26th international conference on Conceptual modeling-Volume 83, pp. 3–18. Australian Computer Society, Inc., 2007.
  • [4] A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic mapping of indoor scenes from rgb-d images. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 2631–2638. IEEE, 2014.
  • [5] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pp. 109–117, 2011.
  • [6] J. McCormac, A. Handa, A. Davison, and S. Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. arXiv preprint arXiv:1609.05130, 2016.
  • [7] R. A. Newcombe, D. Fox, and S. M. Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE CVPR, pp. 343–352, 2015.
  • [8] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. IEEE, 2011.
  • [9] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(4):640–651, 2017.
  • [10] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [11] C. Zhao, L. Sun, and R. Stolkin.

    A fully end-to-end deep learning approach for real-time simultaneous 3d reconstruction and material recognition.

    In 2017 18th International Conference on Advanced Robotics (ICAR), pp. 75–82, July 2017. doi: 10 . 1109/ICAR . 2017 . 8023499