1.1 Camera tracking and model reconstruction
We have adapted KinectFusion as the core camera tracking system with dense 3D model reconstructions. A Kinect depth sensor has been used to fuse the data into a single global surface model while simultaneously obtaining the camera pose by using a coarse-to-fine iterative closest point (ICP) algorithm. The tracking and modeling processes consist of four steps: (i) Each pixel acquired by the depth camera is firstly transformed into the 3D space by the cameraś intrinsic parameters and the corresponding depth value acquired by the camera; (ii) A ICP alignment algorithm is performed to estimate the camera pose between the current frame and the reconstructed model; (iii) With the available camera poses, each consecutive depth frame can be fused incrementally into one single 3D reconstruction by a volumetric truncated signed distance function (TSDF); (iv) Finally, a surface model is predicted via a ray-casting process.
1.2 Deep learning for material recognition
To train a neural network for material recognition, we follow the method in
, the VGG-16 pre-trained model for ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) is used as the initial weights of our neural network
. We then fine-tuned the network from 1000 different classes of materials into 23 class labels as the output based on the Materials in Context Database (MINC) that contains 3 million material samples across 23 categories. However, the Convolutional Neural Network (CNN) is specifically designed for classification tasks and only produces a single classification result for a single image. We manually cast the CNN into a Fully Convolutional Network (FCN) for pixel-wise dense outputs. By transforming the last three inner product layers into convolutional layers, the network can learn to make dense predictions efficiently at pixel level for tasks like semantic segmentation. Finally, we have trained the FCN-32s, FCN-16s and FCN-8s consecutively using images with material labels provided in the MINC database.
1.3 Semantic label fusion using CRF
KinectFusion builds a 3D model, but our material recognition network only provides 2D outputs. Therefore, following  we employed a graphical model of Conditional Random Fields (CRF)  to guide the fusion process of mapping the 2D semantic labels onto the 3D reconstruction model. CRF is to ensure the contextual consistency, and the final fusion result is shown in Figure 1.
2 Result and discussion
We have developed a small shooting game demo 111https://www.youtube.com/watch?v=02ZAqXH2FGU (see Figure 6) in Unity to demonstrate our proposed concept of semantic material-aware AR. Our framework is built as a drop-and-play plugin in Unity, which processes the AR camera pose tracking and feeds the 3D semantic-aware model. The game contains two layers, in which the top layer displays the current video stream from a RGBD camera, whilst the semantic 3D model serves for physical interaction layer by correctly mapping the video stream with synchronised camera poses for semantic inference. An oct-tree acceleration data structure has been implemented for efficient ray-casting to query the material properties and corresponding physical interactions are applied through physic simulations. As can be seen from Figure 6, realistic interactions between the real and virtual objects (e.g. bullet holes, flying chips and sound) are simulated at real-time with various different material responses i.e. (a)wood, (b)glass and (c)fabric, creating a real-time interactive semantic driven AR shooting game. Our work demonstrates the first step towards the high-level conceptual interaction modelling for enhanced user experience in complex AR environment.
-  I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In , pp. 1534–1543, 2016.
-  P. Chevaillier, T.-H. Trinh, M. Barange, P. De Loor, F. Devillers, J. Soler, and R. Querrec. Semantic modeling of virtual environments using mascaret. In Software Engineering and Architectures for Realtime Interactive Systems (SEARIS), 2012 5th Workshop on, pp. 1–8. IEEE, 2012.
-  O. De Troyer, F. Kleinermann, B. Pellens, and W. Bille. Conceptual modeling for virtual reality. In the 26th international conference on Conceptual modeling-Volume 83, pp. 3–18. Australian Computer Society, Inc., 2007.
-  A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic mapping of indoor scenes from rgb-d images. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 2631–2638. IEEE, 2014.
-  P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pp. 109–117, 2011.
-  J. McCormac, A. Handa, A. Davison, and S. Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. arXiv preprint arXiv:1609.05130, 2016.
-  R. A. Newcombe, D. Fox, and S. M. Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE CVPR, pp. 343–352, 2015.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. IEEE, 2011.
-  E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(4):640–651, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
C. Zhao, L. Sun, and R. Stolkin.
A fully end-to-end deep learning approach for real-time simultaneous 3d reconstruction and material recognition.In 2017 18th International Conference on Advanced Robotics (ICAR), pp. 75–82, July 2017. doi: 10 . 1109/ICAR . 2017 . 8023499