Simultaneous localization and mapping (SLAM) is one of the hardest and most important fields in computer vision. As pointed out by Montemerlo , SLAM is a fundamental problem of autonomous robots and therefore it plays a key role in robotics aswell as in augmented reality.
The challenge here is to aswell localize a robot within an unknown area of unknown scale and compute an internal global environment map for tracking. With computer vision, it is possible to solve that problem by creating a map of the environment in form of a 3D model. The camera images of the moving robot are used to observe the environment containing all the visible objects and structures. The creation of the map has to be done online, since it is neccessary to simultaneous localize the robot and update the global 3D map, frame by frame. The whole algorithm has a high computational complexity which makes it very challenging to achieve a suffienctly high framerate for an accurate tracking.
In Augmented Reality (AR), registration is one the most essential components. Since we were using the Oculus Rift as an Head-Mounted-Display (HMD), we had to capture the real world using undistorted camera images and render both camera streams onto the intern HMD screen. To achieve a stereoscopic view, we have used two fish-eye cameras IDS uEye UI-122-1LE-C-HQ. Both of them are positioned on the front plate of the Rift. Using this Virtual Reality (VR) device, we could achieve a full-view AR immersion, since we are able to blend holograms everywhere at the intern screen with the real world. The development of our previous AR system is documented in our previous paper .
Microsoft’s new promising HoloLens is based on similar principles. It renders virtual objects onto the screen located within the lenses and tracks the 3D position in world space using the mounted camera. However, since the lenses of the HoloLens are transparent, there is no need for a digital camera stream. On the downside, the HoloLens can augment only a small area of the real world field-of-view (FOV), located in the center of the screen.
In section 2, we talk about Image Registration since this is a fundamental knowledge of SLAM systems. For illustration, we show some examples from the field of Medical Image Analysis. Then we cover SLAM methods in general and go into more detail about the monocular LSD-SLAM.
Section three 3 shows some implementation details about the integration and modifications, which have been done.
1.1 Building upon our AR Oculus Rift System
Considering our previous basic AR system  with a modified Oculus Rift, and our own DirectX stereo engine, we were able to render and animate 3D graphic models super-imposed to the images captured by the cameras.
Using the Oculus Rift’s internal gyro sensor, the augmentations could move consistently with the orientation of the user’s head. But what makes AR systems really interesting, is a complete augmentation of the holograms into realworld space. To give a good illusion of augmentation, it is neccessary to consider translation motions as well.
2.1 Vision in Augmented Reality
Building large-scale, accurate, responsive, and robust AR systems in real-time is challenging. What makes AR systems complex in computation is the real-world camera tracking, especially in large-scale areas without having information about the environment.
According to , the majority of AR systems (in 2007) operated with a prior knowledge of the environment the camera is located in. One possibility is to place a marker, maybe a card or a chessboard into the scene with known position like shown in . By observing the marker, its prior known 3D world position can be used to measure the location of the observing camera in relation to the object in world space. The huge downside in approaches like that is, it is only possible to yield location and rotation of the camera while measuring the marker. That means, augmentation is not possible when the camera is currently not observing and measuring the object with prior known world coordinates.
Compared to these marker-based approaches, we wanted to integrate an extensible tracking, without having any prior information about the environment to fully augment reality. This need of extensible tracking opens the door to SLAM systems.
How do we know about the 3D pose of the Rift in an entirely prior unknown environment, frame-by-frame in real-time? The answer is an algorithm called Monocular LSD-SLAM, introduced by Engel et al . It covers the process of recovering geometrical information from images. The camera could also be mounted on a robot and used for autonomous navigation. As an input sensor, we use the camera images. The underlying process of relating images to each other in LSD-SLAM is called Image Registration and is explained further in the next section with some illustrative examples from Medical Image Analysis.
2.2 Image Registration
The term Image Registration defines the process of determining a one-to-one mapping or transformation between the coordinates in one space and those in another, such that points in the two spaces that correspond to the same point are mapped to each other. It is about to find a function, aligning a moving image with a second fixed image, such that a defined similarity measure is maximized. This underlying process can also be seen as part of LSD-SLAM, it registers images to each other to retrieve information about similarity transform sim(3) of our observed camera keyframes, achieved by image-aligning.
In this section, we talk shortly about the principal methods of Image Registration. We lean here mainly on the book Medical Image Registration from Hill . In Medical Image Analysis, images from anatomical changing patients or different sensors need to be registered to achieve Image Fusion and / or improve Image Guided Surgery based on AR systems.
Image Registration can be done by four differnt methods. Three of them belong to rigid registration which are shortly explained here. Since nonlinear (nonrigid) registrations are not as relevant in our case, we skip them for the sake of this paper. The complexity of the methods increase in descending listing order.
2.2.1 Feature Based Rigid Registration
Placing anatomic or geometric landmarks on the human body makes it easy to identify these features (fiducials) in different images. Since the markers are very clearly to identify on every image, even if the images come from different sensors (CT-MR), the algorithm is straight forward by solving the Procrustes Problem with Point-Based Registration. Given two sets of N corresponding points and , find the similarity transformation (scale factor s, rotation matrix
and translation vector) that minimizes the mean squared distance between the points:
2.2.2 Surface Based Rigid Registration
A good gemoetrical feature of an object is represented by its boundary. This approach can align large number of points in images but segmentation is necessary for surface extraction. The extracted surface is represented either as a point set, triangle set or as level set. Given a set of surface points and a surface Q, it is about to find the rigid-body transformation T, where T is rotation matrix R and translation vector t that minimizes the mean squared distance between the points and the surface:
Iterative Closest Point algorithm yields the solution for this problem.
2.2.3 Intensity Based Rigid Registration (Voxel Similarity)
Intensity Based approaches are more complex, but also more powerful since more information is used. Voxel Similarity measures must distinguish between Single-modality registration and Multi-modality registration. The terms defines if the input images come from the same sensor, or different sensors. Depending on the domain, it is important to use suitable similarity assumptions. When speaking of LSD-SLAM, the scene consists of constantly changing intensities, but the sensor hardware (camera) stays the same.
Single-Modality means images are retrieved from the same sensor, e.g. CT-CT, MR-MR, PET-PET. etc. The main approach is to search a transformation T, which is determined by iteratively minimizing a voxel-based dissimilarity measure C. With Single-Modal, two different similarity assumptions about the intensity can be made, ”Identity” and ”Linear” which form the similarity metric.
Identity as a similarity assumption means, that the image intensity only differs by a Gaussian noise. The similarity metric of the Sum Of Squared Differences (SSD) can be used.
Though, it is sensitive to outliers, where the Sum Of Absolute Differences (SAD) is better suited.
Linear assumes a linear relationship between image intensities. As a similarity metric, the Normalized Cross Correlction (CC) is used.
(MR-CT, MR-PET, CT-PET, etc) assumes that image intensities are related by some unknown function or statistical relationship which is not known a-priori. Images are considered as a (pixel-wise) probability distributions which can be estimated using a histogram. Knowing this, it is possible to create 2D Joint Histograms and retrieve knowledge about the Joint Probability p(a, b) of a pixel having intensity a in one image and intensity b in another image.
The amount of information in combined images A and B is described by the Joint Entropy. Registration is achieved by minimizing the joint entropy between both images. However, it is highly sensitive to the overlap of two images.
Mutual Information on the other hand, describes how well an image can be explained by another image and can be expressed in terms of marginal and joint probability distributions:
Registration can be achieved by maximizing the Mutual Information between both images. Normalizing the Mutual Information leads to an independent term of this expression:
To optimize Voxel Similarity based techniques, Multi-Resolution optimazation leads to an acceleration.
2.3 Simultaneous Localization and Mapping
The task of localization solves the problem of estimating the 3D pose of a robot within an unkown environment. Mapping means computing a map of the environment in which the robot is located in. The map is then used as a reference for estimating 3D position. In principal, it is the same what humans do when trying to navigate within an unknown environment. Speaking of localization and mapping, means speaking of two different problems. However, as Davison  also pointed out, solving one of them requires solving both of them. Different SLAM approches are listet below.
2.3.1 Feature-based monocular SLAM
In feature-based monocular methods, the process of retrieving geometric information from images is splint into sequential tasks. The algorithm tries as a first step to extract feature observations from the input (image). From a set of keypoint obversations, it is then possible to estimate 3D pose of the current keypoint features and thus the camera world-coordinates by matching the features with the set of keypoint observations. This means, all the other image information except of the exctracted and observed features are thrown away, approaching however real-time performance since the complexity of the problem is reduced to the keypoint features.
The successful project MonoSLAM from Davison  uses feature-based SLAM within a probabilistic framework running at a framerate of 30 Hz.
However, Engel  clearly states out the big drawback on this method. Only features who match the defined feature type can be extracted (corners, line segments).
2.3.2 Dense monocular SLAM (Direct)
This intensity-based approach improves over the feature-based SLAM method by using all the information available in the image. It uses some measure derived directly from the intensity of the image pixels which leads to more information about the geometry. New image frames are tracked using whole-image alignment. There is no need for feature extraction as in the previous method. Due to the dense information, tracking accuracy and robustness is widely improved over feature-based methods. That makes direct methods very valuable for robotics or augmented reality systems.
The downside is the costly computation. As a comparison, the dense depthmap is similar to the output of a RGB-D camera shown in the following figure.
2.4 Monocular LSD-SLAM
With Monocular Large-Scale Semi Dense SLAM, Engel [1, 8] proposed in 2013 the propably first feautreless real-time approach for monocular visual odometry, running with real-time framerates on a CPU which cuts out the need for high-parallel performance GPU hardware. The algorithm introduces 2 additional fundamental aspects. One is the estimation of semi-dense depthmaps which improves performance sigificantly over dense depthmaps. The second characteristic is then given by the large scale capability of the algorithm.
The algorithm generates an intern global map, frame by frame, using direct image alignment to estimate the camera’s 3D world position. This global map consists of keyframes as vertices with 3D similarity transform sim(3), represented as a pose graph. For each localization estimation, the most up-to-date depthmap from the camera’s recent keyframe is used as a reference for aligning the new captured image after motion movement.
The key idea behind this approach is not to build a dense depthmap using all intensities within the images, but to estimate a semi-dense inverse depth map for the current observed keyframe. The semi-dense inverse depthmap is modeled by using a Gaussian probability distribution which models one inverse depth hypothesis per pixel. The depthmap is estimated each frame and updated with a variable-baseline stere comparison.
This characteristic of Monocular LSD-SLAM solves one of the biggest challenges in SLAM because world scale can not be observed and it changes during runtime. As depth or stereo cameras are very limited in scale, to provide realible measurements, LSD-SLAM enables a smooth switch between different scales within the world. The scale problem is solved by using inherent correlation between scene depth and tracking accuracy. Keyframes have to be scaled such that the mean inverse depth is one and edges between keyframes are estimated as elements of sim(3) allowing to detect scale-drift.
2.4.3 Complete Algorithm
To give an overview about the LSD-SLAM algorithm from Engel , we summarize the components shortly. The algorithm has 3 main components, tracking, depth map estimation and map optimization, illustrated in figure 14
executes the task of permanently tracking new camera images. The rigid body pose se(3) of each current keyframe is estimated by using the pose of the previous frame as a reference.
Depth map estimation computes the semi-dense depth of each current keyframe by filtering over many per-pixel, small-baseline stereo comparisons. It uses tracked frames, and refines the current keyframe. In the case the camera’s motion was too large by moving too far, a new keyframe is initialized which replaces the old keyframe. The replacement is done by projecting points from existing, close-by keyframes onto the new keyframe.
Map optimization triggers when a new keyframe is replaced instead of depth refinement. It’s purpose is to update the global map with the new tracked keyframe. Loop closures and scale-drift are detected by estimating a similarity transform sim(3) to close-by existing keyframes with scale-aware, direct sim(3)-image alignment.
The project is now split into 2 Visual Studio projects, each accessible on Github. The basic AR engine with Oculus Rift support can be accessed via https://github.com/MaXvanHeLL/ARift.git and the modified LSD-SLAM project for the integration is uploaded here https://github.com/MaXvanHeLL/LSD-SLAM.git.
Here we recap shortly the camera calibration process, documented in our previous paper . We have used the omnidirectional camera model from Scaramuzza  and have written an undistortion shader in HLSL for our engine. Thus, we could bypass the undistortion step for the LSD-SLAM, and benefit from a performance boost, since the undistortion shader runs highly parallel on the GPU. We have computed the calibration with a polynomial function of fifth order. Distortion coefficients are already integrated in the projection function F.
If the undistortion, integrated in the LSD-SLAM project, is desired, the camera matrix has to be built with focal lengths and optical centers measured in pixels.
Also the distortion coefficients are neccessary.
Both, the camera matrix and the distortion coefficients are held in the file out_camera_data.xml, located in the data directory of the project root.
3.2 Engine Integration
In our previous paper  we documented our complete AR stereo engine so far. In this section, we give an overview of the major new components due to the LSD-SLAM integration.
The core of the integration is represented by the new class LsdSlam3D, illustrated in figure 15. It can be seen as an API to the dedicated Visual Studio project of LSD-SLAM and holds for us the important information about the camera’s real world position and rotation.
Since the GraphicsAPI is constantly synchronizing the virtual camera with the now fully tracked LSD-SLAM camera, we added a mutex to ensure stable synchronization and guarantee always clean and valid data information.
Additionally, we have added another mutex, used for the camera images. Since the camera stream is running in another thread, we also synchronized the data exchange between the image capturing and SLAM operation.
In the main function we set up all the needed LSD-SLAM properties. First of all, the CvCapture property, which holds information about the tracked monocular camera. The camera is referenced wit an ID, starting from 0 to ascending order. Also the capture frame dimensions are set here with 640 x 480.
After that, we initialize the camera stream thread, used by LSD-SLAM. Since we modified the original lsd_slam::OpenCVImageStreamThread, we introduced a new one called lsd_slam::IDSuEyeCameraStreamThread. The slight modifications to our custom input stream are documented in section 3.6.
Now we are about to set the monocular camera buffer to the IDSuEyeCameraStreamThread. The camera stream thread will from now on constantly read out the frames stored in the undistortion buffer, retrieved from the HLSL shader.
To complete the initalization, we allocate our new introduced LsdSlam3D API object where we will retrieve the desired 3D camera translation and rotation as an output.
3.4 Render Loop Update
Some program controls allow us to switch between LSD-SLAM rotation and gyro sensor rotation from the Oculus Rift, aswell as to decide if we want to use LSD-SLAM translation or only the virtual head camera. Since we can set those controls individually, the Frame() function has been modified accordingly:
As the name suggests, the updateLsdSlam3DRotation() method updates internally the virtual camera rotation matrix with the one retrieved from LSD-SLAM. Also we convert the Eigen::MatrixXd* to our matching coordinate system convention.
What happens in updateLsdSlam3DTranslation() is basically a smooth 3D movement implementation and a reference comparison with LSD-SLAM data. We store a lsdslam_reference which is directly compared with the accurat retrieved riftPosition from LSD-SLAM . Since we wanted to achieve a smooth movement we have added some thresholding value to erase small correction errors. Further we invented a scaling factor for the movement, to be not as limited and scale in different worlds accordingly. Of course, when we don’t have a lsdslam_reference yet, we are in the initialization phase and just store the current position as the origin reference.
3.5 Undistortion ShaderBuffer for LSD-SLAM
Since we have used our own undistortion images retrieved from our fragment shader, we had to adapt the region-of-interest (ROI) of the undistorted images to ensure that we only pass fully valid image data to our lsd_slam::IDSuEyeCameraStreamThread.
After the undistortion, our image buffer itself holds of course black borders which correct the distortion, illustrated in figure 16. With the parameter camID we can control which of our cameras we want to use for the LSD-SLAM, which changes also the ROI parameters due to undistortion.
The undistorted images were so far fully placed in the GPU memory, since they were computed from our fragment shader. Therefore, we had to setup an explicit mapped CPU access to copy data from the GPU memory.
After we have copied the image data to from the GPU, we converted the images’ data buffer to an IplImage* frame where we then can extract the correct ROI, depending on which camera we use for tracking.
For the last step, we resized the image to 640 x 480 which are the suggested dimensions for LSD-SLAM and copy the images to the image buffer.
3.6 LSD-SLAM Code Modifications
Generally, the camera capturing and undistortion process is implemented in the OpenCVImageStreamThread but we already mentioned above that we have used our own undistortion, retrieved from our undistortion shader and supply LSD-SLAM with a buffer of those undistorted images. We show the code modifications within the LSD-SLAM project itself. For this purpose, we introduced a new class lsd_slam::IDSuEyeCameraStreamThread.
The constructor makes it pretty clear which data is neccessary. The cameraBuffer_ and cameraMutex_ are pointers to instances where the thread should read data from. We just copy the data from cameraBuffer_ to the internally used frameBuffer_.
iDSuEyeFrame_ is the final data type on which LSD-SLAM is processing on and resizedFrame_ is needed in some cases when the image dimensions are not 640 x 480.
Finally, use_shaderUndistortion_ just tells us if we want to use the undistorted images supplied by our shader, or might want to use raw image data or something else for any reason.
This method- setIDSuEyeCameraStream() - is called in the configuration phase directly from main, to set the corresponding pointers and configs for camera streaming.
The run() method is calling the actual runtime loop of the thread.
In the beginning before the Loop starts, we check if we have valid camera objects and jump then into the runtime loop, which overrides the actual image streaming to the LSD-SLAM. In case of using the undistortiion images from the shader, there is not much happening here since everything has been prepared previously from the GraphicsAPI.
However, if we might want to use the image raw data for tracking or something else, we have to sample the images to fit the suggested dimensions.
A short method for re-sampling an IplImage to the desired image dimensions. In our case, we would downsample from 752 x 480 to 640 x 480.
3.7 Light Shading Model - Rendering Equation
As a very small gimmick, we also modified our fragment shader from the previously used Lambert Shading to a Phong Shading model. This reflection model can be implemented without any effort actually but gives quite a more realistic look to the holograms due to the reflection term which yields specular highlights. Since it is still a local illumination model, an approximation of global illumination in form of an ambient light term is added to the rendering equation.
Generally, the shading model has 3 components, a diffuse illumination , a specular reflection component and an ambient term . This gives a formal expression like 11.
Formulating the model more precisely for each component, we get the render equation according to formula 12.
is a vector from the surface point to the emmitting light source. is the normal vector of the surface point. stands for a direction vector, pointing towards the observer (viewer) and is the direction of the reflected light which the light ray takes from this surface point on.
3.8 New Engine Configurations
A list of new configuration mechanics, added during the integration process:
LsdSlam_Threshold - configures a threshold which eliminates smaller correction errors in the tracking
LsdSlam_ScaleFactor - configure the scaling from the virtual movement, retrieved from LSD-SLAM
LsdSlam_UNDISTORTION - provide LSD-SLAM with undistorted shader images
LsdSlam_Translation - use translation from LSD-SLAM or not
LsdSlam_Rotation - use rotation from LSD-SLAM or Oculus Rift’s gyro sensor
VSYNC_ENABLED - enables / disables vertical synchronization
3.9 New Libraries
We are really satisfied with the results. The integration process actually worked quite good, however it was still some coding work to adapt the engine for the use of LSD-SLAM. That’s mainly because we have used an omnidirectional camera model for our high degree fisheye-lenses to compute the calibration. And since we already computed the undistortion with a fragment shader in HLSL we wanted to use them for LSD-SLAM aswell. So basically, we had to tweak the image streaming to the algorithm and could not use the prepared OpenCVImageStreamThread.
The project was shown also at the OpenLAB-Night 2016 of the Institute for Computer Graphics and Computer Vision (ICG) at TU Graz. The guests could try it out and we received alot of good feedback. The people showed alot of interest in the project and generally in AR itself.
Tracking the real-world camera and synchronizing it’s 3D movement with the virtual head camera worked pretty well. It still feels exciting to walk towards 3D holograms which are augmented in the real world, going around them and crouch to see them from below.
The following figures show some final tracking results. The digimon model and texture file from the result images are taken from web source models-resource.com under educational usage purposes. The chessboard which sometimes is in the scene was only used for camera calibration.
5 Future Work
We want to focus now on the interaction with virtual objects. The motivation is an interaction system in AR which feels more natural like using our hands and interacting with virtual objects the same way or at least similar as we do with real objects. We expect that to increase the immersion of AR significantly. However, we strongly consider a hardware upgrade since the Oculus Rift DK1 is pretty old already and the display’s resolution is very low. We will use the HoloLens for this project and see how it feels like if we change the interaction system of it with our own one. It will be very interesting to see how that works out and if we might also use some functionality of the interaction from the HoloLens, like maybe using the ”Gaze” raycaster for highlighting the virtual objects but further interacting with our hand gestures or totally bypass the whole interaction and only use the new one.
To capture our hand skeleton, we will use the Leap Motion controller which is using structured light for measurement. The Leap Motion has already a VR mount to install it on the front plate of VR HMDs. Since that could be tricky on the HoloLens, a custom mount might be the better choice. What might be more tricky is the communication between the HoloLens and the Leap Motion, since we will need to set up a remote communication between them due to the fact that the USB ports of the HoloLens can not be used for periphery devices appearantly. Another challenge will be to translate the coordinates from the Leap Motion’s coordinate system to the AR space of the HoloLens and compute a fine hand gesture detection and collision resolution on virtual objects.
-  J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in Computer Vision–ECCV 2014, pp. 834–849, Springer, 2014.
-  M. Höll, N. Heran, and V. Lepetit, “Augmented reality oculus rift,” arXiv preprint arXiv:1604.08848, 2016.
-  M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “Fastslam: A factored solution to the simultaneous localization and mapping problem,” in Aaai/iaai, pp. 593–598, 2002.
-  G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234, IEEE, 2007.
-  H. Kato and M. Billinghurst, “Marker tracking and hmd calibration for a video-based augmented reality conferencing system,” in Augmented Reality, 1999.(IWAR’99) Proceedings. 2nd IEEE and ACM International Workshop on, pp. 85–94, IEEE, 1999.
-  D. L. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes, “Medical image registration,” Physics in medicine and biology, vol. 46, no. 3, p. R1, 2001.
-  A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
-  J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular camera,” in Proceedings of the IEEE international conference on computer vision, pp. 1449–1456, 2013.
-  D. Scaramuzza, A. Martinelli, and R. Siegwart, “A toolbox for easily calibrating omnidirectional cameras,” in Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 5695–5701, IEEE, 2006.