Monocular LSD-SLAM Integration within AR System

by   Markus Höll, et al.

In this paper, we cover the process of integrating Large-Scale Direct Simultaneous Localization and Mapping (LSD-SLAM) algorithm into our existing AR stereo engine, developed for our modified "Augmented Reality Oculus Rift". With that, we are able to track one of our realworld cameras which are mounted on the rift, within a complete unknown environment. This makes it possible to achieve a constant and full augmentation, synchronizing our 3D movement (x, y, z) in both worlds, the real world and the virtual world. The development for the basic AR setup using the Oculus Rift DK1 and two fisheye cameras is fully documented in our previous paper. After an introduction to image-based registration, we detail the LSD-SLAM algorithm and document our code implementing our integration. The AR stereo engine with Oculus Rift support can be accessed via the GIT repository and the modified LSD-SLAM project used for the integration is available here


page 5

page 9

page 10

page 13

page 15

page 25

page 34

page 35


Augmented Reality Oculus Rift

This paper covers the whole process of developing an Augmented Reality S...

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SL...

OV^2SLAM : A Fully Online and Versatile Visual SLAM for Real-Time Applications

Many applications of Visual SLAM, such as augmented reality, virtual rea...

OpenVSLAM: A Versatile Visual SLAM Framework

In this paper, we introduce OpenVSLAM, a visual SLAM framework with high...

A General Framework for Lifelong Localization and Mapping in Changing Environment

The environment of most real-world scenarios such as malls and supermark...

The Mobile AR Sensor Logger for Android and iOS Devices

In recent years, commodity mobile devices equipped with cameras and iner...

A Hybrid Learner for Simultaneous Localization and Mapping

Simultaneous localization and mapping (SLAM) is used to predict the dyna...

1 Introduction

Simultaneous localization and mapping (SLAM) is one of the hardest and most important fields in computer vision. As pointed out by Montemerlo [3], SLAM is a fundamental problem of autonomous robots and therefore it plays a key role in robotics aswell as in augmented reality.

The challenge here is to aswell localize a robot within an unknown area of unknown scale and compute an internal global environment map for tracking. With computer vision, it is possible to solve that problem by creating a map of the environment in form of a 3D model. The camera images of the moving robot are used to observe the environment containing all the visible objects and structures. The creation of the map has to be done online, since it is neccessary to simultaneous localize the robot and update the global 3D map, frame by frame. The whole algorithm has a high computational complexity which makes it very challenging to achieve a suffienctly high framerate for an accurate tracking.

In Augmented Reality (AR), registration is one the most essential components. Since we were using the Oculus Rift as an Head-Mounted-Display (HMD), we had to capture the real world using undistorted camera images and render both camera streams onto the intern HMD screen. To achieve a stereoscopic view, we have used two fish-eye cameras IDS uEye UI-122-1LE-C-HQ. Both of them are positioned on the front plate of the Rift. Using this Virtual Reality (VR) device, we could achieve a full-view AR immersion, since we are able to blend holograms everywhere at the intern screen with the real world. The development of our previous AR system is documented in our previous paper [2].

Microsoft’s new promising HoloLens is based on similar principles. It renders virtual objects onto the screen located within the lenses and tracks the 3D position in world space using the mounted camera. However, since the lenses of the HoloLens are transparent, there is no need for a digital camera stream. On the downside, the HoloLens can augment only a small area of the real world field-of-view (FOV), located in the center of the screen.

In section 2, we talk about Image Registration since this is a fundamental knowledge of SLAM systems. For illustration, we show some examples from the field of Medical Image Analysis. Then we cover SLAM methods in general and go into more detail about the monocular LSD-SLAM.

Section three 3 shows some implementation details about the integration and modifications, which have been done.

In section 4 we summarize the experiences made during development and show some results. In section 5, we list some possible future work.

1.1 Building upon our AR Oculus Rift System

Considering our previous basic AR system [2] with a modified Oculus Rift, and our own DirectX stereo engine, we were able to render and animate 3D graphic models super-imposed to the images captured by the cameras.

Figure 1: Results from our previously built AR system ”Augmented Reality Oculus Rift”. Image taken from [2].

Using the Oculus Rift’s internal gyro sensor, the augmentations could move consistently with the orientation of the user’s head. But what makes AR systems really interesting, is a complete augmentation of the holograms into realworld space. To give a good illusion of augmentation, it is neccessary to consider translation motions as well.

Figure 2: Hardware Modifications on Oculus Rift DK1 for ”Augmented Reality Oculus Rift”. Image property of Markus Höll

2 Theory

2.1 Vision in Augmented Reality

Building large-scale, accurate, responsive, and robust AR systems in real-time is challenging. What makes AR systems complex in computation is the real-world camera tracking, especially in large-scale areas without having information about the environment.

According to [4], the majority of AR systems (in 2007) operated with a prior knowledge of the environment the camera is located in. One possibility is to place a marker, maybe a card or a chessboard into the scene with known position like shown in [5]. By observing the marker, its prior known 3D world position can be used to measure the location of the observing camera in relation to the object in world space. The huge downside in approaches like that is, it is only possible to yield location and rotation of the camera while measuring the marker. That means, augmentation is not possible when the camera is currently not observing and measuring the object with prior known world coordinates.

Figure 3:

Relationship between marker coordinates and estimated camera coordinates. Image taken from


Compared to these marker-based approaches, we wanted to integrate an extensible tracking, without having any prior information about the environment to fully augment reality. This need of extensible tracking opens the door to SLAM systems.

How do we know about the 3D pose of the Rift in an entirely prior unknown environment, frame-by-frame in real-time? The answer is an algorithm called Monocular LSD-SLAM, introduced by Engel et al [1]. It covers the process of recovering geometrical information from images. The camera could also be mounted on a robot and used for autonomous navigation. As an input sensor, we use the camera images. The underlying process of relating images to each other in LSD-SLAM is called Image Registration and is explained further in the next section with some illustrative examples from Medical Image Analysis.

2.2 Image Registration

The term Image Registration defines the process of determining a one-to-one mapping or transformation between the coordinates in one space and those in another, such that points in the two spaces that correspond to the same point are mapped to each other. It is about to find a function, aligning a moving image with a second fixed image, such that a defined similarity measure is maximized. This underlying process can also be seen as part of LSD-SLAM, it registers images to each other to retrieve information about similarity transform sim(3) of our observed camera keyframes, achieved by image-aligning.

Figure 4: Nuclear Medicine: X-ray co-registered with corresponding bone scan. Image taken from [6].

In this section, we talk shortly about the principal methods of Image Registration. We lean here mainly on the book Medical Image Registration from Hill [6]. In Medical Image Analysis, images from anatomical changing patients or different sensors need to be registered to achieve Image Fusion and / or improve Image Guided Surgery based on AR systems.

Figure 5: Bottom: Slice of Image Fusion from CT and MR Volumes Top Left: CT TopRight: MR. Image taken from [6].

Image Registration can be done by four differnt methods. Three of them belong to rigid registration which are shortly explained here. Since nonlinear (nonrigid) registrations are not as relevant in our case, we skip them for the sake of this paper. The complexity of the methods increase in descending listing order.

2.2.1 Feature Based Rigid Registration

Placing anatomic or geometric landmarks on the human body makes it easy to identify these features (fiducials) in different images. Since the markers are very clearly to identify on every image, even if the images come from different sensors (CT-MR), the algorithm is straight forward by solving the Procrustes Problem with Point-Based Registration. Given two sets of N corresponding points and , find the similarity transformation (scale factor s, rotation matrix

and translation vector

) that minimizes the mean squared distance between the points:

Figure 6: Identify corresponding feature points (placed landmarkers on the human head). Image taken from [6].

2.2.2 Surface Based Rigid Registration

A good gemoetrical feature of an object is represented by its boundary. This approach can align large number of points in images but segmentation is necessary for surface extraction. The extracted surface is represented either as a point set, triangle set or as level set. Given a set of surface points and a surface Q, it is about to find the rigid-body transformation T, where T is rotation matrix R and translation vector t that minimizes the mean squared distance between the points and the surface:


Iterative Closest Point algorithm yields the solution for this problem.

Figure 7: Surface P: data shape (contour) and Surface Q: model shape (high-resolution scan). Image taken from [6].

2.2.3 Intensity Based Rigid Registration (Voxel Similarity)

Intensity Based approaches are more complex, but also more powerful since more information is used. Voxel Similarity measures must distinguish between Single-modality registration and Multi-modality registration. The terms defines if the input images come from the same sensor, or different sensors. Depending on the domain, it is important to use suitable similarity assumptions. When speaking of LSD-SLAM, the scene consists of constantly changing intensities, but the sensor hardware (camera) stays the same.

  • Single-Modality means images are retrieved from the same sensor, e.g. CT-CT, MR-MR, PET-PET. etc. The main approach is to search a transformation T, which is determined by iteratively minimizing a voxel-based dissimilarity measure C. With Single-Modal, two different similarity assumptions about the intensity can be made, ”Identity” and ”Linear” which form the similarity metric.

    Identity as a similarity assumption means, that the image intensity only differs by a Gaussian noise. The similarity metric of the Sum Of Squared Differences (SSD) can be used.


    Though, it is sensitive to outliers, where the Sum Of Absolute Differences (SAD) is better suited.

    Linear assumes a linear relationship between image intensities. As a similarity metric, the Normalized Cross Correlction (CC) is used.

  • Multi-Modality

    (MR-CT, MR-PET, CT-PET, etc) assumes that image intensities are related by some unknown function or statistical relationship which is not known a-priori. Images are considered as a (pixel-wise) probability distributions which can be estimated using a histogram. Knowing this, it is possible to create 2D Joint Histograms and retrieve knowledge about the Joint Probability p(a, b) of a pixel having intensity a in one image and intensity b in another image.

    Figure 8: 2D Joint Probability distribution of intensities for aligned MR and FDG PET volumes (left), misaligned with a 2mm translation (middle) and misaligned with a 5mm translation (right). Image taken from [6].

    The amount of information in combined images A and B is described by the Joint Entropy. Registration is achieved by minimizing the joint entropy between both images. However, it is highly sensitive to the overlap of two images.


    Mutual Information on the other hand, describes how well an image can be explained by another image and can be expressed in terms of marginal and joint probability distributions:


    Registration can be achieved by maximizing the Mutual Information between both images. Normalizing the Mutual Information leads to an independent term of this expression:


    To optimize Voxel Similarity based techniques, Multi-Resolution optimazation leads to an acceleration.

2.3 Simultaneous Localization and Mapping

The task of localization solves the problem of estimating the 3D pose of a robot within an unkown environment. Mapping means computing a map of the environment in which the robot is located in. The map is then used as a reference for estimating 3D position. In principal, it is the same what humans do when trying to navigate within an unknown environment. Speaking of localization and mapping, means speaking of two different problems. However, as Davison [7] also pointed out, solving one of them requires solving both of them. Different SLAM approches are listet below.

2.3.1 Feature-based monocular SLAM

In feature-based monocular methods, the process of retrieving geometric information from images is splint into sequential tasks. The algorithm tries as a first step to extract feature observations from the input (image). From a set of keypoint obversations, it is then possible to estimate 3D pose of the current keypoint features and thus the camera world-coordinates by matching the features with the set of keypoint observations. This means, all the other image information except of the exctracted and observed features are thrown away, approaching however real-time performance since the complexity of the problem is reduced to the keypoint features.

The successful project MonoSLAM from Davison [7] uses feature-based SLAM within a probabilistic framework running at a framerate of 30 Hz.

Figure 9: Detected feature patches of interest in camera stream (left) and visualized in world space coordinate system (right). Image taken from [7].

However, Engel [1] clearly states out the big drawback on this method. Only features who match the defined feature type can be extracted (corners, line segments).

2.3.2 Dense monocular SLAM (Direct)

This intensity-based approach improves over the feature-based SLAM method by using all the information available in the image. It uses some measure derived directly from the intensity of the image pixels which leads to more information about the geometry. New image frames are tracked using whole-image alignment. There is no need for feature extraction as in the previous method. Due to the dense information, tracking accuracy and robustness is widely improved over feature-based methods. That makes direct methods very valuable for robotics or augmented reality systems.

The downside is the costly computation. As a comparison, the dense depthmap is similar to the output of a RGB-D camera shown in the following figure.

Figure 10: Left: keypoint depthmap Center: dense depthmap Right: RGB-D camera. Image taken from [8].

2.4 Monocular LSD-SLAM

With Monocular Large-Scale Semi Dense SLAM, Engel [1, 8] proposed in 2013 the propably first feautreless real-time approach for monocular visual odometry, running with real-time framerates on a CPU which cuts out the need for high-parallel performance GPU hardware. The algorithm introduces 2 additional fundamental aspects. One is the estimation of semi-dense depthmaps which improves performance sigificantly over dense depthmaps. The second characteristic is then given by the large scale capability of the algorithm.

The algorithm generates an intern global map, frame by frame, using direct image alignment to estimate the camera’s 3D world position. This global map consists of keyframes as vertices with 3D similarity transform sim(3), represented as a pose graph. For each localization estimation, the most up-to-date depthmap from the camera’s recent keyframe is used as a reference for aligning the new captured image after motion movement.

Figure 11: Camera trajectory with LSD-SLAM on medium scale, showing some estimated semi-dense keyframe depthmaps. Image taken from [1].

2.4.1 Semi-Dense

The key idea behind this approach is not to build a dense depthmap using all intensities within the images, but to estimate a semi-dense inverse depth map for the current observed keyframe. The semi-dense inverse depthmap is modeled by using a Gaussian probability distribution which models one inverse depth hypothesis per pixel. The depthmap is estimated each frame and updated with a variable-baseline stere comparison.

Figure 12: Left: original captured camera image. Right: semi-dense depth map (LSD-SLAM). Image taken from [8].

2.4.2 Large-Scale

This characteristic of Monocular LSD-SLAM solves one of the biggest challenges in SLAM because world scale can not be observed and it changes during runtime. As depth or stereo cameras are very limited in scale, to provide realible measurements, LSD-SLAM enables a smooth switch between different scales within the world. The scale problem is solved by using inherent correlation between scene depth and tracking accuracy. Keyframes have to be scaled such that the mean inverse depth is one and edges between keyframes are estimated as elements of sim(3) allowing to detect scale-drift.

Figure 13: Two different scenes showing scale-differences. The keyframe’s camera viewport is displayed in red rectangles. Image taken from [1].

2.4.3 Complete Algorithm

To give an overview about the LSD-SLAM algorithm from Engel [1], we summarize the components shortly. The algorithm has 3 main components, tracking, depth map estimation and map optimization, illustrated in figure 14

Figure 14: Illustration of the complete LSD-SLAM algorithm. Image taken from [1].
  • Tracking

    executes the task of permanently tracking new camera images. The rigid body pose se(3) of each current keyframe is estimated by using the pose of the previous frame as a reference.

  • Depth map estimation computes the semi-dense depth of each current keyframe by filtering over many per-pixel, small-baseline stereo comparisons. It uses tracked frames, and refines the current keyframe. In the case the camera’s motion was too large by moving too far, a new keyframe is initialized which replaces the old keyframe. The replacement is done by projecting points from existing, close-by keyframes onto the new keyframe.

  • Map optimization triggers when a new keyframe is replaced instead of depth refinement. It’s purpose is to update the global map with the new tracked keyframe. Loop closures and scale-drift are detected by estimating a similarity transform sim(3) to close-by existing keyframes with scale-aware, direct sim(3)-image alignment.

3 Implementation

The project is now split into 2 Visual Studio projects, each accessible on Github. The basic AR engine with Oculus Rift support can be accessed via and the modified LSD-SLAM project for the integration is uploaded here

3.1 Calibration

Here we recap shortly the camera calibration process, documented in our previous paper [2]. We have used the omnidirectional camera model from Scaramuzza [9] and have written an undistortion shader in HLSL for our engine. Thus, we could bypass the undistortion step for the LSD-SLAM, and benefit from a performance boost, since the undistortion shader runs highly parallel on the GPU. We have computed the calibration with a polynomial function of fifth order. Distortion coefficients are already integrated in the projection function F.


If the undistortion, integrated in the LSD-SLAM project, is desired, the camera matrix has to be built with focal lengths and optical centers measured in pixels.

Also the distortion coefficients are neccessary.


Both, the camera matrix and the distortion coefficients are held in the file out_camera_data.xml, located in the data directory of the project root.

3.2 Engine Integration

In our previous paper [2] we documented our complete AR stereo engine so far. In this section, we give an overview of the major new components due to the LSD-SLAM integration.

The core of the integration is represented by the new class LsdSlam3D, illustrated in figure 15. It can be seen as an API to the dedicated Visual Studio project of LSD-SLAM and holds for us the important information about the camera’s real world position and rotation.

Figure 15: LSD-SLAM code integration. Image property of Markus Höll.

Since the GraphicsAPI is constantly synchronizing the virtual camera with the now fully tracked LSD-SLAM camera, we added a mutex to ensure stable synchronization and guarantee always clean and valid data information.

Additionally, we have added another mutex, used for the camera images. Since the camera stream is running in another thread, we also synchronized the data exchange between the image capturing and SLAM operation.

3.3 Initialization

In the main function we set up all the needed LSD-SLAM properties. First of all, the CvCapture property, which holds information about the tracked monocular camera. The camera is referenced wit an ID, starting from 0 to ascending order. Also the capture frame dimensions are set here with 640 x 480.

After that, we initialize the camera stream thread, used by LSD-SLAM. Since we modified the original lsd_slam::OpenCVImageStreamThread, we introduced a new one called lsd_slam::IDSuEyeCameraStreamThread. The slight modifications to our custom input stream are documented in section 3.6.

Now we are about to set the monocular camera buffer to the IDSuEyeCameraStreamThread. The camera stream thread will from now on constantly read out the frames stored in the undistortion buffer, retrieved from the HLSL shader.

To complete the initalization, we allocate our new introduced LsdSlam3D API object where we will retrieve the desired 3D camera translation and rotation as an output.

1blueint main(blueint, bluechar**)
3  green// init DirectX Graphics, Cameras and OculusHMD
4  [...]
6  CvCapture* capture = cvCaptureFromCAM(LsdSlam_CAM)
7  cvSetCaptureProperty(capture, CV_CAP_PROP_FRAME_WIDTH, 640);
8  cvSetCaptureProperty(capture, CV_CAP_PROP_FRAME_HEIGHT, 480);
10  lsd_slam::IDSuEyeCameraStreamThread* inputStream = bluenew lsd_slam::IDSuEyeCameraStreamThread();
11  inputStream->setCameraCapture(capture);
12  blueif (LsdSlam_UNDISTORTION) green// use undistorted imgs
13        inputStream->setIDSuEyeCameraStream(undistBuffer_, undistBufferMutex_, bluetrue);
14  blueelse green// use raw images
15        inputStream->setIDSuEyeCameraStream(cont.camInput_->lsdslamBuffer_, cont.camInput_->lsdslamMutex_, bluefalse);
16  inputStream->run();
18  green// Init LSD-SLAM with outputstream and inputstream
19  lsd_slam::Output3DWrapper* outputStream = bluenew lsd_slam::LsdSlam3D(inputStream->width(), inputStream->height());
20  lsd_slam::LiveSLAMWrapper slamNode(inputStream, outputStream);
21  dx11->setLsdSlamTrackingAndMapping(outputStream);

3.4 Render Loop Update

Some program controls allow us to switch between LSD-SLAM rotation and gyro sensor rotation from the Oculus Rift, aswell as to decide if we want to use LSD-SLAM translation or only the virtual head camera. Since we can set those controls individually, the Frame() function has been modified accordingly:

1bluebool GraphicsAPI::Frame()
3  green// [ Lsd-Slam ] Virtual Camera Update ---------------------------------
4  blueif (LsdSlam_Rotation)
5    updateLsdSlam3DRotation();
6  blueelse
7    OculusMotion();
9  green// [ Lsd-Slam ] Virtual Camera Update ---------------------------------
10  blueif (LsdSlam_Translation)
11    updateLsdSlam3DPosition();
12  blueelse
13    headCamera_->SetPosition(ariftcontrol_->cameraPositionX_, ariftcontrol_->cameraPositionY_, ariftcontrol_->cameraPositionZ_);
15  green// Render the virtual scene.
16  Render();

As the name suggests, the updateLsdSlam3DRotation() method updates internally the virtual camera rotation matrix with the one retrieved from LSD-SLAM. Also we convert the Eigen::MatrixXd* to our matching coordinate system convention.

1bluevoid GraphicsAPI::updateLsdSlam3DRotation()
3  green// waits internally or throws error code
4  blueif (!acquireLsdSlam3DRotationMutex())
5    bluereturn;
7  XMFLOAT3X3 cam.Rot.Matrix = {};
8  blueif (lsdslam3D_)
9  {
10    Eigen::MatrixXd* riftRot.Matrix = bluedynamic_cast<lsd_slam::LsdSlam3D*>(lsdslam3D_)->getRiftRotation3D();
11    cam.Rot.Matrix._11 = (*riftRot.Matrix)(0,0);
12    cam.Rot.Matrix._12 = (*riftRot.Matrix)(1,0);
13    cam.Rot.Matrix._13 = (*riftRot.Matrix)(2,0);
14    cam.Rot.Matrix._21 = (*riftRot.Matrix)(0,1);
15    cam.Rot.Matrix._22 = (*riftRot.Matrix)(1,1);
16    cam.Rot.Matrix._23 = (*riftRot.Matrix)(2,1);
17    cam.Rot.Matrix._31 = (*riftRot.Matrix)(0,2);
18    cam.Rot.Matrix._32 = (*riftRot.Matrix)(1,2);
19    cam.Rot.Matrix._33 = (*riftRot.Matrix)(2,2);
21    cam.Rot.Matrix._32 *= (-1);
22    cam.Rot.Matrix._23 *= (-1);
23    cam.Rot.Matrix._12 *= (-1);
24    cam.Rot.Matrix._21 *= (-1);
25  }
27  headCamera_->LsdSlamRotationMatrix_ = rotationMatrix;
29   releaseLsdSlam3DRotationMutex();

What happens in updateLsdSlam3DTranslation() is basically a smooth 3D movement implementation and a reference comparison with LSD-SLAM data. We store a lsdslam_reference which is directly compared with the accurat retrieved riftPosition from LSD-SLAM . Since we wanted to achieve a smooth movement we have added some thresholding value to erase small correction errors. Further we invented a scaling factor for the movement, to be not as limited and scale in different worlds accordingly. Of course, when we don’t have a lsdslam_reference yet, we are in the initialization phase and just store the current position as the origin reference.

1bluevoid GraphicsAPI::updateLsdSlam3DPosition()
3  green// waits internally or throws error code
4  blueif (!acquireLsdSlam3DPositionMutex())
5    bluereturn;
7  bluebool pos_x_changed = bluefalse;
8  bluebool pos_y_changed = bluefalse;
9  bluebool pos_z_changed = bluefalse;
11  green// calling LsdSlam3D API internally
12  lsd_slam::LsdSlam3D::RiftPosition3D* riftPosition = getLsdSlam3DPosition();
13  blueif (lsdslam_init_ == bluetrue)
14  {
15    lsdslam_ref.x = riftPosition->x;
16    lsdslam_ref.y = riftPosition->y;
17    lsdslam_ref.z = riftPosition->z;
18    lsdslam_init_ = bluefalse;
19  }
21  XMFLOAT3 camera_position = headCamera_->GetPosition();
22  bluefloat move_direction = headCamera_->headToEyeOffset_.rotationY_ * 0.0174532925f;
24  green// [Move] Forward / Backward -------------------
25  blueif ((riftPosition->z > (lsdslam_reference_.z + LsdSlam_Threshold))) green// Move Forwards
26  {
27    bluefloat distance = riftPosition->z - lsdslam_reference_.z;
28    bluefloat new_position_x = (camera_position.x) + sinf(move_direction) * (distance * LsdSlam_ScaleFactor);
29    bluefloat new_position_z = (camera_position.z) + cosf(move_direction) * (distance * LsdSlam_ScaleFactor);
30    headCamera_->SetPositionX((new_position_x));
31    headCamera_->SetPositionZ((new_position_z));
32    pos_x_changed = bluetrue;
33    pos_z_changed = bluetrue;
34  }
35  blueelse blueif ((riftPosition->z < (lsdslam_reference_.z - LsdSlam_Threshold))) green// Move Backwards
36  {
37    bluefloat distance = lsdslam_reference_.z - riftPosition->z;
38    bluefloat new_position_x = (camera_position.x) - sinf(move_direction) * (distance * LsdSlam_ScaleFactor);
39    bluefloat new_position_z = (camera_position.z) - cosf(move_direction) * (distance * LsdSlam_ScaleFactor);
40    headCamera_->SetPositionX((new_position_x));
41    headCamera_->SetPositionZ((new_position_z));
42    pos_x_changed = bluetrue;
43    pos_z_changed = bluetrue;
44  }
45  green// [Strafe] Left / Right -------------------
46  blueif ((riftPosition->x > (lsdslam_reference_.x + LsdSlam_Threshold))) green// Strafe Right
47  {
48    bluefloat distance = riftPosition->x - lsdslam_reference_.x;
49    bluefloat new_position_x = (camera_position.x) + cosf(move_direction) * (distance * LsdSlam_ScaleFactor);
50    bluefloat new_position_z = (camera_position.z) + sinf(move_direction) * (distance * LsdSlam_ScaleFactor);
51    headCamera_->SetPositionX((new_position_x));
52    blueif(!pos_z_changed)
53      headCamera_->SetPositionZ((new_position_z));
54    pos_x_changed = bluetrue;
55    pos_z_changed = bluetrue;
56  }
57  blueelse blueif ((riftPosition->x < (lsdslam_reference_.x - LsdSlam_Threshold))) green// Strafe Left
58  {
59    bluefloat distance = lsdslam_reference_.x - riftPosition->x;
60    bluefloat new_position_x = (camera_position.x) - cosf(move_direction) * (distance * LsdSlam_ScaleFactor);
61    bluefloat new_position_z = (camera_position.z) - sinf(move_direction) * (distance * LsdSlam_ScaleFactor);
62    headCamera_->SetPositionX((new_position_x));
63    blueif (!pos_z_changed)
64      headCamera_->SetPositionZ((new_position_z));
65    pos_x_changed = bluetrue;
66    pos_z_changed = bluetrue;
67  }
68  green// [Fly] Up / Down -------------------
69  blueif ((riftPosition->y > (lsdslam_reference_.y + LsdSlam_Threshold))) green// Down
70  {
71    bluefloat distance = riftPosition->y - lsdslam_reference_.y;
72    bluefloat new_position_y = (camera_position.y) -  (distance * LsdSlam_ScaleFactor);
73    headCamera_->SetPositionY((new_position_y));
74    pos_y_changed = bluetrue;
75  }
76  blueelse blueif ((riftPosition->y < (lsdslam_reference_.y - LsdSlam_Threshold))) green// Up
77  {
78    bluefloat distance = lsdslam_reference_.y - riftPosition->y;
79    bluefloat new_position_y = (camera_position.y) + (distance * LsdSlam_ScaleFactor);
80    headCamera_->SetPositionY((new_position_y));
81    pos_y_changed = bluetrue;
82  }
84  green// update LsdSlam Reference Position
85  blueif (pos_x_changed)
86    lsdslam_reference_.x = riftPosition->x;
87  blueif (pos_y_changed)
88    lsdslam_reference_.y = riftPosition->y;
89  blueif (pos_z_changed)
90    lsdslam_reference_.z = riftPosition->z;
92  releaseLsdSlam3DPositionMutex();

3.5 Undistortion ShaderBuffer for LSD-SLAM

Since we have used our own undistortion images retrieved from our fragment shader, we had to adapt the region-of-interest (ROI) of the undistorted images to ensure that we only pass fully valid image data to our lsd_slam::IDSuEyeCameraStreamThread.

After the undistortion, our image buffer itself holds of course black borders which correct the distortion, illustrated in figure 16. With the parameter camID we can control which of our cameras we want to use for the LSD-SLAM, which changes also the ROI parameters due to undistortion.

Figure 16: Undistortion mapping and ROI extraction. Image property of Markus Höll.

The undistorted images were so far fully placed in the GPU memory, since they were computed from our fragment shader. Therefore, we had to setup an explicit mapped CPU access to copy data from the GPU memory.

After we have copied the image data to from the GPU, we converted the images’ data buffer to an IplImage* frame where we then can extract the correct ROI, depending on which camera we use for tracking.

For the last step, we resized the image to 640 x 480 which are the suggested dimensions for LSD-SLAM and copy the images to the image buffer.

1bluevoid GraphicsAPI::UndistortionForLsdSlam(blueint camID)
3  ID3D11Resource* renderBuffer;
4  blueif (camID == 1)
5    renderTextureLeft_->GetRenderTargetView()->GetResource(&renderBuffer);
6  blueelse
7    renderTextureRight_->GetRenderTargetView()->GetResource(&renderBuffer);
9  D3D11_TEXTURE2D_DESC texDesc;
10  texDesc.ArraySize = 1;
11  texDesc.BindFlags = 0;
12  texDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
13  texDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
14  texDesc.Width = screenWidth_;
15  texDesc.Height = screenHeight_;
16  texDesc.MipLevels = 1;
17  texDesc.MiscFlags = 0;
18  texDesc.SampleDesc.Count = 1;
19  texDesc.SampleDesc.Quality = 0;
20  texDesc.Usage = D3D11_USAGE_STAGING;
22  ID3D11Texture2D* undistortedShaderTex;
23  device_->CreateTexture2D(&texDesc, 0, &undistortedShaderTex);
24  devicecontext_->CopyResource(undistortedShaderTex, renderBuffer);
26  green// Map Resource From GPU to CPU
27  D3D11_MAPPED_SUBRESOURCE mappedResource;
28  blueif (FAILED(devicecontext_->Map(undistortedShaderTex, 0, D3D11_MAP_READ, 0, &mappedResource)))
29    std::cout << "Error:[CAM2]couldnotMapRenderedCameraShaderResourceforUndistortion" << std::endl;
31  blueunsigned bluechar* buffer = bluenew blueunsigned bluechar[(screenWidth_ * screenHeight_ * CAMERA_CHANNELS)];
32  blueunsigned bluechar* mappedData = bluereinterpret_cast<blueunsigned bluechar*>(mappedResource.pData);
33  memcpy(buffer, mappedData, (screenWidth_ * screenHeight_ * 4));
34  devicecontext_->Unmap(undistortedShaderTex, 0);
36  green// OpenCV IplImage* Convertion
37  IplImage* frame = cvCreateImageHeader(cvSize(screenWidth_, screenHeight_), IPL_DEPTH_8U, CAMERA_CHANNELS);
38  frame->imageData = (bluechar*)buffer;
39  cvSetData(frame, buffer, frame->widthStep);
40  cv::Mat mymat = cv::Mat(frame, bluetrue);
42  green// Extract ROI
43  cv::Rect roi;
44  blueswitch (camID)
45  {
46    bluecase 1: green// Left Camera
47    blueif (HMD_DISTORTION)
48      roi = cv::Rect(300, 230, screenWidth_ - 300, screenHeight_ - 410);
49    blueelse
50      roi = cv::Rect(10, 170, screenWidth_ - 10, screenHeight_ - 250);
51    bluebreak;
53    bluecase 2: green// Right Camera
54      blueif (HMD_DISTORTION)
55        roi = cv::Rect(5, 230, screenWidth_ - 170, screenHeight_ - 450);
56      blueelse
57        roi = cv::Rect(5, 150, screenWidth_ - 10, screenHeight_ - 290);
58      bluebreak;
60    bluedefault:
61      roi = cv::Rect(0, 0, screenWidth_, screenHeight_);
62  }
63  cv::Mat roi_mat = mymat(roi);
65  green// Resize for LSD-SLAM 640x480
66  IplImage* resized_ipl = cvCreateImage(cvSize(640, 480), IPL_DEPTH_8U, CAMERA_CHANNELS);
67  IplImage* ipl_roi = bluenew IplImage(roi_mat);
68  cvResize((IplImage*)ipl_roi, resized_ipl, CV_INTER_LINEAR);
69  cv::Mat lsdslam_image = cv::Mat(resized_ipl, bluetrue);
71  WaitForSingleObject(undistortedShaderMutex_, INFINITE);
72  memcpy(undistortedShaderBuffer_,, (640 * 480 * CAMERA_CHANNELS));
73  ReleaseMutex(undistortedShaderMutex_);
75  blueif (!undistortionReady_)
76    undistortionReady_ = bluetrue;
78  blueif (undistortedShaderTex)
79    undistortedShaderTex->Release();
81  blueif (buffer)
82   bluedelete(buffer);
84  blueif (resized_ipl)
85    cvReleaseImage(&resized_ipl);
87  blueif (ipl_roi)
88    bluedelete(ipl_roi);

3.6 LSD-SLAM Code Modifications

Generally, the camera capturing and undistortion process is implemented in the OpenCVImageStreamThread but we already mentioned above that we have used our own undistortion, retrieved from our undistortion shader and supply LSD-SLAM with a buffer of those undistorted images. We show the code modifications within the LSD-SLAM project itself. For this purpose, we introduced a new class lsd_slam::IDSuEyeCameraStreamThread.

3.6.1 lsd_slam::IDSuEyeCameraStreamThread

The constructor makes it pretty clear which data is neccessary. The cameraBuffer_ and cameraMutex_ are pointers to instances where the thread should read data from. We just copy the data from cameraBuffer_ to the internally used frameBuffer_.

iDSuEyeFrame_ is the final data type on which LSD-SLAM is processing on and resizedFrame_ is needed in some cases when the image dimensions are not 640 x 480.

Finally, use_shaderUndistortion_ just tells us if we want to use the undistorted images supplied by our shader, or might want to use raw image data or something else for any reason.

1  IDSuEyeCameraStreamThread::IDSuEyeCameraStreamThread()
2  {
3    cameraBuffer_ = 0;
4    cameraMutex_ = 0;
5    frameBuffer_ = 0;
6    iDSuEyeFrame_ = 0;
7    resizedFrame_ = 0;
8    use_shaderUndistortion_ = bluefalse;
9  }

This method- setIDSuEyeCameraStream() - is called in the configuration phase directly from main, to set the corresponding pointers and configs for camera streaming.

1 bluevoid IDSuEyeCameraStreamThread::setIDSuEyeCameraStream(blueunsigned bluechar* cameraBuffer, bluevoid* cameraMutex, bluebool use_shaderUndistortion)
3  cameraBuffer_ = cameraBuffer;
4  cameraMutex_ = cameraMutex;
5  use_shaderUndistortion_ = use_shaderUndistortion;
7  blueif (use_shaderUndistortion_)
8    frameBuffer_ = bluenew bluechar[LsdSlam_BUFFER_LENGTH];
9  blueelse
10  {
11    frameBuffer_ = bluenew bluechar[CAMERA_BUFFER_LENGTH];
12    resizedFrame_ = cvCreateImage(cvSize(LsdSlam_CAM_WIDTH, LsdSlam_CAM_HEIGHT), IPL_DEPTH_8U, CAMERA_CHANNELS);
13  }

The run() method is calling the actual runtime loop of the thread.

1bluevoid IDSuEyeCameraStreamThread::run()
3  boost::thread thread(boost::ref(*bluethis));
4 }

In the beginning before the Loop starts, we check if we have valid camera objects and jump then into the runtime loop, which overrides the actual image streaming to the LSD-SLAM. In case of using the undistortiion images from the shader, there is not much happening here since everything has been prepared previously from the GraphicsAPI.

However, if we might want to use the image raw data for tracking or something else, we have to sample the images to fit the suggested dimensions.

1bluevoid IDSuEyeCameraStreamThread::blueoperator()()
3  WaitForSingleObject(cameraMutex_, INFINITE);
4  blueif (!cameraBuffer_ || !cameraMutex_)
5  {
6    assert("NOvalidIDSuEyeCamerabufferfoundforTrackingandMapping");
7    bluereturn;
8  }
9  ReleaseMutex(cameraMutex_);
10  bluewhile (1)
11  {
12    TimestampedMat bufferItem;
13    bufferItem.timestamp = Timestamp::now();
15    green// Use Undistortion provided by our Fragment Shader ROI (640 x 480)
16    blueif (use_shaderUndistortion_)
17    {
18      WaitForSingleObject(cameraMutex_, INFINITE);
19      memcpy(frameBuffer_, cameraBuffer_, LsdSlam_BUFFER_LENGTH);
20      ReleaseMutex(cameraMutex_);
21      iDSuEyeFrame_ = cvCreateImageHeader(cvSize(LsdSlam_CAM_WIDTH, LsdSlam_CAM_HEIGHT), IPL_DEPTH_8U, CAMERA_CHANNELS);
22      iDSuEyeFrame_->imageData = frameBuffer_;
23      iDSuEyeFrame_->imageDataOrigin = iDSuEyeFrame_->imageData;
24     }
25    blueelse green// use raw image data from cams (752 x 480)
26    {
27      WaitForSingleObject(cameraMutex_, INFINITE);
28      memcpy(frameBuffer_, cameraBuffer_, CAMERA_BUFFER_LENGTH);
29      ReleaseMutex(cameraMutex_);
30      iDSuEyeFrame_ = cvCreateImageHeader(cvSize(CAMERA_WIDTH, CAMERA_HEIGHT), IPL_DEPTH_8U, CAMERA_CHANNELS);
31      iDSuEyeFrame_->imageData = frameBuffer_;
32      iDSuEyeFrame_->imageDataOrigin = iDSuEyeFrame_->imageData;
33      iDSuEyeFrame_ = ResizeFrame(iDSuEyeFrame_, LsdSlam_CAM_WIDTH, LsdSlam_CAM_HEIGHT);
34    }
35 = iDSuEyeFrame_;
36    imageBuffer->pushBack(bufferItem);
37  }
38  exit(0);

A short method for re-sampling an IplImage to the desired image dimensions. In our case, we would downsample from 752 x 480 to 640 x 480.

2IplImage* IDSuEyeCameraStreamThread::ResizeFrame(IplImage* src_frame, blueint new_width, blueint new_height)
4  blueif (new_width != LsdSlam_CAM_WIDTH || new_height != LsdSlam_CAM_HEIGHT)
5    resizedFrame_ = cvCreateImage(cvSize(new_width, new_height), src_frame->depth, src_frame->nChannels);
6  cvResize(src_frame, resizedFrame_, CV_INTER_LINEAR);
7  bluereturn resizedFrame_;

3.7 Light Shading Model - Rendering Equation

As a very small gimmick, we also modified our fragment shader from the previously used Lambert Shading to a Phong Shading model. This reflection model can be implemented without any effort actually but gives quite a more realistic look to the holograms due to the reflection term which yields specular highlights. Since it is still a local illumination model, an approximation of global illumination in form of an ambient light term is added to the rendering equation.

Generally, the shading model has 3 components, a diffuse illumination , a specular reflection component and an ambient term . This gives a formal expression like 11.


Formulating the model more precisely for each component, we get the render equation according to formula 12.


is a vector from the surface point to the emmitting light source. is the normal vector of the surface point. stands for a direction vector, pointing towards the observer (viewer) and is the direction of the reflected light which the light ray takes from this surface point on.

3.8 New Engine Configurations

A list of new configuration mechanics, added during the integration process:

  • LsdSlam_Threshold - configures a threshold which eliminates smaller correction errors in the tracking

  • LsdSlam_ScaleFactor - configure the scaling from the virtual movement, retrieved from LSD-SLAM

  • LsdSlam_UNDISTORTION - provide LSD-SLAM with undistorted shader images

  • LsdSlam_Translation - use translation from LSD-SLAM or not

  • LsdSlam_Rotation - use rotation from LSD-SLAM or Oculus Rift’s gyro sensor

  • VSYNC_ENABLED - enables / disables vertical synchronization

3.9 New Libraries


  • OpenCV

  • boost

  • g2o

4 Conclusion

We are really satisfied with the results. The integration process actually worked quite good, however it was still some coding work to adapt the engine for the use of LSD-SLAM. That’s mainly because we have used an omnidirectional camera model for our high degree fisheye-lenses to compute the calibration. And since we already computed the undistortion with a fragment shader in HLSL we wanted to use them for LSD-SLAM aswell. So basically, we had to tweak the image streaming to the algorithm and could not use the prepared OpenCVImageStreamThread.

The project was shown also at the OpenLAB-Night 2016 of the Institute for Computer Graphics and Computer Vision (ICG) at TU Graz. The guests could try it out and we received alot of good feedback. The people showed alot of interest in the project and generally in AR itself.

Tracking the real-world camera and synchronizing it’s 3D movement with the virtual head camera worked pretty well. It still feels exciting to walk towards 3D holograms which are augmented in the real world, going around them and crouch to see them from below.

The following figures show some final tracking results. The digimon model and texture file from the result images are taken from web source under educational usage purposes. The chessboard which sometimes is in the scene was only used for camera calibration.

Figure 17: Results: near distance. Image property of Markus Höll
Figure 18: Results: crouching, near distance. Image property of Markus Höll
Figure 19: Results: far distance. Image property of Markus Höll
Figure 20: Results: side view, walking around. Image property of Markus Höll

5 Future Work

We want to focus now on the interaction with virtual objects. The motivation is an interaction system in AR which feels more natural like using our hands and interacting with virtual objects the same way or at least similar as we do with real objects. We expect that to increase the immersion of AR significantly. However, we strongly consider a hardware upgrade since the Oculus Rift DK1 is pretty old already and the display’s resolution is very low. We will use the HoloLens for this project and see how it feels like if we change the interaction system of it with our own one. It will be very interesting to see how that works out and if we might also use some functionality of the interaction from the HoloLens, like maybe using the ”Gaze” raycaster for highlighting the virtual objects but further interacting with our hand gestures or totally bypass the whole interaction and only use the new one.

To capture our hand skeleton, we will use the Leap Motion controller which is using structured light for measurement. The Leap Motion has already a VR mount to install it on the front plate of VR HMDs. Since that could be tricky on the HoloLens, a custom mount might be the better choice. What might be more tricky is the communication between the HoloLens and the Leap Motion, since we will need to set up a remote communication between them due to the fact that the USB ports of the HoloLens can not be used for periphery devices appearantly. Another challenge will be to translate the coordinates from the Leap Motion’s coordinate system to the AR space of the HoloLens and compute a fine hand gesture detection and collision resolution on virtual objects.


  • [1] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in Computer Vision–ECCV 2014, pp. 834–849, Springer, 2014.
  • [2] M. Höll, N. Heran, and V. Lepetit, “Augmented reality oculus rift,” arXiv preprint arXiv:1604.08848, 2016.
  • [3] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “Fastslam: A factored solution to the simultaneous localization and mapping problem,” in Aaai/iaai, pp. 593–598, 2002.
  • [4] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234, IEEE, 2007.
  • [5] H. Kato and M. Billinghurst, “Marker tracking and hmd calibration for a video-based augmented reality conferencing system,” in Augmented Reality, 1999.(IWAR’99) Proceedings. 2nd IEEE and ACM International Workshop on, pp. 85–94, IEEE, 1999.
  • [6] D. L. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes, “Medical image registration,” Physics in medicine and biology, vol. 46, no. 3, p. R1, 2001.
  • [7] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
  • [8] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular camera,” in Proceedings of the IEEE international conference on computer vision, pp. 1449–1456, 2013.
  • [9] D. Scaramuzza, A. Martinelli, and R. Siegwart, “A toolbox for easily calibrating omnidirectional cameras,” in Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 5695–5701, IEEE, 2006.