A Hybrid SLAM and Object Recognition System for Pepper Robot

03/02/2019 ∙ by Paola Ardón, et al. ∙ 0

Humanoid robots are playing increasingly important roles in real-life tasks especially when it comes to indoor applications. Providing robust solutions for the tasks such as indoor environment mapping, self-localisation and object recognition are essential to make the robots to be more autonomous, hence, more human-like. The well-known Aldebaran service robot Pepper is a suitable candidate for achieving these goals. In this paper, a hybrid system combining Simultaneous Localisation and Mapping (SLAM) algorithm with object recognition is developed and tested with Pepper robot in real-world conditions for the first time. The ORB SLAM 2 algorithm was taken as a seminal work in our research. Then, an object recognition technique based on Scale-Invariant Feature Transform (SIFT) and Random Sample Consensus (RANSAC) was combined with SLAM to recognise and localise objects in the mapped indoor environment. The results of our experiments showed the system's applicability for the Pepper robot in real-world scenarios. Moreover, we made our source code available for the community at <https://github.com/PaolaArdon/Salt-Pepper>.



There are no comments yet.


page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The technological advancements in the past decades significantly improved the quality of our daily life. On an attempt to making humans lives more comfortable and independent, a significant step has been made in the robotics and computer vision field – the development of humanoid robots.

In comparison with the rest of humanoid robots in the market, Pepper, developed by Aldebaran and Softbank, is affordable and has an open platform that allows developers to enhance its capabilities and implement applications to make the robot useful for everyday tasks [1].

I-a Problem statement and objectives

The development of humanoid robots is still a relatively new technological field. Therefore, research is still being done on the subject. In order to achieve a higher level of interest and fresh ideas in the area, many competitions are organised around the world such as the European Robotics League (ERL[2]. One of the purposes of this competition is to develop technological applications that will help elderly people to live longer independently at home [3]. Pepper comes with many built-in functions, some of them being learn home, object recognition and go to goal. Even though these functions prove to be useful, they are limited in various ways. For example, the built-in function learn-home requires the environment to be less than . In order to cope with the rules of the ERL Service Robots competition and overcome the limitations of Pepper, we set the primary objective: to develop a hybrid system for Pepper that integrates the object recognition into Simultaneous Localisation and Mapping (SLAM). Another limitation that makes the project more challenging is poor sensors that come with the Pepper robot. Since both visual SLAM and object recognition use optical cameras, we are giving some characteristics of the used sensors [3]:

  • RGB camera: located on the forehead and has a resolution of at frames per second (fps).

  • Depth camera: located behind Pepper’s eyes with a resolution of at fps.

I-B Contributions and outline

Currently, many applications have been developed on the robot family of Aldebaran and Softbank. However, to the best knowledge of the authors, visual SLAM and its combination with a robust object recognition algorithm have never been done on Pepper robot before. Our main contributions are the following:

  1. For the first time, a visual SLAM algorithm is successfully applied in the Pepper robot, so it is no longer just limited to a small environment.

  2. We present an accurate and robust object recognition algorithm for Pepper.

  3. We build a hybrid system which integrates the object recognition and SLAM into a unified framework. The recognised objects are marked in the map, and the map can be saved and reused.

  4. The framework has been tested with the Pepper robot in real-world conditions. The demo video is available at https://youtu.be/evFsnWH_bpY.

  5. We also make our implementation code available for the community at https://github.com/PaolaArdon/Salt-Pepper.

The paper’s structure is as follows: Section II discusses many modern object recognition and SLAM methods. The theory of our object recognition method and SLAM with Pepper robot are shown in sections III and IV respectively. Then, the integration of the two functionalities into one system is described in section V, which is followed by the results in section VI and final remarks as well as future works in section VII.

Ii Related Work

Before going into details about the used algorithms for the implementation, it is useful to review some of the general concepts in the field. Object recognition and SLAM have been active research fields over the last decades. Some of the related works are reviewed in this section.

Ii-a Object recognition

Object recognition relates to the problem of identifying an object in an image. In general, the algorithms can be divided into two main streams: appearance-based methods and feature-based methods.

An appearance-based recognition method is based on directly using example images (or templates) to perform recognition tasks. Sung et al. [4] introduced a method where the edges in both, the current frame and template images from the database, are extracted. Then, different sliding windows with various scales are employed to find the object with the highest similarity measures. Swain and Ballard in [5] initially showed how object recognition could be performed by comparing the colour histogram. Schiele and Crowley [6] applied histograms of receptive fields proposed by Koenderink and van Doorn [7] and the recognition result is enhanced with the usage of the Gaussian derivative or the Laplacian operator at multiple scales. Linde and Lindeberg [8] generalised the idea of the receptive field histogram to a higher dimensionality. Histograms of wavelet coefficients are proved to be a useful tool for the recognition of cars and faces in [9].

Appearance-based methods are usually robust to a particular type of object characteristics, depending on which information is extracted for the comparison between the templates and the objective image. However, they are usually computationally expensive and sensitive to many variations. In contrast, feature-based recognition methods offer a solution to the mentioned problems.

The objective of feature-based

recognition algorithms is to find feasible matches between the features extracted from the database images and from the target image. Some of the commonly used features in object recognition are: Shape Context 

[10], Haar-Like feature [11], Scale Invariant Feature Transform (SIFT[12], Speed Up Robust Features (SURF[13], Oriented FAST Rotated BRIEF (ORB[14], Binary Robust Invariant Scalable Keypoints (BRISK[15].

In recent years, Artificial Neural Networks – in particular, Convolutional Neural Networks (CNN) showed promising results for object detection and recognition problems. Unlike the traditional methods that employ hand-crafted features as described above, the CNNs learn the features from the observed data

[16]. Recent studies showed that deep CNN architectures are able to outperform the classical algorithms with higher accuracy for object recognition [17, 18, 19]. However, we did not consider using such approaches in this study, as the CNN-based methods rely on the computational power of GPUs which are not currently available for the Pepper robot.

Ii-B Simultaneous localisation and mapping (SLAM)

In this section, we are going to review some of the state-of-the-art algorithms for SLAM that we have considered in our project.

Extended Kalman filter Slam (Ekf-Slam)


the map is represented with large vector stacking sensors and landmarks states which is modelled by a Gaussian variable 

[20]. Maximum likelihood algorithm is used for data association.

Some of the advantages of EKF-SLAM is that it is relatively easy to implement and is efficient when working with a small number of features and distinct landmarks. On the other hand, the complexity is quadratic with respect to the number of features, it does not guarantee convergence in non-linear and/or non-gaussian cases, and does not correct erroneous data association [20].

Collaborative visual Slam (CoSLAM)

This method [21]

interacts in dynamic environments where live frames come from multiple cameras that can be independent and mounted on different points of view. As an overview, these cameras build a single global map, including the static background points and the foreground dynamic points. This set of points are the ones used to estimate the poses of all cameras, which should have overlapping fields of view. CoSLAM is considered as one of the most efficient approaches which are able to get rid of false points caused by incorrect matching.

Large scale direct monocular Slam (Lsd-Slam)

It has been developed to allow the building of large-scale map environments [22]. Instead of using key points, this SLAM uses the image intensities to track and build a map. The method shows the advantage of allowing the mapping of large areas without extra computational power.

Oriented FAST and rotated BRIEF Slam (Orb-Slam)

It is a keyframe and ORB feature based SLAM algorithm [23]. One of its greatest advantages is that it operates in real-time and large environments, being also able to close loops and re-localise from different viewpoints. Due to these significant contributions, it is the chosen algorithm for this project implementation. More details about the algorithm are described in section IV

Iii Object Recognition Framework

As previously explained, we want to allow the Pepper robot to be more autonomous and helpful in the household. One of our main objectives is to allow Pepper to recognise objects. Some of the classical object recognition algorithms show not to be efficient for some applications. For instance, Haar Cascades [11] requires a trade-off to be done between the efficiency in learning/training time and the output’s accuracy.

Based on Lowe’s paper [12], we introduce a robust SIFT-based recognition algorithm for Pepper robot. Our method is not only able to get rid of the long training time but also robust to rotation, scaling, perspective transformation among others. The robustness of this algorithm allows Pepper to recognise objects efficiently.

The flow chart of the object recognition algorithm for Pepper is presented in Fig. 1. The main steps of the method include feature extraction, feature matching, and decision making, which are described in the following sections.

Fig. 1: Workflow of the object recognition algorithm.

Iii-a Feature extraction

The very first step is to extract features from a given image database. This is also done every time a new frame arrives. Under the uniform recognition framework, we tested several feature extraction techniques including ORB, SURF and SIFT. When applying ORB and SURF the accurate detection of the output is acceptable. However, SIFT offered the highest accuracy among them in all our testing rounds (see Section VI) so it was chosen as the primal feature extraction method. Both SURF and ORB are kept as user options in the implementation and can be selected as the main feature descriptor.

Iii-B Feature matching

Once the SIFT features of the images in the database and the current frame have been extracted, we need to know how many features are matched between the current frame and every image in the database. This will help the following decision-making step described in the next section.

We use the matching method proposed in [24], which is a kd-tree built for the rapid traversing of each feature in the current frame. From our intensive experiments, it has been shown that the kd-tree nearest neighbour matching algorithm significantly speeds up the recognition process (around three times faster than brute-force matching). Since the recognition run-time for each frame is within the range of the updating time ( ), the kd-tree matching enables Pepper to perform real-time object recognition along with the SLAM algorithm.

Iii-C Decision making

After the feature matching between the current frame and all the objects in the database, Random Sample Concensus (RANSAC[25] is applied to determine whether an object from the database exists in the current frame or not, and which object it is.

We use RANSAC to fit a homography transformation between the feature positions in the current frame and the object in the database , and then calculate the number of inliers . If is larger than a threshold (empirically set to ), we make this object as a candidate. This process is repeated until the number of inliers of all database objects has been calculated. If the inlier numbers are all smaller than the threshold, we assume there is no object found in the current frame. Otherwise, the candidate with the most inliers is considered as the detected object. Finally, if an object is detected, through the homography matrix acquired from RANSAC, the bounding box of the best object position is drawn on the frame.

More implementation details about the Pepper object recognition can be found in the /pepper_recog folder of our GitHub repository.

Iv Simultaneous Localisation and Mapping

Nowadays, SLAM is one of the most active research topics in Computer Vision and Robotics community. Many SLAM algorithms have been developed as we have discussed in Section II. All of these algorithms share a common purpose but use different approaches depending on the available sensors. Some of them use lasers, LiDAR, cameras or RGB-D cameras (or a combination of different sensors). For instance, LSD and ORB SLAM algorithms are based on RGB(-D) camera and called visual SLAM.

The required sensors to achieve the goal of this project have been described earlier. Among the tested SLAM algorithms, the ORB SLAM was better to cope with the limited sensor capabilities such as low frame rate, therefore being the implemented algorithm on Pepper. In this section, a brief introduction for ORB features and ORB SLAM is given as well as its extension ORB SLAM 2 [26] that uses an RGB-D camera.

Iv-a Orb feature

Since we are working with the visual SLAM, extracting features from the input video stream is commonly the essential step. Feature extraction is the base for the ORB SLAM algorithm that makes the robot understand the surrounding environment and localise itself, as well as closing the trajectory loop. The Oriented FAST and Rotated BRIEF feature [14], known as the ORB, is a state-of-the-art feature descriptor that is applied to our SLAM algorithm.

ORB is built on the Features from Accelerated Segment Test (FAST) detector [27] and Binary Robust Independent Elementary Features (BRIEF) descriptor [28]. The original FAST detector provides neither the keypoint orientation nor the measure of the corners, which makes ORB not rotation invariant. Therefore, in the phase of keypoint detection of ORB, the intensity centroid [29] and the Harris corner measure [30]

are applied to remedy these disadvantages. Similarly, although the BRIEF descriptor can be calculated efficiently and robust to additive illumination change, perspective distortion, etc., the performance of BRIEF diminishes significantly for the rotation over a few degrees. To solve the weakness of BRIEF, the best BRIEF pairs with large variance and low correlation are learned from PASCAL VOC 2006

[31] and then the obtained BRIEF descriptors from the key points of the current image are steered based on the orientation of the key points.

ORB is made up of the modified version of FAST and BRIEF that we mentioned before. It is rotational and scale invariant as well as robust to noise, and it has been shown that the performance of ORB in many real-life applications is equivalent to or even slightly better than SIFT in some cases. More importantly, ORB is computationally inexpensive. Compared with the costly SIFT, ORB is at two orders of magnitude faster, which is suitable for our real-time SLAM application.

Iv-B Orb Slam – Monocular

ORB SLAM mainly consists of three components running in parallel: tracking, local mapping and loop closing. In the following sections, the main ideas of each component are described.

Iv-B1 Tracking

This process starts with the initialization of the map. In the monocular case, the depth has to be computed using several images of the same scene by moving the camera in the horizontal/vertical direction with respect to the scene. The authors proposed a new method [23] for ”structure from motion” estimation that combines two geometrical models for camera pose estimation:

  • Assumes the scene is planar and computes the corresponding homography matrix between two frames.

  • Assumes the scene is non-planar and computes the fundamental matrix.

Then the selection of the best model is computed using specific heuristics, and the camera pose will be estimated based on the selected model. Once the map has been initialised from several consecutive frames of a scene from different viewpoints, the

ORB features (key-points) are extracted from consecutive frames. Note that the FAST corners are extracted at 8-scale levels, and the modified BRIEF descriptors are computed on the key points orientation.

The camera pose is computed by searching the matches in a small area around each ORB key point between the current frame and the previous one. The search is optimised by assuming that the camera motion has a constant velocity model. If there are not enough matches, the search is done on all map points near the points from the last observed frame. In case the track is lost, the current key points are converted into bag-of-words features and traverse the predefined recognition bag-of-words database. This is applied to obtain the best matching keyframe. After that, the robot can be re-localised again. Moreover, an Efficient Perspective-n-Point (EPnP) algorithm [32] along with RANSAC is applied to refine the estimation of the pose further.

Iv-B2 Local mapping

The new keyframe is obtained as discussed in the last section, and to put the new map points, we need to find the positions of all the new points on the world coordinate. Instead of triangulating points only with the closest keyframes like PTAM, ORB-SLAM triangulates points with several neighbouring keyframes. As long as a pair of ORB features have been matched, they can be triangulated.

Sometimes wrong map points may appear. To ensure all the mapped points are the real ones, we should check if a map point remains in the map for a period of time. The authors of ORB-SLAM use a method called pass culling test, which means a key point can be put in the map only after the following two conditions are satisfied: make sure the key point can be found in at least of frames, and the key point should be seen in at least three keyframes.

Finally, the local bundle adjustment will optimise the current keyframe. The final pose optimisation is performed by the Levenburg-Marquart method.

Iv-B3 Loop closing

Loop closing is one of the most important contributions of the ORB-SLAM and also one of the reasons we chose it for Pepper’s SLAM task. Loop closing means when the robot is moving around the environment and then comes back to the starting point, the system should be able to connect the latest movement with the initial ones. In this case, the trajectory can be closed, and the map will be globally changed. With the loop closing, the built map and the estimated robot trajectory are more accurate.

The main idea of loop closing can be summarised in three steps: loop detection, similarity transformation computing and loop fusion. First, a co-visibility consistency test is performed to check if a loop has been found. Throughout the whole process of the SLAM, we keep calculating the similarity between the current keyframe and all its neighbours in the co-visibility graph. The keyframe with the highest similarity score will be used to update the reference loop-closing frame. Second, if one keyframe satisfies the test in the first step, the RANSAC will be iteratively applied to calculate a similarity transformation containing: 3 translations, 3 rotations and 1 scaling parameter. When the candidate has enough number of inliers, we are sure the loop has been found. Third, with the similarity transformation matrix acquired from the last step, the map points in the current keyframe are reformed to the reference loop-closing keyframe. The map points from all the neighbours of the current keyframe are also projected through the same transform. Therefore, all inliers from the last step are fused.

The last step is to perform a global bundle adjustment. The only difference from section IV-B2 is that optimising all the map points will be used for the bundle adjustment and refined. The illustrations of the loop closing can be found in Section VI.

Iv-C Orb Slam – Rgb-D

As mentioned, the first step of ORB SLAM is the initialisation of the map, which requires several images of a scene from different viewpoints. However, this process takes a long time for Pepper robot, because with a rate of fps the sequence of images cannot provide a smooth parallax effect.

In order to overcome this problem, an extension for Monocular ORB SLAM has been introduced in [26], where the depth estimation has been replaced by the RGB-D camera. In this case, the initialisation process does not involve recovering the camera pose from several images. Instead, the first taken image by the camera can be directly used to initialise the map because the depth information for the key points is already there. Therefore, using an RGB-D camera speeds up significantly the initialisation process, which is very important when using a camera with a low frame rate as in the Pepper robot.

V Integration and Architecture

Our final objective was to combine the object recognition with SLAM, i.e. while running SLAM the robot can also identify the detected object’s position and put a marker with the label on the map built by SLAM.

In this section, we are going to show how we accomplished this task. Also, how the whole system is organised in order to make the robot, ROS and the two previously described algorithms work together.

V-a System overview

Pepper robot comes with many built-in functions and its own operating system (OS) called NAOqi-OS. This is a Linux distribution based on Gentoo, and it is installed in Pepper’s computer which is integrated on the robot. However, Pepper does not allow users to install third-party applications on its OS and requires to use its own Software Development Toolkit (SDK). In order to overcome this limitation, the Robotic Operating System (ROS) has been used in this project. ROS is a language and platform independent framework that gives users permission to create packages in a graph-based structure and provides a powerful tool for message sending/receiving between processes [33].

ROS – NAOqi driver and plugin for Pepper

Despite the fact that the manufacturers of Pepper limit the access to the OS of Pepper, they provide a driver that can be used to link NAOqi and ROS together. This driver fetches all sensor data and creates ROS nodes and topics which publish the states of all the robot sensors. Moreover, the driver creates topics for controlling joints of the robot allowing other ROS applications to subscribe and publish standard ROS messages (e.g. Twist) to control the robot. The whole process of NAOqi-ROS communication is illustrated in Fig. 2. As can be seen in this figure, the main role of the NAOqi driver is converting NAOqi modules to ROS nodes.

Fig. 2: The diagram illustrating the way of communication of a ROS application through NAOqi ROS driver.

In addition to the NAOqi driver, there must be robot specific plugins that bring specific capabilities of the robot to ROS depending on the characteristics of the robot. For example, Pepper robot shares the same OS with other Aldebaran and Softbank robots, but each of these robots has different configurations such as the number of joints and types of sensors. In order to avail full robot capabilities, it is required to run a certain type of driver. To achieve this, pepper_bringup and pepper_dcm_bringup plugins [34] have been used for the Pepper robot. The main difference between pepper_bringup and pepper_dcm_bringup is that the former one does not block the autonomous life of the robot, whereas the latter turn that functionality off. In our experiments, we did not use the autonomous life, since it allows Pepper to imitate a human behaviour – such as tracking human face, reacting to sudden loud noise, etc. – which can bring inconvenience while running the algorithm.

V-B Implementation

Now we introduce how SLAM and object recognition are combined using ROS. First of all, we have to mention that the ORB SLAM 2 algorithm that we used has been implemented in C++ programming language as a stand-alone application, i.e. it can be used without ROS. For this reason, it does not use RViz for showing the map, which is a default and convenient visualisation tool of ROS. Instead, it uses Pangolin [35], which is a lightweight library for managing visualization and user interaction that wraps OpenGL [36] functions.

V-B1 Combining Slam with ROS

To use ORB SLAM 2 in ROS, a ROS node was implemented to instantiate ORB SLAM 2 as an object. The created node subscribes to the topics where RGB and depth images are being published. Note that the ORB SLAM 2 with RGB-D expects an RGB-D camera, but Pepper has RGB and depth cameras separately. Accordingly, we made two separated subscribers for both modalities. We also have to make sure that the messages coming from these topics have the same timestamp, because it is possible that some frames may be delayed or lost due to unexpected technical issues. Furthermore, we also make sure that the images from both cameras are correctly registered. The described architecture for SLAM & ROS is illustrated in Fig. 3 (Block-A).

V-B2 Combining object recognition with ROS

In contrast to the ORB SLAM implementation, we implemented object recognition module as a ROS node, so the algorithm logic (Fig. 1) is directly put inside the node. Then, we obtain images from Pepper’s frontal camera and convert ROS raw image format to OpenCV image using CV-Bridge [37] package from ROS.

The object position with respect to the camera coordinate (depth estimation) is computed in this node as well. The estimated depth, which will be used for marking objects on the map, is published as a topic. To publish the object name and its position in camera frame we created a custom ROS message that holds the following fields: 1) flag (boolean type) – accepts true when an object has been detected, false otherwise; 2) depth (float type) – estimated distance from the camera frame origin to the object; 3) name (string type) – name of the object that has been detected.

V-B3 Marking objects on the map with homography

As we mentioned earlier when the object is detected the object recognition node publishes a custom message with a flag field set to true. In order to put a marker with the name of the detected object, we created a subscriber to the custom message in the SLAM node (Fig. 3 (Block-B)). Since the map visualisation is independent of ROS, we cannot directly put markers on the map inside the SLAM node. Therefore, we created a C++ class (we will refer to this class as Recognition.class further) that represents the recognised objects in ORB SLAM 2 package. This class is also included in the ROS node. When SLAM node receives a message notifying that an object has been detected, we create an instance of the Recognition.class with the parameters that came with the message. In order to process this kind of instances we modified the source code of ORB SLAM 2 to process Recognition.class objects along with the RGB and depth images.

Until now, we only know the positions of the objects w.r.t. The camera. Before plotting the object on the map, we have to find its position in the world frame. In order to do so, we obtained the pose (rotation + translation) of the camera when the object was being detected, which is described as the transformation matrix. Then, the position of the object is computed using the following equation:

where - is the camera transformation matrix that shows how it is rotated and translated from the origin of the world frame; and - are object position in camera and world frames respectively; - is the estimated depth. Then we update the corresponding object coordinates with the new computed world frame coordinates.

Now, by using the transformation matrix, the 3D position of the object with respect to the world coordinate has been found. As a result, we can directly put the marker with the name on that position in the map.

Fig. 3: System architecture. Block A. Communication of NAOqi ROS driver with the ORB SLAM 2 module. Block B. Integration of ORB SLAM 2 with Object Recognition module. Block C. Robot control module and NAOqi ROS driver communication.

V-C Additional features

In this section, the additional features are shown that are essential for performing SLAM and making the whole system faster and more practical.

V-C1 Robot control with a joystick

The first thing that needs to be mentioned is the robot control. This is the main module that is used for moving the robot in an indoor environment for building the map. This task is executed with the help of a joystick. The usage of a joystick ensures full control of the robot for SLAM. Additionally, by controlling the robot manually, we can assure that all the necessary areas of the environment are covered and put in the map.

ROS provides a generic teleoperation tools [38], which is a simple library that reads commands from a joystick and publishes a vector with buttons state. In order to make it work with our robot, we created a controller ROS node, that subscribes to the joystick node. Then, depending on the pressed button, we define linear and angular velocities for the robot and send them to the /pepper_robot/cmd_vel topic. By sending velocity commands to that topic, we can control the robot base. However, it is worth mentioning this does not allow to control other joints of the robot.

For controlling the robot head, we used NAOqi SDK inside our Robot Control node. First, we retrieve the current position of the head when a button, which was mapped to head movements, is pressed. Then, depending on the movement direction, we calculate the final position of the head (in degrees). Next, using the ALMotion NAOqi module we send a command to the robot. Additionally, we programmed two more buttons that send the robot to Rest and Active status, which is implemented using NAOqi SDK as well.

The general overview of the robot controlling component of the system is illustrated in Fig. 3 (Block-C). The implementation details can be found in /joy_pepper/scripts/joypepper.py in our GitHub repository.

V-C2 Map saving and loading

Once the map of the environment has been built, it is important to be able to reuse it. Saving the map becomes an important task due to the short working period of Pepper’s joints (e.g. overheat). The implementation of the ORB SLAM 2 does not provide a functionality that allows to save the built map and load an existing map. In order to fill this gap and allow Pepper to continue the map building process, we have included this feature to our system.

First, a naive method has been implemented where we save all the key points, keyframes and corresponding bag of words for each keyframe of the map into a text file. To reuse it we load and parse this file. This method appears to be very slow and inefficient, due to the large file size. Moreover, the processes of writing/reading from a text file are known to be slow.

Another way of solving this problem was saving all the instances of the C++ objects into a binary file, which is a well-known strategy in programming called serialization. For ORB SLAM 2 there was already some research going on about this [39], where serialisation and deserialization have been used for saving and loading the map. However, this has been implemented only for Monocular SLAM, and we implemented it similarly for SLAM with the RGB-D camera. More details can be found in the codes Map.cc KeyFrame.cc MapPoint.cc in our GitHub repository /orb_slam2/src.

V-C3 Fast vocabulary loading

For loop closing and camera relocalisation, the authors of ORB SLAM 2 used bag of words place recognition model [40]. This model uses a vocabulary of visual words, which have been built using a vast database of images. Whenever ORB SLAM 2 is launched it takes some time loading the vocabulary because the vocabulary is saved as a text file which contains more than a million lines. This issue makes the start-up process very slow, and therefore the serialisation for the vocabulary has been implemented similarly as in the map serialisation [39]. The source code can be found /tools/bin_vocabulary.cc.

V-C4 Object following and avoiding

We also implemented an object following and avoiding functionality for Pepper. The main idea is to allow the robot to continuously track an object but also avoid it by keeping a certain distance when the object is too close.

The application works in the following manner: if the estimated depth distance from Pepper to the detected object is larger than , Pepper follows the object at a pre-defined constant speed. On the contrary, if the distance is smaller than Pepper avoids it by going backwards. Moreover, we also want the detected object to be at the centre of the frame. In order to do so, we computed the displacement of the central point of the object from the frame centre. Then, depending on this displacement, we send an angular velocity command to the robot to minimise this difference.

Vi Results and discussion

In this section we discuss the object recognition, SLAM and the integration. We will discuss what we have accomplished as well as the comparisons with other object recognition and SLAM methods. For the real demonstration, please check the link https://youtu.be/evFsnWH_bpY.

Vi-a Object recognition

(a) Runtime
(b) Accuracy rate
Fig. 4: Comparison of the recognition performance with ORB, SURF and SIFT. The implementation language is Python and 3 various objects (book, folder, T-shirt) are put in the database.

First of all, since we are using feature-based object recognition framework, the comparison of the recognition performance with various features should be discussed. As illustrated in Fig. 4, we compare the performance of ORB, SURF and SIFT under our recognition framework. In Fig. 4(a), SIFT takes longer time than the other two but is under the acceptable updating time frame (200 ms). Nonetheless, when comparing the recognition accuracy (Fig. 4(b)), the SIFT-based recognition achieves around . Clearly, for the indoor usage, accuracy is more important than computational time, so SIFT is chosen as our primal feature.

In Table 1, we compare our SIFT + NN + RANSAC method with the well-known Haar Cascade method [11]. As we can notice, our method outperforms the Haar Cascades almost in all the cases. Haar Cascades method requires a long time to train one object, while our method does not require any training and needs only one image per object. This makes the system more flexible and easy to use by allowing users to modify the database just by adding/removing images of objects. The performance of our method also appears to be much more consistent than the other method and barely has false alarms.

[tabularx=X——c—c—c—c—c, float=tb, boxrule=0.9pt, title= Table 1: Comparison of Haar Cascade and our SIFT + NN + RANSAC] & Haar Cascades& Ours
Training time & &
Detection consistence & &
False alarm & &
Rotation Invariant & &
Detect with partial info & &
A large number of objects & &

From our experiments, when recognising the same still object, we found out that the recognition accuracy of our method is almost 100%. In contrast, the accuracy of Haar Cascades is less than 60%, and false alarm and misdetection may happen even in between two consecutive frames. Indeed, for Haar cascades the more negative/positive samples we use for training, the better recognition rate we obtain. However, the training time will also increase dramatically.

Another important improvement of our method is enabling the object rotation-invariance. It turns out that the Haar-like feature does not evidently have the capacity of dealing with the rotated object unless a huge amount of samples with various angles have been used for training. Even though the training dataset is huge, we are not guaranteed with a decent result. SIFT instead is mainly famed for the rotation-invariance property. Fig. 5 undoubtedly shows that our method can cope with all kinds of rotational movements.

Fig. 5: Illustrations for the rotation invariant of the object recognition.

It is worth mentioning that our method can still recognise the objects properly with only partial details of an object, as shown in Fig 6 with around of the folder covered. For Haar cascades, the object is not able to be recognised at all even if the covered portion is really small.

Fig. 6: The object recognition can work with partial information.

Finally, the main problem of our method is that the recognition will be slower when the number of the objects inside the database increases. However, with regard to a home service robot which always stays indoors, it is sufficient for Pepper to recognise only a limited number of objects.

Vi-B Visual Slam with object recognition

The experiments have shown that the ORB SLAM 2 outperforms its predecessor and LSD-SLAM in terms of initialisation and depth estimation accuracy. LSD-SLAM provides the dense reconstruction of the scene, however, in an indoor environment, it will cause lots of noise due to the inaccurate depth estimation of the points. Therefore, ORB SLAM and ORB SLAM 2 algorithms showed better performance in mapping an indoor environment. It should also be mentioned that LSD-SLAM is oriented for Large-Scale environments, whereas ORB SLAM can be applied for both outdoor and indoor environments.

[tabularx=X——c—c—c—c, float=tb, boxrule=0.9pt, title= Table 2: Comparison of SLAM algorithms] & LSD & ORB & ORB-2 & Ours
RGB-D support & & & &
Fast initialization & & & &
Accurate localization & & & &
Map saving & reusing & & & &
Fast vocabulary load & & & &
Recognition + SLAM & & & &

Table 2 summarises the comparison of the SLAM algorithms that we tried for the implementation as well as the improved SLAM version that we are using to which we added extra features to the ORB SLAM 2 implementation.

It can be clearly seen that the final result we obtained is the best among the others. As it was explained before, the most important features that include map saving and reusing play a significant role while performing SLAM with Pepper. Fast initialisation for the tracking process is also achieved by leveraging depth camera as well as the decrease of the launching time due to the serialisation of the vocabulary.

(a) Before
(b) After
Fig. 7: Illustration of the constructed map and localised objects (a) before and (b) after the loop closing.

The results after running SLAM + object recognition are illustrated in Fig. 7. Here we can observe that the map on the left (Fig. 7(a)) is a preliminary result that has been obtained before the loop closure. When the robot arrived at its initial position, the system closed the loop and reconstructed a map of the environment as well as the trajectory of the robot as shown in Fig. 7(b). Blue markers on both images represent the inserted keyframes and the red (active), and black (inactive) points are the key points.

Fig. 8: Putting markers of detected object. Top-left: Map and the inserted marker; Top-right: Recognized object; Bottom: True location of the robot and object.

After the loop closure, we saved the map and reloaded it again to perform only localisation of the robot and to test object recognition and localisation on the map. The result of this test has been shown in Fig. 8. From the top-left image, we can observe the previously built map and the robot position as well as the position and label of the recognised object, which is shown in the top-right image. The bottom image shows the robot and the part of the environment where we performed our tests.

Vii Final Remarks and Future Works

One of the main aspects is that an innovative application integrating a robust object recognition algorithm with a modified ORB SLAM 2 was proposed. This system was implemented and successfully tested on the humanoid Pepper robot under the scheme of the European Robotics League.

As a summary, for the object recognition algorithm, SIFT features were extracted and then matched using kd-tree nearest neighbour search. Then, whether an object was recognised or not is decided through RANSAC. The algorithm has shown its robustness through its consistent detection, high accuracy without false alarm, and rotational invariance, etc.

Regarding the SLAM application, we have modified and improved the open source ORB SLAM 2 in following ways: enabling the map saving and reusing it, accelerating the vocabulary loading and most importantly, integrating the object recognition. The whole system is successfully working on Pepper despite the poor sensors, especially the low resolution and frame rate of the camera as well as the joint overheating problem.

Finally, some future works can easily be implemented on top of our proposed application, for example:

  • Autonomous control of velocity while building the map.

  • Add path planning algorithms (e.g. Rapidly-exploring Random Tree, Rotational Plane Sweep, etc.).

  • Make Pepper go to the marked position of a certain object and be able to grasp it and take the object back to the initial position.


We want to thank our project supervisor Dr Mauro Dragone for his helpful guidance and support in the process, as well as Dr Yvan Petillot for his comments and feedback.

We also thank Raul Mur-Artal for his outstanding ORB SLAM 2 algorithm, Bence Magyar for his advice of using Joystick teleop, and José María Sola Durán for his object recognition code framework.


  • [1] S. Brown, “Meet pepper, the emotion reading robot,” Health, 2013.
  • [2] euRobotics, “European robotics league.” https://www.eu-robotics.net/robotics_league/, 2019.
  • [3] ALdebaran, Pepper Documentation NAOqi. ALdebaran Sofbank Group.
  • [4]

    K.-K. Sung and T. Poggio, “Example-based learning for view-based human face detection,”

    IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 1, pp. 39–51, 1998.
  • [5] M. J. Swain and D. H. Ballard, “Color indexing,” International journal of computer vision, vol. 7, no. 1, pp. 11–32, 1991.
  • [6] B. Schiele and J. L. Crowley, “Object recognition using multidimensional receptive field histograms,” in ECCV, 1996.
  • [7] J. J. Koenderink and A. J. van Doorn, “Generic neighborhood operators,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 6, pp. 597–605, 1992.
  • [8] O. Linde and T. Lindeberg, “Object recognition using composed receptive field histograms of higher dimensionality,” in ICPR, 2004.
  • [9] H. Schneiderman and T. Kanade, “A statistical method for 3d object detection applied to faces and cars,” in CVPR, 2000.
  • [10] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 4, pp. 509–522, 2002.
  • [11] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, vol. 1, pp. I–511, 2001.
  • [12] D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, vol. 2, pp. 1150–1157, 1999.
  • [13] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded up robust features,” in ECCV, pp. 404–417, 2006.
  • [14] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to sift or surf,” in ICCV, 2011.
  • [15] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in ICCV, 2011.
  • [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [17] R. Girshick, “Fast R-CNN,” in ICCV, pp. 1440–1448, 2015.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
  • [19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR, 2016.
  • [20] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “FastSLAM: A factored solution to the simultaneous localization and mapping problem,” in AAAI, pp. 593–598, 2002.
  • [21] D. Zou and P. Tan, “CoSLAM: Collaborative visual SLAM in dynamic environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 2, pp. 354–366, 2013.
  • [22] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” in ECCV, pp. 834–849, 2014.
  • [23] R. Mur-Artal, J. Montiel, and J. D. Tardós, “ORB SLAM: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [24] J. S. Beis and D. G. Lowe, “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,” in CVPR, 1997.
  • [25] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, 1981.
  • [26] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  • [27]

    E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in

    ECCV, pp. 430–443, 2006.
  • [28] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in ECCV, 2010.
  • [29] P. L. Rosin, “Measuring corner properties,” Computer Vision and Image Understanding, vol. 73, no. 2, pp. 291–307, 1999.
  • [30] C. Harris and M. Stephens, “A combined corner and edge detector.,” in Alvey vision conference, vol. 15, p. 50, 1988.
  • [31] M. Everingham, A. Zisserman, C. K. Williams, and L. Van Gool, “The PASCAL visual object classes challenge 2006 (voc2006) results,” 2006.
  • [32] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate O(n) solution to the PnP problem,” International journal of computer vision, vol. 81, no. 2, pp. 155–166, 2009.
  • [33] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: an open-source robot operating system,” in ICRA workshop, vol. 3, p. 5, 2009.
  • [34] N. Lyubova, “Pepper bringup plugin.” https://github.com/ros-naoqi/pepper_robot, 2016.
  • [35] S. Lovegrove, “Pangolin.” https://github.com/stevenlovegrove/Pangolin, 2016.
  • [36] OpenGL, OpenGL official documentation. OpenGL.
  • [37] P. Mihelich and J. Bowman, “cv_bridge.” http://wiki.ros.org/cv_bridge, 2016.
  • [38] B. Magyar, “A set of generic teleoperation tools for any robot.” https://github.com/ros-teleop/teleop_tools, 2016.
  • [39] R. Mur-Artal, “ORB SLAM 2 implementation, modified by a github user @poine.” https://github.com/poine/ORB_SLAM2, 2016.
  • [40] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.