ORBBuf: A Robust Buffering Method for Collaborative Visual SLAM

10/28/2020
by   Yu-Ping Wang, et al.
Tsinghua University
0

Collaborative simultaneous localization and mapping (SLAM) approaches provide a solution for autonomous robots based on embedded devices. On the other hand, visual SLAM systems rely on correlations between visual frames. As a result, the loss of visual frames from an unreliable wireless network can easily damage the results of collaborative visual SLAM systems. From our experiment, a loss of less than 1 second of data can lead to the failure of visual SLAM algorithms. We present a novel buffering method, ORBBuf, to reduce the impact of data loss on collaborative visual SLAM systems. We model the buffering problem into an optimization problem. We use an efficient greedy-like algorithm, and our buffering method drops the frame that results in the least loss to the quality of the SLAM results. We implement our ORBBuf method on ROS, a widely used middleware framework. Through an extensive evaluation on real-world scenarios and tens of gigabytes of datasets, we demonstrate that our ORBBuf method can be applied to different algorithms, different sensor data (both monocular images and stereo images), different scenes (both indoor and outdoor), and different network environments (both WiFi networks and 4G networks). Experimental results show that the network interruptions indeed affect the SLAM results, and our ORBBuf method can reduce the RMSE up to 50 times.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

10/13/2021

Collaborative Radio SLAM for Multiple Robots based on WiFi Fingerprint Similarity

Simultaneous Localization and Mapping (SLAM) enables autonomous robots t...
10/06/2021

InterpolationSLAM: A Novel Robust Visual SLAM System in Rotating Scenes

In recent years, visual SLAM has achieved great progress and development...
01/22/2019

DF-SLAM: A Deep-Learning Enhanced Visual SLAM System based on Deep Local Features

As the foundation of driverless vehicle and intelligent robots, Simultan...
11/13/2016

Semi-Dense 3D Semantic Mapping from Monocular SLAM

The bundle of geometry and appearance in computer vision has proven to b...
06/14/2019

ViTa-SLAM: A Bio-inspired Visuo-Tactile SLAM for Navigation while Interacting with Aliased Environments

RatSLAM is a rat hippocampus-inspired visual Simultaneous Localization a...
10/27/2018

Modeling Perceptual Aliasing in SLAM via Discrete-Continuous Graphical Models

Perceptual aliasing is one of the main causes of failure for Simultaneou...
11/30/2019

Collaborative SLAM based on Wifi Fingerprint Similarity and Motion Information

Simultaneous localization and mapping (SLAM) has been extensively resear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual simultaneous localization and mapping (SLAM) is an important research topic in robotics [41]

, computer vision 

[6] and multimedia [43]. In the last decade, requests for multiple robots working together have become popular within different scenarios, including architecture modeling and landscape exploring; therefore, collaborative visual SLAM has become an emerging research topic [44]. On the other hand, the task of self-localization and mapping is computationally expensive, especially for embedded devices with both power and memory restrictions. Collaborative visual SLAM systems that outsource this task to a powerful central processing node are a feasible solution in this scenario [29].

Fig. 1:

The result of a real-world SLAM experiment run with a TurtleBot3 and a server. (a) A photo from our laboratory. (b) The server received visual data from the robot with ROS and ran a SLAM algorithm but failed due to WiFi unreliability (the door on the left was completely missing). (c) It also failed when employing the random buffering method. (d) By employing our ORBBuf method, the SLAM algorithm successfully estimated the correct trajectory (the red curve) and built a sparse 3D map (the white points).

In collaborative visual SLAM systems, robots transmit the collected visual data to other robots and/or a high-performance server. This requires high network bandwidth and network reliability [37]. There has been work on reducing the bandwidth requirements based on video compression techniques [16, 5] or compact 3D representations [21, 17], though few focus on network reliability. In this paper, we address the problem of network reliability, which can have considerable impact on the accuracy of collaborative visual SLAM systems, and our approach is orthogonal to the methods that reduce the bandwidth requirements.

Network connections, especially wireless ones (e.g. over WiFi or 4G), are not always reliable. A detailed measurement study [40] has shown that lower throughput or even network interruption could occur for dozens of seconds due to tunnels, large buildings, or poor coverage in general. With the advent of 5G, the problems of network bandwidth and latency will be relieved, but the unreliability due to poor coverage still exists [4].

To facilitate our work, we built a collaborative visual SLAM system based on Turtlebot3 and ROS [33]. The robot moved around our laboratory as shown in Figure 1(a). We fixed a camera on top of the robot, and the captured images were transmitted to a server via a public WiFi router. We found that the SLAM algorithm [9] on the server failed at some position. At that position, the network connection was highly unreliable (it may have been affected by the surrounding metal tables). To tolerate such network unreliability, a common solution is buffering, which puts new frames into a buffer and waits for future transmission. When the network is unreliable, the buffer becomes full and the buffering method is responsible for deciding which frame(s) should be discarded. We tried two kinds of commonly used buffering methods, but the SLAM algorithm failed in both cases (shown in Figure 1(b) and (c)).

Main Results: In this paper, we present a novel robust buffering method, ORBBuf, for collaborative visual SLAM systems. By analyzing the requirement of visual SLAM algorithms, we introduce the similarity value between frames and model the buffering problem into an optimization problem that tries to maximize the minimal similarity. We use an efficient greedy-like algorithm, and our ORBBuf method drops the frame that results in the least impact to visual SLAM algorithms. Experimental results on real-world scenarios (shown in Figure 1(d)) and tens of gigabytes of 3D vision datasets show that the network interruptions indeed affect the SLAM results and that our ORBBuf method can greatly relieve the impacts. The main contributions of this paper include:

  • We address the problem of network reliability for collaborative visual SLAM. We model the buffering problem into an optimization problem and propose a novel buffering method named ORBBuf to solve this problem.

  • We conduct studies to show how much SLAM algorithms rely on the relationships between the consecutive input frames, which reveal the characteristics of SLAM algorithms.

  • We implement our ORBBuf method based on a widely used middleware framework (ROS [33]).

  • Through an extensive evaluation on real-world scenarios and tens of gigabytes of 3D vision datasets, we demonstrate that our ORBBuf method can be applied to different algorithms (VinsFusion [31] and DSO [9]), large outdoor (KITTI [12]) and indoor (TUM monoVO [11]) datasets, and different network environments (WiFi network and 4G network [40]). The experimental results show that using our ORBBuf method can reduce the RMSE (root mean square error) up to 50 times.

Ii Related Work

Ii-a Visual SLAM

Visual information is usually large in size. To lower the computational requirement, feature-based approaches extract feature points from the original images and only use these feature points to decide the relative pose between frames through matching. A typical state-of-the-art feature-based visual SLAM is ORB-SLAM [26, 27]. The most recent version of ORB-SLAM3 [2] has combined most of the visual SLAM techniques into a single framework based on the ORB [36] feature. Accurate localization depends on the number of successful matched feature points among consecutive frames. The correlations among consecutive frames are not robust [1]. When too many consecutive frames are lost, visual SLAM algorithms will fail to match enough corresponding feature points, leading to inaccurate results or even failure of the system. We will verify this claim with a more detailed analysis in Section III-D. In this case, the loop-closure routine, which usually works as a background optimization to relieve accumulation errors, may save the SLAM system from a total failure, but it is both expensive and unstable [24]. In addition to feature-based approaches, there are direct SLAM algorithms that use all raw pixels, including DTAM [28], LSD-SLAM [10], and DSO [9]

. They try to minimize the photometric error between frames by probabilistic models directly using the raw pixels. The success of all these SLAM algorithms depends on the correlations between consecutive frames, which indicate the requirements for SLAM algorithms. In this paper, we introduce fast feature extraction into a buffering method to relieve the impact of network unreliability on SLAM algorithms.

Ii-B Collaborative Visual SLAM

Most collaborative visual SLAM systems adopt a centralized architecture, as in PTAMM [3], C2TAM [34], and CVI-SLAM [18]. These methods are easily implementable compared with the emerging decentralized architecture [7]. Despite the different architecture, a major issue for collaborative visual SLAM is what data representation should be transmitted. Compress-then-analyze (CTA) solutions transmit compressed sensor data such as RGB images and depth images [14]. Analyze-then-compress (ATC) solutions assume that robots are powerful enough to build a local map or generate hints of the local map, and then share them among robots or merge them at the server [8]. Opdenbosch et al. [29] proposed an intermediate solution where fast feature extraction is performed at the robots and the server performs the following SLAM by collecting the features.

On the other hand, few works have focused on network reliability. Many works explicitly assume that the network bandwidth is sufficient and stable [19, 7, 25]. This assumption is hard to satisfy in real-world wireless networks [40, 4]. In this paper, we address the problem of network reliability and propose a novel buffering method that is robust against network unreliability.

Ii-C Buffering Methods in Other Applications

Buffering methods are widely used in other applications such as video streaming and video conferencing. Dynamic Adaptive Streaming over HTTP (DASH) [39] has become the de facto standard for video streaming services. By using the DASH technique, each video is partitioned into segments and each segment is encoded in multiple bitrates. Adaptive BitRate (ABR) algorithms decide which bitrate should be used for the next segment. However, the DASH technique is not quite suitable for collaborative visual SLAM for two main reasons. First, visual SLAM algorithms rely on correlations between consecutive frames. Using multiple bitrates potentially weakens such correlations and increases errors. Second, correlations among consecutive frames are not robust [1]. Since the DASH technique partitions video into segments, losing even a single segment would lead to failure.

In addition to the DASH technique, buffering methods are common solutions against network reliability and have been researched for the last decade. Specialized for video transmission, enhanced buffering methods introduce the concept of profit and assume that each message has a constant profit value. The goal is to maximize the total profit of the remaining frames [13]. According to the application scenarios, different types of frames may have different profits [23, 20, 38, 42, 22]. However, in the visual SLAM scenario, the profit is not a constant value for each frame but is rather a correlation among its neighbors. Therefore, we need better techniques that understand the requirements of visual SLAM algorithms to correlate the input frame sequences.

Iii Our Approach: ORBBuf

In this section, we model the buffering problem as an optimization problem (Section III-A). We then solve this problem with our ORBBuf method (Section III-B) and describe its implementation (Section III-C). Finally, we discuss why alternative buffering methods fail by analyzing the characteristics of visual SLAM algorithms (Section III-D).

Iii-a Modeling the Buffering Problem

Messages containing 3D visual information are usually large in size. However, network packets are limited in size (e.g., 1500 bytes), meaning that these messages are divided into hundreds of packets. If each message contains 100 packets, losing even 1% of the packets means about of the messages would not be complete. Therefore, unreliable network protocols such as UDP are not preferred in this case. When using reliable network protocols (e.g., TCP) or specialized protocols (e.g., QUIC [30]), a network connection is dedicated to sending a single message at a time while other messages are inserted into a sending buffer to wait for transmission. When the network bandwidth is insufficient or a temporary network interruption occurs, the sending buffer becomes full. In this case, the buffering method has to drop some messages from the sending buffer.

For clarity, we define some notations and symbols that we will use to model the buffering problem.

  • denotes a message sequence. Messages in the message sequence are denoted as . denotes the number of messages of , i.e. .

  • denotes the message sequence generated at the edge-side device. Since messages are generated over time, grows correspondingly. denotes the status of at time .

  • Without loss of generality, we define time as the message is generated, i.e., .

  • denotes the message sequence that is actually sent via the network. denotes the message sequence that is received at the server. Obviously, we have . Since we use reliable network protocols, we can assume that .

  • denotes the size of the sending buffer. denotes the buffer status at time .

Our goal is to keep the SLAM algorithm robust. As we have stated in Section II, the success of SLAM algorithms depends on the correlations between consecutive frames, i.e. . In this paper, we define the correlations between two frames as their similarity and denote the similarity function as . Larger similarity leads to more robust SLAM results. Therefore, we can state the buffering problem as follows.

Problem Statement: Knowing the buffer size while not knowing the transmission status or the message sequence in advance, produce a buffering method that maximizes the minimal similarity over adjacent messages in . Described as Equation (1).

(1)

From the problem statement, we can see that the key to solving this problem is understanding the similarity between frames. In fact, different SLAM algorithms use different models to represent similarity, but they all reflect the correlation between the frames themselves. Without loss of generality, we use the number of successfully matched ORB feature points to measure the similarity. ORB feature refers to “Oriented FAST and Rotated BRIEF” and corresponds to a very fast binary descriptor based on BRIEF, which is rotation invariant and resistant to noise [36]. The ORB-SLAM algorithm [26] is based on the ORB feature because it can be calculated efficiently.

Iii-B Our ORBBuf Method

As we have stated in Equation (1), our goal is to maximize the minimal similarity between adjacent frames in . The only information provided to a buffering method is the buffer status and the next message . Therefore, we cannot get the global optimal solution. However, we could get a greedy-like algorithm to solve each local problem. Since is combined with the output from buffer , we can define a local problem as Equation (2). Our ORBBuf method finds a result that maximizes the minimal similarity of all feasible .

(2)

To get the optimal solution of this local problem efficiently, we record a score for each message, which is the similarity between its previous message and the next message in the buffer. The score of a message indicates the resulting adjacent similarity when it is dropped. When the buffer is full, our ORBBuf method will drop the message with the maximal score. Since the similarity between other adjacent messages remains the same, the resulting minimal similarity relies on whether the score is lower than the original minimal similarity.

The pseudo code of our ORBBuf method is shown in Figure 2. From line 3 to line 9, we find the message with the maximal score. This message is dropped in line 10. Dropping this message causes the score of some messages to be changed in lines 11 and 12. After the new message is inserted in line 14, the score of the previous message is changed in line 15. Note that the function is called at most 3 times each time a message arrives at the buffer. Since the calculation of the ORB feature is fast, the overhead introduced by our ORBBuf method is negligible.

Overall, with the cost of the memory for recording a score for each message (which is tiny compared with the size of each message), our ORBBuf method can find the solution within O(1) time, which is independent of the buffer size.

Fig. 2: The pseudo code of our ORBBuf method. It is a greedy-like algorithm that maximizes the resulting minimal adjacent similarity within O(1) time.

Iii-C Implementation

We implement our ORBBuf method based on ROS [33], which is a middleware widely used in robotic systems. The interface of ROS does not notify the upper-level application when the lower-level transmission is completed. Thus, we add a notification mechanism and an interface to ROS. At runtime, the new message is put into a buffer following our ORBBuf method, and the low-level transmission is evoked, if possible. Whenever the low-level transmission is completed, we begin transmission of another message. We modify the code of the class TransportSubscriberLink, located at the ros_comm project [35], and link it into libroscpp.so. Developers can easily call on our interface to send a new message with our ORBBuf method.

Iii-D Analysis and Comparison

Before discussing our quantitative evaluation, we would like to compare two alternative buffering methods by analyzing the characteristics of visual SLAM algorithms.

A commonly-used but trivial buffering method is the Drop-Oldest method. It is designed based on the assumption that newer messages are more valuable than older messages. Therefore, if the buffer is full when a new message arrives, the oldest message is dropped to make room for the new message (formalized in Equation (3)). When message transmission is slower than message generation, the buffer eventually becomes full, and consecutive frames would be dropped during network interruption.

(3)

To show the consequence of losing consecutive frames, we conduct a study with sequence No. 11 of the TUM monoVO dataset [11] and pick the frames from No. 200 to No. 300. For every two frames, we record their distance (the difference between their frame numbers) and their similarity. The results are plotted in Figure 3. At the left of Figure 3, the horizontal axis is the distance between two frames and the vertical axis is their similarity. We can clearly see that there is a coarse inverse relationship between the frame distance and the similarity. We note that there are more exceptions when the distance becomes larger, which is reasonable since there may be similar objects in an indoor scene. In order to show the inverse relationship more clearly, we draw the corresponding histogram at the right of Figure 3. The horizontal axis is the product of the distance and similarity and the vertical axis is the number of pairs of frames within the interval. We note that 70.3% of the pairs are within the first 3 intervals (from 0 to 1500), which is shown as the red curve (on the left) or the red line (on the right). Overall, we can conclude that there is a coarse inverse relationship between the frame distance and the similarity. Therefore, when consecutive frames are dropped by the Drop-Oldest method, a large frame distance occurs, and we are thus more likely to get a lower similarity and unstable SLAM results.

Fig. 3: Relationship between the frame distance and the similarity. 70.3% of points are under the red curve on the left and they show a coarse inverse relationship between the frame distance and the similarity.

A promising buffering method to avoid losing consecutive frames is the Random method. If the buffer is full when a new message arrives, this method randomly drops a message inside the buffer (formalized in Equation (4), where is a random number between and ). Statistically, the random method drops the messages in a uniform manner. Such uniform frame distance is more likely to result in average similarity and solve the problem formalized in Equation (2).

(4)

In practice, this method cannot solve our buffering problem for two reasons. First, as we have stated in Equation (1), we need to maximize the minimal similarity, which is not even statistically guaranteed. Second, similarity is not distributed in a uniform manner.

To verify this claim, we conduct another study with the same data sequence that we used in the first study. We remove some interval of frames from the original sequence, provide it to a SLAM algorithm (the DSO [9] algorithm), and record whether the algorithm would fail. The results are shown in Figure 4. The horizontal axis is the number of the first lost frame, and the vertical axis is the number of consecutive frames that would cause the SLAM algorithm to fail. For example, the value at 250 is 32, which means the algorithm failed when the frames from 250 to 282 were lost, but the algorithm still succeeded when the frames from 250 to 281 were lost. In the worst case (at frame 271), losing 17 consecutive frames (at 25 fps, i.e. 0.68 seconds) would cause the algorithm fail. From this result, we can see that the SLAM algorithm can tolerate some lost frames, but the tolerance capability is uneven at different positions.

Fig. 4: Simulation results when a SLAM algorithm loses consecutive frames. The SLAM algorithm can tolerate some lost frames, but the tolerance capability is uneven at different positions (less than 1 second in the worst case at frame 271).

Overall, these two solutions cannot solve our problem. They fail because they are general-purpose methods and do not take the requirements of SLAM algorithms into consideration. In contrast, our ORBBuf method employs similarity to guide dropping procedures. In our evaluation, we will compare our ORBBuf method with these two methods.

Iv Evaluation

To show the advantage of our ORBBuf method, we carry out three experiments. In these experiments, different kinds of network situations, different SLAM algorithms, and different kinds of input data are employed.

To show practical results, our experiments are carried out completely online on real-world hardware. We use a laptop with an Intel Core i7-8750H @2.20GHz 12x CPU, 16GB memory, and GeForce GTX 1080 GPU as the SLAM server. In the first two experiments, large vision datasets are used. We use a laptop without GPU support to replay each data sequence and transmit it to the server. All software modules are connected with the ROS middleware. The ROS version is Lunar on Ubuntu 16.04.

Iv-a Simulated Network Interruption

In this experiment, we use the TUM monoVO [11] dataset as the input data sequence and run the DSO [9] algorithm on the SLAM server. All data sequences are replayed at 25 fps, and the buffer size is set to one second. The TUM monoVO dataset is a large indoor dataset that contains 50 real-world sequences (43GB in total) captured with a calibrated monocular camera. All images in this dataset are compressed as .jpg files and have a resolution of 1280x1024. The robot and the server are connected via a real-world 54Mbps WiFi router. To simulate network interruption, we modify the data sequence replaying module by adding three new parameters. Before transmitting a designated frame, “Intr”, we add a T milliseconds latency to the network using the tc command to simulate network interruption, where the time of each network interruption lasts for L frames. During this experiment, the network latency T is set to 1000 milliseconds, and the network interruption duration L is set to 50 frames.

Table I

gives numeric evaluation results. In this table, “Seq” denotes the sequence number in the dataset, “Size” denotes the total size of the data sequence, “Frames” denotes the total number of frames, “Intr” denotes the frame when network interruption occurs, “Points” denotes the total number of points in the ground truth result, “Policy” denotes the buffering method used, “Cnt” denotes the number of points in the result when using the buffering method, “Percent” denotes the completeness (Cnt divided by Points), “Mean” denotes the mean error distance between corresponding points, and “Std” denotes the standard deviation of the error distances. Figure 

5 shows two visualized results.

TABLE I: Numeric comparison results of different buffering methods tolerating simulated network interruption.
Fig. 5: Two visualized results of different buffering methods tolerating simulated network interruption. Sequence 1 and 11 of the TUM monoVO dataset are both collected from indoor scenes that include rooms and corridors. The comparisons are generated with CloudCompare, where grey points represent the ground truth, red points represent large errors and blue points represent small errors.

As we can see, when using the Drop-Oldest method, the SLAM algorithm is most likely to fail when the network interruption occurs (the percentage is low). When using the Random method, the SLAM algorithm still fails in some cases. When using our ORBBuf method, the SLAM algorithm succeeds in all cases and the error distances between corresponding points are smaller than other successful cases.

Iv-B Collected 4G Network Traces

In this experiment, we use the KITTI [12] dataset as the input data sequence and run the VINS-Fusion [32, 31] algorithm on the server. The KITTI dataset is a large outdoor dataset contains 22 real-world sequences (22.5GB in total) captured with calibrated stereo cameras fixed on top of an automobile. All images in this dataset are .png files and have a resolution of 1241x376. During this experiment, all data sequences are replayed at 10 fps and the buffer size is set to two seconds. We employ the network traces collected from a real-world 4G network by [40]. The robot and the server are connected via a network cable to minimize other factors. In [40], dozens of network traces are collected for different scenarios. Since the KITTI dataset is collected on an automobile, we choose four of the network traces that are also collected on a driving car.

Table II gives numeric evaluation results. In this table, “Seq” denotes the sequence number in the dataset, “Size” denotes the total size of the data sequence, “Frames” denotes the total number of frames, “Net Trace” denotes the network traces, and “RMSE” denotes the root mean square error between the ground truth and the result using a buffering method. Figure 6 shows two visualized results.

TABLE II: Numeric comparison results of different buffering methods handling collected 4G network traces.
Fig. 6: Two visualized results of different buffering methods handling collected 4G network traces. Sequence 00 and 08 of the KITTI dataset are collected from outdoor scenes on urban roads. The comparisons are generated with the evo project [15].

The VINS-Fusion algorithm never alerts a failure, but the result can be highly unstable. As we can see, when using the Drop-Oldest method, the resulting RMSE values are relatively large. When using our ORBBuf method, the resulting trajectory fits the ground truth much better and the RMSE values are reduced by up to 50 times.

We further test the effect of varying the buffer size. We repeat the experiment using sequence No.00 of the KITTI dataset and the network trace labelled Car02 with different buffer sizes. We repeat each test case 10 times, and the results are summarized in the box-plot figure in Figure 7. When using the Drop-Oldest method, the resulting RMSE becomes low when the buffer size increases to 30 or more. When using our ORBBuf method, the resulting RMSE becomes low since the buffer size is 15 or more. When using the Random method, the resulting RMSE is not stable, even when the buffer size is 35. This result indicates that our ORBBuf method can tolerate the same level of network unreliability with a smaller buffer size.

Fig. 7: The effect of buffer size. Our ORBBuf method can tolerate the same level of network unreliability with a smaller buffer size.

In addition, we test the running time of our ORBBuf method during the experiment. Since our ORBBuf method introduces the calculation of the ORB feature into the message enqueuing routine, it introduces some time overhead. We record the time to enqueue each frame during the previous experiment. The result using sequence 00 of the KITTI dataset and the network trace labelled Car02 is shown in Figure 8. This result is related to the used network trace. When the network is stable, no time overhead is introduced at all. When the network bandwidth is low or network interruption occurs, our ORBBuf method starts to calculate the ORB features. In the worst case, our ORBBuf method introduces overhead of about 23ms, which does not affect the periodic data transmission (i.e. 10 Hz in this experiment and 25Hz in the next experiment).

Fig. 8: Running time of our ORBBuf method. Our ORBBuf method introduces negligible time overhead.

Iv-C Real-World Network

In this experiment, we build a Turtlebot3 Burger with a Raspberry Pi and 1GB memory. Images captured by a camera on top of the Turtlebot are transmitted to the server via a public WiFi router. We run the DSO algorithm on the server and use a Bluetooth keyboard to control the robot moving along the same path when using different buffering methods.

The results are shown in Figure 1. When using the Drop-Oldest and Random methods, the SLAM algorithm loses its trajectory. When using our ORBBuf method, the SLAM algorithm successfully estimates the correct trajectory (the red curve) and builds a sparse 3D map (the white points).

Overall, we have shown that our ORBBuf method can be used for different kinds of network situations and can adapt different kinds of input sensor data. The network interruptions indeed affect the collaborative SLAM systems, and after using our ORBBuf method, SLAM systems become more robust against network unreliability.

V Conclusion

This paper presented a novel buffering method for collaborative visual SLAM systems. By considering the similarity between frames inside the buffer, our proposed buffering method helps overcome the network interruptions. We integrated our ORBBuf method with ROS, a commonly used communication middleware. Experimental results demonstrated that our ORBBuf method helps visual SLAM algorithms be more robust against network unreliability and reduces the RMSE up to 50 times.

References

  • [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. D. Reid, and J. J. Leonard (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robotics 32 (6), pp. 1309–1332. External Links: Link, Document Cited by: §II-A, §II-C.
  • [2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. D. Tardós (2020)

    ORB-SLAM3: an accurate open-source library for visual, visual-inertial and multi-map SLAM

    .
    CoRR abs/2007.11898. External Links: Link, 2007.11898 Cited by: §II-A.
  • [3] R. O. Castle, G. Klein, and D. W. Murray (2008) Video-rate localization in multiple maps for wearable augmented reality. In 12th IEEE International Symposium on Wearable Computers (ISWC 2008), September 28 - October 1, 2008, Pittsburgh, PA, USA, pp. 15–22. External Links: Link, Document Cited by: §II-B.
  • [4] J. Chen, X. Ge, and Q. Ni (2019) Coverage and handoff analysis of 5g fractal small cell networks. IEEE Trans. Wireless Communications 18 (2), pp. 1263–1276. External Links: Link, Document Cited by: §I, §II-B.
  • [5] J. Chen, J. Hou, and L. Chau (2018) Light field compression with disparity-guided sparse coding based on structural key views. IEEE Trans. Image Processing 27 (1), pp. 314–324. External Links: Link, Document Cited by: §I.
  • [6] S. Choi, Q. Zhou, and V. Koltun (2015) Robust reconstruction of indoor scenes. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015

    ,
    pp. 5556–5565. External Links: Link, Document Cited by: §I.
  • [7] T. Cieslewski, S. Choudhary, and D. Scaramuzza (2018) Data-efficient decentralized visual SLAM. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 2466–2473. External Links: Link, Document Cited by: §II-B, §II-B.
  • [8] S. Dong, K. Xu, Q. Zhou, A. Tagliasacchi, S. Xin, M. Nießner, and B. Chen (2019) Multi-robot collaborative dense scene reconstruction. ACM Trans. Graph. 38 (4), pp. 84:1–84:16. External Links: Link, Document Cited by: §II-B.
  • [9] J. Engel, V. Koltun, and D. Cremers (2018) Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40 (3), pp. 611–625. External Links: Link, Document Cited by: 4th item, §I, §II-A, §III-D, §IV-A.
  • [10] J. Engel, T. Schöps, and D. Cremers (2014) LSD-SLAM: large-scale direct monocular SLAM. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Lecture Notes in Computer Science, Vol. 8690, pp. 834–849. External Links: Link, Document Cited by: §II-A.
  • [11] J. Engel, V. C. Usenko, and D. Cremers (2016) A photometrically calibrated benchmark for monocular visual odometry. CoRR abs/1607.02555. External Links: Link, 1607.02555 Cited by: 4th item, §III-D, §IV-A.
  • [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the KITTI dataset. I. J. Robotics Res. 32 (11), pp. 1231–1237. External Links: Link, Document Cited by: 4th item, §IV-B.
  • [13] M. H. Goldwasser (2010) A survey of buffer management policies for packet switches. SIGACT News 41 (1), pp. 100–128. External Links: Link, Document Cited by: §II-C.
  • [14] S. Golodetz, T. Cavallari, N. A. Lord, V. A. Prisacariu, D. W. Murray, and P. H. S. Torr (2018) Collaborative large-scale dense 3d reconstruction with online inter-agent pose optimisation. IEEE Trans. Vis. Comput. Graph. 24 (11), pp. 2895–2905. External Links: Link, Document Cited by: §II-B.
  • [15] M. Grupp (2017) Evo: python package for the evaluation of odometry and slam.. Note: https://github.com/MichaelGrupp/evo Cited by: Fig. 6.
  • [16] J. Hasselgren and T. Akenine-Möller (2006) Efficient depth buffer compression. In Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, Vienna, Austria, September 3-4, 2006, pp. 103–110. External Links: Link, Document Cited by: §I.
  • [17] O. Kähler, V. A. Prisacariu, J. P. C. Valentin, and D. W. Murray (2016) Hierarchical voxel block hashing for efficient integration of depth images. IEEE Robotics and Automation Letters 1 (1), pp. 192–197. External Links: Link, Document Cited by: §I.
  • [18] M. Karrer, P. Schmuck, and M. Chli (2018) CVI-SLAM - collaborative visual-inertial SLAM. IEEE Robotics Autom. Lett. 3 (4), pp. 2762–2769. External Links: Link, Document Cited by: §II-B.
  • [19] B. Kim, M. Kaess, L. Fletcher, J. J. Leonard, A. Bachrach, N. Roy, and S. J. Teller (2010) Multiple relative pose graphs for robust cooperative mapping. In IEEE International Conference on Robotics and Automation, ICRA 2010, Anchorage, Alaska, USA, 3-7 May 2010, pp. 3185–3192. External Links: Link, Document Cited by: §II-B.
  • [20] S. Kim and Y. Won (2017) Frame rate control buffer management technique for high-quality real-time video conferencing system. In Advances in Computer Science and Ubiquitous Computing - CSA/CUTE 2017, Taichung, Taiwan, 18-20 December., pp. 777–783. External Links: Link, Document Cited by: §II-C.
  • [21] S. Kim and Y. Ho (2007) Mesh-based depth coding for 3d video using hierarchical decomposition of depth maps. In Proceedings of the International Conference on Image Processing, ICIP 2007, September 16-19, 2007, San Antonio, Texas, USA, pp. 117–120. External Links: Link, Document Cited by: §I.
  • [22] T. Kimura and M. Muraguchi (2017) Buffer management policy based on message rarity for store-carry-forward routing. In 23rd Asia-Pacific Conference on Communications, APCC 2017, Perth, Australia, December 11-13, 2017, pp. 1–6. External Links: Link, Document Cited by: §II-C.
  • [23] A. Krifa, C. Barakat, and T. Spyropoulos (2008) Optimal buffer management policies for delay tolerant networks. In Proceedings of the Fifth Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks, SECON 2008, June 16-20, 2008, Crowne Plaza, San Francisco International Airport, California, USA, pp. 260–268. External Links: Link, Document Cited by: §II-C.
  • [24] M. Labbé and F. Michaud (2019) RTAB-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. J. Field Robotics 36 (2), pp. 416–446. External Links: Link, Document Cited by: §II-A.
  • [25] P. Lajoie, B. Ramtoula, Y. Chang, L. Carlone, and G. Beltrame (2020)

    DOOR-SLAM: distributed, online, and outlier resilient SLAM for robotic teams

    .
    IEEE Robotics Autom. Lett. 5 (2), pp. 1656–1663. External Links: Link, Document Cited by: §II-B.
  • [26] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós (2015) ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robotics 31 (5), pp. 1147–1163. External Links: Link, Document Cited by: §II-A, §III-A.
  • [27] R. Mur-Artal and J. D. Tardós (2017) ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robotics 33 (5), pp. 1255–1262. External Links: Link, Document Cited by: §II-A.
  • [28] R. A. Newcombe, S. Lovegrove, and A. J. Davison (2011) DTAM: dense tracking and mapping in real-time. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, D. N. Metaxas, L. Quan, A. Sanfeliu, and L. V. Gool (Eds.), pp. 2320–2327. External Links: Link, Document Cited by: §II-A.
  • [29] D. V. Opdenbosch, M. Oelsch, A. Garcea, T. Aykut, and E. G. Steinbach (2018) Selection and compression of local binary features for remote visual SLAM. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 7270–7277. External Links: Link, Document Cited by: §I, §II-B.
  • [30] T. C. Projects (2019) QUIC, a multiplexed stream transport over udp. Note: https://www.chromium.org/quic Cited by: §III-A.
  • [31] T. Qin, P. Li, and S. Shen (2018) VINS-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robotics 34 (4), pp. 1004–1020. External Links: Link, Document Cited by: 4th item, §IV-B.
  • [32] T. Qin and S. Shen (2018) Online temporal calibration for monocular visual-inertial systems. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2018, Madrid, Spain, October 1-5, 2018, pp. 3662–3669. External Links: Link, Document Cited by: §IV-B.
  • [33] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger, R. Wheeler, and A. Ng (2009) ROS: an open-source robot operating system. In IEEE International Conference on Robotics and Automation, ICRA, Workshop on Open Source Software, Kobe, Japan, Cited by: 3rd item, §I, §III-C.
  • [34] L. Riazuelo, J. Civera, and J. M. M. Montiel (2014) Ctam: A cloud framework for cooperative tracking and mapping. Robotics Auton. Syst. 62 (4), pp. 401–413. External Links: Link, Document Cited by: §II-B.
  • [35] ROS (2019) Ros_comm project. Note: https://github.com/ros/ros_comm Cited by: §III-C.
  • [36] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: an efficient alternative to SIFT or SURF. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, D. N. Metaxas, L. Quan, A. Sanfeliu, and L. V. Gool (Eds.), pp. 2564–2571. External Links: Link, Document Cited by: §II-A, §III-A.
  • [37] S. Saeedi, M. Trentini, M. Seto, and H. Li (2016) Multiple-robot simultaneous localization and mapping: A review. J. Field Robotics 33 (1), pp. 3–46. External Links: Link, Document Cited by: §I.
  • [38] G. Scalosub (2017) Towards optimal buffer management for streams with packet dependencies. Computer Networks 129, pp. 207–214. External Links: Link, Document Cited by: §II-C.
  • [39] T. Stockhammer (2011) Dynamic adaptive streaming over HTTP -: standards and design principles. In Proceedings of the Second Annual ACM SIGMM Conference on Multimedia Systems, MMSys 2011, Santa Clara, CA, USA, February 23-25, 2011, A. C. Begen and K. Mayer-Patel (Eds.), pp. 133–144. External Links: Link, Document Cited by: §II-C.
  • [40] J. van der Hooft, S. Petrangeli, T. Wauters, R. Huysegems, P. Rondao-Alface, T. Bostoen, and F. D. Turck (2016) HTTP/2-based adaptive streaming of HEVC video over 4g/lte networks. IEEE Communications Letters 20 (11), pp. 2177–2180. External Links: Link, Document Cited by: 4th item, §I, §II-B, §IV-B.
  • [41] T. Whelan, M. Kaess, H. Johannsson, M. F. Fallon, J. J. Leonard, and J. McDonald (2015) Real-time large-scale dense RGB-D SLAM with volumetric fusion. I. J. Robotics Res. 34 (4-5), pp. 598–626. External Links: Link, Document Cited by: §I.
  • [42] L. Yang, W. S. Wong, and M. H. Hajiesmaili (2017) An optimal randomized online algorithm for qos buffer management. POMACS 1 (2), pp. 36:1–36:26. External Links: Link, Document Cited by: §II-C.
  • [43] H. Zhang, G. Wang, Z. Lei, and J. Hwang (2019) Eye in the sky: drone-based object tracking and 3d localization. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, and W. T. Ooi (Eds.), pp. 899–907. External Links: Link, Document Cited by: §I.
  • [44] D. Zou, P. Tan, and W. Yu (2019) Collaborative visual SLAM for multiple agents: A brief survey. Virtual Real. Intell. Hardw. 1 (5), pp. 461–482. External Links: Link, Document Cited by: §I.