I Introduction
2D light detection and ranging (LiDAR) sensor is widely equipped on autonomous vehicles and various types of robotics due to its robustness and lower price. A 2D LiDAR sensor can detect the obstacles in the scene and measure their distance. However, as shown in Fig. 1, the scan data are binary when they are transformed into a 2D map, which contains much less information than some other sensors such as 3D LiDAR or image/video sensors. Due to this limitation, it is very challenging to perform some highlevel perception tasks only with 2D LiDAR, such as to predict the future map in a dynamic scene, which is a significant problem for robotics navigation and pathplanning.
To predict the dynamic map, previous methods can be divided into three classes, state based methods, direct methods, and motion flow based methods. State based methods are also closely related to moving object tracking problem. With the help of odometry information, these methods classify the cells in a LiDAR map into two classes, moving and static. Then, the next map can be predicted by estimating the state of moving objects. To this end, traditional methods
[1, 2, 3, 4, 5, 6] usually divide the pipeline into several steps, including to detect separate objects, associate measurements with tracked objects, estimate the state, and predict the next state of each tracked object. Then, the next map can be predicted. These methods include several handdesigned processes along with some parameters. The parameters should be manually tuned, which limits their applicability.For direct methods, Ondrúška et al. [7, 8, 9] proposed the DeepTracking method based on gated recurrent unit (GRU) [10]
to directly predict the occupancy map. DeepTracking formulates the problem as a segmentation problem. For each cell in the map, DeepTracking predicts the probability whether it will be occupied or not at the next moment. In contrast, our method predicts the map via estimating motion flow. As motion flow encodes the velocity of each cell, we can obtain the predicted map as well as the motion information of each cell.
As is wellknown, optical flow [11] encodes the correspondence relationship between adjacent frames, which is a powerful tool in processing and understanding the contents in a video. However, as shown in Fig. 1, a 2D LiDAR map only contains sparse occupied points of the scene, which leads to inadequate context information and unrepresentative local features. Therefore, it is a challenging problem to estimate the optical flow of 2D LiDAR. Moreover, as the word optical usually refers to visual sensors, we use motion flow instead in this paper. To estimate motion flow, previous works resorted to Bayesian occupancy filter [12, 13]
[14] and recurrent flow network [15]. The presentation ability of these models, however, is usually too weak to cover this problem, which leads to unsatisfactory performance. Moreover, in these methods, there are many parameters to be empirically adjusted, which limits their application.As shown in Fig. 1
, we also predict the next map via motion flow. To alleviate the challenge, we resort to deep learning based method, which is a datadriven method and can automatically learn representative features from the data. Moreover, as the next frame is unknown, we can only exploit the current and previous frames. To this end, we design a recurrent network based on gated recurrent unit, which is named LiDARFlowNet. With the help of GRU, the spatiotemporal motion information can be effectively captured and encoded. Our LiDARFlowNet can simultaneously estimate backward and forward motion flow between the current map and the unknown next map. A selfsupervised strategy is further designed to train the LiDARFlowNet model. No training data need to be annotated with this strategy. To alleviate the challenge in processing the binary maps, we propose to filter the maps using Gaussian kernels, which makes the training process more effective. With the backward motion flow, we can easily predict the next dynamic map via warping the current map. Experimental results verify the effectiveness of our LiDARFlowNet as well as the proposed training strategy. The results of the predicted map also show the advantages of our motion flow based method.
Our contributions mainly lie in three aspects:

We propose a recurrent neural network named LiDARFlowNet which can estimate the motion flow between current 2D LiDAR map and unknown next map only based on the current map and previous maps.

We design a selfsupervised strategy along with Gaussian filter to train the LiDARFlowNet effectively. No training data need to be annotated with this strategy.

With the estimated motion flow, we can predict the dynamic LiDAR map at the next moment. The experimental results verify the usefulness and effectiveness of motion flow as well as our proposed LiDARFlowNet.
Ii Related Work
To predict the dynamic 2D LiDAR map, previous methods can be divided into three classes, state based methods, direct methods, and motion flow based methods. We will briefly review these methods as follows.
State based methods. State based methods are also closely related to moving object tracking problem. These methods usually exploit the odometry information, and then the dynamic cells are only caused by the moving objects. These methods first classify the cells into two classes, moving and static. The next map can be predicted by estimating the state of moving objects. The whole process can be divided into several steps, including detection of separate objects, association of measurements with tracked objects, estimation of the state, and prediction of the next state of each tracked object. Then, the next map can be predicted. Following this framework, some methods [4, 5, 6] aim at tackling the general motion of arbitrary types of objects. However, this problem is too complicated for traditional methods, and the performance is unsatisfying. On the other hand, some other methods [1, 2, 3] only consider some presupposed motion patterns for the tracked objects. These methods can only improve the performance for the right motion patterns. Moreover, these methods include several handdesigned processes along with some parameters. The parameters should be manually tuned, which limits the applicability of these methods.
Direct methods. For direct methods, Ondrúška et al. [7, 8, 9] also resorted to deep learning based method. They formulated the problem as a segmentation problem, which is to segment the occupied cells from the map. To this end, they designed a recurrent neural network based on the gated recurrent unit (GRU). For each cell, the deep model predicts the probability whether it will be occupied or not at the next moment. To train the model, they took the occupancy map at the next moment as ground truth and minimized a cross entropy loss. The network they used actually is similar with ours. However, our method predicts the map via estimating motion flow and the training strategy is also different. As motion flow encodes the velocity of each cell, we can obtain the predicted map as well as the motion information of each cell. The estimated motion flow could also be a powerful tool to perform other tasks.
Motion flow based methods. To estimate the motion flow of 2D LiDAR, Chen et al. [12] proposed Bayesian occupancy filter. Gindele et al. [13] further extended this method by incorporating prior knowledge. Choi et al. [15] proposed a recurrent flow network. Note that, the proposed network is different from the conception in the deep learning community. The proposed network mainly consists of a context layer, which is to encode the velocity of each cell in the 2D LiDAR scan map. The velocity also can be regarded as the motion flow. All these methods model the variation of each cell and its local neighbors and then estimate the motion flow. However, the presentation ability of these shallow models usually is too weak to encode the complex problem, which leads to unsatisfying performance. Moreover, in these methods, there are many parameters to be empirically adjusted, which limits their application.
Iii Overview: Map Prediction Using Motion Flow
In this section, we briefly introduce the pipeline of our method. For each 2D LiDAR scan, we place the LiDAR at the bottommiddle of the map and convert the scan data from a vector to two binary maps, occupancy map
and visibility map . A pair of example maps are demonstrated in Fig. 3. In an occupancy map, a cell if it is occupied, and if it is free. In a visibility map, a cell if it is visible, and if it is occluded by occupied cells. In this form, all the information contained in the LiDAR scan can be encoded in the 2D maps, which can be efficiently processed by deep learning base method. Note that, we limit the FoV of the LiDAR to .We denote the backward motion flow between current frame and next frame as , where . Similarly, the forward motion flow is denoted as , where . Moreover, and are not always integer, if not, we use bilinear sampling to perform the warping process. According to the definition, we can see that the occupancy map at the next moment can be easily calculated if we have the backward motion flow. The key problem is transformed into estimating the motion flow.
As shown in Fig. 2, we demonstrate the pipeline of our method. At each moment, we feed the occupancy and visibility maps along with the hidden states of GRUs at the previous moment into the LiDARFlowNet. Moreover, the hidden states could encode the motion information of each cell. With the LiDARFlowNet, we can predict the backward motion flow of the current frame. Then the next frame can be estimated via warping the current frame according to the motion flow. The mean squared error between the estimated next frame and the real next frame can be used as the loss function to train the LiDARFlowNet model. Note that, we simultaneously estimate the forward motion flow, which is not in the pipeline for simplicity.
Iv Estimate Motion Flow of 2D Lidar
In this section, we detailedly introduce each step of our method, including the LiDARFlowNet, the selfsupervised training strategy, and the Gaussian filter which is used to facilitate the training process.
Iva LiDARFlowNet With GRU
To estimate the motion flow using deep neural networks, previous methods [16, 17]
usually adopt feed forward neural networks,
i.e.convolutional neural networks. These methods need a pair of frames as input and estimate the motion flow between them via implicitly or explicitly feature matching. However, for our problem, we aim to predict the next frame, which is unknown. Therefore, feedforward neural networks are not suitable for our problem. On the other hand, recurrent neural networks can encode and exploit the dynamic history information, which can be used to estimate motion flow. Inspired by DeepTracking
[9], we believe that recurrent neural networks could estimate the motion flow between the current frame and the unknown next frame only from the current frame and the previous frames.To this end, we design a recurrent neural network named LiDARFlowNet, which is demonstrated in Fig. 2. Our LiDARFlowNet consists of six layers. The first layer is the input layer, which reads a pair of occupancy and visibility maps and feeds them into the network. The second layer is a convolution layer, which aims to extract local features from the input. The third, fourth and fifth layers are gated recurrent layers [10], which actually is a convolutional gated recurrent unit. For each GRU, the output at time can be calculated from its input at the current time and output at the previous time as
(1) 
where is the update gate, is the reset gate, denotes convolution operation, and denotes dot product operation. The details of GRU can be found in [10]. With the help of GRU, the model can encode temporal information of each cell, which can facilitate the motion flow estimation.
The last layer is the output layer, which is also a convolution layer and calculates the forward and backward motion flow. As the motion flow of each cell consists of the horizontal and vertical offset, the output has four channels. The parameters of each layer are listed in Table I. Besides the standard convolution operation, we use dilation [18] to enlarge the perception field while maintaining the computation unchanged. With the help of dilation, the model could encode fast moving objects in the map.
Layer name  Type  Parameters  Output size 
input      
conv0  Convolution  f: , s: 1,  
d: 1, p:1  
gru0  GRU  f: , s: 1,  
d: 1, p:1  
gru1  GRU  f: , s: 1,  
d: 2, p:2  
gru2  GRU  f: , s: 1,  
d: 4, p:4  
conv_flow  Convolution  f: , s: 1,  
d: 1, p:1 
The structure of our LiDARFlowNet, where f is short for filter size, s for stride, d for dilation, p for padding.
IvB Selfsupervised Training Strategy
The ground truth motion flow is difficult to be manually annotated. To this end, synthetic datasets, e.g. FlyingChairs [16], are generated and popular. However, the synthetic data are not the same as real data. On the other hand, some selfsupervised methods [19, 20, 21, 22, 23] are proposed to avoid data synthesis and annotation. Inspired by these methods, we propose a similar strategy to train our LiDARFlowNet model. As shown in Fig. 2, at time , the input occupancy map can be warped as according to the estimated backward flow. The warped occupancy map should be similar to the occupancy map at the next time . To evaluate the similarity between them, we use the mean squared error. Then the LiDARFlowNet model can be trained via minimizing this error. Moreover, the occupancy map at the next time also can be warped as according to the forward motion flow. As forward motion flow and backward motion flow are relevant, we can formulate them as a multitask problem and simultaneously estimate them. Usually, if we perform several relevant tasks using a multitask model, the performance of all tasks will be improved. Then the loss function can be defined as
(2) 
where is the mean squared error function and is the warping function.
Moreover, to train the model, the warping step should be differentiable. This problem has been thoroughly investigated in spatial transform network
[24]. More specifically, we use bilinear sampling in the forward calculation process.IvC Facilitate the Training Process Using Gaussian Filter
The proposed LiDARFlowNet can be trained via minimizing the loss function (2). However, as the occupancy map is very different from the rgb image, the training process is not effective. The problem is that as shown in Fig. 3 a), a large proportion of cells are zero in the occupancy map, and the most gradients will be zero. To illustrate this problem, we take a pair of simple onedimension occupancy maps as an example. As shown in Fig 4 a), the gradient is nonzero only when the motion flow . This means that the gradient will be zero in most instances, which will lead to ineffective training.
To alleviate this problem, we propose to filter the occupancy map using the Gaussian kernel before warping. As shown in Fig 4 b), the Gaussian filter will enlarge the range of nonzero gradient. Then the loss function will be
(3) 
where is the Gaussian filter operation and is the filter size. We first set a large filter size , and gradually decrease it until it is along with the training process. When , the loss function (3) is the same as (2).
V Experiments
In this section, we first explain the experimental details of our LiDARFlowNet model. Then, we present the results of LiDAR map prediction to verify the effectiveness of our method as well as the LiDARFlowNet. The results of motion flow estimation are also presented to demonstrate the ability of our LiDARFlowNet.
Va Experimental Details of LiDARFlowNet
The robotics platform may be static or dynamic relative to the scene. Thus, the data also can be divided into two classes accordingly. It is obvious that the dynamic scenario is more difficult than the static scenario. We set up a robotic platform to collect these two types of data in our indoor office. The robotic platform is equipped with a 2D LiDAR sensor, i.e. SICK TIM5612050101. The LiDAR scans the scene at fps. For each scenario, we collect about minutes data, i.e. frames, for training and minutes data, i.e. frames, for validation. For the static scenario, as the robotic platform is static relative to the scene, the moving objects are the walking people. For the dynamic scenario, all the things are moving relative to the robotic platform. Each scan is transformed into an occupancy map and a visibility map. The map size is , and the resolution of each cell is centimeter.
To train the LiDARFlowNet model, we divide the training data into sequences, and each sequence consists of frames. For each sequence, the first frames are fed into the LiDARFlowNet to initialize the hidden state. From the eleventh frame, we take the mean squared error (3) as the loss, and minimize this loss to update the model. Specifically, for RNN like LiDARFlowNet, a sequence is a training sample. In our experiment, we set the batch size as , the initial learning rate as . The learning rate is divided by after every epoch. The final model is obtained after epochs training. For the Gaussian filter, we set the initial filter size as , and reduce the filter size by after every epoch until it equals to .
VB LiDAR Map Prediction Results
With the LiDARFlowNet model, we can predict the motion flow of each cell in the scan map of current frame. Then the scan map of next frame can be obtained via warping the current scan map. For each cell in the predicted map, it will be when it is more than a threshold, otherwise . We empirically set the threshold parameter as . To quantitatively evaluate the results, we use the metric
score and precisionrecall curve, which can balance the precision and recall.
As shown in Table II, the score of our final results are and on the validation data collected by static and dynamic platform respectively. For a static platform, the model only needs to predict the motion of moving objects. For a dynamic platform, however, the model needs to predict the motion of the robotic implicitly, which is more difficult. As a result, our method achieved better result on the static platform than on the dynamic platform. To alleviate the problem on dynamic platform, one can exploit the odometry information after perform spatial calibration and time synchronization between the LiDAR sensor and odometry. The problem is then transformed into a similar problem on the static platform. However, this may be not the ultimate way to solve this problem as the odometry information is not always accurate.
To explain the influence of Gaussian filter during training our LiDARFlowNet, a comparison experiment is conducted. In this experiment, we still train a LiDARFlowNet model but skip the Gaussian filter step and keep the other steps and parameters unchanged. The trained model is also evaluated on validation data. As shown in Table II, we can see that the score decreases by about on the dynamic platform, which demonstrates the effectiveness of Gaussian filter step in our method. At the same time, we also notice that the results are almost the same on the static platform. The reason may be that quite a number of cells remain unchanged in the map sequence from the static platform and the zero gradient problem may be negligible. The precisionrecall curve of each method is also demonstrated in Fig. 5, and the same conclusion can be conducted from it.
Method  score  

Static platform  Dynamic platform  
DeepTracking [9]  
Our method w/o Gaussian filter  
Our final method 
We also compare our method with the DeepTracking [9]
, which is among the state of the arts. To be fair, we repeat DeepTracking and use the same hyperparameters to train the DeepTracking model on our training data and evaluate the model on our validation data. Unlike our method, DeepTracking directly predicts the LiDAR map. We also demonstrate the
score in Table II. We can see that our method achieves a higher score than DeepTracking. Furthermore, we also draw the precisionrecall curves in Fig. 5, which shows the same conclusion.At last, we also visualize some results of LiDAR map predicted by our method and DeepTracking [9]. To visualize the results, we put the ground truth LiDAR map in the green channel of the visualization image and put the predicted result in the red channel. Then if a cell is yellow, this cell is correctly predicted. As shown in Fig. 6, DeepTracking can predict some occluded objects, it also obtains many false positives. With the help of motion flow, our method can alleviate this problem.
VC Motion Flow Estimation Results
In this section, we illustrate the predicted motion flow as it plays a key role in our method. As shown in Fig. 7, we visualize some motion flow results. We can see that our model actually can estimate the motion flow of each cell. However, a quantitative evaluation is still an open problem as it is difficult to collect 2D LiDAR dataset with motion flow ground truth. We leave this problem as future work. Moreover, inspired by the synthetic dataset, e.g. FlyingChairs [16], it is also an alternative solution to synthesize data along with ground truth.
Vi Discussion and Conclusion
In this paper, we propose a method to predict the 2D LiDAR map at the next moment using motion flow. This problem is challenging due to the featureless 2D LiDAR maps. To alleviate this challenge, we propose to estimate the motion flow of 2D LiDAR via the powerful deep neural networks inspired by its successful application in estimating the optical flow of the video. To this end, we design a recurrent network based on gated recurrent unit, which is named LiDARFlowNet. Our LiDARFlowNet can simultaneously estimate forward and backward motion flow between the current frame and the next frame only from the current frame and past frames. A selfsupervised strategy is further designed to train the LiDARFlowNet model effectively. No training data need to be annotated with this strategy. With the bidirectional motion flow, it is straightforward to perform some perception tasks, e.g. with the backward motion flow, we can predict the next frame. Experimental results verify the effectiveness of our LiDARFlowNet as well as the proposed training strategy. Moreover, the estimated motion flow can also be used to perform other tasks, e.g. with the forward motion flow, we can detect the moving objects and separate them from the static background. We leave applications like this as our future work.
Acknowledgment
This work was supported by grants from the National Basic Research Program of China (2015CB351806), the National Natural Science Foundation of China (61825101) and China Postdoctoral Science Foundation.
References

[1]
L. Zhao and C. Thorpe, “Qualitative and quantitative car tracking from a range
image sequence,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 1998.  [2] K. O. Arras, O. M. Mozos, and W. Burgard, “Using boosted features for the detection of people in 2d range data,” in IEEE International Conference on Robotics and Automation (ICRA), 2007.
 [3] A. Petrovskaya and S. Thrun, “Model based vehicle detection and tracking for autonomous urban driving,” Autonomous Robots, vol. 26, no. 23, pp. 123–139, 2009.
 [4] T.D. Vu, O. Aycard, and N. Appenrodt, “Online localization and mapping with moving object tracking in dynamic outdoor environments,” in IEEE Intelligent Vehicles Symposium, 2007.
 [5] S.W. Yang and C.C. Wang, “Simultaneous egomotion estimation, segmentation, and moving object detection,” Journal of Field Robotics, vol. 28, no. 4, pp. 565–588, 2011.
 [6] D. Z. Wang, I. Posner, and P. Newman, “Modelfree detection and tracking of dynamic objects with 2d lidar,” The International Journal of Robotics Research, vol. 34, no. 7, pp. 1039–1063, 2015.

[7]
P. Ondruska and I. Posner, “Deep tracking: Seeing beyond seeing using
recurrent neural networks,” in
AAAI Conference on Artificial Intelligence (AAAI)
, 2016.  [8] P. Ondruska, J. Dequaire, D. Z. Wang, and I. Posner, “Endtoend tracking and semantic segmentation using recurrent neural networks,” in Robotics: Science and Systems (RSS), 2016.
 [9] J. Dequaire, P. Ondrúška, D. Rao, D. Wang, and I. Posner, “Deep tracking in the wild: Endtoend tracking using recurrent neural networks,” The International Journal of Robotics Research (IJRR), vol. 37, no. 45, pp. 492–512, 2018.
 [10] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
 [11] J. J. Gibson, “The perception of the visual world.” 1950.
 [12] C. Chen, C. Tay, C. Laugier, and K. Mekhnacha, “Dynamic environment modeling with gridmap: a multipleobject tracking application,” in International Conference on Control, Automation, Robotics and Vision (ICARCV), 2006.
 [13] T. Gindele, S. Brechtel, J. Schroder, and R. Dillmann, “Bayesian occupancy grid filter for dynamic environments using prior map knowledge,” in IEEE Intelligent Vehicles Symposium, 2009.
 [14] D. MeyerDelius, M. Beinhofer, and W. Burgard, “Occupancy grid models for robot mapping in changing environments.” in AAAI Conference on Artificial Intelligence (AAAI), 2012.
 [15] S. Choi, K. Lee, and S. Oh, “Robust Modeling and Prediction in Dynamic Environments Using Recurrent Flow Networks,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016.
 [16] A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. V. D. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015.
 [17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [18] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs,” arXiv, 2016.
 [19] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View Synthesis by Appearance Flow,” in European Conference on Computer Vision (ECCV), 2016.
 [20] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video Frame Synthesis using Deep Voxel Flow,” in IEEE International Conference on Computer Vision (ICCV), 2017.
 [21] Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha, “Unsupervised deep learning for optical flow estimation.” in AAAI Conference on Artificial Intelligence (AAAI), vol. 3, 2017, p. 7.

[22]
Y. Zhong, Y. Dai, and H. Li, “SelfSupervised Learning for Stereo Matching with SelfImproving Ability,”
arXiv, 2017.  [23] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with LeftRight Consistency,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [24] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems (NIPS), 2015.
 [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in Advances in Neural Information Processing Systems (NIPS)  Workshop, 2017.
Comments
There are no comments yet.