I Introduction
In artificial intelligent field, automatic vehicle is being studied in great demand both in academic and industrial sector. In recent years, a large and fast-growing number of companies and colleges set up the team special for self-driving, and have launched their own automatic vehicles. It is significant for self-driving to reduce the stress of drivers and improve the road safety. Self-driving technique can be divided into three parts, environmental perception, planning decision, execution control, and perception is the basis for autonomous systems. Environmental perception mainly includes visual perception and radar perception, while visual perception is currently more widely used.
As a branch of self-driving, automatic parking requires the correct information of lane makings and parking slots in the perception process. Comparing with front sight images, panoramic images can capture the surroundings of car completely; therefore it is more suitable to use panoramic images in a low speed environment like automatic parking.
In recent years, deep learning
[1] has made significant breakthroughs in image processing. For traditional visual algorithms, it needs to set the rules of extracting features based on prior knowledge, while the rules are complex and hard to be adaptive and robust. As for deep learning, it can automatically extract features from sample data with training, and got much better results as long as the sample data is sufficient enough. Visual perception in self-driving is essentially image processing, due to the high accuracy and strong robustness of deep learning, the technology of applying it to visual perception has become the current trend. A large number of traffic scene datasets [2][3] and corresponding methods of object detection [4][5] and semantic segmentation [6][7] based on deep leaning are proposed, and proved to be useful. Therefore, we use the semantic segmentation method of deep learning to segment the area and classify the class of lane makings and parking slots on panoramic images. To improve the accuracy of results, we propose the DFNet for semantic segmentation and make two main contributions:-
One is dynamic loss weights. When computing the loss, we assign the weights to each class, which are calculated according to the pixel number of each class, to overcome the imbalance of different class.
-
The other is residual fusion block (RFB). It is used to refine the segmentation area, and reduce the classification error of pixels at the boundary between the two areas.
Ii Related Work
In 2006, K Kato et al. [8] firstly proposed a panoramic parking system, which can effectively eliminate visual blind area and then improve the efficiency and safety of parking. YC Liu et al. [9] presented a driving assistant system which can provide the bird’s eye view image of vehicle surroundings with six fisheye cameras. The multiple fisheye cameras were mounted around the vehicle to capture images of the surroundings. Because of the complete range of vision, various vision methods for lane markings and parking slots detection were proposed on panoramic images. C Wang [10] extracted the parking slots with a Radon transform based method, JK Suhr and HG Jung [11] detected parking slots by exploiting a hierarchical tree structure of it and combining sequential detection results, HH Chi and LY Hsu [12] detected lane markings with line detection method. For the detection on panoramic images, traditional image processing methods are the main methods at present, the accuracy of which are closely related to the structure of the markings, and will be greatly influenced by the noise in the images, like a shade or a fuzzy structure.
Deep learning has achieved far more accurate results than traditional methods on image processing. There are several works on lane makings and parking slots detection using deep learning on front sight images. J Kim and M Lee [13]
presented a robust lane detection method based on the combined convolutional neural network with random sample consensus algorithm; S Lee et al.
[14] proposed a unified end-to-end trainable multi-task network that jointly handles lane and road marking detection and recognition guided by a vanishing point; G Amato et al. [15] proposed a decentralized and efficient solution for visual parking slots occupancy detection based on a deep convolutional neural network.Traditional image processing methods on panoramic images and deep learning methods on front sight images both achieve excellent results, but few work is aimed to use deep leaning on panoramic images. This is because the accuracy of deep learning model is largely related to datasets, and there are few public dataset of panoramic images which can be used to train a model. The first public panoramic dataset for lane markings and parking slots is panoramic surround view (PSV) dataset, released by Yan Wu et al. [16]. And this dataset is specially used for semantic segmentation, which is labeled pixel by pixel, and each pixel has its corresponding class. In order to combine the advantages of panoramic images and deep learning, we use the semantic segmentation method with convolutional network to segment the area and classify the class of lane makings and parking slots on the PSV dataset.
Semantic segmentation is the natural step to achieve fine-grained inference, its goal is to make dense predictions inferring labels for every pixel [17]. The most successful model for semantic segmentation is the fully convolutional network (FCN) by Long et al. [18], which is the first end-to-end semantic segmentation model, realized by enlarging the feature maps to the same size as input image. After that, all the network models are designed as end-to-end, which has the advantage that they do not need the post-processing. There are three main ways to enlarge the feature maps, deconvolution [19]
, unpooling, and bilinear interpolation. The way used in FCN is deconvolution, which is the reverse process of convolution. In Segnet
[6], Enet[20], they used unpooling, which need positon parameters of pool mask from the corresponding pooling. To improve the accuracy, more complex models are proposed. With hundreds of layers, Resnet[21] and Densenet[22] become the most common basic model used in convolutional networks. FCCN[23] added soft weights of cost function on different target objects; HFCN[24] proposed a highly fused convolutional network with multiple soft cost functions; Refinenet[25] trained the model with multiple scale of input images; F Yu [26] proposed dilation convolution to enlarge the receptive field of convolution without increase in parameters; PSPnet[7] presented a pyramid pooling model; GCN[27] applied large-size kernels and residual-based boundary refinement blocks.In [16], they achieved the segmentation of lane markings and parking slots using semantic segmentation method on PSV dataset, and proposed a VH-stage module special for linear structures. But the size of their model is too large to meet the requirement of using in embedded and mobile platform. In this paper, we put forward a smaller size model, but with higher accuracy. And the two main improved methods we proposed are proved to be significant.
Iii Methods
Iii-a network
The proposed model, dense fusion network (DFNet), is illustrated in Fig.1. DFNet is adapted from PSPNet [7]
, which used to be the state-of-the-art model of semantic segmentation for a long time. DFNet can be divided into three parts, basic module, features extraction module, and refinement module. For basic module, we use the Densenet
[22] as basic network. Comparing with Resnet [21] used in many semantic segmentation models, Densenet has smaller model size and faster training speed, but similar accuracy. For features extraction module, we use the pyramid pooling module proposed by PSPNet, followed by convolutional layers and an upsample layer using bilinear interpolation. After these two modules, the feature maps are enlarged to the same size as input image. However, when the enlargement factor is large, it will bring noise and make the pixels at the boundary of two areas difficult to classify. Therefore, we add refinement module at the end of the model to refine the segmentation area. For refinement module, we propose a residual fusion block (RFB) which is the combination of convolution layers and pooling layers.RFB is used to refine the segmentation area of each class and reduce the influence of noise caused by enlarge layers. RFB is mainly focused on the classification of the pixels at the boundary between two areas, because these pixels are relatively difficult to classify, RFB can reduce the error prediction of these pixels, and then improve the accuracy. The main idea of RFB to divide the feature maps into two paths is similar to residual block. One path consists of convolutional layers or pooling layers, while the other has not any processing. Finally, we fuse the feature maps of these two paths by averaging or multiplying. This is because after processing with convolutional layers or pooling layers, the values of the points in feature maps will be slightly changed, while different degrees of change in different areas. The closer to the boundary, the greater the change is. By fusing the feature maps of two paths, the value of points with a greater difference will be corrected. We attempt several structures, all are displayed in Fig.2. We will describe the configuration and effect in detail in C part of section IV.

Iii-B dynamic loss weights
In the process of convolution neural network training, the value of network weights are adjusted by error calculated in loss function. But for the reason that the numbers of pixels in images vary from class to class, the influence of each class to the loss is different. The more the pixel number of a class is, the more the impact of this class on loss. For overcoming the imbalance of different class, we assign a weight to each class when calculate the loss. In Segnet
[6], they compute the weights according to the whole training set. But the network weights are adjusted after each iteration of training, and the pixel number of each class in a batch may be much different from that in the whole training set. So in each iteration, we calculate the weights according to the current input batch, and the weights are different in each iteration. The weights calculation formula is shown in Eq.1.(1) |
In the formula, is the weight of class , is the class number, and the value of is from 0 to . and are the lower and upper threshold of , we set the threshold to avoid excessive weights differences. is the total pixel number of this batch, is the pixel number of class , when = 0, it means that the class does not appear in this batch, we set the weight to 1. Because we need to increase the effect of small pixel number class on loss, so the smaller the , the larger the is. and are constant, is just changed by . When the is the average number, is calculated to be , the multiplicative coefficient of is also used to decrease the of large pixel number of class. The loss function is shown in Eq.2, where , are prediction class and label in pixel (i, j), is the loss weights.
(2) |
Iv Experinment
Iv-a experimental setup
In our experiments, we evaluate our methods on PSV datasets [16]
, which is made and released by The Tongji Intelligent Electric Vehicle (TiEV) team. The images are collected in Tongji university with two size, 600x600 and 1000x1000. There are a total of 4249 panoramic RGB images with labeled ground truth of 6 object classes, background, parking slots, white solid line, white dashed line, yellow solid line and yellow dashed line. Thereinto, the number of images in train set, test set and valid set are 2550, 1274, 425. Our experiments are implemented on pytorch, and trained on NVIDIA GeForce GTX TITAN X graphic card. We crop the input images to a unified size of 600x600. Three metrics are used for evaluating our methods, pixel accuracy (pacc), mean pixel accuracy (mpacc), and mean intersection over union (mIoU).
The experiments are divided to three steps: firstly, we train our model with dynamic weights but without RFB, to get the optimum threshold in weights formula; secondly, using the threshold determined in the previous step, we verify these several RFBs to find the best structure; finally, we change the aux loss to further improve the accuracy.
Iv-B evaluation with dynamic loss weights
There are two thresholds in weights formula, for the lower and for the upper. Because only the pixel number of background is way above the average, so we assign the value of to 0.1, and change the value of to find the best. We also train our model without dynamic loss weights as a reference. The results are shown in Table.I. From the Table.I, we can see that, the dynamic loss weights indeed greatly improve the results. When the is too large, the improvement is just a little. The excessive weight of one class results in too much difference among different class, which is equivalent to making a mutual transformation between the minority and the majority, instead of balancing the minority and the majority. When is within 10, the improvements are more obvious. Thereinto, we get best result when is 5, mIoU shows the improvements about 7.7% comparing with not using dynamic weights. Finally, we choose 5 as the value of .
pacc(%) | mpacc(%) | mIoU(%) | |
---|---|---|---|
-,- | 97.67 | 41.77 | 36.06 |
= 0.1, = 3 | 95.99 | 81.19 | 42.86 |
= 0.1, = 5 | 95.73 | 84.58 | 43.79 |
= 0.1, = 7 | 95.32 | 86.10 | 43.31 |
= 0.1, = 10 | 95.99 | 81.19 | 42.86 |
= 0.1, = 50 | 93.64 | 88.91 | 38.87 |
Iv-C evaluation with refinement block
There are six kinds of structures showed in Fig 2, the corresponding configuration and results are listed in Table II, all the results are based on dynamic weights and is 5 . The difference of each structure lies in the combination of layers in one path and the fusion way of feature maps. Because the value of each point in the final feature maps is from 0 to 1, so after the fusion by averaging or multiplying, the value is also from 0 to 1. From the table, we can see that, when using a single layer, the influence of the two fusion ways is not very different, but when using multiple layers, the multiplying way is better than averaging. In addition, when the number of layers is less or the fusion way is averaging, RFB has little improvements on the results. The best results is to use the structure of (f), which consists of two convolution layers and an average pooling layer, and it get 2.6% improvements on mIoU comparing with not using RFB. Finally we choose the last structure (f) as RFB.
configuration | fusion | pacc | mpacc | mIoU | |
---|---|---|---|---|---|
(%) | (%) | (%) | |||
- | –,– | –,– | 95.73 | 84.58 | 43.79 |
(a) | Conv, k = 3, d = 1, p = 1, s = 1 | avg | 95.97 | 81.23 | 43.57 |
(b) | Conv, k = 3, d = 1, p = 1, s = 1 | mul | 96.44 | 85.03 | 43.52 |
(c) | Conv, k = 3, d = 1, p = 1, s = 1 | avg | 96.10 | 82.44 | 43.11 |
Conv, k = 3, d = 2, p = 2, s = 1 | |||||
(d) | Conv, k = 3, d = 1, p = 1, s = 1 | mul | 96.64 | 79.32 | 45.25 |
Conv, k = 3, d = 2, p = 2, s = 1 | |||||
Conv, k = 3, d = 1, p = 1, s = 1 | |||||
(e) | Conv, k = 3, d = 2, p = 2, s = 1 | avg | 96.03 | 85.49 | 43.01 |
avg pool, k = 3, p = 1, s = 1 | |||||
Conv, k = 3, d = 1, p = 1, s = 1 | |||||
(f) | Conv, k = 3, d = 2, p = 2, s = 1 | mul | 96.27 | 83.99 | 46.36 |
avg pool, k = 3, p = 1, s = 1 |
The configuration and results of different structures. k, d, p, s are kernel size, dilation, padding, stride respectively. The first row is the results without RFB.
Iv-D evaluation with aux loss
We also use aux loss to assist the training of network. In the experimental model listed in B and C part of this section, the aux loss is calculated according to the feature maps after the layer3 block of Densenet. Considering that we need to enlarge the feature maps to the same size as the input image to calculate the aux loss, the too small size of feature maps will bring excessive noise. In order to further improve the results, we change to calculate the aux loss after layer2 block. As shown in Table.III, calculating after layer2 block is better than after layer3 block, which get 1.8% improvements on mIoU .
pacc(%) | mpacc(%) | mIoU(%) | |
---|---|---|---|
After layer2 block | 96.27 | 83.99 | 46.36 |
After layer3 block | 96.75 | 83.83 | 48.13 |
Iv-E result comparison
Through the gradual experiments of the three methods in B, C, D of this section, we achieve 7.7%, 2.6%, 1.8% improvements respectively on major metric mIoU, and get a total of 12% promotion. We compare our model with other models in [16] on the PSV dataset. The detailed results are shown in Table IV, ours get the advanced result on mIoU, and majority best on the IoU of each class.
model | background | parking | White solid | White dashed | Yellow solid | Yellow dashed | pacc | mIoU | Model size |
---|---|---|---|---|---|---|---|---|---|
FCN[18] | 85.88 | 13.16 | 18.42 | 7.40 | 23.09 | 20.32 | 86.18 | 28.04 | 500M |
FCCN[23] | 92.53 | 22.50 | 29.60 | 11.87 | 41.21 | 27.38 | 92.66 | 37.51 | 537M |
HFCN[24] | 93.87 | 25.46 | 36.26 | 18.97 | 45.08 | 26.87 | 93.97 | 41.09 | 555M |
VH-HFCN[16] | 96.22 | 36.16 | 39.56 | 21.46 | 47.64 | 38.03 | 96.25 | 46.51 | 544M |
ours | 96.69 | 38.52 | 41.43 | 34.91 | 40.27 | 36.96 | 96.75 | 48.13 | 147M |

In addition, in other models, the accuracy of dashed lines is obviously lower than that of solid lines, and there is quite difference in each class; in our model, there is not much difference. What is more, for the model size, ours is 3.7 times smaller than VH-HFCN. The smaller size of model can reduce the size of required memory space. At the same time, smaller size also means less computation parameters, thus it can improve the processing speed of the network. There are some visual samples of predictions by different model in Fig.3. We can see that our model can precisely segment the area of lane markings and parking slots, slightly influenced by background noise, and the blank area in dashed line can also be identified correctly.
V Conclusions
In this paper, aiming at panoramic images of PSV dataset, we segment the lane markings and parking slots by the method of semantic segmentation. The two parts of the method we proposed, dynamic loss weights and residual fusion block, are proved to be very effective. Moreover, our method is not specially designed for liner structures, which can also be used in general semantic segmentation task. Comparing with other models, ours is more precise in result and smaller in model size. In future work, we will further improve the accuracy and compress the network model to meet the requirement of using in embedded and mobile platform.
References
- [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
-
[2]
J. M. Alvarez, M. Salzmann, and N. Barnes.
Large-scale semantic co-labeling of image sets.
In
IEEE Winter Conference on Applications of Computer Vision
, pages 501–508, March 2014. -
[3]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 3213–3223, 2016. - [4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017.
- [5] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
- [6] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
- [7] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
- [8] Koichi Kato, Makoto Suzuki, Yukio Fujita, and Yuichi Hirama. Image synthesis display method and apparatus for vehicle camera, November 21 2006. US Patent 7,139,412.
- [9] Yu-Chih Liu, Kai-Ying Lin, and Yong-Sheng Chen. Bird¡¯s-eye view vision system for vehicle surrounding monitoring. In International Workshop on Robot Vision, pages 207–218. Springer, 2008.
- [10] Chunxiang Wang, Hengrun Zhang, Ming Yang, Xudong Wang, Lei Ye, and Chunzhao Guo. Automatic parking based on a bird’s eye view vision system. Advances in Mechanical Engineering, 6:847406, 2014.
- [11] Jae Kyu Suhr and Ho Gi Jung. Sensor fusion-based vacant parking slot detection and tracking. IEEE Transactions on Intelligent Transportation Systems, 15(1):21–36, 2014.
- [12] CHI Hao-Han and Li-You Hsu. Dynamic lane line detection system and method, November 29 2016. US Patent 9,508,016.
- [13] Jihun Kim and Minho Lee. Robust lane detection based on convolutional neural network and random sample consensus. In International Conference on Neural Information Processing, pages 454–461. Springer, 2014.
- [14] Seokju Lee, Junsik Kim, Jae Shin Yoon, Seunghak Shin, Oleksandr Bailo, Namil Kim, Tae-Hee Lee, Hyun Seok Hong, Seung-Hoon Han, and In So Kweon. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1965–1973. IEEE, 2017.
- [15] Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Carlo Meghini, and Claudio Vairo. Deep learning for decentralized parking lot occupancy detection. Expert Systems with Applications, 72:327–334, 2017.
- [16] Yan Wu, Tao Yang, Wei Jiang, Junqiao Zhao, and Linting Guan. Vh-hfcn based parking slot and lane markings segmentation on panoramic surround view. arXiv preprint arXiv:1804.07027, 2018.
- [17] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena-Martinez, and Jose Garcia-Rodriguez. A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857, 2017.
- [18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- [19] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- [20] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [22] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1175–1183. IEEE, 2017.
- [23] Yan Wu, Tao Yang, Junqiao Zhao, Linting Guan, and Jiqian Li. Fully combined convolutional network with soft cost function for traffic scene parsing. In International Conference on Intelligent Computing, pages 725–731. Springer, 2017.
- [24] Tao Yang, Yan Wu, Junqiao Zhao, and Linting Guan. Semantic segmentation via highly fused convolutional network with multiple soft cost functions. arXiv preprint arXiv:1801.01317, 2018.
- [25] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [26] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
- [27] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. arXiv preprint arXiv:1703.02719, 2017.
Comments
There are no comments yet.