With the booming of intelligent transportation system research, autonomous driving technology has gained more and more attentions. Road segmentation, as one of the crucial tasks, is a basic topic for enabling autonomous ability and mobilitydeepdrive ; treml2016speeding . In road segmentation research, various methods have been proposed to find road area in RGB imageRBNet or 3D LiDAR point cloudLuca2017 ; Chen2017 . However, the colors, textures and shapes can be very different due to the various illumination condition, weather condition and very different scenes, eventually makes road segmentation still a challenging task.
Deep learning is a powerful tool on learning representation in basic multimedia tasks, such as image classificationyao2018tip ; yao2018ijcai ; tkde2019 ; tmm2018 , image searchingijcai2019 ; shen1 ; shen2 ; shen3 , image segmentationLong2015Fully
, as well as scene recognitionYYZ2018 ; YYZ2019 ; XGS2017tcsvt ; XGS2019 ; XGS2017ijcv
. As a common sense, the features used in those tasks have a great impact on final performance, and recently Convolutional Neural Network(CNN) has been demonstrated that, automatic feature learning on massive annotated data surpasses hand-crafted features in many applications. As a result, more and more researchers are trying to exploit Deep Neural Network(DNN) in many fields. Deep convolutional neural networks ,like VGGVGG and Residual NetRESNET , are used as encoders, acting like de facto standard feature generators in many applications, and they greatly improve the results of all the tasks we mentioned above. Specifically, in the past 3 years, for the road segmentation task, which is one of the semantic segmentation problems, the performance have been improved dramatically by methods based on the variations of Fully Convolutional Network(FCN)Long2015Fully
. FCN established a classic encoder-decoder pattern for segmentation using a deep CNN, and this leads to an automatic end-to-end feature extraction and segmentation architecture, which has a giant parameter space to represent diverse objects in a very complex way, thus make them much more classifiable. Also, the deconvolution layers which are widely used used in neural networksZMM2017 are also introduced to tackle up-sampling and rebuild the pixel level label prediction. Then, many effective semantic segmentation network were proposed and there has been an amazing performance improvement.
But autonomous driving cars are equipped with various sensors to enhance their environmental perception ability. Though it has been reported that image based road segmentation claims high performance in many tests, road segmentation via single sensor is not that robust in complex scenes. Those above motivate us to develop road segmentation solutions under fusion strategies.
Our work follows the encoder-decoder pattern however some drawbacks have to be discussed firstly:
1) Although there are a lot of successful works and public datasets on pure image based road segmentation, their weaknesses are obvious: they are insufficient to learn a robust representation of road area, due to the lack of sample quantity and scene diversity across datasets. At the same time, learning 3D geometric information is difficult since recovering 3D structure from 2D image remains a challenging problem nowadays. To address this issue, LiDAR based methodsChen2017 ; Luca2017 have been proposed these years. However sparse LiDAR point cloud is not always helpful to improve segmentation performance because sparse LiDAR point is location accurate but visual perception deficient. Therefore, fusing multi-sensor inputs to improve road segmentation performance as well as maximize sensor utilization is an intuitive ideaHan2016 ; Schlosser ; asvadi . In the fusion pipeline a alignment procedure is usually required. Before feeding image and LiDAR into processing system, they are supposed to align with each other via calibration parameters. Then, spatial and geometric features embedded in LiDAR point cloud can be extracted simultaneously with image features via an end-to-end deep neural network. However, since the design of network structure is very flexible and there are various deformations to fuse data, how to exploit image and LiDAR information within the CNN based semantic segmentation network remains an open problem.
2) With the developing of deep learning related research, the encoder and decoder of original FCN is not good enough since we have more computing resources so that we can endure a much deeper network. Besides, as we mentioned above, in this novel method, the network model must have the ability to fuse data from different kinds of sensors. Many existed methods embed their fusion structure in encoder stage, expecting fuse information by synergetic feature extracting stepsLiDARCam , or, just after a series of side-by-side encoding proceduresSchlosser . In addition, stage-wise fusion like cross-fusionLiDARCam and siamese-fusionsiamese have been reported recently. The above mentioned fusion mode is illustrated in Figure.1. Generally, those attempts improved the performance by fusion features in the encoder step, as they reported, but makes pre-trained image encoder hardly useful. As a consequence, we need a very big dataset to train a pre-trained fusion model as a alternative to maintain network performance, which is not realistic. Therefore, in our work, the encoder is updated to residual net and the decoder is replaced by a more complex model, in which data from different sensors are fused in multiple scales to achieve a better prediction result.
Motivated by tackling drawbacks discussed above, we propose a novel approach with the following contributions:
1. We tried a novel fusion structure design in which the LiDAR fusion is performed in decoder instead of encoder. This designation makes it possible to utilize pre-trained models in encoder easily, also, generate better label prediction by enhancing the up-sampling.
2. We use pyramid multi-scale re-projection instead of classic step-by-step pooling method to generate multi-scale LiDAR map for fusion in each stage, which alleviate degeneration in down-sampling .
3. A compact fusion and up-sampling structure is designed to perform segmentation prediction. We conduct a series of experiments of KITTI ROAD datasetKITTI and it demonstrate that the performance of our method is competitive to recent state-of-the-art ones.
2 Related Works
Computer vision and pattern recognition research cover a lot of fields, and the a an important topic is feature extraction and analysisywk ; ywk2 ; zwm ; ywk3 ; mmm ; aim ; huangpu . In the past decade, deep learning with CNN becomes a very important feature extractor for classification and segmentation problems neurocom ; prl ; acmmm . Since road segmentation is a semantic segmentation task, the following subsection is the traditional ideas on road segmentation. Then the subsequent subsection reviews deep neural network based semantic segmentation as our work are inspired by many recent progress in new architecture in CNN classifier and semantic segmentation methods. Finally, data fusion methods in previous works is listed.
Road segmentation. Before the deep learning comes into vogue, the traditional methods were studied under probabilistic framework. Amongst these methods, the most popular idea is extracting hand-crafted features to perform a pixel-wise prediction, finally mark the road and non-road area. For example, Keyu Lu et al. proposed a hierarchical approach for road detectionLUK2014
. They trained a Gaussian mixture model (GMM) to obtain road probability density map(RPDM), then divided the images into superpixels. They tried to select road superpixels from seeds in a growcut framework firstly, and refined the results with a conditional random field (CRF). Liang Chen et al. only use Lidar point clouds to detect the roadChen2017 . They resampled the Lidar point clouds and generated Lidar-imageries, and proposed a Lidar-hisotgram derived from them. In the Lidar-histogram representation, the 3D traversable road plane in front of vehicle can be projected as a straight line, and the positive and negative obstacles are projected above and below the line respectively. Liang Xiao et al. proposed a hybrid CRF to fuse Lidar and image dataXiao2018
. They firstly extracted features from images and used an boosted decision tree classifier to predict the unary potential. Then the pairwise potential was designed via hybrid model of the contextual consistency in the images and Lidar point clouds, as well as the cross consistency between them. Finally they used this CRF to detect the road areas. Those method contains an Achilles’ heel: hand-crafted features is to difficult to produce but semantic information requires massive features and their efficient combination. To strengthen the weaknesses mentioned above, the deep methods are introduced to this field.
Semantic Segmentation. FCN is the first well-known method for end-to-end deep semantic segmentationLong2015Fully . FCN’s designation follows the encoder-decoder pattern with transposed convolutions and skip layers, this architecture laid the foundation for segmentations. In the meanwhile, SegNetsegnet
use max-pooling indices in the decoders to perform upsampling of low resolution feature maps which retains high frequency details in the segmented images as well as reduces the total number of trainable parameters in the decoders. To make strong use of data augmentation of available annotated samples, UNetunet develops architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. DeepLab series now evolve to DeepLabV3+deeplab , it extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. Some work achieve very good result by combining FCN with Conditional Random Field(CRF) using recurrent netsisdcncrf2015 ; CRFasRNN .
. Guosheng Lin present a generic multi-path refinement network called RefineNet. RefineNet is a generic multi-path refinement network in which the information available along the down-sampling process are explicitly exploited to enable high-resolution prediction using long range residual connectionsrefinenet . The most important component in RefineNet is Chained residual pooling(CRP). The key idea of RefineNetrefinenet is deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. The experiments have proved that it achieved state-of-the-art performance in many semantic segmentation datasets. This network is elastic since there exists many configuration after encoder. Our work is based on the principle of RefineNet and a fusion module is added to CRP module. Figure. 2 is an illustration of RefineNet.
Data Fusion for DNN based road segmentation. Schlosser et al. discussed different fusion strategies, named early fusion and late fusionSchlosser . Last year, Caltagirone et al. proposed a new fusion architecture name corss-fusion to directly learn from data where to integrate information by using trainable cross connections between the LIDAR and the camera processing branchesLiDARCam . A siamese fusion method for road segmentation is also proposed to fuse LiDAR and camera informationssiamese . On road segmentation with camera and LiDAR fusion tasks, the only suitable dataset and benchmark, to our knowledge, is KITTI ROAD datasetKITTI . This dataset provide various sensor data collected on a real car with careful calibration. Due to above limitation, we will test our method using KITTI benchmark.
3 Data Fusion for Road segmentation
3.1 Architecture and Work Flow
Our approach follows the encoder-decoder design pattern. Encoder module is in charge of generating feature maps at different scales for image input. Decoder module predict the segmentation result by gradually up-sample the feature maps while fusing LiDAR information. The summery of our approach is depicted in left side of Figure. 3
Encoder. The encoder as backbone in our network is which down-sample the image to and expands the feature channels from to . In fact, , is also suitable as an encoder in our task. The key point of encoder is that multi-scale feature maps is required during the down-sampling procedure. Taking different structures of encoder into consideration, we utilize the consecutive down-sampled feature maps form to of original scale, denote as . Each
will be used to refine the up-sampling prediction. In our network, we use pretrained model on ImageNetimagenet in encoder stage, since encoder is only in charge of image feature map generation.
Decoder. Our decoder is inspired by multi-path refinement networksrefinenet which is optimized for high resolution image segmentation. We make a major revision to RefineNet block by embedding a fusion step to process LiDAR information. The Original RefineNet blocks designed two additional modules named residual convolutional unit(RCU) and chained residual pooling(CRP). In RefineNet, RCU and CRP borrows the idea from residual blocks in which gradients can be directly propagated, thus reduced the training difficulty. We discard the RCU and replace it with LiDAR processing block to generate LiDAR feature map and fuse it with image information. Additionally, CRP module has been slightly modified to reduce the parameter numbers. Our design is aiming at embedding a LiDAR fusion module without enlarge the network too much.
Fusion. There are two fusion strategies inside our architecture. One fusion is performed among image feature maps at multiple scales, whose objective is refining image up-sampling with more details. This technique has been proven effective in many semantic segmentation networks, such as skip layers of FCN and similar structure in Unet. The other one happens between LiDAR feature maps and image feature maps. We designed this sensor level fusion structure to utilize image and LiDAR information at same scale. The details of fusion will be discussed in Section. 3.2
3.2 Fuse Image and LiDAR information in Decoder
Fusing image and LiDAR data is usually understood as a sort of synergetic feature extracting problem in many previous researches. Generally, there are three ways of image-LiDAR fusion: early fusion, late fusion and stage-wise fusion. Early fusion preproccesses image and LiDAR data to create a high dimension object, then feed it to an encoder. It is quite easy to implement but not always make sense due to the unbalanced input, and at the same time, well pretrained model is hardly useful. Contrary to the early fusion, late fusion uses a group of detached encoders for each multiple input sources and finally join them each other. In this way, making use of pretrained model for each branch is possible, however, developers is required to manually adjust the fusion stages, and it is difficult to reduce the network size. Naturally, a stage-wise fusion were set forth. It performs a fusion procedure at each stage of network(usually each end of scale), then pass the fused block to next stage. Stage-wise fusion forces the image and LiDAR information fuse with each other at each stage, which fixed the problems in both early and late fusion, but pretrained model is still useless for any input sources. The details of those three ways of fusion is illustrated in Figure. 1
Synergetic feature extracting in encoder is an intuitive idea but still exists some drawbacks. As we all know, CNN extracts features automatically, however, do not guarantee equilibrium between LiDAR and image usage during training process. From another point of view, we need to create massive training data to help the encoder to find out the appropriate combination of image and LiDAR features, so that features form both input sources are effectively used. Unfortunately, training data contains both color image and LiDAR point cloud with careful calibration is scarce currently, thus, over fitting is inevitable. Motivated by above discussion, we find out a practical way to avoid above issues, in which synergetic feature extracting is avoid and at the same time loading pretrained model from large scale image dataset in stead of training from scratch becomes possible.
Taken all above into consideration, we fuse LiDAR in decoding stage instead of blending it with images in encoding stage in our proposed network(see decoder in dashed bounding box, Figure. 3(a)). In our network, LiDAR features are extracted and fused in just for refine the score maps. At the same time, we can use the pretrained model for encoder as a fine-tune way. The detailed structure is depicted in Figure. 4
3.3 Lidar Map generation via multi-scale reprojection
To align the image content with LiDAR point cloud, we need generate the so called LiDAR image or LiDAR map. The first step of LiDAR image generation is 3D point projection. The projection need a set of calibration parameters between camera and LiDAR device. We assume the sensors are well calibrated in advance so that the projection matrix, including rotation and translation parameters, is already known. In practise, each frame of LiDAR data usually consist of more then 100K 3D points with three location parameters and a intensity value. Only part of those points can be projected onto the RGB image plane. Let’s denote the LiDAR point clouds as and the projection results on RGB image plane as . The rotation matrix is and the translation matrix is . The intrinsic parameters of camera is . Then the projection can be formulated as follows:
The projected LiDAR points is too sparse in image. Usually, we need interpolate the sparse LiDAR images to a dense ones by bilateral filterbf , then pooling operation will generate multi-scale LiDAR maps. But we don’t think interpolation is essential in this task. We introduce a scale factor to help us reproject LiDAR images at each scale of images by pseudo intrinsic parameters . So the multi-scale reprejection formula is:
Using LiDAR image in decoder is quite different with one in encoder. In decoder, the work flow is from small scale to large scale. Due to the projection property, densely LiDAR image is fused with pixels containing strong semantic description while. Our reprojection strategy preserved the accurate geometric information at each scale compared with arbitrary down-sampling.
4.1 KITTI Dataset and Data Augmentation
We use KITTI ROAD datasetKITTI to evaluate the performance our proposed method. There are totally 579 frames of color images and Lidar point clouds in this dataset, and their corresponding calibrations are available too. In the dataset, 289 frames of which are used as training data and the others are testing data. All the training and testing data are divided into 3 categories: UM (urban marked), UMM (urban multiple marked lanes) and UU (urban unmarked).
Since pictures in KITTI ROAD dataset is not the same size, we performed a preprocessing step referred to yao2016icme ; YYZ2017 . We pre-processed the pictures and resize color images and ground truth images to 384 by 1248, at the same time, random horizontal flips are applied with probability of 0.5. Before the images are fed to network, mean value of each channel should be subtracted.
4.2 Implementation Details
In the experiments, the batch size of training data is set to 4. The reprojected LiDAR point maps at 3 different scales are set to be 24*78, 48*156 and 96*312. In the first RFU, the size of input color image score maps are 24*78, and the size of input LiDAR point map is 24*78 too. After that, just like the first unit, the sizes of input image scores maps of the second unit and third unit is 48*156 and 96*312, and the sizes of input LiDAR point maps in those two units are 48*156 and 96*312.
In the begining, the learning rate of training is set to be 5e-4 for encoder and 5e-3 for decoder, and the momentum and weight decay for both encoder and decoder is 0.9and 1e-5. The total epoch number is set to 2000. The criterion is SGD and loss function is cross entropy. Before training, ResNet50 or ResNet101 should load ImageNet pretrained model. We discard trainable transpose convolutional layers for up-sampling, and use a bilinear interpolation instead. decoding procedure stop atof the original image size, and a 4x interpolation is concatenated to recover the prediction size.
4.3 Data Fusion at Different Scales
To verify our proposed refined fusion unit, we trained several different networks on KITTI road dataset. Specifically, since KITTI benchmark doesn’t provide the ground truth of testing data, we separate the training dataset into 2 sub-datasets for training and validation. The training data contains 240 frames in training dataset, and the rest of them is validation data. After that, the training dataset is used to train various networks. They are all based on our proposed encoder-decoder architecture. The differences of them are the number of refined fusion unit they used. The networks use 1, 2 and 3 RFU in multiple scales. We verify their performance on the validation dataset, and 2 major metrics(IoU and Accuracy). Table 1 shows all the metrics under different configuration using ResNet-50 and ResNet-101 independently. The formulas of each metric are listed in the last section.
|ResNet-50 + 1 RFU||ResNet-50 + 2 RFU||ResNet-50 + 3 RFU|
|ResNet-101 + 1 RFU||ResNet-101 + 2 RFU||ResNet-101 + 3 RFU|
From this table we can find out that the fusion of different kinds of data in multiple layers can improve the road detection result effectively. More RFU we use, the better performance the net work achieved. However, The results of our network with 2 RFUs and 3 RFUs are almost the same, indicating that the data fusion in the largest scale is not very helpful. Hence, our method only contains 3 RFUs.
4.4 Performance on KITTI Road Dataset
At last, we use all the 289 training frames to train our method and segment the road areas of testing frames. The results are submitted to the KITTI Benchmark Website Server. A set of metrics in bird’s eye view (BEV) images are used for evaluation. They are maximum F1-measure (MaxF), average precision (AP), precision (PRE), recall (REC), false positive rate (FPR) and false negative rate (FNR).
Table 2 shows the results of our method in 3 categories and urban dataset. From these tables we can see, our method performs better on UM and UMM testing images, compared with UU testing images. This is because the road areas follow more obvious spatial patterns in UM and UMM testing images. To be more specific, in those scenes, there are sidewalks or fences above the ground, separating the road and non road areas and providing much sharper edges in LIDAR point clouds.
The results of some recently submitted real-name methods and ours are shown in Table 2. They are DEEP-DIGDEEPDIG2017 , Up-Conv-PolyUPCONVPOLY , HybridCRFXiao2018 and MixedCRFHan2016 . The first two methods are deep learning based road segmentation methods, and the other two are not deep learning methods. Besides, the first two methods only use images to train the networks, while the last two methods use the fusion data of images and LIDAR point clouds. This table shows our method gain the best performance among all the methods. Compared with the results of DEEP-DIG and Up-Conv-Poly, the result of our method prove that by fusing Lidar data with images, we can improve the road segmentation ability of deep learning based methods significantly. Our method has obviously better results than HybridCRF and MixedCRF, and this shows that fusing data in deep learning framework achieves a remarkable improvement.
Figure 5 shows our final results on the KITTI ROAD benchmark in the perspective images. In the images, red areas denote false negatives, blue areas correspond to false positives and green area represent true positives. This Figure shows that our method can segment road areas in all the 3 categories effectively.
Although we have great road segmentation performances, our method is a bit overfitting to the urban road scenes, facing the lack of annotated data in other environments. And this problem leads to the relying on a pre-trained model, just like we address in the Section I. More data should be annotated so that the network with data fusion in deeper layers can be trained in the future.
This paper propose a novel structure to fuse image and LiDAR point cloud in an end-to-end semantic segmentation network. The fusion is performed at decoder stage. We exploit the multi-scale LiDAR maps which generated from LIDAR point clouds by using pyramid projection method. to fuse with the image features in different layers. Additionally, we adapted the multi-path refinement network with our fusion strategy and improve the road segmentation results compared with transpose convolution with skip layers. Our approach has been tested on KITTI ROAD dataset and have a competitive performance.
In this section, we list some details for the notation and indicators mentioned above. In following equations, is short for TRUE, is short for FALSE, is short for POSITIVE and is short for NEGATIVE. is short for , is short for, is . The definition is shown as follows:
- (1) C. Chen, A. Seff, A. Kornhauser, J. Xiao. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. IEEE International Conference on Computer Vision, 2015:2722–2730, 2015.
- (2) T. Treml, Arjona-Medina. Speeding up semantic segmentation for autonomous driving. NIPS Workshop, 2016:96–108, 2016.
- (3) Z. Chen, Z. Chen. RBNet: A Deep Neural Network for Unified Road and Road Boundary Detection. Neural Information Processing, 2017:677–687, 2017.
- (4) L. Caltagirone , S. Scheidegger, L. Svensson, M. Wahda. Fast LIDAR-based road detection using fully convolutional neural networks. IEEE Intelligent Vehicles Symposium, 2017:1019–1024, 2017.
- (5) L. Chen, J. Yang, H. Kong. LiDAR-histogram for fast road and obstacle detection. IEEE International Conference on Robotics and Automation, 2017:1343–1348, 2017.
- (6) J. Long, E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition, 2015:3431–3440, 2015.
- (7) Y. Yao, F. Shen, J. Zhang, L. Liu, Z. Tang and L. Shao. Extracting Privileged Information for Enhancing Classifier Learning. IEEE Transactions on Image Processing, 28(1):436–450, 2019.
Y. Yao, J. Zhang, F. Shen, W. Yang, X. Hua and Z. Tang. Extracting Privileged Information from Untagged Corpora for Classifier Learning. International Joint Conference on Artificial Intelligence, 2018:1085–1091, 2018.
- (9) Y. Yao, J. Zhang, F. Shen, W. Yang, P. Huang, Z. Tang. Discovering and Distinguishing Multiple Visual Senses for Polysemous Words. AAAI Conference on Artificial Intelligence, 2018:523–530, 2018.
- (10) Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu and Z. Tang. Exploiting Web Images for Dataset Construction: A Domain Robust Approach. IEEE Transactions on Multimedia, 19(8):1771–1784, 2017.
- (11) Y. Yao, F. Shen, J. Zhang, L Liu, Z Tang, and L Shao. Extracting Multiple Visual Senses for Web Learning. IEEE Transactions on Multimedia, 21(1):184–196, 2019.
- (12) W. Zheng. Multichannel EEG-Based Emotion Recognition via Group Sparse Canonical Correlation Analysis, IEEE Transactions on Cognitive and Developmental Systems, 19(3):281–290, 2017.
- (13) J. Fritsch, T. Kuhnl, A. Geiger. A new performance measure and evaluation benchmark for road detection algorithms. IEEE Conference on Intelligent Transportation Systems, 2014:1693–1700 (2014).
- (14) Y. Yao, Z. Sun, F. Shen, L. Liu, L. Wang, F. Zhu, L. Ding, G. Wu, L. Shao. Dynamically Visual Disambiguation of Keyword-based Image Search. International Joint Conference on Artificial Intelligence (IJCAI), 2019.
- (15) K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations 2015:1–14, 2015.
- (16) K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2016:770–778, 2016.
- (17) W. Yang, J. Li, H. Zheng, R. Xu. A Nuclear Norm Based Matrix Regression Based Projections Method for Feature Extraction. IEEE Access, 6:7445–7451, 2017.
- (18) X. Han, J. Lu, C. Zhao, H. Li. Fully Convolutional Neural Networks for Road Detection with Multiple Cues Integration. IEEE International Conference on Robotics and Automation, 2018:1–9, 2018.
- (19) X. Han, H. Wang, J. Lu, C. Zhao. Road detection based on the fusion of Lidar and image data. International Journal of Advanced Robotic Systems, 14:1–10, 2017.
- (20) J. Schlosser, C. Chow and Z. Kira. Fusing LIDAR and images for pedestrian detection using convolutional neural networks. IEEE International Conference on Robotics and Automation, 2016:2198–2205, 2016.
- (21) W. Yang, Z. Wang, C. Sun. A collaborative representation based projections method for feature extraction. Pattern Recognition, 48(1):20–27, 2015.
- (22) A. Asvadi, L. Garrote, C. Premebida, P. Peixoto and U. Nunes. Multi-modal vehicle detection: fusing 3D-LiDAR and color camera data. Pattern Recognition Letters, 115:20–29, 2017.
- (23) L Xiao, R. Wang, B. Dai, Y. Fang, D. Liu. Hybrid conditional random field based camera-LIDAR fusion for road detection. Information Sciences, 432:543–558, 2018.
- (24) K. Lu, J. Li, X. An, H. He. A hierarchical approach for road detection. IEEE International Conference on Robotics and Automation, 2014:517–522, 2014.
- (25) W. Yang, Z. Wang, J. Yin, C. Sun, K. Ricanek. Image classification using kernel collaborative representation with regularized least square. Applied Mathematics and Computation, 222:13–28, 2013.
- (26) L. Caltagirone, M. Bellone, L. Svensson, M. Wahde. LIDAR–camera Fusion for Road Detection using Fully Convolutional Neural Networks. Robotics and Autonomous Systems, 2019:125–131, 2019.
- (27) V. Badrinarayanan and A. Kendall, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:2481–2495, 2017.
- (28) R. Olaf, F. Philipp and B. Thomas. U-Net: Convolutional Networks for Biomedical Image Segmentation. International Conference on Medical Image Computing and Computer Assisted Intervention, 2015:234–241, 2015.
- (29) Y. Yao, J. Zhang, F. Shen, L. Liu, F. Zhu, D. Zhang, and H. Shen. Towards Automatic Construction of Diverse, High-quality Image Dataset. IEEE Transactions on Knowledge and Data Engineering, 2019.
- (30) G. Lin, A. Milan, C. Shen and I. Reid. RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition, 2017:5168–5177, 2017.
- (31) Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu and Z. Tang. A New Web-supervised Method for Image Dataset Constructions. Neurocomputing, 236: 23-31, 2017.
- (32) L. Chen, Y. Zhu, P. George and S. Florian. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. European Conference on Computer Vision, 2018:833–851, 2018.
- (33) Y. Yao, F. Shen, J. Zhang, L. Liu, Z. Tang and L. Shao. Discovering and Distinguishing Multiple Visual Senses for Web Learning. IEEE Transactions on Multimedia, 2019.
M. Xu, Z. Tang, Y. Yao, L. Yao, H. Liu and J. Xu. Deep Learning for Person Reidentification Using Support Vector Machines. Advances in Multimedia, 9874345:1-9874345:12, 2017.
- (35) P. Huang, T. Li, G. Gao, Y. Yao and G. Yang. Collaborative Representation Based Local Discriminant Projection for Feature Extraction. Pattern Recognition Letters, 76: 84-93, 2018.
- (36) Y. Yao, J. Zhang, X. Hua, F. Shen, and Z. Tang. Extracting Visual Knowledge from the Internet: Making Sense of Image Data. International Conference on Multimedia Modeling, 862-873, 2016.
- (37) F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang and H. Shen. Unsupervised Deep Hashing with Similarity-Adaptive and Discrete Optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):3034–3044, 2018.
- (38) F. Shen, X. Zhou, Y. Yang, J. Song, H. Shen and D. Tao. A Fast Optimization Method for General Binary Code Learning. IEEE Transactions on Image Processing, 25(12):5610–5621, 2016.
- (39) F. Shen, Y. Yang, L. Liu, W. Liu, D. Tao, H. Shen. Asymmetric Binary Coding for Image Search. IEEE Transactions on Multimedia, 19(9):2022–2032, 2017.
- (40) L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, L. Yuille. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. International Conference on Learning Representations, 2015:1–1, 2015.
- (41) Y. Yao, W. Yang, P. Huang, Q. Wang, Y. Cai and Z. Tang. Exploiting Textual and Visual Features for Image Categorization. Pattern Recognition Letters, 117: 140-145, 2019.
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang and P. Torr. Conditional Random Fields as Recurrent Neural Networks. IEEE International Conference on Computer Vision, 2015:1529–1537, 2015.
- (43) H. Liu, X. Han, X. Li, Y. Yao, P. Huang and Z. Tang. Deep representation learning for road detection using siamese network. Multimedia Tools and Applications, 2018:1–15, 2018.
- (44) C. Premebida, J. Carreira, J. Batista and U. Nunes. Pedestrian detection combining RGB and dense LIDAR data. IEEE International Conference on Intelligent Robots and Systems, 2014:4112–4117, 2014.
- (45) J. Deng, W. Dong, R. Socher, L. Li, K. Li and F. Li. ImageNet: A large-scale hierarchical image database. IEEE International Conference on Computer Vision and Pattern Recognition, 2009:248–255, 2009.
- (46) J. Muñoz-Bulnes, C. Fernandez, I. Parra, D. Fernández-Llorca and M. Sotelo. Deep Fully Convolutional Networks with Random Data Augmentation for Enhanced Generalization in Road Detection. IEEE International Conference on Intelligent Transportation Systems, 2017:366–371, 2017.
- (47) G. Oliveira, W. Burgard and T. Brox, Efficient Deep Methods for Monocular Road Segmentation. International Conference on Intelligent Robots and Systems, 2016:9–14, 2016.
- (48) Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu and Z. Tang. Automatic Image Dataset Construction with Multiple Textual Metadata. IEEE International Conference on Multimedia and Expo, 2016:1–6, 2016.
- (49) G. Xie, X. Zhang, S. Yan and C. Liu. Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation. IEEE Transactions on Circuits and Systems for Video Technology, vol. 27(6): 1263–1274, 2017.
- (50) Y. Yao, X. Hua, F. Shen, J. Zhang and Z. Tang. A Domain Robust Approach for Image Dataset Construction. ACM Conference on Multimedia, 212-216, 2016.
- (51) G. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao. Attentive Region Embedding Network for Zero-shot Learning. IEEE International Conference on Computer Vision and Pattern Recognition, 2019.
- (52) G. Xie, X. Zhang, S. Yan, C. Liu. SDE: A novel selective, discriminative and equalizing feature representation for visual recognition. International Journal of Computer Vision, 124(2):145–168, 2017.
F. Zhao, J. Feng, J. Zhao, W. Yang and S. Yan. Robust LSTM-Autoencoders for Face De-Occlusion in the Wild. IEEE Transactions on Image Processing, 27(2):778–790, 2018.
F. Zhao, J. Li, J. Zhao and J. Feng. Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network. IEEE Conference on Computer Vision and Pattern Recognition, 2018:5696–5705, 2018.
- (55) M. Zhao, J. Zhang, F. Porikli, C. Zhang and W. Zhang. Learning a perspective-embedded deconvolution network for crowd counting. IEEE International Conference on Multimedia and Expo, 2017:403–408, 2017.