Deep learning has drawn lots of attentions in recent few years in the realms such as classification, natural language processing, dimension reduction, object detection, and motion modeling[1, 2, 3], due to its powerful ability to approximate highly nonlinear functions . Convolutional neural net (CNN) is one approach to implementing deep learning and particularly suited for image recognition as it can perform a dimensional reduction of high-dimensional inputs through convolution . Therefore, CNN has been introduced to deal with problems in autonomous vehicles, such as detection and classification of pedestrians and vehicles [6, 7, 8], and environment perception. In addition, researchers have implemented CNN for an end-to-end framework of learning an autonomous vehicle controller. Compared to explicit decomposition of controller design methods, such as lane marking detection, path planning, and vehicle control, training an end-to-end controller can simultaneously optimize these steps. For example, in , robots with an end-to-end vehicle controller can detect obstacles and navigate around them in real time. Xu, et al.  also used the CNN to learn an end-to-end vehicle controller that can follow the curved lane accurately. In , the car with an end-to-end controller can run in traffic with/without lane markings. It also performs well in the scenarios with unclear visual guidance such as in parking lots and on unpaved roads.
However, since training CNNs needs a large amount of labeled data covering diverse scenarios, the training procedure is always computationally expensive and time-consuming. Researchers have utilized multiple Graphics Processing Units (GPUs) to cope with this problem. This, however, increases the development cost. In addition, the program architecture of GPUs differs significantly from the traditional central process units, which makes coding with GPUs hard to grasp. Li, et al.  used low-resolution images to shorten the training time, but this leads to reduced training accuracy.
Researchers in  and  ranked the object-level importance in images by training the neural network with semantic abstraction or human-centric annotations. However, it is still unclear how an end-to-end CNN controller can perform if trained by discarding less important features in images and applied to normal driving scenarios.
This paper analyzes the importance of different image features for training an end-to-end autonomous vehicle controller.
We describe a neural net architecture and the training of a CNN-based end-to-end steering controller of autonomous vehicles. Then, we propose two frameworks to analyze the importance of different features for learning the CNN. In this paper, image features are classified into three categories and new data sets with only one of three features excluded are used for training and validation. The performance of learned controllers are also validated and analyzed by running a closed-loop test in simulation to control an autonomous vehicle running in different tracks.
Section II presents the method to train end-to-end autonomous vehicle controllers and the experiment design. The learned controllers are validated in Section III. Section IV presents the feature evaluation. Conclusions and future works are discussed in Section V.
We collected the images (i.e., driving scenarios) with labels (i.e., steering angle) using The Open Racing Car Simulator (TORCS), which is widely used in AI research. Then, image features of the collected data sets are grouped. Two different frameworks are proposed to train and test the CNNs, as shown in Fig. 1. Based on the two frameworks, we can evaluate the importance level of feature for learning a controller based on CNNs.
Ii-a Data Collection
To train a CNN, we need to label images by matching each scenario screenshot with a steering angle. In this paper, the labeled data is collected by a human driver driving cars in TORCS with joystick wheel. Here, the experimental scenarios include 13 different double-lane tracks without other road users. Examples of five tracks are shown in Fig. 2. We replace the original road surface textures in TORCS by customized asphalt textures and asphalt darkness levels so that data coherence can be guaranteed. We sample the images with 10 frames per second (FPS), because more FPS would only lead to more similar frames without providing more useful information
. The labeled screenshots are down-sampled to 190*100 and stored in a database together with the normalized steering angle from -1 to 1. The car speed is set as a constant of 60 km/h. Then, we generate a Hierarchical Data Format 5 (HDF5) file that includes frames and steering angles. This file is the input of training process in CAFFE. CAFFE is a well-known deep learning framework developed by the Berkeley Vision and Learning center (BVLC) . Totally 33,700 training images are included in the training data set.
Ii-B Network Structure
We define the net structure in CAFFE, as shown in Fig. 3. We use a CNN with four hidden layers, including three convolutional layers and one fully connected layer. The input data has the size 190*100*3 and we use a batch size of 100111A batch is a subset of input frames.. The first two convolutional layers have a kernel size of 5*5. The first one has 20 feature maps as output and the second one has 48 feature maps as output. They are used to extract the features of roads. After the first two layers we use a pooling layer with 2*2 kernel to scale the frames222The pooling layer operates max operation to resize the feature map.. The third convolutional layer has a 3*3 kernel size. The fully connected layer has 500 outputs and the last layer has the steering angle as an output. The last layer is called the Euclidean loss layer. This layer calculates the loss E, given by
is the estimated value andis the labeled value after an iteration with the batch size . The loss is necessary for the optimization algorithm to update the weights and bias to minimize the loss of the next iteration.
Ii-C Experiment Design
Ii-C1 Framework 1
In Framework1, 33,700 images are taken as a training data set to learn an end-to-end autonomous vehicle controller, denoted as CNN1. After the training converges, CNN1 would be tested using some unknown test data sets. Then, features of test data sets are manually classified into 3 categories, i.e., sky-related feature, roadside-related feature, and road-related feature. Typical feature areas are represented in Fig. 4. Sky-related feature (region a in Fig. 4) refers to the upper part of an image, often with clouds, birds or some part of skyscraper; roadside-related feature (region b in Fig. 4 ) denotes the middle right and left part of an image, often with grass, trees and buildings; road-related feature (region c in Fig. 4) is the lower-middle part of an image, often with road of different textures. In the analysis stage, each feature will be covered up one-by-one to evaluate its importance.
Ii-C2 Framework 2
In Framework2, the training data set with 33,700 images is preprocessed by covering a single feature shown in Fig. 4 with a white polygon, and thus three new training data sets are obtained with different features covered, i.e., sky covered-up data set, roadside covered-up data set and road covered-up data set. Then three different end-to-end autonomous vehicle controllers (CNN2, CNN3, and CNN4) are also trained in CAFFE using the three different covered-up data sets, respectively. After that, three controllers are tested using the data set with the same features as the corresponding training data set.
In addition, data sets consisting of all three features are used to test controllers CNN2, CNN3, and CNN4. The importance level of each single feature can be assessed. Then, to evaluate the features’ importance for training a CNN, the three controllers are used to handle the car in TORCS game. The duration time of successfully tracking is recorded and compared.
Iii Validation of Trained Controllers
We evaluate the effectiveness of different controllers by checking the loss values for all learned controllers.
Iii-a Verification and Validation of CNN1
First, the end-to-end controller CNN1 is learned using the data set consisting of 33,700 images without any feature covered up. We train the controller on a laptop with only the CPU. The training can also be accelerated by exploiting GPUs. We evaluate the effectiveness of training by examining the loss values. As soon as the loss value converges and does not decrease anymore, the net is either trained well enough or the net structure is not suitable for the task at hand. For the case where the net is already sufficiently trained and the training process is not stopped, overfitting issues may occur even if dropouts are used. Fig. 5 shows loss value of CNN1 over the training iterations. We note that, after the 5,000 iteration, the loss value of CNN1 does not decrease noticeably, so the training process can be stopped at this stage.
The point of convergence of loss values is a good indicator for a well trained neural net, but the most important factor is how well the trained net performs on new data. Fig. 6 compares the steering angles from a human driver and the steering angles estimated using CNN1 for the new data set. We can see that CNN1 performs well on new data, indicating that the structure of the designed neural network is reasonable and the controller trained is effective.
Iii-B Verification and Validation of CNN2, CNN3, and CNN4
We generate 3 new data sets from the data set by discarding different features. Fig. 7 shows the loss values of CNN2, CNN3, and CNN4 over the training iterations. After 5,000 iterations, the loss value does not decrease noticeably, so the training processes are stopped. We note that the end-to-end controllers trained using the data set with sky covered-up and roadside covered-up have almost similar rate of convergence, i.e., both of them decease significantly in the first 500 iterations and converge at the 1000 iteration. However, as to the controller trained using the data set with road covered-up, the loss value decreases to 0.02 at the very beginning and oscillate around 0.02 till the end.
In the verification process of CNN1, the net structure is proved to be good, which are also verified by the convergence processes of CNN2 and CNN3, as seen in Fig. 7. So the reason for the non-convergence of CNN4 would be the road feature excluded dataset. It is easy to understand that even for human beings, if driving without knowing anything about the road, it would be hard to decide how to behave next.
Then, we test three controllers separately using new data sets. These data sets differ from each other because each data set is with only one feature covered-up. Fig. 8 shows test results of CNN2, CNN3, and CNN4, compared to the human driver’s steering angles. The results of 1400-1600 iterations is magnified for a detailed comparison. Table I
shows means and standard deviations of the Euclidean loss between the predicted and labeled steering angles for CNN2, CNN3, and CNN4. We note that CNN2, trained using the sky covered-up data set, has the smallest mean loss value of. The mean loss values of other two controllers are higher than CNN2 by one magnitude. The CNN3, trained with roadside covered-up data set, has the mean loss of 0.0019 and is much better than CNN4 which is trained using data with road covered-up.
Iv Features Evaluation
We use CNN2, CNN3, and CNN4 controllers to evaluate the importance level for sky-related features, roadside-related features, and road-related features, respectively. The evaluation is also carried through the proposed two frameworks.
In Framework1, CNN1 is test using data set with each single feature covered-up one-by-one. This provides us a direct understanding of the relationship between controller output and importance level of features in scenarios.
Fig. 9 presents the predicted steering angles from CNN1 with different features covered-up, compared with the steering angles from human drivers. From Fig. 9 we note that the predicted steering angle using the data set with the sky-related feature covered-up can highly match with the steering angles from human drivers. This indicates that the neural network can ’drive’ the vehicle well without knowing any sky-related information. The green line is the predicted steering angle using the data set with the roadside covered-up. It is little bit different from the steering angles from human drivers, but the shape of steering angle is similar to the steering angles from human drivers. We can understand this in the way that without knowing the roadside information, the CNN1 controller still works well because the road features can provide partial useful information for drivers, but to obtain an accurate control command, the roadside features are also needed.
For road-related features, the pink line (Fig. 9.) is really far from the human driver’s control, which means that the road-related feature in the data set is the most important for learning.
Framework2 evaluates the features from an opposite view, compared with Framework1. In Framework2, we test the three controllers (CNN2, CNN3, and CNN4) using the data set with all-features involved. In this way, we can evaluate the influence of different features on training an end-to-end autonomous vehicle controller.
Fig. 10 compares the human driver and three end-to-end controllers for predicting steering angles on a new data set with all features included. In Fig. 10, none of the controllers can perform as much well as the labeled value. CNN2 behaves better than the other two and CNN4 behaves in irregular. Since CNN4 is trained using the data set without road-related features, it can be inferred that the road feature is indispensable for the end-to-end controller training. On the contrary, the sky-related features are the least important for an end-to-end autonomous controller.
Table II shows the means and standard deviations of the Euclidean loss between the predicted and labeled steering angles of CNN2, CNN3, and CNN4. We can see that the end-to-end controller CNN2 has the smallest mean loss value of . The mean loss values of CNN3 and CNN4 are 0.003 and 0.0062 and higher than CNN2 by one magnitude. Compared to Table I, we can find that the performance of CNN controllers trained with missing feature datasets would degrade in different degree, which indicates that the deviation extent and dispersion degree become larger if we apply the controller trained without some features to the test scenarios with these features included.
To further analyze the influence of discarding features on the training an end-to-end autonomous controller, we then implement the three controllers (i.e., CNN2, CNN3, and CNN4) in the closed-loop driving test. Fig. 11 shows the new tracks and scenarios for the close-loop test. In the closed-loop test, screenshot of scenarios is input to controllers to generate steering angle for the whole track control.
Table III shows the lasting time of successfully tracking the lane for a TORCS car in different test scenarios. From Table III, we can see that the TORCS car with CNN4 drifts off the track at 15s in every scenario. In simpler scenarios such as Motorway and Spring, the three controllers (CNN1, CNN2, and CNN3) perform well and can keep a TORCS car in the track and successfully finish the tracking test. For the track with complicated and colored road edges (e.g., E-track1 and Torovo), all three controllers cannot keep the car in the lane for a long time because of the sudden turn and confusing features in the track, and the lasting time of successful tracking are almost similar. From the tests in E-track1, Spring and Motorway, the end-to-end controllers trained with/without the sky- and roadside-related features have a similar performance on the simple or complicated scenarios.
The most interesting scenario is E-track4. The shape of E-track4 is quite simple and not any sharp turn exists in the track. However, the roadside of E-track4 consists of different textures such as grass and sand. In this scenario, the car with controller CNN1 can finish the track, but the car with CNN2 and CNN3 can only keep in the track for 331.2s and 127.7s, respectively. Since CNN2 is trained using the data set with roadside-related features included and thus has a good generalization capability for different roadside scenarios, therefore it can run much longer than CNN3 in a track with grass and sand features mixed. From another aspect, we can conclude that the priority of features for training an end-to-end controller is ranked as road-related features, the roadside-related features, and the sky-related features.
V Conclusions and Future Work
From the experiment validation and discussion above, we can conclude that all the controllers cannot perform well without road information. The road-related features are of crucial importance to train end-to-end controllers. The roadside-related features provide the controller with a good generalizability for various scenarios, and therefore should also be included in training data. Sky-related features are of least importance and therefore can be excluded to improve the speed of training an end-to-end controller. Though this work analyzes three specific categories of features and evaluates the effects of each category, the main contribution of this work is proposing a framework for feature analysis and selection, which can be generally utilized for reducing the computational cost of training deep learning-based controllers of autonomous vehicles.
In this paper, we manually classify the features of images for the purpose of clarify and accuracy. In the future work, methods like fully CNN can be used to automatically classify features, which is advantageous when more features are analyzed. In addition, we only use one human driver to collect data under the constant velocity of 60 km/h. In the future work, we will use more drivers operating the car under different vehicle velocities. Besides, we have shown that the road-related features are the most important features in this work. However, it is still not known which part of the road is more important. Based on our driving experience, if we drive at a high speed, we always look further into the lane; when driving at a low speed, we care more about the traffic situation nearby. In the future work, we will conduct the road-related feature analysis for different driving conditions. Real driving data sets will also be used in the future research.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very deep convolutional networks for natural language processing,” arXiv preprint arXiv:1606.01781, 2016.
L. Shao, Z. Cai, L. Liu, and K. Lu, “Performance evaluation of deep feature learning for rgb-d image/video classification,”Information Sciences, vol. 385, pp. 266–283, 2017.
-  T. Liu, S. Fang, Y. Zhao, P. Wang, and J. Zhang, “Implementation of training convolutional neural networks,” arXiv preprint arXiv:1506.01195, 2015.
-  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016.
K. A. Ishak, S. A. Samad, and A. Hussain, “A face detection and recognition system for intelligent vehicles,”Information Technology Journal, vol. 5, no. 3, pp. 507–515, 2006.
-  Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian detection aided by deep learning semantic tasks,” in
-  X. Du, M. El-Khamy, J. Lee, and L. S. Davis, “Fused dnn: A deep neural network fusion approach to fast and robust pedestrian detection,” arXiv preprint arXiv:1610.03466, 2016.
-  C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Learning affordance for direct perception in autonomous driving,” Proceedings of 15th IEEE International Conference on Computer Vision, 2015.
-  Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road obstacle avoidance through end-to-end learning,” in NIPS, 2005, pp. 739–746.
-  H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models from large-scale video datasets,” arXiv preprint arXiv:1612.01079, 2016.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5325–5334.
-  S. Shalev-Shwartz and A. Shashua, “On the sample complexity of end-to-end training vs. semantic abstraction training,” arXiv preprint arXiv:1604.06915, 2016.
-  E. Ohn-Bar and M. M. Trivedi, “Are all objects equal? deep spatio-temporal importance prediction in driving videos,” Pattern Recognition, vol. 64, pp. 425–436, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.