Deep neural networks (DNNs) are increasingly used in the automotive industry for realizing perception functions in automated driving. The safety-critical nature of automated driving requires the deployed DNNs to be highly dependable. Among many dependability attributes we consider the robustness criterion, which intuitively requires a neural network to produce similar output values under similar inputs. It is known that DNNs trained under standard approaches can be difficult to exhibit robustness. For example, by imposing carefully crafted tiny noise on an input data point, the newly generated data point may enable a DNN to produce results that completely deviate from the originally expected output.
In this paper, we study the robustness problem for direct perception networks in automated driving. The concept of direct perception neural networks refers to learning affordances (high-level sensory features) [pomerleau1989alvinn, chen2015deepdriving, sauer2018conditional] such as distance to lane markings or distance to the front vehicles, directly from high-dimensional sensory inputs. In contrast to classification where the output criterion for robustness is merely the sameness of the output label for input data under perturbation, direct perception commonly uses regression output. For practical systems, it can be unrealistic to assume that trained neural networks can produce output regression that perfectly matches the numerical values as specified in the labels.
Towards this issue, our proposal is to define tolerance that explicitly regulates the allowed output deviation from labels. Pragmatically, the source of tolerance arises from two aspects, namely (i) the quality contract between car makers and their suppliers, or (ii) the inherent uncertainty in the manual labelling process111Our very preliminary experiments demonstrated that, when trying to label the center of the ego lane on the same image having pixels in width, a labelling deviation of pixels for two consecutive trials is very common, especially when the labelling decision needs to be conducted within a short period of time.. We subsequently define a loss function that integrates the tolerance - the prediction error is set to be so long as the prediction falls within the tolerance bound. Thus the training intuitively emphasizes reducing the worst case (i.e., to bring the prediction back to the tolerance). This is in contract to the use of standard loss functions (such as mean-squared-error) where the goal is to bring every prediction to be close to the label.
As robustness requires that input data being perturbed should produce results similar to input data without perturbation, can naturally be overloaded to define the “sameness” of the regression output under perturbation. Based on this concept, we further propose a new criterion for provable robustness [kolter2017provable, sinha2017certifiable, wang2018mixtrain, raghunathan2018certified, wong2018scaling, tsuzuku2018lipschitz, salman2019provably] tailored for regression, which is parametric to the allowed output tolerance , the layer index where perturbation is considered, and the maximum perturbation amount . The robust criterion requires that for any data point in the training set, by applying any feature-level perturbation on the -th layer with quantity less than , the computation of the DNN only leads to slight output deviation (bounded by ) from the associated ground truth. Importantly, the introduction of parameter overcomes scalability and precision issues, while it also implicitly provides capabilities to capture global input transformations (cf. Section LABEL:sec.related for a detailed comparison to existing work). By carefully defining the loss function as the interval overflow between (i) the computed error bounds due to perturbation and (ii) the allowed tolerance interval, the loss can be efficiently computed by summing the overflow of two end-points in the propagated symbolic interval.
To evaluate our proposed approach, we have trained a direct perception network with labels created from publicly accessible datasets. The network takes input from road images and produces affordances such as the central position of the ego-lane in pixel coordinates. The positive result of our preliminary experiment demonstrates the potential for further applying the technology in other automated driving tasks that use DNNs.
(Structure of the Paper) The rest of the paper is structured as follows. Section II starts with basic formulations of neural networks and describes the tolerance concept. It subsequently details how the error between predictions and labels, while considering tolerance, can be implemented with GPU support. Section III extends the concept of tolerance for provable robustness by considering feature-level perturbation. Section IV details our initial experiment in a highway vision-based perception system. Finally, we outline related work in Section LABEL:sec.related and conclude the paper in Section LABEL:sec.concluding.remarks with future directions.
Ii Neural Network and Tolerance
A neural network is comprised of layers where operationally, the -th layer for of the network is a function , with being the dimension of layer . Given an input data point , the output of the -th layer of the neural network is given by the functional composition of the -th layer and previous layers . is the prediction of the network under input data point in
. Throughout this paper, we use subscripts to extract an element in a vector, e.g., useto denote the -th value of in.
Given a neural network following above definitions, let be the training data set, with each data point having its associated label . Let , where , be the output tolerance. We integrate tolerance to define the error between a prediction of the neural network and the label lb. Precisely, for output index ,
Figure 1 illustrates the intuition of such an error definition. We consider a prediction to be correct (i.e., error to be ) when the prediction is within interval (Figure 1-a). Otherwise, the error is the distance to the boundary of the interval (Figure 1-b). Finally, we define the in-sample error to be the sum of squared error for each output dimension, for each data point in the training set.
Definition 1 (Interval tolerance loss)
Define the error in the training set (i.e., in-sample error) to be , where and .
One may observe that the loss function defined above is designed as an extension of the mean-squared-error (MSE) loss function.
When , is equal to the mean-squared-error loss function .
By setting , can be simplified to . Thus, in Definition 1, is simplified to , which is equivalent to computing the square of the L2-norm .
(Implementing the loss function with GPU support)222Google TensorFlow: http://www.tensorflow.org
or PyTorch333Facebook PyTorch: https://www.pytorch.org
, for training a neural network that uses standard layers such as ReLU[nair2010rectified], ELU [clevert2015fast], Leaky ReLU [maas2013rectifier] as well as convolution, one only needs to manually implement the customized loss function, while back propagation capabilities for parameter updates are automatically created by the infrastructure. In the following, we demonstrate a rewriting of such that it uses built-in primitives supported by TensorFlow. Such a rewriting makes it possible for the training to utilize GPU parallelization.
Lemma 2 (Error function using GPU function primitives)
Let be a function that returns if ; otherwise it returns . Define to be . Then .
It can be intriguing to reason that and are equivalent functions, i.e., the if-then-else statement in is implicitly implemented using the primitive444The function is implemented in Google TensorFlow using tf.keras.backend.clip
tf.keras.backend.clip. in . To assist understanding, Figure 2 provides a simplified proof by enumerating all three possible cases regarding the relative position of and the tolerance interval, together with their intermediate computations. Diligent readers can easily swap the constants in Figure 2 and establish a formal correctness proof.
Iii Provably Robust Training
This section starts by outlining the concept of feature-level perturbation, followed by defining symbolic loss. It then defines provably robust training and how training can be made efficient with GPU support. For simplifying notations, in this section let be an operation that (1) if is a scalar, adds to every dimension of a vector , or (2) if is a vector, perform element-wise addition.
Definition 2 (Output bound under -perturbation)
Given a neural network and an input data point in, let be the output bound subject to -perturbation. For each output dimension , , the -th output bound of the neural network, satisfies the following condition: If where , then .
Definition 2 can be understood operationally: first compute which is the feature vector of in at layer . Subsequently, try to perturb with some noise bounded by in each dimension, in order to create a perturbed feature vector fv. Finally, continue with the computation using the perturbed feature vector (i.e., ), and the computed prediction in the -th dimension should be bounded by . Note that Definition 2 only requires to be an over-approximation over the set of all possible predicted values, as the logical implication is not bidirectional. The bound can be computed efficiently with GPU support via approaches such as abstract interpretation with boxed domain (i.e., dataflow analysis [cousot1977abstract, cheng2017maximum]).
Given the output bound under -perturbation, our goal is to define a loss function that computes the overflow of output bounds over the range of tolerant values.
Definition 3 (Symbolic loss)
For the -th output of the neural network, for , Let be the output bound by feeding the network with in following Definition 2. Define , the symbolic loss for the -th output subject to -perturbation with tolerance, to be . The function equals where
are maximally disjoint intervals of .
Function computes the shortest distance between point and points in the interval .
The intuition behind the defined symbolic loss is to (i) compute the intervals of the output bound that are outside the tolerance and subsequently, (ii) consider the loss as the accumulated effort to bring the center of each interval back to the tolerance interval. Figure 3 illustrates the concept.
In Figure 3-a, as the both the output lower-bound and the upper-bound are contained in the tolerance interval, the loss is set to .
For Figure 3-d, the output bound is while the tolerance interval is . Therefore, there are two maximally disjoint intervals and falling outside the tolerance. The loss is the distance between the center of each interval , to the tolerance boundary, which equals .
Lastly in Figure 3-e, the complete interval is outside the tolerance interval. The loss is the distance between the center of the interval () to the boundary, which equals .
Definition 4 (Symbolic tolerance loss)
Given a neural network , define the loss on the training set to be , where , , and is computed using Definition 2.
(Implementing the loss function with GPU support) The following result states that computing the symbolic loss on the -th output can be done very effectively by averaging the interval loss for and , thereby further utilizing the result from Lemma 2 for efficient computation via GPU support.
Lemma 3 (Computing symbolic loss by taking end-points)
For the -th output of the neural network, for , the symbolic loss subject to -perturbation with tolerance has the following property:
where is computed by feeding the network with in using Definition 2.
(Sketch) Here for simplicity, we illustrate in Figure 3 all possible cases concerning the relative position between interval and interval . For each case, results of computing are shown directly in Figure 3. Readers can easily swap the constants in Figure 3 to create a formal correctness proof.
One immediate observation of Lemma 3 is that when the maximum allowed perturbation equals , no feature perturbation appears, and the output bound can be as tight as a single point . Therefore, one has and it enables the following simplification.
When , values computed using symbolic loss can be the same as values computed from interval loss, i.e., .
altogether offer a pragmatic method for training. First, one can train a network using the loss function MSE; based on the chained rule of Lemma1 and Lemma 3, using MSE loss is equivalent to the special case where and . Subsequently, one can train the network with interval loss; it is equivalent to the special case where . Finally, one enlarges the value of towards provably robust training.
Finally, we summarize the theoretical guarantee that the new training approach provides. Intuitively, the below lemma states that if there exists an input (not necessarily contained in the training data) whose feature vector is sufficiently close to the feature vector of an existing input in, then the output of the network under will be close to the output of the network under in.
Lemma 5 (Theoretical guarantee on provable training)
Given a neural network and be the training data set, if , then for every input in’, if exists an input training data such that , then .
When , from Definition 4 one knows that for every input data , the corresponding . This implies that the output lower-bound and upper-bound , computed using Definition 3 with input in, are contained in .
In Definition 3, the computation of and considers every point in . Therefore, so long as , the output of the neural network under in’ should be within , thereby within .
As a consequence, if one perturbs a data point in in the training set to , so long as the perturbed input has produced similar high-level feature vectors at layer , the output under perturbation is provably guaranteed to fall into the tolerance interval.
To understand the proposed concept in a realistic setup, we engineered a direct perception network for identifying the center of the ego lane in -position, by considering a fixed height () in pixel coordinates555It is possible to train a network to produce multiple affordances. Nevertheless, in the evaluation, our decision to only produce one affordance is to clearly understand the impact of the methodology for robustness.. For repeatability purposes, we take the publicly available TuSimple dataset for lane detection666TuSimple data set is available at: https://github.com/TuSimple/tusimple-benchmark/wiki and create labels from its associated ground truth.
Iv-a Creating data for experimenting direct perception
In the TuSimple lane detection dataset, labels for lane markings contain three parts:
containing a list of coordinates that are used to represent a lane.
A list of lanes , where for each lane , it stores a list of coordinates.
The corresponding image raw file.
Therefore, for the -th lane in an image, its lane markings are , and so on. The lanes are mostly ordered from left to right, with some exceptions (e.g., files clips/0313-1/21180/20.jpg and clips/0313-2/550/20.jpg) where one needs to manually reorder the lanes.
We created a script to automatically generate affordance labels for our experiment: First, fix the height to be , followed by finding two adjacent lanes where (1) the first lane marking is on the left side of the image, and (2) the second lane marking is on the right side. If the script cannot find such two lanes, and the script just omits the data as it requires manual labelling. Subsequently, we take the average -position of two such lanes to be the center of the ego lane (i.e., the label). Therefore, every output label is an integer between and . See Figure 4 for the ground truth of the lane marking and the generated center-of-ego-lane position (small green dot). Furthermore, we duplicate images whose created labels are far ( pixels) from the center of the image, to highlight the importance of rare events and to compensate the problem of not having enough labelled data.
For an image in the TuSimple dataset, it has a size of . We crop the -direction to keep only pixels with indices in range , as the cropped elements are largely sky and cloud. Subsequently, resize the image by
and make it grayscale. This ultimately creates, for each input image, a tensor of dimension(in TensorFlow, the shape of the tensor equals ). We also perform simple normalization using such that the value of each pixel, originally in the range , is now in the interval .
In our experiment, we use a network architecture similar to the one shown in Figure 5. As a baseline, we train networks using Xavier weight initialization [glorot2010understanding], with each network starting with a unique random seed between and
. This is for repeatability purposes (via fixing random seeds) and for eliminating manual knowledge bias in summarizing our findings (via training many models). The training uses the Adaptive Moment Estimation (Adam) optimization algorithm[kingma2014adam] with learning rate for epochs and subsequently, for another epochs. Finally, we take the best performed models and further apply robust training techniques, by using Adam optimization algorithm with for yet another epochs. In terms of average-case performance, the baseline model and the model further trained with robust loss have similar performance.
We use the single-step (non-iterative) fast gradient sign method (FGSM) [szegedy2013intriguing] as the baseline perturbation technique to understand the effect of applying robust training. Precisely, we compare the minimum step size to make an originally perfect prediction (both for the standard network and the network further trained using robust loss) deviate with pixels. As we use the single-step method, the parameter is directly related to the intensity of perturbation. Figure LABEL:fig:bar.chart shows the overall summary on each baseline model and its further (robustly) trained model. In some models (such as model 2), further training does not lead to significant improvement, as for these networks, the minimum values for enabling successful perturbations are largely similar. Nevertheless, for other models such as model 1 or model 6, a huge portion of the images require larger in the robustly trained model, in order to successfully create the adversarial effect. Figure LABEL:fig:symbolic.loss.experiment details the required value for successful attacks in model 1, where each image is a point in the coordinate plane. One immediately observes that the majority of the points are located at the top-left of the coordinate plane, i.e., one requires larger amount of perturbation for models under robust training to reach the desired effect.
Although our initial evaluation has hinted promises, it is important to understand that a more systematic analysis, such as evaluating the technique on multiple data sets and a deeper understanding over the parameter space of , , and , is needed to make the technology truly useful.