The development of algorithms for autonomous navigation of robotic systems in complex unstructured environments is attracting significant research interest due to emergence of more accurate and cheaper sensors, faster embedded computing devices (e.g., GPUs), and vast amounts of data. Particularly, learning-based algorithms are widely used due to the large sets of available data to train either the perception subsystems or the end-to-end control policy for robust autonomous operation of the robotic system [1, 2].
These learning-based systems cannot be trained in all possible environments and environmental conditions. Such systems are also vulnerable to adversarial attacks [3, 4]. Recently, approaches have been proposed to address this challenge based on generative modeling [5, 6] and gradient based search . Formal verification  and Bayesian optimization  are promising but are computationally complex.
For real-time detection of anomalies/malfunctions in a cyber-physical system [11, 12], we propose the framework in Figure 1 which combines complementary anomaly monitoring methods [10, 13], namely controller-focused anomaly monitor (CFAM) and system-focused anomaly monitor (SFAM). CFAM uses an image-conditioned EBGAN to validate the mapping from the sensor data to the actuator command. SFAM uses an action conditioned video prediction system to validate the mapping from the system action to the sensor data. The proposed learning-based anomaly detection framework continuously validates and verifies autonomous operation during run-time. Novel aspects of this study are:
A learning-based on-line framework to monitor the controller and the system behavior.
A conditional EBGAN architecture to detect anomalies in controller outputs.
An action conditioned video prediction framework to detect anomalies in system behavior.
A methodology to train a robust video prediction architecture to detect anomalies.
This paper is organized as follows. The background and the problem are discussed in Section II. The architecture, training methodology, and anomaly detection for CFAM and SFAM are described in Section III. Section IV reports results on indoor (collected from our experimental unmanned ground vehicle) and outdoor (Udacity ) datasets. Section V concludes the paper.
Ii Problem Formulation
The presented anomaly detection applies to any closed-loop system such as an autonomous vehicle. In this paper, we consider an unmanned ground vehicle (UGV) instrumented with camera(s) and LIDAR and controlled by an end-to-end learning system. The input command to the vehicle is at time step with the corresponding sensor output, .
The CFAM operates on the stream of sensor data and actuator commands to monitor validity of the mapping and detects anomalies related to the controller. These anomalies can be due to distributional shift (e.g., when the testing and training environments are different) or due to sensor malfunctions/failures/attacks. The approach for CFAM is based on learning sensor data (image) conditioned energy based generative adversarial network (CEBGAN) in Figure 2 which outputs low energy for valid actuator (steering) commands and high energy for anomalous commands.
The SFAM monitors the validity of the (dynamic) mapping from system action to sensor data (over a time horizon into future) and detects anomalies generated due to malfunctions in the system. Malfunctions/anomalies can occur due to environmental perturbation (e.g., slippery road) or partial/full subsystem failure (e.g., electrical, mechanical) which causes the system to not act according to the given actuator command. The framework addresses this problem by learning the dynamics of the overall system based on action conditioned video prediction as shown in Figure 4. In the implementation, the framework takes as input and actuator commands to predict , which is used to validate the system action to system data mapping. The video prediction and anomaly detection architectures for the proposed framework are shown in Figures 3 - 5.
Iii Proposed Framework
Iii-a Image conditioned EBGAN
Iii-A1 Model architecture
Motivated by the success of GANs in a variety of applications, we propose to use discriminator in GAN for CFAM. We combine conditional-GAN with EBGAN, to generate CEBGAN and explore its potential to detect controller generated anomalies.
Inputs to the generator are noise vectorsampled uniformly from and condition image . Discriminator inputs are actuator command and condition image . The discriminator learns to map inputs and to a scalar value (Energy), while the generator learns to predict steering command from the condition image.The discriminator output is computed as mean square error between input steering command and the condition driven steering command prediction. For CFAM, the discriminator is the key component and the generator is merely trained to produce contrastive samples. We train them both simultaneously and use only the discriminator in CFAM.
In CEBGAN, the generator and the discriminator are conditioned on camera input images. The conditional image
feature vector is computed by passing the image through a series of convolutional layers. In our experiments with indoor (UGV) and outdoor (Udacity) datasets, the images are all first rescaled to size 3x128x128 and the best performer discriminator used 6 convolution layers for image feature extraction in both generator and discriminator. The number of kernels in the first, second, third, fourth, fifth and sixth convolution layers are 8, 16, 32, 64, 128, 256 respectively and kernels size is fixed to 4x4 and applied with stride 2. Before each convolution layer, spatial batch normalization is performed and LeakyReLU activations with 0.2 negative slope are used for all layers except for the output, which uses Tanh.The obtained feature map is then reshaped as a vector.
In generator, the noise input
is mapped to a fully connected hidden layer of size same as that of the feature vector computed on the condition image to allow their summation. Then, in both, the summation of these vectors is mapped to another fully connected hidden layer before it is reduced to a single output neuron as shown in Figure2.
Iii-A2 Training methodology
In CEBGAN, the discriminator is trained with an objective function in order to shape the energy function to attribute low energies to correct steering commands and high energies to the generated (or anomalous) commands. We use different loss functions to train the discriminator and the generator. Given a positive margin, a condition image , true steering command , and generated steering command , the discriminator loss and generator loss are defined as:
where is discriminator output and . The generator and discriminator parameters are optimized using Adam optimizer.
Iii-A3 Anomaly detection framework
The real-time CFAM module consists of a discriminator trained as described in Section III-A2. Given a condition image , energies (discriminator outputs) are computed for N linearly spaced equi-distant points in the steering commands values range. The validity of the incoming sensor data is determined by computing the steering command deviation as the magnitude of the difference of the controller output (steering command) and the corresponding minimum energy steering command value. Anomalies are associated with data points with deviation greater than the threshold.
Iii-B Adversarially learned action conditioned video prediction
Iii-B1 Model architecture
Learning-based video prediction has been used in various control applications [18, 19, 20, 21]. The SFAM in Figure 3 consists of an action conditioned video prediction architecture shown in Figure 4, dissimilarity computation module, and validation module. The video prediction architecture is based on the four-layer Prednet . Each layer has four sub-modules: (1) convolution, (2) prediction representation, (3) recurrent representation, and (4) error representation.
All convolutions have 3x3 kernels and max-pooling of stride 2x2 with kernel size 2x2. The number of output convolution channels per layer for prediction and target representation sub-modules are 3, 32, 48, and 64, respectively.
The inputs to the network are a sequence of images and the steering commands at the next time step as shown in Figure 5. Each layer at time step has prediction and target representations , generated by the prediction and convolution sub-modules. At the first layer, the target representation is the actual frame. In layers 2, 3, 4 the target representation is generated by the convolution sub-module with the error representation from the previous layer
. The convolution sub-module has a convolution layer with ReLU nonlinearity followed by max-pooling layer.
The prediction representation is generated from the recurrent representation of the current layer as input to the prediction sub-module. The prediction module consists of a convolution layer with ReLU non-linearity except for at the first layer where it is followed by a SatLU non-linearity to saturate the values to the maximum pixel intensity values.
The recurrent representation layer that generates is a convolutional LSTM whose inputs are the error representation from the previous time step of the same layer combined with the upsampled output of the recurrent representation of the higher layer . The hidden state is the recurrent representation from prior time step .
The error representation is generated by combining the feature maps of the difference between the prediction and the target and the target and the prediction followed by a ReLU nonlinearity. In the first time step, all error representations are reset to zero. The steering command is introduced at the next time step as the input to generate action conditioned frame prediction. It is concatenated with the error representation of the fourth layer by tiling the steering command to the same dimension and used as input to the fourth recurrent representation layer.
Iii-B2 Training methodology
The video prediction architecture takes as input four RGB images (scaled between 0 and 1) from a monocular camera of resolution 120x160 and steering commands to predict the frame . The training has two stages: error representation minimization and adversarial optimization. The weights are updated using an Adam optimizer with a base learning rate of 0.001.
The loss function for the error representation minimization consists of the average of all error representation at each time step and at each layer , the negative of structural similarity (SSIM) (kernel size=5) between the predicted frame and the actual frame and the SSIM between the predicted frame and the previous frame . The weight for the average of all error representation is 0.1. The weight for structural similarity between actual and predicted frame is 1. The weight for structural similarity between previous and predicted frame is 0.5.
Adversarial optimization is introduced to curtail blurry prediction images . In this stage, the video prediction architecture is the generator and is combined with a spectrally normalized discriminator  with N+1 labels where one label is the fake/real label and the other labels are N steering commands as in Figure 5. We use 15 equally spaced labels for steering command actions for both indoor and Udacity dataset with range from to 0.28 and from to 0.56 respectively. The discriminator consists of nine spectrally normalized convolution layers with LeakyReLU non-linearity with negative slope of 0.1 in between all of them and a kernel size alternating between 3x3 and 4x4 with stride of 1x1 and 2x2, respectively and a spectrally normalized fully connected layer, which outputs a vector of size 16. The GAN is trained based on the technique proposed in  with an additional regularization term as described in  to make it robust to out-of-distribution samples.
The regularization term added while updating the generator parameters is where
is the uniform distribution of the steering command labels, is the generator, is the video prediction network input, and
are the parameters of the generator. This term forces the generator to output out-of-distribution samples which are in the low data density region. The generator is updated by minimizing the weighted sum of cross entropy loss (fake/real, action likelihood estimate) and thedivergence term introduced above with weights randomly generated between 0-1.
Iii-B3 Anomaly detection framework
The video prediction framework in Figure 4 generates video predictions where . These video predictions are input to the dissimilarity computation module as shown in Figure 3 which calculates the dissimilarity ((1-SSIM)/2) of the prediction with the actual future frame. These dissimilarities are input to the validation module which selects the action corresponding to the least dissimilar predicted frame and compares it with the actual future action to validate the mapping from the actuator command (system action) to sensor data. A windowing strategy using multiple prediction frames validates system actions sensor data mapping.
Iv Experimental Studies
We empirically validated CFAM and SFAM on indoor and outdoor datasets. The indoor dataset is collected by a human controller driving a UGV in a corridor with varying lighting conditions and changing obstacle placements [2, 26]. The Udacity dataset  has images recorded while driving on highways and residential roads (with and without lane markings) in clear weather during daytime and includes driver’s activities such as staying in and switching lanes.
Iv-B Simulated Scenarios
Datasets of RGB camera images and the corresponding steering commands for various anomalous and non-anomalous scenarios were created. To simulate anomalous scenarios due to malfunctions/anomalies in the controller or the external system, the controller output was overridden by a human driver to create anomalous driving conditions. To test CFAM, the human-created anomalous inputs were provided to the CFAM as controller-generated commands.To test SFAM, the actual controller-generated, correct steering commands and the sensor data that was recorded when the robotic vehicle was actually given the overridden steering commands were provided to the SFAM.
Two complementary anomalous datasets were simulated by the human driver for the UGV in the indoor environment resulting in late-right and early-left crash into a wall. In the late-right turn scenario, the driver provided steering commands to the vehicle to go straight when the vehicle has to take a right turn. In the early-left crash case, the driver provided steering commands to move towards left when the vehicle has to go straight.
CFAM and SFAM performance for non-anomalous scenarios are shown in Figures 7 and 10, respectively and for anomalous scenarios in Figures 8 and 11, respectively. The predictions generated by the video prediction architecture in SFAM with varying lighting are shown in Figure 9. By using a threshold on the deviation between the actual steering command and the steering command corresponding to minimum dissimilarity score, a boolean value of anomalous/non-anomalous can be generated as shown in Figure 6.
Our proposed framework is successfully able to continuously validate the mappings from sensor data to actuator command and actuator command to (future) sensor data to detect anomalies/malfunctions in the overall system.
-  M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” CoRR, vol. abs/1604.07316, 2016.
-  N. Patel, A. Choromanska, P. Krishnamurthy, and F. Khorrami, “Sensor modality fusion with CNNs for UGV autonomous driving in indoor environments,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Vancouver, Canada, Sep 2017, pp. 1531–1536.
K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” in
-  N. Patel, K. Liu, P. Krishnamurthy, S. Garg, and F. Khorrami, “Lack of robustness of LIDAR-based deep learning systems to small adversarial perturbations,” in Proceedings of the 50th International Symposium on Robotics, Munich, Germany, Jun 2018, pp. 359–365.
S. Liang, Y. Li, and R. Srikant, “Principled detection of out-of-distribution examples in neural networks,” inProceedings of the International Conference on Learning Representations, Vancouver, Canada, May 2018.
K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated classifiers for detecting out-of-distribution samples,” inProceedings of the International Conference on Learning Representations, Vancouver, Canada, May 2018.
-  Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: automated testing of deep-neural-network-driven autonomous cars,” in Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, May 2018, pp. 303–314.
-  G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient SMT solver for verifying deep neural networks,” in Proceedings of the International Conference on Computer Aided Verification, Heidelberg, Germany, Jul 2017, pp. 97–117.
-  S. Ghosh, F. Berkenkamp, G. Ranade, S. Qadeer, and A. Kapoor, “Verifying controllers against adversarial examples with bayesian optimization,” in Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, May 2018.
-  F. Khorrami, P. Krishnamurthy, and R. Karri, “Cybersecurity for control systems: A process-aware perspective,” IEEE Design Test, vol. 33, no. 5, pp. 75–83, Oct 2016.
-  A. Keliris, H. Salehghaffari, B. R. Cairl, P. Krishnamurthy, M. Maniatakos, and F. Khorrami, “Machine learning-based defense against process-aware attacks on industrial control systems,” in Proceedings of the IEEE International Test Conference, Fort Worth, USA, November 2016, pp. 1–10.
-  H. Amrouch, P. Krishnamurthy, N. Patel, J. Henkel, R. Karri, and F. Khorrami, “Emerging (un-)reliability based security threats and mitigations for embedded systems,” in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, Seoul, Republic of Korea, Oct 2017, pp. 17:1–17:10.
-  P. Krishnamurthy, F. Khorrami, R. Karri, and H. Salehghaffari, “Process-aware side channel shaping and watermarking for cyber-physical systems,” in Proceedings of the 35th American Control Conference, Milwaukee, USA, Jun 2018.
-  Udacity. (2017) Public driving dataset. [Online]. Available: https://www.udacity.com/self-driving-car
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Montreal, Canada, Dec 2014, pp. 2672–2680.
-  M. Mirza and S. Osindero, “Conditional generative adversarial nets,” CoRR, vol. abs/1411.1784, 2014.
-  J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial networks,” in Proceedings of the International Conference on Learning Representations, Toulon, France, Apr 2017.
N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using lstms,” inProceedings of the 32nd International Conference on Machine Learning, Lille, France, Jul 2015, pp. 843–852.
-  J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atari games,” in Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Montreal, Canada, Dec 2015, pp. 2863–2871.
-  C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, Singapore, Jun 2017, pp. 2786–2793.
-  M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in Proceedings of the International Conference on Learning Representations, Vancouver, Canada, May 2018.
-  W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in Proceedings of the International Conference on Learning Representations, Toulon, France, Apr 2017.
-  M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, May 2016.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in Proceedings of the International Conference on Learning Representations, Vancouver, Canada, May 2018.
-  T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Barcelona, Spain, Dec 2016, pp. 2226–2234.
-  N. Patel, P. Krishnamurthy, and F. Khorrami, “Semantic segmentation guided SLAM using vision and LIDAR,” in Proceedings of the 50th International Symposium on Robotics, Munich, Germany, Jun 2018, pp. 352–358.