Anomalies are unexpected or abnormal behaviour of systems, which are synonymously referred to as outliers, exceptions, peculiarities, discordant observations, contaminants, or aberrations in different contexts. Anomaly detection is a crucial task in many applications, such as, fraud detection for credit card transactions, intrusion detection for cyber-security systems, illegal activity detection for surveillance systems, and fault detection in safety-critical systems. We consider the real-time detection of abnormalities of ground and aerial heterogeneous autonomous systems using their embedded sensor data. For this purpose, information given by the camera (i.e., video) and the inertial measurement unit (IMU) sensor can be exploited to detect anomalies. By employing both video and IMU data, the robustness of the anomaly detection can be improved since the information of some anomalies may be available either in video data or in IMU data, for example, sudden obstructions to a regularly moving system can be mostly detected using video data only.
We divide the broad range of anomalies in autonomous agents into two basic categories based on the circumstance on occurrence. They are:
Internal anomalies - Anomalies that occur in the motion of the vehicle such as vehicle acceleration and orientation.
External anomalies - Anomalies that occur in the external environment such as obstacles and external objects moving towards the vehicle.
We also divide the anomalies into two categories based on how we can capture them; as follows:
Instant anomalies - Anomalies which occur and can be observed in time instances, such as high or low values of acceleration and spatial anomalies in camera images.
Transitional anomalies - Anomalies which occur and can be observed in transitions between subsequent time instances, such as abnormal transitions in subsequent camera images and abnormal transitions in velocity readings.
Fig. 0(a) illustrates two examples of instant anomalies which are a spatial anomaly (left) and unusual values in linear acceleration (right). The image sequence shown in Fig. 0(b) shows an example of a transitional anomaly, where the transition between subsequent camera frames are anomalous.
In this paper, we propose multiple self-supervised algorithms to capture all types of anomalies discussed above. In particular, to detect both internal and external anomalies we process both IMU and camera data by standalone algorithms. While IMU data are particularly useful in detecting internal anomalies, camera frame sequences can be used to detect both internal and external anomalies. To capture both instant and transitional anomalies, we propose reconstruction based and forecasting based algorithms. We assume that in the event of an instant anomaly, a model which has been trained to reconstruct the same input sample would get confused and return a high reconstruction error. Also, we assume that in the event of a transitional anomaly, a model which is trained to forecast next instance from previous instances would get confused and return a high forecasting error.
To identify anomalies in video data, the immediate-past three frames are fed into a parametric model, which captures both sequential and spatial information to predict the next frame. The next frame prediction error is compared against a threshold to determine whether the frame corresponds to an anomaly or not. We employ two approaches to detect anomalies with IMU data: the first is a Long Short Term Memory (LSTM) based autoencoder and the second is an LSTM based forecaster. The LSTM autoencoder reconstructs three consecutive IMU vectors and the LSTM forecaster predicts the next vector using the previous three IMU vectors. Based on the reconstruction error or the prediction error, and a threshold, timestamps are classified as anomalous or not. We implement our system using Robot Operating System (ROS)  for real-time operation. The composition of algorithms won the runner up at the IEEE Signal Processing Cup 2020 anomaly detection challenge .
Ii Related Work
Video prediction has garnered considerable attention from the research community and the industry due to its application in domains such as autonomous vehicle navigation  and robot manipulation , aside from anomaly detection 
, where it has been used to understand the behavior of the surrounding environment with time. In this context, sequential models such as Recurrent Neural Networks (RNNs) typically outperform non-sequential ones, as adjacent video frames often share valuable information among them, which the latter finds hard to grasp[35, 31, 8, 24]. The works of [34, 20, 18, 13, 19]
demonstrate prediction of high-level information, such as the action of a person in video frames using supervised learning. The research community mainly propose models based on unsupervised learning to predict low-level details in video frames such as pixel-level information[31, 8]. Several works demonstrate the effective usage of different types of auto-encoders to detect anomalies in video streams. In these works, the authors reconstruct video frames with the help of autoencoders and calculate the reconstruction error between the output and the original frame to identify anomalies. Ribeiro et al.  and Gutoski et al.  showed that deep convolutional auto-encoder models can successfully capture high-level spatial and temporal features of video frames to detect anomalies. Furthermore, Chomg and Tay , and Xu et al. employed Spatio-temporal autoencoders and variational autoencoders, respectively. Further, Duman and Erdem  proposed a method using convolutional autoencoders and convolutional LSTMs to detect anomalies in which the authors use a dense optical flow to extract velocity and location information of objects.
Several works have been presented in the context of anomaly detection of autonomous systems. Olier et al.  demonstrated how an autonomous agent can be trained to mimic human behavior using a variational deep generative architecture, and Baydoun et al.  introduced an anomaly detection model by fusing two viewpoints together; the shared layer and the private layer. Here, the shared layer observes odometry data of all the moving agents externally while the private layer is only accessible to the relevant agent and observes visual data captured by each agent. Campo et al. 
employed a Gaussian process regression to obtain the most probable motion of an agent according to its current position and segmented the state space into spatial zones. Here, the authors employ a set of Kalman filters to track the behavior of agents in each region and to detect anomalous behavior. Iqbalet al.  discussed the usage of an improved version of the Growing Neural Gas algorithm to cluster multi-sensory data optimally. This approach reduces computational complexity while maintaining detection accuracy. Kanapran et al. 
proposed a dynamic Bayesian network to evaluate and detect abnormal behavior based on internal cross-correlational parameters, using the Hellinger distance metric as the abnormality measurement. Furthermore, Ravanbakhshet al. [26, 27] proposed a novel method using an incremental hierarchy of cross-modal Generative Adversarial Networks (GANs) to process visual data obtained from a camera to build a self-awareness model and detect anomalies by computing distance between observed data and generated image frames from the GANs.
Iii Proposed Network Topologies
The proposed architecture comprises of two separate systems to process IMU data and image data, i.e., the frames of the video, where each system identifies abnormal timestamps independently. For IMU processing, we propose two alternatives based on reconstruction and forecasting. The reconstruction-based approach learns to auto-encode normal samples, thus for an abnormal input sample, the reconstruction error is expected to rise. The forecasting-based approach learns to predict the next sample given previous normal samples in a sequence, hence it is expected to return a high forecasting error if the inputs are abnormal. For image data processing, we propose a similar forecasting approach which is then fine-tuned using conditional adversarial training .
Iii-a Architectures for IMU Processing
The IMU data contains linear acceleration, angular velocity, and orientation data. However, we only use linear acceleration and angular velocity for anomaly detection. Since both of them represent a rate of change in another measure (velocity, angle), high values of these sensor data are often correlated with abnormal behaviour. The IMU data vector at a given timestamp () is a six-dimensional vector containing the angular velocities and the linear accelerations in the three directions;
Here, , , represent angular velocities and , , represent linear accelerations in the , , directions, respectively. We propose two approaches to detect abnormalities which are based on reconstructing input samples and predicting the next sample from previous samples.
Iii-A1 LSTM Autoencoder
We propose the use of autoencoders to learn to reconstruct the normal data. The Autoencoder learns a basis representation of the normal data and to reconstruct them with minimal error, allowing the reconstruction error to be used as an anomaly metric. To make the reconstruction process smooth over the past samples, we incorporate an LSTM  architecture to expand the autoencoder to aid from previous samples as well. We base ourselves on the assumption that the reconstruction error is high for data points that lie in significantly different ranges from the data points the model has seen.
Fig. 1(a) illustrates the LSTM Autoencoder architecture which consists of two parts: the encoder and the decoder. The encoder receives three IMU data vectors of three consecutive timestamps. The first LSTM layer outputs 128-dimensional feature vectors and the second LSTM layer reduces the feature size to 64, where the final time step of the second layer outputs a 64-dimensional encoded embedding. This encoded vector is then repeated three times and fed to the three cells of the first LSTM layer of the decoder which outputs three 64-dimensional feature vectors. The next layer increases the feature size to 128, finally a time distributed dense layer provides output with the same dimensions as the input. The reconstructed outputs are then compared with the targets which are the inputs themselves to calculate the mean squared error to back-propagate. This process is repeated over such windows of three samples which slide through the normal sequences.
Iii-A2 LSTM Forecaster
Here, we propose the use of encoder-decoder recurrent models which are conventionally used for sequence-to-sequence prediction problems . Given a sequence of IMU vectors of consecutive timestamps, the recurrent encoder processes the sequential data to return a latent hidden state which is then fed to the recurrent decoder to predict the IMU vectors of the next samples. We base ourselves on the assumption that once the network has learned to predict future samples from given normal samples, the next sample prediction gets confused and deviates from the actual for any abnormality in the transitions in the input sequence, hence leads to high prediction error.
Fig. 1(b) shows the architecture of the LSTM  based Forecaster. Given three IMU vectors from three consecutive timestamps to the encoder which is an LSTM layer with three steps, it processes the sequential data and returns the latent hidden state of the final timestep. The decoder, which is a single LSTM cell, is then initiated with this hidden state and a zero vector as the input, to predict the IMU vector of the next sample. Although we can extend the decoder to predict several future samples, we limit to one sample and exclude such investigation in this paper. The predicted sample is then compared against the actual future sample to calculate the mean squared error which is then back-propagated. During inference, this error is expected to rise for any abnormality in the transitions in the input sequence.
Iii-B Architecture for Image Processing
To capture anomalous behaviour from images, it is important to process them sequentially to capture sudden abnormal transitions between consecutive frames. Once the network successfully learns to predict the future frame from the previous frames, during the inference time if there are sudden unpredictable transitions between consecutive frames it becomes hard to predict future frames, hence the system gives higher prediction errors. Here, we use a convolutional neural network based sequential model (CNN-LSTM Forecaster) to predict the next frame from the immediate-past frames. This is similar to the LSTM Forecaster introduced for IMU processing (Sec.III-A2). However, the input frames are pre-processed by a convolutional encoder to reduce the dimensions, and the LSTM prediction is post-processed by a convolutional decoder to construct the future frame. The predicted frame and the actual occurrence are then compared to determine the prediction error which is minimized.
Fig. 2(a) shows the CNN-LSTM Forecaster. For the convolutional encoder and decoder networks, we use the SegNet architecture. The encoder, which consists of nine hidden convolutional layers takes a image and produces a
latent tensor. This is flattened and fed to the LSTM Forecaster which takes three such embeddings and predicts the fourth frame’s latent representation. Once the LSTM Forecaster produces the future frame embedding, it is reshaped toand the convolutional decoder reverses the process of the encoder to construct a
image at the output. The decoder consists of convolutional layers which are followed by up-sampling layers where the dimensions are increased. All hidden layers in both encoder and decoder use leaky ReLU activation with 0.2 slope, except the final layers which use tanh activation.
To make the CNN-LSTM Forecaster more robust to unseen data and to prevent the models overfitting to the relatively small dataset, we further deploy a CGAN approach . Here, the trained Forecaster is used as the generator where another CNN-LSTM architecture is used as the Discriminator. The Discriminator, as shown in Fig. 2(b), uses a four-step LSTM where the first three steps take the three input frames’ latent representations via a convolutional encoder. The fourth step to the LSTM is either the encoded Forecaster generated image or the real fourth image. The discriminator is supposed to distinguish between the real and generated fourth frame. We feed the real fourth frame through convolutional encoder and decoder to subject them to similar information loss as generated images otherwise, the discriminator would easily overpower the generator.
Iv Data Preparation, Training and Inference
Iv-a Training with IMU Data
We use only the linear acceleration and angular velocity values of the IMU data as explained in Section III-A. The dataset contains six normal scenarios and six abnormal scenarios that are given in ROS bag files. We refer to the normal scenarios as normal-0, normal-1, normal-2, normal-3, normal-4, and normal-5. The six abnormal bag files are referred to similarly. The LSTM Autoencoder and LSTM Forecaster are trained only on four of the normal bag files (normal-1,…,4) where we use normal-0 for finding the threshold, and normal-5 along with all abnormal cases for testing.
The training dataset consists of 551 IMU vectors with each containing six features. As a preprocessing step, each feature is scaled to the range [-1,1]. We create two distinct datasets for the two models. For the LSTM Autoencoder, we use a sliding window of three consecutive vectors. The training dataset contains 549 such sets. In this case the input and targets to the model are the same. To train the LSTM Forecaster, we use the same sliding window of length three with the target being the fourth frame which follows the three frames in every window. There are 548 such sets in this training dataset. The normal-0 which is used for thresholding contains 302 timestamps which leads to 298 4-frame segments for the LSTM Forecaster and 299 3-frame segments for the LSTM Autoencoder. Each model is trained in the respective training dataset for 500 epochs with a learning rate of 0.01. We use mean squared error as the loss function and a batch size of 1.
Iv-B Training with Image Data
The image dataset consists of sequences of RGB images taken from the drone’s frontal camera at fixed time intervals. The images are from six normal scenarios and six abnormal scenarios as mentioned in Section IV-A. These images are converted to gray-scale and resized to . Furthermore, the image pixel values are normalized to the range of [-1, 1]. We augment the normal scenarios with horizontal flipping which gives us six more normal scenarios which are the mirrors of the original six normal scenarios.
From each normal scenario, we construct sequences of four images where the first three images are inputs and the fourth image is the target for prediction. We construct 810 such sequences by sliding the length-4 window through all normal scenarios. We annex all the image sequences together and randomly pick 100 image sequences as the threshold determination set and 100 more are allocated for testing. The rest of the 610 image sequences are used to train the model. The 100 test sequences are later combined with 196 sequences obtained similarly from the abnormal scenarios to build the total test set of 296 four-image segments.
First, we combine the convolutional encoder and decoder parts together which resembles the SegNet  architecture and train for the image reconstruction task. We use the individual images in the training set which are augmented using horizontal flips, random rotation within ten degrees, width and height shifts and zoom. The trained encoder and decoder parts are then plugged into the CNN-LSTM Forecaster (Fig. 1(b)) which is then trained to predict the fourth frame from 3 consecutive input frames. During this phase, the weights of the convolutional encoder and decoder are frozen, and only the LSTM cells are learned for the prediction task. For both encoder-decoder training and CNN-LSTM Forecaster training, we use the addition of the mean squared error and the mean absolute error as the loss function. Such an addition enables the model to benefit from stable convergence while being robust to outliers. In both cases, we train the models for 100 epochs where the learning rate is initialized as 0.001 and decayed by a factor of 10 after 50 and 80 epochs respectively.
Afterward, the trained LSTM Forecaster is used as the generator in a CGAN  where the LSTM Forecaster’s next frame prediction is fine-tuned with the adversarial loss along with the prediction loss. Section III-B and Fig. 2(b) explains the Discriminator architecture we use for this purpose. During this phase, we let all the weights of the Forecaster to be freely updated, and we use a combined loss function of both prediction loss and the adversarial loss as proposed in .
Iv-C Flagging Anomalies During Inference
Both models used for IMU data processing as well as the model used for image data processing are based on either the reconstruction of the same sample or the prediction of future samples. Therefore, we record the reconstruction or prediction error of each model in the thresholding dataset and compute error histograms that are converted to probability distributions. We fit statistical distributions to the derived probability distributions. To determine the best-fit curve for each case, we use the Kolmogorov-Smirnov test . Once a statistical curve is fit for each distribution, we set the thresholds at the 95% right-tailed confidence which is then used to flag abnormal behaviour in the test set. In particular, during inference, if the reconstruction/prediction error is above this threshold we flag the timestep as abnormal (See Fig. 4).
In IMU data processing, the reconstruction/prediction error for a particular step is a six-dimensional vector containing angular velocities and linear acceleration each in three directions (). Here, we plot histograms and compute thresholds for the angular velocities and linear acceleration separately. In particular, we define the error related to angular velocities as the mean of the 3 angular velocity errors and the error related to linear acceleration as the mean of the 3 linear acceleration errors. The reason for such division is that angular velocity and linear acceleration are two distinct measures which might not be correlated always.
For the LSTM Autoencoder, is fit with a Birnbaum-Saunders distribution  (parameters - c: 2.053, location: 0.022, scale: 0.019) and is fit with a Johnson’s SU distribution  (parameters - a: 0.89,b: 0.44,location: 0.16,scale: 0.0024). The 95% right-tailed confidence thresholds for the and are computed as 0.276 and 0.531 respectively. Similarly the thresholds of and in the LSTM Forecaster are calculated as 0.655 and 0.322. Fig. 5 shows these distributions where and
distributions in each case show clear differences from each other. For the vision system, the prediction error is fit with the Normal Inverse Gaussian distribution (parameters - a : 0.326, b: 0.291, location :0.061, scale: 0.01). The 95% right-tailed confidence threshold is computed as 0.1598.
V Results and Discussion
|Only Prediction Loss||0.9381||0.9081||93.81%||0.9518|
|Prediction Loss + CGAN||0.9269||0.9821||94.18%||0.9537|
To evaluate the IMU models we use the test set which contains one normal case (normal-5) and all the abnormal cases. We establish ground-truth by classifying each data point in the test set as an anomaly or not. For linear accelerations along each orthogonal axis, maximum and minimum values corresponding to normal data are used as thresholds to detect anomalies. If acceleration along any of the axes is found to be beyond the threshold, we label them as abnormal. Angular velocities are labeled in a similar manner. We assume such irregularities in angular velocity and linear acceleration are correlated with abnormal movement.
The normal-5 scenario is not used for the training or thresholding processes, hence we first evaluate our models in this scenario where the LSTM Autoencoder achieves an accuracy of 95.3% and the LSTM Forecaster achieves an accuracy of 100% in predicting whether a particular step is normal. Table I illustrates the performance of the two models in abnormal scenarios where both perform equally well.
To evaluate the image sequence anomaly detection process, we report the performance of the CNN-LSTM Forecaster on a test set. The test set contains 296 frame sequences taken from six abnormal scenarios and unused sequences from the normal cases. We establish test set ground-truth by manually observing each sequence of the four consecutive frames in a sliding window, determining whether the fourth frame is abnormal. Here, we specifically pay attention to the ability to predict the fourth frame given the three previous frames. If the initial three frames in a particular segment show unpredictable transitions, or fourth frame deviates significantly from the pattern followed by its preceding three frames, we label the fourth frame as an anomaly.
Table II compares the CNN-LSTM Forecaster’s performances when trained with only the prediction loss and when further fine-tuned using conditional adversarial loss. When the Forecaster is fine-tuned with the adversarial loss, the recall, accuracy and the F1 score are improved. We further plot handpicked frame sequences representing both normal and abnormal segments from the test set. Fig. 6 shows a normal frame segment where the fourth frame is successfully predicted given the three previous frames. Fig. 7 shows a segment captured by a drone which is already in an anomalous movement. In such a case, it is difficult for the system to predict the future frame. In the frame segment shown in Fig. 8, the model fails to predict the sudden appearance of the man in the fourth frame. During such sudden unpredicted occlusions, the IMU sensors continue to give normal reading since the movement of the vehicle is regular. Hence, these types of anomalies are not flagged by the IMU processing.
In this paper, we proposed self-supervised deep learning algorithms to detect anomalies in autonomous systems using video and IMU data. Our algorithms were based on reconstructing the same input sample or predicting the next sample from a given sequence. We expected that for input samples that are significantly dissimilar to the samples that the reconstruction model has seen during training, the reconstruction error tends to rise. Additionally, if there are unpredicted, sudden transitions in a given sequence of samples, it is hard to predict the next sample. Consequently, the prediction error rises. The proposed models for IMU-based anomaly detection achieve an accuracy of 91% and an F1-score of 0.99 whereas the proposed model for image-based anomaly detection achieves an accuracy of 94% and an F1-score of 0.95.
-  (2015) Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293. Cited by: §IV-B.
-  (2019) Birnbaum-saunders distribution: a review of models, analysis, and applications. Applied Stochastic Models in Business and Industry 35 (1), pp. 4–49. External Links: Cited by: §IV-C.
-  (2018) A multi-perspective approach to anomaly detection for self-aware embodied agents. In 2018 IEEE International Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 6598–6602. Cited by: §II.
-  (2019) Learning probabilistic awareness models for detecting abnormalities in vehicle motions. IEEE Transactions on Intelligent Transportation Systems. Cited by: §II.
-  (2009-Jul.) Anomaly detection: a survey. ACM computing surveys 41 (3), pp. 1–58. Cited by: §I.
-  (2017) Abnormal event detection in videos using spatiotemporal autoencoder. In International Symposium on Neural Networks, Cited by: §II.
-  (2019) Anomaly detection in videos using optical flow and convolutional autoencoder. IEEE Access 7 (99), pp. 183914 – 183923. Cited by: §II.
-  (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: §II.
-  (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conf. Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §II.
Detection of video anomalies using convolutional autoencoders and one-class support vector machines. Cited by: §II.
-  (2001-05) The normal inverse gaussian distribution: a versatile model for heavy-tailed stochastic processes. Acoustics, Speech, and Signal Processing, IEEE International Conference on 6, pp. 3985–3988. External Links: Cited by: §IV-C.
-  (1997-Nov.) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §I, §III-A1, §III-A2.
-  (2014) Action-reaction: forecasting the dynamics of human interaction. In European Conf. Comput. Vis., pp. 489–504. Cited by: §II.
-  (2019) Clustering optimization for abnormality detection in semi-autonomous systems. In Int. Workshop on Multimodal Understanding and Learning for Embodied Applications, pp. 33–41. Cited by: §II.
Image-to-image translation with conditional adversarial networks.
Proceedings of the IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1125–1134. Cited by: §IV-B.
-  (1951) The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical Association 46 (253), pp. 68–78. External Links: Cited by: §IV-C.
-  (2019) Self-awareness in intelligent vehicles: experience based abnormality detection. In Iberian Robotics conference, pp. 216–228. Cited by: §II.
-  (2012) Activity forecasting. In European Conf. Comput. Vis., pp. 201–214. Cited by: §II.
-  (2015) Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38 (1), pp. 14–29. Cited by: §II.
A hierarchical representation for future action prediction.
European Conference on Computer Vision, pp. 689–704. Cited by: §II.
-  (2016) Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390. Cited by: §II.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §III-B, §III, §IV-B.
-  (2017) Dynamic representations for autonomous driving. In IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §II.
-  (2015) Spatio-temporal video autoencoder with differentiable memory. arXiv preprint :1511.06309. Cited by: §II.
ROS: an open-source robot operating system. In ICRA workshop on open source software, Vol. 3, pp. 5. Cited by: §I.
-  (2018) Hierarchy of GANs for learning embodied self-awareness model. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1987–1991. Cited by: §II.
-  (2018) Learning multi-modal self-awareness models for autonomous vehicles from human driving. In 2018 21st International Conf. on Information Fusion (FUSION), pp. 1866–1873. Cited by: §II.
-  (2017) A study of deep convolutional auto-encoders for anomaly detection in videos. In Pattern Recognition Letters, Cited by: §II.
-  (2020-05) Signal processing cup. External Links: Cited by: §I.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §III-A2.
-  (2017) Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033. Cited by: §II.
-  (1978) Determining parameters of the johnson su. Communications in Statistics - Simulation and Computation 7 (3), pp. 223–226. External Links: Cited by: §IV-C.
-  (2017) End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2174–2182. Cited by: §II.
-  (2010) A data-driven approach for event prediction. In European Conf. Comput. Vis., pp. 707–720. Cited by: §II.
-  (2016) Learning temporal transformations from time-lapse videos. In European Conf. Comput. Vis., pp. 262–277. Cited by: §II.