Using Visual Anomaly Detection for Task Execution Monitoring

07/29/2021 ∙ by Santosh Thoduka, et al. ∙ University of Bonn Hochschule Bonn-Rhein-Sieg 0

Execution monitoring is essential for robots to detect and respond to failures. Since it is impossible to enumerate all failures for a given task, we learn from successful executions of the task to detect visual anomalies during runtime. Our method learns to predict the motions that occur during the nominal execution of a task, including camera and robot body motion. A probabilistic U-Net architecture is used to learn to predict optical flow, and the robot's kinematics and 3D model are used to model camera and body motion. The errors between the observed and predicted motion are used to calculate an anomaly score. We evaluate our method on a dataset of a robot placing a book on a shelf, which includes anomalies such as falling books, camera occlusions, and robot disturbances. We find that modeling camera and body motion, in addition to the learning-based optical flow prediction, results in an improvement of the area under the receiver operating characteristic curve from 0.752 to 0.804, and the area under the precision-recall curve from 0.467 to 0.549.



There are no comments yet.


page 2

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Service robots operating in dynamic, unstructured environments are prone to failures. In order to be resilient to failures, execution monitoring is necessary, and usually forms a part of the robot’s software architecture [1]. Monitoring allows the robot to verify that actions are completed successfully, and perform recovery actions if a failure occurs. This, in turn, makes robots safer, more trustworthy, and dependable.

A task can have different modes of failure, some of which might be unforeseeable and therefore impossible to enumerate. However, variations in nominal executions are limited, thereby making them easier to enumerate or model. Therefore, some authors have framed execution monitoring as an anomaly detection task, in which models are learned from successful executions, and deviations from the models are detected as anomalies [2, 3]. Depending on the nature of the failure, different sensors might be used for detection; for example, force-torque sensors for collisions, or vision and auditory sensors for external events such as objects falling. Some methods [4, 5, 6] only make use of visual data such as RGB frames, depth frames, or the output of object detection algorithms. In this paper, we propose a method for visual execution monitoring, using the videos from the robot’s camera and the robot’s kinematics.

Fig. 1: Task failures, such as a book falling while being placed on a shelf, can occur during execution. Modelling expected motions (green arrows) from successful executions allows us to identify unexpected motions (red arrow) using video anomaly detection.

Research on video-based anomaly detection has focused primarily on surveillance videos, and is driven by datasets such as CUHK Avenue [7], Street Scene [8], etc. The typical setup for these datasets involves a static camera observing a fixed scene. For a mobile robot, we can assume neither a static camera nor a fixed scene, since the robot can perform a given task at different locations and the task might require camera motion during execution. Additionally, the manipulator might occlude parts of the scene and cause motions in the scene which the robot must take into account.

Since deep learning dominates most recent approaches for video anomaly detection, large-scale datasets are needed and have been published in recent years 

[8, 9]. Datasets for robotics also exist [10, 11], and are usually recorded with a single robot, and are task-specific. Generalizing learning models across robots requires large-scale datasets, which can be quite expensive [12]. Using small, robot- and task-specific datasets sidesteps this problem, but constrains the learning method to be data-efficient. A combination of model-based and data-driven techniques loosens the requirement for large-scale datasets, while still making it possible to learn aspects which cannot be easily modelled.

For certain actions, robots use predefined manipulator motions or motion primitives, which are only parameterized by a target pose. This means that the motions resulting from the robot’s actions are predictable for a nominal execution. For the action of placing a book on a shelf, shown in Fig. 1, the dominant motion patterns in a nominal execution include the robot’s arm approaching the shelf, a short downward motion when the book is released, and the motion of the arm retracting. The motion of the book falling off the shelf is not one that occurs during a nominal execution, and therefore should be detected as an anomaly.

In this paper, we learn a model of nominal motions during the execution of an action, and detect anomalies by comparing the observed motion to the expected nominal motion. The motion of the camera and the robot’s body are modelled using the known joint states and kinematics, while motions external to the robot are modelled by learning to predict future optical flow in a self-supervised manner. The combination of a learning-based video anomaly detection method and analytical modelling of the robot’s self motion is in contrast to existing methods which either use only learning [2], or use model knowledge at a later stage (for e.g. to isolate faults [13]). For evaluation, we collect a dataset of the Toyota HSR robot placing a book on a shelf, which consists of RGB images from the robot’s head camera, and joint states. Both nominal and failed executions are recorded, but the training data consists only of nominal data.111Code and dataset are available at

Ii Related Work

Ii-a Execution monitoring in robotics

Our approach can be considered a knowledge-based method of execution monitoring [1] since we use data for learning, but also use an analytical approach for modelling the camera and body motion. Several works use data-driven methods for execution monitoring and anomaly detection in robotics. Park et al. [2, 14]

developed an execution monitor for detecting anomalous events during assistive tasks such as feeding. They use force, sound and kinematic signals from a service robot to learn from nominal executions using hidden Markov models (HMMs) and Gaussian processes. Inceoglu et al. 


learn from several sensor modalities, using HMMs to classify extracted predicates from each modality into

success and failure

classes for different actions. They also present an end-to-end convolutional neural network 

[11], which classifies executions as success or failure, and identifies the failure types as well. Wang et al. [10]

present a visual-tactile grasp dataset, consisting of tactile, joint and visual data of a robot arm grasping several objects. Baseline results for slip detection using a recurrent neural network on the tactile data are presented. Çatal et al. 


present an anomaly detection method for an autonomous guided vehicle patrolling a warehouse, using a variational autoencoder to reconstruct an input image, conditioned on an action vector corresponding to commands sent to the robot.

All of these methods are characterised by the use of learning from multi-modal data using neural networks or hidden Markov models. Colour and depth images are used in case of visual data, and kinematic data is represented as raw signals or processed to extract features. In some cases, only nominal data is used for learning. In our approach, we only focus on the

motions in the scene (represented by optical flow) since motion-related anomalies are more likely for tasks which involve interaction with the environment. Instead of using raw kinematic signals, we represent them in a visual form, allowing us to model them in relation to the camera image. Hence, our approach learns from nominal optical flow, and combines this with the kinematics of the robot, represented visually, without using learning.

Ii-B Video-based anomaly detection

Reconstruction-based methods are one category of video anomaly detection approaches in which networks are trained to either reconstruct the input, or predict a future outcome, typically using some type of autoencoder network [16]. These methods assume that the network will poorly reconstruct anomalous inputs since they are not in the same distribution as the training data. In [17], the authors use a U-Net network along with a discriminator for predicting a future frame. They use intensity, gradient and optical flow loss between the predicted and ground truth frame, in addition to the discriminator loss to train the network. In [18], the authors augment their autoencoder model with a memory module, which is updated with the latent codes of nominal videos during training. At test time, inputs are reconstructed using the latent code in memory which is closest to the latent code of the test input. Other variations of reconstruction-based approaches include using networks for both prediction and reconstruction [19], reconstructing the input and predicting optical flow [20], the use of LSTMs in a variational autoencoder network [21] for future frame prediction, and the use of a graph-convolutional network to model object interactions [22]. We follow a similar approach to several of these methods by using a variational U-Net model to predict a future optical flow image. However, in contrast to most methods, we only use a single optical flow image as input, instead of a sequence of RGB images, since we are primarily interested in modelling the motion as opposed to the appearance of the scene. This makes the learning task less complex, and training and inference are faster due to lower input dimensionality.

Iii Method

Fig. 2: The expected motion is compared to the observed motion for three types of motions: a) Camera motion: the transformation between images obtained via image registration is compared with the expected transformation from known camera motion obtained from the robot’s joint states b) Body motion: the observed optical flow of the robot’s body () is compared with the optical flow obtained by rendering the robot model () using the robot’s joint states c) Optical flow: the full optical flow image is compared to the optical flow image predicted by a trained U-Net network, given an optical flow image from the past. An anomaly score is calculated based on the residuals in each case (see text for notation).

Our goal is create a model of the nominal motion that occurs during the execution of a task, and to detect anomalies by monitoring the deviation from the nominal motion model. By observing the motion, we focus naturally on the salient regions of the scene - particularly those that are relevant for motion-related anomalies. Motions of the robot’s camera and body can be modelled using the internal sensors of the robot, and are less affected by the dynamics of the external environment in nominal scenarios. We therefore use an optical flow prediction network to model the overall motion, but also model the camera and robot body motion separately using the robot’s kinematics and joint states. In all three cases, the error between the expected and observed motion is measured and used to calculate an anomaly score.

The sequence of images, , from the robot’s camera and the robot state, , sampled at 10 Hz during the execution of the task are the inputs to the algorithms. The robot state includes positions, velocities and efforts of all joints, and the frames of additional links of the robot. Optical flow, which is calculated from consecutive pairs of images, is represented as a two-channel image (for horizontal and vertical displacement), . Using the 3D model of the robot and the robot state, a sequence of images, , are rendered from the point of view of the robot’s camera with the robot model in the configuration defined by the robot state. An additional sequence of optical flow images, , are calculated from this sequence of rendered images.

Fig. 2 illustrates the three types of expected and observed motions which are compared. For camera motion, we compare the expected pixel motion calculated using the known camera motion against the measured pixel motion using image registration. For body motion, we compare the optical flow from rendered images, , against the optical flow from real images, . For the optical flow, we use a probabilistic U-Net model [23] to learn to predict the current optical flow, , given an optical flow frame from the past, . The errors between the expected (or predicted) output and the measured output, , are combined to calculate the anomaly score. Each of the three errors are described in detail in the following sections.

Iii-a Camera Motion

Motions of a robot-mounted camera cause an apparent motion of the entire scene. The image motion caused by this camera motion is calculated in two ways.

Iii-A1 Expected Motion

The camera motion relative to a fixed frame (such as the robot base) is obtained using the position sensors for various robot parts, such as the head and torso. With epipolar geometry, the correspondence between image points in two consecutive views of the camera is obtained using Eq. 1 [24, Eq. (9.7)].


Here, and are the image points in the two consecutive camera views corresponding to a point in the scene, is the intrinsic camera matrix at the two time points, and are the rotation matrix and translation vector between the two camera positions and Z is the depth of the point in 3-D space. With a minimum of two such correspondences from the two views, the similarity transform (i.e. translation, , scale , and rotation ) between the sets of image points is computed [24, pg. 39]. In practice, more than two correspondences should be used to account for errors in the intrinsic parameters and noise in the depth data and proprioceptive sensors. If the camera motion only consists of a translation, the simplified Eq. 2 [24, Eq. (9.6)] can be used instead.


In case no depth map of the scene is available, a uniform depth can be assumed for further simplification. The RGB images from the camera are not used in this step.

Iii-A2 Observed Motion

The observed motion between the images from the two viewpoints is computed using the Fourier-Mellin transform [25]

. This method registers two images by estimating the similarity transform,

, between the images based on the Fourier shift theorem. Since this method assumes that the dominant motion in the image is caused by camera motion, we mask out the region of the image where the robot body motion is visible before registration. Finally, the absolute error between the expected and measured transforms is calculated. In the use-case described in Sect. IV, the camera motion results only in translations; hence we calculate the error as in Eq. 3.


For errors involving scale and rotation, a weighted sum of the errors should be used, or should be considered separately from the translation errors.

Iii-B Body Motion

The robot’s manipulator and end-effector are typically in view of the camera during the execution of a manipulation task. For the expected motion of the robot body, the robot model is rendered using the current robot state () and 3D model of the robot, and images () are captured from the pose of the real camera. An example of the real and rendered image can be seen in Fig. 3. The expected optical flow caused by the robot body, , is calculated from consecutive rendered frames, using the TV-L1 algorithm [26].

Fig. 3: The real image (left) and rendered image (right) from the point of view of the robot’s camera

The optical flow between the real images, , is also calculated using the same algorithm. The two optical flow images are masked such that only motions of the robot body remain, resulting in and . The mask is created by applying binary thresholding and contour detection on the rendered image of the robot body. The error between the two optical flow images is calculated as the absolute difference between the median magnitude (denoted by )222

We use the median instead of the mean to be robust to outliers

in both directions, as shown in Eq. 4.


Iii-C Optical Flow

To produce an expected optical flow image, we use a neural network which learns to predict the current optical flow image, , given a past optical flow image, . We use a probabilistic U-Net [23] model which combines a conditional variational auto-encoder (VAE) with a U-Net [27], by the addition of a prior and posterior network (see Fig. 4). This network architecture is chosen since it allows training with multiple ground truth outputs for a given input, and is relatively small compared to other frame prediction networks, with around 5 million trainable parameters. In our case, we consider the prediction of an optical flow frame to be time-agnostic [28]; the past optical flow frame which can best predict the current frame is selected from a range of past frames rather than from a fixed offset in the past. During training, an optical flow image from the past, , is selected randomly within a certain range, where , with the target optical flow image as . At inference time, however, is predicted times using inputs for all , and the prediction with the lowest error is selected. The lower limit of the range, , and the span of the range,

, are hyperparameters whose values we determine experimentally.

The prior network produces a distribution represented by the parameters of a multi-dimensional Gaussian with a diagonal covariance matrix (). At inference time, a latent vector, , is sampled from this distribution, concatenated with the last activation map of the U-Net, and passed through some final convolutional layers to produce (where the input to the U-Net is also ). Multiple can be generated by sampling multiple times from the prior distribution.

Fig. 4: The probabilistic U-Net architecture consists of a posterior and prior network in addition to the standard U-Net. The network predicts a future optical flow image conditioned on a past optical flow image and a latent vector sampled from the distribution output by the posterior network (during training) or the prior network (during inference).

During training, the latent vector is sampled from the distribution, (), parameterized by the posterior network, , instead of the prior network. The input to the posterior network is a concatenation along the channel dimension of the ground truth, , and the input to the prior network, . The posterior network learns to produce a latent vector that will result in the exact ground truth provided in its input. The training objective (Eq. 5) is that of a VAE [29], namely, the sum of mean-squared error (MSE) between the predicted output and the ground truth, and the Kullback-Leibler (KL) divergence between the distributions output by the posterior and prior networks.333

The loss function is identical to that used in 

[23] with the exception that the output distribution

is considered to be a normal distribution instead of a categorical distribution

The KL-divergence term, which is weighted by , a hyperparameter, encourages the prior distribution to move close to the posterior distribution.444We use , which is the default in the implementation by [23]


The input optical flow images are resized and center-cropped to 64x64 pixels. All models are trained for 50 epochs using a batch size of 128 and learning rate of 0.0001. At inference time, we sample

outputs from the network for each (resulting in a total of predictions), and the output, , with the minimum prediction error is used to compute the error as in Eq. 6.


Iii-D Anomaly Score

We do not expect a significant change in and between the training and nominal test data since they are based on the robot’s proprioceptive sensors. However, we do expect a change in at inference time due to distribution shift, even for nominal executions. Therefore, we use a fixed threshold for the camera and body motion, but evaluate the overall score using a range of thresholds. The thresholds for the camera and body motion errors are based on the maximum error from the training set, as in Eq. 7.


The final anomaly score is calculated as:


Iv Experiments

Iv-a Data

The method is evaluated on a dataset of executions in which the robot places a book on a shelf. The dataset consists of 61 nominal executions and 60 anomalous executions. Data recorded includes RGB and depth images from the head-mounted 3-D camera, joint states, and other sensor data such as the force-torque sensor. Anomalies include the book falling on or off the shelf, books on the shelf being disturbed significantly, occluded camera, and external collisions and disturbances to the robot. While camera occlusions and disturbances to the robot do not necessarily result in task failures, they are still important to detect since they might indicate other problems that the robot needs to address (for example, the robot may need to re-localize itself if it has been disturbed). Anomalies are annotated frame-wise, so that the objective is to detect anomalous frames. For training the learning model and for determining thresholds for the camera and body motion models, 48 nominal executions are used as training data and an additional 6 nominal executions are used for validation. The test set consists of 60 anomalous executions and 7 nominal executions.

Iv-B Evaluation Metrics

The area under the receiver operating characteristic curve (AUC-ROC) is the most common metric used in video anomaly detection. It allows us to compare methods based on their ability to discriminate between anomalous and non-anomalous frames at different thresholds. If there is an imbalance between the two classes, the AUC of the precision-recall (AUC-PR) curve is an additional metric which summarizes how well the classifier is able to detect the anomalous frames at different thresholds. Since the dataset is unbalanced (only about 12% of test frames are anomalous), both metrics are used for evaluation. In both cases, an area of 1.0 represents a perfect detector. Unlike several video-anomaly detection methods (see [16]), we do not normalize the anomaly score per execution, since this assumes that at least one anomaly occurs during an execution.

Iv-C Results

Fig. 5: Several types of anomalies are illustrated here: top: the book starts to falls off the shelf around frame 160 when the arm is retracted; middle: the camera is occluded at around frame 50; bottom: the robot is disturbed externally, resulting in a shaking camera around frame 130. The threshold shown here corresponds to the optimal point in the precision-recall curve. The full clips for these examples can be found in the supplementary video.

Iv-C1 Comparison to other anomaly detection methods

We compare the performance of the probabilistic U-Net to two other future prediction methods, U-Net + GAN [17] and VRNN [21], and an HMM-based method similar to [2]. We train the probabilistic U-Net with , such that it only predicts the next optical flow frame. For a fair comparison with the future prediction models, we do not consider the body and camera motion, so that the anomaly score is . Both future prediction models are trained to predict the next RGB frame given a sequence of input RGB frames. We fit the HMM using the Baum-Welch algorithm with multivariate features comprising of the maximum magnitude from optical flow images and magnitudes of observed body motion and image motion due to camera motion. The anomaly score for each frame in the test set is computed using the log likelihood of the sequence of observations up to that frame. The results in Table I show that learning to predict optical flow using the probabilistic U-Net has an advantage over the future prediction methods, and the HMM.

U-Net + GAN [17] 0.675 0.339
VRNN [21] 0.533 0.166
HMM 0.657 0.197
Prob. U-Net 0.728 0.397
TABLE I: Comparison to other methods

Iv-C2 Input range

We determine ideal values for the upper and lower limit of the range () of past frames to be used as the input to the probabilistic U-Net. As seen in Fig. 6, there is a marginal improvement of AUC-ROC as the lower limit and span are increased up to a certain point. Similar trends were observed for other combinations of and , and for AUC-PR. We obtain the best results for and , namely by using an optical flow frame between 5 and 9 time steps in the past to predict the current frame.

Fig. 6: AUC-ROC for different ranges of input optical flow images

Iv-C3 Optical flow, body and camera motion

For the learning method, we consider several variants of the input optical flow. The registered optical flow is the optical flow calculated after registering consecutive images. In effect, the camera motion is no longer visible in this variant. The masked optical flow masks out the optical flow in the regions where the robot’s body is visible (using the inversion of the mask described in III-B). This removes observed motions of the robot body from the optical flow. The masked registered variant is a masked version of the registered optical flow. Table II shows the AUC-ROC and AUC-PR for all optical flow variants using (i) only the learning method (OF only; ); (ii) the learning method + error from the body motion (OF + body motion); and (iii) the learning method combined with the body and camera motion errors (OF + body + camera motion; score calculated as in Eq. 8). All results use , and .

Input optical flow variant OF only OF + body motion OF + body + camera motion
Optical flow 0.740 0.741 0.792
Registered optical flow 0.716 0.716 0.773
Masked optical flow 0.752 0.752 0.804
Masked reg. optical flow 0.727 0.727 0.779
Optical flow 0.380 0.387 0.489
Registered optical flow 0.353 0.362 0.466
Masked optical flow 0.467 0.464 0.549
Masked reg. optical flow 0.413 0.415 0.509

The masked optical flow variant performs the best in all cases. Incorporating the body motion error has no effect on the AUC-ROC, and only marginally improves AUC-PR. However, incorporating the camera motion error shows an improvement in both metrics. Fig. 5 shows the anomaly score with some corresponding image frames. Unexpected motions such as a falling book (top), and shaking of the camera (bottom) result in an increase in the anomaly score. Occlusion of the camera (middle) also results in a high anomaly score.

Iv-D Discussion

The body motion error did not impact the performance significantly; this is probably because the types of anomalies in the dataset did not include discrepancies between the internally sensed arm motion and the observed motion. Body motion errors were visible when, for example, the book occluded the view of the arm, or the camera was occluded.

The camera motion error significantly improved the performance in terms of detecting the anomalies when they occurred (AUC-PR). In addition to detecting disturbances to the robot and occlusions, anomalies which involved large motions also increased the detection rate since they affected the image registration process (which assumes that the dominant motion in the scene is due to camera motion). Using registered optical flow decreased performance in all cases; this is likely due to the same reason.

The motions of the manipulator probably made the task of predicting the motion harder, since they caused significant motion in the image due to being close to the camera. Therefore, using the masked optical flow performed better. The method failed in cases of static anomalies (such as a book nearly falling out of the gripper), and instances where the anomaly was mostly occluded by the arm. False positives were seen in cases of inaccurate optical flow calculation, and often at the time of release of the book. Our dataset does not include background motions (such as persons moving around) which are unrelated to the task. This is an aspect that should be evaluated in future work, since unstructured motions in the background are likely to make predicting future motions harder.

V Conclusions

We investigated using visual anomaly detection for monitoring the execution of tasks by modelling the motions observed during successful executions. We use a combination of a learning and model-based method to generate expectations about the nominal motion, which are compared to the observed motions to detect anomalies. Our experiments show that it is beneficial to separately consider the known motions of the robot (in particular, the camera motion) when comparing the expected and observed motions. The results also improve if the motions of the robot body are removed from the input to the learning method. Our approach incorporates the robot’s kinematics and model as visual inputs to the anomaly detection method. However, non-visual task knowledge, such as task progress, could provide additional context and structure to the neural network or to the overall method, and is a good candidate for future research.


  • [1] O. Pettersson, “Execution monitoring in robotics: A survey,” Robotics and Autonomous Systems, vol. 53, no. 2, pp. 73–88, 2005.
  • [2] D. Park, H. Kim, and C. C. Kemp, “Multimodal anomaly detection for assistive robots,” Autonomous Robots, vol. 43, no. 3, pp. 611–629, 2019.
  • [3] E. Khalastchi, M. Kalech, G. A. Kaminka, and R. Lin, “Online data-driven anomaly detection in autonomous robots,” Knowledge and Information Systems, vol. 43, no. 3, pp. 657–688, 2015.
  • [4] L. Wellhausen, R. Ranftl, and M. Hutter, “Safe Robot Navigation via Multi-Modal Anomaly Detection,” IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 1326–1333, 2020.
  • [5] L. Mauro, E. Alati, M. Sanzari, V. Ntouskos, G. Massimiani, and F. Pirri, “Deep execution monitor for robot assistive tasks,” in Computer Vision – ECCV 2018 Workshops.   Springer International Publishing, 2019, pp. 158–175.
  • [6] A. Inceoglu, G. Ince, Y. Yaslan, and S. Sariel, “Failure Detection Using Proprioceptive, Auditory and Visual Modalities,” in 2018 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 2491–2496.
  • [7] C. Lu, J. Shi, and J. Jia, “Abnormal Event Detection at 150 FPS in MATLAB,” in Proc. IEEE Int. Conf. on Computer Vision, 2013, pp. 2720–2727.
  • [8] B. Ramachandra and M. Jones, “Street Scene: A new dataset and evaluation protocol for video anomaly detection,” in The IEEE Winter Conf. on Applications of Computer Vision, 2020, pp. 2569–2578.
  • [9] W. Sultani, C. Chen, and M. Shah, “Real-world Anomaly Detection in Surveillance Videos,” in

    Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition

    , 2018, pp. 6479–6488.
  • [10] T. Wang, C. Yang, F. Kirchner, P. Du, F. Sun, and B. Fang, “Multimodal grasp data set: A novel visual–tactile data set for robotic manipulation,” Int. Journal of Advanced Robotic Systems, vol. 16, no. 1, p. 1729881418821571, 2019.
  • [11] A. Inceoglu, E. E. Aksoy, A. C. Ak, and S. Sariel, “FINO-Net: A Deep Multimodal Sensor Fusion Framework for Manipulation Failure Detection,” arXiv preprint arXiv:2011.05817, 2020.
  • [12] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “RoboNet: Large-Scale Multi-Robot Learning,” Conf. on Robot Learning, 2019.
  • [13] E. Khalastchi and M. Kalech, “A sensor-based approach for fault detection and diagnosis for robotic systems,” Autonomous Robots, vol. 42, no. 6, pp. 1231–1248, 2018.
  • [14] D. Park, H. Kim, Y. Hoshi, Z. Erickson, A. Kapusta, and C. C. Kemp, “A Multimodal Execution Monitor with Anomaly Classification for Robot-Assisted Feeding,” in 2017 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 5406–5413.
  • [15] O. Çatal, S. Leroux, C. De Boom, T. Verbelen, and B. Dhoedt, “Anomaly Detection for Autonomous Guided Vehicles using Bayesian Surprise,” in 2020 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 8148–8153.
  • [16] B. Ramachandra, M. Jones, and R. R. Vatsavai, “A Survey of Single-Scene Video Anomaly Detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2020.
  • [17] W. Liu, W. Luo, D. Lian, and S. Gao, “Future Frame Prediction for Anomaly Detection–A New Baseline,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545.
  • [18] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder for Unsupervised Anomaly Detection,” in Proc. of the IEEE Int. Conf. on Computer Vision, 2019, pp. 1705–1714.
  • [19] Y. Tang, L. Zhao, S. Zhang, C. Gong, G. Li, and J. Yang, “Integrating prediction and reconstruction for anomaly detection,” Pattern Recognition Letters, vol. 129, pp. 123–130, 2020.
  • [20] T.-N. Nguyen and J. Meunier, “Anomaly Detection in Video Sequence with Appearance-Motion Correspondence,” in Proc. of the IEEE Int. Conf. on Computer Vision, 2019, pp. 1273–1283.
  • [21] Y. Lu, K. M. Kumar, S. Shahabeddin Nabavi, and Y. Wang, “Future Frame Prediction Using Convolutional VRNN for Anomaly Detection,” in 2019 16th IEEE Int. Conf. on Advanced Video and Signal Based Surveillance (AVSS).   IEEE, 2019, pp. 1–8.
  • [22] S. Haresh, S. Kumar, M. Zia, and Q. Tran, “Towards Anomaly Detection in Dashcam Videos,” in 31st IEEE Intelligent Vehicles Symp. (IV), 2020.
  • [23] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. A. Eslami, D. J. Rezende, and O. Ronneberger, “A Probabilistic U-Net for Segmentation of Ambiguous Images,” in Advances in Neural Information Processing Systems, 2018, pp. 6965–6975.
  • [24] R. Hartley and A. Zisserman, Multiple View Geometry in computer vision.   Cambridge University Press, 2003.
  • [25] B. S. Reddy and B. N. Chatterji, “An FFT-based Technique for Translation, Rotation, and Scale-Invariant Image Registration,” IEEE Trans. on Image Processing, vol. 5, no. 8, pp. 1266–1271, 1996.
  • [26] C. Zach, T. Pock, and H. Bischof, “A Duality Based Approach for Realtime TV-L1 Optical Flow,” in Joint Pattern Recognition Symposium.   Springer, 2007, pp. 214–223.
  • [27] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Int. Conf. on Medical Image Computing and Computer Assisted Intervention.   Springer, 2015, pp. 234–241.
  • [28] D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine, “Time-Agnostic Prediction: Predicting Predictable Video Frames,” Int. Conf. on Learning Representations (ICLR), 2019.
  • [29] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” Int. Conf. on Learning Representations (ICLR), 2014.