A Multimodal Anomaly Detector for Robot-Assisted Feeding Using an LSTM-based Variational Autoencoder

11/02/2017 ∙ by Daehyung Park, et al. ∙ Georgia Institute of Technology 0

The detection of anomalous executions is valuable for reducing potential hazards in assistive manipulation. Multimodal sensory signals can be helpful for detecting a wide range of anomalies. However, the fusion of high-dimensional and heterogeneous modalities is a challenging problem. We introduce a long short-term memory based variational autoencoder (LSTM-VAE) that fuses signals and reconstructs their expected distribution. We also introduce an LSTM-VAE-based detector using a reconstruction-based anomaly score and a state-based threshold. For evaluations with 1,555 robot-assisted feeding executions including 12 representative types of anomalies, our detector had a higher area under the receiver operating characteristic curve (AUC) of 0.8710 than 5 other baseline detectors from the literature. We also show the multimodal fusion through the LSTM-VAE is effective by comparing our detector with 17 raw sensory signals versus 4 hand-engineered features.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

People with disabilities often need physical assistance from caregivers. Robots can provide assistance for activities of daily living such as robot-assisted feeding [1] and shaving [2]. However, its structural complexity, task variability, and sensor uncertainty may result in failure. A lack of detection systems for the failures may also lower the usage of robots due to potential failure cost. The detection of an anomalous task execution (i.e., anomaly) can help to prevent or reduce potential hazards in the assistance by recognizing highly unusual situations and stop in these situations.

In this paper, an anomaly detector is a method to identify when the current execution differs from past successful experiences (i.e., non-anomalous executions). Researchers often use an one-class classifier trained on non-anomalous execution. An ideal detector should detect a variety of anomalies, alert the robot quickly, ignore irrelevant task variation, and handle the stream of sensory signals. Multimodal sensory signals can be helpful to detect various anomalies using its high dimensional information. Researchers often reduce the dimension or select features before applying a classifier. Our previous work used also selected 4 hand-engineered features from 3 modalities for a likelihood-based classifier, HMM-GP, using hidden Markov models (HMM)

[3, 4]

. However, the compressed or selected representations may be missing information relevant to anomaly detection. Creating useful hand-specified features can also involve significant engineering effort and domain expertise.

Fig. 1: Robot-assisted feeding system. A PR2 robot detects anomalous feeding executions collecting 17 sensory signals from 5 types of sensors.

An alternative solution is reconstruction-based detection, such as an autoencoder (AE) based approach that compresses and reconstructs high dimensional inputs based on non-anomalous executions. When an AE is trained only with non-anomalous data, a high reconstruction error can indicate an anomaly. The idea behind this detection is that an AE cannot reconstruct unforeseen patterns of anomalous data well compared to foreseen non-anomalous data. In addition to the reconstruction error, a variational autoencoder (VAE) can compute the reconstruction log-likelihood of the inputs modeling the underlying probability distribution of data. Both AE and VAE can be combined with time-series modeling approaches such as recurrent neural network (RNN) including long short-term memory (LSTM) network.

In this paper, we introduce a long short-term memory-based variational autoencoder (LSTM-VAE) for multimodal anomaly detection. For encoding, an LSTM-VAE projects multimodal observations and their temporal dependencies at each time step into a latent space using serially connected LSTM and VAE layers. For decoding, it estimates the expected distribution of the multimodal inputs from the latent space representation. We train it under a denoising autoencoding criterion

[5] to prevent learning an identity function and improve representation capability. Our LSTM-VAE-based detector detects an anomaly when the log-likelihood of current observation given the expected distribution is lower than a threshold. We also introduce a state-based threshold to increase detection sensitivity and lower the false alarms similar to [3].

We evaluated the LSTM-VAE with robot-assisted feeding data that we collected from 24 able-bodied participants with 1,555 feeding executions. The proposed detector is beneficial in that we could directly use high-dimensional multimodal sensory signals without significant effort for feature engineering. It was able to catch an anomaly online. In particular, it was able to set tight or loose decision boundaries depending on the variations of multimodal signals using the state-based threshold. Our method had higher area under receiver operating characteristic (ROC) curves than other baseline methods from the literature. In our evaluation, the area under the curve (AUC) was 0.044 higher than that of our previous algorithm, HMM-GP given the same data. Our new method also had a 0.064 higher AUC when we used 17-dimensional sensory signals from visual, haptic, kinematic, and auditory modalities instead of 4-dimensional hand-engineered features.

Ii Related Work

Anomaly detection is known as novelty, outlier, or event detections in other domains

[6]. In robotics, it has been used to detect the failure of manipulation tasks: bin picking [7], bottle opening [8]

, etc. Many classic machine learning approaches have also been used: support vector machine (SVM)

[9, 10]

, self-organizing map (SOM)


, k-nearest neighbors (kNN)

[12], etc. To detect anomalies from time-series signals, researchers have also used hidden Markov models [3]

or Kalman filters


Researchers have often reduced the dimension of high-dimensional inputs using principal component analysis (PCA) before applying probabilistic or distance-based detection

[9, 14]. However, the compressed representations of outliers (i.e., anomalous data) may be inliers in latent space. Instead, we use a reconstruction-based method that recovers inputs from its compressed representation so that it can measure reconstruction error with the anomaly score. An AE is a representative reconstruction approach that is a connected network with an encoder and a decoder [15]. It has also been applied for reconstructing time-series data using a sliding time-window [16]. However, the window method does not represent dependencies between nearby windows and a window may not include an anomaly.

To model time-series data with its temporal dependencies, we use an LSTM network [17]

, which is a type of recurrent neural network (RNN). An LSTM network can make use of long-term dependencies and avoid the vanishing gradient problem

[17]. Researchers have used LSTM networks for prediction in this anomaly detection domains such as the following: radio anomaly detection [18] and EEG signal anomaly detection[19]. Malhorta et al. introduced an LSTM-based anomaly detector (LSTM-AD) that measures the distribution of prediction errors [20]. However, the method may not predict time-series under unpredictable external changes such as manual control and load on a machine [21]. Alternatively, researchers have introduced RNN- and LSTM-based autoencoders for reconstruction-based anomaly detection [22, 23]. In particular, Malhorta et al. introduced an LSTM-based encoder-and-decoder (EncDec-AD) that estimates a reconstruction error [21]. We also use this reconstruction scheme for detection and as a baseline method in this paper.

Another relevant approach is a variational autoencoder (VAE) [24]. Unlike the AE, a VAE models the underlying probability distribution of observations using variational inference (VI). Bayer and Osendorfer used VI to learn the underlying distribution of sequences and introduced stochastic recurrent networks [25]. Soelch et al. used their work to detect robot anomalies using unimodal signals [26]. Our work also uses variational inference, but we do not predict and instead only reconstruct data using an LSTM-based autoencoder for multimodal anomaly detection.

Iii LSTM-based Variational Autoencoding

We review an autoencoder and a variational autoencoder. We then describe our proposed LSTM-based variational autoencoder. We represent a vector of multidimensional inputs by and the corresponding latent space vector by , where and are the number of input signals and the dimension of the latent space, respectively.

Iii-a Preliminary: Autoencoder(AE)

An AE is an artificial neural network that consists of sequentially connected encoder and decoder networks. It sets the target of the decoder to be equal to the input of the encoder. The encoder network learns a compressed representation (i.e., bottleneck feature or latent variable) of the input. The decoder network reconstructs the target from the compressed representation. The difference between the input and the reconstructed input is the reconstruction error. During training, the autoencoder minimizes the reconstruction error as an objective function. An AE is often used for data generation as a generative model. An AE’s decoder can generate an output given an artificially assigned compressed representation.

Fig. 2: Illustration of a multimodal anomaly detector with an unrolled LSTM-VAE model. We train the LSTM-VAE using multimodal signals and corresponding progress-based priors. We then train a threshold estimator using the outputs of the LSTM-VAE. For testing, we input sensory signals only. The detector then returns an anomaly when current anomaly score is over an estimated threshold . Note that Linear* and LSTM layers have tanh and softplus activations, respectively. The red arrows are used for training only.

Iii-B Preliminary: Variational Autoencoder(VAE)

A VAE is a variant of an AE rooted in Bayesian inference

[24]. A VAE is able to model the underlying distribution of observations

and generate new data by introducing a set of latent random variables

. We can represent the process as . However, the integral is intractable due to the continuous domain of . Instead, we can represent the marginal log-likelihood of an individual point as using notation from [24], where is KullbackLeibler divergence from a prior to the variational approximation of and is the variational lower bound of the data by Jensen’s inequality. Note that and are the parameters of the encoder and the decoder, respectively.

A VAE optimizes the parameters, and , by maximizing the lower bound of the log likelihood, ,


The first term regularizes the latent variable by minimizing the KL divergence between the approximated posterior and the prior of the latent variable. The second term is the reconstruction of by maximizing the log-likelihood with sampling from .

The choice of distribution types are important since a VAE models the approximated posterior distribution from a prior and likelihood

. A typical choice for the posterior is a Gaussian distribution,

, where a standard normal distribution

is used for the prior. For the likelihood, a Bernoulli distribution or multivariate Gaussian distribution is often used for binary or continuous data, respectively.

Iii-C An LSTM-based Variational Autoencoder (LSTM-VAE)

We introduce a long short-term memory-based variational autoencoder (LSTM-VAE). To use the temporal dependency of time-series data in a VAE, we combine a VAE with LSTMs by replacing the feed-forward network in a VAE to LSTMs similar to conventional temporal AEs such as an RNN Encoder-Decoder [22] or an EncDec-AD [21]. Fig. 2 shows an unrolled structure with LSTM-based encoder-and-decoder modules. Given a multimodal input at time , the encoder approximates the posterior by feeding an LSTM’s output into two linear modules to estimate the mean

and co-variance

of the latent variable. Then, the randomly sampled from the posterior feeds into the decoder’s LSTM. The final outputs are the reconstruction mean and co-variance .

We apply a denoising autoencoding criterion [5] to the LSTM-VAE by introducing corrupted input with Gaussian noise, , where . We then replace the lower bound in Eq. (1) with a denoising variational lower bound [27],


where is an approximated posterior distribution given a corruption distribution around . Given Gaussian distribution for and , can be represented as a mixture of Gaussians. For computational convenience, we use a single Gaussian, .

We introduce a progress-based prior . Unlike conventional static priors using a normal distribution , we vary the center of a normal distribution as , where and are the center and co-variance of the underlying distribution of multimodal inputs, respectively (see Fig. 3). This kind of varying prior can be helpful to introduce the temporal dependency of time-series data into its underlying distribution since a VAE tries to minimize the difference between the approximated posterior and the prior. Unlike the RNN prior of Solch et al. [26] and the transition prior of Karl et al. [28], we gradually change from to as the progress of a task execution. To simplify the prior, we use an isotropic normal distribution so the co-variance matrix is . We can rewrite the regularization term of as

Fig. 3: Illustration of the progress-based prior. The center of the prior linearly changes from as initial progress to as final progress.

To represent the distribution of high-dimensional continuous data, we use a multivariate Gaussian with a diagonal co-variance matrix. We can derive the reconstruction term in as


We use an LSTM with tanh

for each encoder and decoder. We implemented the LSTM-VAE using stateful LSTM models in the Keras deep learning library

[29]. We trained the LSTM-VAE using an Adam optimizer with 3-dimensional latent variables and a 0.001 learning rate. Note that we are not using a sliding window in this work, but a window could be applied.

Iv Anomaly Detection

We now introduce an online anomaly detection framework for multimodal sensory signals with state-based thresholding.

Iv-a Anomaly Score

Our method detects an anomalous execution when the current anomaly score of an observation is higher than a score threshold ,


where is an anomaly score estimator. We define the score as the negative log-likelihood of an observation with respect to the reconstructed distribution of the observation through an encoding-decoding model,


where and are the mean and co-variance of the reconstructed distribution, , from an LSTM-VAE with parameters and . A high score indicates an input has not been reconstructed well by the the LSTM-VAE. In other words, the input has deviated greatly from the non-anomalous training data.

Iv-B State-based Thresholding

We introduce a varying threshold that changes over the estimated state of a task execution motivated by the dynamic threshold [3]. Depending on the state of task executions, reconstruction quality may vary. In other words, anomaly scores in non-anomalous task executions can be high in certain states, so varying the anomaly score can reduce false alarms and improve sensitivity. In this paper, the state is the latent space representation of observations. Given a sequence of observations, the encoder of LSTM-VAE is able to compute a state at each time step. By mapping states and corresponding anomaly scores from non-anomalous dataset, our method is able to train an expected anomaly score estimator . We use support vector regression (SVR) to map from a multidimensional input to a scaler

using a radial basis function (RBF) kernel. To control sensitivity, we add a constant

into the expected score and represent the state-based threshold as .

Iv-C Training and Testing Framework

Algorithm 1 shows the training framework of our LSTM-VAE-based anomaly detector. Given a set of non-anomalous training and validation data, , the framework aims to output the optimized parameters of an LSTM-VAE and an expected anomaly score estimator . Note that we represent sequences of multimodal observations as . and denote the numbers of training and validation data, respectively. We also represent the encoder and decoder functions as and , respectively. Then, we denote the function of serially connected encoder and decoder (i.e., autoencoder) by with noise injection.

The framework preprocesses and by resampling those to have length and normalizing their individual modalities in the range of with respect to . The framework then starts to train the LSTM-VAE with respect to maximizing and stops the training when

does not increase for 4 epochs. Then it extracts a set of latent space representations and corresponding anomaly scores from

as the training set for . Finally, this framework returns the trained SVR object as well as the LSTM-VAE’s parameters. Note that we reset the state of the LSTM in the beginning of a sequence of data only.

In testing, the detector aims to detect an anomaly in real time. Algorithm 2 shows the pseudo code for the online detection process. In each loop, the detector takes multimodal input x. The detector scales individual dimension with respect to the scaled . It then estimates the latent variable and the parameters of the reconstructed distribution. When the anomaly score of the current input is higher than the threshold , our system detects that the detector determines the current task execution is anomalous and returns the decision. We control the sensitivity of the detector by adding a constant to .

input : ,
output : , ,
1 = Preprocessing() ;
2 train LSTM-VAE with (;
3 ;
4 for  to  do
5       Reset the state of LSTM-VAE;
6       for  to  do
7             ;
8             ;
9             ;
10             Add and into and , respectively.
11       end for
13 end for
train an SVR with .
Algorithm 1 Training algorithm for an LSTM-VAE-based anomaly detector
input : 
output : Anomaly or Anomaly
1 while True do
2       get current multimodal data;
3       ;
4       ;
5       ;
6       if  then
7             return Anomaly ;
9       end if
11 end while
Algorithm 2 Testing algorithm for an LSTM-VAE-based anomaly detector.

V Experimental Setup

V-a Instrumental Setup

Our system uses a PR2 from Willow Garage, a general-purpose mobile manipulator with two 7-DOF arms with powered grippers and an omni-directional mobile base. For safety and prevention of possible hazards, we used a low-level PID controller with low gains and a mid-level model predictive controller from [30] without haptic feedback. We used the following sensors: an RGB-D camera with a microphone (Intel SR300) on the right wrist, a force/torque sensor (ATI Nano25) on the utensil handle, joint encoders, and current sensors. These sensors measure mouth position and sound, force on the utensil, spoon position, and joint torque, respectively.

V-B Data Collection

We used data from 1,555 feeding executions collected from 24 able-bodied participants where we newly collected 1,203 non-anomalous feeding executions for this work. 16 participants were male and 8 were female, and the age range was 19-35. We conducted the studies with approval from the Georgia Tech Institutional Review Board (IRB).

We divided our data into two subsets: a training/testing dataset collected from our previous work [4], and a pre-training dataset. The training/testing dataset consists of data from 352 executions (160 anomalous and 192 non-anomalous) collected from 8 able-bodied participants who used the feeding system with yogurt and a silicone spoon.

The pre-training dataset uses data from 1,203 non-anomalous executions from 16 newly recruited participants who used various foods and utensils. Pre-training with this dataset allowed us to initialize the weights of the LSTM-VAE to find a better fit in fine tuning. Among the dataset, 559 non-anomalous executions were from 9 participants who used 3 types of food and corresponding utensils: cottage cheese and silicone spoon, watermelon chunks and metal fork, and fruit mix and plastic spoon. An experimenter also conducted 428 non-anomalous feeding executions as a self-study with 6 foods (yogurt, rice, fruit mix, watermelon chunks, cereal, and cottage cheese) and 5 utensils (small/large plastic spoons, a silicone spoon, and plastic/metal forks). We also collected additional data from 216 non-anomalous executions from 6 participants who used yogurt and a silicone spoon.

V-C Experimental Procedure

This feeding system allows the user to command three autonomous subtasks: scooping/stabbing, clean spoon, and feeding. The user sends these commands using a web-based graphical user interface. A typical feeding sequence consists of scooping or stabbing followed by feeding. In order to approximate one form of limited mobility that people with disabilities might have, we instructed the participants to not move their upper bodies and to eat food off of the utensil using their lips.

Each participant performed anomalous and non-anomalous feeding executions while the participant, experimenters, or the system produced anomalies. We randomly determined the order of these executions. We defined 12 types of representative anomalies through fault tree analysis [31]: touch by user, aggressive eating, utensil collision by user, sound from user, face occlusion, utensil miss by user, unreachable location, environmental collision, environmental noise, utensil miss by system fault, utensil collision by system fault, and system freeze (see Fig. 5). For anomalies caused by the user, we instructed the participants on their actions through demonstration videos and verbal explanation. The participant had control over the details of the anomaly, such as the exact timing and magnitude of their actions.

Fig. 4: Left: Examples of food used in our experiments. Right: The 3D-printed utensil handle and 5 utensils used. Red boxes show yogurt and silicone spoon used for our training/testing dataset.
Fig. 5: 12 representative anomalies caused by either the user, the environment, or the system in our experiments.

V-D Data collection and Pre-processing

For each feeding execution, we collected 17 sensory signals from 5 sensors: sound energy (1), force (3) applied on the end effector, joint torque (7), spoon position (3), and mouth position (3), where the number in parentheses represents the dimension of signals. We zeroed the initial value and resampled each signal to have for the robot’s actual anomaly check frequency. We then scale signals in the non-anomalous dataset to have a value between 0 and 1. Corresponding to this scale, we also scaled signals from the anomalous dataset. Finally, we have a sequence of tuples per execution (i.e., sequence length 17). For visualization and comparison purposes, we also extracted 4-dimensional hand-engineered features used in our previous work [32]: sound energy, 1st joint torque, accumulated force, and spoon-mouth distance. Here, we used sound energy111Root mean square (RMS) of 1,024 frames. instead of raw 16 bit PCM encoding since the under sampling could miss auditory anomalies.

(a) A non-anomalous execution
(b) An anomalous execution
Fig. 8: Visualization of the reconstruction performance and anomaly scores over time using an LSTM-VAE. The upper four sub graphs show observations and reconstructed observations’ distribution. The lower sub graphs show current and expected anomaly scores. The dashed curve shows a state-based threshold where the LSTM-VAE determines an anomaly when current anomaly score is over the threshold. Brown lines represent the time of anomaly detection.

V-E Baseline Methods

To evaluate the performance of the proposed method, we implemented 5-baseline methods,

  • RANDOM: A random binary classifier in which we control its sensitivity by weighting a class.

  • OSVM: A one-class SVM-based detector trained with only non-anomalous executions. We move a sliding window (of size 3 in time like EncDec-AD [21]) one step at a time. We control its sensitivity by adjusting the number of support vectors.

  • HMM-GP: A likelihood-based classifier using an HMM introduced in [32]. We vary the likelihood threshold with respect to the distribution of hidden states.

  • AE: A reconstruction-based anomaly detector using a conventional autoencoder with a 3 time step sliding window based on [33].

  • EncDec-AD: A reconstruction-based anomaly detector using an LSTM-based autoencoder [21]. We use window size as in the paper, but unlike the paper we use a diagonal co-variance matrix when we model the distribution of reconstruction-error vectors.

From now on, we will also use the term LSTM-VAE to refer to our LSTM-VAE-based detector.

Vi Evaluation

We first investigated the reconstruction function of the LSTM-VAE. The upper 4 sub graphs in Fig. 8 show the reconstructed distribution of 4 hand-engineered features from non-anomalous and anomalous feeding executions in the robot-assisted feeding task. For Fig. (a)a, the observed features (blue curves) and the mean of reconstructed distribution (red curves) show a similar pattern of change over time. On the other hand, in an anomalous execution (see Fig. (b)b), the LSTM-VAE resulted in a large deviation between observed and reconstructed accumulated force since the pattern of accumulated force by the collision is not easily observable from non-anomalous executions. Consequently, we can observe the anomaly score (blue curve) gradually increases after the onset of the deviation from the lower sub graph of Fig. (b)b. Note that the anomalous execution came from a face-spoon collision caused intentionally by the user.

The anomaly score metric is effective in distinguishing anomalies. Fig. 9

shows the distributions of the anomaly scores over time of a participant’s 24 anomalous and 20 non-anomalous feeding executions during leave-one-person-out cross validation. The blue and red shaded regions show the mean and standard deviation of non-anomalous and anomalous executions’ anomaly scores, respectively. The score of non-anomalous executions shows a specific pattern of change with a smaller average and variance than that of anomalous executions, making anomalies easily distinguishable from non-anomalous situations.

The lower sub graphs of Fig. 8 also show the state-based threshold is capable of achieving a tighter anomaly decision boundary (red dash lines) than a fixed threshold over time. The expected anomaly scores (red curves) and the actual scores (blue curves) show a similar pattern of change. However, the expected score is lower than the actual score given an anomaly. Brown boxes show the time of anomaly detection where the first detection time matches with the initial increase of accumulated force.

Fig. 9: An example distributions of anomaly scores from a participant’s 20 non-anomalous and 24 anomalous executions over time.

We compared our LSTM-VAE with 5 other baseline methods through a leave-one-person-out cross-validation method. Given the training/testing dataset, we used data from 7 participants for training and tested with the data from the remaining 1 participant. Table I shows our detector outperformed the other 5 baseline methods with higher area under the ROC curve (AUC) when using 4 hand-engineered features. The AUC is 0.044 higher than the next best method, HMM-GP.

We also investigated the performance when using 17 sensory signals with the additional pre-training dataset. Our method resulted in the highest performance of AUC that is 0.064 higher than the next best method, EncDec-AD. It is also higher than the result from hand-engineered features. This indicates an LSTM-VAE is capable of modeling the heterogeneous high-dimensional multimodal signals and detecting anomalies among those signals without significant feature extraction effort. In this evaluation, we pre-trained each method using the pre-training dataset in addition to the dataset from the 7 participants. We then fine-tuned each method with the data from 7 participants. Note that we train an OSVM with the pre-traning dataset only and we did not succeed in training HMM-GP due to underflow errors resulting from the high-dimensional input.

Fig. 10 shows ROC curves from an LSTM-VAE with two thresholding techniques. The red curve shows the result of the proposed state-based thresholding. The yellow curve shows the result of conventional fixed thresholding. The state-based thresholding resulted in higher true positive rates given the same false positive rates. In this evaluation, we used 17 sensory signals with the pre-training dataset.

Fig. 10: Receiver operating characteristic (ROC) curves to compare the performance of LSTM-VAEs with and without a state-based threshold.
4 hand-engineered features 0.5121 0.7427 0.8121 0.8123 0.7995 0.8564
17 raw sensory signals 0.5052 0.7376 N/A 0.8012 0.8075 0.8710
TABLE I: Comparison of the LSTM-VAE and 5 baseline methods with two types of inputs. Numbers represent the area under the ROC curve (AUC).

Vii Conclusion

We introduced an LSTM-VAE-based anomaly detector for multimodal anomaly detection. An LSTM-VAE models the underlying distribution of multi-dimensional signals and reconstructed the signals with expected distribution information. The detector estimated the negative log-likelihood of multimodal input with respect to the distribution as anomaly score. By introducing a denoising autoencoding criterion and state-based thresholding, the detector successfully detected anomalies in robot-assisted feeding resulting in higher AUC than other 5 baseline methods in literature. Without significant effort of feature engineering, the detector with 17 raw inputs outperformed a detector trained with 4 hand-engineered features. Finally, we also showed the LSTM-VAE with the state-based decision boundary is beneficial for more sensitive anomaly detection with lower false alarms.


This work was supported in part by NSF Award IIS-1150157, NIDILRR grant 90RE5016-01-00 via RERC TechSAge, and a Google Faculty Research Award. Dr. Kemp is a cofounder, a board member, an equity holder, and the CTO of Hello Robot, Inc., which is developing products related to this research. This research could affect his personal financial status. The terms of this arrangement have been reviewed and approved by Georgia Tech in accordance with its conflict of interest policies.