1 Introduction
Novelty detection comprises anomaly detection algorithms which first learn a model of the normal data, and at inference time compute a novelty score of unseen data samples under the learned model. Data samples are flagged as normal or abnormal by comparing their novelty score to a learned decision boundary. In that sense, novelty detection can be seen as a oneclass classification task. One important characteristic, and appealing property, of novelty detection algorithms is that during training time they do not require access to instances of anomalous data. This is important in situations in which we have little or no anomalous data, and it is hard and costly to obtain such data. Because of this, novelty detection is widely used in many domains, such as medical diagnostics (Tarassenko et al., 1995), security of electronic systems (Patcha & Park, 2007), or mobile robotics (Hornung et al., 2014).
Pimentel et al. (2014)
classify novelty detection algorithms in several groups: probabilistic, distancebased, reconstructionbased, domainbased, and informationtheoretic based techniques. For our application, we initially chose four methods that belong to different algorithm families outlined above: support vector machine (SVM)
(Schölkopf et al., 2000), isolation forest (IF) (Liu et al., 2012), local outlier factor detector (LOF) (Breunig et al., 2000)as representative of conventional machine learning algorithms, and normalizing flows (NF)
(Kingma et al., 2016)as representative of deep learning algorithms. After preliminary tests, out of three mentioned conventional methods, LOF yielded best classification accuracy, so we focused our further effort on LOF and normalizing flows.
Normalizing flows are a class of deep generative models that leverage invertible neural networks to learn a mapping between a simple base distribution and a given data distribution. The invertibility allows for two important use cases: generation of new data and classification of input data samples by computing the likelihood of such samples. The latter one makes them a suitable candidate algorithm for novelty detection. Compared to classical algorithms, normalizing flows allow us to flexibly constrain the learned distributions, for instance by enforcing autoregressive property when modeling time series.
In this work, we demonstrate the applicability of normalizing flows for novelty detection in time series. We apply two different flow models, masked autoregressive flows (MAF) (Papamakarios et al., 2017) and FFJORD (Grathwohl et al., 2019)
restricted by a Masked Autoencoder for Distribution Estimation (MADE) architecture
(Germain et al., 2015) to synthetic data and motor current time series data from an industrial machine. Both flowbased models achieve superior results over the local outlier factor method. Furthermore, we demonstrate the generation of new data samples with the learned flow models which can give rise to further use cases in the domain of anomaly detection and defect analysis in industrial machines.2 Datasets
We create synthetic data to test the ability of the models for novelty detection by generating random time series with a defined autocorrelation function. We specify the autocorrelation function as which defines the covariance matrix for a given length of time series and compute its Cholesky decomposition such that
. We then generate data samples by drawing white noise samples
and transforming them to samples of our time series with given autocorrelation as . We define normal samples to haveand vary the decay time to create abnormal samples. By construction, the intersample mean at each timestep is 0. To keep the variance across samples at every time step equal between normal and abnormal samples, we divide abnormal samples
by their intersample variance and multiply by the variance of the normal time series: .To test on real data, we used a dataset that was generously provided by Kawasaki Heavy Industries Ltd (KHI). Data is an electric motor current signal (measured in Amperes), collected from the electrical motor of one of KHI’s products. Dataset contains both the signal during motor’s normal operation, and anomalous signal after a gearbox connected to the motor experienced an undefined problem in its operation. There are eight different patterns of motor operation, each pattern lasting between 5 and 30 seconds. Signals were sampled with frequency of 500 Hz. Ratio of normal to anomalous data in the dataset was roughly equal, but for model training purposes only a subset of normal data was used, while heldout normal data and anomalous data were used for testing purposes. One sample of normal and anomalous data is shown in Fig3_real_dataA. In this work, we use a subset of one of eight patterns that, based on classfication performance of the LOF, we found to be the most challenging pattern for anomaly detection.
3 Model setup
We train three different models on our data: Masked Autoregressive Flows, Continuous Normalizing Flows using Freeform Jacobian of Reversible Dynamics with a MADE network, and Local Outlier Factor.

MAF uses a stack of fixed number of affine layers whose scale and shift parameters are computed by an autoregressive network, here implemented using the MADE architecture
(Germain et al., 2015). Given a latent random variable
the transformed variable is computed as , where the scale and shift terms are efficiently computed by one forward pass through a MADE network. We chose MAF over Inverse Autoregressive Flows (IAF) (Kingma et al., 2016)because MAF offers fast evaluation of data likelihood which is essential in novelty detection. IAF, on the other hand, offers fast generation of new data but slow evaluation of test data. In our experiments we chose the standard normal distribution as the base distribution
and stack 5 coupling layers each with MADE networks consisting of 3 hidden layers with 256 units each and tanh activation function.

The FFJORD model (Grathwohl et al., 2019) extends the idea of continuous normalizing flows (CNF) (Chen et al., 2018) by an improved estimator of the logdensity of samples. CNF models the latent variable
with an ordinary differential equation
so that transforming from latent to data space is equivalent to integrating the ODE from pseudo times to : (see Grathwohl et al. (2019) for details). The function is represented by a neural network. We here chose a MADE architecture to enforce the autoregressive property between different time samples:. In our experiments, we use 2 hidden layers with 256 neurons and tanh activation function. More details on the experiments in Section
7. 
Local outlier factor is a distance based novelty detection algorithm that assigns a degree of being an outlier to each data point. This degree, local outlier factor, is determined by comparing the local density of a data point to the local density of its neighboring points. A point that has significantly lower local density than its neighbors is considered to be an outlier. The main parameter used to influence the algorithm’s performance is MinPts which specifies the number of points to be considered as neighborhood of a datapoint . Breunig et al. (2000) establishes that inliers have value of LOF equal to 1, while LOF for outliers is greater than 1. It also provides the tightness of the lower and upper bound for outlier’s LOF values.
4 Experiments
We test our idea of using flow models as novelty detectors for time series on a set of synthetic data where we control the deviation between normal and abnormal samples. The data is created as time series with a defined autocorrelation where we systematically vary the time constant of the autocorrelation (see Section 2
for details). Normal and abnormal data only vary in their correlation between time points, while the ensemble statistics (in terms of mean and variance across samples) are equal at every time step (Fig1_toy_dataA, B). This makes anomaly detection challenging because the model cannot resort to simply representing mean and variance of time steps independently but rather needs to learn a joint distribution representing temporal correlations.
We train the two flow models on the normal data and then apply them to unseen normal samples as well as abnormal data samples (Fig1_toy_dataC). Both models transform the normal samples to approx. white noise while the transformed abnormal samples clearly deviate from white noise. Consistent with the visual inspection of transformed samples, the models assign lower likelihood to abnormal samples than normal samples, and this difference becomes more clear with increasing deviation of abnormal samples, i.e. decreasing time constant of the autocorrelation function (Fig1_toy_data, dark blue vs. light blue in data points).
To use the trained model as a novelty detector, we define a decision boundary, i.e. a likelihood value which separates abnormal from normal data samples. To judge the quality of a novelty detector, a typical metric is the receiver operating characteristic (ROC) curve. Given a model and abnormal data samples, we vary the decision boundary and measure the rate of false positive (normal data classified as abnormal) versus true positive (abnormal data classified as abnormal). Thus, the steeper the slope of the ROC curve, the better. For varying time constants in the autocorrelation of the synthetic data, we compute the ROC curve of the two flowbased models and compare them to the local outlier factor (LOF) (Fig2_novelty_det). Both flowbased models quickly deviate from the chancelevel at (in this case, the test data is drawn from the same distribution as normal data) for decreasing time constant. The LOF model does not impose autoregressive constraints on the time series and thus, it performs worse in modelling the temporal correlations between data points in the time series.
We test the feasibility of the flowbased models on real data (see Section 2
for details). Both normal and abnormal time series follow a very similar global curve (Fig3_real_dataA, top) and are visually hardly distinguishable. We thus split normal samples into train and test data and subtract the mean across training samples from normal and abnormal samples (Fig3_real_dataA, middle and bottom). Abnormal and normal samples display different temporal correlations (Fig3_real_dataB), and consequently after training on the normal data both models clearly separate normal test data and abnormal data (Fig3_real_dataC). We find that the FFJORD+MADE diverges when applied to abnormal data samples after the training has been performed for sufficient number of epochs (depending on the hyperparameters), where very low training and test loss has been reached. The ordinary differential equation becomes unstable for samples deviating from the training distribution. For the purposes of visualization, we thus decide to stop the training after 140 epochs for Fig3_real_data. While the normal samples are approximately mapped to uncorrelated white noise, the transformed abnormal samples clearly deviate from white noise. To score the samples, we again compute the likelihood per time point and bin them into histograms (Fig3_real_dataD). It is trivial to define a decision boundary which reaches
false positive rate and true positive rate. The LOF model reaches an accuracy of around true positive rate and false positive rate on this task. The flowbased models marginally outperform the simpler method.The flowbased models enable us to generate new, artificial data. We draw random samples of white noise (Fig4_generativeA, top) and pass them through the models to transform the random samples into samples from the data distribution (Fig4_generativeA, bottom). The generated samples fit the empirical autocorrelation very well (Fig4_generativeB), demonstrating that we indeed can generate artificial samples of normal data from the learned models.
5 Discussion
In this work, we evaluate the use of flowbased generative models for novelty detection in time series. Since normalizing flows approximate distribution of the data and score data samples based on their likelihood, they possess the sufficient ingredients for novelty detection. In comparison to other conventional methods, they can learn very flexible distributions and allow the modeler to impose helpful constraints such as autoregressive properties in time series.
To test our idea, we use two flowbased models: Masked Autoregressive Flows (MAF) and a Freeform Jacobian of Reversible Dynamics (FFJORD) continuoustime flow. We restrict the FFJORD model with autoregressive constraint imposed through the usage of autoregressive MADE networks, which to the best of our knowledge is the first time. We train the models on synthetic data, which we could control to make novelty detection challenging, and demonstrate good classification performance, outperforming a conventional method for novelty detection, Local Outlier Factor (LOF). Applied to less challenging real data from an industrial machine, the models reach perfect accuracy, marginally better than LOF.
As an extension of this work, the flowbased models could be trained to learn the transition from normal to abnormal samples. This could support the analysis of the defect causing the emergence of abnormal data in the industrial machine. Furthermore, the learned transition from normal to abnormal data could be applied to new motor operation patterns and generate abnormal data, thereby helping to understand the possible failure modes of the machine defect.
6 Acknowledgments
We would like to thank Jun Yamaguchi and Hitoshi Hasunuma from Kawasaki Heavy Industries Ltd. and Shunji Goto from Ascent Robotic Inc. for their support of this project.
References
 Bingham et al. (2018) Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research, 2018.
 Breunig et al. (2000) Breunig, M. M., Kriegel, H. P., Ng, R. T., and Sander, J. LOF: Identifying densitybased local outlier. In ACM SIGMOID Int. Conf. on Management of Data, pp. 93–104, 2000.
 Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6572–6583, 2018.
 Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889, 2015.
 Grathwohl et al. (2019) Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: FreeForm continuous dynamics for scalable reversible generative models. pp. 1–13, 2019.
 Hornung et al. (2014) Hornung, R., Urbanek, H., Klodmann, J., Osendorfer, C., and Van Der Smagt, P. Modelfree robot anomaly detection. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3676–3683. IEEE, 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751, 2016.
 Liu et al. (2012) Liu, F. T., Ting, K. M., and Zhou, Z. H. Isolationbased anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3, 2012.
 Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.

Paszke et al. (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., and Lerer, A.
Automatic differentiation in PyTorch.
In NIPS Autodiff Workshop, 2017.  Patcha & Park (2007) Patcha, A. and Park, J.M. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks, 51(12):3448–3470, aug 2007. doi: 10.1016/j.comnet.2007.02.001.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Pimentel et al. (2014) Pimentel, M. A., Clifton, D. A., Clifton, L., and Tarassenko, L. A review of novelty detection. Signal Processing, 99:215–249, 2014.
 Schölkopf et al. (2000) Schölkopf, B., Williamson, R. C., Smola, A. J., ShaweTaylor, J., and Platt, J. C. Support vector method for novelty detection. In Advances in neural information processing systems, pp. 582–588, 2000.
 Tarassenko et al. (1995) Tarassenko, L., Hayton, P., Cerneaz, N., and Brady, M. Novelty detection for the identification of masses in mammograms. In Proceedings of the 4th International Conference on Artificial Neural Networks, pp. 442–447. IET, 1995.
7 Supplement
For all experiments data was preprocessed in the same way: 80% of normal data was used for training purposes, while remaining 20% of normal data, and 100% of abnormal data was used for testing purposes. We subsample the data by the factor of 10, and extract middle 100 time points from the whole signal. We calculate the mean and standard deviation of the training data, and use it to normalize all three datasets.
For the flowbased models, we used the following hyperparameters: Adam optimizer with learning rate and weight decay , ODE solver (for FFJORD+MADE) ‘dopri5‘, batch size , number of epochs 620 (Fig3_real_data, FFJORD+MADE), 2000 (Fig3_real_data, MAF), 140 (Fig3_real_data, FFJORD+MADE), 100 (Fig3_real_data, Fig4_generative, MAF), 600 (Fig4_generative, FFJORD+MADE). Our FFJORD+MADE implementation is based on the ffjord library (https://github.com/rtqichen/ffjord) and the MADE implementation in the pyro library (Bingham et al., 2018). Our MAF implementation is based on a publicly available implementation (https://github.com/ikostrikov/pytorchflows) using pytorch (Paszke et al., 2017).
For Local Outlier Factor, parameter MinPts is set to 50, and Chebyshev distance metric was used to calculate distance between data points. We used the implementation in scikitlearn (Pedregosa et al., 2011).