Anomaly detection, also known as outlier detection, is a problem of detecting data that significantly differ from normal data[1, 2, 3]. Since such anomalies might indicate symptoms of mistakes or malicious activities, their prompt detection may prevent such problems. Therefore, anomaly detection has received much attention and been applied for various purposes.
In this paper, we specifically consider unsupervised anomaly detection (UAD), in which only normal data can be used for training anomaly detection models. UAD is typically solved by first training a normal model with normal data and then estimating the deviance of each testing sample with the trained model. In the anomaly detection field, many types of normal models have been investigated. In the early studies, a Gaussian distribution was used[4, 5]
, and recently, more flexible statistical models have been used such as a Gaussian mixture model (GMM)[6, 7]. More recently, deep neural network (DNN)-based methods have been investigated such as an Auto-Encoder (AE) [8, 9], a Variational Auto-Encoder (VAE) [10, 11, 12], and Generative Adversarial Networks (GAN) [13, 14, 15].
In the typical setting of UAD, one assumes that training and testing data are sampled from the same distribution. However, this assumption does not hold in certain practical scenarios. Let us consider the anomaly detection problem on facility equipments. Typically, such equipments have various operation patterns, and the environmental noise patterns around them may change due to certain factors such as seasons and the weather. In this case, the above assumption does not always hold; hence, simply applying existing normal models to such problems may significantly decrease the anomaly detection accuracy. A naïve method one can use to avoid this is to adapt normal models to a new distribution by conducting fine-tuning with newly-collected normal data. However, fine-tuning requires high memory and computational costs and cannot be easily conducted with devices installed in facility equipments that typically have only limited computational resources. Therefore, a more efficient adaptation method is needed.
To address this problem, we propose a new density estimator named AdaFlow, a unified model of Normalizing Flows (NFs) [16, 17], a powerful DNN-based density estimator, and the Adaptive Batch Normalization (AdaBN) 
, a module that enables DNNs to handle different domains’ data. AdaBN alleviates the difference between domains by scaling and shifting each domain’s input data so that each domain’s mean and variance are zero and one, respectively. Since AdaBN can be adapted to a new domain by just adjusting its statistics with the domain’s data, the adaptation step of AdaFlow can be done by just conducting forward-propagation only once per sample. Therefore, AdaFlow can be used on devices that have limited computational resources.
We also propose a method of applying AdaFlow to the unpaired cross-domain translation problem, in which one has to train a cross-domain translation model with only unpaired data. We show the effectiveness of using AdaFlow for this problem through cross-domain translation experiments on image datasets.
2 Related Work
2.1 Unsupervised anomaly detection
In UAD, the deviation between a normal model and observation is computed; the deviation is often called the “anomaly score”. One way of computing anomaly scores is a density estimation-based approach. This approach first trains a density estimator , such as a Gaussian distribution function, with normal data, and then computes the negative log-likelihood of each testing data with . In this approach, its negative log-likelihood is used as its anomaly score , i.e.,
Then, is determined to be anomalous when the anomaly score exceeds a pre-defined threshold :
Recently, deep learning has also been investigated for defining normal models for UAD. Several studies on deep-learning-based UAD employed an AE [8, 9] (or a VAE [11, 12]). The AE-based anomaly detection framework defines the anomaly score as follows:
where denotes the norm, and are the encoder and decoder of the AE, and and are its parameters, namely . Then, is trained to minimize the anomaly scores of normal data as follows:
where is the -th training sample and is the number of training samples.
Although it has been empirically shown that anomaly detection can be addressed by AE-based anomaly detection, one of its drawbacks is that there is no guarantee that minimizing Eq. (4) encourages anomaly scores of normal data to be less than those of anomaly data, because anomaly scores of anomaly data are not considered in Eq. (4). In constrast, in the density estimation-based approach, minimizing NLLs of normal data encourages to maximize NLLs of the other data, including anomal data, since the integral value of the likelihood in the input space is always 1. Therefore, instead of using the AE-based anomaly detection approach, we adopt the density estimation-based approach. Specifically, in this paper, we adopt a Normalizing Flow (NF), a DNN-based flexible density estimator. We explain its details in Section 3.
2.2 Domain adaptation on DNN-based density estimator
Although a DNN is a powerful tool for anomaly score computation, it may be problematic for practical use. One problem occurs when adjusting the normal model to a new domain. The distribution of normal data often varies due to aging of the target and/or change in environmental noise. Therefore, we need to adapt the normal model to such fluctuations. Let us formulate this problem. Suppose that we have a normal model trained on dataset(s) collected in individual domains. When the distribution changes, we need to adapt to the new domain (-th domain) to obtain a new normal model . This problem can be regarded as an analogy of domain adaptation . Although several domain adaptation methods have been investigated [20, 21, 22], most require iterative optimization and huge memory, and such methods cannot be easily used with devices installed in most practical conditions, which typically have limited computational resources. Therefore, in terms of the computational cost and required memory, a more efficient adaptation method is needed.
3 Proposed method
3.1 Normalizing Flow
We adopt a Normalizing Flow (NF) as a density estimator. NF represents a probabilistic density by transforming a base probabilistic density function with a series of invertible projections with each parameter . In NF, is regarded as a transformed variable with as follows:
thus, can be obtained by the inverse transform of (5). Following prior works [16, 17], we employ a Gaussian distribution for . Then, the likelihood of the given sample is obtained by repeatedly applying the rule for change of variables as follows:
Thus, the anomaly score computed by NF can be expressed as
Parameters can be trained by minimizing the anomaly scores as follows:
where and are the -th training sample and the number of training samples of the -th dataset, respectively.
We consider domain adaptation for NF. A naïve method of adapting NF to the -th dataset is to fine-tune all with that dataset. However, fine-tuning requires high memory and computational costs and cannot be easily conducted with devices installed in facility equipments that typically have only limited computational resources. Therefore, a more efficient adaptation method is needed.
To address this problem, we propose AdaFlow, a Normalizing Flow-based density estimator that utilizes Adaptive Batch Normalizations (AdaBNs). An AdaBN converts data as follows:
are vectors of mean and variance computed with the data in the-th domain, respectively, and and are learnable parameters shared in all domains. The function denotes an operator that converts into a diagonal matrix of which -th entry is if , otherwise . Note that and are individually calculated for each domain, whereas same and are used for all domains. By training the whole projections in this manner, and
alleviate the difference up to the second-order moment for each domain in the hidden layers. In addition, adapting AdaFlow to the given-th domain can be achieved by just computing AdaBNs’ statistics and with data sampled from that domain.
3.3 Examples of projection implementations
We next explain projections that can be used for implementing AdaFlow. If each projection is easy to invert and the determinant of its Jacobian is easy to compute, exact density estimation at each data point can be easily conducted. We introduce two projections that satisfy the above requirements.
Linear Transformation:Linear transformation can be used as a projection for NFs as follows:
is a weight matrix and a bias vector, respectively. The determinant of the Jacobian of this projection is. Since its computational complexity is , we reparametrize as a LDU decomposition form , where and is a lower and upper triangular matrix of which all diagonal elements are one, respectively, and . Since and , the computational complexity of the determinant of the Jacobian can be reduced to by using this reparametrization form.
where is a hyper parameter, and is an operator that outputs element-wise maximum of and , respectively. Since Leaky ReLU is easy to invert and the determinant of its Jacobian is easy to compute, it can also be used as a projection for NFs. The determinant of its Jacobian is , where is the number of elements that are less than 0.
4.1 Experimental Settings
To verify the effectiveness of AdaFlow, we conducted experiments on an anomaly detection in sound (ADS) task. For the training and test datasets, we constructed a toy-car-running sound dataset in a simulated room of a factory, as shown in Fig. 2. The toy cars were placed at in the room, and two loudspeakers were arranged around a toy car to emit factory noise. For the target and noise sound, we individually collected four types of car-running sounds and four types of factory noise data emitted from two loudspeakers. Then, types of pre-training datasets were generated by mixing three of the four types of car sounds and three environmental sounds at a signal-to-noise (SNR) of 0 dB. The adaptation and test datasets were generated by mixing the remaining car sound and environmental noise at an SNR of 0 dB. All sounds were recorded at a sampling rate of 16 kHz.
Since it is difficult to generate various types of anomalous sounds, we created synthetic anomalous sounds in the same manner as in a previous study . A part of the training dataset for the task of DCASE-2016 [23, 24] was used as anomalous sounds; 140 sounds including slamming doors , knocking at doors , keys put on a table, keystrokes on a keyboard, drawers being opened, pages being turned, and phones ringing
) were selected. To synthesize the test data, the anomalous sounds were mixed with normal sounds at anomaly-to-normal power ratios (ANRs) of -20 dB. We used the area under the ROC curve (AUROC) as an evaluation metric. We also used the negative log-likelihood (NLL). Note that the higher AUROC, the better the model, whereas the lower NLL, the better the model.
The frame size of the discrete Fourier transformation was 512 points, and the frame was shifted every 256 samples. The input vectors were the log amplitude spectrum of 64-dimensional Mel-filterbank outputs with a context-window size of 5. Thus, the dimension of input vectorwas .
4.1.2 Comparison methods
We compared the following models.
AdaFlow: each model is first trained with data sampled from the nine pre-training datasets and then adapted with data sampled from the target dataset. The architecture is a sequence of linear transformation, AdaBN, leaky ReLU, linear transformation, and AdaBN. For adapting this model, the number of samples used was set to .
Normalizing Flow: each model is trained with data sampled from the nine pre-training datasets (the target dataset is not included). The architecture is a sequence of linear transformation, BN, leaky ReLU, linear transformation, and BN.
Normalizing Flow: a model is first trained in the same manner as above, and then fine-tuned with data sampled from the target dataset. The architecture is the same as above. For fine-tuning this model, the number of samples used was set to .
Auto-encoder: each model is trained with data sampled from the nine pre-training datasets. Since this model cannot be used for density estimation, we only evaluate AUROC. The architecture is a sequence of linear transformation (the output dimension is 128), ReLU, linear transformation (the output dimension is 64), ReLU, linear transformation (the output dimension is 128), ReLU, and linear transformation (the output dimension is 704).
4.2 Objective evaluations
The experimental results are shown in Tables 1 and 2. From these results, we observed the following things:
Both Normalizing Flows and AdaFlow outperformed Auto-encoder. This observation indicates the superiority of Normalizing Flows over Auto-encoder in anomaly detection.
AdaFlow outperformed Normalizing Flow trained with nine pre-training datasets, even when it was trained with 10 samples. This indicates the superiority of AdaFlow over non-fine-tuned Normalizing Flow.
The larger the amount of data used for adapting AdaFlow, the better both the metrics were. This indicates that the amount of data used for adaptation should be as large as possible.
AdaFlow can be adapted to a new dataset about 36 times faster than fine-tuning-based Normalizing Flow adaptation, with slight accuracy decrease. This indicates that AdaFlow is equally accurate yet much more efficient than fine-tuning-based adaptation.
|Norm. Flow (Trained with 9 other datasets)||53.9||0.835|
|Auto-encoder (Trained with 9 other datasets)||N/A||0.805|
|AdaFlow (Adapted with 10 samples)||92.4||0.816|
|AdaFlow (Adapted with 100 samples)||21.4||0.875|
|AdaFlow (Adapted with 1000 samples)||15.3||0.882|
|Norm. Flow (Fine-tuned with 1000 samples)||13.9||0.887|
|Norm. Flow (Fine-tuned with 1000 samples)||3.23|
|AdaFlow (Adapted with 1000 samples)||0.09|
5 Application to Unpaired Cross-Domain Translation
Though AdaFlow was originally designed for conducting density estimation on multiple domains, we demonstrate that it can be also used for the unpaired cross-domain translation problem, in which one has to train a cross-domain translation model without paired data. We propose the unpaired cross-domain translation framework with AdaFlow in Fig. 1 (c). Given a trained AdaFlow model, data belonging to one domain is first projected to the latent space with that domain’s AdaBN statistics, and after that the obtained latent variable is reprojected to the data space with the target domain’s AdaBN statistics.
We used two datasets for these experiments: the first one consisted of 400 photos, and the second one consisted of 400 paintings drawn by Van Goph. Examples are shown in Fig. 3 (a) and (b). As an architecture for AdaFlow, we employed a variant of Glow , in which activation normalization layers are replaced with AdaBN.
The cross-domain translation results are shown in Fig. 3 (c, d). We can see that unpaired cross-domain translation can be achieved via AdaFlow, even when it is trained without paired data. These results indicate that AdaFlow can be a density-based alternative to other methods for this problem, such as CycleGAN .
We proposed a new DNN-based density estimator called AdaFlow; a unified model of the NF and AdaBN. Since AdaFlow can be adapted to a new domain by just adjusting the statistics used in AdaBNs, we can avoid iterative parameter update for adaptation, unlike fine-tuning. Therefore, a fast and low-computational cost domain adaptation is achieved. We confirmed the effectiveness of the proposed method through an anomaly detection in a sound task. We also proposed a method of applying AdaFlow to the unpaired cross-domain translation problem. We demonstrated the effectiveness of using AdaFlow for the task through cross-domain translation experiments on photo and painting datasets.
AdaFlow has the potential to resolve some problems of other important tasks. One possible example is source enhancement [27, 28, 29]. It is known that the performance of DNN-based source enhancement is degraded when target/noise characteristics of test data are different from those of training data. This problem is also domain-adaptation problem, thus it might be resolved by using AdaFlow. Therefore, in the future, we plan to apply AdaFlow to other tasks including source enhancement.
-  Victoria Hodge and Jim Austin, “A survey of outlier detection methodologies,” Artificial intelligence review, vol. 22, no. 2, pp. 85–126, 2004.
-  Animesh Patcha and Jung-Min Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Computer networks, vol. 51, no. 12, pp. 3448–3470, 2007.
-  Jiawei Han, Jian Pei, and Micheline Kamber, Data mining: concepts and techniques, Elsevier, 2011.
-  Walter Andrew Shewhart, Economic control of quality of manufactured product, ASQ Quality Press, 1931.
-  Bovas Abraham and George EP Box, “Bayesian analysis of some outlier problems in time series,” Biometrika, vol. 66, no. 2, pp. 229–236, 1979.
“Detecting anomalies in cross-classified streams: a bayesian approach,”Knowledge and information systems, vol. 11, no. 1, pp. 29–44, 2007.
-  Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, and Noboru Harada, “Optimizing acoustic feature extractor for anomalous sound detection based on neyman-pearson lemma,” in EUSIPCO, 2017.
Chong Zhou and Randy C Paffenroth,
“Anomaly detection with robust deep autoencoders,”in KDD, 2017.
-  Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, Yuta Kawachi, and Noboru Harada, “Unsupervised detection of anomalous sound based on deep learning and the neyman-pearson lemma,” IEEE/ACM Trans. ASLP, 2018.
-  Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
Jinwon An and Sungzoon Cho,
“Variational autoencoder based anomaly detection using reconstruction probability,”Special Lecture on IE, vol. 2, pp. 1–18, 2015.
-  Yuta Kawachi, Yuma Koizumi, and Noboru Harada, “Complementary set variational autoencoder for supervised anomaly detection,” in ICASSP, 2018.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in NIPS, 2014.
-  Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs, “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in IPMI, 2017.
-  Swee Kiat Lim, Yi Loo, Ngoc-Trung Tran, Ngai-Man Gemma Roig Cheung, and Yuval Elovici, “Doping: Generative data augmentation for unsupervised anomaly detection with gan,” in ICDM, 2018.
-  Danilo Rezende and Shakir Mohamed, “Variational inference with normalizing flows,” in ICML, 2015.
-  Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio, “Density estimation using real nvp,” in ICLR, 2017.
-  Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou, “Revisiting batch normalization for practical domain adaptation,” in ICLR Workshop, 2016.
Sinno Jialin Pan, Qiang Yang, et al.,
“A survey on transfer learning,”IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
Yaroslav Ganin and Victor Lempitsky,
“Unsupervised domain adaptation by backpropagation,”in ICML, 2015.
-  Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan, “Domain separation networks,” in NIPS, 2016.
-  Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada, “Asymmetric tri-training for unsupervised domain adaptation,” in ICML, 2017.
-  http://www.cs.tut.fi/sgn/arg/dcase2016/.
-  http://www.cs.tut.fi/sgn/arg/dcase2016/download.
-  Diederik P Kingma and Prafulla Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” arXiv preprint arXiv:1807.03039, 2018.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros,
“Unpaired image-to-image translation using cycle-consistent adversarial networks,”in ICCV, 2017.
-  Yuma Koizumi, Kenta Niwa, Yusuke Hioka, Kazunori Kobayashi, and Yoichi Haneda, “Dnn-based source enhancement to increase objective sound quality assessment score,” IEEE/ACM Trans. ASLP, 2018.
-  Yuma Koizumi, Noboru Harada, Yoichi Haneda, Yusuke Hioka, and Kazunori Kobayashi, “End-to-end sound source enhancement using deep neural network in the modified discrete cosine transform domain,” in ICASSP, 2018.
-  Shinichi Mogami, Hayato Sumino, Daichi Kitamura, Norihiro Takamune, Shinnosuke Takamichi, Hiroshi Saruwatari, and Nobutaka Ono, “Independent deeply learned matrix analysis for multichannel audio source separation,” in EUSIPCO, 2018.