Learning to Adapt to Domain Shifts with Few-shot Samples in Anomalous Sound Detection

by   Bingqing Chen, et al.
Carnegie Mellon University

Anomaly detection has many important applications, such as monitoring industrial equipment. Despite recent advances in anomaly detection with deep-learning methods, it is unclear how existing solutions would perform under out-of-distribution scenarios, e.g., due to shifts in machine load or environmental noise. Grounded in the application of machine health monitoring, we propose a framework that adapts to new conditions with few-shot samples. Building upon prior work, we adopt a classification-based approach for anomaly detection and show its equivalence to mixture density estimation of the normal samples. We incorporate an episodic training procedure to match the few-shot setting during inference. We define multiple auxiliary classification tasks based on meta-information and leverage gradient-based meta-learning to improve generalization to different shifts. We evaluate our proposed method on a recently-released dataset of audio measurements from different machine types. It improved upon two baselines by around 10 model reported on the dataset.



page 1

page 6


Meta-learning One-class Classifiers with Eigenvalue Solvers for Supervised Anomaly Detection

Neural network-based anomaly detection methods have shown to achieve hig...

Disentangling Physical Parameters for Anomalous Sound Detection Under Domain Shifts

To develop a sound-monitoring system for machines, a method for detectin...

Few-Shot Bearing Anomaly Detection Based on Model-Agnostic Meta-Learning

The rapid development of artificial intelligence and deep learning techn...

Flow-based Self-supervised Density Estimation for Anomalous Sound Detection

To develop a machine sound monitoring system, a method for detecting ano...

Meta-Learning Based Early Fault Detection for Rolling Bearings via Few-Shot Anomaly Detection

Early fault detection (EFD) of rolling bearings can recognize slight dev...

Meta-learning with GANs for anomaly detection, with deployment in high-speed rail inspection system

Anomaly detection has been an active research area with a wide range of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Anomaly detection is the task of identifying anomalous observations [11, 31, 20, 4, 22]. In this paper, we focus on anomaly detection applied to machine health monitoring via audio signal. Detecting anomalies is useful for identifying incipient machine faults, condition-based maintenance, and quality assurance, which are integral components towards Industry 4.0. In comparison to direct measurements, audio is a cost-effective, non-intrusive, and scalable sensor modality.

Specifically, we focus on the problem set-up (Figure 1a) where our system adapts to new conditions using only a handful of samples (specifically 3 in the experiment). It is clearly not desirable for an observation to be tagged as anomalous due to changes in operating condition or environmental noise. Thus, we aim to develop an anomalous sound detection (ASD) system that can adapt to new conditions quickly with few-shot samples. Additionally, some meta-information may be available, e.g., machine model and operating load.

Due to the practical difficulty of enumerating potential anomalous conditions and generating such samples, it is typically assumed that only normal samples are available for training [24]. This poses the challenge that anomaly detectors need to learn to identify anomalies in the absence of any anomalous samples. Furthermore, deep anomaly detectors perform in unexpected and not well-understood ways under out-of-distribution scenarios [23]. Last but not least, the unique characteristics of audio add to the challenge.

Fig. 1: Framework. (a) Our anomalous sound detection system adapts to new conditions with 3-shot samples. We adopt a classification-based approach and define the auxiliary classification tasks on available meta-data. (b) In the outer loop, we alternative between the auxiliary classification tasks, such that the model can quickly adapt to new conditions. (c) For a given auxiliary classification task, the anomaly score is calculated based on distance between each sample () to the prototypes () on the embedding space. This is equivalent to mixture density estimation on normal samples.

Given its superior performance, especially on similar problems [13, 6, 17], we adopt a classification-based approach for anomaly detection [2], where auxiliary classification tasks are defined on the available meta-data. However, existing classification-based methods [2, 12] are not designed for anomaly detection under domain shift, and thus are not sufficient for competitive performance on their own, which we show in Section V. Thus, we propose a novel approach, with bi-level optimization, to tackle the challenging problem of anomaly detection under domain shifts. In the outer loop (Figure 1b), we alternate between all available auxiliary classification tasks leveraging a gradient-based meta-learning (GBML) algorithm [18], such that the resulting model can quickly adapt to the target domain. In the inner loop (Figure 1c), we train the classification-based anomaly detector with an episodic procedure to match the few-shots setting [27]. While classification-based methods show strong empirical results in the literature [7, 10, 2], there is not a convincing explanation for that performance. Thus, we propose a supplementary explanation that classification-based methods are equivalent to mixture density estimation on normal samples. Finally, we evaluate our proposed approach on a recently-released dataset of audio measurements from a variety of machine types [12] and show strong empirical results.

Ii Related Work

Anomaly Detection

Recently, there is a surge of interest in using deep learning approaches for anomaly detection to handle complex, high-dimensional data. Anomaly detection methods can be categorized as density estimation-based, classification-based, and reconstruction-based

[2, 23]. Density estimation-based methods fit a probabilistic model on normal data, and data from low density region are considered anomalies. For the purpose of anomaly detection, we are only interested in level sets of the distribution, and thus classification-based methods learn decision boundaries that delineate high-density and low-density region. Reconstruction-based methods train a model to reconstruct normal samples, and use reconstruction error as a proxy for anomaly score.

Due to the superior performance of classification-based methods on our application of interest [6, 17, 13], we focus our review on this class of methods and refer interested readers to comprehensive reviews [3, 23]

for more information. A representative method is Deep Support Vector Data Description (SVDD)


, a one-class classification method, where the neural network learns a representation that enforces the majority of normal samples to fall within a hypersphere. The core challenge of this line of work is learning the decision boundary for a binary classification problem in the presence of only normal samples. Alternatively, outlier exposure

[10] takes advantage of an auxiliary dataset of outliers, i.e. samples disjoint from both normal and anomalous ones, to improve performance for anomaly detection and generalization at unseen scenarios. Finally, other methods train models on auxiliary classification tasks and use the negative log-likelihood of a sample belonging to the correct class as the anomaly score. Some examples of such auxiliary classification tasks include identifying the geometric transformation applied to the original sample [7, 2]

and classifying the metadata associated with the sample

[6, 17]

. This is reminiscent of self-supervised learning (SSL), where learning to distinguish between self and others is conducive to learning salient representation for other downstream tasks.


The objective of meta-learning is to find a model that can generalize to a new, unseen task with a handful of samples [5]. Metric-based meta-learning algorithms learn feature representation such that query samples can be classified based on their similarity to the support samples with known labels. For instance, Matching Networks [30] use attention mechanism to evaluate the similarity. Prototypical Networks [27] establish class prototypes based on support samples, and assign each query sample to its nearest prototype.

GBML algorithms, popularized by model-agnostic meta-learning (MAML) [5], train models that can quickly adapt to new tasks with gradient-based updates, typically using a bi-level optimization procedure. Variants of MAML have been proposed in works such as [18, 21, 32].

Anomalous Sound Detection

In comparison to general anomaly detection, audio has its unique characteristics, e.g., temporal structure [25]. On the same task, [16] observed that there was limited performance gain by trying a number of common domain adaptation techniques, highlighting the challenge and the need for research on audio-specific methods.

While reconstruction-based [25, 28] and density estimation-based [33] anomaly detection methods have been used for anomalous sound detection, classification-based methods enjoy superior performance on similar / the same tasks [13, 6, 17]. Albeit differences in specific implementation, the core idea is to classify metadata, such as machine identification, and compute the anomaly score from the negative log-likelihood of class assignment. Also relevant to the task are Sniper [14] and SpiderNet [15], which memorize few-shot anomalous samples to prevent overlooking other anomalous samples. In particular, SpiderNet uses attention mechanism to evaluate the similarity with known anomalous samples, similar to Matching Networks.

Iii Approach

Iii-a Preliminaries

By defining the distribution of normal data as over the data space , anomalies may be characterized as the set where normal samples is unlikely to be, i.e. , where is a threshold [23]

that controls Type I error.

A fundamental assumption in anomaly detection is the concentration assumption, i.e. the region where normal data reside can be bounded [23]. More precisely, there exists , such that is nonempty and small. Note that this assumption does not require the support of be bounded; only that the support of the high-density region of

be bounded. In contrast, anomalies need not be concentrated and a commonly-used assumption is that anomalies follow a uniform distribution over


Iii-B Problem Formulation

We study the problem of anomaly detection under domain shift, having access to only few-shot samples in the target domain. Specifically, we have access to a dataset from the source domain , where each is a normal sample and are the labels for auxiliary classification tasks. We also have access to a small number of samples in the target domain , where is a normal sample from a domain-shifted condition. Note that we only access to normal samples in both source and target domain. We denote each auxiliary task as and the set of auxiliary tasks as .

We use a neural network to map samples from the data space to an embedding space, where . We denote the embedding that corresponds to each sample as , where , and the distribution of normal data in the embedding space as .

Iii-C Anomaly Detection via Metadata Classification

To simplify notation, we consider a single auxiliary task with class labels in this subsection. We denote .

As discussed in Section II, anomaly detection methods based on auxiliary classification train models to differentiate between classes, either defined via metadata inherent to the dataset or synthesized by applying various transformations to the sample, and calculate the anomaly score as the negative log-likelihood of a sample belonging to the correct class (Eqn. 1) [9].


Following [2], we define the likelihood by distance on the embedding space (Eqn. 2), rather than a classifier head, to handle new, unseen classes in the target domain, where is a distance metric and is the class centroid.


The learning objective is to maximize the log-likelihood of assigning each sample to its correct class, or equivalently minimizing the negative log-likelihood as in Eqn. 3, leading to the familiar cross-entropy loss.


An explanation for the strong empirical performance of anomaly detection based on auxiliary classification is that the auxiliary classification tasks are conducive to learning salient feature useful for anomaly detection [7, 6]. However, this explanation provides limited insight in how to define the auxiliary classification tasks and what kind of performance may be expected.

Here, we provide an alternative explanation that the classification objective (Eqn. 3) is equivalent to mixture density estimation on normal samples. In Deep SVDD [24], it is assume that the neural network can find a latent representation such that the majority of normal points fall within a single hypersphere or equivalently

follows an isotropic Gaussian distribution. Based on the same intuition, we assume

can be characterized by a mixture model (Equation 4), where is the prior distribution for class membership. As an example, a machine may operate normally under different operating loads. Instead of assuming that normal data can be modeled by a single cluster, the distribution of normal data can be better characterized with a number of clusters each corresponding to an operating load. We hypothesize that this more flexible representation enables the model to learn fine-granular features conducive to anomaly detection, and extrapolate better to unseen scenarios.


We also assume that can be characterized by a distribution in the exponential family, where is a Bregman divergence [1]. The choice of dictates the modeling assumption on the conditional distribution, . For instance, by choosing squared Euclidean distance, i.e. , one models each cluster as an isotropic Gaussian. Given these modeling assumptions, it is easy to see that


Eqn. 5 is the same as Eqn. 2, by assuming a flat prior on class membership, which can be satisfied by sampling mini-batches with balanced classes. This shows that learning the auxiliary classification task is equivalent to performing mixture density estimation with exponential family, where each cluster corresponds to a class.

1:input: Data , ; Test set ; Model ;
2:parameter: Learning rate ; Outer step-size ; Number of inner and fine-tuning iterations and
3:function ComputeLoss(, , )
4:     // input: support set, query set, task index
5:     // Compute prototypes from the support set
7:     // Calculate task loss from the query set
8:     return
9:end function
11:procedure Train(, )
12:     // input: model, training data
13:     initialize
14:     while not done do
15:         for  do
16:              set
17:              for  do
18:                  // Sample support and query set
20:                   = ComputeLoss(, , )
21:                  update
22:              end for
23:              meta update
24:         end for
25:     end while
26:end procedure
28:procedure Inference(, , )
29:     // input: trained model, few-shot examples, test set
30:     set
31:     // Fine-tuning on few-shot examples
32:     for  do
33:          = ComputeLoss(, , 0)
34:         update
35:     end for
36:     // Compute anomaly score for test samples
37:     compute
38:end procedure
Algorithm 1 Learning to Adapt with Few-shot Samples

Adaptation with Few-shot Samples

A distinction from the typical anomaly detection problem set-up is that we need to adapt to new conditions and only have access to few-shot samples from the target domain. Thus, we draw from the few-shot classification literature, and modify the classification-based anomaly detector for the few-shot setting. Specifically, Prototypical networks (ProtoNet) [27] is a few-shot learning method that also classifies samples based on distance on the embedding space. During inference, ProtoNet uses the few-shot examples to establish the new class centroid (i.e. prototype) under domain-shifted conditions, , i.e.

. Each test sample is assigned to the nearest class prototype, with the same probability as defined by Eqn.


ProtoNet adopts an episodic training procedure (see ComputeLoss in Algorithm 1), such that the training condition matches the test condition. During training, ProtoNet splits each mini-batch into support set, , and query set, , where the support set simulates the few-shot examples, and the query set simulates the test samples during inference. In other words, the prototypes are computed on the support set and the loss is evaluated on the query set.

Outlier Exposure

We also use the outlier exposure (OE) technique [10] to boost performance. It is commonly assumed that anomalies follow a uniform distribution over the data space [23]. Thus, the auxiliary loss for outlier exposure is defined as cross-entropy to the uniform distribution, and added to the learning objective with weight .

Iii-D Multi-objective Meta-learning

So far, we have focused the discussion on the case of there being a single auxiliary classification task. But, more meta-information regarding operating conditions may be available. Since the samples may be subject to different changes due to operating condition, machine load, or environment noise, we hypothesize that training on a variety of tasks is conducive to generalizing well to different domain shifts. Also, it is not known a priori which auxiliary classification task would be most effective for anomaly detection. Thus, it is sensible to train on all available auxiliary classification tasks. Empirically, we show in Section V that training on all auxiliary classification tasks does outperform training on any single one.

Recall that meta-learning trains the model on a distribution of tasks such that it can quickly learn new, unseen tasks with few-shot samples. Thus, these auxiliary classification tasks can be naturally incorporated into meta-learning algorithms. Wang et al. in [32] draw the close connections between multi-task learning and meta-learning. A distinction, however, is that meta-learning typically trains on different tasks of the same nature, e.g. 5-way image classification, while multi-task learning may train on functionally related tasks of different nature, e.g. image reconstruction and classification. Our approach falls under the latter case.

We use Reptile [18], a first-order variant of MAML, which learns a parameter initialization that can be fine-tuned quickly on a new task. Reptile repeatedly samples a task , trains on it, and moves the model parameter towards the trained weights on that task (see Train in Algorithm 1) following

where are the model parameters after training on for gradient steps and is the step-size for meta-update. Based on Taylor series analysis, [18] shows that Reptile simultaneously minimizes expected loss over all tasks, and maximizes within-task generalization, i.e. taking a gradient step for a specific task on one mini-batch, also improves performance on other tasks.

During inference, the trained model is first fine-tuned on few-shot examples from the target domain, and the anomaly score (Eqn. 1) can be computed on the test set (see Inference in Algorithm 1).

Machine Type Anomalous Conditions Variations across Domains Auxiliary Classification Tasks
Toy Car Bent shaft; Deformed / melted gears; Damaged wheels Car model; Speed; Microphone type and position Section ID; Model No.; Speed
Toy Train Flat tire; Broken shaft; Disjoint railway track; Obstruction on track Train model; Speed; Microphone type and position Section ID; Model No.; Speed
Fan Damaged / unbalanced wing; Clogging; Over-voltage Wind strength; Size of the fan; Background noise Section ID
Gearbox Damaged gear; Overload; Over-voltage Voltage; Arm-length; Weight Section ID; Voltage; Arm length; Weight;
Pump Contamination; Clogging; Leakage;
Dry run Fluid viscosity; Sound-to-Noise Ratio (SNR); Number of pumps Section ID
Slider Damaged rail; Loose belt; No grease Velocity; Operation pattern; Belt material Section ID; Slider type; Velocity; Displacement;
Valve Contamination Operation pattern; Existence of pump in background; Number of valves Section ID; Pattern; Existence of pump; Multiple valves
TABLE I: Summary of anomalous conditions, domain shifts, and auxiliary classification tasks

Iv Experiment

In this section, we describe the experimental set-up, including the dataset, evaluation metrics, our implementation details, baselines, and ablation study.


We use the dataset from Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2021 Task 2 [12], which is composed of subsets of ToyADMOS2 [8] and MIMII DUE [29]. The dataset consists of normal and anomalous samples from seven distinct machine types, i.e. toy car, toy train, fan, gearbox, pump, slider, and valve. Anecdotally, we found it extremely challenging to distinguish normal and anomalous with untrained human ears. The anomalous samples are generated by intentionally damaging the machines in different ways (see Table I). Note that the anomalous samples are used for evaluation only.

Each sample is a 10s audio clip at 16kHz sampling rate, including both machine sound and environment noise. For each machine type, the data is grouped into 6 sections, where a section is a unit for performance evaluation and roughly corresponds to a distinct machine. In each section, the samples are collected from two different conditions, which we refer to as the source and target domain. The domain shifts are different across sections. A notable challenge is that there are only 3 normal samples for the target domain in each section, while there are 1000 for the source domain. Table I summarizes the domain shifts in the dataset.

For evaluation, each section has 100 normal samples and 100 anomalous samples for both source and target domain. Section 0, 1, 2 are designated as the validation set, and Section 3, 4, 5 are designated as the the test set.

Evaluation Metrics

We follow the same evaluation procedure as the DCASE Challenge, and report the Area Under the Receiver Operating Characteristic curve (AUROC) and the partial AUROC (pAUROC), which controls the false positive rate at 0.1. The metrics are aggregated over sections and machines types with harmonic mean. We focus on the results on the target domain.

Implementation Details

Our preprocessing and model architecture follows the 2nd baseline from the challenge organizer [12]. The raw audio is preprocessed into log-mel-spectrogram with a frame size of 64ms, a hop size of 50%, and 128 mel filters. After preprocessing, 64 consecutive frames in a context window are treated as a sample, and samples are generated from a audio clip by shifting the context window by 8 frames. Each 10s audio clip results in 32 samples, with each sample, . We use MobileNetV2 [26] as backbone, and set the bottleneck size to 128. Thus, .

We follow the optimization procedure in [18]. In the inner loop, we use ADAM as the optimizer with a learning rate , , and . Each mini-batch is sampled to be have balanced class, with 3 audio clips as support and 5 as query for each class, simulating the 3-shot setting during inference. For the outer loop, we use SGD with step-size, , linearly annealing from 1 to 0. The number of iterations for inner loop and fine-tuning are , and respectively. We have a model for each machine type and train it for 10K steps. For outlier exposure, we generate outliers by 1) taking samples from machines other than the one being trained, and 2) synthesizing samples via frequency warping following [6]. We use .

The implementation is in PyTorch[19] and trained on a machine with Intel® Core™ i9-10900KF CPU@3.70GHz and NVIDIA GeForce RTX™ 3090 GPU.

Baselines and Ablation Study

We compare our model to the two baselines provided by the challenge organizer [12]. The 1st

baseline adopts a reconstruction-based approach with an autoencoder. The 2

nd baseline uses a classification-based approach with MobileNetV2 as the backbone. The auxiliary classification task is defined as section ID. Note that our preprocessing and the neural architecture is the same as that of the 2nd baseline. We also compare our model to the best-performing system [16] among the 77 submissions to the challenge, which used an ensemble of two classification-based models and one density estimation-based model. We conduct ablation study to analyze the contribution of individual components.

V Results

Fig. 2: Performance comparison of using individual vs. all auxiliary classification tasks (evaluated on validation set). For individual tasks, the bar length indicates the averaged score over tasks and error bar indicates the minimum and the maximum over tasks.

Choice of Auxiliary Classification Tasks

While in prior work [6, 17, 16], it is popular to use machine / section ID as the auxiliary classification task, we hypothesize that training on multiple auxiliary classification tasks performs better than training on any individual one. We define auxiliary classification tasks by parsing the meta-information associated with each audio clip, also summarized in Table I. Take Toy Car as an example, we have access to information on car model and speed. Intuitively, being able to differentiate between car models/speeds, is conducive for the model to adapt to new model/speed and potentially other new conditions.

Fig. 3: Performance comparison with baselines and ablation (evaluated over the test set). The proposed method (Ours) is compared against Baseline 1 and 2, DCASE 2021 Task2 winner, and ProtoNet, and our proposed method without outlier exposure (ProtoNet + Reptile).

To confirm the hypothesis, we train the anomaly detector on individual classification tasks with ProtoNet, and compare its performance with our proposed approach of alternating between all auxiliary classification tasks using Reptile. We report the score on the validation set in Figure 2. As mentioned in Section IV, we follow the evaluation procedure in the DCASE challenge and report the performance as harmonic mean over AUROC and pAUROC@0.1 of relevant data sections. Fan and Pump are not compared here, as only one auxiliary classification task is available for these two machine types. As expected, training on all tasks performs consistently better across different machine types, confirming the hypothesis.

Overall Results

The overall results evaluated on the test set are summarized in Figure 3. In summary, our model (Ours) improved on average upon Baseline 1 and 2 by 11.7% and 9.6% respectively, and is on par with the best-performing system [16], despite being a third of its model complexity (measured by the number of model parameters). While Baseline 1 and Baseline 2 perform similarly based on the harmonically-averaged scores, there are significant variations across machines. It appears reconstruction-based Baseline 1 and classification-based Baseline 2 excel on different machines. Our proposed model outperforms both baselines for ToyCar, Pump, Slider, and Valve.

In comparison to vanilla classification-based approach in Baseline 2, we trained the classifier episodically to match few-shot setting (ProtoNet), and iterated among different auxiliary classification tasks (ProtoNet + Reptile) to improve generalization. Finally, we augmented the dataset via OE, adding up to our proposed approach (Ours). On average, the episodic training procedure improved performance by 1.9%, GBML applied to auxiliary classification improved performance by 4.1% and data augmentation by OE improved performance by 3.6%. The most significant improvement comes from training on all auxiliary classification tasks via Reptile. There is no improvement from Fan or Pump as these two machine types have only one auxiliary classification task.

Low-Dimensional Visualization

Qualitatively, we show the embedding of the test sample from the trained model for Toy Car, before and after fine-tuning. The stars indicate the prototypes established by the 3-shot samples.

(a) Before Fine-tuning
(b) After Fine-tuning
Fig. 4: Trained Embedding of Toy Car (via t-SNE) on the test set before (top) and after (bottom) fine-tuning on the target domain . Fine-tuning improves discriminability of normal vs. anomalous samples (bottom-right) on the target domain.

The trained model has not seen any samples from the target domain prior to fine-tuning. Regardless, the model has already learned meaningful embedding (Figure 3(a)), where the samples are naturally separated by section. But, the normal and anomalous samples are not yet well separated. During fine-tuning, the prototypes attract normal samples, and as a result the normal and anomalous samples become better separated.

Vi Conclusions

In this work, we tackle the challenging task of adapting unsupervised anomaly detector to new conditions using few-shot samples. To achieve this objective, we leverage approaches from meta-learning literature. We train a classification-based anomaly detector in an episodic procedure to match the few-shot setting during inference. We use a GBML algorithm to find parameter initialization that can quickly adapt to new conditions with gradient-based updates. Finally, we boost our model with outlier exposure.

Grounded in the application of machine health monitoring, we evaluate our proposed method on a recently-released dataset of audio measurements from different machine types, used for DCASE challenge 2021. We model is on par with best-performing system among the 77 submissions to the challenge. We conduct ablation to analyze the contribution of each component, and the GBML procedure leads to the most significant improvement.


  • [1] A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, and J. Lafferty (2005) Clustering with bregman divergences..

    Journal of machine learning research

    6 (10).
    Cited by: §III-C.
  • [2] L. Bergman and Y. Hoshen (2020) Classification-based anomaly detection for general data. arXiv preprint arXiv:2005.02359. Cited by: §I, §II, §II, §III-C.
  • [3] R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §II.
  • [4] J. Feng, C. Zhang, and P. Hao (2010)

    Online learning with self-organizing maps for anomaly detection in crowd scenes


    2010 20th International Conference on Pattern Recognition

    Vol. , pp. 3599–3602. External Links: Document Cited by: §I.
  • [5] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: §II, §II.
  • [6] R. Giri, S. V. Tenneti, F. Cheng, K. Helwani, U. Isik, and A. Krishnaswamy (2020) Self-supervised classification for detecting anomalous sounds. In Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), pp. 46–50. Cited by: §I, §II, §II, §III-C, §IV, §V.
  • [7] I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. arXiv preprint arXiv:1805.10917. Cited by: §I, §II, §III-C.
  • [8] N. Harada, D. Niizumi, D. Takeuchi, Y. Ohishi, M. Yasuda, and S. Saito (2021) ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. arXiv preprint arXiv:2106.02369. Cited by: §IV.
  • [9] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §III-C.
  • [10] D. Hendrycks, M. Mazeika, and T. Dietterich (2018) Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §I, §II, §III-C.
  • [11] M. Ivanovska, J. Perš, D. Tabernik, and D. Skočaj (2021) Evaluation of anomaly detection algorithms for the real-world applications. In 2020 25th International Conference on Pattern Recognition (ICPR), Vol. , pp. 6196–6203. External Links: Document Cited by: §I.
  • [12] Y. Kawaguchi, K. Imoto, Y. Koizumi, N. Harada, D. Niizumi, K. Dohi, R. Tanabe, H. Purohit, and T. Endo (2021) Description and discussion on dcase 2021 challenge task 2: unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions. arXiv preprint arXiv:2106.04492. Cited by: §I, §IV, §IV, §IV.
  • [13] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, et al. (2020) Description and discussion on dcase2020 challenge task2: unsupervised anomalous sound detection for machine condition monitoring. arXiv preprint arXiv:2006.05822. Cited by: §I, §II, §II.
  • [14] Y. Koizumi, S. Murata, N. Harada, S. Saito, and H. Uematsu (2019) SNIPER: few-shot learning for anomaly detection to minimize false-negative rate with ensured true-positive rate. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 915–919. Cited by: §II.
  • [15] Y. Koizumi, M. Yasuda, S. Murata, S. Saito, H. Uematsu, and N. Harada (2020) Spidernet: attention network for one-shot anomaly detection in sounds. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 281–285. Cited by: §II.
  • [16] J. A. Lopez, G. Stemmer, P. Lopez-Meyer, P. S. Singh, J. D. H. Ontiveros, and H. Courdourier (2021) ENSEMBLE of complementary anomaly detectors under domain shifted conditions. In Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Cited by: §II, §IV, §V, §V.
  • [17] K. Morita, T. Yano, and K. Q. Tran (2021) Anomalous sound detection using cnn-based features by self-supervised learning. In Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Cited by: §I, §II, §II, §V.
  • [18] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §I, §II, §III-D, §IV.
  • [19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §IV.
  • [20] M. A. Prado-Romero and A. Gago-Alonso (2016) Detecting contextual collective anomalies at a glance. In 2016 23rd International Conference on Pattern Recognition (ICPR), Vol. , pp. 2532–2537. External Links: Document Cited by: §I.
  • [21] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2019) Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157. Cited by: §II.
  • [22] M. Reif, M. Goldstein, A. Stahl, and T. M. Breuel (2008)

    Anomaly detection by combining decision trees and parametric densities

    In 2008 19th International Conference on Pattern Recognition, Vol. , pp. 1–4. External Links: Document Cited by: §I.
  • [23] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2021) A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE. Cited by: §I, §II, §II, §III-A, §III-A, §III-C.
  • [24] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §I, §II, §III-C.
  • [25] E. Rushe and B. Mac Namee (2019) Anomaly detection in raw audio using deep autoregressive networks. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3597–3601. Cited by: §II, §II.
  • [26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4510–4520. Cited by: §IV.
  • [27] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175. Cited by: §I, §II, §III-C.
  • [28] K. Suefusa, T. Nishida, H. Purohit, R. Tanabe, T. Endo, and Y. Kawaguchi (2020)

    Anomalous sound detection based on interpolation deep neural network

    In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275. Cited by: §II.
  • [29] R. Tanabe, H. Purohit, K. Dohi, T. Endo, Y. Nikaido, T. Nakamura, and Y. Kawaguchi (2021) MIMII due: sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions. arXiv preprint arXiv:2105.02702. Cited by: §IV.
  • [30] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. Advances in neural information processing systems 29, pp. 3630–3638. Cited by: §II.
  • [31] C. Wang, Y. Zhang, and C. Liu (2018)

    Anomaly detection via minimum likelihood generative adversarial networks

    In 2018 24th International Conference on Pattern Recognition (ICPR), Vol. , pp. 1121–1126. External Links: Document Cited by: §I.
  • [32] H. Wang, H. Zhao, and B. Li (2021) Bridging multi-task learning and meta-learning: towards efficient training and effective adaptation. arXiv preprint arXiv:2106.09017. Cited by: §II, §III-D.
  • [33] M. Yamaguchi, Y. Koizumi, and N. Harada (2019) Adaflow: domain-adaptive density estimator with application to anomaly detection and unpaired cross-domain translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3647–3651. Cited by: §II.