Meta Learning for Few-Shot One-class Classification

09/11/2020 ∙ by Gabriel Dahia, et al. ∙ proton mail 0

We propose a method that can perform one-class classification given only a small number of examples from the target class and none from the others. We formulate the learning of meaningful features for one-class classification as a meta-learning problem in which the meta-training stage repeatedly simulates one-class classification, using the classification loss of the chosen algorithm to learn a feature representation. To learn these representations, we require only multiclass data from similar tasks. We show how the Support Vector Data Description method can be used with our method, and also propose a simpler variant based on Prototypical Networks that obtains comparable performance, indicating that learning feature representations directly from data may be more important than which one-class algorithm we choose. We validate our approach by adapting few-shot classification datasets to the few-shot one-class classification scenario, obtaining similar results to the state-of-the-art of traditional one-class classification, and that improves upon that of one-class classification baselines employed in the few-shot setting. Our code is available at https://github.com/gdahia/meta_occ

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One-class classification algorithms are the main approach to detecting anomalies from normal data but traditional methods scale poorly both in computational resources and sample efficiency with the data dimensions. Attempting to overcome these problems, previous work proposed using deep neural networks to learn feature representations for one-class classification. While successful in addressing some of the problems, they introduced other limitations. One problem with these methods is that some of them optimize a metric that is related, but different than their true one-class classification objective (

e.g., input reconstruction [31]

). Other methods require imposing specific structure to the models, like using generative adversarial networks (GANs) 

[11, 29]

, or removing biases and restricting the activation functions for the network model 

[28]. GANs are notoriously hard to optimize [2, 21], and removing biases restrict which functions the models can learn [28]. Furthermore, these methods require thousands of samples from the target class, only to obtain results that are comparable to that of the traditional baselines [28].

We propose a method that overcomes these problems if we have access to data from related tasks. By using recent insights from the meta-learning community on how to learn to learn from related tasks [9, 32]

, we show that it is possible to learn feature representations suitable for one-class classification by optimizing an estimator of its classification performance. This not only allows us to optimize the one-class classification objective without any restriction to the model besides differentiability but also improves the data efficiency of the underlying algorithm. Our method obtains similar performance to traditional methods while using 1,000 times fewer data from the target class, defining a trade-off in the availability of data from related tasks and data from the target class.

For some one-class classification tasks, there are related tasks, and so our method’s requirement is satisfied. For example, in fraud detection, we could use normal activity from other users and create related tasks that consist of identifying if the activity came from the user or not, while still employing and optimizing one-class classification.

Figure 1: Overview of the proposed method. During the meta-training stage, we emulate a training stage by first sampling from a distribution that is similar to the one of our target class data . In practice we use all examples from class - represented as the sets in the figure - from a labeled dataset . We then sample a minibatch of pairs with being an example and a binary label indicating whether belongs to the same class as the examples in , sampling again from sets . Then, we use a one-class classification algorithm (e.g. SVDD) in the features resulting from applying on the examples of

. We use the resulting classifier to classify each example’s features

as belonging or not to the same class as , and compute the binary loss with the true labels . We optimize by doing gradient descent in the value of over many such tasks. After is learned, we run the same one-class classification algorithm on the resulting features, represented by the dashed-dotted arrow from the meta-training to the deployment stage, for in the true training stage, yielding the final one-class classification method.

We describe an instance of our method, the Meta Support Vector Data Description, obtained by using the Support Vector Data Description (SVDD) [33] as the one-class classification algorithm. We also simplify this method to obtain a one-class classification variant of Prototypical Networks [32], which we call One-class Prototypical Network. Despite its simplicity, this method obtains comparable performance to Meta SVDD. Our contributions thus are:

  • We show how to learn a feature representation for one-class classification (Section 2) by defining an estimator for the classification loss of such algorithms (Section 2.1

    ). We also describe how to efficiently backpropagate through the objective when the chosen algorithm is the SVDD method, so we can parametrize the feature representation with deep neural networks (Section 

    2.2). The efficiency requirement to train our model serves to make it work in the few-shot setting.

  • We simplify Meta SVDD by replacing how the center of its hypersphere is computed. Instead of solving a quadratic optimization problem to find the weight of each example in the center’s averaging, we remove the weighting and make the center the result of an unweighted average (Section 3). The resulting One-class Prototypical Networks are simpler, have lower computational complexity and more stable training dynamics than Meta SVDD.

  • After that, we detail how our method conceptually addresses the limitations of previous work (Section 4). We also show that our method has promising empirical performance by adapting two few-shot classification datasets to the one-class classification setting and obtaining comparable results with the state-of-the-art of the many-shot setting (Section 5). Our results indicate that learning the feature representations may compensate for the simplicity of replacing SVDD with feature averaging and that our approach is a viable way to replace data from the target class with labeled data from related tasks.

2 Meta SVDD

The Support Vector Data Description (SVDD) method [33] computes the hypersphere of minimum volume that contains every point in the training set. The idea is that only points inside the hypersphere belong to the target class, so we minimize the sphere’s volume to reduce the chance of including points that do not belong in the target class.

Formally, the radius of the hypersphere centered at covering the training set transformed by is

(1)

The SVDD objective is to find the center that minimizes the radius of such a hypersphere, i.e.

(2)

Finally, the algorithm determines that a point belongs to the target class if

(3)

The SVDD objective, however, does not specify how to optimize the feature representation

. Previous approaches include using dimensionality reduction with Principal Component Analysis (PCA) 

[28], using a Gaussian kernel with the kernel trick [33]

, or using features learned with unsupervised learning methods, like deep belief networks 

[8]. We take a different approach: Our goal is to learn for the task, and we detail how next.

2.1 Meta-learning One-class Classification

Our objective is to learn an such that the minimum volume hypersphere computed by the SVDD covers only the samples from the target class. We, therefore, divide the learning problem into two stages. In the meta-training stage, we learn the feature representation . Once we learn , we use it to learn a one-class classifier using the chosen algorithm (in this case, SVDD) from the data of the target class in the training stage. This is illustrated in Figure 1.

Notice how both the decision on unseen inputs (Equation 3) and the hypersphere’s center (Equation 2) depend on . Perfectly learning in the meta-training stage would map any input distribution into a space that can be correctly classified by SVDD, and would therefore not depend on the given data nor on what is the target class; that would be learned by the SVDD after transforming with in the subsequent training stage. We do not know how to learn perfectly but the above observation illustrates that we do not need to learn it with data from the target class.

With that observation, we can use the framework of nested learning loops [26] to describe how we propose to learn :

  • Inner loop: Use to transform the inputs, and use SVDD to learn a one-class classification boundary for the resulting features.

  • Outer loop: Learn from the classification loss obtained with the SVDD.

We use the expected classification loss in the outer loop. With this, we can use data that comes from the same distribution as the data for the target class, but with different classification tasks. To make this definition formal, first, let be a one-class classification function parametrized by which receives as inputs a subset of examples from the target class and an example

, and outputs the probability that

belongs to the target class. For a suitable classification loss , our learning loss is

(4)

where is a binary label indicating whether belongs to the same distribution of or not. The outer expectation of Equation 4 defines a one-class classification task, and the inner expectation is over labeled examples for this task (hence the dependency on for the labeled example distribution ). Since we do not have access to the distribution nor we have access to , we approximate it with related tasks. Intuitively, the closer the distribution of the tasks we use to approximate it, the better our feature representation.

To compute this approximation in practice, we require access to a labeled multiclass classification dataset , where is the th element and its label, that has a distribution similar to our dataset , but is disjoint from it (i.e. none of the elements in are in and none of its elements belong to any of the classes in ). Datasets like are common in the meta-learning or few-shot learning literature, and their existence is a standard assumption in previous work [32, 9, 20]. However, this restricts the tasks to which our method can be applied to those that have such related data available.

We then create the datasets from by separating its elements by class, i.e.

(5)

We create the required binary classification tasks by picking as the data for the target class, and the examples from , , to be the input data from the negative class. Finally, we approximate the expectations in Equation 4 by first sampling mini-batches of these binary classification tasks and then averaging over mini-batches of labeled examples from each of the sampled tasks. By making each sampled have few examples (e.g. 5 or 20), we not only make our method scalable but we also learn for few-shot one-class classification.

In the next section, we define a model for and a way to optimize it over Equation 4.

2.2 Gradient-based Optimization

If we choose to be a neural network, it is possible to optimize it to minimize the loss in Equation 4 with gradient descent as long as and

are differentiable and have meaningful gradients because of the chain rule of calculus.

can be the standard binary cross-entropy between the data and model distributions [10].

We also modify the SVDD to satisfy the requirements of the function. Neither how it computes the hypersphere’s center, by solving an optimization problem (Equation 2), nor its hard, binary decisions (Equation 3) are immediately suitable for gradient-based optimization.

To solve the hard, binary decisions problem, we adopt the approach of Prototypical Networks [32] and consider the squared distance from the features to the center (the left-hand side of Equation 3

) as the input logits for a logistic regression model. Doing this not only solves the problem of uninformative gradients coming from the binary outcomes of SVDD but also simplifies its implementation in modern automatic differentiation/machine learning software,

e.g.PyTorch [24]

. As our logits are non-negative, using the sigmoid function

to convert logits into probabilities would result in probabilities of at least 0.5 for every input, so we replace it with the and keep the binary cross-entropy objective otherwise unchanged.

As for how to compute in a differentiable manner, we can write it as the weighted average of the input features

(6)

where the weights are the solution of the following quadratic programming problem, which is the dual of the problem defined in Equation 2 [7, 33]

(7)
subject to (8)
(9)

and

(10)

is the kernel matrix of for input set . Despite such quadratic programs not having known analytical solutions and requiring a projection operator to unroll its optimization procedure because of its inequality constraints, the quadratic programming layer [1] can efficiently backpropagate through its solution and supports GPU usage.

Still, the quadratic programming layer has complexity for optimization variables [1]; in the case of Meta SVDD, is equal to the number of examples in during training [20]. As the size of the network is constant, this is the overall complexity of performing a training step in the model. Since we keep the number of examples small, 5 to 20, the runtime is dominated by the computation of .

In practice, we follow previous work that uses quadratic programming layers [20] and we add a small stabilization value to the diagonals of the kernel matrix (Equation 10), i.e.

(11)

and we use in Equation 7. Not adding this stabilization term results in failure to converge in some cases.

Using the program defined by objective 7, and constraints 8 and 9 to solve SVDD also allows us to use the kernel trick to make non-linear with regards to  [33]. We believe this would not add much since using a deep neural network to represent can handle the non-linearities that map the input to the output, in theory.

SVDD [33]

also introduce slack variables to account for outliers in the input set

. Since our setting is few-shot one-class classification, we do not believe these would benefit the method’s performance because we think outliers are unlikely in such small samples. We leave the analysis to confirm or refute these conjectures to future work.

3 One-class Prototypical Networks

The only reason to solve the quadratic programming problem defined by objective 7 and constraints 8 and 9 is to obtain the weights for the features of each example in Equation 6.

We experiment with replacing the weights in Equation 6 by uniform weights . The center then becomes a simple average of the input features

(12)

and we no longer require solving the quadratic program. The remainder of the method, i.e. its training objective, how tasks are sampled, etc, remains the same. This avoids the cubic complexity in the forward pass, and the destabilization issue altogether. We call this method One-class Prototypical Networks because the method can be cast as learning binary Prototypical Networks [32] with a binary cross-entropy objective.

Despite being a simpler method than Meta SVDD, we conjecture that learning to be a good representation for One-class Prototypical Networks can compensate its algorithmic simplicity so that performance does not degrade.

4 Related work

4.1 One-class Classification

The SVDD [33], reviewed in Section 2

, is closely related to the One-class Support Vector Machines (One-class SVMs) 

[30]

. Whereas the SVDD finds a hypersphere to enclose the input data, the One-class SVM finds a maximum margin hyperplane that separates the inputs from the origin of the coordinate system. Like the SVDD, it can also be formulated as a quadratic program, solved in kernelized form, and use slack variables to account for outliers in the input data. In fact, when the chosen kernel is the commonly used Gaussian kernel, both methods are equivalent 

[30].

Besides their equivalence in that case, the One-class SVM more generally suffers from the same limitations as the SVDD: it requires explicit feature engineering (i.e. it prescribes no way to formulate ), and it scales poorly both with the number of samples and the dimension of the data.

In Section 2, we propose to learn from related tasks, which addresses the feature engineering problem. We also make it so that it requires only a small set to learn the one-class classification boundary, solving the scalability problem in the number of samples. Finally, by making the feature dimension much smaller than , we solve the scalability issue regarding the feature dimensionality.

The limitations of SVDD and One-class SVMs led to the development of deep approaches to one-class classification, where the previous approaches are known as shallow because they do not rely on deep (i.e. multi-layered) neural networks for feature representation.

Most previous approaches that use deep neural networks to represent the input feature for downstream use in one-class classification algorithms are trained with a surrogate objective, like the representation learned for input reconstruction with deep autoencoders 

[12].

Autoencoder methods learn feature representations by requiring the network to reconstruct inputs while preventing it to learn the identity function. These are usually divided into an encoder, tasked with converting an input example into an intermediate representation, and a decoder, that gets the representation and must reconstruct the input [10].

The idea is that if the identity function cannot be learned, then the representation has captured semantic information of the input that is sufficient for its partial reconstruction and other tasks. How the identity function is prevented determines the type of autoencoder and many options exist: by reducing the dimensions of or imposing specific distributions to the intermediate representations, by adding a regularization term to the model’s objective, or by corrupting the input with noise [10].

Philipp Seeböck et al. [31] train a deep convolutional autoencoder (DCAE) in images for the target class, here healthy retinal image data, and after that the decoder is ignored and a One-class SVM is trained on the resulting intermediate representations. The main issue with this approach is that the objective of autoencoder training does not assure that the learned representations are useful for classification.

A related approach is to reuse features from networks trained for multiclass classification. Oza and Patel [23]

remove the softmax layer of a Convolutional Neural Network (CNN) 

[19]

trained in the ImageNet dataset 

[6]

as its feature extractor. The authors then train the fully-connected layers of the pre-trained network alongside a new fully connected layer tasked with discriminating between features from the target class and data sampled from a spherical Gaussian distribution; the convolutional layers are not updated.

AnoGANs [29] are trained as Generative Adversarial Networks [11] to generate samples from the target class. After that, gradient descent is used to find the sample in the noise distribution that best reconstructs the unseen example to be classified, which is equivalent to approximately inverting the generator using optimization. The classification score is the input reconstruction error, which assumes pixel-level similarity determines membership in the target class.

Like our method, Deep SVDD [28] attempts to learn feature representations for one-class classification from the data using gradient-based optimization with a neural network model. It consists of directly reducing the volume of a hypersphere containing the features, and in that it is a deep version of the original SVDD.

Deep SVDD’s algorithm relies on setting the centers every few iterations with the mean of the features from a forward pass instead of computing the minimum bounding sphere. Since their objective is to minimize the volume of the hypersphere containing the features, the algorithm must avoid the pathological solution of outputting a constant function. This requires imposing architectural constraints on the network, the stronger of which is that the network’s layers can have no bias terms. The authors also initialize the weights with those of an encoder from a trained autoencoder. Neural network models in our method have no such restrictions and do not require a pre-training stage.

One advantage of Deep SVDD over our work is that it does not require data from tasks from a similar distribution: it is trained only on the target class data. While this is an advantage, there is a downside to it. It is not clear for us, reading the paper describing Deep SVDD, how to know for how long to train a Deep SVDD model, how to tune its many hyperparameters, or what performance to expect of the method in unseen data. These are usually done with computing useful metrics in a validation set. However, for Deep SVDD, the optimal value can be reached for pathological solutions, so a validation set is not useful.

Ruff et al. [28] prove that using certain activation functions or keeping bias terms allow the model to learn the constant function but they do not prove the reciprocate, i.e. they do not prove that constant functions cannot be learned by the restricted models. The authors also do not analyze which functions are no longer learnable when the model is restricted as such. For Meta SVDD, on the other hand, the related tasks give predictive measures of metrics of interest, allow tuning hyperparameters, and early stopping.

4.2 Few-shot Learning

The main inspiration for the ideas in our paper besides Deep SVDD came from the field of meta-learning, in particular, that of few-shot classification. Prototypical Networks [32] are few-shot classifiers that create prototypes from few labeled examples and use their squared distances to an unseen example as the logits to classify it as one of their classes. We first saw the idea of learning the feature representation from similarly distributed tasks and of using the squared distances in this paper. They also propose feature averaging as a way to summarize class examples and show its competitive performance despite its simplicity; One-class Prototypical Networks are the one-class variant of this method.

Recently, Lee et al. [20] proposed to learn feature representations for few-shot classification convex learners, including multi-class Support Vector Machines [4], with gradient-based optimization. Their work is similar to ours in its formulation of learners as quadratic programs, and in solving these with quadratic programming layers but it does not address one-class classification.

5 Experiments

5.1 Evaluation Protocol

Our first experiment is an adaptation of the evaluation protocol of Deep SVDD [28]

to the few-shot setting to compare Meta SVDD with previous work. The original evaluation protocol consists of picking one of the classes of the dataset, training the method in the examples in the training set (using the train-test split proposed by the maintainers), and using all the examples in the test set to compute the mean and standard deviation of the Area under the curve (AUC) of the trained classifier over 10 repetitions in the MNIST 

[18]

and CIFAR-10 

[16] datasets.

We modified the protocol because there are only 10 classes in these datasets, which is not enough for meta-learning one-class classifiers. This illustrates the trade-off introduced by our approach: Despite requiring many fewer examples per class, it requires many more classes. Our modifications are only to address the number of classes and we tried to keep the protocol as similar as possible to make the results more comparable.

The first modification is the replacement of CIFAR-10 by the CIFAR-FS dataset [3], a new split of CIFAR-100 for few-shot classification in which there is no class overlap between the training, validation and test sets. CIFAR-FS has 64 classes for training, 16 for validating, and 20 for testing, and each class has 600 images.

No such split is possible for MNIST because there is no fine-grained classification like in the case of the CIFAR-10 and CIFAR-100 datasets. Therefore, we use the Omniglot dataset [17], which is considered the “transposed” version of the MNIST dataset because it has many classes with few examples instead of the many examples in the 10 classes of MNIST. This dataset consists of 20 images of each of its 1623 handwritten characters, which are usually augmented with four multiples of to obtain classes [34, 3, 32, 9]. We follow the pre-processing and dataset split proposed by Vinylas et al. [34] by resizing the images to pixels, and using 4800 classes for training and 1692 for testing, which is nowadays standard in few-shot classification work [9, 32, 3].

Another modification is that since there are only 10 classes in MNIST and CIFAR-10, Deep SVDD [28] reports the AUC metrics for each class. This is feasible for CIFAR-FS, which has 20 testing classes, but not for Omniglot, which has 1692. We summarize these statistics by presenting the minimum, median, and maximum mean AUC alongside their standard deviations.

The last modification is in the number of elements per class in the test set evaluation. Since there are many classes and we are dealing with few-shot classification, we use only two times the number of examples in for the target and for the negative class, e.g. if the task is 5-shot learning, then there are 10 examples from the target class and 10 examples from the negative class for evaluation.

To better compare the previous methods with ours in the few-shot setting, we evaluate the state-of-the-art method for general deep one-class classification, Deep SVDD [28], in our modified protocol. We run the evaluation protocol in CIFAR-FS using only 5 images for training, and we evaluate it using 10 images from the target class and 10 images from a negative class, and we do this 10 times for each pair of the 20 test classes to compute mean and standard deviation statistics for the AUC. We don’t do this for Omniglot because it would require training more than 1692 Deep SVDD models.

We also conduct a second experiment, based on the standard few-shot classification experiment in which we evaluate the mean 5-shot one-class classification accuracy over 10,000 episodes of tasks consisting of 10 examples from the target class and 10 examples from the negative class. We use this experiment to compare with a shallow baseline, PCA and Gaussian kernel One-class SVM [30]

, and One-class Prototypical Network. We use the increased number of episodes to compute 95% confidence intervals like previous work for few-shot multiclass classification 

[3, 20].

width=0.48 Dataset DCAE Deep SVDD Dataset Deep SVDD One-Class Protonet Meta SVDD Min. 78.2 2.7 88.5 0.9 89.0 0.2 88.6 0.4 Med. MNIST 86.7 0.9 94.6 0.9 Omniglot 99.5 0.0 99.5 0.0 Max. 98.3 0.6 99.7 0.1 100.0 0.0 100.0 0.0 Min. 51.2 5.2 50.8 0.8 47.9 4.9 60.2 3.4 59.0 5.7 Med. CIFAR-10 58.6 2.9 65.7 2.5 CIFAR-FS 64.0 5.0 72.7 3.0 71.0 4.0 Max. 76.8 1.4 75.9 1.2 92.4 2.3 90.1 2.3 92.5 1.7

Table 1: Minimum, median and maximum mean AUC alongside their standard deviation for one-class classification methods for 10 repetitions. We highlight in boldface the highest mean and others which are within one standard deviation from it. The results for the many-shot baselines in MNIST and CIFAR-10 are compiled from the table by Ruff et al. [28]. The results for Omniglot and CIFAR-FS are for 5-shot one-class classification.

5.2 Setup

We parametrize with the neural network architecture model introduced by Vinyals et al. [34] that is commonly used in other few-shot learning work [9, 32]. There are four convolutional blocks with number of filters equal to 64, and each block is composed of a

kernel, stride 1, “same” 2D convolution, batch normalization 

[13], followed by max-pooling and ReLU activations [14].

We implemented the neural network using PyTorch [24] and the qpth package [1] for the quadratic programming layer. We also used Scikit-Learn [25] and NumPy [22] to compute metrics, implement the shallow baselines and for miscelaneous tasks, and Torchmeta [5] to sample mini-batches of tasks, like described in Section 2.1.

We optimize both Meta SVDD and One-class Prototypical Networks using stochastic gradient descent 

[27] on the objective defined in Section 2.1 and Equation 4 with the Adam optimizer [15]. We use a constant learning rate of over mini-batches of tasks of size 16, each having set with 5 examples, and set with 10 examples from the target class and 10 examples from a randomly picked negative class. The learning rate value was the first one we tried, so no tuning was required. We picked the task batch size that performed better in the validation set when training halts; we tried sizes . We evaluate the performance in the validation set with 95% confidence intervals of the model’s accuracy in 500 tasks randomly sampled from the validation sets, and we consider that a model is better than another if the lower bound of its confidence interval is greater, or if its mean is higher when the lower bounds are equal up to 5 decimal points. Early stopping halts training when performance in the validation set does not increase for 10 evaluations in a row, and we use the model with higher performance in the validation set. We evaluate the model in the validation set every 100 training steps.

The results for the few-shot experiment with Deep SVDD are obtained modifying the code made available by the authors111https://github.com/lukasruff/Deep-SVDD-Pytorch, keeping the same hyperparameters.

For the few-shot baseline accuracy experiment with PCA and One-class SVMs with Gaussian kernel, we use the grid search space used by the experiments in prior work [28]: is selected from , and is selected from . Furthermore, we give the shallow baseline an advantage by evaluating every parameter combination in the test set and reporting the best result.

Figure 2:

Mean AUC with shaded standard deviations for tasks in CIFAR datasets sorted by increasing mean value. Comparing Deep SVDD across datasets and protocols shows that the modified protocol is reasonable to evaluate few-shot one-class classification because the trend in task difficulty is similar. Within the few-shot protocol in CIFAR-FS, meta one-class classification are numerically superior, show less variance and can be meta-trained once for all tasks, with simple adaptation for unseen tasks, but require related task data.

5.3 Results

We reproduce the results reported for Deep SVDD [28] and its baselines alongside the results for 5-shot Meta SVDD and One-class Prototypical Networks, and our experiment with 5-shot Deep SVDD in Table 1. Figure 2 also provides mean AUC with shaded standard deviations for the results in the CIFAR dataset variants.

While the results from different datasets are not comparable due to the differences in setting and application listed in Section 5.1, they show that the approach has similar performance to the many-shot state-of-the-art in terms of AUC. Figure 2 shows that when we sort the mean AUCs for CIFAR-10 and CIFAR-FS, the performance from hardest to easier tasks exhibit similar trends despite these differences, and that the modifications to the protocol are reasonable.

This experiment is evidence that our method is able to reduce the required amount of data from the target class in case we have labeled data from related tasks. Note that it is not the objective of our experiments to show that our method has better performance than previous approaches, since they operate in different settings, i.e. few-shot with related tasks and many-shot without them.

The comparison with Deep SVDD in the few-shot scenario gives further evidence of the relevance of our method: both Meta SVDD and One-Class Prototypical Networks obtain higher minimum, and median AUC than Deep SVDD. Another advantage is that we train once in the training set of Omniglot or CIFAR-FS, and learn only either the SVDD or the average on each of the sets in the test set. We also obtain these results without any pre-training, and we have established a clear validation procedure to guide hyperparameter tuning and early stopping.

These results also show we can train a neural network for without architectural restrictions to optimize a one-class classification objective whereas other methods either require feature engineering, optimize another metric, or impose restrictions on the model architecture to prevent learning trivial functions.

width= Dataset PCA+SVM One-class Protonet Meta SVDD Omniglot 50.64 0.10% 94.68 0.17% 94.33 0.19% CIFAR-FS 54.77 0.31% 67.67 0.39% 64.95 0.37%

Table 2: Mean accuracy alongside 95% confidence intervals computed over 10,000 tasks for Gaussian kernel One-class SVM with PCA, Meta SVDD and One-class Protoypical Networks. The results with highest mean and those with overlapping confidence interval with it are in boldface. We report the best result for the One-class SVM in its parameter search space, which gives it an advantage over the other two methods. Despite employing a simpler algorithm for one-class classification, One-class Prototypical networks obtain equivalent accuracy for Omniglot and better accuracy for CIFAR-FS than Meta SVDD. This indicates that learning feature representations is more important than which one-class classification algorithm we use.

The results for our second experiment, comparing the accuracies of Meta SVDD, a shallow baseline and One-class Prototypical Networks are presented in Table 2.

In this experiment, we can see an increase from almost random performance to almost perfect performance for both methods when compared to the shallow baseline in Omniglot. Both methods for few-shot one-class classification that use related tasks have equivalent performance in Omniglot. The gain is not as significant for CIFAR-FS but more than 10% in absolute for both methods, which shows they are a marked improvement over the shallow baseline.

Comparing the two proposed methods, we observe the unexpected result that the simpler method, One-class Prototypical Networks, has equivalent accuracy in the Omniglot experiment, and better accuracy in the CIFAR-FS experiment. This indicates that learning the feature representation directly from data might be more important than the one-class classification algorithm we choose, and the increased complexity of using SVDD over simple averaging does not translate into improved performance in this setting.

We have also attempted to run this same experiment in the miniImageNet dataset [34], a dataset for few-shot learning using the images from the ImageNet dataset [6]. The accuracy in the validation set, however, never rose above 50%. One of the motivations of introducing CIFAR-FS was that there was a gap in the challenge between training models in Omniglot and miniImageNet and that successfully training models in the latter took hours [3]. Since none of the previous methods attempted solving ImageNet level datasets, and the worst performance in datasets from CIFAR is already near random guessing, we leave the problem of training one-class classification algorithms in this dataset open for future work.

Finally, we have run a small variation of the second experiment in which the number of examples in is greater than during training, using 10 examples instead of 5. The results stayed within the accuracy confidence intervals for 5-shot for both models in this 10-shot deployment scenario.

6 Conclusion

We have described a way to learn feature representations so one-class classification algorithms can learn decision boundaries that contain the target class from data, optimizing an estimator of its true objective. Furthermore, this method works with 5 samples from the target class with performance similar to the state-of-the-art in the setting where target class data is abundant, and better when the many-shot state-of-the-art method is employed in the few-shot setting. We also provide an experiment that shows that using a simpler one-class classification yields comparable performance, displaying the advantages of learning feature representations directly from data.

One possibility to replace the main requirement of our method with a less limiting one would be the capability of generating related tasks from unlabeled data. A simple approach in this direction could be using weaker learners to define pseudolabels for the data. Doing this successfully would increase the number of settings where our method can be used significantly.

The main limitations of our method besides the requirement of the related tasks are the destabilization of the quadratic programming layer, which we solved by adding a stabilization term to the diagonal of the kernel matrix or by simplifying the one-class classification algorithm to use the mean of the features, and its failure to obtain meaningful results in the miniImageNet dataset.

We believe not only finding solutions to these limitations should be investigated in future work but also other questions left open in our work, like confirming our hypothesis that introducing slacks would not benefit Meta SVDD.

Other directions for future work are extending our method for other settings and using other one-class classification methods besides SVDD. Tax and Duin [33] also detail a way to incorporate negative examples in the SVDD objective, so we could try learning using this method and to minimize the hypersphere’s volume instead of converting SVDD into a binary classification problem that uses the unseen examples’ distances to the center as logits.

References

  • [1] B. Amos and J. Z. Kolter (2017-06–11 Aug) OptNet: differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 136–145. External Links: Link Cited by: §2.2, §2.2, §5.2.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein GAN. arXiv preprint arXiv:1701.07875. Cited by: §1.
  • [3] L. Bertinetto, J. F. Henriques, P. H. S. Torr, and A. Vedaldi (2019) Meta-learning with differentiable closed-form solvers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §5.1, §5.1, §5.1, §5.3.
  • [4] C. Cortes and V. Vapnik (1995) Support-vector networks. Machine Learning 20 (3), pp. 273–297. External Links: Document Cited by: §4.2.
  • [5] T. Deleu, T. Würfl, M. Samiei, J. P. Cohen, and Y. Bengio (2019) Torchmeta: A Meta-Learning library for PyTorch. Note: Available at: https://github.com/tristandeleu/pytorch-meta External Links: Link Cited by: §5.2.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In

    2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA

    ,
    pp. 248–255. External Links: Document Cited by: §4.1, §5.3.
  • [7] D. J. Elzinga and D. W. Hearn (1972) The minimum covering sphere problem. Management science 19 (1), pp. 96–104. Cited by: §2.2.
  • [8] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie (2016)

    High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning

    .
    Pattern Recognition 58, pp. 121 – 134. External Links: ISSN 0031-3203, Document, Link Cited by: §2.
  • [9] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1126–1135. Cited by: §1, §2.1, §5.1, §5.2.
  • [10] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §2.2, §4.1, §4.1.
  • [11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2672–2680. External Links: Link Cited by: §1, §4.1.
  • [12] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. External Links: Document, ISSN 0036-8075, Link Cited by: §4.1.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. External Links: Link Cited by: §5.2.
  • [14] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun (2009) What is the best multi-stage architecture for object recognition?. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - October 4, 2009, pp. 2146–2153. External Links: Link, Document Cited by: §5.2.
  • [15] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §5.2.
  • [16] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.1.
  • [17] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/350/6266/1332.full.pdf Cited by: §5.1.
  • [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 1558-2256 Cited by: §5.1.
  • [19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel (1990) Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §4.1.
  • [20] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019-06) Meta-learning with differentiable convex optimization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §2.2, §2.2, §4.2, §5.1.
  • [21] L. Mescheder, A. Geiger, and S. Nowozin (2018-10–15 Jul) Which training methods for GANs do actually converge?. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3481–3490. External Links: Link Cited by: §1.
  • [22] T. Oliphant (2006–) NumPy: a guide to NumPy. Note: USA: Trelgol Publishing External Links: Link Cited by: §5.2.
  • [23] P. Oza and V. M. Patel (2019) One-class convolutional neural network. IEEE Signal Process. Lett. 26 (2), pp. 277–281. External Links: Document Cited by: §4.1.
  • [24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §2.2, §5.2.
  • [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.2.
  • [26] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2019) Rapid learning or feature reuse? towards understanding the effectiveness of MAML. CoRR abs/1909.09157. External Links: Link, 1909.09157 Cited by: §2.1.
  • [27] H. Robbins and S. Monro (1951-09) A Stochastic Approximation Method. The Annals of Mathematical Statistics 22 (3), pp. 400–407. External Links: ISSN 0003-4851, 2168-8990, Link, Document Cited by: §5.2.
  • [28] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018-10–15 Jul) Deep one-class classification. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4393–4402. External Links: Link Cited by: §1, §2, §4.1, §4.1, §5.1, §5.1, §5.1, §5.2, §5.3, Table 1.
  • [29] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Information Processing in Medical Imaging - 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017, Proceedings, pp. 146–157. External Links: Document Cited by: §1, §4.1.
  • [30] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural Computation 13 (7), pp. 1443–1471. External Links: Document Cited by: §4.1, §5.1.
  • [31] P. Seeböck, S. M. Waldstein, S. Klimscha, B. S. Gerendas, R. Donner, T. Schlegl, U. Schmidt-Erfurth, and G. Langs (2016) Identifying and categorizing anomalies in retinal imaging data. CoRR abs/1612.00686. External Links: Link, 1612.00686 Cited by: §1, §4.1.
  • [32] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 4077–4087. Cited by: §1, §1, §2.1, §2.2, §3, §4.2, §5.1, §5.2.
  • [33] D. M. J. Tax and R. P. W. Duin (2004) Support vector data description. Machine Learning 54 (1), pp. 45–66. External Links: Document Cited by: §1, §2.2, §2.2, §2.2, §2, §2, §4.1, §6.
  • [34] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3630–3638. External Links: Link Cited by: §5.1, §5.2, §5.3.