1 Introduction
Oneclass classification algorithms are the main approach to detecting anomalies from normal data but traditional methods scale poorly both in computational resources and sample efficiency with the data dimensions. Attempting to overcome these problems, previous work proposed using deep neural networks to learn feature representations for oneclass classification. While successful in addressing some of the problems, they introduced other limitations. One problem with these methods is that some of them optimize a metric that is related, but different than their true oneclass classification objective (
e.g., input reconstruction [31]). Other methods require imposing specific structure to the models, like using generative adversarial networks (GANs)
[11, 29], or removing biases and restricting the activation functions for the network model
[28]. GANs are notoriously hard to optimize [2, 21], and removing biases restrict which functions the models can learn [28]. Furthermore, these methods require thousands of samples from the target class, only to obtain results that are comparable to that of the traditional baselines [28].We propose a method that overcomes these problems if we have access to data from related tasks. By using recent insights from the metalearning community on how to learn to learn from related tasks [9, 32]
, we show that it is possible to learn feature representations suitable for oneclass classification by optimizing an estimator of its classification performance. This not only allows us to optimize the oneclass classification objective without any restriction to the model besides differentiability but also improves the data efficiency of the underlying algorithm. Our method obtains similar performance to traditional methods while using 1,000 times fewer data from the target class, defining a tradeoff in the availability of data from related tasks and data from the target class.
For some oneclass classification tasks, there are related tasks, and so our method’s requirement is satisfied. For example, in fraud detection, we could use normal activity from other users and create related tasks that consist of identifying if the activity came from the user or not, while still employing and optimizing oneclass classification.
We describe an instance of our method, the Meta Support Vector Data Description, obtained by using the Support Vector Data Description (SVDD) [33] as the oneclass classification algorithm. We also simplify this method to obtain a oneclass classification variant of Prototypical Networks [32], which we call Oneclass Prototypical Network. Despite its simplicity, this method obtains comparable performance to Meta SVDD. Our contributions thus are:

We show how to learn a feature representation for oneclass classification (Section 2) by defining an estimator for the classification loss of such algorithms (Section 2.1
). We also describe how to efficiently backpropagate through the objective when the chosen algorithm is the SVDD method, so we can parametrize the feature representation with deep neural networks (Section
2.2). The efficiency requirement to train our model serves to make it work in the fewshot setting. 
We simplify Meta SVDD by replacing how the center of its hypersphere is computed. Instead of solving a quadratic optimization problem to find the weight of each example in the center’s averaging, we remove the weighting and make the center the result of an unweighted average (Section 3). The resulting Oneclass Prototypical Networks are simpler, have lower computational complexity and more stable training dynamics than Meta SVDD.

After that, we detail how our method conceptually addresses the limitations of previous work (Section 4). We also show that our method has promising empirical performance by adapting two fewshot classification datasets to the oneclass classification setting and obtaining comparable results with the stateoftheart of the manyshot setting (Section 5). Our results indicate that learning the feature representations may compensate for the simplicity of replacing SVDD with feature averaging and that our approach is a viable way to replace data from the target class with labeled data from related tasks.
2 Meta SVDD
The Support Vector Data Description (SVDD) method [33] computes the hypersphere of minimum volume that contains every point in the training set. The idea is that only points inside the hypersphere belong to the target class, so we minimize the sphere’s volume to reduce the chance of including points that do not belong in the target class.
Formally, the radius of the hypersphere centered at covering the training set transformed by is
(1) 
The SVDD objective is to find the center that minimizes the radius of such a hypersphere, i.e.
(2) 
Finally, the algorithm determines that a point belongs to the target class if
(3) 
The SVDD objective, however, does not specify how to optimize the feature representation
. Previous approaches include using dimensionality reduction with Principal Component Analysis (PCA)
[28], using a Gaussian kernel with the kernel trick [33], or using features learned with unsupervised learning methods, like deep belief networks
[8]. We take a different approach: Our goal is to learn for the task, and we detail how next.2.1 Metalearning Oneclass Classification
Our objective is to learn an such that the minimum volume hypersphere computed by the SVDD covers only the samples from the target class. We, therefore, divide the learning problem into two stages. In the metatraining stage, we learn the feature representation . Once we learn , we use it to learn a oneclass classifier using the chosen algorithm (in this case, SVDD) from the data of the target class in the training stage. This is illustrated in Figure 1.
Notice how both the decision on unseen inputs (Equation 3) and the hypersphere’s center (Equation 2) depend on . Perfectly learning in the metatraining stage would map any input distribution into a space that can be correctly classified by SVDD, and would therefore not depend on the given data nor on what is the target class; that would be learned by the SVDD after transforming with in the subsequent training stage. We do not know how to learn perfectly but the above observation illustrates that we do not need to learn it with data from the target class.
With that observation, we can use the framework of nested learning loops [26] to describe how we propose to learn :

Inner loop: Use to transform the inputs, and use SVDD to learn a oneclass classification boundary for the resulting features.

Outer loop: Learn from the classification loss obtained with the SVDD.
We use the expected classification loss in the outer loop. With this, we can use data that comes from the same distribution as the data for the target class, but with different classification tasks. To make this definition formal, first, let be a oneclass classification function parametrized by which receives as inputs a subset of examples from the target class and an example
, and outputs the probability that
belongs to the target class. For a suitable classification loss , our learning loss is(4) 
where is a binary label indicating whether belongs to the same distribution of or not. The outer expectation of Equation 4 defines a oneclass classification task, and the inner expectation is over labeled examples for this task (hence the dependency on for the labeled example distribution ). Since we do not have access to the distribution nor we have access to , we approximate it with related tasks. Intuitively, the closer the distribution of the tasks we use to approximate it, the better our feature representation.
To compute this approximation in practice, we require access to a labeled multiclass classification dataset , where is the ^{th} element and its label, that has a distribution similar to our dataset , but is disjoint from it (i.e. none of the elements in are in and none of its elements belong to any of the classes in ). Datasets like are common in the metalearning or fewshot learning literature, and their existence is a standard assumption in previous work [32, 9, 20]. However, this restricts the tasks to which our method can be applied to those that have such related data available.
We then create the datasets from by separating its elements by class, i.e.
(5) 
We create the required binary classification tasks by picking as the data for the target class, and the examples from , , to be the input data from the negative class. Finally, we approximate the expectations in Equation 4 by first sampling minibatches of these binary classification tasks and then averaging over minibatches of labeled examples from each of the sampled tasks. By making each sampled have few examples (e.g. 5 or 20), we not only make our method scalable but we also learn for fewshot oneclass classification.
In the next section, we define a model for and a way to optimize it over Equation 4.
2.2 Gradientbased Optimization
If we choose to be a neural network, it is possible to optimize it to minimize the loss in Equation 4 with gradient descent as long as and
are differentiable and have meaningful gradients because of the chain rule of calculus.
can be the standard binary crossentropy between the data and model distributions [10].We also modify the SVDD to satisfy the requirements of the function. Neither how it computes the hypersphere’s center, by solving an optimization problem (Equation 2), nor its hard, binary decisions (Equation 3) are immediately suitable for gradientbased optimization.
To solve the hard, binary decisions problem, we adopt the approach of Prototypical Networks [32] and consider the squared distance from the features to the center (the lefthand side of Equation 3
) as the input logits for a logistic regression model. Doing this not only solves the problem of uninformative gradients coming from the binary outcomes of SVDD but also simplifies its implementation in modern automatic differentiation/machine learning software,
e.g.PyTorch [24]. As our logits are nonnegative, using the sigmoid function
to convert logits into probabilities would result in probabilities of at least 0.5 for every input, so we replace it with the and keep the binary crossentropy objective otherwise unchanged.As for how to compute in a differentiable manner, we can write it as the weighted average of the input features
(6) 
where the weights are the solution of the following quadratic programming problem, which is the dual of the problem defined in Equation 2 [7, 33]
(7)  
subject to  (8)  
(9) 
and
(10) 
is the kernel matrix of for input set . Despite such quadratic programs not having known analytical solutions and requiring a projection operator to unroll its optimization procedure because of its inequality constraints, the quadratic programming layer [1] can efficiently backpropagate through its solution and supports GPU usage.
Still, the quadratic programming layer has complexity for optimization variables [1]; in the case of Meta SVDD, is equal to the number of examples in during training [20]. As the size of the network is constant, this is the overall complexity of performing a training step in the model. Since we keep the number of examples small, 5 to 20, the runtime is dominated by the computation of .
In practice, we follow previous work that uses quadratic programming layers [20] and we add a small stabilization value to the diagonals of the kernel matrix (Equation 10), i.e.
(11) 
and we use in Equation 7. Not adding this stabilization term results in failure to converge in some cases.
Using the program defined by objective 7, and constraints 8 and 9 to solve SVDD also allows us to use the kernel trick to make nonlinear with regards to [33]. We believe this would not add much since using a deep neural network to represent can handle the nonlinearities that map the input to the output, in theory.
SVDD [33]
also introduce slack variables to account for outliers in the input set
. Since our setting is fewshot oneclass classification, we do not believe these would benefit the method’s performance because we think outliers are unlikely in such small samples. We leave the analysis to confirm or refute these conjectures to future work.3 Oneclass Prototypical Networks
The only reason to solve the quadratic programming problem defined by objective 7 and constraints 8 and 9 is to obtain the weights for the features of each example in Equation 6.
We experiment with replacing the weights in Equation 6 by uniform weights . The center then becomes a simple average of the input features
(12) 
and we no longer require solving the quadratic program. The remainder of the method, i.e. its training objective, how tasks are sampled, etc, remains the same. This avoids the cubic complexity in the forward pass, and the destabilization issue altogether. We call this method Oneclass Prototypical Networks because the method can be cast as learning binary Prototypical Networks [32] with a binary crossentropy objective.
Despite being a simpler method than Meta SVDD, we conjecture that learning to be a good representation for Oneclass Prototypical Networks can compensate its algorithmic simplicity so that performance does not degrade.
4 Related work
4.1 Oneclass Classification
The SVDD [33], reviewed in Section 2
, is closely related to the Oneclass Support Vector Machines (Oneclass SVMs)
[30]. Whereas the SVDD finds a hypersphere to enclose the input data, the Oneclass SVM finds a maximum margin hyperplane that separates the inputs from the origin of the coordinate system. Like the SVDD, it can also be formulated as a quadratic program, solved in kernelized form, and use slack variables to account for outliers in the input data. In fact, when the chosen kernel is the commonly used Gaussian kernel, both methods are equivalent
[30].Besides their equivalence in that case, the Oneclass SVM more generally suffers from the same limitations as the SVDD: it requires explicit feature engineering (i.e. it prescribes no way to formulate ), and it scales poorly both with the number of samples and the dimension of the data.
In Section 2, we propose to learn from related tasks, which addresses the feature engineering problem. We also make it so that it requires only a small set to learn the oneclass classification boundary, solving the scalability problem in the number of samples. Finally, by making the feature dimension much smaller than , we solve the scalability issue regarding the feature dimensionality.
The limitations of SVDD and Oneclass SVMs led to the development of deep approaches to oneclass classification, where the previous approaches are known as shallow because they do not rely on deep (i.e. multilayered) neural networks for feature representation.
Most previous approaches that use deep neural networks to represent the input feature for downstream use in oneclass classification algorithms are trained with a surrogate objective, like the representation learned for input reconstruction with deep autoencoders
[12].Autoencoder methods learn feature representations by requiring the network to reconstruct inputs while preventing it to learn the identity function. These are usually divided into an encoder, tasked with converting an input example into an intermediate representation, and a decoder, that gets the representation and must reconstruct the input [10].
The idea is that if the identity function cannot be learned, then the representation has captured semantic information of the input that is sufficient for its partial reconstruction and other tasks. How the identity function is prevented determines the type of autoencoder and many options exist: by reducing the dimensions of or imposing specific distributions to the intermediate representations, by adding a regularization term to the model’s objective, or by corrupting the input with noise [10].
Philipp Seeböck et al. [31] train a deep convolutional autoencoder (DCAE) in images for the target class, here healthy retinal image data, and after that the decoder is ignored and a Oneclass SVM is trained on the resulting intermediate representations. The main issue with this approach is that the objective of autoencoder training does not assure that the learned representations are useful for classification.
A related approach is to reuse features from networks trained for multiclass classification. Oza and Patel [23]
remove the softmax layer of a Convolutional Neural Network (CNN)
[19]trained in the ImageNet dataset
[6]as its feature extractor. The authors then train the fullyconnected layers of the pretrained network alongside a new fully connected layer tasked with discriminating between features from the target class and data sampled from a spherical Gaussian distribution; the convolutional layers are not updated.
AnoGANs [29] are trained as Generative Adversarial Networks [11] to generate samples from the target class. After that, gradient descent is used to find the sample in the noise distribution that best reconstructs the unseen example to be classified, which is equivalent to approximately inverting the generator using optimization. The classification score is the input reconstruction error, which assumes pixellevel similarity determines membership in the target class.
Like our method, Deep SVDD [28] attempts to learn feature representations for oneclass classification from the data using gradientbased optimization with a neural network model. It consists of directly reducing the volume of a hypersphere containing the features, and in that it is a deep version of the original SVDD.
Deep SVDD’s algorithm relies on setting the centers every few iterations with the mean of the features from a forward pass instead of computing the minimum bounding sphere. Since their objective is to minimize the volume of the hypersphere containing the features, the algorithm must avoid the pathological solution of outputting a constant function. This requires imposing architectural constraints on the network, the stronger of which is that the network’s layers can have no bias terms. The authors also initialize the weights with those of an encoder from a trained autoencoder. Neural network models in our method have no such restrictions and do not require a pretraining stage.
One advantage of Deep SVDD over our work is that it does not require data from tasks from a similar distribution: it is trained only on the target class data. While this is an advantage, there is a downside to it. It is not clear for us, reading the paper describing Deep SVDD, how to know for how long to train a Deep SVDD model, how to tune its many hyperparameters, or what performance to expect of the method in unseen data. These are usually done with computing useful metrics in a validation set. However, for Deep SVDD, the optimal value can be reached for pathological solutions, so a validation set is not useful.
Ruff et al. [28] prove that using certain activation functions or keeping bias terms allow the model to learn the constant function but they do not prove the reciprocate, i.e. they do not prove that constant functions cannot be learned by the restricted models. The authors also do not analyze which functions are no longer learnable when the model is restricted as such. For Meta SVDD, on the other hand, the related tasks give predictive measures of metrics of interest, allow tuning hyperparameters, and early stopping.
4.2 Fewshot Learning
The main inspiration for the ideas in our paper besides Deep SVDD came from the field of metalearning, in particular, that of fewshot classification. Prototypical Networks [32] are fewshot classifiers that create prototypes from few labeled examples and use their squared distances to an unseen example as the logits to classify it as one of their classes. We first saw the idea of learning the feature representation from similarly distributed tasks and of using the squared distances in this paper. They also propose feature averaging as a way to summarize class examples and show its competitive performance despite its simplicity; Oneclass Prototypical Networks are the oneclass variant of this method.
Recently, Lee et al. [20] proposed to learn feature representations for fewshot classification convex learners, including multiclass Support Vector Machines [4], with gradientbased optimization. Their work is similar to ours in its formulation of learners as quadratic programs, and in solving these with quadratic programming layers but it does not address oneclass classification.
5 Experiments
5.1 Evaluation Protocol
Our first experiment is an adaptation of the evaluation protocol of Deep SVDD [28]
to the fewshot setting to compare Meta SVDD with previous work. The original evaluation protocol consists of picking one of the classes of the dataset, training the method in the examples in the training set (using the traintest split proposed by the maintainers), and using all the examples in the test set to compute the mean and standard deviation of the Area under the curve (AUC) of the trained classifier over 10 repetitions in the MNIST
[18]and CIFAR10
[16] datasets.We modified the protocol because there are only 10 classes in these datasets, which is not enough for metalearning oneclass classifiers. This illustrates the tradeoff introduced by our approach: Despite requiring many fewer examples per class, it requires many more classes. Our modifications are only to address the number of classes and we tried to keep the protocol as similar as possible to make the results more comparable.
The first modification is the replacement of CIFAR10 by the CIFARFS dataset [3], a new split of CIFAR100 for fewshot classification in which there is no class overlap between the training, validation and test sets. CIFARFS has 64 classes for training, 16 for validating, and 20 for testing, and each class has 600 images.
No such split is possible for MNIST because there is no finegrained classification like in the case of the CIFAR10 and CIFAR100 datasets. Therefore, we use the Omniglot dataset [17], which is considered the “transposed” version of the MNIST dataset because it has many classes with few examples instead of the many examples in the 10 classes of MNIST. This dataset consists of 20 images of each of its 1623 handwritten characters, which are usually augmented with four multiples of to obtain classes [34, 3, 32, 9]. We follow the preprocessing and dataset split proposed by Vinylas et al. [34] by resizing the images to pixels, and using 4800 classes for training and 1692 for testing, which is nowadays standard in fewshot classification work [9, 32, 3].
Another modification is that since there are only 10 classes in MNIST and CIFAR10, Deep SVDD [28] reports the AUC metrics for each class. This is feasible for CIFARFS, which has 20 testing classes, but not for Omniglot, which has 1692. We summarize these statistics by presenting the minimum, median, and maximum mean AUC alongside their standard deviations.
The last modification is in the number of elements per class in the test set evaluation. Since there are many classes and we are dealing with fewshot classification, we use only two times the number of examples in for the target and for the negative class, e.g. if the task is 5shot learning, then there are 10 examples from the target class and 10 examples from the negative class for evaluation.
To better compare the previous methods with ours in the fewshot setting, we evaluate the stateoftheart method for general deep oneclass classification, Deep SVDD [28], in our modified protocol. We run the evaluation protocol in CIFARFS using only 5 images for training, and we evaluate it using 10 images from the target class and 10 images from a negative class, and we do this 10 times for each pair of the 20 test classes to compute mean and standard deviation statistics for the AUC. We don’t do this for Omniglot because it would require training more than 1692 Deep SVDD models.
We also conduct a second experiment, based on the standard fewshot classification experiment in which we evaluate the mean 5shot oneclass classification accuracy over 10,000 episodes of tasks consisting of 10 examples from the target class and 10 examples from the negative class. We use this experiment to compare with a shallow baseline, PCA and Gaussian kernel Oneclass SVM [30]
, and Oneclass Prototypical Network. We use the increased number of episodes to compute 95% confidence intervals like previous work for fewshot multiclass classification
[3, 20].5.2 Setup
We parametrize with the neural network architecture model introduced by Vinyals et al. [34] that is commonly used in other fewshot learning work [9, 32]. There are four convolutional blocks with number of filters equal to 64, and each block is composed of a
kernel, stride 1, “same” 2D convolution, batch normalization
[13], followed by maxpooling and ReLU activations [14].We implemented the neural network using PyTorch [24] and the qpth package [1] for the quadratic programming layer. We also used ScikitLearn [25] and NumPy [22] to compute metrics, implement the shallow baselines and for miscelaneous tasks, and Torchmeta [5] to sample minibatches of tasks, like described in Section 2.1.
We optimize both Meta SVDD and Oneclass Prototypical Networks using stochastic gradient descent
[27] on the objective defined in Section 2.1 and Equation 4 with the Adam optimizer [15]. We use a constant learning rate of over minibatches of tasks of size 16, each having set with 5 examples, and set with 10 examples from the target class and 10 examples from a randomly picked negative class. The learning rate value was the first one we tried, so no tuning was required. We picked the task batch size that performed better in the validation set when training halts; we tried sizes . We evaluate the performance in the validation set with 95% confidence intervals of the model’s accuracy in 500 tasks randomly sampled from the validation sets, and we consider that a model is better than another if the lower bound of its confidence interval is greater, or if its mean is higher when the lower bounds are equal up to 5 decimal points. Early stopping halts training when performance in the validation set does not increase for 10 evaluations in a row, and we use the model with higher performance in the validation set. We evaluate the model in the validation set every 100 training steps.The results for the fewshot experiment with Deep SVDD are obtained modifying the code made available by the authors^{1}^{1}1https://github.com/lukasruff/DeepSVDDPytorch, keeping the same hyperparameters.
For the fewshot baseline accuracy experiment with PCA and Oneclass SVMs with Gaussian kernel, we use the grid search space used by the experiments in prior work [28]: is selected from , and is selected from . Furthermore, we give the shallow baseline an advantage by evaluating every parameter combination in the test set and reporting the best result.
5.3 Results
We reproduce the results reported for Deep SVDD [28] and its baselines alongside the results for 5shot Meta SVDD and Oneclass Prototypical Networks, and our experiment with 5shot Deep SVDD in Table 1. Figure 2 also provides mean AUC with shaded standard deviations for the results in the CIFAR dataset variants.
While the results from different datasets are not comparable due to the differences in setting and application listed in Section 5.1, they show that the approach has similar performance to the manyshot stateoftheart in terms of AUC. Figure 2 shows that when we sort the mean AUCs for CIFAR10 and CIFARFS, the performance from hardest to easier tasks exhibit similar trends despite these differences, and that the modifications to the protocol are reasonable.
This experiment is evidence that our method is able to reduce the required amount of data from the target class in case we have labeled data from related tasks. Note that it is not the objective of our experiments to show that our method has better performance than previous approaches, since they operate in different settings, i.e. fewshot with related tasks and manyshot without them.
The comparison with Deep SVDD in the fewshot scenario gives further evidence of the relevance of our method: both Meta SVDD and OneClass Prototypical Networks obtain higher minimum, and median AUC than Deep SVDD. Another advantage is that we train once in the training set of Omniglot or CIFARFS, and learn only either the SVDD or the average on each of the sets in the test set. We also obtain these results without any pretraining, and we have established a clear validation procedure to guide hyperparameter tuning and early stopping.
These results also show we can train a neural network for without architectural restrictions to optimize a oneclass classification objective whereas other methods either require feature engineering, optimize another metric, or impose restrictions on the model architecture to prevent learning trivial functions.
The results for our second experiment, comparing the accuracies of Meta SVDD, a shallow baseline and Oneclass Prototypical Networks are presented in Table 2.
In this experiment, we can see an increase from almost random performance to almost perfect performance for both methods when compared to the shallow baseline in Omniglot. Both methods for fewshot oneclass classification that use related tasks have equivalent performance in Omniglot. The gain is not as significant for CIFARFS but more than 10% in absolute for both methods, which shows they are a marked improvement over the shallow baseline.
Comparing the two proposed methods, we observe the unexpected result that the simpler method, Oneclass Prototypical Networks, has equivalent accuracy in the Omniglot experiment, and better accuracy in the CIFARFS experiment. This indicates that learning the feature representation directly from data might be more important than the oneclass classification algorithm we choose, and the increased complexity of using SVDD over simple averaging does not translate into improved performance in this setting.
We have also attempted to run this same experiment in the miniImageNet dataset [34], a dataset for fewshot learning using the images from the ImageNet dataset [6]. The accuracy in the validation set, however, never rose above 50%. One of the motivations of introducing CIFARFS was that there was a gap in the challenge between training models in Omniglot and miniImageNet and that successfully training models in the latter took hours [3]. Since none of the previous methods attempted solving ImageNet level datasets, and the worst performance in datasets from CIFAR is already near random guessing, we leave the problem of training oneclass classification algorithms in this dataset open for future work.
Finally, we have run a small variation of the second experiment in which the number of examples in is greater than during training, using 10 examples instead of 5. The results stayed within the accuracy confidence intervals for 5shot for both models in this 10shot deployment scenario.
6 Conclusion
We have described a way to learn feature representations so oneclass classification algorithms can learn decision boundaries that contain the target class from data, optimizing an estimator of its true objective. Furthermore, this method works with 5 samples from the target class with performance similar to the stateoftheart in the setting where target class data is abundant, and better when the manyshot stateoftheart method is employed in the fewshot setting. We also provide an experiment that shows that using a simpler oneclass classification yields comparable performance, displaying the advantages of learning feature representations directly from data.
One possibility to replace the main requirement of our method with a less limiting one would be the capability of generating related tasks from unlabeled data. A simple approach in this direction could be using weaker learners to define pseudolabels for the data. Doing this successfully would increase the number of settings where our method can be used significantly.
The main limitations of our method besides the requirement of the related tasks are the destabilization of the quadratic programming layer, which we solved by adding a stabilization term to the diagonal of the kernel matrix or by simplifying the oneclass classification algorithm to use the mean of the features, and its failure to obtain meaningful results in the miniImageNet dataset.
We believe not only finding solutions to these limitations should be investigated in future work but also other questions left open in our work, like confirming our hypothesis that introducing slacks would not benefit Meta SVDD.
Other directions for future work are extending our method for other settings and using other oneclass classification methods besides SVDD. Tax and Duin [33] also detail a way to incorporate negative examples in the SVDD objective, so we could try learning using this method and to minimize the hypersphere’s volume instead of converting SVDD into a binary classification problem that uses the unseen examples’ distances to the center as logits.
References
 [1] (201706–11 Aug) OptNet: differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 136–145. External Links: Link Cited by: §2.2, §2.2, §5.2.
 [2] (2017) Wasserstein GAN. arXiv preprint arXiv:1701.07875. Cited by: §1.
 [3] (2019) Metalearning with differentiable closedform solvers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §5.1, §5.1, §5.1, §5.3.
 [4] (1995) Supportvector networks. Machine Learning 20 (3), pp. 273–297. External Links: Document Cited by: §4.2.
 [5] (2019) Torchmeta: A MetaLearning library for PyTorch. Note: Available at: https://github.com/tristandeleu/pytorchmeta External Links: Link Cited by: §5.2.

[6]
(2009)
ImageNet: A largescale hierarchical image database.
In
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2025 June 2009, Miami, Florida, USA
, pp. 248–255. External Links: Document Cited by: §4.1, §5.3.  [7] (1972) The minimum covering sphere problem. Management science 19 (1), pp. 96–104. Cited by: §2.2.

[8]
(2016)
Highdimensional and largescale anomaly detection using a linear oneclass svm with deep learning
. Pattern Recognition 58, pp. 121 – 134. External Links: ISSN 00313203, Document, Link Cited by: §2.  [9] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 1126–1135. Cited by: §1, §2.1, §5.1, §5.2.
 [10] (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §2.2, §4.1, §4.1.
 [11] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 2672–2680. External Links: Link Cited by: §1, §4.1.
 [12] (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. External Links: Document, ISSN 00368075, Link Cited by: §4.1.
 [13] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 448–456. External Links: Link Cited by: §5.2.
 [14] (2009) What is the best multistage architecture for object recognition?. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27  October 4, 2009, pp. 2146–2153. External Links: Link, Document Cited by: §5.2.
 [15] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §5.2.
 [16] (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.1.
 [17] (2015) Humanlevel concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/350/6266/1332.full.pdf Cited by: §5.1.
 [18] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 15582256 Cited by: §5.1.
 [19] (1990) Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §4.1.
 [20] (201906) Metalearning with differentiable convex optimization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §2.2, §2.2, §4.2, §5.1.
 [21] (201810–15 Jul) Which training methods for GANs do actually converge?. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3481–3490. External Links: Link Cited by: §1.
 [22] (2006–) NumPy: a guide to NumPy. Note: USA: Trelgol Publishing External Links: Link Cited by: §5.2.
 [23] (2019) Oneclass convolutional neural network. IEEE Signal Process. Lett. 26 (2), pp. 277–281. External Links: Document Cited by: §4.1.
 [24] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §2.2, §5.2.
 [25] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.2.
 [26] (2019) Rapid learning or feature reuse? towards understanding the effectiveness of MAML. CoRR abs/1909.09157. External Links: Link, 1909.09157 Cited by: §2.1.
 [27] (195109) A Stochastic Approximation Method. The Annals of Mathematical Statistics 22 (3), pp. 400–407. External Links: ISSN 00034851, 21688990, Link, Document Cited by: §5.2.
 [28] (201810–15 Jul) Deep oneclass classification. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4393–4402. External Links: Link Cited by: §1, §2, §4.1, §4.1, §5.1, §5.1, §5.1, §5.2, §5.3, Table 1.
 [29] (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Information Processing in Medical Imaging  25th International Conference, IPMI 2017, Boone, NC, USA, June 2530, 2017, Proceedings, pp. 146–157. External Links: Document Cited by: §1, §4.1.
 [30] (2001) Estimating the support of a highdimensional distribution. Neural Computation 13 (7), pp. 1443–1471. External Links: Document Cited by: §4.1, §5.1.
 [31] (2016) Identifying and categorizing anomalies in retinal imaging data. CoRR abs/1612.00686. External Links: Link, 1612.00686 Cited by: §1, §4.1.
 [32] (2017) Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 4077–4087. Cited by: §1, §1, §2.1, §2.2, §3, §4.2, §5.1, §5.2.
 [33] (2004) Support vector data description. Machine Learning 54 (1), pp. 45–66. External Links: Document Cited by: §1, §2.2, §2.2, §2.2, §2, §2, §4.1, §6.
 [34] (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pp. 3630–3638. External Links: Link Cited by: §5.1, §5.2, §5.3.
Comments
There are no comments yet.