One-Shot Learning using Mixture of Variational Autoencoders: a Generalization Learning approach

by   Decebal Constantin Mocanu, et al.
TU Eindhoven

Deep learning, even if it is very successful nowadays, traditionally needs very large amounts of labeled data to perform excellent on the classification task. In an attempt to solve this problem, the one-shot learning paradigm, which makes use of just one labeled sample per class and prior knowledge, becomes increasingly important. In this paper, we propose a new one-shot learning method, dubbed MoVAE (Mixture of Variational AutoEncoders), to perform classification. Complementary to prior studies, MoVAE represents a shift of paradigm in comparison with the usual one-shot learning methods, as it does not use any prior knowledge. Instead, it starts from zero knowledge and one labeled sample per class. Afterward, by using unlabeled data and the generalization learning concept (in a way, more as humans do), it is capable to gradually improve by itself its performance. Even more, if there are no unlabeled data available MoVAE can still perform well in one-shot learning classification. We demonstrate empirically the efficiency of our proposed approach on three datasets, i.e. the handwritten digits (MNIST), fashion products (Fashion-MNIST), and handwritten characters (Omniglot), showing that MoVAE outperforms state-of-the-art one-shot learning algorithms.



There are no comments yet.


page 5

page 6


Few Shot Learning with Simplex

Deep learning has made remarkable achievement in many fields. However, l...

One-shot Learning with Absolute Generalization

One-shot learning is proposed to make a pretrained classifier workable o...

Uncertainty in Multitask Transfer Learning

Using variational Bayes neural networks, we develop an algorithm capable...

Transductive Maximum Margin Classifier for Few-Shot Learning

Few-shot learning aims to train a classifier that can generalize well wh...

Infinite Mixture Prototypes for Few-Shot Learning

We propose infinite mixture prototypes to adaptively represent both simp...

ZeroBERTo – Leveraging Zero-Shot Text Classification by Topic Modeling

Traditional text classification approaches often require a good amount o...

One-shot learning for acoustic identification of bird species in non-stationary environments

This work introduces the one-shot learning paradigm in the computational...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Object recognition is an important problem, and it has many applications, e.g. computer vision

Dalal and Triggs (2005); Agrawal et al. (2015); Ouyang et al. (2015); Thewlis et al. (2017), robotics Loncomilla et al. (2016) and healthcare Litjens et al. (2017)

. Traditional solutions use classifiers built on large amounts of data. In a time with more and more unlabeled data, manually labeling of all these data is costly, time consuming, and inefficient. Hence, the one-shot learning paradigm becomes increasingly important. The aims of this paradigm is to improve the generalization capabilities of the learning models (or algorithms) in such a way that they are capable to achieve a very good performance by having just one labeled sample per class or (at maximum) few labeled samples. To achieve this, usually the state-of-the-art one-shot learning algorithms make use of prior knowledge and large amounts of unlabeled data. Even if serious progress have been made in the last period on this paradigm, still its algorithms are usually immature and in an incipient phase. Thus, there is place for many improvements. For instance, MNIST digits dataset may be considered to be overused and very simple by the scientific community nowadays. Yet, up to our best knowledge, the maximum classification accuracy achieved on MNIST by one-shot learning algorithms with one labeled sample per class (1-shot) is just about 72%.

In this paper, we address the above problem, and we propose a new one-shot learning classification method, dubbed Mixture of Variational Autoencoders (MoVAE). Contrary to the state-of-the-art one-shot learning methods, MoVAE does not need at all any prior knowledge. In fact, it complements these methods. It starts from zero knowledge and one (or few) labeled samples per class, and then it gradually learns to generalize its knowledge using the generalization learning concept Waterman (1970)

. Also, by opposite to the usual direction in artificial neural networks, MoVAE is not an unitary neural network. In fact, it is composed by many Variational Autoencoders (VAEs), each one learning the distribution of a class. Thus, MoVAE can be a good example of collective intelligence. Each VAE took separately can not perform classification, but all of them acting together, are able to learn and classify objects very well.

To assess the performance of our proposed method, we performed empirical studies on three different types of object recognition problems, i.e. digits recognition (MNIST dataset), fashion products classification (Fashion-MNIST dataset Xiao et al. (2017)), and handwritten characters recognition (Omniglot dataset Lake et al. (2015a)). Aiming to rise the one-shot learning performance, we analyzed the MoVAE behavior in different settings. To the end, we proved its advantages in comparison with state-of-the-art.

The remainder of this paper is organized as follows. Section 2 briefly presents some background information on one-shot learning and Variational Autoencoders Kingma and Welling (2013) for the benefit of the non-specialist reader. Section 3 introduces our proposed methods. Section 4 details the experiments performed and the results obtained, while Section 5 concludes the paper.

2. Background

2.1. One-Shot learning

In a time when we have access to big unlabeled datasets, learning from one or just few examples is essential. By the opposite, the most advanced machine learning methods use large labeled datasets in order to learn to classify useful object representations. However, in the last decade, starting with the work of

Fei-Fei et al. (2006), there are many approaches which propose solutions for one-shot learning, e.g. Miller et al. (2000); Lake et al. (2015b); Koch et al. (2015); Hariharan and Girshick (2016); Vinyals et al. (2016). As deep learning became more successfully, in the last few years, a bridge between one-shot learning and deep learning has been developed. The Siamese Net Koch et al. (2015) was used to learn useful representations and a distance metric to say whether a test image and a given image belong to the same class or not. Further on, Santoro et al. proposed a memory-augmented neural network in Santoro et al. (2016) to rapidly assimilate new data and by leveraging this data to make accurate predictions after seeing only just few labeled samples. Maybe one of the most advanced one-shot learning models, i.e Matching Nets, has been proposed in Vinyals et al. (2016)

and employs ideas from both, neural networks with external memories and metric learning. Even though there are many one-shot learning models now, it is interesting to mention here also a new direction which combines reinforcement learning with one-shot learning 

Woodward and Finn (2017).

For clarity, we mention that independently of the used model and dataset, there are two commonly utilized terms in one-shot learning. We will define them here and use them further throughout the paper. These terms are: (1) N-way classification, which means that N classes from the dataset are considered to perform one-shot learning. Usually N is smaller than the total number of classes in the dataset, as the remaining ones are used to build prior knowledge; and (2) k-shot learning which means that from each of the N classes, just k labeled samples are used for one-shot learning. If then one-shot learning is performed, and if then, in fact, few-shot learning is performed.

2.2. Variational auto-encoders (VAE)

Based upon the auto-encodes procedure, variational auto-encoders Kingma and Welling (2013) are providing to be powerful generative models used to reconstruct an output from an input. Their generalization capabilities arise from the adding of a sampling layer between encoder and decoder. Many variations of VAE appeared latter on, such as hierarchical nonparametric variational autoencoders, which combines Bayesian nonparametric priors with VAEs Goyal et al. (2017).

Let us consider our observable variable, and a latent variable (continuous). Thus a typical VAE model follows the next steps. Initially, a latent variable model is constructed in order to learn a mapping from some latent variable to an input distribution, such that by training a neural network, where

is a joint probability density function,

, over the network parameters, , such that . Thus,


where .

2.2.1. Variational inference

Having a relatively similar idea with the Helmholtz machine Dayan et al. (1995) or wake-sleep algorithm Hinton et al. (1995), in VAE models Kingma and Welling (2013); Rezende et al. (2014) the true posterior is aproximated using variational parameters, . Given the intractable posterior , the VAE introduce an inference model (parametrized with another neural network) that learns to approximate the posterior by optimizing the variational lower bound, such that . Hence, the objective function is a sum over the reconstruction error and the regularization terms, given by


The first term in equation 2

is the Kullback-Leibler divergence between the inference model,

, and the posterior distribution Kullback (1959). Further on we can consider the variational lover bound as since the Kullback-Leibler divergence is always positive.

2.2.2. Reparametrization trick

There are so far two types of connections, the top-down connections implementing the generative model and the bottom-up connections implementing the inference model. Finally a reparametrization trick is used in the VAE models by adding a noise signal, , into the latent variable.

VAE simultaneously train both the generative model and the inference model

by optimizing the variational bound using gradient backpropagation. Horewer, a complementary approach to variational inference is the Markov Chain Monte Carlo (MCMC) method, as detailed for example in

Salimans et al. (2015).

3. Mixture of Variational AutoEncoders

In this section, we present the details of our proposed method, dubbed Mixture of Variational AutoEncoders (MoVAE). We start by giving its intuition. Then we continue by describing its technical details, and its inference procedure.

3.1. Intuition

The intuition behind MoVAE is simple and it is inspired by human learning processes. People, when they learn new concepts, they do not manage too well to deal with large amounts of labeled data, but they are often extremely efficient to generalize across various conditions just from one example. Sometimes, they make use of prior acquired knowledge, and sometimes not. They start just from one example and gradually add new representations of that example (or situation) to its default category using generalization Gluck et al. (2011). At a different scale, the learning concept evolved through the human world into a collective intelligence behavior. The advances of human society were mainly made, not by super-humans, but by many humans, connected between them in a social network, sharing a set of values, and working together for a common goal. Moreover, a human is far to be one of the strongest animal in the world. In fact, it is quite weak, but humans collaborative way of being and personal specialization made from the human race one of the most successful in the word Harari (2015).

Keeping the proportion, by analogy, we argue that in machine learning, we should not search for the most powerful model possible, but to create many specialized models, each being capable of doing well its specialized task. Then, these models working together will be able to fulfill a common goal, inaccessible for a singular model. In a way, in artificial intelligence, this approach is followed by ensembles and swarm intelligence, with the difference that each particle or ensemble could do a better or worse job on the common task, while in what we propose next, one singular model would achieve nothing.

These being said, and knowing that a Variational Autoencoder can represent very well a data distribution, in this paper, we propose to build a Mixture of Variational Autoenconders (MoVAE) to perform classification. In the specific case of classification, each VAE of the mixture will be very specialized and will learn the distribution of just one class by being trained on samples belonging just to its specific class. Thus, after the learning phase, our assumption is that each VAE will reconstruct very well unseen images belonging to its encoded class, but if images belonging to other classes will be reconstructed through it, then their reconstructed version will be not so good. And here come the trick of cooperative inference. Each VAE model is not able to discriminate if a given image belongs to its encoded class if it looks just of that image reconstructed version, but the mixture of VAEs it is. If we pass the same image through all the VAEs belonging to the mixture then we obtained a reconstructed version of the original image for any VAE. Then the class of the original image is given by the VAE which obtains the best reconstruction of the original image. Moreover, our assumption is that our proposed approach does not need many labeled images to learn well the class distributions. In fact, it can use just one labeled sample per class to encode in a decent manner the corresponding class in each VAE. Then, by using generalization learning and considering unlabeled data it will be able to gradually increase the quality of the encoded distributions, being capable to improve by itself its discriminative capabilities, as described next.

3.2. Algorithm

MoVAE method is briefly described in Algorithm 1. It starts by initializing its two specific meta-parameters ( which represents the number of unlabeled data samples considered at each generalization iteration and the classes of the dataset) and the other meta-parameters characteristic to any VAE model (i.e. Algorithm 1, Line 1). Then, it creates a VAE model for each class , which is trained on a small amount of randomly selected labeled samples (i.e. Algorithm 1, Lines 3-6). Please note, if we would like MoVAE to perform only pure supervised learning, we have just to consider all labeled samples belonging to the training set in this initialization phase and then we can stop the procedure here.

In the case of one-shot learning, after this initialization phase, we go further to the generalization phase in which a number of recursive generalization iterations is performed (i.e. Algorithm 1, Lines 8-21). At each generalization iteration, MoVAE, firstly, reconstructs all the unconsidered unlabeled samples, computes the distance from these reconstructions to the original samples, and sorts the samples according with this distance (i.e. Algorithm 1, Lines 9-13). Secondly (i.e. Algorithm 1, Lines 14-19), it adds to the training set of each VAE (the VAE model corresponding to class ) a number of samples (where is the total number of classes) which are best reconstructed by VAE (i.e. Algorithm 1, Line 16). At the same time, these samples are badly reconstructed by the others VAEs (i.e. Algorithm 1, Line 17). This is done with the aim of avoiding introducing a propagating error through the generalization iterations, as much as possible. Each of these selected samples are then removed from the set of unconsidered unlabeled samples. Thirdly (i.e. Algorithm 1, Lines 20-21), each VAE belonging to MoVAE is retrained using the new obtained .

1:Initialize meta-parameters, e.g. ,
2:%%initialize all VAEs belonging to MoVAE
3:for each class of the data set do
4:      1 (or few) random samples
5:      VAE MoVAE
6:     train VAE with
7:%%generalization iterations
8:for a specific amount of generalization iterations do
9:     for each VAE MoVAE do
10:         for all unconsidered unlabeled samples do
11:               by VAE
12:               the distance          
13:          and according with      
14:     for each VAE MoVAE do
15:         for a number of samples do
16:               with )
17:              if   then
19:                   from , , and                             
20:     for each VAE MoVAE do
21:          VAE with      
Algorithm 1 MoVAE schematic algorithm.

3.3. Inference

After the MoVAE model is trained, for inference, a very simple metric can be used. The basic idea is that the estimated class for a specific image is given by that VAE

which is capable to obtain the best reconstructed image. Thus, for a given image the following formula can be used:

Figure 1.

One-shot learning. MoVAE performance on the test set of the MNIST dataset during training without data augmentation (left) and with data augmentation (right). At each generalization step 3000 previously unseen unlabeled samples were taken into consideration. The straight lines represent the mean values, while the shadowed areas represent the standard deviation.

To assess the distance between the image and its reconstructed version , more metrics are suitable, but we found out that Pearson Correlation Coefficient is a very good one, being fast and accurate. Moreover, it is good mentioning that during all phases (especially in the initialization phase), techniques such as data augmentation may be used to increase artificially the number of unlabeled or labeled samples, as we show further in the Experimental section.

4. Experiments and results

We assessed the performance of the proposed method on three datasets, proceeding in a step wise fashion. Firstly, we have evaluated MoVAE for pure supervised learning tasks (all available labeled training data were used) in Section 4.1. Secondly, in Section 4.2 we performed experiments in an one-shot semi-supervised learning context. Thirdly, the final evaluation of MoVAE has been done in a pure one-shot learning fashion in Section 4.3.

4.1. Purely supervised learning

As MoVAE is also a new classification model, first we have assessed if MoVAE is a good classifier for pure supervised learning tasks.


In this set of experiments and in the next one (Section 4.2) we have considered two datasets: (1) the MNIST dataset111 Last visit on 25 September 2017. of handwritten digits on which the goal is to perform digits recognition, and (2) Fashion-MNIST Xiao et al. (2017) dataset, released in August 2017 as an alternative to MNIST, on which the goal is to perform fashion products recognition. As MNIST digits dataset is widely known, further on we just briefly describe the Fashion-MNIST dataset Xiao et al. (2017). Similarly, with MNIST it has 60000 training images, 10000 testing images - each image being in a gray-scale matrix of size 28x28. The images are split equally in 10 classes. Differently from MNIST, the images are not digits. Instead they represent fashion products collected from Zalando website and processed to be in the same format as MNIST. The classes of Fashion-MNIST are considerably more difficult than the ones of MNIST as suggested by their names: (0) T-Shirt/Top, (1) Trouser, (2) Pullover, (3) Dress, (4) Coat, (5) Sandals, (6) Shirt, (7) Sneaker, (8) Bag, and (9) Ankle boots. On both datasets, we have used the standard training set for training the models and the standard test set to evaluate them.

Implementation and evaluation details.

MoVAE has been implemented with the help of the Keras library 

Chollet (2015). In all the experiments performed in this set and in the next one (Section 4.2), for a fair and easy comparison, we have used MoVAE models with the same architectures. More exactly, each MoVAE had 10 VAEs, one for each class. Each VAE had 784 input dimensions, 256 intermediate dimensions, and 50 latent dimensions, and dense layers, following the standard VAE example from the Keras library Chollet (2015)

. At each iteration, the MoVAE training was done using the RMSProp optimizer 

Duchi et al. (2011)

for 40 epochs. The metric used to measure the similarity between an original image and its reconstructed version was Pearson Correlation Coefficient (PCC). In this set of experiments and in the next one (Section 4.2), we have evaluated MoVAE against the default Convolutional Neural Network (CNN) offered as an example in the Keras library

222 Last visit on 9 September 2017., and state-of-the-art results for that specific task.

In this set of experiments we used the complete labeled training set of MNIST and Fashion-MNIST to train models. As for this goal the generalization iterations phase of MoVAE is not useful (lines 7-21 of Algorithm 1), we set the number of generalization iterations to 0 to skip it. We did not performed any technique to try improving the results, such as data augmentation, and we evaluate it on the standard test sets of both datasets.

Table 1 presents the results. We may observe that on both datasets, MoVAE models achieves good accuracies, very close to the ones obtained by the CNN models. In fact, on the MNIST dataset a classification accuracy of 97.85% is over or, at least, on par with the accuracies obtained by many state-of-the-art machine learning models. The Fashion-MNIST dataset is very new, so no many results are reported yet in the literature, but still MoVAE achieves a classification accuracy which outperforms most of the models reported as benchmark in the launching paper of the dataset Xiao et al. (2017)

. For instance, gradient boosting achieves in average 84% accuracy. We have good reasons to believe that MoVAE accuracy could be easily improved with fine-tuning, but our goal in this set of experiments was just to prove that MoVAE is a good classifier in a pure supervised learning fashion. This being said, further on we present the most interesting set of experiments, in which we analyze MoVAE performance in the one-shot semi-supervised learning context.

Model MNIST Fashion-MNIST
Accuracy [%] Accuracy [%]
MoVAE (ours) 97.9 86.9
CNN 98.9 92.3
Table 1. Purely supervised learning - MoVAE and CNN classification accuracy when the complete labeled training sets of MNIST and Fashion-MNIST were used to train them.

4.2. One-shot semi-supervised learning

In this set of experiments, we have evaluated MoVAE in a more difficult setting. We considered just 1, 5, and 10 randomly chosen labeled samples per class for each dataset (from here comes the one-shot learning part of the section name). All the other samples belonging to the training sets were used as unlabeled data (from here comes the semi-supervised learning part of the section name). We have used the same datasets and implementations as in the previous set of experiments (Section 4.1), with the exception that now we have used also the generalization iterations phase of Algorithm 1. The amount of unlabeled data selected by MoVAE at each generalization iteration was set to 3000 samples.

For each of the three situations (1-shot, 5-shot, and 10-shot), we have repeated the experiments 10 times and we present the averaged results with mean and standard deviation.

Figure 2. One-shot semi-supervised learning - Illustration of MoVAE behavior on the MNIST digits dataset. The first row ("original") shows the original image given to the model, while the following rows show the reconstructed image by each composing VAE (one per class - column) of MoVAE during the generalization iterations. The white square from the first row shows the truth class, while the gray squares from the other rows shows the class predicted by MoVAE at that specific generalization iteration.
Figure 3. One-shot semi-supervised learning - Illustration of MoVAE behavior on the Fashion-MNIST dataset. The first row ("original") shows the original image given to the model, while the following rows show the reconstructed image by each composing VAE (one per class - column) of MoVAE during the generalization iterations. The white square from the first row shows the truth class, while the gray squares from the other rows shows the class predicted by MoVAE at that specific generalization iteration.

4.2.1. MNIST digits

Same as before, first we evaluated MoVAE on the MNIST dataset without using data augmentation techniques. Figure 1 (left) reflects the evolution of MoVAE accuracy performance and its generalization capabilities while new previously unseen unlabeled samples are taken into consideration. It is interesting to see that in the case when just 1 random chosen labeled sample is considered per class the MoVAE model initially achieves about 45% mean accuracy. Moreover, during the generalization phase, the MoVAE model is capable to improve itself and to reach about 69% mean accuracy. Similarly, in the cases where 5 and 10 randomly chosen labeled samples per class are used, MoVAEs show also impressionable generalization capabilities reaching to the end over 90% mean accuracy. The small standard deviations reflect that MoVAE is a stable model, independently of what labeled samples are given by random chance. In fact, these results would be suffice to overpass state-of-the-art in one-shot learning on the MNIST dataset, as reported in Table 2, but further on we show that they can be improved even more, by using data augmentation techniques.

Model Labeled Data Prior Unlabeled MNIST Fashion-MNIST
samples/ Augmen- Knowledge Data Accuracy [%] Accuracy [%]
class [#] tation
MoVAE (ours) 1-shot no no yes 69.66.5 -
MoVAE (ours) 1-shot yes no yes 91.14.7 61.62.8
CNN 1-shot no no no 17.43.5 -
CNN 1-shot yes no no 22.13.4 21.34.3
CPM Wong and Yuille (2015) 1-shot - yes no 68.8 -
Siamese Net Koch et al. (2015) 1-shot - yes no 70.3 -
Matching Nets Vinyals et al. (2016) 1-shot - yes no 72.0 -
MoVAE (ours) 5-shot no no yes 90.41.6 -
MoVAE (ours) 5-shot yes no yes 94.50.6 66.51.7
CNN 5-shot no no no 24.35.4 -
CNN 5-shot yes no no 28.15.2 28.24.7
CPM Wong and Yuille (2015) 5-shot - yes no 83.8 -
MoVAE (ours) 10-shot no no yes 93.11.1 -
MoVAE (ours) 10-shot yes no yes 94.90.4 70.51.9
CNN 10-shot no no no 33.15.1 -
CNN 10-shot yes no no 47.76.6 36.65.4
CPM Wong and Yuille (2015) 10-shot - yes no 88.0 -
Table 2. One-shot semi-supervised learning - Classification accuracy of MoVAE against baseline CNN and state-of-the-art using 1, 5, and 10 labeled samples per class on the MNIST and Fashion-MNIST datasets.
MNIST digits with data augmentation

More exactly, on the MNIST dataset we have used rotations and small horizontal/vertical shifting. The reason of using data augmentation lays in the wish of creating an initial larger pool of samples (i.e. 500 samples) starting from the 1, 5, or 10 labeled samples considered. Figure 1 (right) shows that this is a good approach, as MoVAE with just 1 labeled sample per class is capable to achieve an amazing performance of about 91% mean accuracy on the MNIST dataset, while state-of-the-art best results, up to our best knowledge, report about 72% accuracy for this scenario. When 5 or 10 random chosen labeled samples are used per class the overall performance increase up to 95% mean accuracy, this being already very close to the results obtained when the full training set is used as labeled data by MoVAE or other classification models. It worths to be highlighted that with data augmentation the standard deviations become even smaller showing an increasing stability of the MoVAE models.

To understand better MoVAE behavior during learning, in Figure 2 we illustrate how each of its composing VAEs is capable to reconstruct a given digit during the generalization learning phase. It may be observed that initially the VAE corresponding to digit 2 of MoVAE is not capable to reconstruct well the given image (i.e. a 2) and in consequence MoVAE fails to classify the given image correctly. Hence, after MoVAE considers 15000 unlabeled samples, it starts generalizing quite well and its corresponding VAE for digit 2 reconstructs the given image well, yielding a correct classification. It is interesting to observe that in this case the other VAEs belonging to the model tend to transform the 2 in their corresponding digit.

4.2.2. Fashion-MNIST

Knowing that data augmentation is a good technique for MoVAE, further on we have used it also on the Fashion-MNIST dataset. More exactly, we have used horizontal flips and a very small zooming. Up to now, there are no results in the literature on the Fashion-MNIST dataset for one-shot learning. Still, our comparisons with the baseline CNN from Table 2 show that MoVAE is a good model for one-shot learning. To understand better the behavior of our proposed model on this dataset, in Figure 3 we give two illustrative examples for MoVAE which uses just 1 labeled sample per class: the reconstruction of a sneaker (left) and the reconstruction of a shirt (right). In the sneaker case the situation is quite simple (all other classes having quite a different shape) and MoVAE is capable to classify it correctly from start without considering any unlabeled data. In the shirt case, the model is bouncing between classifying the given image as a shirt, as a coat, or as a pullover (these three classes being quite similar), but after seeing 30000 unlabeled samples - illustrating the power of the unlabeled data and generalization learning - MoVAE classifies correctly the shirt.

4.3. One-shot learning

In the last set of experiments, we have addressed a pure one-shot learning problem. Herein, we did not consider at all unlabeled data, thus we set the number of generalization iterations to 0, ignoring the lines 7-21 of the Algorithm 1.


We evaluated MoVAE on the Omniglot dataset Lake et al. (2015a), a widely used dataset for assessing novel one-shot learning algorithms. Omniglot has in total 50 different alphabets, totalizing 1623 characters. Each character being hand drawn by 20 different people. We mention that the original images have 105x105 pixels and in our experiments we rescale them to 28x28 pixels. We have considered two scenarios, a 5-way and a 1623-way classification problem, with 1 (1-shot) and 5 (5-shot) labeled samples per class (character). The labeled samples from each class were chosen randomly and they were used as training data, while the other samples belonging to same class (19 and 15 samples, respectively) were used as testing data.

In all the experiments performed, we have used data augmentation. By performing a light grid search, we have arrived at the following settings using the Keras data augmentation engine: random rotations with 20 degrees, horizontal and vertical shifts of 0.2, shear range of 0.2, and a zoom range between 0.8 and 1.2. Starting from the labeled samples that we had and with the previous settings we have generated 10000 augmented images for each class. We used these augmented images in the training process.

Implementation details.

The MoVAE had 5 VAEs for the first scenario, and 1623 VAEs for the second scenario. Each VAE had 784 input dimensions, 784 intermediate dimensions, and 100 latent dimensions, and dense layers. The learning was performed using the RMSProp optimizer Duchi et al. (2011)

for 50 epochs. As in the previous set of experiments, we observed that the CNN model does not perform well with little labeled data, in this set of experiments we have considered a usual baseline method for one-shot learning, i.e. k-Nearest Neighbours (kNN) with 3 neighbours. We trained and tested MoVAE and kNN on exactly the same data.

4.3.1. 5-way Omniglot

Usually, one-shot learning on Omniglot is performed in a 5-way manner. This means that just 5 classes, picked at random from the total available number of classes (the ones that are not used to create prior knowledge), are considered to perform the models evaluation Vinyals et al. (2016). This setting leads to a random guess accuracy of 20%. We repeated the experiments 10 times, and we report the results with mean and standard deviation in Table 3. The results show that MoVAE achieves a very good performance, i.e. over 90% in the case of 1-shot learning, and over 96% in the case of 5-shot learning. In some of the runs MoVAE achieved even 100% accuracy. Moreover, the small standard deviation of MoVAE accuracy in comparison with the one of kNN shows that MoVAE is a very stable method. We highlight that MoVAE obtained an accuracy on par with the state-of-the-art Siamese Net Koch et al. (2015) and Matching Nets Vinyals et al. (2016) methods, while having the advantage of not needed large amounts of labeled data to build prior knowledge.

Model Labeled Data Prior Unlabeled 5-way
samples/ Augmen- Knowledge Data Omniglot
class [#] tation Accuracy [%]
MoVAE (ours) 1-shot yes no no 90.95.4
kNN 1-shot yes no no 72.411.6
Siamese Net Koch et al. (2015) 1-shot yes yes no 96.7
Matching Nets Vinyals et al. (2016) 1-shot yes yes no 98.1
MoVAE (ours) 5-shot yes no no 96.72.8
kNN 5-shot yes no no 85.28.1
Siamese Net Koch et al. (2015) 5-shot yes yes no 98.4
Matching Nets Vinyals et al. (2016) 5-shot yes yes no 98.9
Table 3. One-shot learning - The classification performance of MoVAE and state-of-the-art one-shot learning models on the 5-way Omniglot.

4.3.2. 1623-way Omniglot

One may argue that MoVAE is not a scalable method, as it needs to build a VAE for each class. To show that this is not the case, in our final experiments we have evaluated it by performing an 1623-way one-shot classification on Omniglot. Indeed, this means that we have built 1623 VAEs, one for each class (character). We highlight that there are no results reported in the literature for 1623-way one-shot classification on Omniglot and we propose this problem as a more realistic benchmark for one-shot learning.

Siamese Net Koch et al. (2015) and Matching Nets Vinyals et al. (2016) methods are not capable of performing 1623-way classification on Omniglot as they need all the data from some classes to create the prior knowledge. An exception for this situation would be the use of other datasets to create the prior knowledge. However, to the best of our knowledge, there are no results reported in the literature for this situation. We mention that this is a clear advantage of MoVAE, as it can perform directly 1623-way classification on Omniglot, and we compare its performance against kNN in Table 4. For 1-shot learning MoVAE achieved about 27% accuracy. At a first thought this does not mean too much, but having in mind that the random guess accuracy is just 0.06%, this means that MoVAE accuracy is 463 times higher than the one of random guess. Also, it is 9 times higher than the one of kNN. As in the previous experiments, more labeled data means better results, and in the case of 5-shot classification MoVAE performance rises up to 43%.

In terms of computational time, MoVAE was very fast given the problem considered (i.e. 1623-way Omniglot classification). The total training time for one run was about 4 hours using a standard computer and a M40 GPU. In fact, MoVAE running time was several times faster than the one of kNN.

Model Labeled Data Prior Unlabeled 1623-way
samples/ Augmen- Knowledge Data Omniglot
class [#] tation Accuracy [%]
MoVAE (ours) 1-shot yes no no 27.80.4
kNN 1-shot yes no no 3.10.01
Random guess - - - - 0.06
MoVAE (ours) 5-shot yes no no 43.20.1
kNN 5-shot yes no no 5.90.01
Random guess - - - - 0.06
Table 4. One-shot learning - MoVAE and kNN classification performance on the 1623-way Omniglot (a new challenge for one-shot learning).

4.4. Discussion

To the end, we report that besides MoVAE, we have tried to build a model with similar principles for a different generative model type, i.e. Restricted Boltzmann Machines (RBMs) 

Smolensky (1987). We have tested the mixture model based on RBMs using just the full labeled training set of MNIST and we obtained a mean accuracy of about 95%. This accuracy being lower than the one obtained with MoVAE, we have decided to make the thorough analyze just for MoVAE. Auxiliary, in terms of measuring the distance between an original image and its reconstructed version we have tested also the Root Mean Squared Error (RMSE). However, the accuracy results obtained with RMSE were always smaller with 1-3% than the ones obtained with PCC.

5. Conclusion

In this paper, we introduce MoVAE (Mixture of Variational Autoencoders), taking inspiration from the human world. MoVAE is capable to successfully perform the one-shot learning task, without the need of having prior knowledge, due to its generalization learning capabilities. In terms of performance, it goes much beyond the state-of-the-art one-shot learning methods, being capable to overpass them with more than 25% in accuracy on the MNIST digits recognition task. Besides that, MoVAE achieves very good results (over 60% accuracy) on a much more complicated real-world task, fashion products recognition, using just 1 labeled sample per class.

Even when unlabeled data is unavailable, MoVAE is capable of good performance. Thus, we introduce a new challenge for one-shot learning, 1623-way 1-shot learning classification on Omniglot, on which MoVAE accuracy is 463 times higher than the one of random guess. Also, it is 9 times higher than the one of kNN, while no other state-of-the-art results are reported in this very difficult context. This good behavior opens MoVAE path to real-world applications. Overall, our results hint at a promising avenue of research in the attempt of bringing closer the learning capabilities of machine learning models to the ones of humans in the case of collaborative learning just from few examples.

There are a number of possible future developments following the ideas introduced in this paper. Further on, we enumerate just a few, considering also some limitations of MoVAE: (1) Even MoVAE shows to be very successfully in one-shot learning without prior knowledge, we believe that by making use of prior knowledge, its performance can be further improved; (2) Due to MoVAE generalization capabilities, we believe that if we would combine it with generative replay Mocanu et al. (2016b), it may be adapted to perform on-line learning; (3) It may be that by creating a VAE for each class will generate on some high dimensional datasets too many connections to optimize, and thus we intend to look into techniques to decrease the amount of these connections for MoVAE before training, such as in Mocanu et al. (2016a); and (4) We see benefits for plenty real-world applications.