Deep learning models have achieved great success in image classification  . However, in practical applications, the popularization of deep learning models still suffer from a lot of problems. The prominent problem is that deep learning models need a large number of labeled samples of training  , and once the scenes are changed, previously trained models may not work normally although the scenes change does not alter semantics about human cognition. For example, if we train a model to recognize samples with a certain texture of background, we may need to retrain the model once the texture of the background changes into another style. That is, the generalization of the model is limited, which makes it difficult for the model trained on dataset to work on dataset . Fortunately, studies related to one-shot learning or few-shot learning (FSL) in recent years may provide a solution to the problem .
The one-shot learning was first proposed by  and , aiming to learn characteristics of novel patterns from a few of labeled samples and use them for classification . If one labeled sample is taken from each of the 20 novel patterns for learning, it is called 20-way one-shot learning . In general, FSL methods can be roughly divided into three categories  : (1) making augmentation and regularization based on prior knowledge to avoid over-fitting of the model  ; (2) constraining hypothesis space based on prior knowledge, including multitask learning  , embedding learning   and so on ; (3) optimizing the search strategy for parameters in hypothesis spaces based on prior knowledge  .
Most datasets for one-shot learning are divided into two subsets: training set and probe set, and none of these patterns of the probe set have ever appeared in training set. The trained one-shot learning models are expected to learn through few of labeled probe samples and be able to classify other unlabeled probe samples. Therefore, it is regarded that one-shot learning models are generalizable, which can work on datasets that have never been seen before based on limited prior knowledge. If we regard the training set and probe set as two different datasets and , respectively, then one-shot learning models could be regarded as a general model on both datasets. Most researchers dedicate to study how to improve the generalizability by adapting models themselves. However, few of them consider the relationship between the distribution of data and the generalizability of models. That is, the generalizability of models may depend on whether the datasets support models to generalize. For example, if datasets and are totally different, those models trained on certainly could not work on . But if and are semantically identical, but distributed differently, those models trained on must could be generalized on from the perspective of human cognition.
In this paper, we define an “absolute generalization” for classification from the perspective of distribution of dataset. That is, as long as the distribution of dataset and satisfies certain conditions, models trained on the dataset will be able to work on dataset . We analyze the distribution of concatenated samples coupled by two samples and propose a method to make existing methods absolutely generalizable. The proposed method is aimed to measure the similarity between a pair of samples without any metric measurement. Finally, we compare our method with the baseline produced by siamese networks for verification  .
The rest of paper is organized as follows. In the second section, we review some related work. In the third section, we make some definitions about “absolute generalization” and propose our method to build a absolutely generalizable classifiers. Next, we apply experiments compared with siamese networks as baselines in Section 4. Finally, we point out the remaining problems and future research directions.
Ii Related work
Ii-a Image Classification
Deep learning has achieved great success in image classification . Most of these classifiers are essentially data fitting depending on labels. Consider the forerunner of convolution neutral networks for recognition, LeNet 
, which succeeded on MNIST dataset for classification. LeNet fits training data to corresponding one-hot code, where appropriate optimizers, structures, active functions and loss functions need to be considered. However, for example, if we simply reverse the color of input images during probe phase, these classifiers may not work normally. Although the simple operation does not disturb human cognition, these classifiers may take a fatal blow. It is obvious that weights in the hidden layers highly depend on the distribution of training data. Once the distribution of probe data is changed, no matter whether the change alters semantics of probe data, the classifier will not make a difference. In a broad sense, we can regard that the generalizability of classifier is low.
Ii-B One-shot learning
From the perspective of generalization, these one-shot learning models are aimed to improve the generalizability on novel datasets. We can regard a one-shot labeled sample as a template. That is, these one-shot learning models recognize unlabeled samples through predicting whether they belong to the same pattern with the template . Recently, a number of one-shot learning models have been developed. In , Koch et al. employed deep siamese networks for one-shot learning on Omniglot dataset . The siamese networks were first introduced in the early 1990s by Bromley and LeCun to solve signature verification problem , which aimed to measure the similarity between a pair of samples. In 
, LeCun et al. employed two weights-shared subnetworks (usually either CNNs or autoencoders) to extract features of a pair samples, respectively, as shown in Figure.1. Then, they used a contrastive energy function which contained dual terms to decrease the energy of identical pattern pairs and increase the energy of different pattern pairs.
Iii-a Absolute Generalization
The concept of generalization describes that how well the model works in the probe set of a dataset. To distinguish the generic concept, we define “absolute generalization” for two datasets and :
The absolute generalization refers to that a model which is trained in the dataset could be employed on the dataset without extra modification, and we call that the model is absolutely generalizable.
In above definition, the two datasets are different, which is defined as follows:
The dataset is distinguished from the dataset if .
Where, and are data distributions of the dataset and , respectively. It’s noted that the model with absolute generalization not always exists for any two datasets. Next, we give the existing conditions.
Assume that the dataset contains two patterns which are denoted as and , respectively. So does the dataset
. Generally, we use the maximum posterior estimation (MAP) to make decision in the dataset:
denotes prior probability, which is a constant and could be ignored if it is 0.5. Denoteas a likelihood distribution of the dataset . Due to and , or some models ignore , the classifier trained in the dataset could not be employed in the dataset . However, we can relax the constraints to make the “absolute generalization” available.
Assume that the observed sample in dataset and dataset is generated by the following mapping:
Where, denote as a mapping from latent space to sample space. Denote and as two independent latent variable. As shown in following, we convert the data distribution to latent variable distribution:
Because the distribution of has nothing to do with patterns, we call as the background distribution of dataset .
The classifier with the absolute generalization between two datasets exits if and , i=0,1. In this case, the classifier is
As long as we can isolate the effects of when we construct a classifier in the dataset , the classifier is absolutely generalizable and could be applied in the dataset .
Iii-B Proposed Method
Denote sample matrices of dataset and as and , respectively. denotes dimension of samples. and denote number of the samples in dataset, respectively. Each column of
is a sample vector and so does. Samples of two patterns in dataset are denoted as and , respectively. Where .
Consider converting classification problems of two patterns to distinguishing whether two samples belong to an identical pattern or different patterns. For dataset , concatenate samples belonging to an identical pattern and denote them as and concatenate samples belonging to different patterns and denote as . For dataset , similar symbols are denoted. We consider that classifiers in the sample space spanned by and are still executable in the sample space spanned by and if the two datasets satisfy Definition 3.
We take a simple example to explain the observation. Assume that and the dimension of concatenated samples is 4. As shown in Figure. 2, samples of dataset are distributed in 2-dim plane with two patterns represented by red and blue, respectively. The 2-dim plane is a duplicate of . To be able to show the 4-dim space, regard and
as two basis plane. The two orange hollow circles denote concatenate samples belonging to different patterns and two black hollow circles denote concatenate samples belonging to an identical pattern. Note that two black hollow circle are located in a hyperplane passing the origin, denoted as blue plane. Moreover, two orange hollow circles are located on both sides of the hyperplane . In the example, if the two datasets and satisfy Definition3, the slope of decision bound in the two datasets are the same, which confirms that the normal vector of hyperplane is constant. As long as we solve the normal vector in dataset , it could be used in dataset for classification directly.
We use MNIST dataset to further explain our method. We concatenate two samples from digits 0 and 1 of MNIST as a new sample and generate the sample set and . We regard the raw sample as dataset and . Then, we modify the distribution of raw data but do not alter its semantics, such as flipping image color, adding noise and replacing texture of background. We regard these modified dataset as dataset . Because dataset and have the same semantic, they satisfy Definition 3. Due to their different distributions, we regard they are two different datasets according to Definition 2. As shown in Figure. 3, we used t-sne  to reduce the dimension of concatenated samples and use different colors to indicate different concatenate patterns. For concatenated samples from raw data (i.e. dataset ) denoted as blue and red, the distribution of them is similar to our assumption in Figure. 2. The concatenated samples from identical patterns are distributed nearly to the same hyperplane as other modified datasets in Figure. 3.
Iii-C Models for Neural Networks
Our method is to couple two samples together as one new sample by concatenating them along with a dimension. In image processing, we directly concatenate two vectors flattened by images when employing MLP, as shown in Figure. 4(a). When employing CNN, we can concatenate two images along with the channel dimension, as shown in Figure. 4(b).
As explained in Section III
, the output of our model represents the normalization distance from the concatenated sample to the hyperplane. Thus, a positive or negative output indicates that the sample is on both sides of the plane. We also could regard the output as a probability of the concatenated sample belonging to, if limiting output to . If outputs indicate distances, employ mean square error (MSE) as loss function; else if they indicate probabilities, binary cross entropy (BCE) is employed as loss function.
Iii-D Comparison with Siamese Networks
firstly proposed the idea of coupling two samples together, where the author employed a distance measurement to describe the similarity between two hidden features extracted by two identical networks, respectively. However, in our method, we regard the concatenated sample as a new sample and analyze its distribution properties aiming to approve the generalizability of classifiers.
From the perspective of loss function, the siamese networks employ contrastive loss function based on p-norm distance. But in our method, it cuold be regarded that the type of distance metric is left to the neural networks to decide by itself.
The training time and inference time of our model is half that of siamese networks because our model only takes one forward propagation.
Iv-a Experiments on MNIST
|raw||flipped||salt pepper||salt pepper||salt pepper|
|gaussian||gaussian||gaussian||style 1||style 2|
Note that the “salt pepper noise(0.2)” denotes the density of salt pepper noise is 0.2.
The “gaussian noise(0.5)” denotes the variance of gaussian
noise is 0.5. So do others.
The “gaussian noise(0.5)” denotes the variance of gaussian noise is 0.5. So do others.
The MNIST dataset  contains 70,000 samples of 10 patterns. Each sample is a single channel image of with black background and white foreground. Due to our attention on “absolute generalization”, we only use the digit 4 and 9 in our experiment. We train our model on the raw dataset and test the model on some modified samples to valid the performance of our model. For comparison, we employ siamese neutral networks (SNN) as a contrast.
Fairly, the same structure of MLP is employed in our model and SNN. As shown in Figure. 5, the whole structure is employed in our model and the first 4 layers are employed in SNN as a subnetwork that shares weights. The image samples are reshaped as vectors at the input layer of the network. The variables’ dimension drops by half as they pass through a full connection layer. Because that the output of Tanh is in block , we use Tanh as activate layers according to 4.
In training phase, we train models with 100 epoches. For each epoch, 23582 pairs of samples are chosen randomly in the MNIST dataset with batch-size 256. We use Adam optimizer with learning rate 0.001. In probe phase, besides raw probe dataset of MNIST, some modified samples are produced with noise or different styles based on the raw probe dataset of MNIST, which are regarded as another dataset different from the MNIST dataset. As shown in Table.I
, we have 10 different probe datasets including raw data, flipped black and white, adding salt-pepper noise with various noisy pixel densities, adding Gaussian noise with various variances, adding various texture styles on background. We use area under the curve (AUC) of the receiver operating characteristics (ROC) and F1-scores to measure the classification performance, where F1-score is the harmonic average of precision and recall. We operate each experiment 10 times and get the mean and standard deviation. As references, we use average Structural SIMilarity (SSIM) to measure the similarity between pairs of samples.
Besides, we use LeNet trained with digits 4 and 9 of raw training dataset to compared with our model. Note that the intput of LeNet is one image, which is different from SNN and ours. That is, the datasets LeNet used is different from SNN and ours. Thus, the AUC and F1-socre of LeNet is just a reference.
As shown in Table. I, when using simple structure of MLP, SNN is awkward in probe datasets adding salt-pepper noise and adding various texture styles on background. However, our model is better at these probe dataset. The LeNet does a good job on probe datasets adding noise. However, it has a poor performance on “flliped”, “style1” and “style2”. Due to the input of our model is a concatenated sample coupling two images, where one of them provides prior knowledge when another is regarded as a probe sample, our model has better performance.
Iv-B Experiments for Face Identification
We conduct the experiment on ORL  faces Dataset. The dataset contains 40 patterns of faces with 10 images in each pattern. All images are stored in grayscale with image size . For each pattern of faces, the images were captured at different times, under different lighting, with different facial expressions (eyes open/closed, smiling/not smiling) and facial details (with glasses/without glasses). All images were taken against a dark, uniform background, with the front face (some slightly sideways).
Fairly, we employ the same convolution neutral networks in our model and SNN, as shown in Figure. 6
, the difference between two models is the input and output of the network. Note that we use ReLU and Sigmoid as activate layers rather than Tanh. Because that the distributions of concatenated samples are complex when the patterns in a dataset is more than 2, we only use 0 and 1 to denote the label of identical pattern samples and different pattern samples, respectively.
In the experiment, we reshape all images to the size of . In training phase, we use 20 patterns of faces to train our model and SNN. We train models with 100 epoches. For each epoch, 660 pairs of samples are chosen randomly which are evenly split between the identical patterns sample pair and the different patterns sample pair. In probe phase, we evaluate models on probe datasets including the rest 20 patterns of faces. Similarly to the Section IV-A, we produce some images as different datasets based on the raw probe dataset.
As shown in Table. II, SNN has lower recognition rate for probe samples with salt-pepper noise compared with our model. our model has better performance than SNN in face identification.
Iv-C Experiments on Omniglot for One-Shot Learning
The Omniglot dataset  is a classical dataset for one-shot learning, which is collected by Brenden Lake and his collaborators. The dataset contains handwritten character images from 50 alphabets ranging from well-established international languages. All images are divided into a 40 alphabet background set and a 10 alphabet evaluation set, which are used for training and probe phase, respectively.
For fairness, we employ the similar structure in our model and SNN as reference . In training phase, no data augmentation is used. In probe phase, we produce 5 datasets based on the raw probe dataset, including “flipped”, adding salt-pepper noise with density 0.5, adding gaussian noise with variance 0.9 and styles transformation.
As shown in Table. III, our model perform better than SNN on most probe datasets. However, on the “flipped” dataset, our model perform worse. Theoretically, the distribution of “flipped” dataset is a mirror of raw dataset, which could be classified as accurate as raw dataset. However, in Section IV-B and IV-C, the accuracy on the “flipped” dataset is far lower than its on the raw dataset, which may be because that weights of CNN in our model still tightly depend on the distribution of background.
In practical applications, we always expect that a trained model could deal with other similar classification task. One-shot learning gives a solution for the multi-task need. However, most researchers focus on improving the classification performance of model, few of them consider whether the datasets they employed are fit for one-shot learning.
In this paper, we proposed the concept “absolute generalization” in order to explain what kind of datasets were fit for one-shot learning. We believed that a classifier with absolute generalizability can be obtained when the datasets satisfied certain conditions. We proposed a method to build an absolutely generalizable classifier. In the method, a new dataset was produced by concatenating two samples of raw datasets. In the new dataset, we converted a classification problem to an identity identification problem or a similarity metric problem. The distribution of the new dataset hid a constant hyperplane which supported an absolutely generalizable classifier.
Because open source datasets cannot satisfy our conditions, we produced some artificial datasets based on open source datasets. However, these artificial datasets had a great challenge and practical significance. Experiments showed that the proposed method was superior to the baseline method, which confirmed that our concerns did influence the baseline method.
However, we found that the proposed method performed poorly when combined with CNN. So in the future, we will continue to study our method based on CNN. Besides, we will try to concatenate samples in higher dimensions for few-shot learning.
Infinite mixture prototypes for few-shot learning.
Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 232–241. Cited by: §I.
-  (1993) Signature verification using a “siamese” time delay neural network. 7 (04), pp. 669–688. Cited by: §II-B, §III-D.
One-shot video object segmentation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 221–230. Cited by: §I.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, pp. 539–546. Cited by: §I, §I, §II-B.
-  (2020-06) Few-shot object detection with attention-rpn and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2003) A bayesian approach to unsupervised one-shot learning of object categories. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141. Cited by: §I.
-  (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §I.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I.
-  (2020) Review and analysis of zero, one and few shot learning approaches. In Intelligent Systems Design and Applications, A. Abraham, A. K. Cherukuri, P. Melin, and N. Gandhi (Eds.), Cham, pp. 100–112. Cited by: §I.
-  (2019-10) Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §I.
-  (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §I, §I, §II-B, §IV-C.
-  (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §I.
-  (2011) One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, Vol. 33. Cited by: §IV-C.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §II-A, §III-D, §IV-A.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §III-B.
-  (2019) One-shot learning for custom identification tasks; a review. Procedia Manufacturing 38, pp. 186–193. Cited by: §I, §II-B.
-  (2020) Meta pseudo labels. arXiv preprint arXiv:2003.10580. Cited by: §II-A.
-  (1994) Parameterisation of a stochastic model for human face identification. In Proceedings of 1994 IEEE workshop on applications of computer vision, pp. 138–142. Cited by: §IV-B.
-  (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §I.
-  (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §I, §II-B.
-  (2017) Few-shot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pp. 2255–2265. Cited by: §I.
-  (2016) Matching networks for one shot learning. Advances in neural information processing systems 29, pp. 3630–3638. Cited by: §I.
-  (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Computing Surveys (CSUR) 53 (3), pp. 1–34. Cited by: §I.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §IV-A.
-  (2018) Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In Proceedings of the european conference on computer vision (ECCV), pp. 233–248. Cited by: §I.
-  (2020) Laplacian regularized few-shot learning. In International Conference on Machine Learning, pp. 11660–11670. Cited by: §I.