Supervised learning algorithms have demonstrated tremendous success in a multitude of tasks both high-level like classification , detection , etc and also in low-level tasks such as segmentation  after the explosion of deep neural networks. However, the same statement cannot be made for situations where the model is expected to generalize in the absence of densely available labels. This is unlike humans, who generalise in an incremental manner to novel classes by observing only a few number of examples . The importance of a learning model that improves on unseen examples on gathering more experience is instrumental in almost all practical problems where annotating labels is either not scalable or unavailable due to safety or privacy issues.
Motivated by the aforementioned issues, recent approaches to generalize learning models range from weakly-supervised learning 25], domain adaptation techniques , data augmentation , incremental learning  and task based few shot learning [26, 23, 22, 24]
. Few-shot classification aims to accommodate to novel classes unseen during training by just using a few examples during test time. This is unlike fine-tuning, where the classifier uses a previously learnt representation and tunes its parameters to maximize accuracy over the new data. The problem with fine tuning is that the classifier would most likely overfit to the new data when it is given as few as five examples. In this work, we take inspiration from humans in the sense that in order for registration, we infer the scene from different perspectives and then are able to generalize in similar future settings. We present a novel method for end-to-end differentiable data augmentation technique inspired by Spatial Transformer Networks and inference technique for single and few-shot learning scenarios. Our contributions are as follows:
We propose a theory for a new data-augmentation technique inspired from projective transformations in the 3D camera pinhole model.
We demonstrate an algorithm that estimates the data augmentation parameters in an end-to-end neural network model to generalize under a multi-classk-shot classification framework.
We present analysis of our proposed algorithm using 3 recent few-shot learning paradigms and establish the efficiency of our method for one-shot and few-shot learning on two versatile datasets.
The rest of the paper is as follows. Section 2 presents some of the previous works in literature pertaining to learning with limited labels and data augmentation techniques. Section 3 describes our method in detail followed by Section 4 which shows detailed analysis and comparison. The final Section 5 contains concluding remarks and discussions about scope for future work.
2 Related Work
Few-shot learning: Lake et al.  propose a generative model and infer handwritten characters from latent strokes in new characters. Ravi and Larochelle  use a LSTM-based meta-learner that captures short-term knowledge particular to a task and long-term knowledge common to all tasks. ProtoNets  learn a representation based metric space and perform classification using the ”prototypes” (class means) of each class. Vinyals et al.  propose a network called Matching Networks that learns the mapping between a small labelled support set and an unlabelled example. The principle that the testing and training conditions should match is used for the training procedure. Few-shot learning has also been explored in the context of meta-learning by Finn et al.  where they propose an algorithm for fast adaptation of networks on versatile tasks and demonstrate their effectiveness on one-shot learning tasks. Finn et al. 
further explore the task of one-shot learning for a robot under the framework of meta-learning combined with imitation learning from visual demonstrations. Meta learning and transfer learning was combined by to propose an efficient learning curriculum which they name hard-task meta batch scheme that improves the convergence and accuracy.
Data augmentation: Antoniou et al.  were the first to demonstrate improved performances on meta-learning tasks using data augmentation techniques. They do so by generalizing the model to generate class-agnostic data samples. Zhang et al.  approach the problem of few-shot learning using a unified adversarial generator that is capable of learning sharper boundaries for supervised few-shot and semi-supervised few-shot scenarios as well. This is facilitated by making the GAN generate fake data that provides additional examples for training. Our method is also based on adversarial training but instead of directly generating augmented examples for training, we generate the parameters for transforming the input to learn a robust classifier. The closest work compared to ours is 
where they use a search algorithm to search the best policy for augmenting a single sample in a mini-batch. The policies consist of sub-policies consisting of either rotation, translation or shearing functions. However, the method is not tested in few-shot settings and the use of reinforcement learning can be unstable with an evolving reward function. Our work is different in the sense that instead of considering these image processing functions independently, we use an adversarial scheme to learn the complete affine transform matrix elements which provides us with better generalization. We also show that a variant which predicts the parameters independently doesn’t perform as well as our method.
Our model takes inspiration from how humans observe novel objects - they don’t just register one “snapshot” of the object, but rather take a look from multiple coherent perspectives. Although this may not be possible given that we do not have images of the same object taken from different perspectives, we can approximate it by assuming that the object is placed far away from the camera (i.e. ).
Consider a 3D point of an object in homogeneous coordinates and its 2D projection into the image plane .
Without loss of generality, assume that , to get .
Consider a slight change of roll (), yaw () and pitch () where , and a small change in translation such that . Plugging these formulae into the rotation matrix and using Taylor expansion (ignoring third order terms and higher), we have:
The new point in the image plane corresponding to the original 3D coordinate is:
Since we assume it to be a distant object, and the values of are relatively small, the denominator can be simplified using binomial expansion
where The new point on the image plane is approximated as
Substituting the values of we get
where We approximate the distortion in rotation and translation using an affine transform of the given form, which encourages only slight deviation from the identity transform. The values of the parameters
can be determined using an adversary that detects the distortions that the model hasn’t generalized to. This is the core idea which forms the basis of generalization to unseen examples. We use Spatial Transformer Networks (STN) which are end-to-end differentiable spatial manipulators. STN computes parameters of the spatial manipulation rather than the manipulated image itself, making it easier to learn a few parameters and perform powerful spatial transformations. They are generally used as a starting module to output a canonical version of an image that can be used as input to a classifier. However, we use it in an adversarial manner by backpropagating through the Cross Entropy loss of the few-shot learner. Learning the trend of the parameters
is simpler and quicker than GANs that learn the data distribution over entire images in response to a noise signal or other support images. We show that this form of augmentation to an image is more effective than applying standard augmentations like random rotations, translations and scaling. At every epoch, the few shot learner processes a batch of support and query examples. The few shot network minimizes the classification loss on the query examples given the support examples. The Transformer takes gradients with respect to the support images to maximize the classification loss on the query images. Let the transformer be a functionparameterized by and the few shot learner is a function parameterized by . Let be the support dataset and be the query dataset. The optimization problem becomes:
To make sure that the Transformer doesn’t deviate from the identity transform, we apply a regularization term that penalizes deviation from the identity affine transform. The regularization is given by the following term:
The modified optimization problem becomes:
is a hyperparameter. Note that regularization plays an important role, because without any regularization the STN can morph the images to have unrecognizable features and hence maximizing the classification loss and not allowing the classifier to learn useful features. Without explicit regularization, the parameters of the affine matrix predicted by the STN will also violate the assumption about the magnitudes of theparameters. This does occur in our experiments when we set , the accuracy over the validation set decreases because the classifier failed to learn good features during training.
To analyse the effect of adversarial Spatial Transformer Networks, we test our training framework on the Omniglot  and MiniImageNet  datasets. We show that our method is base-model agnostic by testing on 3 different methods - Prototypical Networks , Matching Networks  and Model-Agnostic Meta Learning (MAML)  frameworks for few shot learning. We observe that all baselines have very high accuracy on the Omniglot dataset, and adding an STN improves the results only marginally. Therefore, we show results for Omniglot only with Prototypical Networks. However, the improvements in accuracy for MiniImageNet are significant and we test our module with all the three baselines.
Prototypical networks received some concerns about reproducibility in results , , . To provide consistent results for all methods, we use the code provided by  and incorporate our module into the code.
In standard classification tasks, the training data is augmented and the validation data is not augmented. We follow the same procedure, we augment the meta-train (or support) examples and do not augment the meta-validation (or query) examples while training. During test time, the STN is disabled for both support and query examples. To avoid potential data distribution shift between the support examples encountered during the training phase and validation phase, we apply a dropout on the output of the STN to retain some of the support images (by randomly selecting images and setting their affine matrix to identity). The dropout value is fixed to and the values of are obtained using a coarse grid search on a log-scale and a finer grid search on a linear scale after choosing the best interval from the coarse search. The first baseline does not use any data augmentation. The second baseline uses standard data augmentation like random scaling, translation, and rotation. However, unlike random data augmentation, our method outputs parameters by an adversarial STN. The STN outputs the values of rotation , translation and scale and the affine matrix is constructed as:
The values are bounded to , and using tanh activations and appropriate scaling. For all experiments, we set , , and , where are the height and width of the images.
|(with standard aug.)||()|
|20 way, 5 shot||98.70%||98.89%||94.25%||98.80%|
|20 way, 1 shot||95.9%||96.09%||80.70%||95.97%|
|5 way, 5 shot||99.62%||99.62%||99.40%||99.67%|
|5 way, 1 shot||98.42%||98.60%||96.40%||98.61%|
|(ProtoNets )||(with standard aug.)|
|5 way, 5 shot||66.6%||70.2%||58.8%||70.4%|
|5 way, 1 shot||51.4%||49.8%||36.2%||52.8%|
|(MAML )||(with standard aug.)|
|5 way, 5 shot||65.9%||66.3%||57.9%||67.0%|
|5 way, 1 shot||47.3%||47.3%||32.1%||48.2%|
|(Matching Nets )||(with standard aug.)|
|5 way, 5 shot||59.8%||61.4%||47.8%||62.0%|
|5 way, 1 shot||47.0%||48.4%||34.2%||50.8%|
The improvements on Prototypical Networks for Omniglot dataset (Table 1
) are not very significant because the baselines already learn features which are general enough to perform well on this easy dataset. However, miniImageNet is a dataset with more variance and would require a classifier to learn complex features to perform well. Our method bumps the performance of the base classifiers by as much aswithout requiring any change to the model architecture, thereby learning better features than that are learnt without the adversarial augmentation (Table 2). Expectedly, our method fails to generalize in the absence of regularization as the STN exploits the freedom of choosing the affine matrix by performing transformations which produce images that are very far from the original data distribution and are often degenerate (for example, excessively zoomed images can result in the image being just a single color). These images hinder the actual learning of the classifier and the accuracy drops significantly below the baseline method. This clearly reinforces our hypothesis regarding the importance of regularization while estimating the transformation parameters. Baseline with standard data augmentation performs better than the baseline in most cases, but the improvement is not consistent (see table 2 - 5 way, 5 shot in ProtoNets and 5 way, 1 shot in MAML).
In this paper, we introduced MA, a model-agnostic adversarial augmentation technique for few shot learning. The method is inspired by an approximate model of how humans “cheat” by observing a novel object from various perspectives. We show that the model can be approximated using an affine transform, and Spatial Transformer Networks naturally fit into the equation by predicting affine transforms that the classifier is not robust to. Experiments show that the method works on both metric-based and meta-learning approaches by testing it on top of 3 popularly known works - Prototypical Networks, Matching Networks and the MAML framework. Our method performs better than standard augmentations, which raises the question as to which augmentations are actually useful in learning robust features, which is an interesting avenue for future work.
Issue #5: Reproducing Mini-Imagenet Results. Note: https://github.com/jakesnell/prototypical-networks/issues/5 Cited by: §4.
-  (2017) Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Cited by: §2.
-  (2016) Weakly supervised deep detection networks. In , pp. 2846–2854. Cited by: §1.
-  (2019) Reproducibility and stability analysis in metric-based few-shot learning. In RML@ICLR, Cited by: §4.
-  (2019) Autoaugment: learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123. Cited by: §2.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2, Table 2, §4.
-  (2017) One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905. Cited by: §2.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.
-  (2018) Repository for few-shot learning machine learning projects. Note: https://github.com/oscarknagg/few-shot/ Cited by: §4.
-  (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §1.
-  (2011) One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, Vol. 33. Cited by: §2.
-  (2011) One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, Cited by: Table 1, §4.
-  (2018) Issue #2: Can you release detailed configuration?. Note: https://github.com/jakesnell/prototypical-networks/issues/2 Cited by: §4.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
-  (2015) Visual domain adaptation: a survey of recent advances. IEEE signal processing magazine 32 (3), pp. 53–69. Cited by: §1.
-  (2016) Optimization as a model for few-shot learning. Cited by: §2.
-  (2019) Incremental few-shot learning with attention attractor networks. In Advances in Neural Information Processing Systems, pp. 5276–5286. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Table 2, §4.
A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 60. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §1, §2, Table 1, Table 2, §4.
-  (2019) Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: §1, §2.
-  (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §1.
-  (2018) A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270–279. Cited by: §1.
-  (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1, §2, Table 2, §4.
-  (2018) Metagan: an adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems, pp. 2365–2374. Cited by: §2.