Ortho-Shot: Low Displacement Rank Regularization with Data Augmentation for Few-Shot Learning

10/18/2021
by   Uche Osahor, et al.
West Virginia University
0

In few-shot classification, the primary goal is to learn representations from a few samples that generalize well for novel classes. In this paper, we propose an efficient low displacement rank (LDR) regularization strategy termed Ortho-Shot; a technique that imposes orthogonal regularization on the convolutional layers of a few-shot classifier, which is based on the doubly-block toeplitz (DBT) matrix structure. The regularized convolutional layers of the few-shot classifier enhances model generalization and intra-class feature embeddings that are crucial for few-shot learning. Overfitting is a typical issue for few-shot models, the lack of data diversity inhibits proper model inference which weakens the classification accuracy of few-shot learners to novel classes. In this regard, we broke down the pipeline of the few-shot classifier and established that the support, query and task data augmentation collectively alleviates overfitting in networks. With compelling results, we demonstrated that combining a DBT-based low-rank orthogonal regularizer with data augmentation strategies, significantly boosts the performance of a few-shot classifier. We perform our experiments on the miniImagenet, CIFAR-FS and Stanford datasets with performance values of about 5% when compared to state-of-the-art

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

04/19/2021

Few-shot learning via tensor hallucination

Few-shot classification addresses the challenge of classifying examples ...
03/30/2020

Adversarial Feature Hallucination Networks for Few-Shot Learning

The recent flourish of deep learning in various tasks is largely accredi...
06/17/2021

Episode Adaptive Embedding Networks for Few-shot Learning

Few-shot learning aims to learn a classifier using a few labelled instan...
08/22/2020

Few-Shot Learning with Intra-Class Knowledge Transfer

We consider the few-shot classification task with an unbalanced dataset,...
04/01/2020

Self-Augmentation: Generalizing Deep Networks to Unseen Classes for Few-Shot Learning

Few-shot learning aims to classify unseen classes with a few training ex...
06/09/2021

Tensor feature hallucination for few-shot learning

Few-shot classification addresses the challenge of classifying examples ...
09/23/2020

Fuzzy Simplicial Networks: A Topology-Inspired Model to Improve Task Generalization in Few-shot Learning

Deep learning has shown great success in settings with massive amounts o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of convolutional neural network (CNN) models largely depend on training a network with a lot of labelled instances and a spectrum of visual variations which are mostly in thousands per class

[Krizhevsky2012ImageNetCW]. The cost of labelling these data manually by human annotation as well as the scarcity of data that captures the complete diversity in a specific class significantly limits the potential of current vision models. However, the human visual system (HVS) can identify new classes with fewer labelled examples [Kietzmann2019RecurrenceIR, Nayebi2018TaskDrivenCR], this unique trait of the HVS reveals the need to dive into new paradigms that would learn to generalize new classes with a limited amount of labelled data for each novel class. Recently, significant progress has been made towards better solutions using ideas of meta-learning [Oreshkin2018TADAMTD, Rusu2019MetaLearningWL, Ye2018LearningEA, Lee2019MetaLearningWD, Li2019FindingTF, Motiian2017FewShotAD].

Figure 1:

The convolution expression; Conv(K, X) is converted into a faster DBT vector representation;

.

Empirically, it has been observed that the convolutional filters learned in deeper layers are highly correlated and redundant [Wang2020OrthogonalCN]

, thereby resulting in unstable training performance and vanishing gradients. These shortcomings of convolutional neural networks are also more damaging in few-shot classification due to the small data size. The potential pitfalls of such convolutional layers could result in under-utilization of model capacity, overfitting, vanishing and exploding gradients

[Glorot2010UnderstandingTD, Bengio1994LearningLD], growth in saddle points [Dauphin2014IdentifyingAA] and shifts in feature statistics [Ioffe2015BatchNA], which collectively affect model generalization.

The doubly block-toeplitz (DBT) matrix [Gray2005ToeplitzAC] is part of a class of low displacement rank (LDR) matrix constructions [Zhao2017TheoreticalPF] that guarantee model reduction and computational complexity reduction in neural networks which is achieved by regularizing the weight matrices of network layers. The storage requirement of such a DBT-regularized network is reduced from to and the computational complexity can be reduced from to , due to the fast matrix-vector multiplication property of LDR structured matrices as shown in Figure 2. It is also well established [Li2017LowRankDE, Thomas2018LearningCT] that when filters are learned to be as orthogonal as possible, model capacity is better utilized which in-turn improves feature expressiveness and intra-class feature representation [Araujo2020OnLR, Vapnik2000TheNO, thomas2019learning, Aghdaie2021AttentionAW].

Our goal is to present an effective baseline model that harnesses good learned representations for few-shot classification kinds of tasks which perform better or at par with current few-shot algorithms [Wang2016LearningTL, Vinyals2016MatchingNF, Triantafillou2017FewShotLT, Snell2017PrototypicalNF, Sung2018LearningTC, Oreshkin2018TADAMTD, Rusu2019MetaLearningWL, Motiian2017FewShotAD]. In a nutshell, we tackled the few-shot learning limitations by imposing orthogonal regularization on the model baseline which is a simpler yet effective approach compared to techniques used previously in [Wang2018LowShotLF, Gidaris2018DynamicFV, Qi2018LowShotLW]. We also incorporated data augmentation strategies that significantly improved data diversity and overall model performance.

Figure 2: Toeplitz covariance matrices from features samples. This requires learning parameters, in contrast to for generic covariance matrices.

1.1 Contributions:

  • We adopted an efficient orthogonal regularization technique on convolutional layers of the few-shot classifier that enhances model generalization and intra-class feature embedding, using the doubly block toeplitz (DBT) matrix structure.

  • We broke down the pipeline of a few-shot learner, and based on our findings, we established three augmentations strategies namely: support augmentation, query augmentation and task augmentation that aid in minimizing overfitting.

  • We show with compelling results that combining a DBT-based regularizer with a robust augmentation strategy improves few-shot learning performance at an average of 5%.

2 Related works

Orthogonal regularization. In convolutional networks, orthogonal weights have being used to stabilize layer-wise distributions and to make optimization as efficient as possible. In [Bansal2018CanWG, Mishkin2016AllYN]

the authors introduced orthogonal weight initialization driven by the norm preserving property of an orthogonal matrix. However, it was shown that the orthogonality and isometry property does not necessarily sustain throughout training

[Bansal2018CanWG] if the convolutional layers are not properly regularized. In other works, [Jia2017ImprovingTO, Ozay2016OptimizationOS, Huang2018OrthogonalWN] considered Stiefel manifold-based hard constraints of weights [tagare2011notes], but their performance reported on VGG networks [Simonyan2015VeryDC] were not as promising. These aforementioned methods [Jia2017ImprovingTO, Ozay2016OptimizationOS, Huang2018OrthogonalWN]

are associated with hard orthogonality constraints and in most cases, they have to repeat singular value decomposition (SVD) during training which is computationally expensive on the GPUs. A recent work adopted soft orthogonality

[Balestriero2018MadMA, Balestriero2018AST, Bansal2018CanWG, Xie2017AllYN], where the Gram matrix of the weight matrix is required to be close to identity, given as , where is the Frobenius norm-based regularization coefficient. It’s a more efficient approach than the hard orthogonality assumption [Jia2017ImprovingTO, Ozay2016OptimizationOS, Huang2018OrthogonalWN, Harandi2016GeneralizedB, Xu2020LearningST] and can be viewed as a different weight decay term limiting the set of parameters close to a Stiefel manifold [tagare2011notes]. Their approach constrained orthogonality among filters in one layer, leading to smaller correlations among learned features and implicitly reducing the filter redundancy. However, there are special cases where the Gram matrix cannot be close to identity which implies that matrix is overcomplete [Thomas2018LearningCT]. Similarly, other works explored orthogonal weight initialization [Sung2018LearningTC], mutual coherence with the isometry property [Bansal2018CanWG], penalizing off-diagonal elements [Brock2019LargeSG] towards improving kernel orthogonality.

In general, the orthogonality of alone is not sufficient to make the linear convolutional layer orthogonal among its filters. Due to these shortcomings, we apply the improved regularization technique used in [Wang2020OrthogonalCN, Le2011ICAWR]. We adopt the DBT matrix denoted as with a filter , while we keep the reshaped input and output intact. The matrix multiplication; enforces the orthogonality of as shown in Figure 1 and Figure 3.

Augmentation. Data augmentation has become a well established technique for most image classifiers and deep networks, as it provides an efficient strategy that significantly mitigates the models’ vulnerability to overfitting. In contrast, data augmentation still has room for expansion and adaptation in few-shot classification or other derivatives of meta-learning in general. Existing works [Taylor2018ImprovingDL, Kang2017PatchShuffleR, Takahashi2020DataAU], apply basic data augmentation strategies like random crops, horizontal flips and color jitter as the staple method for most meta-learning applications. However, these aforementioned techniques have plateaued in performance with little room for significant improvement [Ren2018MetaLearningFS, Ni2021DataAF]. Other works have added random noise to labels to alleviate overfitting [Rajendran2020MetaLearningRM], some techniques rotate all the images in a class and consider the newly rotated class as distinct from its parent class. Recent works [Dabouei2020SuperMixST, Shorten2019ASO, Ni2021DataAF, Gidaris2018DynamicFV, Qiao2018FewShotIR] are recording better performance values when augmentation strategies are injected within the meta-learning pipeline.

In our work, we explored the benefits of including augmentation strategies along the pipeline of a DBT regularized few-shot classifier. We identified how different augmentation approaches could affect a few-shot classifier when placed strategically along the classifier pipeline. At the core of our findings, we observed that the classifier is more sensitive to query data than support data.

Toeplitz matrix applications. Kimitei et al. [Kimitei2011AlgorithmsFT] used toeplitz matrices with Tikhonov regularization [Natterer1984ErrorBF] as a mathematical approach to restoring blurred images. They explored their techniques on image restoration, enhancement, compression and recognition. In [Hansen2004DeconvolutionAR], the authors presented modern computational methods for treating linear deconvolution problems, they showed how to exploit the toeplitz structure to derive efficient numerical deconvolution algorithms. In compressive sensing applications [Su2014AnIT], toeplitz-like matrices allow the entire signal to be efficiently acquired and reconstructed from relatively few measurements, compared to previous compressive sensing frameworks where a random measurement matrix is employed.

Figure 3: A doubly block-Toeplitz (DBT) matrix

derived from the kernel tensor

.

3 Background

We consider a meta learning scenario for an N-shot, K-way classification problem where the training and testing task datasets can be represented as . Such a meta-training task is divided into and , called a meta-training set [Oreshkin2018TADAMTD, Rusu2019MetaLearningWL, Ye2018LearningEA, Lee2019MetaLearningWD, Li2019FindingTF, Motiian2017FewShotAD]. The set of and represent a small number of samples from the same distribution. We implement a DBT-based learner to train the model for a given input feature denoted as , where (*) denotes implementations for train and test sets. We then map train and test examples into a DBT structured embedding space .

The objective of our model becomes:

(1)

where represents the parameters of the embedding model,

is the loss function and

is the regularization as described in Section 4.2. At the end of meta-training, the performance of the model is evaluated on a set of tasks called the meta-testing set. The final evaluation representation over the test set is:

(2)

The goal of meta learning is to learn a transferable efficient embedding model that generalizes to new tasks. As described in section 4, we deviated from popular techniques [Vinyals2016MatchingNF, Snell2017PrototypicalNF, Sung2018LearningTC, Finn2017ModelAgnosticMF] that train classifiers with convolutional blocks with some form of hard orthogonality constraint [Ni2021DataAF]. Our strategy, imposes a better low displacement rank DBT-based soft orthogonality constraint on the classifier network to produce more efficient embeddings for the base learner. The final embedding model is given as:

(3)

where is the task from and denotes the cross-entropy loss between predictions and ground truth labels.

Figure 4: The network depicts a DBT-regularized few shot learner. The network embeddings are regularized based on the DBT structured matrix. The dotted box of the CNN block illustrates the inner translation between convolution layer embeddings and the more efficient DBT-based embeddings denoted as , see above. The Algorithm 1 gives a logical representation of the training process.

3.1 Doubly-block toeplitz (DBT) regularization

The feature interaction between two weights vectors and , within the layers of a few-shot classifier involves a convolution operation which can simply be represented as , such that if has length and has length then has length . Unfortunately, this computation involves operations which is not suitable for fast linear algebraic computations and intra-class parameter sharing which is critical for few shot learning. For such computations, if we consider a single convolution layer with input tensor and kernel , the convolution’s output tensor is expressed as , where , we replaced the convolution operator with for simplicity. , , and are the number of kernels, height, width and channel of the input tensor, respectively. While represents the kernel size and , are the height and width for the output tensor, respectively.

Inline with our goal to improve the computational complexity and enhance better feature representation, we adapted a DBT matrix construction by utilizing the linear property of the convolution operation. The convolution expression; Conv(K, X) is converted into a faster DBT matrix-vector representation given as:

(4)

This simple rearrangement establishes the foundation for adapting the DBT regularizer in our few-shot classifier network. Where is the DBT matrix, and represent flattened input and output tensors, respectively. is structured and is of rank [Thomas2018LearningCT], this representation minimizes the storage requirements to parameters and accelerates the matrix-vector multiplication time to . Section 1: Figure 1 in the supplementary material shows the hierarchy for storage cost and operation count for matrix-vector multiplications. This DBT formulation stabilizes the spectrum of the newly derived DBT-based matrix . In section 1.1 and 1.4 of the supplementary material, we reflect the overall benefit of the DBT model.

1:
2:procedure Orthogonal-regularizer
3:      toeplitz matrix
4:     
5:     
6:      DBT based output
7:
8:procedure Few Shot Learning
9:     if Train then
10:          for  do, T
11:                DBT model
12:               
13:               
14:                (orthogonal    regularization)                
15:     if Test then
16:          for  do, Q
17:                               
Algorithm 1 Ortho-shot algorithm

4 The proposed method

We present an efficient low displacement rank (LDR) regularization strategy termed Ortho-Shot that imposes orthogonal regularization on the convolutional layers of a few-shot classifier which is based on the doubly-block toeplitz (DBT) matrix structure [Wang2020OrthogonalCN, Huang2018OrthogonalWN]. Our technique, as reflected in section 4.1 deviates from popular methods that train classifiers with convolutional blocks with some form of hard orthogonality constraint. We also adapted a set of augmentation strategies based on the support, query and task datasets to boost overall model performance. In general, our approach enhances model generalization, intra-class feature embeddings and also minimizes overfitting for a few-shot classifier. To further describe our approach, we consider a single convolutional layer case. We extract feature embeddings from the intermediate convolutional layers of the few-shot classifier and then flatten it to a vector . The weight tensor; of our model is also converted to a doubly block-Toeplitz (DBT) matrix derived from kernel tensor as shown in Figure 3. With the aforementioned matrix structure, we are able to apply a better orthogonality constraint as described by the Lemma in 4.2. In Figure 4, we show a fully regularized setup for a single CNN block. The network embeddings are regularized based on the DBT structure and the entire losses from each respective layer is summed up to . We show promising results for our technique as described by the CAM plots in Figure 5.

4.1 Convolutional orthogonality

A DBT kernel matrix can be applied on both a rectangular or square case, where kernel dimensions can be rectangular or square, . In the rectangular case, the uniform spectrum applies row orthogonal convolution while the square case requires column orthogonal convolution. In theory, the DBT kernel is highly structured and sparse [Le2011ICAWR] as a result, an equivalent representation is required to regularize the spectrum of to be uniform [Wang2020OrthogonalCN, Huang2018OrthogonalWN]. We give the cases for both row and column orthogonality and we also propose an equivalent representation in this section.

Row orthogonality case. The row of matrix corresponds to a filter at a particular spatial location flattened to a vector, denoted as . The row orthogonality condition is given as:

(5)

This results to an equivalent of Equation 4 as the following self-convolution:

(6)

where

is a tensor with an identity matrix at the centre and zeros entries elsewhere.

Figure 5: Illustration of CAM plots. (a) The second row represents CAM plots for single classes. The red squares highlight regions of interest clearly highlighted by the model.(b) Shows more complex scenarios where multi classes are involved. Classes are clearly separated from non-classes of interest. In (c), all the objects separated by bounding boxes are clearly localised as indicated by the CAM plot.

Column orthogonality case. If denotes an input tensor, which has all zero except an entry at the input channel at spatial location . Then we can denote the flattened vector as derived from . A column vector of is obtained by multiplying and column vector . Similar to the row orthogonality,

(7)

where is the input-output transposed , i.e., has all zeros except for the center entries as an identity matrix. Figure 2 illustrates the DBT matrix structure of our model.

4.2 Row-column orthogonality equivalence

To develop an equivalent representation for row and column orthogonality, we build on the equation described by lemma 1, which states that the minimizing of the column orthogonality and row orthogonality costs are equivalent [Le2011ICAWR] due to the property of the Frobenius norm.

Lemma 1: The row orthogonality cost is equivalent to the column orthogonality cost where is a constant. This implies that convolution orthogonality independent of the shape of (square or rectangular) can be regularized, given as:

(8)

where is the DBT-based orthogonal regularization term that depends only on Equation 6 and replaces the term in Equation 1.

5 Experimental setup and analysis

Our experiments were conducted on the miniImagNet, CIFAR-FS, Stanford Dogs and Stanford Cars datasets, respectively. We used the R2-D2 base leaner

[Bertinetto2019MetalearningWD], the ”ResNet-12” and ”64-64-64-64” backbone for different few-shot learning modes used in our work. Data augmentation strategies were also analysed to determine the best combination for a DBT-regularized model. The complete details for of the entire setup is expressed in section 1.2 of the supplementary material.

5.1 Data augmentation strategy

Motivated by the impact of applying a diverse augmentation strategy on meta-learners, we established three unique augmentation approaches; support, query and task augmentation that contribute to the overall classifier performance, aimed at minimising overfitting. Our empirical analysis confirm that support augmentation increases the number of fine tuning data while the query data improves evaluation performance while training the classifier. Similarly, task augmentation is used to increase the number of classes per task while training. We adapted a couple of augmentation techniques such as CutMix [Yun2019CutMixRS], where image patches are cut and pasted among training images and the ground truth labels are also mixed proportionally within the area of the patches. Mixup [Seo2021SelfAugmentationGD], a technique that generates convex combinations of pairs of examples and their labels, which proved to be effective for support and query augmentation strategies. As well as Self-Mix [Zhang2018mixupBE] in which an image is substituted into other regions in the same image. This dropout effect improves few-shot learning generalization overall. In addition, we implemented standard data augmentation techniques by randomly erasing patches from the images (Random Erase), horizontally flipping the images (Horizontal Flip), rotating the images at different specified angles (Rotation) and Color Jitter, where we randomly change the brightness, contrast and saturation of the images. To boost the performance of our augmentation strategy, we combine different augmentation techniques using the MaxUp augmentation approach proposed in [Gong2020MaxUpAS]. The rationale behind MaxUp augmentation is to minimize training loss by performing parameter updates on the task that maximizes loss in a min-max optimization manner, the MaxUp expression is given as:

(9)

where represents the model parameters, is the base model, is the loss function and is a task for both support and query data; and , respectively.

5.2 Augmentation performance

In this section, we investigate the performance of a few-shot classifier for different augmentation strategies. We investigated three test cases that check the training performance when data is sampled from the support, query and task data, respectively. Our approach is similar to techniques adapted by [Ni2021DataAF, Kye2020TransductiveFL, Seo2021SelfAugmentationGD] that examine the impact of augmentation on a diverse set of data combinations.

Case 1: We trained the model at an equal number of support and query data as indicated in Table 1, so as to establish a baseline performance of the model. We use this strategy to compare the impact of any of the data pools (support or query) when any of the augmented pairs is reduced.

Case 2: We initiated training of the classifier by randomly sampling from 5 and 10 unique samples per class of the support data while using the entire query data pool. Using this approach, we reduced the influence of support data in order to examine the impact of the diverse pool of query data on the classifier. Our findings reflected in Table 1 show accuracy values at 2%. This is a clear indication that augmentation of query data plays a more significant role in the overall model performance. In contrast, we reduced the number of query data while maintaining the initially set cap for support data and recorded a decline in accuracy.

Case 3: To evaluate the impact of task augmentation, we used the CIFAR-FS data to initially allocate 10 distinct 5-way classification tasks (252 combinations) before training, while the support and query datasets are maintained equally at 500, respectively. We observed a decline in performance. However, as we increased the amount of task data, significant improvement is recorded, which confirms that task augmentation is crucial in few-shot learning.

In summary, we broke down the few-shot learning process to determine the influence of support, query and task augmentation, respectively. Our findings confirm that our baseline learner is most sensitive to query data [Ni2021DataAF]. In addition, task augmentation provided significant value (about 2%) that cannot be overlooked by the classifier.

height=1.55cm, width=7.8cm Support Query Task 1-shot 5-shot 500 500 full 71.41 0.21 86.01 0.08 100 500 full 70.11 0.01 83.00 0.03 10 500 full 70.72 0.01 81.41 0.32 500 300 full 69.41 0.11 72.41 0.08 500 100 full 59.00 0.21 70.41 0.08 5 (random) 500 full 61.01 0.11 80.41 0.08 10 (random) 500 full 63.01 0.30 81.24 0.02

Table 1: Few-shot classification accuracy (%) using R2-D2 base leaner with a ResNet-12 backbone on CIFAR-FS dataset. Support, Query and Task columns represent the number of samples per class for support, query data and the total number of tasks available.
Figure 6: Accuracy results for training and validation on R2-D2 base-learner [Bertinetto2019MetalearningWD] with a DBT-regularized ResNet-12 backbone on the CIFAR-FS dataset. (Top Left) Baseline model and (Top right) Augmentation ”Aug” and MaxUp. The MaxUp augmentation strategy narrows down the generalization gap and reduces overfitting. (Bottom left) 1-shot classification and (Bottom Right) 5-shot classification for Query data augmentation.
Figure 7: Accuracy plots for different datasets compared to the baseline model. Augmentations techniques were applied on Task ”T” and Support ”S” datasets. Overall, accuracy for 5-shot is maintained at 85-88% while for 1-shot, a range of 65-68% is recorded.

height=2.6cm, width=6.3cm Augmentation 1-shot 5-shot CutMix(Q) 76.01 0.21 87.14 0.08 + CutMix(S) 75.11 0.31 85.30 0.14 + Horizontal Flip(S) 76.32 0.11 87.01 0.23 + Rotation(T) 75.33 0.25 87.68 0.03 SelfMix(Q) 76.04 0.21 86.81 0.08 + CutMix(S) 76.19 0.29 86.35 0.16 + Horizontal Flip(S) 75.27 0.32 86.88 0.03 + Rotation(T) 75.61 0.22 87.40 0.18 MixUp(Q) 72.14 0.01 82.81 0.08 + CutMix(S) 71.03 0.29 85.15 0.11 + Horizontal Flip(S) 72.27 0.10 83.08 0.01 + Rotation(T) 74.10 0.11 85.10 0.22

Table 2: Few-shot classification accuracy (%) using R2-D2 base leaner with a ResNet-12 backbone on the CIFAR-FS dataset. Support(S), Query(Q) and Task(T) data are used on different augmentation strategies.

height=2.29cm, width=10.80cm miniImageNet 5-way DBT-model Backbone 1-shot 5-shot 1-shot 5-shot Basline(No Aug) ResNet-12 70.26 0.61 83.12 0.53 55.03 0.40 74.06 0.24 CutMix(Q) ResNet-12 71.46 0.24 84.32 0.73 57.36 0.24 74.46 0.11 CutMix(Q) + M ResNet-12 72.00 0.01 86.20 0.61 58.13 0.25 75.69 0.74 SelfMix(S) + R ResNet-12 62.56 0.54 79.82 0.33 50.38 0.63 71.44 0.08 SelfMix(S) + M ResNet-12 63.51 0.78 80.20 0.66 57.31 0.89 72.69 0.70 CutMix(S) + HF 64-64-64-64 60.56 0.29 85.32 0.73 62.26 0.63 79.28 0.63 CutMix(S) + M 64-64-64-64 63.42 0.17 86.33 0.66 63.31 0.89 80.69 0.54 SelfMix(Q) + HF 64-64-64-64 75.56 0.84 84.32 0.73 66.31 0.89 82.69 0.74 SelfMix(Q) + M 64-64-64-64 76.42 0.38 86.10 0.36 67.39 0.34 83.44 0.24

Table 3: Comparison to prior work on miniImageNet and CIFAR-FS. Few-shot classification accuracy (%) using R2-D2 base leaner a ”ResNet-12” and ”64-64-64-64” backbone on CIFAR-FS and miniImageNet datasets, respectively. We applied Rotation (R) to the CutMix and Horizontal Flip (HF) to the SelfMix augmentation modes. ”Q” denotes query data, ”S” represents support data and ”M” dentotes MaxUp.

height=3.1cm, width=15.30cm Stanford Dogs 5-way Stanford Cars 5-way CIFAR-FS 5-way Model 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot Matching Networks [Vinyals2016MatchingNF] 35.80 0.99 47.50 1.03 34.80 0.98 44.70 1.03 61.16 0.89 72.86 0.70 MAML [Finn2017ModelAgnosticMF] 44.81 0.34 58.68 0.31 47.22 0.39 61.21 0.28 55.92 0.95 72.09 0.76 Relation Nets [Sung2018LearningTC] 43.33 0.42 55.23 0.41 47.67 0.47 60.59 0.40 62.45 0.98 76.11 0.69 Prototypical Networks [Snell2017PrototypicalNF] 37.59 1.00 48.19 1.03 40.90 1.01 52.93 1.03 51.31 0.91 70.77 0.69 DN4 [Li2019RevisitingLD] 45.41 0.76 63.51 0.62 59.84 0.80 88.65 0.44 52.79 0.86 81.45 0.70 PABN [Huang2019LowRankPA] 45.65 0.71 61.24 0.62 54.44 0.71 67.36 0.61 63.56 0.79 75.35 0.58 MATANet [Chen2020MultiscaleAT] 55.63 0.88 70.29 0.62 73.15 0.88 91.89 0.45 67.33 0.84 83.92 0.63 GNN [Satorras2018FewShotLW] 46.38 0.78 62.27 0.95 55.85 0.97 71.25 0.89 51.83 0.48 63.69 0.94 Rfs [Tian2020RethinkingFI] 55.64 0.28 62.02 0.63 79.64 0.44 69.74 0.72 83.41 0.55 83.50 0.11 Rfs-distill [Tian2020RethinkingFI] 56.01 0.48 64.82 0.60 82.14 0.43 71.52 0.69 86.03 0.49 84.10 0.28 DBT-baseline 56.06 0.03 71.00 0.25 73.49 0.01 92.02 0.33 74.41 0.50 84.21 0.65
+ CutMix(Q) + R
56.36 0.64 71.39 0.04 73.69 0.51 93.00 0.15 74.81 0.37 86.01 0.67

+ SelfMix(Q) + HF
56.86 0.64 72.19 0.78 74.21 0.01 93.30 0.35 75.01 0.15 87.01 0.74

+ MaxUp
57.06 0.63 73.15 0.22 75.34 0.41 94.38 0.25 76.41 0.25 87.68 0.24

Table 4:

Experimental results that compare prior work on the Stanford Dogs, Stanford Cars and CIFAR-FS dataset. Average few-shot classification accuracy with 95 % confidence intervals. The second column shows which kind of embedding is employed, we used a 4-layer convolutional network with their respective filters in each layer.

5.3 Augmentation modes

This section builds on the findings of section 5.1, where we established three core data augmentation cases; support, query and task data augmentation. Similar to [Ni2021DataAF, Gong2020MaxUpAS, Dabouei2020SuperMixST], we used the CutMix, SelfMix, MixUp, Random Crop and Horizontal Flip augmentation methods on the support, query and task datasets, respectively. We identified the best augmentation combinations that suit a few-shot learner and with our findings, we picked the best strategy to determine which mode of augmentation suits a DBT-regularized few-shot learner. To start with, we used the R2-D2 base learner [Bertinetto2019MetalearningWD] and the CIFAR-FS database to evaluate the augmentation performance on support, query and task augmentations as shown in Table 1. Our findings show that the pair of CutMix and SelfMix augmentation produces the best results with over 2.5% in accuracy improvement [Ni2021DataAF]. Other approaches lag behind in performance at about 3% for both 1-shot and 5-shot cases. Secondly, since the CutMix and SelfMix methods stand out as the best augmentation approach for our setup, we used them as bases to combine augmentations on the three data cases; support, query and task, respectively as shown in Table 2. Model performance significantly improved with the best case occurring when CutMix (query) is combined with SelfMix (support).

Figure 8: Model accuracy plots for CIFAR-FS and miniImageNet datasets on CNN and DBT model baselines with augmentation ”Aug” and without Augmentation for 5-shot and 1-shot cases.

5.4 DBT-regularization with data augmentation

As discussed in section 1, DBT-based regularization improves model generalization and intra-class feature expressiveness. Data augmentation on the other hand creates sufficient data diversity which helps to mitigate overfitting. In this section, we highlight the collective benefits of combing a DBT-based regularizer with augmentation strategies for few-shot learning, using different datasets.

Accuracy results with different datasets: In this section, we setup a testing scheme where we evaluate our method over four runs which is quite similar to techniques applied in [Tian2020RethinkingFI]. We computed the mean accuracy as the accuracy for every run, the experiments are conducted on the Stanford Dogs, Stanford Cars, miniImageNet and CIFAR-FS datasets, respectively as shown in Table 4 and Table 3. Our accuracy results for 5-shot were maintained at 80-88% while for 1-shot at a range of 65-68% as shown in Figure 7 and Figure 8. Our baseline integrated with the DBT-based regularizer ”DBT-baseline” model performs at about 2% better than state-of-the art without data augmentation. Applying the CutMix and SelfMix augmentation on the query ”Q” and support ”S” datasets, show significant improvement. Rotation ”R” and Horizontal Flip ”HF” are integrated into the CutMix and SelfMix data augmentation modes, respectively as indicated in Table 3.

Improvement with MaxUp augmentation: In this section, we evaluate the performance of our model with the Max-Up approach for both 1-shot and 5-shot classification. We use a similar experimentation setting described in [Ni2021DataAF] at different augmentation pool sizes. Figure 6 and Table 4 depict the impact of Max-Up with the augmentation strategies; denote generally as ”Aug” for both train and validation data. We also show results for the baseline model without augmentation (DBT-baseline), with CutMix and MaxUp augmentation for different Query data schemes. We observe from Figure 6 (Top right) that the generalization gap shrinks considerably and by implication, overfitting is minimized when the MaxUp strategy is adapted. MaxUp also adds the extra boost with an average of about 2.3% in performance.

Comparison with different methods: We compared our results against different methods [Satorras2018FewShotLW, Chen2020MultiscaleAT, Huang2019LowRankPA, Li2019RevisitingLD] as shown in Table 4. We observed that [Tian2020RethinkingFI] is closest to ours but we outperform their approach significantly for the 5-shot cases by over 6% on the average. We recorded a better performance than GNN [Satorras2018FewShotLW] and MATANet [Chen2020ANM] using both the 5-way 1-shot and 5-way 5-shot few-shot learning settings, we saw an improvement of about 3.3% 4.2% and 3.16% on Stanford Dogs, Stan-ford Cars, and CIFAR-FS, respectively for 5-way 1-shot task while for the 5-way5-shot task, our method achieved about 4.7%, 2.1%, and 4.9% overall. Clearly, the MaxUp boost is significant in almost all cases.

6 Conclusion

We proposed a structured doubly block-toeplitz (DBT) matrix based model that imposes orthogonal regularization on the filters of the convolutional layers termed Ortho-Shot. Our approach was aimed at maintaining the stability of activations, preserving gradient norms, and enhancing feature transferability of deep networks. We also broke down the pipeline of a few-shot learner and based on our findings, we established three augmentations strategies that aid in minimizing overfitting and increasing data diversity. Our findings and empirical results confirm that a DBT regularized model is beneficial to few-shot classification and meta-learning in general.

References

7 VC dimension and sample complexity

The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions. In our setting we focus on neural networks where all the weights are of low discriminant rank (LDR) such as the Toeplitz-like, Hankel-like, Vandermonde-like, and Cauchy-like matrices.

7.1 Bounding VC dimension

Theorem 1 For input and parameter , let denote the output of the network. let be the class of functions . Let be the number of parameters up to layer i.e the total number of parameters in layer (1,2,…,). we define the depth effective path as:

(10)

Then the total number of computations units is given as :

(11)

Inline with works of [Bartlett1998AlmostLV, Harvey2017NearlytightVB, Warren1968LowerBF, Anthony1999NeuralNL].If k=1, corresponding to piece-wise linear networks, it can be shown that:

(12)

Lemma 1. Let be polynomials of degree at most d in variables, then we define:

(13)

i.e., if is the number of possible sign vectors given by the polynomials, then . To partition the parameter space for a fixed input , the output on each region in the partition implies of the parameter .

Hence, we have:

(14)

Hence from Lemma 1, we can show that by recursive construction, is a partition of such that for [thomas2019learning]. The network output for input is a fixed polynomial of which collectively gives:

(15)

with the size of and equation (6) we get:

(16)

We can take logarithm and apply Jensen’s inequality, with :=

(17)

We bound using the bound on ; since the degree of an LDR matrix is at most:

(18)

where is a constant thus

(19)

To bound the VC-dimension, if VCdim(sign) = there exists data points such that the output of the model can have sign patterns[Vapnik2000TheNO]. The bound on K then implies:

(20)

Hence completing the proof. Since the number of parameters is around the square root of the number of parameters of a network (e.g doubly block toeplitz based network) with unstructured layers, the sample complexity of an LDR network is much smaller than that of unstructured networks (e.g CNN) which is beneficial for deep networks.

Figure 9: Captions for each class shows the storage cost and operation count for matrix-vector multiplication. Our proposed Toeplitz-like is of lowest rank. compared to circulant, standard convolutional filters, and orthogonal polynomial transforms.

7.2 Space and time complexity.

The proposed DBT-model has a time complexity of and the small number of parameters also makes the network perform better with limited amount of training data which is crucial for few-shot learning [Li2017LowRankDE]. We also ran tests on the model backbone with two NVIDIA GeForce GTX 1080 Ti GPU and a batch size of 64. Table 1 reflects the accuracy performance and we see an overall model performance of about 4%. Similar to [Christiani2019FastLH]

, the network used in our test consists of 4 convolutional layers, 1 fully-connected layer and one softmax layer. Rectified linear units (ReLU) are used as the activation units. Images were cropped to 24x24 and augmented with horizontal flips, rotation, and scaling trans-formations. We use an initial learning rate of 0.0001 and train for 800-400-100 epochs with their respective default weight decay. Our efficient DBT-based approach obtains a test error of 6.61%, compared to 5.26% obtained by the conventional CNN model. At the same time, the DBT-based network is 4x more space efficient and 1.2x more time efficient than the conventional CNN-based model.

7.3 Optimization setup.

An SGD optimizer with a momentum of 0.9 and a weight decay of was used for our setup. We used a learning rate initialized at 0.0001 with a decay factor of 0.1 applied for all datasets. We trained over 100 epochs for miniImageNet and 200 epochs for both CIFAR-FS and 150 epochs for Stanford Dogs and Stanford Cats, respectively.

7.4 Architecture

The works of [Oreshkin2018TADAMTD, Lee2019MetaLearningWD, Tian2020RethinkingFI]

used a ResNet12 as backbone for their model, we used a similar structure but replace the convolutional layer with a doubly-block toeplitz matrix the network consists of 4 residual blocks and 3 x 3 kernels. A 2x2 max-pooling layer is applied after each of the first 3 blocks; and a global average-pooling layer is on top of the fourth block to generate feature embeddings. Similar to

[Thomas2018LearningCT], we used spectral regularization and changed the number of filters from (64, 128, 256, 512) to (64, 160, 320, 640).

Usefulness of DBT regularization The DBT matrix represents a class of structured matrices whose layers interact multiplicatively at time as compared to convolutional layers that are linear and unstructured and are implemented in about time [Li2017LowRankDE]. The generic term structured matrix refers to an x matrix that can be described in fewer than parameters and is capable of fast operation with at most double the displacement rank, which is far simpler for computations [Li2017LowRankDE]. Hence, if denotes a class of neural networks comprising of DBT layers, total parameters and piece-wise linear activations, we can measure the complexity, expressive power, richness, or flexibility of via a measure referred to as the Vapnik–Chervonenkis (VC) dimension [Vapnik2000TheNO, Bartlett2003VapnikChervonenkisDO].

For a simple classification problem of the form: , the VC dimension of the class is expressed as:

(21)

matches the standard bound for unconstrained weight matrices [Bartlett1998AlmostLV, Harvey2017NearlytightVB, thomas2019learning].