1 Introduction
The performance of convolutional neural network (CNN) models largely depend on training a network with a lot of labelled instances and a spectrum of visual variations which are mostly in thousands per class
[Krizhevsky2012ImageNetCW]. The cost of labelling these data manually by human annotation as well as the scarcity of data that captures the complete diversity in a specific class significantly limits the potential of current vision models. However, the human visual system (HVS) can identify new classes with fewer labelled examples [Kietzmann2019RecurrenceIR, Nayebi2018TaskDrivenCR], this unique trait of the HVS reveals the need to dive into new paradigms that would learn to generalize new classes with a limited amount of labelled data for each novel class. Recently, significant progress has been made towards better solutions using ideas of metalearning [Oreshkin2018TADAMTD, Rusu2019MetaLearningWL, Ye2018LearningEA, Lee2019MetaLearningWD, Li2019FindingTF, Motiian2017FewShotAD].Empirically, it has been observed that the convolutional filters learned in deeper layers are highly correlated and redundant [Wang2020OrthogonalCN]
, thereby resulting in unstable training performance and vanishing gradients. These shortcomings of convolutional neural networks are also more damaging in fewshot classification due to the small data size. The potential pitfalls of such convolutional layers could result in underutilization of model capacity, overfitting, vanishing and exploding gradients
[Glorot2010UnderstandingTD, Bengio1994LearningLD], growth in saddle points [Dauphin2014IdentifyingAA] and shifts in feature statistics [Ioffe2015BatchNA], which collectively affect model generalization.The doubly blocktoeplitz (DBT) matrix [Gray2005ToeplitzAC] is part of a class of low displacement rank (LDR) matrix constructions [Zhao2017TheoreticalPF] that guarantee model reduction and computational complexity reduction in neural networks which is achieved by regularizing the weight matrices of network layers. The storage requirement of such a DBTregularized network is reduced from to and the computational complexity can be reduced from to , due to the fast matrixvector multiplication property of LDR structured matrices as shown in Figure 2. It is also well established [Li2017LowRankDE, Thomas2018LearningCT] that when filters are learned to be as orthogonal as possible, model capacity is better utilized which inturn improves feature expressiveness and intraclass feature representation [Araujo2020OnLR, Vapnik2000TheNO, thomas2019learning, Aghdaie2021AttentionAW].
Our goal is to present an effective baseline model that harnesses good learned representations for fewshot classification kinds of tasks which perform better or at par with current fewshot algorithms [Wang2016LearningTL, Vinyals2016MatchingNF, Triantafillou2017FewShotLT, Snell2017PrototypicalNF, Sung2018LearningTC, Oreshkin2018TADAMTD, Rusu2019MetaLearningWL, Motiian2017FewShotAD]. In a nutshell, we tackled the fewshot learning limitations by imposing orthogonal regularization on the model baseline which is a simpler yet effective approach compared to techniques used previously in [Wang2018LowShotLF, Gidaris2018DynamicFV, Qi2018LowShotLW]. We also incorporated data augmentation strategies that significantly improved data diversity and overall model performance.
1.1 Contributions:

We adopted an efficient orthogonal regularization technique on convolutional layers of the fewshot classifier that enhances model generalization and intraclass feature embedding, using the doubly block toeplitz (DBT) matrix structure.

We broke down the pipeline of a fewshot learner, and based on our findings, we established three augmentations strategies namely: support augmentation, query augmentation and task augmentation that aid in minimizing overfitting.

We show with compelling results that combining a DBTbased regularizer with a robust augmentation strategy improves fewshot learning performance at an average of 5%.
2 Related works
Orthogonal regularization. In convolutional networks, orthogonal weights have being used to stabilize layerwise distributions and to make optimization as efficient as possible. In [Bansal2018CanWG, Mishkin2016AllYN]
the authors introduced orthogonal weight initialization driven by the norm preserving property of an orthogonal matrix. However, it was shown that the orthogonality and isometry property does not necessarily sustain throughout training
[Bansal2018CanWG] if the convolutional layers are not properly regularized. In other works, [Jia2017ImprovingTO, Ozay2016OptimizationOS, Huang2018OrthogonalWN] considered Stiefel manifoldbased hard constraints of weights [tagare2011notes], but their performance reported on VGG networks [Simonyan2015VeryDC] were not as promising. These aforementioned methods [Jia2017ImprovingTO, Ozay2016OptimizationOS, Huang2018OrthogonalWN]are associated with hard orthogonality constraints and in most cases, they have to repeat singular value decomposition (SVD) during training which is computationally expensive on the GPUs. A recent work adopted soft orthogonality
[Balestriero2018MadMA, Balestriero2018AST, Bansal2018CanWG, Xie2017AllYN], where the Gram matrix of the weight matrix is required to be close to identity, given as , where is the Frobenius normbased regularization coefficient. It’s a more efficient approach than the hard orthogonality assumption [Jia2017ImprovingTO, Ozay2016OptimizationOS, Huang2018OrthogonalWN, Harandi2016GeneralizedB, Xu2020LearningST] and can be viewed as a different weight decay term limiting the set of parameters close to a Stiefel manifold [tagare2011notes]. Their approach constrained orthogonality among filters in one layer, leading to smaller correlations among learned features and implicitly reducing the filter redundancy. However, there are special cases where the Gram matrix cannot be close to identity which implies that matrix is overcomplete [Thomas2018LearningCT]. Similarly, other works explored orthogonal weight initialization [Sung2018LearningTC], mutual coherence with the isometry property [Bansal2018CanWG], penalizing offdiagonal elements [Brock2019LargeSG] towards improving kernel orthogonality.In general, the orthogonality of alone is not sufficient to make the linear convolutional layer orthogonal among its filters. Due to these shortcomings, we apply the improved regularization technique used in [Wang2020OrthogonalCN, Le2011ICAWR]. We adopt the DBT matrix denoted as with a filter , while we keep the reshaped input and output intact. The matrix multiplication; enforces the orthogonality of as shown in Figure 1 and Figure 3.
Augmentation. Data augmentation has become a well established technique for most image classifiers and deep networks, as it provides an efficient strategy that significantly mitigates the models’ vulnerability to overfitting. In contrast, data augmentation still has room for expansion and adaptation in fewshot classification or other derivatives of metalearning in general. Existing works [Taylor2018ImprovingDL, Kang2017PatchShuffleR, Takahashi2020DataAU], apply basic data augmentation strategies like random crops, horizontal flips and color jitter as the staple method for most metalearning applications. However, these aforementioned techniques have plateaued in performance with little room for significant improvement [Ren2018MetaLearningFS, Ni2021DataAF]. Other works have added random noise to labels to alleviate overfitting [Rajendran2020MetaLearningRM], some techniques rotate all the images in a class and consider the newly rotated class as distinct from its parent class. Recent works [Dabouei2020SuperMixST, Shorten2019ASO, Ni2021DataAF, Gidaris2018DynamicFV, Qiao2018FewShotIR] are recording better performance values when augmentation strategies are injected within the metalearning pipeline.
In our work, we explored the benefits of including augmentation strategies along the pipeline of a DBT regularized fewshot classifier. We identified how different augmentation approaches could affect a fewshot classifier when placed strategically along the classifier pipeline. At the core of our findings, we observed that the classifier is more sensitive to query data than support data.
Toeplitz matrix applications. Kimitei et al. [Kimitei2011AlgorithmsFT] used toeplitz matrices with Tikhonov regularization [Natterer1984ErrorBF] as a mathematical approach to restoring blurred images. They explored their techniques on image restoration, enhancement, compression and recognition. In [Hansen2004DeconvolutionAR], the authors presented modern computational methods for treating linear deconvolution problems, they showed how to exploit the toeplitz structure to derive efficient numerical deconvolution algorithms. In compressive sensing applications [Su2014AnIT], toeplitzlike matrices allow the entire signal to be efficiently acquired and reconstructed from relatively few measurements, compared to previous compressive sensing frameworks where a random measurement matrix is employed.
3 Background
We consider a meta learning scenario for an Nshot, Kway classification problem where the training and testing task datasets can be represented as . Such a metatraining task is divided into and , called a metatraining set [Oreshkin2018TADAMTD, Rusu2019MetaLearningWL, Ye2018LearningEA, Lee2019MetaLearningWD, Li2019FindingTF, Motiian2017FewShotAD]. The set of and represent a small number of samples from the same distribution. We implement a DBTbased learner to train the model for a given input feature denoted as , where (*) denotes implementations for train and test sets. We then map train and test examples into a DBT structured embedding space .
The objective of our model becomes:
(1)  
where represents the parameters of the embedding model,
is the loss function and
is the regularization as described in Section 4.2. At the end of metatraining, the performance of the model is evaluated on a set of tasks called the metatesting set. The final evaluation representation over the test set is:(2) 
The goal of meta learning is to learn a transferable efficient embedding model that generalizes to new tasks. As described in section 4, we deviated from popular techniques [Vinyals2016MatchingNF, Snell2017PrototypicalNF, Sung2018LearningTC, Finn2017ModelAgnosticMF] that train classifiers with convolutional blocks with some form of hard orthogonality constraint [Ni2021DataAF]. Our strategy, imposes a better low displacement rank DBTbased soft orthogonality constraint on the classifier network to produce more efficient embeddings for the base learner. The final embedding model is given as:
(3) 
where is the task from and denotes the crossentropy loss between predictions and ground truth labels.
3.1 Doublyblock toeplitz (DBT) regularization
The feature interaction between two weights vectors and , within the layers of a fewshot classifier involves a convolution operation which can simply be represented as , such that if has length and has length then has length . Unfortunately, this computation involves operations which is not suitable for fast linear algebraic computations and intraclass parameter sharing which is critical for few shot learning. For such computations, if we consider a single convolution layer with input tensor and kernel , the convolution’s output tensor is expressed as , where , we replaced the convolution operator with for simplicity. , , and are the number of kernels, height, width and channel of the input tensor, respectively. While represents the kernel size and , are the height and width for the output tensor, respectively.
Inline with our goal to improve the computational complexity and enhance better feature representation, we adapted a DBT matrix construction by utilizing the linear property of the convolution operation. The convolution expression; Conv(K, X) is converted into a faster DBT matrixvector representation given as:
(4) 
This simple rearrangement establishes the foundation for adapting the DBT regularizer in our fewshot classifier network. Where is the DBT matrix, and represent flattened input and output tensors, respectively. is structured and is of rank [Thomas2018LearningCT], this representation minimizes the storage requirements to parameters and accelerates the matrixvector multiplication time to . Section 1: Figure 1 in the supplementary material shows the hierarchy for storage cost and operation count for matrixvector multiplications. This DBT formulation stabilizes the spectrum of the newly derived DBTbased matrix . In section 1.1 and 1.4 of the supplementary material, we reflect the overall benefit of the DBT model.
4 The proposed method
We present an efficient low displacement rank (LDR) regularization strategy termed OrthoShot that imposes orthogonal regularization on the convolutional layers of a fewshot classifier which is based on the doublyblock toeplitz (DBT) matrix structure [Wang2020OrthogonalCN, Huang2018OrthogonalWN]. Our technique, as reflected in section 4.1 deviates from popular methods that train classifiers with convolutional blocks with some form of hard orthogonality constraint. We also adapted a set of augmentation strategies based on the support, query and task datasets to boost overall model performance. In general, our approach enhances model generalization, intraclass feature embeddings and also minimizes overfitting for a fewshot classifier. To further describe our approach, we consider a single convolutional layer case. We extract feature embeddings from the intermediate convolutional layers of the fewshot classifier and then flatten it to a vector . The weight tensor; of our model is also converted to a doubly blockToeplitz (DBT) matrix derived from kernel tensor as shown in Figure 3. With the aforementioned matrix structure, we are able to apply a better orthogonality constraint as described by the Lemma in 4.2. In Figure 4, we show a fully regularized setup for a single CNN block. The network embeddings are regularized based on the DBT structure and the entire losses from each respective layer is summed up to . We show promising results for our technique as described by the CAM plots in Figure 5.
4.1 Convolutional orthogonality
A DBT kernel matrix can be applied on both a rectangular or square case, where kernel dimensions can be rectangular or square, . In the rectangular case, the uniform spectrum applies row orthogonal convolution while the square case requires column orthogonal convolution. In theory, the DBT kernel is highly structured and sparse [Le2011ICAWR] as a result, an equivalent representation is required to regularize the spectrum of to be uniform [Wang2020OrthogonalCN, Huang2018OrthogonalWN]. We give the cases for both row and column orthogonality and we also propose an equivalent representation in this section.
Row orthogonality case. The row of matrix corresponds to a filter at a particular spatial location flattened to a vector, denoted as . The row orthogonality condition is given as:
(5) 
This results to an equivalent of Equation 4 as the following selfconvolution:
(6) 
where
is a tensor with an identity matrix at the centre and zeros entries elsewhere.
Column orthogonality case. If denotes an input tensor, which has all zero except an entry at the input channel at spatial location . Then we can denote the flattened vector as derived from . A column vector of is obtained by multiplying and column vector . Similar to the row orthogonality,
(7) 
where is the inputoutput transposed , i.e., has all zeros except for the center entries as an identity matrix. Figure 2 illustrates the DBT matrix structure of our model.
4.2 Rowcolumn orthogonality equivalence
To develop an equivalent representation for row and column orthogonality, we build on the equation described by lemma 1, which states that the minimizing of the column orthogonality and row orthogonality costs are equivalent [Le2011ICAWR] due to the property of the Frobenius norm.
Lemma 1: The row orthogonality cost is equivalent to the column orthogonality cost where is a constant. This implies that convolution orthogonality independent of the shape of (square or rectangular) can be regularized, given as:
(8) 
where is the DBTbased orthogonal regularization term that depends only on Equation 6 and replaces the term in Equation 1.
5 Experimental setup and analysis
Our experiments were conducted on the miniImagNet, CIFARFS, Stanford Dogs and Stanford Cars datasets, respectively. We used the R2D2 base leaner
[Bertinetto2019MetalearningWD], the ”ResNet12” and ”64646464” backbone for different fewshot learning modes used in our work. Data augmentation strategies were also analysed to determine the best combination for a DBTregularized model. The complete details for of the entire setup is expressed in section 1.2 of the supplementary material.5.1 Data augmentation strategy
Motivated by the impact of applying a diverse augmentation strategy on metalearners, we established three unique augmentation approaches; support, query and task augmentation that contribute to the overall classifier performance, aimed at minimising overfitting. Our empirical analysis confirm that support augmentation increases the number of fine tuning data while the query data improves evaluation performance while training the classifier. Similarly, task augmentation is used to increase the number of classes per task while training. We adapted a couple of augmentation techniques such as CutMix [Yun2019CutMixRS], where image patches are cut and pasted among training images and the ground truth labels are also mixed proportionally within the area of the patches. Mixup [Seo2021SelfAugmentationGD], a technique that generates convex combinations of pairs of examples and their labels, which proved to be effective for support and query augmentation strategies. As well as SelfMix [Zhang2018mixupBE] in which an image is substituted into other regions in the same image. This dropout effect improves fewshot learning generalization overall. In addition, we implemented standard data augmentation techniques by randomly erasing patches from the images (Random Erase), horizontally flipping the images (Horizontal Flip), rotating the images at different specified angles (Rotation) and Color Jitter, where we randomly change the brightness, contrast and saturation of the images. To boost the performance of our augmentation strategy, we combine different augmentation techniques using the MaxUp augmentation approach proposed in [Gong2020MaxUpAS]. The rationale behind MaxUp augmentation is to minimize training loss by performing parameter updates on the task that maximizes loss in a minmax optimization manner, the MaxUp expression is given as:
(9) 
where represents the model parameters, is the base model, is the loss function and is a task for both support and query data; and , respectively.
5.2 Augmentation performance
In this section, we investigate the performance of a fewshot classifier for different augmentation strategies. We investigated three test cases that check the training performance when data is sampled from the support, query and task data, respectively. Our approach is similar to techniques adapted by [Ni2021DataAF, Kye2020TransductiveFL, Seo2021SelfAugmentationGD] that examine the impact of augmentation on a diverse set of data combinations.
Case 1: We trained the model at an equal number of support and query data as indicated in Table 1, so as to establish a baseline performance of the model. We use this strategy to compare the impact of any of the data pools (support or query) when any of the augmented pairs is reduced.
Case 2: We initiated training of the classifier by randomly sampling from 5 and 10 unique samples per class of the support data while using the entire query data pool. Using this approach, we reduced the influence of support data in order to examine the impact of the diverse pool of query data on the classifier. Our findings reflected in Table 1 show accuracy values at 2%. This is a clear indication that augmentation of query data plays a more significant role in the overall model performance. In contrast, we reduced the number of query data while maintaining the initially set cap for support data and recorded a decline in accuracy.
Case 3: To evaluate the impact of task augmentation, we used the CIFARFS data to initially allocate 10 distinct 5way classification tasks (252 combinations) before training, while the support and query datasets are maintained equally at 500, respectively. We observed a decline in performance. However, as we increased the amount of task data, significant improvement is recorded, which confirms that task augmentation is crucial in fewshot learning.
In summary, we broke down the fewshot learning process to determine the influence of support, query and task augmentation, respectively. Our findings confirm that our baseline learner is most sensitive to query data [Ni2021DataAF]. In addition, task augmentation provided significant value (about 2%) that cannot be overlooked by the classifier.
5.3 Augmentation modes
This section builds on the findings of section 5.1, where we established three core data augmentation cases; support, query and task data augmentation. Similar to [Ni2021DataAF, Gong2020MaxUpAS, Dabouei2020SuperMixST], we used the CutMix, SelfMix, MixUp, Random Crop and Horizontal Flip augmentation methods on the support, query and task datasets, respectively. We identified the best augmentation combinations that suit a fewshot learner and with our findings, we picked the best strategy to determine which mode of augmentation suits a DBTregularized fewshot learner. To start with, we used the R2D2 base learner [Bertinetto2019MetalearningWD] and the CIFARFS database to evaluate the augmentation performance on support, query and task augmentations as shown in Table 1. Our findings show that the pair of CutMix and SelfMix augmentation produces the best results with over 2.5% in accuracy improvement [Ni2021DataAF]. Other approaches lag behind in performance at about 3% for both 1shot and 5shot cases. Secondly, since the CutMix and SelfMix methods stand out as the best augmentation approach for our setup, we used them as bases to combine augmentations on the three data cases; support, query and task, respectively as shown in Table 2. Model performance significantly improved with the best case occurring when CutMix (query) is combined with SelfMix (support).
5.4 DBTregularization with data augmentation
As discussed in section 1, DBTbased regularization improves model generalization and intraclass feature expressiveness. Data augmentation on the other hand creates sufficient data diversity which helps to mitigate overfitting. In this section, we highlight the collective benefits of combing a DBTbased regularizer with augmentation strategies for fewshot learning, using different datasets.
Accuracy results with different datasets: In this section, we setup a testing scheme where we evaluate our method over four runs which is quite similar to techniques applied in [Tian2020RethinkingFI]. We computed the mean accuracy as the accuracy for every run, the experiments are conducted on the Stanford Dogs, Stanford Cars, miniImageNet and CIFARFS datasets, respectively as shown in Table 4 and Table 3. Our accuracy results for 5shot were maintained at 8088% while for 1shot at a range of 6568% as shown in Figure 7 and Figure 8. Our baseline integrated with the DBTbased regularizer ”DBTbaseline” model performs at about 2% better than stateofthe art without data augmentation. Applying the CutMix and SelfMix augmentation on the query ”Q” and support ”S” datasets, show significant improvement. Rotation ”R” and Horizontal Flip ”HF” are integrated into the CutMix and SelfMix data augmentation modes, respectively as indicated in Table 3.
Improvement with MaxUp augmentation: In this section, we evaluate the performance of our model with the MaxUp approach for both 1shot and 5shot classification. We use a similar experimentation setting described in [Ni2021DataAF] at different augmentation pool sizes. Figure 6 and Table 4 depict the impact of MaxUp with the augmentation strategies; denote generally as ”Aug” for both train and validation data. We also show results for the baseline model without augmentation (DBTbaseline), with CutMix and MaxUp augmentation for different Query data schemes. We observe from Figure 6 (Top right) that the generalization gap shrinks considerably and by implication, overfitting is minimized when the MaxUp strategy is adapted. MaxUp also adds the extra boost with an average of about 2.3% in performance.
Comparison with different methods: We compared our results against different methods [Satorras2018FewShotLW, Chen2020MultiscaleAT, Huang2019LowRankPA, Li2019RevisitingLD] as shown in Table 4. We observed that [Tian2020RethinkingFI] is closest to ours but we outperform their approach significantly for the 5shot cases by over 6% on the average. We recorded a better performance than GNN [Satorras2018FewShotLW] and MATANet [Chen2020ANM] using both the 5way 1shot and 5way 5shot fewshot learning settings, we saw an improvement of about 3.3% 4.2% and 3.16% on Stanford Dogs, Stanford Cars, and CIFARFS, respectively for 5way 1shot task while for the 5way5shot task, our method achieved about 4.7%, 2.1%, and 4.9% overall. Clearly, the MaxUp boost is significant in almost all cases.
6 Conclusion
We proposed a structured doubly blocktoeplitz (DBT) matrix based model that imposes orthogonal regularization on the filters of the convolutional layers termed OrthoShot. Our approach was aimed at maintaining the stability of activations, preserving gradient norms, and enhancing feature transferability of deep networks. We also broke down the pipeline of a fewshot learner and based on our findings, we established three augmentations strategies that aid in minimizing overfitting and increasing data diversity. Our findings and empirical results confirm that a DBT regularized model is beneficial to fewshot classification and metalearning in general.
References
7 VC dimension and sample complexity
The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions. In our setting we focus on neural networks where all the weights are of low discriminant rank (LDR) such as the Toeplitzlike, Hankellike, Vandermondelike, and Cauchylike matrices.
7.1 Bounding VC dimension
Theorem 1 For input and parameter , let denote the output of the network. let be the class of functions . Let be the number of parameters up to layer i.e the total number of parameters in layer (1,2,…,). we define the depth effective path as:
(10) 
Then the total number of computations units is given as :
(11) 
Inline with works of [Bartlett1998AlmostLV, Harvey2017NearlytightVB, Warren1968LowerBF, Anthony1999NeuralNL].If k=1, corresponding to piecewise linear networks, it can be shown that:
(12) 
Lemma 1. Let be polynomials of degree at most d in variables, then we define:
(13) 
i.e., if is the number of possible sign vectors given by the polynomials, then . To partition the parameter space for a fixed input , the output on each region in the partition implies of the parameter .
Hence, we have:
(14) 
Hence from Lemma 1, we can show that by recursive construction, is a partition of such that for [thomas2019learning]. The network output for input is a fixed polynomial of which collectively gives:
(15) 
with the size of and equation (6) we get:
(16) 
We can take logarithm and apply Jensen’s inequality, with :=
(17)  
We bound using the bound on ; since the degree of an LDR matrix is at most:
(18)  
where is a constant thus
(19) 
To bound the VCdimension, if VCdim(sign) = there exists data points such that the output of the model can have sign patterns[Vapnik2000TheNO]. The bound on K then implies:
(20)  
Hence completing the proof. Since the number of parameters is around the square root of the number of parameters of a network (e.g doubly block toeplitz based network) with unstructured layers, the sample complexity of an LDR network is much smaller than that of unstructured networks (e.g CNN) which is beneficial for deep networks.
7.2 Space and time complexity.
The proposed DBTmodel has a time complexity of and the small number of parameters also makes the network perform better with limited amount of training data which is crucial for fewshot learning [Li2017LowRankDE]. We also ran tests on the model backbone with two NVIDIA GeForce GTX 1080 Ti GPU and a batch size of 64. Table 1 reflects the accuracy performance and we see an overall model performance of about 4%. Similar to [Christiani2019FastLH]
, the network used in our test consists of 4 convolutional layers, 1 fullyconnected layer and one softmax layer. Rectified linear units (ReLU) are used as the activation units. Images were cropped to 24x24 and augmented with horizontal flips, rotation, and scaling transformations. We use an initial learning rate of 0.0001 and train for 800400100 epochs with their respective default weight decay. Our efficient DBTbased approach obtains a test error of 6.61%, compared to 5.26% obtained by the conventional CNN model. At the same time, the DBTbased network is 4x more space efficient and 1.2x more time efficient than the conventional CNNbased model.
7.3 Optimization setup.
An SGD optimizer with a momentum of 0.9 and a weight decay of was used for our setup. We used a learning rate initialized at 0.0001 with a decay factor of 0.1 applied for all datasets. We trained over 100 epochs for miniImageNet and 200 epochs for both CIFARFS and 150 epochs for Stanford Dogs and Stanford Cats, respectively.
7.4 Architecture
The works of [Oreshkin2018TADAMTD, Lee2019MetaLearningWD, Tian2020RethinkingFI]
used a ResNet12 as backbone for their model, we used a similar structure but replace the convolutional layer with a doublyblock toeplitz matrix the network consists of 4 residual blocks and 3 x 3 kernels. A 2x2 maxpooling layer is applied after each of the first 3 blocks; and a global averagepooling layer is on top of the fourth block to generate feature embeddings. Similar to
[Thomas2018LearningCT], we used spectral regularization and changed the number of filters from (64, 128, 256, 512) to (64, 160, 320, 640).Usefulness of DBT regularization The DBT matrix represents a class of structured matrices whose layers interact multiplicatively at time as compared to convolutional layers that are linear and unstructured and are implemented in about time [Li2017LowRankDE]. The generic term structured matrix refers to an x matrix that can be described in fewer than parameters and is capable of fast operation with at most double the displacement rank, which is far simpler for computations [Li2017LowRankDE]. Hence, if denotes a class of neural networks comprising of DBT layers, total parameters and piecewise linear activations, we can measure the complexity, expressive power, richness, or flexibility of via a measure referred to as the Vapnik–Chervonenkis (VC) dimension [Vapnik2000TheNO, Bartlett2003VapnikChervonenkisDO].
For a simple classification problem of the form: , the VC dimension of the class is expressed as:
(21) 
matches the standard bound for unconstrained weight matrices [Bartlett1998AlmostLV, Harvey2017NearlytightVB, thomas2019learning].
Comments
There are no comments yet.