1 Introduction
Deep learning models achieved breakthroughs in classification tasks, allowing setting stateoftheart results in various fields such as speech recognition (Chiu et al., 2018)
(Vaswani et al., 2017), and computer vision
(Huang et al., 2017). These breakthroughs allowed industrial companies to adopt and integrate DNNs in nearly all segments of the technology industry, from personal assistants (Sadeh et al., 2019a) and searchengines (Sadeh et al., 2019b) to critical applications such as selfdriving cars (Do et al., 2018) and healthcare (Granovsky et al., 2018). In image classification task, the most common approach of training the models is as follows: first, a convolutional neural network (CNN) is used to extract a representative vector, denoted here as
image representation vector (also known as the feature vector). Then, at the classification layer, this vector is projected onto a set of weight vectors of the different target classes to create the class scores, as depicted in Fig. 1. Last, a softmax function is applied to normalize the class scores. During training, the parameters of both the CNN and the classification layer are updated to minimize the crossentropy loss. We refer to this procedure as the dotproduct maximization approach since such training ends up maximizing the dotproduct between the image representation vector and the target weight vector.Recently, it was demonstrated that despite the excellent performance of the dotproduct maximization approach, it does not necessarily encourage discriminative learning of features, nor does it enforce the intraclass compactness and interclass separability (Liu et al., 2016; Wang et al., 2017; Liu et al., 2017). The intraclass compactness indicates how close image representations from the same class relate to each other, whereas the interclass separability indicates how far away image representations from different classes are.
Several works have proposed different approaches to address these caveats (Liu et al., 2016, 2017; Wang et al., 2017, 2018b, 2018a). One of the most effective yet most straightforward solutions that were proposed is NormFace (Wang et al., 2017)
, where it was suggested to maximize the cosinesimilarity between vectors by normalizing both the image and class vectors. However, the authors found when minimizing the cosinesimilarity directly, the models fail to converge, and hypothesized that the cause is due to the bounded range of the logits vector. To allow convergence, the authors added a scaling factor to multiply the logits vector. This approach has been widely adopted by multiple works
(Wang et al., 2018b; Wojke and Bewley, 2018; Deng et al., 2019; Wang et al., 2018a; Fan et al., 2019). Here we will refer to this approach as the cosinesimilarity maximization approach.This paper is focused on redesigning the classification layer, and the its role while kept fixed during training. We show that the visual similarity between classes is implicitly captured by the class vectors when they are learned by maximizing either the
dotproduct or the cosinesimilarity between the image representation vector and the class vectors. Then we show that the class vectors of visually similar categories are close in their angle in the space. We investigate the effects of excluding the class vectors from training and simply drawing them randomly distributed over a hypersphere. We demonstrate that this process, which eliminates the visual similarities from the classification layer, boosts accuracy, and improves the interclass separability (using either dotproduct maximization or cosinesimilarity maximization). Moreover, we show that fixing the class representation vectors can solve the issues preventing from some cases to converge (under the cosinesimilarity maximization approach), and can further increase the intraclass compactness. Last, we show that the generalization to the learned concepts and robustness to noise are both not influenced by ignoring the visual similarities encoded in the class vectors.Recent work by Hoffer et al. (2018), suggested to fix the classification layer to allow increased computational and memory efficiencies. The authors showed that the performance of models with fixed classification layer are on par or slightly drop (up to 0.5% in absolute accuracy) when compared to models with nonfixed classification layer. However, this technique allows substantial reduction in the number of learned parameters. In the paper, the authors compared the performance of dotproduct maximization models with a nonfixed classification layer against the performance of cosinesimilarity maximization models with a fixed classification layer and integrated scaling factor. Such comparison might not indicate the benefits of fixing the classification layer, since the dotproduct maximization is linear with respect to the image representation while the cosinesimilarity maximization is not. On the other hand, in our paper, we compare fixed and nonfixed dotproduct maximization models as well as fixed and nonfixed cosinemaximization models, and show that by fixing the classification layer the accuracy might boost by up to 4% in absolute accuracy. Moreover, while cosinemaximization models were suggested to improve the intraclass compactness, we reveal that by integrating a scaling factor to multiply the logits, the intraclass compactness is decreased. We demonstrate that by fixing the classification layer in cosinemaximization models, the models can converge and achieve a high performance without the scaling factor, and significantly improve their intraclass compactness.
The outline of this paper is as follows. In Sections 2 and 3, we formulate dotproduct and cosinesimilarity maximization models, respectively, and analyze the effects of fixing the class vectors. In Section 4, we describe the training procedure, compare the learning dynamics, and asses the generalization and robustness to corruptions of the evaluated models. We conclude the paper in Section 5.
2 Fixed DotProduct Maximization
Assume an image classification task with possible classes. Denote the training set of examples by , where is the th instance, and is the corresponding class such that . In image classification a dotproduct maximization model consists of two parts. The first is the image encoder, denoted as , which is responsible for representing the input image as a dimensional vector, , where is a set of learnable parameters. The second part of the model is the classification layer, which is composed of learnable parameters denoted as . Matrix can be viewed as vectors, , where each vector can be considered as the representation vector associated with the th class. For simplicity, we omitted the bias terms and assumed they can be included in .
A consideration that is taken when designing the classification layer is choosing the operation applied between the matrix and the image representation vector . Most commonly, a dotproduct operation is used, and the resulting vector is referred to as the logits vector. For training the models, a softmax operation is applied over the logits vector, and the result is given to a crossentropy loss which should be minimized. That is,
(1) 
The equality holds since , where is the angle between the vectors and .
We trained three dotproduct maximization models with different known CNN architectures over four datasets, varying in image size and number of classes, as described in detail in Section 4.1. Since these models optimize the dotproduct between the image vector and its corresponding learnable class vectors, we refer to these models as nonfixed dotproduct maximization models.
Inspecting the matrix of the trained models reveals that visually similar classes have their corresponding class vectors close in space. On the left panel of Fig. 2
, we plot the cosinesimilarity between the class vectors that were learned by the nonfixed model which was trained on the STL10 dataset. It can be seen that the vectors representing
vehicles are relatively close to each other, and far away from vectors representing animals. Furthermore, when we inspect the class vectors of nonfixed models trained on CIFAR100 (100 classes) and Tiny ImageNet (200 classes), we find even larger similarities between vectors due to the high visual similarities between classes, such as
boy and girl or apple and orange. By placing the vectors of visually similar classes close to each other, the interclass separability is decreased. Moreover, we find a strong spearman correlation between the distance of class vectors and the number of misclassified examples. On the right panel of Fig. 2, we plot the cosinesimilarity between two class vectors, and , against the number of examples from categorythat were wrongly classified as category
. As shown in the figure, as the class vectors are closer in space, the number of misclassifications increases. In STL10, CIFAR10, CIFAR100, and Tiny ImageNet, we find a correlation of 0.82, 0.77, 0.61, and 0.79, respectively (note that all possible class pairs were considered in the computation of the correlation). These findings reveal that as two class vectors are closer in space, the confusion between the two corresponding classes increases.
Dataset  Classes  PreActResnet18  ResNet18  MobileNetV2  

Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  
STL (96x96)  10  79.7%  76.6%  82.5%  78.1%  81.0%  77.2% 
CIFAR10 (32x32)  10  94.1%  94.3%  94.2%  93.4%  93.5%  93.1% 
CIFAR100 (32x32)  100  75.2%  75.3%  75.9%  74.9%  74.4%  73.7% 
Tiny ImageNet (64x64)  200  59.1%  55.4%  60.4%  58.9%  59.4%  57.3% 
We examined whether the models benefit from the high angular similarities between the vectors. We trained the same models, but instead of learning the class vectors, we drew them randomly, normalized them (), and kept them fixed during training. We refer to these models as the fixed dotproduct maximization models. Since the target vectors are initialized randomly, the cosinesimilarity between vectors is low even for visually similar classes. See the middle panel of Fig. 2. Notice that by fixing the class vectors and bias term during training, the model can minimize the loss in Eq. 1 only by optimizing the vector . It can be seen that by fixing the class vectors, the prediction is influenced mainly by the angle between and the fixed since the magnitude of is multiplied with all classes and the magnitude of each class vectors is equal and set to 1. Thus, the model is forced to optimize the angle of the image vector towards its randomized class vector.
Table 1 compares the classification accuracy of models with a fixed and nonfixed classification layer. Results suggest that learning the matrix during training is not necessarily beneficial, and might reduce accuracy when the number of classes is high, or when the classes are visually close. Additionally, we empirically found that models with fixed class vectors can be trained with higher learning rate, due to space limitation we bring the results in the appendix (Table 7, Table 8, Table 9). By randomly drawing the class vectors, we ignore possible visual similarities between classes and force the models to minimize the loss by increasing the interclass separability and encoding images from visually similar classes into vectors far in space, see Fig. 3.
3 Fixed CosineSimilarity Maximization
Recently, cosinesimilarity maximization models were proposed by Wang et al. (2017) for face verification task. The authors maximized the cosinesimilarity, rather than the dotproduct, between the image vector and its corresponding class vector. That is,
(2) 
Comparing the righthand side of Eq. 2 with Eq. 1 shows that the cosinesimilarity maximization model simply requires normalizing , and each of the class representation vectors , by dividing them with their norm during the forward pass. The main motivation for this reformulation is the ability to learn more discriminative features in face verification by encouraging intraclass compactness and enlarging the interclass separability. The authors showed that dotproduct maximization models learn radial feature distribution; thus, the interclass separability and intraclass compactness are not optimal (for more details, see the discussion in Wang et al. (2017)). However, the authors found that cosinesimilarity maximization models as given in Eq. 2 fail to converge and added a scaling factor to multiply the logits vector as follows:
(3) 
This reformulation achieves improved results for face verification task, and many recent alternations also integrated the scaling factor for convergences when optimizing the cosinesimilarity Wang et al. (2018b); Wojke and Bewley (2018); Wang et al. (2018a); Shalev et al. (2018); Deng et al. (2019); Fan et al. (2019).
According to Wang et al. (2017), cosinesimilarity maximization models fail to converge when due to the low range of the logits vector, where each cell is bounded between
. This low range prevents the predicted probabilities from getting close to 1 during training, and as a result, the distribution over target classes is close to uniform, thus the loss will be trapped at a very high value on the training set. Intuitively, this may sound a reasonable explanation as to why directly maximizing the cosinesimilarity fails to converge (
). Note that even if an example is correctly classified and well separated, in its best scenario, it will achieve a cosinesimilarity of 1 with its groundtruth class vector, while for other classes, the cosinesimilarity would be . Thus, for a classification task with classes, the predicted probability for the example above would be:(4) 
Notice that if the number of classes , the predicted probability of the correctly classified example would be at most 0.035, and cannot be further optimized to 1. As a result, the loss function would yield a high value for a correctly classified example, even if its image vector is placed precisely in the same direction as its groundtruth class vector.
As in the previous section, we trained the same models over the same datasets, but instead of optimizing the dotproduct, we optimized the cosinesimilarity by normalizing and at the forward pass. We denote these models as nonfixed cosinesimilarity maximization models. Additionally, we trained the same cosinesimilarity maximization models with fixed random class vectors, denoting these models as fixed cosinesimilarity maximization. In all models (fixed and nonfixed) we set to directly maximize the cosinesimilarity, results are shown in Table 2.
Surprisingly, we reveal that the low range of the logits vector is not the cause preventing from cosinesimilarity maximization models from converging. As can be seen in the table, fixed cosinemaximization models achieve significantly high accuracy results by up to 53% compared to nonfixed models. Moreover, it can be seen that fixed cosinemaximization models with can also outperform dotproduct maximization models. This finding demonstrates that while the logits are bounded between , the models can still learn highquality representations and decision boundaries.
Dataset  Classes  PreActResnet18  ResNet18  MobileNetV2  

Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  
STL (96x96)  10  83.7%  58.9%  83.2%  51.6%  77.1%  54.1% 
CIFAR10 (32x32)  10  93.3%  80.1%  93.2%  70.4%  92.2%  57.6% 
CIFAR100 (32x32)  100  74.6%  29.9%  74.4%  27.1%  72.6%  19.7% 
TinyImagenet (64x64)  200  50.4%  19.9%  47.6%  16.7%  41.6%  13.2% 
We further investigated the effects of and train for comparison the same fixed and nonfixed models, but this time we used gridsearch for the best performing value. As can be seen in Table 3, increasing the scaling factor allows nonfixed models to achieve higher accuracies over all datasets. Yet, there is no benefit at learning the class representation vectors instead of randomly drawing them and fixing them during training when considering models’ accuracies.
To better understand the cause which prevents nonfixed cosinemaximization models from converging when , we compared these models with the same models trained by setting the optimal scalar. For each model we measured the distance between its learned class vectors and compared these distances to demonstrate the effect of on them. Interestingly, we found that as increased, the cosinesimilarity between the class vectors decreased. Meaning that by increasing the class vectors are further apart from each other. Compare, for example, the left and middle panels in Fig. 4, which show the cosinesimilarity between the class vectors of models trained on STL with and , respectively.
On the right panel in Fig. 4, we plot the number of misclassification as a function of the cosinesimilarity between the class vectors of the nonfixed cosinemaximization model trained on STL10 with . It can be seen that the confusion between classes is high when they have low angular distance between them. As in previous section, we observed strong correlations between the closeness of the class vectors and the number of misclassification. We found correlations of 0.85, 0.87, 0.81, and 0.83 in models trained on STL10, CIFAR10, CIFAR100, and Tiny ImageNet, respectively.
Dataset  Classes  PreActResnet18  ResNet18  MobileNetV2  

Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  
STL (96x96)  10  83.8%  79.0%  83.9%  79.4%  82.4%  81.5% 
CIFAR10 (32x32)  10  93.5%  93.5%  93.3%  93.0%  92.6%  92.8% 
CIFAR100 (32x32)  100  74.6%  74.8%  74.6%  73.4%  73.7%  72.6% 
TinyImagenet (64x64)  200  53.8%  54.5%  54.3%  55.6%  53.9%  53.5% 
By integrating the scaling factor in Eq. 4 we get
(5) 
Note that by increasing , the predicted probability in Eq. 5 increases. This is true even when the cosinesimilarity between and is less than 1. When is set to a large value, the gap between the logits increases, and the predicted probability after the softmax is closer to 1. As a result, the model is discouraged from optimizing the cosinesimilarity between the image representation and its groundtruth class vector to be close to 1, since the loss is closer to 0. In Table 4, we show that as we increase , the cosinesimilarity between the image vectors and their predicted class vectors decreases.
These observations can provide an explanation as to why nonfixed models with fail to converge. By setting to a large scalar, the image vectors are spread around their class vectors with a larger degree, preventing the class vectors from getting close to each other. As a result, the interclass separability increases and the misclassification rate between visually similar classes decreases. In contrast, setting allows models to place the class vectors of visually similar classes closer in space and leads to a high number of misclassification. However, a disadvantage of increasing and setting it to a large number is that the intraclass compactness is violated since image vectors from the same class are spread and encoded relatively far from each other, see Fig. 5.
Fixed cosinemaximization models successfully converge when , since the class vectors are initially far in space from each other. By randomly drawing the class vectors, models are required to encode images from visually similar classes into vectors, which are far in space; therefore, the interclass separability is high. Additionally, the intraclass compactness is improved since models are encouraged to maximize the cosinesimilarity to 1 as can be set to 1, and place image vectors from the same class close to their class vector. We validated this empirically by measuring the average cosinesimilarity between image vectors and their predicted classes’ vectors in fixed cosinemaximization models with . We obtained an average cosinesimilarity of roughly 0.95 in all experiments, meaning that images from the same class were encoded compactly near their class vectors.
In conclusion, although nonfixed cosinesimilarity maximization models were proposed to improve the caveats of dotproduct maximization by improving the interclass separability and intraclass compactness, their performance are significantly low without the integration of a scaling factor to multiply the logits vector. Integrating the scaling factor and setting it to decrease intraclass compactness and introduce a tradeoff between accuracy and intraclass compactness. By fixing the class vectors, cosinesimilarity maximization models can have both high performance and improved intraclass compactness. Meaning that multiple previous works (Wang et al. (2018b); Wojke and Bewley (2018); Deng et al. (2019); Wang et al. (2018a); Fan et al. (2019)) that adopted the cosinemaximization method and integrated a scaling factor for convergence, might benefit from improved results by fixing the class vectors.
S=1  S=20  S=40  

C10  C100  STL  TI  C10  C100  STL  TI  C10  C100  STL  TI 
0.99  0.99  0.97  0.98  0.88  0.63  0.77  0.62  0.36  0.53  0.21  0.37 
4 Generalization and Robustness to Corruptions
In this section we explore the generalization of the evaluated models to the learned concepts and measure their robustness to image corruptions. We do not aim to set a stateoftheart results but rather validate that by fixing the class vectors of a model, the model’s generalization ability and robustness to corruptions remain competitive.
4.1 Training Procedure
To evaluate the impact of ignoring the visual similarities in the classification layer we evaluated the models on CIFAR10, CIFAR100 Krizhevsky et al. (2009), STL Coates et al. (2011), and Tiny ImageNet^{1}^{1}1https://tinyimagenet.herokuapp.com/ (containing 10, 100, 10, and 200 classes, respectively). For each dataset, we trained Resnet18 He et al. (2016a), PreActResnet18 He et al. (2016b), and MobileNetV2 Sandler et al. (2018)
models with fixed and nonfixed class vectors. All models were trained using stochastic gradient descent with momentum. We used the standard normalization and data augmentation techniques. Due to space limitations, the values of the hyperparameters used for training the models can be found under our code repository. We normalized the randomly drawn, fixed class representation vectors by dividing them with their
norm. All reported results are the averaged results of 3 runs.4.2 Generalization
For measuring how well the models were able to generalize to the learned concepts, we evaluated them on images containing objects from the same target classes appearing in their training dataset. For evaluating the models trained on STL10 and CIFAR100, we manually collected 2000 and 6000 images ,respectively, from the publicly available dataset Open Images V4 Krasin et al. (2017). For CIFAR10 we used the CIFAR10.1 dataset Recht et al. (2018). All collected sets contain an equal number of images for each class. We omitted models trained on Tiny ImageNet from the evaluation since we were not able to collect images for all classes appearing in this set. Table 5 summarizes the results for all the models. Results suggest that excluding the class representation vectors from training, does not decrease the generalization to learned concepts.
Training set  Evaluating set  Dotproduct  Cosinesimilarity  

Fixed  NonFixed  Fixed  NonFixed  
STL  STL Gen  53.7%  50.9%  54.1%  50.4% 
CIFAR10  CIFAR10.1  87.1%  86.9%  85.7%  85.9% 
CIFAR100  CIFAR100 Gen  34.8%  32.9%  35.1%  35.4% 
4.3 Robustness to corruptions
Next, we verified that excluding the class vectors from training did not decrease the model’s robustness to image corruptions. For this we apply three types of algorithmically generated corruptions on the test set and evaluate the accuracy of the models on these sets. The corruptions we apply are impulsenoise, JPEG compression, and defocus blur. Corruptions are generated using Jung (2018), and available under our repository. Results, as shown in Table 6, suggest that randomly drawn fixed class vectors allow models to be highly robust to image corruptions.
Corruption type  Test set  Dotproduct  Cosinesimilarity  

Fixed  Nonfixed  Fixed  Nonfixed  
Saltandpepper  STL  52.9%  49.4%  57.2%  44.9% 
CIFAR10  52.6%  49.5%  49.8%  52.9%  
CIFAR100  36.1%  36.6%  41.3%  40.9%  
Tiny ImageNet  30.8%  26.1%  28.6%  27.9%  
JPEG compression  STL  78.1%  75.8%  76.1%  73.3% 
CIFAR10  71.8%  71.5%  71.4%  72.4%  
CIFAR100  43.8%  42.3%  44.4%  43.3%  
Tiny ImageNet  38.8%  33.0%  35.1%  34.4%  
Blurring  STL  37.7%  35.0%  39.4%  36.1% 
CIFAR10  41.3%  41.1%  41.9%  41.1%  
CIFAR100  24.8%  24.4%  22.4%  23.1%  
Tiny ImageNet  16.4%  13.4%  12.9%  12.4% 
5 Conclusion
In this paper, we propose randomly drawing the parameters of the classification layer and excluding them from training. We showed that by this, the interclass separability, intraclass compactness, and the overall accuracy of the model can improve when maximizing the dotproduct or the cosine similarity between the image representation and the class vectors. We analyzed the cause that prevents the nonfixed cosinemaximization models from converging. We also presented the generalization abilities of the fixed and notfixed classification layer.
References
 Stateoftheart speech recognition with sequencetosequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. Cited by: §1.

An analysis of singlelayer networks in unsupervised feature learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 215–223. Cited by: §4.1. 
Arcface: additive angular margin loss for deep face recognition
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4690–4699. Cited by: §1, §3, §3. 
Realtime selfdriving car navigation using deep neural network
. In 2018 4th International Conference on Green Technology and Sustainable Development (GTSD), pp. 7–12. Cited by: §1.  Spherereid: deep hypersphere manifold embedding for person reidentification. Journal of Visual Communication and Image Representation 60, pp. 51–58. Cited by: §1, §3, §3.
 Actigraphybased sleep/wake pattern detection using convolutional neural networks. arXiv preprint arXiv:1802.07945. Cited by: §1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
 Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §4.1.
 Fix your classifier: the marginal value of training the last weight layer. arXiv preprint arXiv:1801.04540. Cited by: §1.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
 imgaug. Note: https://github.com/aleju/imgaug[Online; accessed 30Oct2018] Cited by: §4.3.
 Openimages: a public dataset for largescale multilabel and multiclass image classification. Dataset available from https://github. com/openimages 2 (3), pp. 2–3. Cited by: §4.2.
 Learning multiple layers of features from tiny images. Cited by: §4.1.
 Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §1, §1.
 Largemargin softmax loss for convolutional neural networks.. In ICML, Vol. 2, pp. 7. Cited by: §1, §1.
 Do cifar10 classifiers generalize to cifar10?. arXiv preprint arXiv:1806.00451. Cited by: §4.2.
 Generating diverse and informative natural language fashion feedback. arXiv preprint arXiv:1906.06619. Cited by: §1.
 Joint visualtextual embedding for multimodal style search. arXiv preprint arXiv:1906.06620. Cited by: §1.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.1.
 Outofdistribution detection using multiple semantic label representations. In Advances in Neural Information Processing Systems, pp. 7375–7385. Cited by: §3.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
 Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: §1, §3, §3.
 Normface: l2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041–1049. Cited by: §1, §1, §3, §3, §3.
 Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §1, §3, §3.
 Deep cosine metric learning for person reidentification. In 2018 IEEE winter conference on applications of computer vision (WACV), pp. 748–756. Cited by: §1, §3, §3.
Appendix A Various learning rate initializations
Dataset  LR = 0.1  LR = 0.01  LR = 0.001  LR = 0.0001  

Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  
STL (96x96)  79.4%  76.6  79.7%  75.9%  78.2%  76.5%  70.6%  68.9% 
CIFAR100 (32x32)  72.2%  NaN  75.2%  75.3%  74.4%  72.5%  –  – 
TinyImagenet (64x64)  59.1%  NaN  55.6%  55.4%  52.9%  52.1%  –  – 
Dataset  LR = 0.1  LR = 0.01  LR = 0.001  LR = 0.0001  

Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  
STL (96x96)  82.5%  75.9%  81.2%  78.1%  79.1%  76.5%  77.9%  75.8% 
CIFAR100 (32x32)  74.8%  73.1%  75.9%  74.4%  74.5%  72.6%  –  – 
TinyImagenet (64x64)  60.1%  59.0%  58.2%  57.1%  54.4%  53.9%  –  – 
Dataset  LR = 0.1  LR = 0.01  LR = 0.001  LR = 0.0001  

Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  Fixed  NonFixed  
STL (96x96)  80.8%  75.9%  81.0%  76.1%  74.4%  74.4%  73.1%  73.9% 
CIFAR100 (32x32)  67.5%  64.2%  75.1%  73.8%  70.8%  70.4%  –  – 
TinyImagenet (64x64)  56.8%  55.1%  59.3%  57.1%  51.2%  49.9%  –  – 