Hyperspherically Regularized Networks for BYOL Improves Feature Uniformity and Separability

04/29/2021 ∙ by Aiden Durrant, et al. ∙ University of Aberdeen 6

Bootstrap Your Own Latent (BYOL) introduced an approach to self-supervised learning avoiding the contrastive paradigm and subsequently removing the computational burden of negative sampling. However, feature representations under this paradigm are poorly distributed on the surface of the unit-hypersphere representation space compared to contrastive methods. This work empirically demonstrates that feature diversity enforced by contrastive losses is beneficial when employed in BYOL, and as such, provides greater inter-class feature separability. Therefore to achieve a more uniform distribution of features, we advocate the minimization of hyperspherical energy (i.e. maximization of entropy) in BYOL network weights. We show that directly optimizing a measure of uniformity alongside the standard loss, or regularizing the networks of the BYOL architecture to minimize the hyperspherical energy of neurons can produce more uniformly distributed and better performing representations for downstream tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised visual representational learning methods [Chen et al.2020a, Chen et al.2020b, Caron et al.2020, Grill et al.2020] have recently demonstrated performance on downstream tasks that continues to narrow the gap to supervised pre-training. This success is largely contributed to contrastive methods in which representations of views (augmentations of the an input image) are encouraged to be similar if they originate from the same image, or dissimilar if they do not [Oord et al.2018, He et al.2020]. The study of contrastive losses have shown that this repulsion effect between dissimilar views is matching the distribution of features in representational space to a distribution of high entropy [Chen and Li2020], in other words, encouraging uniformity of representations in space [Wang and Isola2020]. This balancing of attraction and repulsion is the mechanism that allows contrastive methods to learn similar semantic features whilst avoiding collapse in representation space.

Figure 1: Visual depiction of the regularization of neurons, , to minimum hyperspherical energy, , on the unit hypersphere .  [Liu et al.2018, Lin et al.2020]

More recently, alternative approaches aim to explore self-supervised learning avoiding the inherent computational difficulties imposed by contrastive methods reliance on negative samples [Caron et al.2020]. One method in particular, Bootstrap Your Own Latent (BYOL) [Grill et al.2020], tasks an online network to predict the representations of a target network each given a different view of the same image. Yet this network does away with negative views (views originating from different images), and the subsequent negative term attributed with contrastive losses. The theoretical understanding of how these networks avoid the seemingly inevitable collapsed equilibria, given no explicit mechanism associated with the negative term of contrastive losses, is still to be investigated [Richemond et al.2020].

Intrigued by this property and inspired by [Wang and Isola2020], we empirically observe that BYOL fails to distribute its representations as uniformly in -normalized unit space (i.e. surface of a unit-hypersphere) compared to its contrastive counterparts. As such, we ask, can BYOL benefit from mechanisms that introduce feature uniformity found in contrastive methods? Furthermore, we investigate an alternative to the uniformity constraint posed by [Wang and Isola2020], aiming to maintain the avoidance of negative sampling advocated in BYOL. We propose to utilize minimum hyperspherical energy (MHE) network regularization [Liu et al.2018] to enforce neuron (i.e. kernel) uniformity whilst being independent and therefore robust to smaller batch size.

Our contributions are summarized as follows: i) we empirically show that BYOL distributes its features poorly in representational space compared to contrastive counter parts and that distribution constrains like those in constrastive losses benefit feature representations in BYOL; ii) we propose to hyperspherically regularize the network to improve diversity of neurons and subsequently achieve a greater diversity of representations improving feature separability and subsequent performance on downstream tasks; iii) as a consequence hyperspherically regularized BYOL networks maintain the benefits of avoiding contrastive loss negative terms, resulting in reduced performance drops at smaller batch sizes.

2 Related Work

2.1 Unsupervised Representational Learning

The recent popularity in discriminative unsupervised representational learning, specifically contrastive methods, have sparked keen interest in the theoretical understanding of their underpinnings, emerging from their performance rivaling that of supervised methodologies [Chen et al.2020a, He et al.2020, Tian et al.2019]. Fundamentally contrastive methods aim to minimize the distance of the representations pertaining to two views of the same image in representations space (‘positive pair’), whilst maximizing the distance of views from different images (‘negative pair’) [Chopra et al.2005]. This ensures that semantically relevant features encoded by representations of positive pairs are similar, whilst negative pairs are dissimilar.

As to why these methods perform so well has only recently begun to be understood, notably [Wang and Isola2020] prove that optimizing contrastive loss when under a unit

-norm constraint (restricting representational space to a unit hypersphere) is equivalent to optimizing a metric of alignment (distance between positive pairs) and uniformity (all feature vectors should be roughly uniformly distributed on the unit hypersphere). Additionally,

[Chen and Li2020] extends this work proposing a generic form of the contrastive loss, also identifying the same relations of uniformity to pairwise potential in a Gaussian kernel, to match representations to a prior distribution (of high entropy).

Lately, alternatives to contrastive methods [Caron et al.2020] have been proposed alleviating some of the computational drawbacks associated to contrastive losses, primarily the necessity of large numbers of negative pairs generally requiring increased batch sizes [Chen et al.2020a] or memory banks [He et al.2020]. Bootstrap Your Own Latent (BYOL) avoided the use of negative pairs via a ‘online’ ‘target’ network approach akin to Mean Teachers [Tarvainen and Valpola2017]

, where the ‘online’ network and an additional ‘predictor’ network aim to predict the representations of a slowly updated ‘target’ network of the ‘online’ network. However, it is not clear how these networks avoid collapsed representations, it has been hypothesized Batch Normalization (BN) was the critical mechanism preventing collapse in BYOL 

[Fetterman and Albrecht2020], yet this hypothesis was refuted, showing batch-independent normalization schemes still achieve comparable performance [Richemond et al.2020].

2.2 Minimal Hyperspherical Energy and Diversity Regularization

Many unsupervised representational methods learn their representations constrained to lie on the surface of a unit-hypershpere via a -norm constrain leading to desirable traits [Xu and Durrett2018]. As aforementioned [Wang and Isola2020, Caron et al.2020] prove that the negative term (repulsion of negative views) in the contrastive loss is equivalent to the minimization of hyperspherical energy of representations. The minimization of hyperspherical energy, the Thompson Problem [Thomson1904], is a well studied problem in Physics finding the minimal electrostatic potential energy configuration of electrons. Yet this problem has also found place in providing diversity regularization of neurons [Liu et al.2018, Lin et al.2020], avoiding undesired representation redundancy. Our work however, investigates whether these regularization methodologies, introducing greater feature diversity, can promote more uniformly distributed feature representations in BYOL.

3 Uniform Distribution of Features

3.1 Contrastive Learning

We begin by defining the contrastive loss as in the notation style of [Wang and Isola2020], in which the popular case of contrastive loss is considered where an encoder is trained and feature vectors are -normalized in .

(1)

where is the distribution of data over , is the distribution over positive pairs (augmentations , of image ) ,

is a temperature hyperparameter, and

a fixed number of negative samples, i.e. in [Chen et al.2020a] where is the batch size. Additionally, under the assumption of our -norm constraint .

3.2 The Link to Uniformity

From the loss in Eq.1 [Wang and Isola2020] formally shows that directly optimizing a metric of alignment (encourages positive pair representations to be consistent) and uniformity

(encourages negative pairs to be dissimilar by uniformity distributing representations) is equivalent when

is sufficiently large. The alignment loss is defined as:

(2)
(a) Random Initalization
(b) Contrastive Learning
(c) BYOL
(d) BYOL + Uni
(e) BYOL + MHE Reg
Figure 2:

Representations of the CIFAR-10 validation set on a unit hypersphere

. Uniformity analysis is achieved through plotting the feature distribution with Gaussian Kernel Density Estimation (KDE) in

and corresponding angles for each point in on using von Mises-Fisher KDE on angles [Wang and Isola2020].

which is simply the expected distance between positive pairs. The uniformity loss is given by:

(3)

where in practice parameter is empirically set to .

3.3 BYOL and its Uniformity on the Hypersphere

As previously mentioned, BYOL proposes an alternative to the contrastive paradigm, in which two networks, ‘online’ and ‘target’, are each input with a different view of the same image , with the online network tasked to predict the representations of a target network. The target network is identical in architecture to that of the online network , yet not directly optimized in training rather parameterized as an exponential moving average of , given a target decay rate , updated after each training step,

(4)

The prediction is not made directly via the online network , rather a linear predictor network is employed which is independent of the target network. Furthermore, for notation simplicity the linear projector network is omitted, . The BYOL loss is defined as:

(5)

where and . It can be seen that Eq.5 is equivalent to alignment in Eq.2, yet the omission of any uniformity measure and subsequent the exclusion of diversity constraints is still ongoing research [Richemond et al.2020].

The current conjecture presented in [Grill et al.2020] hypothesize that BYOL works as a form of dynamic system where the target parameters updates are not in the direction of , and as such there is no loss where BYOL’s dynamics is a gradient decent on jointly over . Furthermore, the necessity of the predictor is key, when assuming BYOL with an optimal predictor,

(6)

where and . Therefore, with an optimal predictor [Grill et al.2020] shows that , and as such, for constant

and random variables

and , . Informally, as does not train

the only way to minimize the variance is to increase the information in

, therefore increases the variability of representations in the online network and prevents collapsed representations.

Given this intuition, we aim to explore how effectively this predictor manages to distribute features in representational space without enforced constraints on distribution. Following the procedure described in [Wang and Isola2020] we train a modified AlexNet [Krizhevsky et al.2017] encoder under the BYOL procedure visualizing the CIFAR-10 [Krizhevsky et al.2009] validation set representations output on the hypersphere , where . Fig. 2 depicts the distribution of the validation set features, with Fig.2 visualizing the distribution under BYOL procedure. It is clear that in comparison to contrastive methods (Fig.2) that feature representations of the encoder are more uniformly distributed.

This observation leads to the question: can BYOL benefit from mechanisms that introduce feature uniformity found in contrastive methods? More specifically, can the uniformity loss, Eq.3, benefit the representations learned under BYOL. This question had been partly explored in [Shi et al.2020] and indirectly via exploration of negative samples in [Grill et al.2020], yet both of these consider representations produced by the projector and negatives computed from the target network. Given the above intuition of BYOL behavior we instead minimize Eq.3 of the online projections only and independent of the predictor . The intuition behind this procedure is to enforce uniform distribution of features output by the online network, akin to constrastive, whilst maintaining the properties of the predictor to enforce variation via the maximization of information in the uniformly distributed online network. The loss proposed is therefore,

(7)

where and is a hyperparameter controlling the influence of the uniformity metric.

It can be observed from Fig.2, the addition of during BYOL training does in fact improve the uniformity of representations produced by the encoder, distributing akin to contrastive (Fig.2), this is confirmation of expected behavior. Although the addition of contrastive uniformity losses exhibit greater uniformity of representations and improvement of their quality on smaller datasets, these still rely on the number of negative samples, , being sufficiently large. BYOL aimed to remove this computational necessity, therefore introducing the contrastive uniformity metric contradicts the one of the advantages of BYOL, the lack of negative samples. Further analysis of the quality of representations captured and its robustness to smaller batch size are examined in the following section.

3.4 MHE Regularization

Given the recent exploration into the necessity of batch normalization [Fetterman and Albrecht2020] and the subsequent findings that initialization and regularization of weights by batch normalization are key to BYOL’s success [Richemond et al.2020]. We follow these hypotheses and propose the use of hyperspherical regularization [Liu et al.2018] alongside batch normalization to explicitly regularize the network to reduce hyperspherical energy of neurons (depicted in Fig.1) to further improve the diversity of weights in the network and consequently representation diversity.

[Liu et al.2018] argues that the power of neural representations can be characterized by the hyperspherical energy of its neurons (i.e. kernels), and as such a minimal hyperspherical energy configurations can induce better diversity and therefore improved feature separability. The hyperspherical energy for neurons, in , is defined as:

(8)

where is the -th neuron weight projected onto , and is a decreasing real valued function, which is chosen to be the Riesz s-kernel, [Liu et al.2018]. We therefore aim to minimize the energy in Eq.8 by manipulating the orientation of the neurons to solve , . When , the logarithmic energy minimization problem is undertaken, essentially maximizing the product of Euclidean distance, where in our case this is the angle between neurons.

(9)

As an explicit regularization method, we optimize for the joint objective function:

(10)

where is a hyperparameter to control the weighting of our regularization, and the number of layers in the online network and/or predictor . A further variant has also been considered in this work simply extending the hyperspherical energy based on Euclidean distance in Eq.8 to consider geodesic distance on a unit hypersphere. We define this extension in Appendix A. For more details and proofs of this MHE regularization we refer to [Liu et al.2018].

Following the same visualization methodology aforementioned, we empirically confirm our hypothesis that improving the diversity of weights within the network subsequently results in more diversely distributed representations. Fig.2 show a significant improvement in representation uniformity compared to the baseline in Fig.2. Furthermore, the results throughout also report how weight diversity is beneficial in improving representational quality, uniformity of representational distribution and consequently feature separability when in conjunction with batch normalization. All of which with no added dependence on batch size, maintaining the desirable properties of BYOL. Furthermore, analysis of the regularization and robustness to batch size are given in 5.1 and 5.2.

3.5 Implementation Details

Image Augmentations

As to correspond with the BYOL procedure, we employ the same image augmentations as described in  [Chen et al.2020a, Grill et al.2020]

. For the ImageNet ILSVRC-2012 dataset 

[He et al.2016] we apply the following augmentations in order: random crop and resize to ; random horizontal flip; color distortion (random sequence of brightness, contrast, saturation, hue augmentations); random greyscale conversion; Gaussian blur; and color solarization. For the CIFAR10 and CIFAR100 datasets, we follow the same procedure resizing to and omitting the Gaussian blur and solarization as described in [Chen et al.2020a]. Further details on the datasets are given in Appendix B.

Architecture

Our experimentation primarily focuses on the use of two different convolutional residual network [He et al.2016] configurations for our encoders and , ResNet-18 and ResNet-50 each with 18 and 50 layers respectively. Following the procedure described in BYOL [Grill et al.2020] derived from SimCLR [Chen et al.2020a]

, the networks replace the standard linear output layer with a Multi-Layer Perceptron (MLP)

, projecting the output of the final average pooling layer to a smaller space. The MLP is a two layer linear network the first outputting in 4096 dimensions, followed by a second outputting to 256 dimensions. The first layer only is followed by batch normalization and Rectified Linear Units (ReLU). For CIFAR we adjust the first layers (

i.e.

‘stem’) of the ResNet architecture, reducing kernel size of the first convolutional layer to 3 from 7, kernel stride from 2 to 1, and remove the max-pooling operation to accommodate the reduced image size. When reporting on MHE regularization, unless stated otherwise, regularization is applied to the all linear layers in

and .

Method Top-1 (%) Top-5 (%)
SimCLR [Chen et al.2020a] 67.9 88.5
BYOL [Grill et al.2020] 72.5 90.8
BYOL (repro) 71.9 89.2
BYOL-MHE (ours) 72.4 89.9
Table 1:

Linear Evaluation on ImageNet, ResNet-50 Encoder trained for 300 epochs

Method CIFAR 10 CIFAR 100
SimCLR */(repro) 94.0 / 93.81 – / 70.98
BYOL (repro) 94.46 72.10
BYOL + (ours) 94.84 72.62
BYOL-MHE (ours) 94.78 72.56
Table 2: Top-1 () Accuracies of Linear Evaluation on CIFAR10 and CIFAR100 datasets with ResNet-50 Encoder trained for 1000 epochs. [Chen et al.2020a]

Optimization

For all unsupervised training we use the LARS optimizer [You et al.2017] excluding both batch normalization and bias parameters, with cosine learning rate decay [Loshchilov and Hutter2016], and the learning rate linearly scaled with the batch size  [Goyal et al.2017]. The EMA parameter , is increased during training to 1 from with where is the current iteration, and is the total number of training iterations. For ImageNet we train the ResNet-50 for 300 epochs with , batch size of 512, , and weight decay of . We train on 8 Nvidia RTX-2080ti for approximately 8 days. For CIFAR, 1000 epochs we increase , and increase the batch size to 1024, for 300 epochs we further increase the and revert back to . Further hyperparameter details are given in Appendix D.

4 Linear Evaluation

We evaluate the quality of the representations learned by the BYOL variants after self-supervised pre-training via linear evaluation under the same procedure described in [Chen et al.2020a, Grill et al.2020]

, training a linear classifier on top of the frozen pre-trained online encoder

. The BYOL + networks have the weight hyperparameter empirically set to 0.125 and BYOL-MHE networks set . We applied angular MHE with power which we empirically found to be best under this network configuration, Appendix C.

First, we report the top-1 and top-5 accuracies in % for the ImageNet ILSVRC-2012 test set in Tab.1 trained with a standard ResNet-50 for 300 epochs, we show our reproduction (repro) alongside our MHE regularized variant. We report 72.4 top-1 accuracy with the inclusion of MHE regularization, a 0.4% improvement over our standard BYOL.

In the case of CIFAR-10 and CIFAR-100 datasets we see a more substantial improvement, reporting a top-1 accuracy of 94.78%, and 72.56% for CIFAR-10 and CIFAR-100 respectively. This 0.4% improvement in CIFAR-10 is comparable to the improvements found between SimCLR and BYOL, a substantial move towards the supervised baseline of 95.1% reported in [Chen et al.2020a]. For the explicit uniformity constraint, BYOL + , we see on average 0.1% improvement from the MHE regularized variant. This improvement is expected, given the more explicit nature of the uniformity constraint on directly optimizing the representations rather than the implicit MHE regularization, and the observed near uniform distribution depicted in Fig.2. The overall improvement even in smaller models and on smaller datasets shows how diversity introduced between representations (visually depicted in Fig.2) can subsequently improve feature separability and representation quality.

(a) No BN + No MHE (b) BN + No MHE (c) No BN + MHE (d) BN + MHE
Figure 3: Uniformity of representations on under different regularization configurations plotted with Gaussian KDE.

5 Ablations

To further analyze the behavior of our BYOL variants we explore the impact of various hyperparameter and network configurations on the CIFAR-10 dataset. We follow the same training procedure described in 3.5, training a ResNet-18 encoder for 300 epochs on two Nvidia v100 GPUs. As in [Grill et al.2020] we report the average performance per configuration over three seeds.

BN MHE Accuracy (%)
28.76
90.74
45.72
91.22
Table 3: Linear Evaluation on CIFAR10 under different regularization configurations, all networks are regularized when selected.
Layer
Encoder -
Projector - -
Predictor - - - -
MHE Accuracy (%)
MHE (a2) 91.64 91.36 91.46 91.38 91.10 90.96 91.22
BYOL + 91.48
Table 4: Linear Evaluation on CIFAR10 given different network configurations of MHE regularization, ‘✓’ denotes that MHE regularization has been applied to that sub-network. The encoder is ResNet-18 trained for 300 epochs. = BYOL (repro)

5.1 MHE Regularization and Batch Normalization

The intuition behind the link between uniformity of representations and diversity regularization, is that given more uniformly distributed neurons, more diverse features are able to be captured by the network that will result in more diverse distribution of semantic features being learned. To further test our hypothesis we provide representations of the CIFAR-10 validation set on a unit hypersphere trained on a modified AlexNet encoder, plotting the feature distribution with Gaussian KDE for different configurations of MHE regularization and batch normalization. Fig.4 shows the distribution of representations and Tab.3 reports the results of linear evaluation with a ResNet-18 encoder. These results empirically show how without batch normalization the network the network fails to learn, resulting in collapsed representations, coinciding with [Richemond et al.2020]. Interestingly, MHE alone produces feature representations that are distributed far more uniformly Fig.2(c) than batch norm. However, the linear evaluation performance suffers compared to batch normalization, although it is a significant improvement over random.

Figure 4: Reduction in CIFAR-10 linear evaluation top-1 % when decreasing batch size for the proposed variants and repro of BYOL.

5.2 Batch Size

Alongside the performance improvements found in BYOL, one key advantage is the robustness to smaller batch sizes, this comes from the avoidance of negative pairs sampled from within the batch in end-to-end contrastive models. Therefore, with our addition of (Eq.3) being derived from we expect robustness to degrade. We empirically test the performance under different batch size by averaging gradients over consecutive steps before we update the online and target network parameters, where is the factor of batch size reduction from the baseline [Grill et al.2020].

We observe in Fig.4 that the introduction of the uniformity loss does in-fact reduce robustness when batch size is decreased. We see from a baseline of 91.48%, a -9.06% drop with compared to BYOL’s -5.74%. This expected result confirms our reasoning to find alternative mechanisms to enforce uniformity of feature representations. For MHE regularization, we observe very little deviation of performance compared to the standard BYOL architecture given the regularization’s independence on batch size111As all models employ batch normalization we see the drop in performance can be partly attributed to its computations of batch statistics..

5.3 MHE Regularization Parameterization

To investigate how varying hyperparameters for the MHE regularization affects performance, we report results for network configurations in Tab.4. Additionally, the weight of the regularization and powers are given in Appendix D.

We report in Tab.4 the linear evaluation performance under varying configurations of MHE regularization to individual sub-networks. We show that across all configurations we see a increase in performance, showing that the improved weight diversity and subsequent representation diversity improves the quality of representations learned. Additionally, referring to our previous notion that it is not preferable to directly enforce uniformity at the predictor of the BYOL architecture based on the intuition of BYOL’s behavior [Grill et al.2020], we do not see any degradation in performance when MHE is applied at the predictor level. We conjecture that the improved diversity of features helps assist the online network in capturing more varied representations.

6 Conclusion

We empirically show that uniformity constrains like those in contrastive losses can be beneficial in BYOL where negative samples are negated. To maintain the computation benefits proposed by BYOL we investigate the use of regularization methods that minimize the hyperspherical energy between network neurons. We show that this paradigm of regularization implicitly improves distribution uniformity representations learned by the encoder, leading to improved results in all experimentation over the baseline whilst remaining robust to changes in batch size. We believe further performance improvements can be made with tuning of hyper-parameters. Yet how the avoidance of fully collapsed equilibria in the presence of MHE regularization is still yet to be understood, as is the full understanding of BYOL’s mechanism. From this work we allude that regularization methods play a key role in unsupervised representation learning, establishing the hypothesis for future investigations.

References

Appendix A Method

a.1 Angular MHE Regularization

The hyperspherical energy defined in Eq.8 is based on Euclidean distance on a hypersphere, however, [Liu et al.2018] propose an alternative to Euclidean distance measure. The proposed is a simple extension defining the hyperspherical energy based on geodesic distance, replacing with . The main difference lies in optimization dynamics reported in [Liu et al.2018]. The extension known as angular MHE is defined as:

(11)

Appendix B Dataset Processing

To most appropriately make direct comparison to BYOL, we share identical dataset processing and augmentation procedures, [Grill et al.2020].

b.1 Dataset Split

When performing self-supervised pre-training we utilize manually created validation sets to select appropriate hyperparameters, given both the ImageNet ILSVRC-2012 dataset [He et al.2016] and CIFAR-10/100 datasets do not contain validation splits (we hold out the validation set for use as the test set) we manually generate a subset of the training split for validation.

  • ImageNet, we took the last 10009 last images of the official Tensorflow ImageNet split as in

    [Grill et al.2020].

  • CIFAR-10, we take 500 random samples per class from the train set for validation.

  • CIFAR-100, we take 50 random samples per class from the train set for validation.

For all datasets we report the Top-1 accuracy (%), which is the proportion of correctly classified examples. For ImageNet we also report the Top-5

accuracy (%), the proportion of predictions that are within the top 5 best predictions (the 5 predictions with the highest probabilities).

b.2 Augmentations

Augmentation procedure is key to the success of self-supervised learning, therefore to compare our performance against BYOL, we employ the same image augmentations reported in [Grill et al.2020, Chen et al.2020a]. Undertaken sequentially in the following order:

  • random cropping: a random patch of the image is selected, with an area uniformly sampled between 8%and 100% of that of the original image, and an aspect ratio logarithmically sampled between 3/4 and 4/3. This patch is then resized to the target size of for Imagenet or

    for CIFAR using bicubic interpolation;

  • optional left-right flip;

  • color jittering: the brightness, contrast, saturation and hue of the image are shifted by a uniformly random offset applied on all the pixels of the same image. The order in which these shifts are performed is randomly selected for each patch;

  • grayscale: an optional conversion to grayscale. When applied, output intensity for a pixel (r, g, b) corresponds to its luma component, computed as ;

  • Gaussian blurring: for a image, a square Gaussian kernel of size

    is used, with a standard deviation uniformly sampled over [0.1, 2.0];

  • solarization: an optional color transformation for pixels with values in .

In the case of CIFAR datasets we omit the Gaussian blur from the augmentation procedure as done in [Chen et al.2020a].

At evaluation, we simplify augmentations, first applying center crop where in ImageNet images are resized to 256 pixels along the shorter side using bicubic resampling, after which a center crop is applied. In CIFAR images are center cropped at then resized to .

Following both the training and evaluation augmentations the transformed images are normalized per color channel by subtracting the average color and dividing by the standard deviation. The average color and standard deviations are computed per dataset.

Appendix C Ablation: MHE Regularizer

We briefly report the linear evaluation results on hyperparameter search for the CIFAR-10 dataset trained for 300 epochs on ResNet-18 encoder and linearly evaluated for 80 epochs freezing the encoder weights. The average of three runs with three seeds are reported as with the all ablation studies. All other hyperparameters remain identical in both of the following experimentation cases.

c.1 Regularizer Weight

Tab.5 reports the linear evaluation results of the BYOL procedure pretrained with differing weighting of the MHE regularization in the loss. We maintain the same hyperparameter range stated in [Liu et al.2018] from to . We report very little difference in performance over all values for with diminishing back to BYOL baseline given the small contribution. For all MHE regularization we therefore set given our empirical results.

Weight Accuracy (%)
0.001 90.74
0.01 91.12
1 91.20
10 91.34
100 91.14
Table 5: Linear Evaluation on CIFAR10 with ResNet-18 Encoder trained for 300 epochs for MHE

c.2 Regularizer Power

For powers of , we again observe very little variation in performance between settings, and at worst seeing a reasonable improvement of 0.4%. For angular MHE we report that performs substantially better on average. Although more computationally expensive we opt to use the angular-MHE

for all experimentation. The outlier performance reported at

was primarily due to one poor result and as such we report the half difference between best and worst.

Method Power Accuracy (%)
BYOL (repro) - 90.74
BYOL-MHE 0 91.12
1 91.24
2 91.34
BYOL-aMHE a0 91.14
a1 90.94
a2 91.64
Table 6: Linear Evaluation on CIFAR10 with ResNet-18 Encoder trained for 300 epochs

c.3 Batch Size Accuracies

Method Batch Size Accuracy (%)
BYOL (repro) 1024 90.74
512 89.10
256 87.50
128 85.00
BYOL + 1024 91.48
512 90.22
256 86.84
128 82.42
BYOL-MHE 1024 91.64
512 90.02
256 88.10
128 85.76
Table 7: Corresponding top-1% CIFAR-10 for different batch size

Appendix D Hyperparameters

# Augmentation Probability
jitter_d=0.5
jitter_p=0.8
blur_sigma=[0.1,2.0]
blur_p=0.5
grey_p=0.2
# Model
model=resnet50
h_units=4096
o_units=256
norm_layer=nn.BatchNorm2d
# Training
max_epochs=300
warmup_epochs=10
batch_size=512
# Optim
optimiser=lars
learning_rate=0.3
weight_decay=1e-06
# BYOL
tau=0.99
uniform=False
uni_val=0.1
# MHE Reg
proj_reg=mhe
proj_pow=a2
pred_reg=mhe
pred_pow=a2
reg_weight=10
# Fine Tune
ft_epochs=80
ft_batch_size=100
ft_learning_rate=0.2
ft_weight_decay=0.0
ft_optimiser=sgd