VIbCReg: Variance-Invariance-better-Covariance Regularization for Self-Supervised Learning on Time Series

09/02/2021 ∙ by Daesoo Lee, et al. ∙ 0

Self-supervised learning for image representations has recently had many breakthroughs with respect to linear evaluation and fine-tuning evaluation. These approaches rely on both cleverly crafted loss functions and training setups to avoid the feature collapse problem. In this paper, we improve on the recently proposed VICReg paper, which introduced a loss function that does not rely on specialized training loops to converge to useful representations. Our method improves on a covariance term proposed in VICReg, and in addition we augment the head of the architecture by an IterNorm layer that greatly accelerates convergence of the model. Our model achieves superior performance on linear evaluation and fine-tuning evaluation on a subset of the UCR time series classification archive and the PTB-XL ECG dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last year, representation learning (RL) has had great success within computer vision, improving both on SOTA for fine-tuned models and achieving close-to SOTA results on linear evaluation on the learned representations

[14, 4, 12, 11, 5, 35, 1]

, and many more. The main idea in these papers is to train a high-capacity neural network using a self-supervised learning (SSL) loss that is able to produce representations of images that are useful for downstream tasks such as image classification and segmentation. The recent mainstream SSL frameworks can be divided into two main categories: 1) contrastive learning method, 2) non-contrastive learning method. The representative contrastive learning methods such as MoCo

[12] and SimCLR [3] use positive and negative pairs and they learn representations by pulling the representations of the positive pairs together and pushing those of the negative pairs apart. However, these methods require a large number of negative pairs per positive pair to learn representations effectively. To eliminate a need for negative pairs, a non-contrastive learning method such as BYOL [11], SimSiam [5], Barlow Twins [35], and VICReg [1] have been proposed. Since the non-contrastive learning methods use positive pairs only, their architectures could be simplified. The non-contrastive learning methods were also able to outperform the existing contrastive learning methods.

To further improve quality of learned representations, feature whitening and feature decorrelation have been a main idea behind some recent improvements [10, 35, 15, 1]. Initial SSL frameworks such as SimCLR suffered from a problem called feature collapse if there are not enough negative pairs, where the collapsing denotes that features of the representations collapse to constants. The collapsing occurs since a similarity metric is still high even if all the features converged to constants, which is the reason for the use of the negative pairs to prevent the collapsing. The collapsing has been partially resolved by using a momentum encoder [11], an asymmetric framework with a predictor, and stop-gradient [11, 5], which have popularized the non-contrastive learning methods. However, one of the latest SSL frameworks, VICReg, shows that none of them is needed and it is possible to conduct effective representation learning with the simplest Siamese architecture without the collapsing via the feature decorrelation. Using this idea for SSL has first shown up in W-MSE [10] in a recent year where the feature components are decorrelated from each other by whitening. Later, Barlow Twins encodes the feature decorrelation by reducing off-diagonal terms of its cross-correlation matrix to zero. Hua et al. [15]

encodes the feature decorrelation by introducing Decorrelated Batch Normalization (DBN)

[17] and Shuffled-DBN. VICReg encodes the feature decorrelation by introducing variance and covariance regularization terms in its loss function in addition to a similarity loss.

The mainstream SSL frameworks have been developed in a computer vision domain, and the SSL for time series problems has seen much less attention despite its evident abundance in industrial, financial, and societal applications. The Makridakis competitions [24, 23] give examples of important econometric forecasting challenges. The UCR Archive [6]

is a collection of datasets where classification is of importance, and in the industrial setting, sensor data and Internet of Things (IoT) data are examples where proper machine learning tools for time series modeling is important. Publicly-available ECG datasets such as CinC

[28] and PTB-XL [32] consist of 12-lead ECG signal data and data labels are heart disease diagnoses. Thus, classification is of importance for the ECG datasets.

In this paper, we investigate several existing mainstream SSL frameworks on several time series datasets, and propose a SSL framework, VIbCReg (Variance-Invariance-better-Covariance Regularization), which is inspired by the feature decorrelation methods from [10, 15, 1] and it can be viewed an upgraded version of VICReg by having the better covariance regularization. The existing SSL frameworks investigated in this paper are CPC, SimSiam, Barlow Twins, and VICReg. In addition, we also investigated incorporation of VIbCReg with a recent work, Mean Shift (MSF) [20], in an attempt to optimize the invariance term in a better way. Mean Shift is an extended version of BYOL by pulling one view to its k-neighbors instead of one-to-one pulling. All the frameworks are reproduced and the source code will be made available.

Since the mainstream SSL frameworks from computer vision have been developed focused on a classification task, we investigate SSL frameworks on a time series classification task. In this paper, we investigate the performance of VIbCReg and its competitors on the following datasets: 1) the ten datasets of the UCR archive with the highest number of time series within them, 2) the five datasets of the UCR archive with the highest number of labels, where a dataset has more than 900 samples, 3) the PTB-XL.

One recent study regarding SSL on 12-lead ECG datasets [25] has investigated SSL frameworks such as SimCLR, BYOL, SwAV [2]

, and a modified CPC on 12-lead ECG datasets. SSL is conducted on a combination of several publicly-available 12-lead ECG datasets and evaluated by linear evaluation and fine-tuning evaluation. It showed that the modified CPC proposed by the authors perform the best. However, the modifications made on the original CPC by removing strides in the convolutional layers, therefore, its architecture is not generic and not suitable for general time series. Also, a large portion of performance gain of the modified CPC is not from its architecture but from a two-step fine-tuning technique where a classification head is first trained with an encoder frozen for the first

epochs, and the classification head and the encoder are both trained after epochs. Although our paper addresses SSL on a 12-lead ECG dataset, the main focus differs from [25] in a sense that our focus lies on comparative performances of SSL frameworks on time series data while Mehari and Strodthoff [25] focuses on the highest possible classification performance on the ECG dataset by utilizing SSL.

In the following sections, related works and the proposed methods are introduced, and experimental results are followed. An overview of comparative linear evaluation results on the UCR datasets are presented in Fig. 1.

Figure 1: Comparative linear evaluation results on the UCR datasets. rand init denotes using a randomly-initialized frozen encoder and supervised denotes an encoder trained in a supervised-learning manner. vibcreg+msf denotes VIbCReg+Mean Shift (MSF). It is shown that VIbCReg(+MSF) not only outperforms all the other frameworks, but also it even outperforms supervised on some datasets. Note that VIbCReg+MSF is termed VbIbCReg.

2 Related Works

Figure 2: Comparison on Siamese architecture-based SSL frameworks. Siamese denotes the Siamese architecture [19] and the others are SSL frameworks. The encoder includes all layers that can be shared between both branches. The dash lines indicate the gradient propagation flow. Therefore, the lack of a dash line denotes stop-gradient.

2.1 Contrastive Learning Methods

2.1.1 Siamese Architecture-based SSL Frameworks

A framework of a Siamese neural network [7] is illustrated in Fig. 2. The Siamese neural network has two branches with a shared encoder on each side, and its similarity metric is computed based on two representations from the two encoders. Its main purpose is to learn representations such that two representations are similar if two input images belong to the same class and two representations are dissimilar if two input images belong to the different classes. One of the main applications of the Siamese neural network is one-shot and few-shot learning for image verification [7, 19] where a representation of a given image can be obtained by an encoder of a trained Siamese neural network and it is compared with representations of the existing images for the verification by a similarity metric. It should be noted that the Siamese neural network is trained in a supervised manner with a labeled dataset. The learning process of the representations by training an encoder is termed representation learning.

Although representation learning can be conducted using the Siamese neural network, its capability is somewhat limited by the fact that it utilizes supervised learning where a labeled dataset is required. To eliminate a need for a labeled dataset and and still be able to conduct effective representation learning, SSL has become very popular where unlabeled datasets are utilized to train an encoder. Some of the representative contrastive learning methods that are based on the Siamese architecture are MoCo and SimCLR. Both frameworks are illustrated in Fig. 2.

MoCo The contrastive learning methods require a large number of negative pairs per positive positive pair, and MoCo keeps a large number of negative pairs by having a queue where a current mini-batch of representations is enqueued and the oldest mini-batch of representations in the queue is dequeued. A similarity loss is computed using representations from the encoder and the momentum encoder, and the dissimilarity loss is computed using representations from the encoder and the queue, and they are merged to form a form of a contrastive loss function, called InfoNCE [31]. The momentum encoder is a moving average of the encoder and it maintains consistency of the representations.

SimCLR It greatly simplifies a framework for the contrastive learning. Its major components are 1) a projector after a ResNet backbone encoder [13], 2) InfoNCE based on a large mini-batch. The projector is a small neural network that maps representations to the space where contrastive loss is applied. By having this, quality of representations could be improved by leaving the downstream task to the projector while the encoder is trained to output better quality representations.

2.1.2 Contrastive Predictive Coding

CPC

It is quite different from those Siamese-based frameworks in a sense that its key insight is to learn useful representations by predicting the future in latent space by using an autoregressive model. CPC uses a convolutional neural network (CNN)-based encoder and a GRU-based autoregressive model

[8]

. The encoder compresses high-dimensional data into a much more compact latent embedding space in which conditional predictions are easier to model, and the autoregressive model makes predictions many steps in the future after processing representations encoded by the encoder.

2.2 Non-Contrastive Learning Methods

The representative non-contrastive learning methods are BYOL and SimSiam. Both frameworks are illustrated in Fig. 2.

BYOL It has gained a great popularity by proposing a framework that does not require any negative pair for the first time. Before BYOL, a large mini-batch size was required [3] or some kind of memory bank [34, 12] was needed to keep a large number of negative pairs. Grill et al. [11] hypothesized that BYOL may avoid the collapsing without a negative pair due to a combination of 1) addition of a predictor to an encoder, forming an asymmetrical architecture and 2) use of the momentum encoder.

SimSiam It can be viewed as a simplified version of BYOL by removing the momentum encoder. Chen and He [5] empirically showed that a stop-gradient is critical to prevent the collapsing.

2.2.1 Feature Decorrelation Considered

The frameworks that improved the representation learning using the idea of the feature decorrelation are W-MSE, Barlow Twins, and a Shuffled-DBN-based framework. They are illustrated in Fig. 2.

W-MSE Its core idea is to whiten feature components of representations so that the feature components are decorrelated from each other, which can eliminate feature redundancy. The whitening process used in W-MSE is based on a Cholesky decomposition [9] proposed by Siarohin et al. [29].

Barlow Twins It encodes a similarity loss (invariance term) and a loss for feature decorrelation (redundancy reduction term) into the cross-correlation matrix. The cross-correlation is computed by conducting matrix multiplication of where and and and

denote batch size and feature size, respectively. Then, representation learning is conducted by optimizing the cross-correlation matrix to be an identity matrix, where optimization of the diagonal terms corresponds to that of a similarity loss and optimization of the off-diagonal terms corresponds to that of the feature decorrelation loss.

DBN-based framework Its authors Hua et al. [15] categorized the feature collapsing into two categories: 1) complete collapse, caused by constant feature components, 2) dimensional collapse, caused by redundant feature components. They pointed out that previous works had mentioned and addressed the complete collapse only but not the dimensional collapse. Its main idea is to use DBN to normalize the output from its encoder-projector, which provides the feature decorrelation effect, and the feature decorrlation prevents the dimensional collapse. To further decorrelate the features, Shuffled-DBN is proposed, in which an order of feature components is randomly arranged before DBN and the output is rearranged back to the original feature-wise order.

2.2.2 Feature Decorrelation and Feature-component Expressiveness Considered

VICReg VICReg encodes Feature Decorrelation (FD) and Feature-component Expressiveness (FcE) in its loss function in addition to a similarity loss, where FD and FcE are termed variance term and covariance term in the VICReg paper, respectively. A high FcE indicates that output values from a feature component have a high variance and vice versa. Hence, a (near)-zero FcE indicates the complete collapse. Its strength lies in its simplicity such that VICReg uses the simplest Siamese form and does not require either the stop-gradient, asymmetrical architecture, or whitening/normalization layer. Despite its simplicity, its performance is very competitive compared to other latest SSL frameworks.

2.3 No-striding CPC

Mehari and Strodthoff [25] uses a no-striding CPC, a modified version of the original CPC by removing strides in the convolutional layers. Its performance on the ECG datasets was verified to be better than that of SimCLR, BYOL, and SwAV, which is somewhat surprising given the fact that SimCLR, BYOL, and SwAV outperform CPC with an apparent margin in computer vision [1]. This indicates that CPC is competitively effective compared to SimCLR, BYOL, and SwAV at learning representations of time series data. However, since the no-striding CPC does not use any stride, one feature component in a representation is mapped over a very-short length in the original space of data. Therefore, the no-striding CPC is not generic for processing general time series. Specifically, it is not suitable when long chunks of time series need to be processed. Since SimCLR, BYOL, and SwAV are already investigated on the 12-lead ECG dataset in [25], we decided to investigate more recent SSL frameworks than SimCLR, BYOL, and SwAV such as SimSiam, Barlow Twins, and VICReg under an assumption that the more recent SSL frameworks should perform better than previous ones since the more recent frameworks are algorithmically improved versions of the previous ones. Also, the original implementation of CPC is investigated to put a focus on comparative performances of SSL frameworks to find out which framework has a higher potential for robust time series representation learning.

It should be noted that evaluation of CPC is conducted only on the PTB-XL since its input window length is already suggested (i.e., 2.5s) [25]. Since CPC’s input window length depends on architectural properties (i.e., number of striding convolutional layers and striding size), architecture of the CPC’s encoder would need to be changed for every UCR dataset. Also, CPC wouldn’t even possible to be used on datasets with short length such as Crop since it needs time series that is long enough for the multiple step predictions.

2.4 Mean Shift

Mean Shift (MSF) is a recent SSL framework, and it can be viewed as an extended version of BYOL. In BYOL, two positive views are pulled to optimize the invariance term (i.e., similarity loss). MSF improved it by pulling one view to its k-neighbors, where the neighbors are fetched from a memory bank. By having this better invariance term, MSF could outperform BYOL. We adopted MSF into our VIbCReg to derive the better invariance.

3 Proposed Method

Figure 3: Comparison between VICReg and VIbCReg, where the difference is highlighted in red. As for VIbCReg, given a batch of input data , two batches of different views and are produced and are then encoded into representations and . The representations are further processed into projections and via the projector and an iterative normalization (IterNorm) layer. Then, the similarity between and is maximized, the variance along the batch dimension is maximized, while the feature components of and are progressively being decorrelated from each other, respectively.

VIbCReg (Variance-Invariance-better-Covariance Regularization) is inspired by and concurrent with a recent SSL framework, VICReg (Variance-Invariance-Covariance Regularization), and it can be viewed as VICReg with a better covariance regularization. It should be noted that the variance, invariance, and covariance terms in VICReg correspond to what we call the feature component expressiveness (FcE), similarity loss, and feature decorrelation (FD) terms in VIbCReg, respectively. By having the better covariance regularization, VIbCReg could outperform VICReg in terms of the learning speed, linear evaluation, and semi-supervised training. What motivated VIbCReg are the followings: 1) FD has been one of the key components for improvement of quality of learned representations [10, 35, 17, 1], 2) Hua et al. [15] showed that application of DBN and Shuffled-DBN on the output from an encoder-projector is very effective for faster and further FD, 3) addressing FcE in addition to FD is shown to be effective [1], 4) scale of the VICReg’s FD loss (covariance term) varies depending on feature size and its scale range is quite wide due to its summation over the covariance matrix. Hence, we assumed that a scale of the VICReg’s FD loss could be modified to be consistent and to have a small range so that a weight parameter for the FD loss would not have to be re-tuned in accordance with a change of the feature size.

VIbCReg is illustrated in Fig. 3 in comparison to VICReg. In Fig. 3, and are sampled from a distribution to produce two different views of each input data. The views are encoded using the encoder into representations and . The representations are further processed by the projector and an iterative normalization (IterNorm) layer [16] into projections and . The loss is computed at the projection level on and .

We describe here a loss function of VIbCReg, which consists of a similarity loss, FcE loss, and FD loss. It should be noted that the similarity loss and the FcE loss are defined the same as in VICReg. The input data is processed in batches, and we denote and where and denote batch size and feature size, respectively. The similarity loss (invariance term) is defined as Eq. (1). The FcE loss (variance term) is defined as Eq. (2), where

denotes a variance estimator,

is a target value for the standard deviation, fixed to 1 in our experiments,

is a small scalar (i.e., 0.0001) preventing numerical instabilities.

(1)
(2)

The FD loss is the only component of VIbCReg’s loss that differs from the corresponding covariance term in the loss of VICReg, thus, we here present a comparison between the two. The covariance matrix of in VICReg is defined as Eq. (3), and the covariance matrix in VIbCReg is defined as Eq. (4) where the -norm is conducted along the batch dimension. Then, the FD loss of VICReg is defined as Eq. (5) and that of VIbCReg is defined as Eq. (6). Eq. (4) constrains the terms to range from -1 to 1, and Eq. (6) takes a mean of the covariance matrix by dividing the summation by a number of the matrix elements. Hence, Eq. (4) and Eq. (6) keep a scale and a range of the FD loss neat and small.

(3)
(4)
(5)
(6)

The overall loss function is a weighted average of the similarity loss, FcE loss, and FD loss:

(7)

where , , and are hyper-parameters controlling the importance of each term in the loss. In our experiments, and are set to 25 and 25, respectively, following the VICReg paper. for VICReg is set to 1 as in the VICReg paper, and for VIbCReg is empirically set to 200. Yet, it should be noted that the performance of VIbCReg is quite consistent with respect to a choice of the weight parameter for thanks to the use of the normalized covariance, Eq. (4).

Another key component in VIbCReg is IterNorm. [15] showed that applying DBN on the output from an encoder-projector could improve representation learning, emphasizing that the whitening process helps the learning process, which corresponds to a main argument in the DBN paper [17]. To further improve DBN, IterNorm was proposed [16]. IterNorm was verified to be more efficient at the whitening than DBN by employing Newton’s iteration to approximate a whitening matrix. Therefore, we employ IterNorm instead of DBN, and IterNorm is applied to the output from an encoder-projector as shown in Fig. 3

. Then, feature decorrelation is strongly supported, induced by the following two factors: 1) IterNorm, 2) optimization of the FD loss. IterNorm has two hyperparameters of an iteration number and group size. They are set to 5 and 64, respectively, as recommended in the IterNorm paper

[15]. A pseudocode for VIbCReg is presented in Appendix A.

Relation to VICReg To summarize, VIbCReg can be viewed as VICReg with the normalized covariance matrix (instead of the covariance matrix) and IterNorm.

VIbCReg+MSF (VbIbCReg) We also investigate the incorporation of VIbCReg with MSF, which is proposed to derive the better invariance term. Hence, it is termed VbIbCReg (Variance-better-Invariance-better-Covariance Regularization) The incorporation is fairly simple: VIbCReg’s similarity loss is changed to Eq. (8), where -norm is along the feature dimension. For VICReg and VIbCReg, MSE loss results in better representation learning than applying -norm on as shown in [1]. However, optimization of the invariance term is conducted in the normalized embedding space for MSF [20]. Therefore, to mitigate between the two, we adopt the hybrid similarity loss function as Eq. (8). Finally, a loss function of VIbCReg+MSF is defined as Eq. (9), where denotes the MSF’s similarity loss and its weight hyperparameter. Note that is computed by similarity between one sample and its k-neighbors, where the neighbors are fetched from a memory bank. In our experiments, is set to 5 to slightly add the MSF’s effect onto VIbCReg. MSF has three hyperparameters: 1) memory bank size, 2) a number of neighbors k, 3) momentum of a target encoder. The memory bank size is set such that the size can roughly cover an entire training dataset, and k is set to 2 since k=

2 is shown to be effective enough in the MSF paper and our training datasets are relatively much smaller than ImageNet. The momentum of a target encoder is set to 0.99 as in the MSF paper. The architectural details of VbIbCReg are specified in Appendix

B.

(8)
(9)

4 Experimental Evaluation

Datasets In our experiments, two datasets are used: 1) the UCR archive, from which we select the 10 largest datasets together with the 5 datasets with more than 900 samples having the largest number of classes, and 2) the PTB-XL (12-lead ECG dataset). For the UCR datasets, each dataset is split into 80% (training set) and 20% (test set) by a stratified split. A model pre-training by SSL is conducted and evaluated on each dataset. Therefore, the UCR datasets are independent of one another in our experiments. The PTB-XL dataset comes with 71 labels and its evaluation task is framed as a multi-label classification task. The dataset is organized into ten stratified, label-balanced folds, where the first eight are used as a training set, the ninth is used as a validation set, and the tenth fold is used as a test set. As for the preprocessing, the UCR datasets are preprocessed by z-normalization followed by arcsinh, and the PTB-XL does not need any preprocessing since it is provided, preprocessed. Summaries of the UCR datasets and the PTB-XL are presented in Table 1 and Table 2, respectively.

Dataset name #Samples #Classes Length
Crop 24000 24 46
ElectricDevices 16637 7 96
StarLightCurves 9236 3 1024
Wafer 7164 2 152
ECG5000 5000 5 140
TwoPatterns 5000 4 128
FordA 4921 2 500
UWaveGestureLibraryAll 4478 8 945
FordB 4446 2 500
ChlorineConcentration 4307 3 166
ShapesAll 1200 60 512
FiftyWords 905 50 270
NonInvasiveFetalECGThorax1 3765 42 750
Phoneme 2110 39 1024
WordSynonyms 905 25 270
Table 1: Summary of the UCR datasets. The first 10 datasets are the 10 largest datasets from the UCR archive, and the second 5 datasets are the datasets with the 5 largest number of classes from the UCR archive, where a dataset has more than 900 samples.
Dataset name #Samples #Classes #Patients Sampling frequency Length
PTB-XL 21837 71 18885 100Hz 10s
Table 2: Summary of PTB-XL. Note that its step length is 1000 (i.e., 10s 100Hz).

Training For an optimizer, AdamW [22] (lr=0.001, weight decay=0.00001, batch size=256) is used with training epochs of 200. For a learning rate scheduler, a cosine learning rate decay is used [21]. Data augmentation methods used during the training are Random Crop, Random Amplitude Resize, and Random Vertical Shift. The details of the augmentation methods are presented in Appendix C. Note that the data augmentation is not used for CPC following the original implementation and the previous study [31, 25].

As for the deep learning library and the used GPU model, PyTorch

[26] is used for building and training models with a single GPU (GTX1080-Ti).

4.1 Architecture

Encoder We follow the convention of the previous SSL papers by using ResNet [11, 5, 35, 1] for an encoder. But since the dataset size is much smaller than the size for SSL in computer vision, we use a light-weighted 1D ResNet, inspired by [33]. The detailed architecture is illustrated in Appendix D. It is used as an encoder for SimSiam, Barlow Twins, VICReg, VIbCReg, and VbIbCReg.

VIbCReg

Its projector consists as follows: (Linear-BN-ReLU)-(Linear-BN-ReLU)-(Linear-IterNorm), where Linear, BN, and ReLU denotes a linear layer, batch normalization

[18]

, and rectified linear unit, respectively. The dimension of the inner and output layers of the projector is set to 4096.

Competing SSL Frameworks The competing SSL frameworks are SimSiam, Barlow Twins, VICReg, and CPC. In our experiments, these competing frameworks are implemented as close as possible to the original implementations. The only major difference is the encoder’s dimension. Instead of 2-dimensional image input, they receive 1-dimensional time series input. Unless specified differently, the architectures follow original implementations.

SimSiam Except for the encoder, all the architectural settings are the same as in the original SimSiam paper [5].

Barlow Twins In the original implementation, the size of linear layers in the projector is set to 8192. In our experiments, the size is set to 4096. Barlow Twins has a hyperparameter of . It is set to as in the original implementation.

VICReg In the original implementation, the size of linear layers in the projector is set to 8192. In our experiments, the size is set to 4096. Its hyperparameters of are set as in the original implementation (25, 25, and 1, respectively).

CPC

It uses a downsampling CNN model as an encoder. In the original implementation, the CNN model consists of five convolutional layers with strides [5, 4, 2, 2, 2], filter-sizes [10, 8, 4, 4, 4], and 512 hidden units with ReLU activations. Then, there is a feature vector for every 10ms of speech on the 16KHz PCM audio waveform. Since time series data from the PTB-XL has relatively much lower sampling frequency (

i.e., 100Hz), the downsampling rate should be smaller. The CNN model in our experiments consists of four convolutional layers with strides [4, 3, 3, 2], filter-sizes [8, 6, 6, 4], and 512 hidden units with ReLU activations. Then, there is a feature vector for 1.96s on the 100Hz ECG time series data. Its maximum prediction step is set to 4.

4.2 Linear Evaluation

We follow the linear evaluation protocols from the computer vision domain [11, 5, 35, 1, 20]

. Given the pre-trained encoder, we train a supervised linear classifier on the frozen features from the encoder. For the Siamese-based frameworks, the features are from the ResNet’s global average pooling (GAP) layer, and for CPC, the features are from its downsampling CNN model followed by a GAP layer. The linear classifier is trained with AdamW and (

lr=0.001, batch size=256, weight decay=0.00001, training epochs=50). The learning rate gets reduced by the cosine learning rate decay. The used data augmentation methods are Random Amplitude Resize and Random Vertical Shift. Experimental results for the linear evaluation are presented in Table 3 and Table 4 on the UCR datasets and the PTB-XL, respectively.

Dataset Name Rand Init SimSiam Barlow Twins VICReg VIbCReg VbIbCReg Supervised
Crop 49.6(0.1) 56.0(0.8) 63.7(0.7) 66.2(9.5) 71.0(0.7) 71.7(0.3) 80.1(0.4)
ElectricDevices 51.2(0.7) 53.2(4.3) 64.1(1.1) 73.6(0.1) 87.1(0.3) 84.0(0.4) 87.0(0.2)
StarLightCurves 76.7(2.3) 71.3(8.5) 88.7(4.2) 97.5(0.1) 97.8(0.1) 97.9(0.1) 98.3(0.1)
Wafer 89.4(0.0) 98.4(0.4) 95.9(0.4) 98.8(0.2) 99.5(0.1) 99.6(0.1) 99.9(0.0)
ECG5000 72.9(11.5) 83.1(5.5) 90.9(1.2) 92.8(0.0) 95.4(0.1) 95.5(0.1) 95.8(0.1)
TwoPatterns 42.8(1.6) 37.9(4.4) 87.2(6.5) 81.2(0.6) 99.3(0.2) 99.6(0.2) 100.0(0.0)
FordA 54.5(0.9) 83.0(4.1) 74.5(4.3) 79.0(0.3) 95.5(0.3) 95.4(0.3) 93.3(0.3)
UWaveGestureLibraryAll 47.2(1.3) 30.3(6.8) 51.3(5.6) 57.5(0.8) 90.9(0.4) 90.4(0.3) 96.2(0.4)
FordB 65.7(1.1) 60.8(5.4) 76.1(1.6) 85.4(0.3) 94.0(0.3) 93.8(0.3) 92.4(0.5)
ChlorineConcentration 53.6(0.0) 55.7(0.0) 55.5(0.3) 55.5(0.1) 65.2(0.7) 64.0(0.8) 100.0(0.0)
ShapesAll 7.9(2.5) 14.1(3.1) 39.2(3.6) 31.2(2.4) 85.7(0.8) 86.9(1.0) 91.2(1.0)
FiftyWords 13.6(1.7) 15.8(2.1) 25.4(1.7) 26.3(0.9) 50.2(1.1) 50.1(0.8) 77.2(1.2)
NonInvasiveFetalECGThorax1 5.0(0.8) 20.0(3.6) 21.4(8.8) 37.6(0.7) 58.5(0.6) 72.8(0.6) 94.5(0.3)
Phoneme 11.1(0.0) 18.1(0.6) 19.9(1.5) 21.4(0.2) 42.8(0.5) 42.6(0.2) 47.8(1.0)
WordSynonyms 22.1(0.9) 24.4(0.4) 29.0(1.3) 23.8(0.3) 46.7(0.7) 47.6(0.8) 73.7(1.8)
Table 3: Linear evaluation on the UCR datasets. The results are obtained over 5 runs with different random seeds for the stratified split. It is noticeable that the proposed frameworks outperform the other frameworks with significant margins, and they even outperform Supervised on some datasets. A major difference between VIbCReg and VbIbCReg can be found on the 5 datasets with the 5 highest number of classes (i.e., from ShapesAll to WordSynonyms). The values within the parentheses are standard deviations. Note that Rand Init denotes a randomly-initialized frozen encoder with a linear classifier on the top and Supervised denotes a trainable encoder-linear classifier trained in a supervised manner.
PTB-XL Rand Init CPC SimSiam Barlow Twins VICReg VIbCReg VbIbCReg Supervised
Macro AUC 0.4634(137) 0.8268(20) 0.5899(356) 0.7524(224) 0.8316(11) 0.8478(31) 0.8444(32) 0.9311(18)
Valid. loss 0.1078(2) 0.0799(3) 0.1025(9) 0.0906(33) 0.0791(2) 0.0750(2) 0.0747(2) 0.0629(2)
Test loss 0.1071(2) 0.0806(1) 0.1013(10) 0.0890(26) 0.0788(3) 0.0760(2) 0.0758(2) 0.0612(3)
Table 4: Linear evaluation on the PTB-XL. The results are obtained over 5 runs. For the PTB-XL, a macro AUC score is commonly used as a linear evaluation score [30, 25]. Additionally, the lowest validation and test losses are presented to cover an aspect of the evaluation that the macro AUC cannot cover. The validation loss is computed on the 9-th fold in the same setting as the training, and the test loss is computed on the 10-th fold in the same setting as the training except that no augmentation is applied on input data. Note that a binary cross entropy loss for multiple classes is used (i.e., BCEWithLogitsLoss in PyTorch). The standard deviation values within the parentheses are presented, multiplied by 10000 (i.e., 0.0002 2).

4.3 Fine-tuning Evaluation on a Small Dataset

Fine-tuning evaluation on a subset of training dataset is conducted to do the ablation study. We follow the ablation study protocols from the computer vision domain [11, 5, 35, 1, 20]. Given the pre-trained encoder, we fine-tune the pre-trained encoder and train the linear classifier. AdamW optimizer is used with (=0.0001, =0.001, batch size=256, weight decay=0.001, training epochs=100), where and denote learning rates for the encoder and the linear classifier, respectively. During the fine-tuning, the BN statistics are set such that they can be updated. The used data augmentation methods are Random Amplitude Resize and Random Vertical Shift. Experimental results for the fine-tuning evaluation on the UCR datasets and the PTB-XL are presented in Table 5 and Table 6, respectively.

Dataset name SimSiam Barlow Twins VICReg VIbCReg VbIbCReg Supervised
Fine-tuning evaluation on 5% of the training dataset
Crop 61.6(0.6) 60.5(0.5) 61.6(2.4) 62.4(0.3) 63.9(0.8) 62.6(0.7)
ElectricDevices 71.4(2.8) 69.2(1.5) 73.1(0.3) 84.7(0.4) 82.7(0.2) 69.0(0.9)
StarLightCurves 85.3(0.1) 93.0(4.3) 98.1(0.1) 98.1(0.1) 98.2(0.0) 97.8(0.2)
Wafer 99.3(0.1) 98.8(0.1) 99.4(0.1) 99.1(0.2) 99.3(0.1) 99.3(0.2)
ECG5000 91.4(1.2) 91.8(0.8) 94.0(0.5) 94.1(0.6) 94.3(0.4) 94.1(0.4)
TwoPatterns 49.3(8.5) 96.2(2.2) 98.6(0.4) 99.2(0.5) 99.4(0.2) 88.0(4.2)
FordA 91.1(1.0) 80.6(3.5) 90.1(0.6) 94.4(0.6) 94.3(0.4) 89.8(1.0)
UWaveGestureLibraryAll 42.9(11.3) 55.1(4.6) 77.1(2.2) 83.4(2.3) 83.7(1.5) 78.5(1.1)
FordB 74.3(5.3) 82.4(2.1) 90.0(0.4) 92.2(0.7) 92.3(0.8) 87.4(1.9)
ChlorineConcentration 57.4(0.7) 55.9(0.5) 56.7(1.0) 61.7(1.7) 61.8(2.0) 65.6(1.9)
NonInvasiveFetalECGThorax1 33.4(2.5) 21.2(6.1) 44.2(2.6) 45.6(1.9) 55.9(1.9) 66.9(3.8)
Fine-tuning evaluation on 10% of the training dataset
Crop 66.2(0.6) 65.3(0.4) 66.4(1.9) 66.3(0.4) 68.0(0.6) 67.0(0.7)
ElectricDevices 75.1(1.9) 73.8(0.8) 75.9(0.5) 86.5(0.6) 85.4(0.5) 73.4(0.7)
StarLightCurves 91.8(5.3) 97.4(0.3) 98.2(0.1) 98.2(0.1) 98.2(0.0) 98.0(0.2)
Wafer 99.6(0.1) 99.4(0.1) 99.6(0.1) 99.5(0.1) 99.5(0.1) 99.6(0.1)
ECG5000 93.2(0.3) 93.1(0.5) 94.6(0.7) 95.0(0.5) 95.1(0.4) 94.4(0.7)
TwoPatterns 93.0(5.0) 99.5(0.5) 99.7(0.1) 99.9(0.1) 99.9(0.1) 99.9(0.1)
FordA 92.4(0.1) 87.5(1.4) 92.8(0.3) 94.8(0.2) 94.8(0.1) 91.7(0.3)
UWaveGestureLibraryAll 58.1(14.0) 68.5(3.2) 84.9(1.0) 91.2(1.2) 91.1(1.2) 84.7(0.9)
FordB 89.3(0.6) 89.3(1.9) 91.3(0.6) 93.0(0.5) 92.5(0.3) 89.1(0.3)
ChlorineConcentration 63.4(0.7) 56.2(0.3) 60.6(1.3) 72.2(2.6) 71.5(2.3) 76.7(2.2)
ShapesAll 21.8(6.2) 35.4(3.4) 38.2(2.5) 67.8(3.1) 68.4(2.2) 50.2(4.0)
NonInvasiveFetalECGThorax1 38.1(5.6) 24.4(3.7) 45.7(1.9) 61.8(1.6) 70.3(2.4) 77.4(4.1)
WordSynonyms 29.3(1.1) 34.4(2.5) 40.3(1.4) 45.4(2.6) 48.8(3.0) 36.2(2.4)
Fine-tuning evaluation on 20% of the training dataset
Crop 70.4(0.5) 70.2(0.4) 70.4(1.3) 70.6(0.5) 71.8(0.2) 71.1(0.6)
ElectricDevices 80.4(0.9) 79.7(0.5) 79.0(0.2) 88.1(0.3) 87.5(0.2) 78.5(0.6)
StarLightCurves 98.0(0.3) 98.0(0.2) 98.4(0.1) 98.3(0.1) 98.3(0.0) 98.2(0.1)
Wafer 99.7(0.1) 99.6(0.1) 99.6(0.1) 99.6(0.1) 99.6(0.1) 99.7(0.1)
ECG5000 94.3(0.7) 94.4(0.2) 95.3(0.2) 95.6(0.3) 95.7(0.4) 95.3(0.2)
TwoPatterns 99.9(0.1) 99.9(0.2) 100.0(0.0) 100.0(0.0) 100.0(0.0) 100.0(0.0)
FordA 93.2(0.5) 91.2(0.5) 93.2(0.4) 95.0(0.4) 95.2(0.5) 92.2(0.6)
UWaveGestureLibraryAll 78.0(8.6) 85.5(1.4) 89.2(0.4) 94.5(0.7) 94.1(0.6) 89.7(1.0)
FordB 91.0(0.5) 91.8(0.4) 92.1(0.5) 93.7(0.5) 93.4(0.4) 90.8(0.3)
ChlorineConcentration 85.4(1.1) 56.7(0.7) 81.4(0.9) 88.4(0.9) 88.7(1.1) 93.2(1.1)
ShapesAll 27.6(11.2) 48.8(3.5) 53.6(0.6) 79.6(1.9) 80.2(2.5) 66.6(3.6)
FiftyWords 31.7(1.3) 47.1(1.9) 46.7(2.7) 56.4(1.7) 56.0(0.8) 49.0(1.9)
NonInvasiveFetalECGThorax1 72.9(3.2) 49.8(2.1) 79.7(1.1) 83.0(0.8) 85.1(0.4) 87.4(0.7)
Phoneme 27.5(2.0) 27.8(1.0) 34.5(1.1) 41.5(1.4) 40.9(1.6) 32.8(1.6)
WordSynonyms 32.7(4.9) 42.4(2.7) 47.7(3.3) 55.9(2.9) 55.9(1.9) 47.8(3.2)
Table 5: Fine-tuning evaluation on subsets of the UCR datasets. The results are obtained over 5 runs with different random seed for the stratified split. In 5% and 10%, results on some datasets missing because the split training subsets are too small.
PTB-XL CPC SimSiam Barlow Twins VICReg VIbCReg VbIbCReg Supervised
Macro AUC 0.8313(30) 0.8179(120) 0.8325(39) 0.8692(25) 0.8514(26) 0.8506(27) 0.8643(12)
Valid. loss 0.0768(3) 0.0845(20) 0.0802(6) 0.0754(3) 0.0750(2) 0.0743(1) 0.0765(2)
Test loss 0.0776(3) 0.0831(14) 0.0792(11) 0.0749(4) 0.0764(2) 0.0757(2) 0.0749(3)
Table 6: Fine-tuning evaluation on a subset of the PTB-XL, where the subset used for the fine-tuning is the 1st fold. That is, 12.5% of the training dataset. The results are obtained over 5 runs.

4.4 Faster Representation Learning by VIbCReg

On the top of the outstanding performance, VIbCReg has another strength: faster representation learning

, which makes VIbCReg more appealing compared to its competitors. The speed of representation learning is presented by kNN classification accuracy for the UCR datasets

[5] in Fig. 4 and by kNN macro F1 score [27] for the PTB-XL in Fig. 5 along with FD metrics and FcE metrics . The FD metrics and FcE metrics are metrics for the feature decorrleation and feature component expressiveness. The low FD metrics indicates high feature decorrelation (i.e., features are well decorrelated) and vice versa. The low FcE metrics indicates the feature collapse and vice versa. Note that VIbCReg and VbIbCReg are trained to have the FcE metrics of 1. Details of both metrics are specified in Appendix E. To keep this section neat, the corresponding FD and FcE metrics of Figs. 4-5 are presented in Appendix F.1.

To provide further analysis with respect to the faster representation learning, linear evaluation results with the pretrained encoders of 10 epochs and 100 epochs on the UCR datasets are presented in Fig. 6.

Figure 4: 5-kNN classification accuracy on the UCR datasets during the representation learning. It is recognizable that VIbCReg and VbIbCReg shows the fastest convergence and the highest kNN accuracy on most of the datasets.
Figure 5: kNN macro F1 score on the PTB-XL dataset during the representation learning. Similar to the kNN accuracy on the UCR datasets, VIbCreg and VbIbCReg result in the higher kNN classification accuracy with the fastest convergence.
Figure 6: Linear evaluation results with the pretrained encoders of 10 epochs and 100 epochs on the UCR datasets. VIbCReg and VbIbCReg shows much smaller gaps between ep:10 and ep:100 on many datasets. This fast learning can reduce a significant amount of computational cost for the pretraining.

4.5 Between VICReg and VIbCReg

As mentioned earlier in Section 3, VIbCReg can be viewed as VICReg with the normalized covariance matrix (NCM) and IterNorm. In this section, frameworks between VICReg and VIbCReg are investigated as shown in Table 7. The linear evaluation results of the four frameworks on the UCR datasets are presented in Fig. 7. The corresponding kNN accuracy graphs are also presented in Fig. 8 to show the learning progress, and the corresponding FD and FcE metrics graphs are presented in Appendix F.2.

Frameworks Normalized Covariance Matrix IterNorm Notation
VICReg VICReg
- o VICReg+NCM
- o VICReg+IterN
- o o VIbCReg
Table 7: Frameworks between VICReg and VIbCReg: VICReg+NCM and VICReg+IterN.
Figure 7: Comparative linear evaluation of the frameworks between VICReg and VIbCReg. It shows that either adding NCM (ncm) or IterNorm (itern) improves the performance. VIbCReg shows the most consistently-high performance among the four frameworks.
Figure 8: Comparative kNN accuracy of the frameworks between VICReg and VIbCReg. On most of the UCR datasets, VIbCReg and VICReg+IterN outperform the others, and VIbCReg shows the faster learning speed and the higher performance on several datasets than VICReg+IterN.

4.6 Sensitivity Test w.r.t Weight for the FD Loss

The normalized covariance matrix is proposed to alleviate an effort for tuning the weight hyperparameter for the feature decorrelation loss term . Without the normalization, a scale of from the covariance matrix can be significantly large and wide, which would make the tuning process harder. To show that the tuning process is easier with the normalized covariance matrix (i.e., performance is relatively quite consistent with respect to ), sensitivity test results with respect to are presented in Fig. 9.

Figure 9: Linear evaluation of the sensitivity test with respect to . Default values of for VICReg and VIbCReg are 1 and 200, respectively. Variants of are set by 5/10%, 50%, 200%, and 500% of the default values. It is apparent that the performance gaps between different are much smaller for VIbCReg than VICReg in general, which makes the tuning process for much easier.

5 Discussion

Mean Shift (MSF) in VbIbCReg VbIbCReg is proposed by adding the better invariance term on VIbCReg with MSF in an attempt to improve the performance even further. The idea of adding MSF itself is simple, but it comes with additional complexity introduced by the memory bank and the moving encoder. The memory bank and the moving encoder not only adds implementation complexity but also the GPU memory use and extra computational time. In the MSF paper [20], the authors state that the GPU memory is not an issue if the feature size is set to 512. However, the Barlow Twins paper [35] shows that use of the large feature size matters to maximize the performance, in which they increased the feature size of 16384 in their experiments. Then, a dilemma arises: using the large feature size is better, yet it would take too much GPU memory. Therefore, we settled on 4192 for our experiments. If VbIbCReg is used and one can experiment and find an optimal feature size given the trade-off.

VbIbCReg vs. VIbCReg

Although we expected that VbIbCReg would perform better than VIbCReg since it has an advantage in optimizing the invariance term. Yet, VbIbCReg does not show any significant improvement over VbIbCReg. We suspect dataset size and a number of classes in a dataset. In the MSF paper, the framework is trained on ImageNet which has a large number of both samples and classes. However, in our experiments, the frameworks are trained on each of the relatively much smaller datasets. It may be possible that MSF reveals its advantage once trained on a large dataset with many classes. Therefore, experiments on a combined dataset of multiple datasets from the UCR archive will be conducted to see if MSF benefits the performance in our future study.

Usability of VIbCReg The most similar framework to VIbCReg is VICReg which was proposed in the computer vision domain. Despite its significant simplicity, it showed very competitive performances. VIbCReg is not so different from VICReg as VIbCReg is VICReg + IterNorm + normalized covariance matrix. Conceptually speaking, VIbCReg is VICReg with the better feature decorrelation. In that sense, there is no distinct/specific feature that is only subject to time series in VIbCReg. Hence, we are positive that VIbCReg would also be able to show competitive results in the computer vision domain.

6 Conclusion

In this paper we have introduced VIbCReg, which is an improved version of VICReg. By using a normalized covariance loss, we are able to improve the performance over all tested frameworks significantly for linear and fine-tuned time series classification. By using IterNorm in the last layer of the projector, we also reduce training for the model significantly.

While our application was on time series classification, we suspect that VIbCReg can improve on representation learning in computer vision as well.

Acknowledgements

We would like to thank the Norwegian Research Council for funding the Machine Learning for Irregular Time Series (ML4ITS) project. This funding directly supported this research.

References

  • [1] A. Bardes, J. Ponce, and Y. LeCun (2021) VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. In arXiv, External Links: Link Cited by: §1, §1, §1, §2.3, §3, §3, §4.1, §4.2, §4.3.
  • [2] M. Caron, P. Goyal, I. Misra, P. Bojanowski, J. Mairal, and A. Joulin (2020) Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. External Links: ISSN 23318422 Cited by: §1.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: ISSN 23318422 Cited by: §1, §2.2.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §1.
  • [5] X. Chen and K. He (2020) Exploring simple siamese representation learning. External Links: ISSN 23318422 Cited by: §1, §1, §2.2, §4.1, §4.1, §4.2, §4.3, §4.4.
  • [6] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista (2015-07) The ucr time series classification archive. Note: Cited by: §1.
  • [7] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    Proceedings - 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005

    ,
    Vol. I, pp. 539–546. External Links: Link, ISBN 0769523722, Document Cited by: §2.1.1.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014-12)

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    .
    In arXiv, External Links: Link Cited by: §2.1.2.
  • [9] D. Dereniowski and M. Kubale (2003) Cholesky factorization of matrices in parallel and ranking of graphs. In International Conference on Parallel Processing and Applied Mathematics, Vol. 3019, pp. 985–992. External Links: ISBN 3540219463, Document, ISSN 16113349 Cited by: §2.2.1.
  • [10] A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe (2020) Whitening for Self-Supervised Representation Learning. External Links: Link, ISSN 23318422 Cited by: §1, §1, §3.
  • [11] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §1, §2.2, §4.1, §4.2, §4.3.
  • [12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.2.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 770–778. External Links: Link, ISBN 9781467388504, Document, ISSN 10636919 Cited by: Appendix D, §2.1.1.
  • [14] O. J. Hénaff, A. Razavi, C. Doersch, S. M. Ali Eslami, and A. Van Den Oord (2019) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, External Links: ISSN 23318422 Cited by: §1.
  • [15] T. Hua, W. Wang, Z. Xue, Y. Wang, S. Ren, and H. Zhao (2021) On Feature Decorrelation in Self-Supervised Learning. In arXiv, External Links: Link Cited by: §1, §1, §2.2.1, §3, §3.
  • [16] L. Huang, Y. Zhou, F. Zhu, L. Liu, and L. Shao (2019) Iterative normalization: Beyond standardization towards efficient whitening. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2019-June, pp. 4869–4878. External Links: ISBN 9781728132938, Document, ISSN 10636919 Cited by: §3, §3.
  • [17] L. Huangi, D. Yang, B. Lang, and J. Deng (2018) Decorrelated Batch Normalization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 791–800. External Links: ISBN 9781538664209, Document, ISSN 10636919 Cited by: §1, §3, §3.
  • [18] S. Ioffe and C. Szegedy (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In 32nd International Conference on Machine Learning, ICML 2015, Vol. 1, pp. 448–456. External Links: ISBN 9781510810587 Cited by: §4.1.
  • [19] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese Neural Networks for One-Shot Image Recognition. In ICML - Deep Learning Workshop, Vol. 7, pp. 956–963. External Links: ISBN 9788578110796, ISSN 19454589 Cited by: Figure 2, §2.1.1.
  • [20] S. A. Koohpayegani, A. Tejankar, and H. Pirsiavash (2021) Mean Shift for Self-Supervised Learning. In arXiv, External Links: Link Cited by: §1, §3, §4.2, §4.3, §5.
  • [21] I. Loshchilov and F. Hutter (2017)

    SGDR: Stochastic gradient descent with warm restarts

    .
    In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, External Links: Link Cited by: §4.
  • [22] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, External Links: Link Cited by: §4.
  • [23] S. Makridakis, E. Spiliotis, and V. Assimakopoulos (2020) The m5 accuracy competition: results, findings and conclusions. International Journal of Forecasting. Cited by: §1.
  • [24] S. Makridakis, E. Spiliotis, and V. Assimakopoulos (2018) The m4 competition: results, findings, conclusion and way forward. International Journal of Forecasting 34 (4), pp. 802–808. Cited by: §1.
  • [25] T. Mehari and N. Strodthoff (2021-03) Self-supervised representation learning from 12-lead ECG data. arXiv. External Links: Link Cited by: §C.1, §1, §2.3, §2.3, Table 4, §4.
  • [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.
  • [27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.4.
  • [28] E. A. Perez Alday, A. Gu, A. J. Shah, C. Robichaux, A. K. I. Wong, C. Liu, F. Liu, A. B. Rad, A. Elola, S. Seyedi, Q. Li, A. Sharma, G. D. Clifford, and M. A. Reyna (2020) Classification of 12-lead ECGs: The PhysioNet/Computing in Cardiology Challenge 2020. Physiological Measurement 41 (12), pp. 124003. External Links: Link, Document, ISSN 13616579 Cited by: §1.
  • [29] A. Siarohin, E. Sangineto, and N. Sebe (2019) Whitening and coloring batch transform for GANS. In 7th International Conference on Learning Representations, ICLR 2019, External Links: Link Cited by: §2.2.1.
  • [30] N. Strodthoff, P. Wagner, T. Schaeffter, and W. Samek (2020) Deep learning for ECG analysis: Benchmarks and insights from PTB-XL. Vol. 2020. External Links: Document, ISSN 23318422 Cited by: Table 4.
  • [31] A. Van Den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. External Links: ISSN 23318422 Cited by: §2.1.1, §4.
  • [32] P. Wagner, N. Strodthoff, R. D. Bousseljot, D. Kreiseler, F. I. Lunze, W. Samek, and T. Schaeffter (2020-12) PTB-XL, a large publicly available electrocardiography dataset. Scientific Data 7 (1), pp. 1–15. External Links: Link, Document, ISSN 20524463 Cited by: §1.
  • [33] F. Wang (2018) Multi-Scale-1D-ResNet. GitHub. Note: https://github.com/geekfeiw/Multi-Scale-1D-ResNet Cited by: §4.1.
  • [34] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018-12) Unsupervised Feature Learning via Non-parametric Instance Discrimination. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. External Links: ISBN 9781538664209, Document, ISSN 10636919 Cited by: §2.2.
  • [35] J. Zbontar, L. Jing, I. Misra, Y. Lecun, and S. Deny (2021) Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In arXiv, External Links: Link Cited by: §1, §1, §3, §4.1, §4.2, §4.3, §5.

Appendix A Pseudocode for VIbCReg

Algorithm 1 PyTorch-style pseudocode for VIbCReg
1# f: encoder network
2# lambda_, mu, nu: coefficients of the invariance, variance, and covariance
3#                 losses
4# N: batch size
5# F: feature size (= dimension of the representations)
6#
7# mse_loss: Mean square error loss function
8# off_diagonal: off-diagonal elements of a matrix
9# relu: ReLU activation function
10# normalize: torch.nn.functional.normalize(..)
11
12for x in loader: # load a batch with N samples
13    # two randomly augmented versions of x
14    x_a, x_b = augment(x)
15
16    # compute representations
17    z_a = f(x_a)  # N x F
18    z_b = f(x_b)  # N x F
19
20    # invariance loss
21    sim_loss = mse_loss(z_a, z_b)
22
23    # variance loss
24    std_z_a = torch.sqrt(z_a.var(dim=0) + 1e-4)
25    std_z_b = torch.sqrt(z_b.var(dim=0) + 1e-4)
26    std_loss = torch.mean(relu(1 - std_z_a))
27    std_loss = std_loss + torch.mean(relu(1 - std_z_b))
28
29    # covariance loss
30    z_a = z_a - z_a.mean(dim=0)
31    z_b = z_b - z_b.mean(dim=0)
32    norm_z_a = normalize(z_a, p=2, dim=0)
33    norm_z_b = normalize(z_b, p=2, dim=0)
34    norm_cov_z_a = (norm_z_a.T @ norm_z_a)
35    norm_cov_z_b = (norm_z_b.T @ norm_z_b)
36    norm_cov_loss = (off_diagonal(norm_cov_z_a)**2).mean() \
37                    + (off_diagonal(norm_cov_z_b)**2).mean()
38
39    # loss
40    loss = lambda_ * sim_loss + mu * std_loss + nu * norm_cov_loss
41
42    # optimization step
43    loss.backward()
44    optimizer.step()
Table 8: Pseudocode for VIbCReg. We mostly follow the same notations from the VICReg paper.

Appendix B Architecture of VbIbCReg

The original implementation of MSF consists of a backbone (i.e., ResNet), projector, and predictor. Its projector is arranged as follows: (Linear(4096)-BN-ReLU)-(Linear(512)) and its predictor is: (Linear(4096)-BN-ReLU)-(Linear(512)), where the values in the parenthesis denotes the hidden layer size of a linear layer. Comparing it with VIbCReg, the main difference lies in the use of the predictor. In our experiments as shown in Table 9, the presence of the predictor does not provide much performance improvement, therefore, the predictor is not used. In the experiments, the predictor is implemented as in the MSF paper. Hence, the output from the backbone-projector is treated as query. It should be noted that the projector in VbIbCReg is the same as the VIbCReg’s projector.

Dataset name w/o predictor with predictor
Crop 72.3 73.1
ElectricDevices 84.8 86.2
StarLightCurves 98.1 97.7
Wafer 99.5 99.5
ECG5000 95.3 95.3
TwoPatterns 99.3 99.3
FordA 95.4 95.3
UWaveGestureLibraryAll 90.3 90.2
FordB 93.9 93.8
ChlorineConcentration 63.6 66.6
ShapesAll 86.7 85.8
FiftyWords 0.5083 50.3
NonInvasiveFetalECGThorax1 72.5 63.6
Phoneme 41.2 43.6
WordSynonyms 49.2 48.1
Table 9: Linear evaluation with respect to the use of the predictor for VbIbCReg. The values in the table denote the accuracy. The evaluation was conducted on one random seed.

Appendix C Data Augmentation Methods

c.1 Methods

In our experiments, three data augmentation methods are used: Random Crop, Random Amplitude Resize, and Random Vertical Shift.

Random Crop is similar to the one in computer vision. The only difference is that the random crop is conducted on 1D time series data in our experiments. Its hyperparameter is crop size. The crop size is set to a half length of input time series for most of the UCR datasets. A summary of the crop size for each UCR dataset is presented in Table 10. Note that the crop size for PTB-XL is 2.5s as previously suggested in [25].

Dataset name Length Use half length for crop Crop size
Crop 46 x 46
ElectricDevices 96 o 48
StarLightCurves 1024 o 512
Wafer 152 x 152
ECG5000 140 o 70
TwoPatterns 128 o 64
FordA 500 o 250
UWaveGestureLibraryAll 945 o 473
FordB 500 o 250
ChlorineConcentration 166 x 166
ShapesAll 512 o 256
FiftyWords 270 o 135
NonInvasiveFetalECGThorax1 750 x 750
Phoneme 1024 o 512
WordSynonyms 270 o 135
Table 10: Summary of the crop size for each UCR dataset.

Random Amplitude Resize randomly resizes overall amplitude of input time series. It is expressed as Eq. (10), where is a multiplier to input time series , and is a hyperparameter. In our experiments, is set to 0.3 unless specified differently.

(10)

Random Vertical Shift randomly shifts input time series in a vertical direction. It is expressed as Eq. (11), where denotes input time series before any augmentation, denotes a standard deviation estimator, and is a hyperparameter to determine a magnitude of the vertical shift. In our experiments, is set to 0.5 unless specified differently.

(11)
(12)

Appendix D Encoder

Figure 10: The light-weighted ResNet is illustrated compared to ResNet18.

The light-weighted ResNet presented in Fig. 10 is used in our experiments. The illustration is in the same format as in the original ResNet paper [13].

Appendix E FD and FcE Metrics

FD and FcE metrics are metrics for the feature decorrleation and feature component expressiveness. They are used to keep track of FD and FcE status of learned features during SSL for representation learning. The FD metrics and FcE metrics are defined in Eq. (14) and Eq. (15), respectively. is output from the projector and is feature size of . denotes ignoring the diagonal terms in the matrix. In Eq. (13), the l2-norm is conducted along the batch dimension. In Eq. (15), denotes a standard deviation estimator and it is conducted along the batch dimension.

(13)
(14)
(15)

Appendix F Additional Materials for Some Sections

f.1 Faster Representation Learning by VIbCReg

The FD and FcE metrics that correspond to Fig. 4 (i.e., 5-kNN classification accuracy on the UCR datasets) are presented in Figs. 11-12, respectively. The FD and FcE metrics that correspond to Fig. 5 (i.e., kNN macro F1 score on the PTB-XL dataset during the representation learning) are presented in Fig.13.

Figure 11: FD metrics on the UCR datasets during the representation learning. It is noticeable that the metrics of VIbCReg and VbIbCReg converge to near-zero in the beginning, ensuring the feature decorrleation throughout the entire training epochs.
Figure 12: FcE metrics on the UCR datasets during the representation learning. Note that y-axis is log-scaled. It can be observed that metrics of VIbCReg and VbIbCReg converges to 1 fairly quickly, ensuring the feature component expressiveness throughout the entire training epochs.
Figure 13: FD and FcE metrics on the PTB-XL during the representation learning. Similar to the FD and FcE metrics on the UCR datasets, VIbCReg and VbIbCReg show the fastest convergence to the optimal values. An exponential moving average with the window size of 10 is applied to the results here.

f.2 Between VICReg and VIbCReg

FD metrics and FcE metrics that correspond to Fig. 8 (i.e., Comparative kNN accuracy of the frameworks between VICReg and VIbCReg) are presented in Fig. 14 and Fig. 15, respectively.

Figure 14: FD metrics of the frameworks between VICReg and VIbCReg on the UCR datasets. ncm and itern denote the normalized covariance matrix and IterNorm, respectively. It can be seen that VIbCReg shows the fastest convergence with the lowest FD metrics on most of the datasets. Both NCM and IterN seem to improve the feature decorrelation while IterN brings the higher feature decorrelation. Use both NCM and IterN leads to the highest feature decorrelation. Note that VIbCReg is VICReg+NCM+IterN.
Figure 15: FcE metrics of the frameworks between VICReg and VIbCReg on the UCR datasets. ncm and itern denote the normalized covariance matrix and IterNorm, respectively. It can be observed that the use of NCM results in the faster convergence of the feature component expressiveness.