Evaluating Contrastive Learning on Wearable Timeseries for Downstream Clinical Outcomes

by   Kevalee Shah, et al.
University of Cambridge

Vast quantities of person-generated health data (wearables) are collected but the process of annotating to feed to machine learning models is impractical. This paper discusses ways in which self-supervised approaches that use contrastive losses, such as SimCLR and BYOL, previously applied to the vision domain, can be applied to high-dimensional health signals for downstream classification tasks of various diseases spanning sleep, heart, and metabolic conditions. To this end, we adapt the data augmentation step and the overall architecture to suit the temporal nature of the data (wearable traces) and evaluate on 5 downstream tasks by comparing other state-of-the-art methods including supervised learning and an adversarial unsupervised representation learning method. We show that SimCLR outperforms the adversarial method and a fully-supervised method in the majority of the downstream evaluation tasks, and that all self-supervised methods outperform the fully-supervised methods. This work provides a comprehensive benchmark for contrastive methods applied to the wearable time-series domain, showing the promise of task-agnostic representations for downstream clinical outcomes.



There are no comments yet.


page 1

page 2

page 3

page 4


The Power of Contrast for Feature Learning: A Theoretical Analysis

Contrastive learning has achieved state-of-the-art performance in variou...

Self-supervision of wearable sensors time-series data for influenza detection

Self-supervision may boost model performance in downstream tasks. Howeve...

Self-Supervised Video Representation Learning with Meta-Contrastive Network

Self-supervised learning has been successfully applied to pre-train vide...

Why Do Self-Supervised Models Transfer? Investigating the Impact of Invariance on Downstream Tasks

Self-supervised learning is a powerful paradigm for representation learn...

Self-supervised transfer learning of physiological representations from free-living wearable data

Wearable devices such as smartwatches are becoming increasingly popular ...

On Contrastive Representations of Stochastic Processes

Learning representations of stochastic processes is an emerging problem ...

Learning Generalizable Physiological Representations from Large-scale Wearable Data

To date, research on sensor-equipped mobile devices has primarily focuse...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The current bottleneck for supervised learning on data from wearables is the lack of labelled datasets. Features extracted from self-supervised methods can be leveraged in downstream disease classification tasks. Through this we can see how user-generated data can be used to predict user-specific diseases and conditions, which would aid healthcare through early diagnosis. Therefore, there is a clear motivation to use self-supervised learning to extract features from unlabelled datasets which can be leveraged in downstream classification tasks. Early diagnosis of conditions can lead to a better understanding of the prognosis and can lessen the burden of the healthcare system. Here, we adapt and apply two self-supervised methods, SimCLR and BYOL, to learn features from a wearable activity dataset, and then use downstream classification tasks of different medical conditions to evaluate the quality of learned representations.

2 Related Work

Self-Supervised Frameworks.

SimCLR and BYOL are both frameworks that first originated for the image domain. SimCLR is a ‘simple framework for contrastive self-supervised learning of visual representations’ chen2020simple

that uses data augmentations to create positive and negative samples for the training of the loss function to get high quality learned representations. This is done by using transformations to get different views of the same image, and ensuring that the representations of the positive pairs are attracted and the representations of pairs of views from different images are repelled. In the image domain, the transformations correspond to functions that crop, rotate, resize, colour distort and add noise to the images.

‘Bootstrap your Own Latent’ (BYOL) Grill2020 presents a new approach to self-supervision that is simpler and does not require negative samples for the loss function, which has often been the downfall of SimCLR arora2019theoretical

. It uses two neural networks working in tandem to generate representations. In the context of image classification, both of these methods have been shown to perform on par with or better than the current state-of-the-art supervised learning methods.

Health Signals.

Recent work has been done in using these self-supervised methods on time-series data, instead of image data. For example, SimCLR has been used to achieve results that show improvements over supervised and other unsupervised learning methods in the realm of Human Activity Recognition (HAR)

DBLP:journals/corr/abs-2011-11542, emotion recognition sarkar2020self and ECG signals mohsenvand2020contrastive. Within the scope of wearable activity data, an adversarial unsupervised representation learning method called activity2vec has been used to learn and summarise activity time-series data aggarwal2019adversarial

. This paper will be referred to as the Adversarial paper henceforth. Their method learns distributed representations for activity signals that span over a time segment in a subject invariant manner. Their work shows that it is possible for representations from self-supervised methods to outperform fully-supervised methods when using linear classifiers for disorder prediction tasks.

3 Data Pre-processing

To benchmark all models, the selected dataset is the Hispanic Community Health Study (HCHS) from the National Sleep Research Resource (NSSR) zhang2018national. people with Latino origin between the ages of and had their activity data measured using a Philip’s Actiwatch Spectrum wristwatch111https://www.usa.philips.com/healthcare/product/HC1046964/actiwatch-spectrum-activity-monitor for 7 consecutive days. The time-series for each participant was sampled every seconds, and metrics such as mean activity count, sleep or awake state and light levels were measured. A visualisation of the data is seen in Figure LABEL:fig:hchsdatavisualisation. Clinical annotations were provided that denote the insomniac, sleep apneic, diabetic, hypertension and metabolic syndrome status of each participant. Disease prevalence statistics are shown in Table C.1. Pre-processing steps shown in Figure LABEL:fig:datapreprocessingpipeline were carried out in order to prepare the data for the self-supervised training and downstream classification tasks.

4 Methodology

We adapt SimCLR and BYOL, to suit the time-series nature of the HCHS Dataset.

4.1 SimCLR


Figure 4.1: Components of the SimCLR and BYOL frameworks.

The SimCLR framework, originally proposed for image data, is able to efficiently learn useful representations without requiring specialised architectures or a memory bank. The framework is made up of major components, shown in Figure LABEL:fig:simclrbyolcombined. Specific details of each component can be found in Appendix A.

Data Augmentation.

This transforms any given input data into two correlated views of the same piece of data. Transformations provided by DBLP:journals/corr/abs-2011-11542 are used for time-series data and include: addition of Gaussian noise, scaling, negation, time reversal, column shuffle, time segment shuffle and time warp. Any combination of transformations can be used, and SimCLR has been shown to be quite sensitive to the order and type of transformations selected.

Neural Network Base Encoder.

The base encoder is responsible for taking the augmented data as input and returning embeddings . In the original SimCLR paper, a pretrained ResNet50 architecture was utilised as a base encoder he2015deep. However, for health signals, no such pretrained models exist and therefore we use an encoder that consists of three 1D convolution layers with dropout.

Neural Network Projection Head.

A multilayer perceptron head, with three hidden layers of size

and is used to map the representations, , to get projections

. This additional head introduces a non-linear transformation that helps improve the quality of the learned representation.

Contrastive Loss Function.

NT-Xent loss with a LARS optimiser, learning rate of and a cosine decay schedule is used.

4.2 Byol

Bootstrap Your Own Latent (BYOL) is a self-supervised framework that uses contrastive learning without negative samples. We use two copies of an encoder network, called the online and target networks, to obtain representation pairs, and minimize the contrastive loss between them. Figure LABEL:fig:simclrbyolcombined shows the overall architecture of BYOL. Specific details of each step can be found in Appendix B.


The augmentation of data is the first step of the BYOL algorithm, and this is done by applying a sequence of transformations to the input data, to achieve two different views. The transformations chosen were noise addition and scaling and signal negation.

Encoder and Projector

Since no such pretrained network for wearable time-series exists, like ResNet50 for images, we firstly train an autoencoder on the HCHS train dataset, without the labels, and then use the encoder part of the network as the BYOL encoder (denoted by

in Figure LABEL:fig:simclrbyolcombined). As per the BYOL framework, an MLP head is attached to the encoder, which outputs the projection representation.


The predictor is the part of the framework that differentiates the online and target network. It is simply a linear layer that maps from one vector to another vector in the same space (

, where is the projection dimension). The aim of the predictor is to map from the projection of the online network to the projection of the target.

Training Step

The training algorithm is illustrated in Figure LABEL:fig:byoltrainingstep. Each input data () is transformed to get two different views ( and ), which are each passed through the online network and the target network. The loss is calculated as the sum of the mean-squared error between the online prediction of and the target projection of and the mean-squared error between the online prediction of and the target projection of

. Once the loss has been calculated, the gradients are backpropagated through the

online network only, and the (online) parameters are adjusted. The final part of the training step is to update the (target) parameters, as they are an exponential moving average of the online parameters.

5 Results

Training and Evaluation Protocol.

We use a split for train, validation and test sets of the labelled HCHS dataset. For the self-supervised training of the contrastive models, SimCLR and BYOL, we use of the train set without using the labels. We report the mean -scores with confidence intervals of 10 runs, with the resulting learned representations evaluated on

downstream linear classification tasks, utilising a logistic regression classifier. The classifier is trained using

of the representations corresponding to the labelled train set, and then validated and tested with the entirety of the HCHS labelled validation and test sets.

Method HCHS
Sleep Apnea Diabetes Insomnia Hypertension Metabolic Syndrome
-macro -micro -macro -micro -macro -micro -macro -micro -macro -micro
day2vec - - - -
Table 5.1: The best results achieved for each of the 5 downstream tasks with SimCLR and BYOL, with day2vec and task-specific CNN results shown for comparison. The mean and confidence intervals of 10 runs is reported, apart from the results taken from day2vec, which reports just the mean aggarwal2019adversarial.


Table 5.1 shows the final results achieved for five downstream classification tasks using SimCLR and BYOL methods. The transformations used for that SimCLR framework were negation, time segment permutation, time reversal, channel shuffle and random scaling, whereas for BYOL the transformations used were Gaussian noise addition, scaling and negation. The batch size used for both frameworks was , with a window size of . In accordance with what was reported in the original SimCLR paper, larger batch sizes gave better performance than smaller batch sizes. For a baseline supervised learning comparison, we collate results from the Adversarial paper and our own implementation of a task-specific CNN, showing the highest score for each disease in the table, as well as results of day2vec which embeds time-series at the level of a day span aggarwal2019adversarial. The macro and

micro scores are reported as the evaluation metrics.

The overall trend we see that SimCLR and BYOL out perform the other techniques in the majority of the downstream classification tasks carried out, with the exception of the -macro for the classification of diabetes. SimCLR and BYOL both outperform the fully supervised method and the day2vec method for sleep apnea, and likewise for metabolic syndrome. The classification of metabolic syndrome is a novel downstream evaluation task which has not been studied in any of the papers mentioned in this paper with promising results seen especially with the SimCLR method for it’s classification. With hypertension, SimCLR can be observed to be performing the best when looking at the macro metric and BYOL when considering the micro metric. The impact of the evaluation metric used is an area for further work.

The classification task of both insomnia and diabetes shows the common trend that BYOL performs significantly better than the task specific CNN and day2vec in the micro metric, but for the macro metric the day2vec method works better. This could be due to the imbalance in classes with those diseases from the dataset, and that the Adversarial paper optimises for the -macro metric which provides as explanation for the better result seen there. However this still shows us the advantage self-supervised methods have over fully supervised methods for extracting meaning from time-series data and in particular for health.


This work serves as a good benchmark for the performance of contrastive loss applied to the wearable time-series domain. The clear conclusion to be drawn from these results, and other related work, is that there is definite meaning extracted from the contrastive learning of time-series data, and these representations can be used to boost the performance of linear classification when compared to other fully supervised methods. We observe that SimCLR performed better than both BYOL and day2vec in majority of the downstream tasks , but it is also clear that different diseases have different characteristics and therefore it can be reasoned that it is expected that different techniques would be more suited to each of the different conditions being evaluated.

6 Future Directions

We can develop this work further by applying SimCLR and BYOL to other time-series datasets collected through wearables and evaluating the representations on a wider range of downstream classification tasks. Since different techniques result in different representations of the data, for each disease there would be an optimal representation for its classification. Particularly for health care, we can envision a system that monitors the health signals collected from each user’s wearables and phone. A set of contrastive techniques would be used to generate a range of representations, and the best representation for each disease being evaluated would be used for the classification. The advantage of such a system is that we use a combination of techniques to get the best possible accuracy, and through the utilisation of these signals we can monitor continuously and ideally identify the onset of any disease early on so that the best treatment can be carried out in time.

This work is partially supported by Nokia Bell Labs through their donation for the Centre of Mobile, Wearable Systems and Augmented Intelligence to the University of Cambridge. CI.T is additionally supported by the Doris Zimmern HKU-Cambridge Hughes Hall Scholarship and from the Higher Education Fund of the Government of Macao SAR, China. D.S is supported by the Embiricos Trust Scholarship of Jesus College Cambridge, and EPSRC through Grant DTP (EP/N509620/1). The authors declare that they have no conflict of interest with respect to the publication of this work.


Appendix A SimCLR Details

Neural Network Base Encoder

The three convolutional layers consist of and feature maps with kernels of and

respectively. The stride length is

. ReLU

is used as the non-linear activation in all the layers, apart from the output. After the final dropout layer, a global max pooling layer is added in order to aggregate the high-level features.

Contrastive Loss Function

In the training algorithm, a batch of samples is taken, to which transformations are applied to the positive pairs, and therefore total data points are created per batch. The contrastive loss function treats the samples in the batch that come from the same input data as positive pairs, and the rest of the samples in the batch as negative pairs.

Appendix B BYOL Details


The noise addition transformation was done by taking the noise from a Gaussian distribution with


. Scaling was applied to the input data by taking scale factors from the normal distribution with

and .

Encoder and Projector

A simple network for the autoencoder is used, consisting of eight fully connected layers. ReLU and Sigmoid activations are used between each layer. The MLP head is comprised of a fully connected layer that maps the representation vector to a vector of the size , followed by a 1D BatchNorm layer, ReLU activation, and a final fully connected layer that outputs the projection.

Training Step

As with the SimCLR implementation, the LARS optimizer with a cosine decay learning rate schedule is used, with a base learning rate that is scaled to the batch size: . The parameters are adjusted using an Adam optimiser in the backpropagation stage. is the target decay rate for updating , and it is usually set to .

Appendix C

Disease Classification Prevalence
Diabetes Diabetic 18.1%
Pre-Diabetic 35.0%
Non-Diabetic 46.9%
Sleep Apnea No 91.8%
Yes 8.25%
Insomnia Moderate to Severe 17.7%
Subthreshold 22.5%
Not Clinically Significant 59.8%
Hypertension No 74.9%
Yes 25.1%
Metabolic Syndrome No 66.3%
Yes 33.7%
Table C.1: Disease and Prevalence of the HCHS Dataset


Figure C.1: Pipeline for data pre-processing


Figure C.2: BYOL training step


Figure C.3:

Plot showing the activity level, light levels and sleep or wake status of Participant 5270581 from the HCHS dataset