Log In Sign Up

Decoupling Local and Global Representations of Time Series

by   Sana Tonekaboni, et al.

Real-world time series data are often generated from several sources of variation. Learning representations that capture the factors contributing to this variability enables a better understanding of the data via its underlying generative process and improves performance on downstream machine learning tasks. This paper proposes a novel generative approach for learning representations for the global and local factors of variation in time series. The local representation of each sample models non-stationarity over time with a stochastic process prior, and the global representation of the sample encodes the time-independent characteristics. To encourage decoupling between the representations, we introduce counterfactual regularization that minimizes the mutual information between the two variables. In experiments, we demonstrate successful recovery of the true local and global variability factors on simulated data, and show that representations learned using our method yield superior performance on downstream tasks on real-world datasets. We believe that the proposed way of defining representations is beneficial for data modelling and yields better insights into the complexity of real-world data.


page 14

page 15


Boxhead: A Dataset for Learning Hierarchical Representations

Disentanglement is hypothesized to be beneficial towards a number of dow...

Fuse Local and Global Semantics in Representation Learning

We propose Fuse Local and Global Semantics in Representation Learning (F...

Learning Disentangled Representations for Time Series

Time-series representation learning is a fundamental task for time-serie...

Efficient Forecasting of Large Scale Hierarchical Time Series via Multilevel Clustering

We propose a novel approach to the problem of clustering hierarchically ...

Unifying Local and Global Change Detection in Dynamic Networks

Many real-world networks are complex dynamical systems, where both local...

Out-of-Distribution Detection in Time-Series Domain: A Novel Seasonal Ratio Scoring Approach

Safe deployment of time-series classifiers for real-world applications r...

Federated Learning with Intermediate Representation Regularization

In contrast to centralized model training that involves data collection,...

1 Introduction

Figure 1: Overview of our method for learning global and local representations. The proposed method learns the distribution of the local representations of windows of samples over time using the local encoder . The global encoder models the posterior of the global representation, and the decoder generates the time series using samples from the posterior of the local and global representations.

Learning high-quality representations for complex data such as time series helps better understanding the underlying generative process of the data (ghahramani2015probabilistic) and has a substantial impact on the performance of downstream machine learning (ML) tasks (bengio2013representation). Condensing information into expressive lower-dimensional representations can improve interpretability (bai2018interpretable) and lead to better generalization and domain adaptation (chen2012marginalized). One challenge for downstream ML tasks for time-series data is that collecting labels can be expensive, as most real-world tasks may involve human expert domain knowledge. Also, the underlying state of the data may evolve, rendering human understanding of the data more difficult. Unsupervised representation learning would be of extortionate value in such cases, providing a powerful tool to uncover the underlying state and summarize the complex data into informative general-purpose representations.

Observational time series are often generated from different underlying factors of variation that can be identified through the representations. Some of these factors present the global attributes of a sample and are unique to each individual, while others are descriptive of the variations of the underlying state over time. For instance, consider a medical diagnosis application: the observed patient physiological signals are a product of the individual’s attributes such as gender, age and pre-existing medical conditions, and the medical treatment they received over time. These hidden factors could influence the time series in different ways. Uncovering and decoupling these factors is a fundamental challenge in representation learning that will lead to substantial improvement in data understanding as well as the performance of ML models that utilize the learned representations.

In this paper, we introduce an unsupervised representation learning method for time series that decouples the global and local properties of the signal into separate representations. Our method is based on generative modelling and assumes that each time series sample is generated from two underlying factors: a global and local representation. The global representations are unique to each sample and represent static time-independent characteristics. The local representations encode the dynamic underlying state of windows of the time series sample as they evolve over time, taking into account the non-stationarity of the samples. We use variational approximation to model the posterior of the two sets of representations and model the temporal dynamics using a prior sampled from a Gaussian Process (GP). To ensure that the two sets of representations are decoupled and they model distinct data characteristics, we introduce a counterfactual regularization with the goal of mutual information minimization to disentangle the representations. Fig. 1 overviews our framework. Decoupling local and global representations using our proposed method provides the following important benefits:

  1. Exploiting global patterns in the data and coupling them with local variations serves as an inductive bias and improves the downstream tasks from time series forecasting performance to classification accuracy.

  2. By disentangling the factors of variation, we can represent the underlying information in a more efficient way by encoding the necessary information for target tasks with more compact representations.

  3. Knowing the various factors of variation in generation of a time series helps identify and disentangle the underlying explanatory factors and results in more interpretable representations.

  4. Having this information also allows one or the other representations to be used flexibly for downstream tasks depending on which property is appropriate for a specific use case. For instance, global representations can help us identify subgroups with similar properties.

  5. Using our approach, we can generate counterfactual instances of data with controlled factors of variation, utilizing the global and local representations. This enables generating different manifestations of events like disease progression in individuals with different characteristics.

2 Related Work

The performance of many ML models relies heavily on the quality of the data representations (Bengio+chapter2007). This applies to all data types, but it is vital for complex ones such as time series that can be high-dimensional, high-frequency and non-stationary (yang200610; langkvist2014review). Due to the difficulties in labelling time-series data, unsupervised approaches are often preferred in such settings. They include different categories of methods that are based on reconstruction (yuan2019wave2vec; fortuin2018som; fortuin2020gp; chorowski2019unsupervised), clustering (ma2019learning; lei2019similarity), contrastive objectives (oord2018representation; franceschi2019unsupervised; hyvarinen2019nonlinear; tonekaboni2020unsupervised; hyvarinen2016unsupervised), and others.

While all the methods above have been shown to successfully encode the informative parts of the signal in a low dimensional representation, improving the interpretability of the representations is still an active area of research. An effort in this direction is disentangling the dimensions of the encoding, which enables representing different factors of variation in independent dimensions. Earlier approaches learn disentangled representations using supervised data (yang2015weakly; tenenbaum2000separating), while more recent methods provide unsupervised solutions to tackle this problem using adversarial training (chen2016infogan; kim2018disentangling; kumar2018variational) or regularization of the data distribution (dezfouli2019disentangled). However, associating these factors with interpretable notions in the data domain remains challenging. A different line of work focuses on decoupling global and local representations of samples into separate representations (or dimensions of representation). This idea has been explored for visual data to separate the factors of variation associated with the labels from other sources of variability. mathieu2016disentangling use a conditional generative model with adversarial regularization, while ma2020decoupling learn decoupled representations of global and local information of images relying on empirical characteristics of VAE and flow models.

For time series, VAE-based methods have been used to disentangle dynamic and static factors of representations for video and speech data. FHVAE (hsu2017unsupervised) uses a hierarchical VAE model to learn sequence-dependent variables (corresponding to the speaker factor) and sequence-independent variables (corresponding to the linguistic factors) in modeling speech, but it does not explicitly encourage disentanglement. DSVAE (yingzhen2018disentangled) formalizes the idea of disentangled representation by explicitly factorizing the posterior to model the static and dynamic factors. Other methods have been proposed to augment DSVAE by explicitly enforcing disentanglement in their objective function. S3VAE (zhu2020s3vae) introduces additional loss terms that encourage the invariance of the learnt representations by permuting the frame and leveraging external supervision for the motion labels. An additional regularization is imposed by minimizing the mutual information between the two sets of representations. Similar regularizations are used in C-DSVAE (bai2021contrastively)

using contrastive estimation of mutual information. Note that for all the above methods, the focus is on the generated samples and not the quality nor interpretability of the representations. Therefore, the dynamic local representations are not designed to summarize the information of the time series samples over time. Other efforts for disentangling global and local representations are designed to improve specific downstream modeling tasks. For instance,

sen2019think leverage both local and global patterns to improve forecasting based on matrix factorization techniques. Similar ideas have also been introduced to improve forecasting performance further (wang2019deep; nguyen2021temporal). schulam2015framework learn population and sub-population parameters to model individualized disease trajectories.

3 Method

In this section, we present the notations used throughout this paper, followed by the problem definition and description of the method.

3.1 Notation

Let be a multivariate time series sample (), with input features and measurements over time. Each time series sample is generated from two latent variables and . The global representation

is a vector of size

that represents the global properties of sample . The local representation of sample , , is composed of a set of vector representations of non-overlapping windows of time series . Each is the representation for a window of time series of length that encodes information for all features within the window. The windows split the sample into consecutive parts as shown in Fig. 1. The size of the global representations and the local representation is determined by and , respectively. In order to handle missing measurements, each sample has a mask channel to indicate which data points are measured and which ones are missing. Irregularly-sampled time series can also be converted into regularly-sampled signals with the additional mask channel to indicate the measurements. For simplicity, we drop the sample index in the rest of the paper.

Our probabilistic data generation mechanism assumption is shown in Fig. 2. We model the conditional likelihood distribution of the data as follows: . in Fig. 2 represents a window of time series, and is the local representation for that window. The local representation of the windows can change over time as the underlying state of the time series changes. The dependencies of these local representations are modeled using a prior sampled from . The global representation

is the same for all windows within a sample, and its prior is modelled as a Normal Gaussian distribution.

Figure 2: Graphical model of the generative process. Each window of time series is generated from the global representation of the samples and the local representation of the window . The sample is composed of the series of consecutive windows , and the pset of all local representations are sampled from a Gaussian Process prior.

3.2 Modeling Distributions

As parts of the learning algorithm, we use variational approximations to model the following three distributions:

  1. The conditional likelihood distribution of the time series sample conditioned on the local and global representations. We approximate this using a Decoder model ().

  2. The posterior over the local representations . We approximate this using the Local Encoder (

    ). This encoder slices the time series into consecutive windows and approximates the joint distribution of all the local representations over time.

  3. The posterior distribution of the global representations . The Global Encoder () approximates the parameters of this conditional distribution. The encoder can input the entire time series, or any part of it, for estimating the global representations. This is particularly useful in the presence of missing data and allows using the part of the signal with fewer missing observations. To ensure robustness and to encourage the global representation to be constant throughout a sample, the encoder is trained on random sub-windows of the sample.

The local representation should model the temporal behaviour and the underlying time-varying states. The local encoder learns the representation of consecutive windows independently; however, we cannot assume that these representations are . casale2018gaussian show that for video data, by accounting for temporal covariance between the representations, we can learn more general and informative encodings. Similar to ideas presented in casale2018gaussian and fortuin2020gp, we impose the temporal dependencies between the local representations using a prior sampled from a Gaussian Process (GP). GP is a non-parametric Bayesian method well-suited for modelling temporal data that helps with robust modelling even in the presence of uncertainty (roberts2013gaussian). We model each dimension of the local representations (indexed by ) independently over time, using unique GPs with different kernel functions. The intuition behind this choice is to decompose the latent representations into dimensions with unique temporal behaviours that are characterized by the covariance structure. E.g., a dimension with a periodic kernel can model the seasonality of the underlying state of the signal. This is an important property since the local representations should model the non-stationarity in the samples. The local encoder approximates the posterior of each dimension using a multivariate Gaussian distribution as shown in Eq. 1.


Following fortuin2020gp, the precision matrix of the covariance , is parameterized as a product of bi-diagonal matrices (Eq. 2), where is an upper triangular band matrix that guarantees positive definitiveness and symmetry in . With this sparse estimation of the precision matrix, sampling from the posterior becomes more efficient (mallik2001inverse) and the estimated covariance can still be dense and model long-range dependencies in time.


With the parametric approximations for the distributions, we can use the objective in Eq. 3 to train the models.


The log-likelihood term ensures the generated signals are realistic, and the divergence terms minimize the distance between the estimated distributions and their priors. As mentioned, the prior over the global representations is assumed to be a standard Gaussian () and the prior over the local representations is a zero-mean GP prior defined over time (). We assume different kernel functions and parameters (i.e. the length scale) for different dimensions of the latent representations to model various dynamics at multiple time scales. Our framework is compatible with many kernel structures, including but not limited to RBF, Cauchy, and periodic. A list of kernel functions used in our experiments is presented in the Appendix 6.2. Note, the negative log-likelihood is only estimated for the observed measurements to account for the missing values.

Eq. 3, however, does not guarantee all the properties that we expect from the representations. Nothing would prevent all information to flow through the local representation , which has a higher encoding capacity. This means that the model can easily converge to a solution where all information about the global behaviour of the signal is encoded in the local representations. As a result, would become random noise ignored by the decoder and the local representation would no longer represent underlying states, independent of sample variabilities. To address this, we introduce the counterfactual regularization term of the loss, described in the next section.

3.3 Counterfactual Regularization

The issue of information flowing through one set of representations can be prevented if we have labels available for the global sources of variations (reed2015deep; klys2018learning). However, in practice, they are rarely available, and the underlying factors are often unknown. We propose a counterfactual regularization, encouraging to be informative and the global behaviours to be only encoded in the global representation. This means that variation in the local representation should not change the global identity of the time series and vice versa. For each sample during training, we generate a counterfactual sample with the local representation of (), and a random global representations sampled from the prior (), as explained in Fig. 3.

Figure 3: The counterfactual sample generation process for the regularization term.

Ideally, this generated counterfactual sample would have no signs of the global properties of . If the two representations are independent, should not contain any information about ; therefore, should have low likelihood under the estimated posterior distribution of the global representations, conditioned on the counterfactual sample . Using the global encoder, we estimate this posterior distribution () and encourage the likelihood ratio for to to be low. Our proposed regularization term is as follows:


As a result, the final objective becomes:

Counterfactual regularization and disentanglement.

An essential role of counterfactual regularization is encouraging independence implicitly. One way to achieve independence between global and local variables is through minimizing the mutual information between the two variables. Following moyer2018invariant, the mutual information between the two sets of representations can be decomposed as:


The first two terms measure the information captured by the global representation

, which is also considered in the variational autoencoder objective (Eq.

3). Minimizing can therefore be done by minimizing . As we do not have access to the distribution , existing works (cheng2020club) use an additional network to construct the variational approximation. Instead of introducing an additional network to approximate , which increases training complexity and computation time, we reuse the global encoder for counterfactual regularization for this approximation as follows. Since is sampled from the prior distribution independent of , then if the decoder preserves the information. This implies that minimizing (4) implicitly minimizes and encourages decoupling between the representations.

4 Experiments

ICU Mortality Prediction Average Daily Rain Estimation
Model Dimensions AUPRC AUROC Mean Absolute Error
Our method 8 steps + 8 0.365 0.092 0.752 0.011 1.824 0.001
Our method - no reg. 8 steps + 8 0.238 0.026 0.672 0.010 1.825 0.001
GP-VAE 8 steps 0.266 0.034 0.662 0.036 1.824 0.001
GP-VAE 16 steps 0.282 0.086 0.699 0.018 1.826 0.001
VAE 8 steps 0.157 0.053 0.564 0.044 1.831 0.005
VAE 16 steps 0.118 0.001 0.491 0.037 1.840 0.012
C-DSVAE 8 steps + 8 0.158 0.005 0.565 0.007 1.806 0.012
Supervised NA 0.446 0.036 0.802 0.043 0.079 0.001
Table 1: Performance of different representation learning methods on downstream predictive tasks.
Air Quality Physionet
Model Dimensions Accuracy Accuracy
Our method 8 57.93 3.53 46.98 3.04
Our method - no reg. 8 38.35 2.67 32.54 0.00
GP-VAE 8 36.73 1.40 42.47 2.02
GP-VAE 16 33.57 1.50 44.67 0.50
VAE 8 27.17 0.03 34.71 0.23
VAE 16 31.20 0.33 35.92 0.38
C-DSVAE 8 47.07 1.20 32.54 0.00
Supervised NA 62.43 0.54 62.00 2.10
Table 2: Performance of different representation learning methods on identifying similar subgroups of data.

Evaluating unsupervised representation learning is challenging due to the lack of well-defined labels for the underlying representations and factors of variation. However, the representations’ generalizability and informativeness can be assessed on different downstream tasks. We present a number of experiments to evaluate the performance of our method in comparison to the following benchmarks: (i) Variational Auto Encoder (VAE) as a standard unsupervised representation learning framework, (ii) GP-VAE (fortuin2020gp), (iii) C-DSVAE (bai2021contrastively), which separates dynamic and static representations 222The implementation of this method for time series data doesn’t have the augmentations proposed for video setting because such transformations (cropping, color distortion, etc.) are not defined for time series., and (iv) a model trained supervised for the task.

All baselines are trained to learn representations for consecutive windows in the time series sample. For consistency and to allow comparison of performance results, the encoder and decoder architectures of different baselines are kept the same. One of the benefits of decoupling local and global representations is that we can condense the representation into fewer dimensions, especially since the global representation is unique throughout the sample. We have chosen different encoding sizes for our baselines to reflect this advantage better. The performance of all models is compared across multiple evaluation tests using two different time series datatset:

  1. [leftmargin=*]

  2. Physionet ICU Dataset (goldberger2000physiobank): A medical time-series dataset of records from 12,000 adult ICU stays in the Intensive Care Unit. The temporal measurements consist of physiological signals and different lab measurements. For each recording, there are general descriptors of the patient (age, type of ICU admission, etc. ) as well as labels indicating in-hospital mortality. We use such descriptors as approximates for the global properties of the signals.

  3. UCI Beijing Multi-site Air Quality Dataset (zhang2017cautionary): This dataset includes hourly measurements of multiple air pollutants from 12 nationally-controlled monitoring sites, collected over four years. The measurements are also matched with meteorological data from the nearest weather station. We partition the data such that each time series sample is the pollutant reading for parts of a particular month of the year.

To fully demonstrate the capability of our method, we focus on time series data with non-stationarity. Unlike most classification tasks where the time series is windowed to represent samples of different classes, here we are interested in long time series where the behaviour of the signals might change over time. More information on the experiment datasets, cohort selection and processing, are provided in the Appendix 6.1.

4.1 Improving Downstream Prediction Tasks

General representations should capture the important information in the data and can therefore be leveraged for downstream prediction tasks by training simple predictors on them. This approach is commonly used for evaluating the quality of representations (oord2018representation; franceschi2019unsupervised; fortuin2020gp)

. For the Physionet dataset, we consider the mortality prediction task. As a supervised baseline, we train an end-to-end model that directly uses the time series measurements to predict the risk of in-hospital mortality. For all other baselines, a simple Recurrent Neural Network (RNN) is trained to predict the risk using the representations over time. For the Air Quality dataset, the defined task is to estimate the average daily rain. Similarly, a simple RNN model is used for training this predictor using the representations. For our approach and C-DSVAE where the global and local representations are encoded separately, the global representations are concatenated with the output of the RNN model to estimate the downstream task. Appendix

6.2 provides more details on the architecture of the models used in our experiments.

Table 1 shows the performance of all models on the prediction tasks. For better comparison, we have baselines with different representation dimensionality. The results show that our method outperforms others for the ICU mortality prediction task, with even fewer representation dimensions, and comes second to C-DSVAE for daily rain estimation. For Physionet, we even perform closely to a fully-supervised model. GP-VAE performs better than the regular VAE, as it properly models the correlation between representations of samples over time. As we increase the dimensionality of the representations, this model improves albeit with higher complexity and less interpretability in the encodings. By decoupling the representations, our method achieves superior performance with smaller dimensionality. Lastly, we demonstrate that the counterfactual regularization substantially improves the performance of our method (as shown by the performance results of our method with no regularization).

4.2 Subgroup Identification

Figure 4: t-SNE visualization of the representations of the Air Quality dataset. Each data point is the global representation of a time series sample and the color indicates the month of year of that measurement.

In many cases, we are interested in identifying or clustering samples with similar global properties invariant to other factors that can influence the time series trajectory. The global representations provide the information that allows us to identify such subgroups of the data. This is an important application, and as an example of the benefits, earlier work shows that in many applications, performing cluster-specific modelling or prediction can improve the overall performance of ML models (giordano2019coherent; bouveyron2011model). This experiment evaluates how well our global representations identify clusters of similar samples. For the Physionet dataset, we choose the ICU unit type as a proxy label for identifying sub groups because patients that are admitted to these different units have some underlying similarities. For the Air Quality dataset, we are interested in identifying the month of the year that each recording belongs to.

Air Quality Physionet
Model Dimensions NLL MSE NLL MSE
Our method 8 steps+8 1.445 0.052 0.609 0.026 2.609 0.032 1.183 0.016
Our method - no reg. 8 steps+8 3.866 0.344 0.906 0.040 3.250 0.145 1.323 0.048
GP-VAE 8 steps 1.908 0.114 0.743 0.048 2.626 0.063 1.197 0.031
GP-VAE 16 steps 3.080 0.281 0.941 0.065 2.833 0.069 1.295 0.035
Table 3: Performance of different representation learning methods used for time series forecasting.

We use a simple MLP classifier to predict the subgroup of each sample using the learned representations. Our method and C-DSVAE define global representations separately, and that is what we use for the training. For the other baselines, we randomly select the representation of one window for this task as the representations encode both the local and global properties of the sample. Table

2 summarizes the classification performance of all baselines. Our results support the claim that the global representations capture the global characteristics of each sample. We can therefore identify samples with similar characteristics better than other baselines by only using these representations regardless of the changes over time. Fig. 4 also shows the 2-dimensional projection of the global representations of the Air Quality dataset, learned using our method. We can observe the clear distinction between the global properties of measurements from the warmer months of the year and the ones around winter and fall. Even within these categories, the representations seem to cluster the months together.

4.3 More Accurate Forecasting

Forecasting high-dimensional time series is another important application in many domains, like retail and finance. Prior works observe that exploiting global patterns and coupling them with local calibration help prediction performance on many datasets (sen2019think). Our proposed method, designed in a similar vein for representation learning, can therefore be well-suited to improve forecasting performance. More importantly, since we model the local representations over time using a GP, the conditional distribution over the future local representations can be estimated by conditioning over the observed historical representations (williams2006gaussian); As shown in Eq. 7. Here, represent the local representations, and corresponds to the time steps in the future.


We evaluate the performance of our method and the GP-VAE baselines for the time series forecasting, as shown by the results in Table 3

. The objective is to predict two windows, equivalent to 28 observations or 2-days measurements for the Air Quality data and eight measurements in Physionet, using the full history of the signal. While deep learning models are prone to overfitting with increasing forecast horizon especially in non-stationary settings, our probabilistic method achieves good performance in such scenarios by estimating the expected trajectory of the signal using predictions for future local representations. Additional plots showing the forecasting performance are provided in the Appendix


4.4 Learning Disentangled Representations

Figure 5: Examples of counterfactual signal generation in simulated dataset. The top left plot shows the original time-series in orange. Each row shows generated samples with the same underlying local representations as the original signal, with variable . Each column represent different for the different values.

Many applications require generative models that are capable of synthesizing new instances where certain key factors of variation are held constant and other varied (mathieu2016disentangling). For example, in medical diagnosis, one may wish to model a treatment trajectory on different individuals. Decoupling the global representations from local variations enables such control over the generative process. We have curated a simulated dataset to test this functionality as the evaluation of generated counterfactual samples is difficult on real-world data without the knowledge of the ground truth. This dataset consists of time series samples composed of seasonality, trend, and noise. There are four types of signals in the dataset, with two global classes that determine the trend and intercept and two local classes that determine the seasonal component’s frequency. More information on this dataset is provided in the Appendix 6.1.

Using our approach, we learn one-dimensional local and global representations for each sample. We then generate new samples, controlling the local or global behaviours, and observe the effect of changes on the generated time-series. Fig. 5 shows an example of this experiment. On the top left, we show a sample from the dataset. Over the multiple rows, we generate samples by gradually changing the global representation while keeping the local behaviour the same. We see that this changes the trend and intercept of the signal while maintaining the frequency. This is expected given our knowledge of the generative process of this data. Then, we keep constant for each sample and generate samples by changing the local representation in the second column. The change in local representations manifests as a change in the frequency, confirming our hypothesis. Note that some of the samples generated (Fig. 5 middle row) are the ones never seen by the model, but still, their properties are correctly inferred. This experiment shows that we can learn the true underlying factors of data and shows that our generative approach can synthesize realistic samples by changing certain underlying factors of variation.

5 Conclusion

This paper introduces a generative approach for learning global and local representations for time series. We demonstrate the benefits of decoupling these representations for improving downstream performance, efficiency and a better understanding of the underlying generative factors of the data. Decoupling the underlying factors of variation in time series brings us one step closer to a better understanding of the generative process of complex temporal data. As a future direction, we would like to investigate how to associate the representation with concrete notions in the data space for improved interpretability and explore potential applications of counterfactual generation in real-world data.


6 Supplementary Material

6.1 Datasets

This section provides additional details about each of the datasets used in our experiments.

6.1.1 Physionet Dataset

This dataset was made available as part of the PhysioNet Computing in Cardiology Challenge 2012 333 with the objective of predicting mortality in ICU patients. The data consists of records from 12,000 ICU stays. All patients were adults who were admitted for a wide variety of reasons to cardiac, medical, surgical, and trauma ICUs. For cohort selection, out of the entire set of features that include multiple lab measurements over time and physiological signals collected at the bedside, we have selected the ones with less than 60 percent missing observations over the entire dataset. The length of all samples is restricted to be between 40 to 80 measurements. Some of the time series processing steps are borrowed from johnson2012patient 444

(without any feature extraction for signals).

For all the experiments, we only use the time series measures for training the unsupervised representation learning frameworks, without the static information about each patient. We use the general descriptors as proxy labels for evaluating global representations throughout the experiments.

6.1.2 Air Quality Dataset

The Beijing multi-site Air Quality dataset 555 includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites, collected from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013, to February 28th, 2017. We create our dataset by dividing this time series into samples from different stations and of different months of the year. For our experiments, we use the pollutant measurements as our time series, the daily rain amount as a proxy for local behaviour and the station and month of each sample as the global characteristics.

6.1.3 Simulated Dataset

We have created the simulated dataset to assess the quality of the counterfactual generated samples, as the ground-truth generative process of the data is known to us. The dataset consists of a total of 500 samples, with 100 measurements over time. Each sample is a one-dimensional time series and can be decomposed as follows:


Where is the Gaussian noise, and the rest of the parameters are determined by the global and local representation class of the sample. There are two possible classes defined for each of the local and global representations, resulting in overall four different types of time series samples in the cohort. Table 4 describes the underlying generative process of each sample based on the underlying global and local variable.

= 1 = 2
= 1
= 2
Table 4: Parameters of the simulated dataset

6.2 Experiment Setup

For reproducibility purposes, the implementation code for all the experiments is made available. We have also provided additional information about each experiment in this section.

6.2.1 Training our model

For training our method on all the different datasets, we have chosen the list of hyper-parameters, reported in Table 5. The window and the representation sizes are a design choice and depend on the properties of the data The rest of the parameters are selected based on the model performance on the validation set.

Parameter Physionet Air Quality Simulation
1.0 0.5 2.0
0.1 0.1 0.1
Window size () 4 24 10
Global representation size () 8 8 1
Local representation size () 8 8 1
Optimizer Adam Adam Adam
Learning rate 0.001 0.001 0.01
Prior Kernels RBF, Cauchy RBF, Cauchy RBF
Prior Kernel scales 2, 1, 0.5, 0.25 2, 1, 0.5, 0.25 1
Table 5: List of the selected parameters for training our method on different dataset

6.2.2 Downstream task

The downstream task test evaluates the generalizability and usability of the learned representations. All baseline methods learn a representation for each window of time series. In addition to the representations over time, our method learns a single representation vector for the global variable. For the mortality prediction task, we use a single layer RNN, followed by a fully connected layer to estimate the risk of mortality. To integrate the global representation, our method concatenates the final hidden state of the RNN with the global representation vector. The designated task for the Air Quality dataset is the prediction of average daily rain. Each local representation encodes a window of 24 samples, equivalent to a day of measurements. A 2-layer MLP uses the local representations concatenated with the global representation to predict this value. A big challenge with the rain prediction is that it is a highly imbalanced regression problem, where on almost of the days, there is no rain, and for the rest, the amount is highly variable. To mitigate this, we use a weighted mean absolute error loss for training, and report the mean squared error as a performance metric to ensure consistency among results.

6.2.3 Subgroup classification

For this test, the goal is to identify similar subgroups using the global representations. For baselines where the two representations are not separated, we randomly select one of the local representations over time to use for the prediction. The global properties are expected to be encoded in the same representation vector in these baselines. Using a two layer MLP and the learned representations, we classify samples into the subgroups defined by out proxy labels (type of ICU for Physionet and month of year for the Air quality dataset).

6.3 Supplementary Plots

Figure 6: Risk of mortality estimation over time, based on the local and global representations (As part of the downstream task test for Physionet data).
Figure 7: Exploratory analysis of local representations in Physionet data. The bottom heatmap demonstrates the 8 dimensional local representation of windows over time (x-axis). The middle bar shows the priors over all the different dimensions and finally the plot on the top shows the indicator of the mechanical ventillation. In the representation patterns, we can see signals indicating whether a patient is ventilated or not. Note that the ventilation information is never provided as an input to any of the encoders.
Figure 8: Reconstructed of a Physionet sample with high missing rate
Figure 9:

Forecasting pollutant measurements for the Air Quality dataset. The measurements before the vertical red line are observed (shown by the green line), and the forecasted measurements for 2 windows of time series are shown with the blue line, with the green plot demonstrating the expected prediction. The shaded regions indicate one standard deviation around the estimated distribution.