1 Introduction
Learning highquality representations for complex data such as time series helps better understanding the underlying generative process of the data (ghahramani2015probabilistic) and has a substantial impact on the performance of downstream machine learning (ML) tasks (bengio2013representation). Condensing information into expressive lowerdimensional representations can improve interpretability (bai2018interpretable) and lead to better generalization and domain adaptation (chen2012marginalized). One challenge for downstream ML tasks for timeseries data is that collecting labels can be expensive, as most realworld tasks may involve human expert domain knowledge. Also, the underlying state of the data may evolve, rendering human understanding of the data more difficult. Unsupervised representation learning would be of extortionate value in such cases, providing a powerful tool to uncover the underlying state and summarize the complex data into informative generalpurpose representations.
Observational time series are often generated from different underlying factors of variation that can be identified through the representations. Some of these factors present the global attributes of a sample and are unique to each individual, while others are descriptive of the variations of the underlying state over time. For instance, consider a medical diagnosis application: the observed patient physiological signals are a product of the individual’s attributes such as gender, age and preexisting medical conditions, and the medical treatment they received over time. These hidden factors could influence the time series in different ways. Uncovering and decoupling these factors is a fundamental challenge in representation learning that will lead to substantial improvement in data understanding as well as the performance of ML models that utilize the learned representations.
In this paper, we introduce an unsupervised representation learning method for time series that decouples the global and local properties of the signal into separate representations. Our method is based on generative modelling and assumes that each time series sample is generated from two underlying factors: a global and local representation. The global representations are unique to each sample and represent static timeindependent characteristics. The local representations encode the dynamic underlying state of windows of the time series sample as they evolve over time, taking into account the nonstationarity of the samples. We use variational approximation to model the posterior of the two sets of representations and model the temporal dynamics using a prior sampled from a Gaussian Process (GP). To ensure that the two sets of representations are decoupled and they model distinct data characteristics, we introduce a counterfactual regularization with the goal of mutual information minimization to disentangle the representations. Fig. 1 overviews our framework. Decoupling local and global representations using our proposed method provides the following important benefits:

Exploiting global patterns in the data and coupling them with local variations serves as an inductive bias and improves the downstream tasks from time series forecasting performance to classification accuracy.

By disentangling the factors of variation, we can represent the underlying information in a more efficient way by encoding the necessary information for target tasks with more compact representations.

Knowing the various factors of variation in generation of a time series helps identify and disentangle the underlying explanatory factors and results in more interpretable representations.

Having this information also allows one or the other representations to be used flexibly for downstream tasks depending on which property is appropriate for a specific use case. For instance, global representations can help us identify subgroups with similar properties.

Using our approach, we can generate counterfactual instances of data with controlled factors of variation, utilizing the global and local representations. This enables generating different manifestations of events like disease progression in individuals with different characteristics.
2 Related Work
The performance of many ML models relies heavily on the quality of the data representations (Bengio+chapter2007). This applies to all data types, but it is vital for complex ones such as time series that can be highdimensional, highfrequency and nonstationary (yang200610; langkvist2014review). Due to the difficulties in labelling timeseries data, unsupervised approaches are often preferred in such settings. They include different categories of methods that are based on reconstruction (yuan2019wave2vec; fortuin2018som; fortuin2020gp; chorowski2019unsupervised), clustering (ma2019learning; lei2019similarity), contrastive objectives (oord2018representation; franceschi2019unsupervised; hyvarinen2019nonlinear; tonekaboni2020unsupervised; hyvarinen2016unsupervised), and others.
While all the methods above have been shown to successfully encode the informative parts of the signal in a low dimensional representation, improving the interpretability of the representations is still an active area of research. An effort in this direction is disentangling the dimensions of the encoding, which enables representing different factors of variation in independent dimensions. Earlier approaches learn disentangled representations using supervised data (yang2015weakly; tenenbaum2000separating), while more recent methods provide unsupervised solutions to tackle this problem using adversarial training (chen2016infogan; kim2018disentangling; kumar2018variational) or regularization of the data distribution (dezfouli2019disentangled). However, associating these factors with interpretable notions in the data domain remains challenging. A different line of work focuses on decoupling global and local representations of samples into separate representations (or dimensions of representation). This idea has been explored for visual data to separate the factors of variation associated with the labels from other sources of variability. mathieu2016disentangling use a conditional generative model with adversarial regularization, while ma2020decoupling learn decoupled representations of global and local information of images relying on empirical characteristics of VAE and flow models.
For time series, VAEbased methods have been used to disentangle dynamic and static factors of representations for video and speech data. FHVAE (hsu2017unsupervised) uses a hierarchical VAE model to learn sequencedependent variables (corresponding to the speaker factor) and sequenceindependent variables (corresponding to the linguistic factors) in modeling speech, but it does not explicitly encourage disentanglement. DSVAE (yingzhen2018disentangled) formalizes the idea of disentangled representation by explicitly factorizing the posterior to model the static and dynamic factors. Other methods have been proposed to augment DSVAE by explicitly enforcing disentanglement in their objective function. S3VAE (zhu2020s3vae) introduces additional loss terms that encourage the invariance of the learnt representations by permuting the frame and leveraging external supervision for the motion labels. An additional regularization is imposed by minimizing the mutual information between the two sets of representations. Similar regularizations are used in CDSVAE (bai2021contrastively)
using contrastive estimation of mutual information. Note that for all the above methods, the focus is on the generated samples and not the quality nor interpretability of the representations. Therefore, the dynamic local representations are not designed to summarize the information of the time series samples over time. Other efforts for disentangling global and local representations are designed to improve specific downstream modeling tasks. For instance,
sen2019think leverage both local and global patterns to improve forecasting based on matrix factorization techniques. Similar ideas have also been introduced to improve forecasting performance further (wang2019deep; nguyen2021temporal). schulam2015framework learn population and subpopulation parameters to model individualized disease trajectories.3 Method
In this section, we present the notations used throughout this paper, followed by the problem definition and description of the method.
3.1 Notation
Let be a multivariate time series sample (), with input features and measurements over time. Each time series sample is generated from two latent variables and . The global representation
is a vector of size
that represents the global properties of sample . The local representation of sample , , is composed of a set of vector representations of nonoverlapping windows of time series . Each is the representation for a window of time series of length that encodes information for all features within the window. The windows split the sample into consecutive parts as shown in Fig. 1. The size of the global representations and the local representation is determined by and , respectively. In order to handle missing measurements, each sample has a mask channel to indicate which data points are measured and which ones are missing. Irregularlysampled time series can also be converted into regularlysampled signals with the additional mask channel to indicate the measurements. For simplicity, we drop the sample index in the rest of the paper.Our probabilistic data generation mechanism assumption is shown in Fig. 2. We model the conditional likelihood distribution of the data as follows: . in Fig. 2 represents a window of time series, and is the local representation for that window. The local representation of the windows can change over time as the underlying state of the time series changes. The dependencies of these local representations are modeled using a prior sampled from . The global representation
is the same for all windows within a sample, and its prior is modelled as a Normal Gaussian distribution.
3.2 Modeling Distributions
As parts of the learning algorithm, we use variational approximations to model the following three distributions:

The conditional likelihood distribution of the time series sample conditioned on the local and global representations. We approximate this using a Decoder model ().

The posterior over the local representations . We approximate this using the Local Encoder (
). This encoder slices the time series into consecutive windows and approximates the joint distribution of all the local representations over time.

The posterior distribution of the global representations . The Global Encoder () approximates the parameters of this conditional distribution. The encoder can input the entire time series, or any part of it, for estimating the global representations. This is particularly useful in the presence of missing data and allows using the part of the signal with fewer missing observations. To ensure robustness and to encourage the global representation to be constant throughout a sample, the encoder is trained on random subwindows of the sample.
The local representation should model the temporal behaviour and the underlying timevarying states. The local encoder learns the representation of consecutive windows independently; however, we cannot assume that these representations are . casale2018gaussian show that for video data, by accounting for temporal covariance between the representations, we can learn more general and informative encodings. Similar to ideas presented in casale2018gaussian and fortuin2020gp, we impose the temporal dependencies between the local representations using a prior sampled from a Gaussian Process (GP). GP is a nonparametric Bayesian method wellsuited for modelling temporal data that helps with robust modelling even in the presence of uncertainty (roberts2013gaussian). We model each dimension of the local representations (indexed by ) independently over time, using unique GPs with different kernel functions. The intuition behind this choice is to decompose the latent representations into dimensions with unique temporal behaviours that are characterized by the covariance structure. E.g., a dimension with a periodic kernel can model the seasonality of the underlying state of the signal. This is an important property since the local representations should model the nonstationarity in the samples. The local encoder approximates the posterior of each dimension using a multivariate Gaussian distribution as shown in Eq. 1.
(1) 
Following fortuin2020gp, the precision matrix of the covariance , is parameterized as a product of bidiagonal matrices (Eq. 2), where is an upper triangular band matrix that guarantees positive definitiveness and symmetry in . With this sparse estimation of the precision matrix, sampling from the posterior becomes more efficient (mallik2001inverse) and the estimated covariance can still be dense and model longrange dependencies in time.
(2) 
With the parametric approximations for the distributions, we can use the objective in Eq. 3 to train the models.
(3) 
The loglikelihood term ensures the generated signals are realistic, and the divergence terms minimize the distance between the estimated distributions and their priors. As mentioned, the prior over the global representations is assumed to be a standard Gaussian () and the prior over the local representations is a zeromean GP prior defined over time (). We assume different kernel functions and parameters (i.e. the length scale) for different dimensions of the latent representations to model various dynamics at multiple time scales. Our framework is compatible with many kernel structures, including but not limited to RBF, Cauchy, and periodic. A list of kernel functions used in our experiments is presented in the Appendix 6.2. Note, the negative loglikelihood is only estimated for the observed measurements to account for the missing values.
Eq. 3, however, does not guarantee all the properties that we expect from the representations. Nothing would prevent all information to flow through the local representation , which has a higher encoding capacity. This means that the model can easily converge to a solution where all information about the global behaviour of the signal is encoded in the local representations. As a result, would become random noise ignored by the decoder and the local representation would no longer represent underlying states, independent of sample variabilities. To address this, we introduce the counterfactual regularization term of the loss, described in the next section.
3.3 Counterfactual Regularization
The issue of information flowing through one set of representations can be prevented if we have labels available for the global sources of variations (reed2015deep; klys2018learning). However, in practice, they are rarely available, and the underlying factors are often unknown. We propose a counterfactual regularization, encouraging to be informative and the global behaviours to be only encoded in the global representation. This means that variation in the local representation should not change the global identity of the time series and vice versa. For each sample during training, we generate a counterfactual sample with the local representation of (), and a random global representations sampled from the prior (), as explained in Fig. 3.
Ideally, this generated counterfactual sample would have no signs of the global properties of . If the two representations are independent, should not contain any information about ; therefore, should have low likelihood under the estimated posterior distribution of the global representations, conditioned on the counterfactual sample . Using the global encoder, we estimate this posterior distribution () and encourage the likelihood ratio for to to be low. Our proposed regularization term is as follows:
(4) 
As a result, the final objective becomes:
(5) 
Counterfactual regularization and disentanglement.
An essential role of counterfactual regularization is encouraging independence implicitly. One way to achieve independence between global and local variables is through minimizing the mutual information between the two variables. Following moyer2018invariant, the mutual information between the two sets of representations can be decomposed as:
(6) 
The first two terms measure the information captured by the global representation
, which is also considered in the variational autoencoder objective (Eq.
3). Minimizing can therefore be done by minimizing . As we do not have access to the distribution , existing works (cheng2020club) use an additional network to construct the variational approximation. Instead of introducing an additional network to approximate , which increases training complexity and computation time, we reuse the global encoder for counterfactual regularization for this approximation as follows. Since is sampled from the prior distribution independent of , then if the decoder preserves the information. This implies that minimizing (4) implicitly minimizes and encourages decoupling between the representations.4 Experiments
ICU Mortality Prediction  Average Daily Rain Estimation  
Model  Dimensions  AUPRC  AUROC  Mean Absolute Error 
Our method  8 steps + 8  0.365 0.092  0.752 0.011  1.824 0.001 
Our method  no reg.  8 steps + 8  0.238 0.026  0.672 0.010  1.825 0.001 
GPVAE  8 steps  0.266 0.034  0.662 0.036  1.824 0.001 
GPVAE  16 steps  0.282 0.086  0.699 0.018  1.826 0.001 
VAE  8 steps  0.157 0.053  0.564 0.044  1.831 0.005 
VAE  16 steps  0.118 0.001  0.491 0.037  1.840 0.012 
CDSVAE  8 steps + 8  0.158 0.005  0.565 0.007  1.806 0.012 
Supervised  NA  0.446 0.036  0.802 0.043  0.079 0.001 
Air Quality  Physionet  
Model  Dimensions  Accuracy  Accuracy 
Our method  8  57.93 3.53  46.98 3.04 
Our method  no reg.  8  38.35 2.67  32.54 0.00 
GPVAE  8  36.73 1.40  42.47 2.02 
GPVAE  16  33.57 1.50  44.67 0.50 
VAE  8  27.17 0.03  34.71 0.23 
VAE  16  31.20 0.33  35.92 0.38 
CDSVAE  8  47.07 1.20  32.54 0.00 
Supervised  NA  62.43 0.54  62.00 2.10 
Evaluating unsupervised representation learning is challenging due to the lack of welldefined labels for the underlying representations and factors of variation. However, the representations’ generalizability and informativeness can be assessed on different downstream tasks. We present a number of experiments to evaluate the performance of our method in comparison to the following benchmarks: (i) Variational Auto Encoder (VAE) as a standard unsupervised representation learning framework, (ii) GPVAE (fortuin2020gp), (iii) CDSVAE (bai2021contrastively), which separates dynamic and static representations ^{2}^{2}2The implementation of this method for time series data doesn’t have the augmentations proposed for video setting because such transformations (cropping, color distortion, etc.) are not defined for time series., and (iv) a model trained supervised for the task.
All baselines are trained to learn representations for consecutive windows in the time series sample. For consistency and to allow comparison of performance results, the encoder and decoder architectures of different baselines are kept the same. One of the benefits of decoupling local and global representations is that we can condense the representation into fewer dimensions, especially since the global representation is unique throughout the sample. We have chosen different encoding sizes for our baselines to reflect this advantage better. The performance of all models is compared across multiple evaluation tests using two different time series datatset:

[leftmargin=*]

Physionet ICU Dataset (goldberger2000physiobank): A medical timeseries dataset of records from 12,000 adult ICU stays in the Intensive Care Unit. The temporal measurements consist of physiological signals and different lab measurements. For each recording, there are general descriptors of the patient (age, type of ICU admission, etc. ) as well as labels indicating inhospital mortality. We use such descriptors as approximates for the global properties of the signals.

UCI Beijing Multisite Air Quality Dataset (zhang2017cautionary): This dataset includes hourly measurements of multiple air pollutants from 12 nationallycontrolled monitoring sites, collected over four years. The measurements are also matched with meteorological data from the nearest weather station. We partition the data such that each time series sample is the pollutant reading for parts of a particular month of the year.
To fully demonstrate the capability of our method, we focus on time series data with nonstationarity. Unlike most classification tasks where the time series is windowed to represent samples of different classes, here we are interested in long time series where the behaviour of the signals might change over time. More information on the experiment datasets, cohort selection and processing, are provided in the Appendix 6.1.
4.1 Improving Downstream Prediction Tasks
General representations should capture the important information in the data and can therefore be leveraged for downstream prediction tasks by training simple predictors on them. This approach is commonly used for evaluating the quality of representations (oord2018representation; franceschi2019unsupervised; fortuin2020gp)
. For the Physionet dataset, we consider the mortality prediction task. As a supervised baseline, we train an endtoend model that directly uses the time series measurements to predict the risk of inhospital mortality. For all other baselines, a simple Recurrent Neural Network (RNN) is trained to predict the risk using the representations over time. For the Air Quality dataset, the defined task is to estimate the average daily rain. Similarly, a simple RNN model is used for training this predictor using the representations. For our approach and CDSVAE where the global and local representations are encoded separately, the global representations are concatenated with the output of the RNN model to estimate the downstream task. Appendix
6.2 provides more details on the architecture of the models used in our experiments.Table 1 shows the performance of all models on the prediction tasks. For better comparison, we have baselines with different representation dimensionality. The results show that our method outperforms others for the ICU mortality prediction task, with even fewer representation dimensions, and comes second to CDSVAE for daily rain estimation. For Physionet, we even perform closely to a fullysupervised model. GPVAE performs better than the regular VAE, as it properly models the correlation between representations of samples over time. As we increase the dimensionality of the representations, this model improves albeit with higher complexity and less interpretability in the encodings. By decoupling the representations, our method achieves superior performance with smaller dimensionality. Lastly, we demonstrate that the counterfactual regularization substantially improves the performance of our method (as shown by the performance results of our method with no regularization).
4.2 Subgroup Identification
In many cases, we are interested in identifying or clustering samples with similar global properties invariant to other factors that can influence the time series trajectory. The global representations provide the information that allows us to identify such subgroups of the data. This is an important application, and as an example of the benefits, earlier work shows that in many applications, performing clusterspecific modelling or prediction can improve the overall performance of ML models (giordano2019coherent; bouveyron2011model). This experiment evaluates how well our global representations identify clusters of similar samples. For the Physionet dataset, we choose the ICU unit type as a proxy label for identifying sub groups because patients that are admitted to these different units have some underlying similarities. For the Air Quality dataset, we are interested in identifying the month of the year that each recording belongs to.
Air Quality  Physionet  

Model  Dimensions  NLL  MSE  NLL  MSE 
Our method  8 steps+8  1.445 0.052  0.609 0.026  2.609 0.032  1.183 0.016 
Our method  no reg.  8 steps+8  3.866 0.344  0.906 0.040  3.250 0.145  1.323 0.048 
GPVAE  8 steps  1.908 0.114  0.743 0.048  2.626 0.063  1.197 0.031 
GPVAE  16 steps  3.080 0.281  0.941 0.065  2.833 0.069  1.295 0.035 
We use a simple MLP classifier to predict the subgroup of each sample using the learned representations. Our method and CDSVAE define global representations separately, and that is what we use for the training. For the other baselines, we randomly select the representation of one window for this task as the representations encode both the local and global properties of the sample. Table
2 summarizes the classification performance of all baselines. Our results support the claim that the global representations capture the global characteristics of each sample. We can therefore identify samples with similar characteristics better than other baselines by only using these representations regardless of the changes over time. Fig. 4 also shows the 2dimensional projection of the global representations of the Air Quality dataset, learned using our method. We can observe the clear distinction between the global properties of measurements from the warmer months of the year and the ones around winter and fall. Even within these categories, the representations seem to cluster the months together.4.3 More Accurate Forecasting
Forecasting highdimensional time series is another important application in many domains, like retail and finance. Prior works observe that exploiting global patterns and coupling them with local calibration help prediction performance on many datasets (sen2019think). Our proposed method, designed in a similar vein for representation learning, can therefore be wellsuited to improve forecasting performance. More importantly, since we model the local representations over time using a GP, the conditional distribution over the future local representations can be estimated by conditioning over the observed historical representations (williams2006gaussian); As shown in Eq. 7. Here, represent the local representations, and corresponds to the time steps in the future.
(7)  
We evaluate the performance of our method and the GPVAE baselines for the time series forecasting, as shown by the results in Table 3
. The objective is to predict two windows, equivalent to 28 observations or 2days measurements for the Air Quality data and eight measurements in Physionet, using the full history of the signal. While deep learning models are prone to overfitting with increasing forecast horizon especially in nonstationary settings, our probabilistic method achieves good performance in such scenarios by estimating the expected trajectory of the signal using predictions for future local representations. Additional plots showing the forecasting performance are provided in the Appendix
6.3.4.4 Learning Disentangled Representations
Many applications require generative models that are capable of synthesizing new instances where certain key factors of variation are held constant and other varied (mathieu2016disentangling). For example, in medical diagnosis, one may wish to model a treatment trajectory on different individuals. Decoupling the global representations from local variations enables such control over the generative process. We have curated a simulated dataset to test this functionality as the evaluation of generated counterfactual samples is difficult on realworld data without the knowledge of the ground truth. This dataset consists of time series samples composed of seasonality, trend, and noise. There are four types of signals in the dataset, with two global classes that determine the trend and intercept and two local classes that determine the seasonal component’s frequency. More information on this dataset is provided in the Appendix 6.1.
Using our approach, we learn onedimensional local and global representations for each sample. We then generate new samples, controlling the local or global behaviours, and observe the effect of changes on the generated timeseries. Fig. 5 shows an example of this experiment. On the top left, we show a sample from the dataset. Over the multiple rows, we generate samples by gradually changing the global representation while keeping the local behaviour the same. We see that this changes the trend and intercept of the signal while maintaining the frequency. This is expected given our knowledge of the generative process of this data. Then, we keep constant for each sample and generate samples by changing the local representation in the second column. The change in local representations manifests as a change in the frequency, confirming our hypothesis. Note that some of the samples generated (Fig. 5 middle row) are the ones never seen by the model, but still, their properties are correctly inferred. This experiment shows that we can learn the true underlying factors of data and shows that our generative approach can synthesize realistic samples by changing certain underlying factors of variation.
5 Conclusion
This paper introduces a generative approach for learning global and local representations for time series. We demonstrate the benefits of decoupling these representations for improving downstream performance, efficiency and a better understanding of the underlying generative factors of the data. Decoupling the underlying factors of variation in time series brings us one step closer to a better understanding of the generative process of complex temporal data. As a future direction, we would like to investigate how to associate the representation with concrete notions in the data space for improved interpretability and explore potential applications of counterfactual generation in realworld data.
References
6 Supplementary Material
6.1 Datasets
This section provides additional details about each of the datasets used in our experiments.
6.1.1 Physionet Dataset
This dataset was made available as part of the PhysioNet Computing in Cardiology Challenge 2012 ^{3}^{3}3https://physionet.org/content/challenge2012/1.0.0/ with the objective of predicting mortality in ICU patients. The data consists of records from 12,000 ICU stays. All patients were adults who were admitted for a wide variety of reasons to cardiac, medical, surgical, and trauma ICUs. For cohort selection, out of the entire set of features that include multiple lab measurements over time and physiological signals collected at the bedside, we have selected the ones with less than 60 percent missing observations over the entire dataset. The length of all samples is restricted to be between 40 to 80 measurements. Some of the time series processing steps are borrowed from johnson2012patient ^{4}^{4}4https://github.com/alistairewj/challenge2012
(without any feature extraction for signals).
For all the experiments, we only use the time series measures for training the unsupervised representation learning frameworks, without the static information about each patient. We use the general descriptors as proxy labels for evaluating global representations throughout the experiments.
6.1.2 Air Quality Dataset
The Beijing multisite Air Quality dataset ^{5}^{5}5https://archive.ics.uci.edu/ml/datasets/Beijing+MultiSite+AirQuality+Data includes hourly air pollutants data from 12 nationallycontrolled airquality monitoring sites, collected from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each airquality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013, to February 28th, 2017. We create our dataset by dividing this time series into samples from different stations and of different months of the year. For our experiments, we use the pollutant measurements as our time series, the daily rain amount as a proxy for local behaviour and the station and month of each sample as the global characteristics.
6.1.3 Simulated Dataset
We have created the simulated dataset to assess the quality of the counterfactual generated samples, as the groundtruth generative process of the data is known to us. The dataset consists of a total of 500 samples, with 100 measurements over time. Each sample is a onedimensional time series and can be decomposed as follows:
(8) 
Where is the Gaussian noise, and the rest of the parameters are determined by the global and local representation class of the sample. There are two possible classes defined for each of the local and global representations, resulting in overall four different types of time series samples in the cohort. Table 4 describes the underlying generative process of each sample based on the underlying global and local variable.
= 1  = 2  

= 1  
= 2  
6.2 Experiment Setup
For reproducibility purposes, the implementation code for all the experiments is made available. We have also provided additional information about each experiment in this section.
6.2.1 Training our model
For training our method on all the different datasets, we have chosen the list of hyperparameters, reported in Table 5. The window and the representation sizes are a design choice and depend on the properties of the data The rest of the parameters are selected based on the model performance on the validation set.
Parameter  Physionet  Air Quality  Simulation 

1.0  0.5  2.0  
0.1  0.1  0.1  
Window size ()  4  24  10 
Global representation size ()  8  8  1 
Local representation size ()  8  8  1 
Optimizer  Adam  Adam  Adam 
Learning rate  0.001  0.001  0.01 
Prior Kernels  RBF, Cauchy  RBF, Cauchy  RBF 
Prior Kernel scales  2, 1, 0.5, 0.25  2, 1, 0.5, 0.25  1 
6.2.2 Downstream task
The downstream task test evaluates the generalizability and usability of the learned representations. All baseline methods learn a representation for each window of time series. In addition to the representations over time, our method learns a single representation vector for the global variable. For the mortality prediction task, we use a single layer RNN, followed by a fully connected layer to estimate the risk of mortality. To integrate the global representation, our method concatenates the final hidden state of the RNN with the global representation vector. The designated task for the Air Quality dataset is the prediction of average daily rain. Each local representation encodes a window of 24 samples, equivalent to a day of measurements. A 2layer MLP uses the local representations concatenated with the global representation to predict this value. A big challenge with the rain prediction is that it is a highly imbalanced regression problem, where on almost of the days, there is no rain, and for the rest, the amount is highly variable. To mitigate this, we use a weighted mean absolute error loss for training, and report the mean squared error as a performance metric to ensure consistency among results.
6.2.3 Subgroup classification
For this test, the goal is to identify similar subgroups using the global representations. For baselines where the two representations are not separated, we randomly select one of the local representations over time to use for the prediction. The global properties are expected to be encoded in the same representation vector in these baselines. Using a two layer MLP and the learned representations, we classify samples into the subgroups defined by out proxy labels (type of ICU for Physionet and month of year for the Air quality dataset).