Leveraging Time Irreversibility with Order-Contrastive Pre-training

11/04/2021
by   Monica Agrawal, et al.
0

Label-scarce, high-dimensional domains such as healthcare present a challenge for modern machine learning techniques. To overcome the difficulties posed by a lack of labeled data, we explore an "order-contrastive" method for self-supervised pre-training on longitudinal data. We sample pairs of time segments, switch the order for half of them, and train a model to predict whether a given pair is in the correct order. Intuitively, the ordering task allows the model to attend to the least time-reversible features (for example, features that indicate progression of a chronic disease). The same features are often useful for downstream tasks of interest. To quantify this, we study a simple theoretical setting where we prove a finite-sample guarantee for the downstream error of a representation learned with order-contrastive pre-training. Empirically, in synthetic and longitudinal healthcare settings, we demonstrate the effectiveness of order-contrastive pre-training in the small-data regime over supervised learning and other self-supervised pre-training baselines. Our results indicate that pre-training methods designed for particular classes of distributions and downstream tasks can improve the performance of self-supervised learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/25/2021

Contrasting Contrastive Self-Supervised Representation Learning Models

In the past few years, we have witnessed remarkable breakthroughs in sel...
06/02/2021

SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

Tabular data underpins numerous high-impact applications of machine lear...
11/06/2021

Towards noise robust trigger-word detection with contrastive learning pre-task for fast on-boarding of new trigger-words

Trigger-word detection plays an important role as the entry point of use...
08/31/2021

Medical SANSformers: Training self-supervised transformers without attention for Electronic Medical Records

We leverage deep sequential models to tackle the problem of predicting h...
07/25/2020

Federated Self-Supervised Learning of Multi-Sensor Representations for Embedded Intelligence

Smartphones, wearables, and Internet of Things (IoT) devices produce a w...
01/31/2021

Adversarial Contrastive Pre-training for Protein Sequences

Recent developments in Natural Language Processing (NLP) demonstrate tha...
10/11/2021

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

In the genome biology research, regulatory genome modeling is an importa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of electronic health records has led to an explosion in longitudinal health data. This data can power comparative effectiveness studies, provide clinical decision support, enable retrospective research over real-world outcomes, and inform clinical trial design. However, longitudinal health data is often complex, unstructured, and high-dimensional and thus untapped. Typically, limited labeled data is available for downstream tasks of interest, and labels can be prohibitively expensive to obtain: long records are tedious to synthesize, comprehension requires domain expertise, and patient privacy regulations limit data-sharing across institutions (Bleackley and Kim, 2013; Xia and Yetisgen-Yildiz, 2012). Fortunately, given the large amount of unlabeled data, self-supervision is a promising avenue.

In self-supervision, models are first pre-trained to optimize an objective over unlabeled data, with the goal of learning representations that capture important semantic structure about the input data modality. For example, in masked language modeling for text, the model is trained to predict the identity of randomly masked tokens. Performing well at this objective should require a representation of sentence syntax and semantics. Once pre-trained, self-supervised representations can be used for downstream supervised tasks.

Figure 1:

Depiction of the data-generation and learning process for order-contrastive pre-training. For each trajectory, a pair of consecutive windows is sampled uniformly at random, flipped with probability 0.5, and presented to the model. The model is trained to predict whether the presented pair of windows is in the correct (

) or incorrect () order.

However, despite the success of self-supervision across domains, the development of new self-supervised objectives has been a largely heuristic endeavor.

Why does pre-training improve performance on downstream tasks? Whether this happens depends on both the self-supervised objective and the downstream task itself. But the assumptions linking the self-supervised objective to the downstream tasks of interest are rarely, if ever, made explicit.

In this work, we design a self-supervised objective with a particular class of data distributions and downstream tasks in mind. We aim to make explicit the type of distributions and downstream tasks on which we expect this method to work. We are interested primarily in the types of time-series that arise in longitudinal health data, such as long sequences of clinical notes, insurance claims data, biomarker measurements, or combinations thereof. A key differentiating feature of these data is that a given trajectory can change quickly. For example, a patient may develop a new symptom between subsequent healthcare visits, or a certain biomarker value (e.g., blood pressure) may dramatically increase.

Additionally, these changes largely tend to be irreversible with respect to time. For example, once the word “metastasis” appears in a clinical note, nearly all subsequent notes tend to comment on the state of that metastasis (so the word “metastasis” appears in those notes as well). To train a good representation for certain downstream problems (e.g., “what is the patient’s current disease stage?”), the self-supervised objective should attend to these changes, rather than suppress them.

These properties make such data distributions unsuitable for several existing self-supervised objectives. For example, Franceschi et al. (2019) train a model so that the representation of each time segment is more similar to those of its subsegments than the representation of a randomly chosen segment from another trajectory. Similar techniques have been used to learn image representations from video: two subsequent video frames are likely to contain the same objects (Mobahi et al., 2009; Goroshin et al., 2015). These approaches are all similar to the idea of slow feature analysis (Wiskott and Sejnowski, 2002) for extracting representations of an input signal that change slowly over time. Representation learning techniques based on the ideas of slow feature analysis are appropriate for some downstream tasks and time-series data types, such as the ones studied in the works above, but not, we argue, for data where the time-irreversible features are highly useful for downstream classification.

In this work, we introduce a self-supervised objective called order-contrastive pre-training (OCP). For each trajectory in the input data, we sample random pairs of time segments, switch the order for half of them, and train a model to predict whether a given pair is in the correct order (positives) or in the incorrect order (negatives).

This procedure is shown in Figure 1. OCP is very similar to an existing technique known as permutation-contrastive learning, or PCL (Hyvärinen and Morioka, 2017). PCL was also designed to take advantage of temporal dependence between features of the input signal to learn useful representations. The key difference between these two objectives is in the sampling of the negatives. Where the negatives in OCP are incorrectly-ordered window pairs, the negatives in PCL are random window pairs from the same trajectory, and could be in the correct order. In their simplest forms, the positive samples for the two methods are identical: pairs of consecutive windows in the correct order.

Intuitively, the same time-irreversible features that are useful for the OCP and PCL objectives should also be useful for downstream prediction tasks. To formalize and quantify this, we study a class of data distributions motivated by the preceding discussion. When the representation belongs to a simple hypothesis class (effectively, when the representation is a feature selector), we prove a finite-sample bound on the downstream error of a representation learned using OCP. Although this setting is much simpler than those that appear in similar work (it involves linear, rather than nonlinear, representations of the input data), we show that this model still admits interesting behavior. In particular, we give an example of a data distribution in this setup where OCP and PCL provably learn different representations. Additionally, this model indicates that even when two methods have the same performance with infinite unlabeled data, there is an unlabeled-sample-complexity benefit to using a “clean” distribution of negatives, which matches well with prior work on other contrastive learning algorithms (Chuang et al., 2020).

We supplement this motivating theoretical study with experiments on real-world time-series data. Our results indicate that for the types of data and tasks discussed above, both OCP and PCL representations can enjoy better downstream prediction performance than those trained using existing self-supervised baselines. Moreover, complementing our theoretical results, we show a real-world scenario where OCP outperforms PCL in the low labeled-data regime despite the seemingly minor difference between the two objectives. Given that OCP and PCL only differ slightly in their negative sampling, these results give further theoretical and empirical evidence for the importance of the negative sampling details in contrastive learning, complementing several recent works (Chuang et al., 2020; Robinson et al., 2021; Liu et al., 2021).

2 Order-pretraining algorithm

We suppose each data point is a time series, , where is the number of sample points and may vary with . We also suppose the samples take values in some common set . Let a window be an element of , and let be the corresponding element of .111For simplicity, we only consider windows of size 1. Our results straightforwardly generalize to windows of arbitrary size , where is a subinterval of and .

Given a trajectory , we use the following generative process to sample a data point for our contrastive task. First, is chosen uniformly at random from . Next, random windows and are chosen (in a manner explained below). The segments and corresponding to windows are combined into a tuple . The pair is then a sample for the contrastive task. A model

, given by a composition of a classifier

and a representation , is trained to predict from :

(1)

Here . That is, first computes the representation for each window, then uses a classifier to predict whether the tuple is in the correct order. The representation can then be re-used on a downstream task. The remaining design choice is to specify the process for sampling windows.

Order-contrastive pre-training.

A simple choice for sampling random windows , is to sample a random pair in the correct order when , and in the incorrect order when . We refer to the optimization problem (1) with this choice of sampling as order-contrastive pre-training (OCP). This can easily be generalized to non-consecutive window pairs. The pretraining task (1) is thus to contrast windows in the correct order with windows in the incorrect order.

Permutation-contrastive learning.

Another simple choice is to again sample a pair in the correct order when , but sample a random pair when . This is the data generation process for permutation-contrastive learning (Hyvärinen and Morioka, 2017). Note that the only differences between OCP and PCL are that in PCL, (i) the negative samples ( need not be consecutive, and (ii) some are in the correct order. The distributions of positive samples are identical. We refer to the procedure (1) with this sampling as permutation-contrastive learning (PCL). This exactly matches the contrastive sample distribution in Hyvärinen and Morioka (2017, equations (10)-(11)). Here, the pretraining task is to contrast consecutive windows in the correct order versus random window pairs.

Comparison.

These two sampling methods seem very similar—they only differ slightly in the distribution of negatives (i.e., conditioned on ). However, we show theoretically and empirically in the following sections that they can learn very different representations when used in (1), and they can have different unlabeled sample complexities even if they eventually find the same representation. This gives further evidence of the importance of negative sampling for contrastive learning methods (see, e.g., Chuang et al. (2020)). We give a finite-sample bound for the downstream classification performance of a representation learned using OCP in a simple setup motivated by time series data and predictive tasks in healthcare.

3 Finite-sample guarantee for time-irreversible features

In this section, we study a class of distributions motivated by applications to time-series data in healthcare. We assume for simplicity that each . We identify a simple set of three assumptions for which we can prove a finite sample guarantee for the set of feature selector hypotheses.

Assumptions (informal):
  1. There exists a set of time-irreversible features. Formally, , .

  2. When the features in are not changing, the other features are time-reversible. More formally, for all , and all , if , .

  3. There are no “redundant” features in .

The precise statements of these assumptions are located in Appendix A (Assumptions A.1-A.3). Intuitively, the first two assumptions guarantee that the feature-set is an optimal choice of representation for the OCP pretraining objective (when is the class of feature selector representations), and the third (more technical) assumption guarantees that the optimum is unique.

When these assumptions are satisfied and the features are the correct features for downstream classification, we prove a finite-sample bound for a model pretrained using OCP. The bound only depends on the VC-dimension of the downstream hypothesis class, , rather than on .

Like some results in the nonlinear ICA literature (e.g., Hyvärinen and Morioka (2017)) our results only apply in the regime where there is enough unlabeled data to identify the “correct” representation. It’s then immediate that only factors in to the labeled-data dependence. However, our model also allows us to give upper bounds on the amount of unlabeled data required to reach that regime. This allows us to more rigorously study other aspects of contrastive pretraining, such the role of bias in the negative distribution, which has been shown to affect the performance of other contrastive learning algorithms (Chuang et al., 2020).

We now give a simple example of a class of distributions satisfying these assumptions, grounded in our running application of health time-series data. We prove in Theorem A.1 that this example satisfies assumptions A.1-A.3. Despite its simplicity, our findings suggest that this model allows for several interesting phenomena that also occur in practice, which could make it useful for further study of contrastive learning methods on time-series.

3.1 Extraction example

A common task in clinical informatics is to extract for each time the patient’s structured disease stage, which enables downstream clinical research (Kehl et al., 2019, 2020). Each time point could be an encoding of the clinical note from a patient’s visit at time . Let be the observed label for time point . The end goal is to train a model over a representation to minimize the downstream risk:

Here we make a prediction for every time point, and the expectation is over the time index as well as the trajectory .

Model.

For each , we denote by

the random variable corresponding to indices

at time . Suppose the set of feature indices is partitioned into three types of features:

  • A set of time-irreversible features. We also assume that each has a nonzero probability of activating on its own, without the other features in . That is, for each there exists with . This ensures that assumption A.3 is satisfied. Such features include the onset/progression of chronic conditions and markers of aging (Pierson et al., 2019). For example, appearance of the word “metastasis” in a clinical note.

  • Noisy versions of : for each , there exists with , with , for all . Additionally, is conditionally independent of the other variables (for all times) given its parent variable . For example, the presence of certain interactions with the health system—such as deciding to attend physical therapy—may be a noisy reflection of the patient’s true disease state, which is captured by .

  • Background, reversible features : features such that for all and all ,

    Consider, for example, common words such as “and”, “chart”, etc., in a clinical note, whose presence or absence gives no order information.

Note that we do not make any independence assumptions between the features in this example other than the ones mentioned above.

We prove in Theorem A.1 that this example satisfies Assumptions A.1-A.3. We now provide a simple finite-sample bound for OCP when Assumptions A.1-A.3 are satisfied, the downstream task depends only on the features , and the representation is a feature selector.

3.2 Finite sample bound

Suppose we observe a large set of unlabeled data points drawn independently from the marginal distribution of , and a much smaller set of labeled data

drawn independently from the joint distribution of

and the downstream label (now we use to denote the downstream label rather than the pre-training label).

Let the representation hypothesis class . That is, our representation will select features to be used downstream. We identify each set of indices with the mapping given by projection onto those indices. Let be the downstream hypothesis class, with .222Assume is closed under permutations of the input dimensions. This ensures that we only need to identify the features belonging to , and don’t need to put them in a particular order to do well at downstream prediction. The set of linear hypotheses has this property. The mapping first selects the input features represented by , then passes the values of these features through .

Our end goal is to design a learning algorithm with a downstream excess risk bound. If is the hypothesis output by , we want an upper bound on the excess risk: If we let be empirical risk minimization (ERM) over on the small labeled sample (i.e., the method directly optimizing the downstream objective over without pre-training),333Assume for simplicity that for each trajectory , a single time is chosen uniformly at random and are passed to the learner, so the learner sees i.i.d. samples. A more detailed treatment would handle the dependence between multiple time points to get bounds that decrease as when possible (e.g., Mohri and Rostamizadeh, 2010). a standard result (e.g., Shalev-Shwartz and Ben-David (2014, Theorem 6.8)) implies:

(2)

with high probability over the sampling of the data. On the other hand, let be the algorithm that first uses unlabeled data to pre-train a representation by minimizing (1), then minimizes the downstream risk over (i.e., a 2-phase ERM learner). The following theorem states that under Assumptions A.1-A.3, we can give a more parsimonious upper bound on the excess risk.

Theorem 3.1 (informal).

Assume that Assumptions A.1-A.3 hold and that , i.e., that the optimal downstream classifier is a function of the features . Let be the hypothesis output by the pre-train/finetune learner using OCP for pre-training. Then for large enough, satisfies:

(3)

with high probability over the sampling of the data.

The pretrain + finetune bound (3) has a better dependence on the labeled dataset size than the downstream ERM learner bound (2). The large unlabeled dataset allows for the learning of a good representation

without using any labeled data. Even in this simple feature selection setting, this bound may be much tighter than the direct-downstream bound when

is a fairly complex hypothesis class and . Even for linear, is roughly (see e.g. (Abramovich and Grinshtein, 2018)), so (3) can even save over (2) in this case. In fact, we show in Section 4 that OCP can improve the performance of sparse linear models in a real-world low-labeled data setting (compared to direct downstream prediction without pre-training).

In this section we gave a simple example of a class of distributions, together with an assumption linking the distribution to the downstream task (the time-irreversible features are the most useful ones for downstream classification) for which we can prove that OCP gives a more parsimonious bound on the labeled sample complexity. While the example in Section 3.1 seems straightforward, we show now that it still admits interesting behavior. In particular, there are data distributions that satisfy Assumptions A.1-A.3, but where PCL and OCP learn different representations.

PCL versus OCP: different infinite-data optima.

There are examples of the model from Section 3.1 where PCL and OCP learn provably different representations even with infinite unlabeled samples, despite the minor difference in their sampling schemes. Intuitively, the existence of a periodic feature (such as a procedure always performed at a particular time of day) is strongly predictive of whether two samples are consecutive, but need not be predictive of whether a pair of consecutive samples are in the correct order. Concretely, consider a feature such that , and . Inclusion of this feature doesn’t violate assumptions A.1-A.3—indeed, would qualify as a “background” feature under our model—so Theorem 3.1 guarantees that OCP finds the correct representation. However, in PCL, every non-consecutive sample is a negative. But only non-consecutive samples can have , so is helpful for the PCL objective. We treat this example more formally in Appendix A, but our synthetic results in Section 4 also show that a background periodic feature can affect the PCL representation.

“Debiased” negatives.

PCL has some negatives that are actually in the correct order. Prior work on contrastive learning has called this “bias” in the negative distribution (Chuang et al., 2020). What’s the role of this “bias?” Does it affect the learned representations? Does it affect the amount of unlabeled data required to find a good representation? For distributions satisfying assumptions A.1-A.3 and when is the class of feature-selectors, we answer these questions in the negative and positive, respectively.

In particular, consider the analogue of OCP that instead of always choosing when ( as used in OCP, not the downstream label), instead just chooses a random pair with

(i.e., a random consecutive pair). We refer to this as OCP-biased, since some of the negatives are actually in the correct order. However, the following theorem shows the estimator obtained by minimizing this objective is

not biased in a statistical sense:

Theorem 3.2 (informal).

When assumptions A.1-A.3 are satisfied, is also the unique optimal representation for OCP-biased.

However, it does affect the bound on unlabeled sample complexity required to obtain a good representation:

Proposition 1 (informal).

The upper bound on the sample complexity required for OCP-biased to identify is worse than the upper bound for OCP.

While this proposition only compares upper bounds, our synthetic experiments (see Figure 2) indicate that OCP is more sample-efficient than OCP-biased. We prove these results in Appendix A.

4 Experiments

4.1 Synthetic data

Figure 2: Synthetic experiments admitting different behavior across pre-training setups. (Top): PCL is unable to ever recover all four true features in ; (Bottom): PCL recovers the true representation, but requires higher sample complexity than OCP.

We demonstrate the importance of negative sampling over two synthetic datasets from the model in Section 3. Each distribution contains alongside a number of noisy features. We generate pre-training datasets of different sizes (50 to 16,000) and sample pairs from each dataset according to OCP, PCL, and OCP-biased. We then conduct an L0 regression over the sample pairs and analyze how many variables in were correctly recovered. The top panel of Figure 2 shows a distribution where PCL does not recover in the infinite data limit—this distribution includes a periodic background feature in that is selected by PCL. The bottom panel of Figure 2 shows a distribution where PCL is able to recover all of , but requires a larger sample complexity than OCP. In both cases, OCP and OCP-biased find the same representation, but the former has better dependence on unlabeled data. We provide the details and explanations for these experiments in Appendix B.

4.2 Real-world data

We show OCP yields significant improvements in the low-label regime on extraction from clinical notes.

Progression dataset.

We utilize a dataset of fully de-identified clinical notes from a single institution. This research was reviewed by our institution’s Committee on the Use of Humans as Experimental Subjects and determined to be IRB-exempt. The dataset contains data for 82,839 patients with cancer, with a median of 12 radiology notes each. Each radiology note focuses on one body area (e.g., chest CT scan). In addition, we have a subset of 135 patients with progressive lung cancer with 1095 labeled radiology notes. Each note was labeled post-hoc by an oncologist as ‘indicating progression’ (19%), ‘not indicating progression’ (79.5%), or ‘ambiguous’ (1.5%).

Fraction of training data
Available features 1 1/2 1/4 1/8 1/16
OCP subset 0.864 0.860 0.847 0.808 0.786
All features 0.856 0.851 0.818 0.723 0.726
Most common 0.767 0.767 0.728 0.687 0.658
Random subset 0.740 0.747 0.727 0.639 0.634
(a)

Mean note-level AUC of regularized logistic regression over different dataset sizes. Averaged over the 5 folds, performance was optimal for each dataset size when restricted to the features with nonzero coefficients recovered by OCP.

(b) Example features that OCP selected (top) or excluded (bottom) for downstream prediction.
Figure 3: Linear representation space experiment to validate assumptions of our model apply to real-world data. Quantitatively, we find downstream wins from restricting the model feature space to those found useful for the order-contrastive task. Qualitatively, the features important for the order pre-training are the same we would expect to be useful for the downstream extraction task.
Experimental setup.

We investigate extraction of these binary progression labels from the Impression section of the note. The labeled data was split via 5-fold cross-validation: each fold contained sets of sizes 64% (train), 16% (validation), and 20% (test); for a given fold, no patient examples were ever split between sets. On each fold, we used the test set to benchmark models trained using different amounts of the labeled training data: from just 5 training patients () to all of the training patients. We excluded patients with downstream labels from pretraining. For contrastive pre-training schemes, a pretraining window pair was sampled once per each unique body area (e.g. chest, brain) that was scanned at least twice, capped at five locations per patient. This resulted in 158,000 samples for pretraining.

Pre-training for feature selection.

We first leverage a linear model to evaluate our modeling assumptions from Section 3. We compare downstream progression extraction performance of (i) a vanilla logistic regression model and (ii) a logistic regression model only using the features selected by OCP. We test on all five folds for five training dataset fractions.

For each experiment, our dataset is featurized using the unigrams and bigrams that occur in at least 5% of the labeled training data set. They are vectorized using the

term frequency-inverse document frequency weighting scheme, via scikit-learn (Pedregosa et al., 2011, BSD 3-clause license). We conduct feature selection as an optional intermediate step preceding progression extraction. For OCP, we train a L1 regression model over the 158,000 pre-training pairs of consecutive radiology notes. The regularization constant was set such that there were features with nonzero weights. In addition to OCP-derived features, we select the 50 most common features, and 5 random subsets of 50 features to serve as a comparison.

We train scikit-learn logistic regression models for downstream progression extraction over each feature set; further details are in Appendix C. Results can be seen in Figure 3a. Even with a simple bag-of-words representation, feature selection with OCP outperforms directly training a tuned logistic regression model on the available labeled data (“All features”), especially for small dataset sizes. A paired -test finds that the model with OCP-selected features is significantly better than the direct-downstream model on a sixteenth of the data (). Note that selecting the most common features or a random set of features does not compare, showing that OCP does not improve performance by simply reducing the feature dimension in a redundant space.

We manually examined the OCP-selected features and their coefficients (Figure 2(b)). The features included (e.g. increased, decreased) strongly indicate disease progression, while those discarded (e.g. discussed) largely seem to be noise. Of the nonzero coefficients, 76% have a positive weight; this indicates that the pre-training model focuses mostly on features that have been turned on to conduct the ordering task, fitting with our motivating theoretical setting.

Fraction of training data
AUC diff. (OCP Win %) 1 1/2 1/4 1/8 1/16
OCP AUC 0.87 .03 0.86 .04 0.84 .04 0.82 .03 0.81 .03
OCP BERT 0.08 (93%) 0.12 (100%) 0.12 (100%) 0.18 (100%) 0.22 (100%)
OCP FT LM 0.03 (80%) 0.04 (82%) 0.04 (82%) 0.08 (93%) 0.10 (89%)
OCP Pt-Contrastive 0.03 (86%) 0.03 (77%) 0.05 (91%) 0.09 (91%) 0.12 (97%)
OCP PCL 0.00 (53%) 0.00 (46%) 0.03 (64%) 0.03 (76%) 0.06 (87%)
Table 1: Performance of deep methods on cancer progression extraction. The first row contains the mean AUC of OCP its std dev. The following rows contain the mean AUC advantage of OCP over each comparison method, and the percentage of time OCP outperforms that method, across the 3 seeds and 5 folds.
Nonlinear representations.

We now study the use of OCP for pre-training nonlinear representations. We compare performance of a BERT model pre-trained using OCP to several other self-supervision methods. We investigate the BERT base model and the BERT base model after it is pre-trained using: (i) FT LM: fine-tuned masked language modeling over an equivalent number of impressions, (ii) Pt-Contrastive: a patient-level contrastive objective (identical positive sampling to OCP and PCL, but each negative is a random note of the same note type from a different patient, similar to Diamant et al. (2021)), (iii) PCL: contrastive pre-training with PCL sampling (each negative is a random pair of notes of the same type from the same patient), (iv) OCP: contrastive pre-training with OCP sampling (each negative is a pair of notes of the same type in the incorrect order). All pre-training is conducted over three seeds, and all three contrastive objectives were trained with the same number of pairs (158,000). Implementation for language modeling and contrastive pre-training came from Wolf et al. (2020, Apache-2.0 License) with full details in Appendix C. After model pre-training/fine-tuning, the self-supervised representation layers were frozen, and a single L2-regularized linear layer was added on top. The goal of freezing was to isolate the effect of pre-training to understand representation quality, due to the instability of training BERT on small downstream tasks (Zhang et al., 2020).

Results can be seen in Table 1. The top row shows that OCP has only a modest drop in performance even when trained on the data from just 5 patients (). Since correlations exist in AUC across the 5 folds, 3 seeds, and 5 dataset sizes, standard statistical comparison testing is inappropriate. Instead, we present the mean increase in AUC from OCP, as well as the percentage of the time OCP outperformed the comparisons. Unsurprisingly, BERT alone (trained on non-clinical text) unsurprisingly does not perform well out-of-the-box; fine-tuning with language modeling improves performance, but still suffers in the low data regime. Among the contrastive objectives, the cross-patient objective is the weakest, which may follow since its pre-training task was the easiest (82% accuracy on validation). It could rely on features that differed between patients, instead of being forced to focus on the temporal features that differed within a patient’s timeline. PCL performs equivalently to OCP at large data sizes, but at the smaller data set sizes, it loses to OCP a large majority of the time.

5 Related work

Order pre-training.

Others have found order-based self-supervision useful for more complex time-series data, but without theoretical study. For example, learning the order of frames within a video yields representations useful for downstream activity classification (Fernando et al., 2015; Misra et al., 2016; Lee et al., 2017; Wei et al., 2018). Most similar to our work, Hyvärinen and Morioka (2017) introduced permutation-contrastive learning (PCL) and proved nonlinear identifiability for representations learned using PCL in an ICA setting. That is, they gave distributional conditions where PCL provably recovers the “correct” nonlinear representation of the input given infinite unlabeled data. Our theoretical and empirical results indicate that there can be nontrivial differences between OCP and PCL’s downstream performance. Deeper understanding of what data distributions and downstream tasks are “right” for PCL versus OCP (and for other contrastive sampling methods) is an interesting direction for future theoretical study.

Pre-training for medical time-series.

Several other pre-training objectives have been explored on clinical time-series data. A contrastive learning setup similar to our patient-contrastive baseline has shown promising results on electrocardiograph signals (Diamant et al., 2021; Kiyasseh et al., 2020). Banville et al. (2021) studied a contrastive objective for electroencephalography signals, in which windows of a signal are judged to be similar if they occur within a certain time gap, and dissimilar if they are far away in time. Intuitively, this objective is well-suited to data where the true representation “changes slowly” with time, as with Franceschi et al. (2019) (discussed in Section 1). Other objectives include auto-encoding (Fox et al., 2019) and masked prediction over text and tabular data, to mixed results (Steinberg et al., 2020; Huang et al., 2019; Yoon et al., 2020; McDermott et al., 2021). Multi-task pre-training supplies improvements, but unlike our work, it relies on additional labeled data from closely-related downstream tasks (McDermott et al., 2021).

Self-supervision theory.

Like our work, Arora et al. (2019); Liu et al. (2021) and Tosh et al. (2020, 2021) give downstream finite-sample error bounds for representations learned using particular contrastive learning objectives. Our motivating theoretical setting and proof techniques are simpler than the ones considered in these works, but we show that our setup in Section 3 is (i) complex enough to allow for some of the same nontrivial behavior observed by contrastive methods in practice (Sections 3.2, 4.1) and (ii) it has some practical applications (Section 4.2).

6 Limitations and Conclusion

We have shown both theoretically and empirically that order-contrastive pre-training is an effective self-supervised method for certain types of time-series data and downstream tasks. On real-world longitudinal health data, we find that representations from OCP significantly outperform others, including the similar PCL, in the small data regime. However, OCP is not always suitable. For example, cases of temporal leakage (e.g., the date in a note) can lead to weak OCP (and PCL) representations downstream, since they provide a shortcut during pre-training. While dates are straightforward to censor, more complex global nonstationarities irrelevant to downstream tasks would present a challenge to these methods.

Our theoretical setup and results in Section 3 also serve to highlight that contrastive pre-training methods can be very sensitive to the precise sampling details, and provide a simple model for studying these details that is still complex enough to capture some empirical phenomena. This suggests that obtaining broader theoretical guidelines for selecting a contrastive distribution is an interesting direction for future work.

References

  • F. Abramovich and V. Grinshtein (2018) High-dimensional classification by sparse logistic regression. IEEE Transactions on Information Theory 65 (5), pp. 3068–3079. Cited by: §3.2.
  • S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi (2019) A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229. Cited by: §5.
  • H. Banville, O. Chehab, A. Hyvärinen, D. Engemann, and A. Gramfort (2021) Uncovering the structure of clinical eeg signals with self-supervised learning. Journal of Neural Engineering 18 (4), pp. 046020. Cited by: §5.
  • J. Bleackley and S. Y. R. Kim (2013) The merit and agony of retrospective chart reviews: a medical student’s perspective. British Columbia Medical Journal 55, pp. 374–375. Cited by: §1.
  • C. Chuang, J. Robinson, Y. Lin, A. Torralba, and S. Jegelka (2020) Debiased contrastive learning. In NeurIPS, Cited by: Appendix A, §1, §1, §2, §3.2, §3.
  • N. Diamant, E. Reinertsen, S. Song, A. Aguirre, C. Stultz, and P. Batra (2021) Patient contrastive learning: a performant, expressive, and practical approach to ecg modeling. arXiv preprint arXiv:2104.04569. Cited by: §4.2, §5.
  • B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars (2015) Modeling video evolution for action recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5378–5387. Cited by: §5.
  • I. Fox, H. Rubin-Falcone, and J. Wiens (2019) Learning through limited self-supervision: improving time-series classification without additional data via auxiliary tasks. Cited by: §5.
  • J. Franceschi, A. Dieuleveut, and M. Jaggi (2019) Unsupervised scalable representation learning for multivariate time series. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: §1, §5.
  • R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun (2015) Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE international conference on computer vision, pp. 4086–4093. Cited by: §1.
  • K. Huang, J. Altosaar, and R. Ranganath (2019) Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: §5.
  • A. Hyvärinen and H. Morioka (2017) Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460–469. Cited by: §1, §2, §3, §5.
  • K. L. Kehl, H. Elmarakeby, M. Nishino, E. M. Van Allen, E. M. Lepisto, M. J. Hassett, B. E. Johnson, and D. Schrag (2019)

    Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports

    .
    JAMA oncology 5 (10), pp. 1421–1429. Cited by: §3.1.
  • K. L. Kehl, W. Xu, E. Lepisto, H. Elmarakeby, M. J. Hassett, E. M. Van Allen, B. E. Johnson, and D. Schrag (2020) Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clinical Cancer Informatics 4, pp. 680–690. Cited by: §3.1.
  • D. Kiyasseh, T. Zhu, and D. A. Clifton (2020) Clocs: contrastive learning of cardiac signals. arXiv preprint arXiv:2005.13249. Cited by: §5.
  • H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §5.
  • B. Liu, P. Ravikumar, and A. Risteski (2021) Contrastive learning of strong-mixing continuous-time stochastic processes. In International Conference on Artificial Intelligence and Statistics, pp. 3151–3159. Cited by: §1, §5.
  • M. McDermott, B. Nestor, E. Kim, W. Zhang, A. Goldenberg, P. Szolovits, and M. Ghassemi (2021) A comprehensive ehr timeseries pre-training benchmark. In Proceedings of the Conference on Health, Inference, and Learning, CHIL ’21, New York, NY, USA, pp. 257–278. External Links: ISBN 9781450383592, Link, Document Cited by: §5.
  • I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §5.
  • H. Mobahi, R. Collobert, and J. Weston (2009) Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744. Cited by: §1.
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §A.2.
  • M. Mohri and A. Rostamizadeh (2010) Stability bounds for stationary -mixing and -mixing processes.. Journal of Machine Learning Research 11 (2). Cited by: footnote 3.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §C.1, §4.2.
  • E. Pierson, P. W. Koh, T. Hashimoto, D. Koller, J. Leskovec, N. Eriksson, and P. Liang (2019) Inferring multidimensional rates of aging from cross-sectional data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 97–107. Cited by: 1st item.
  • J. Robinson, C. Chuang, S. Sra, and S. Jegelka (2021) Contrastive learning with hard negative samples. ICLR. Cited by: §1.
  • S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §A.2, §3.2.
  • E. Steinberg, K. Jung, J. A. Fries, C. K. Corbin, S. R. Pfohl, and N. H. Shah (2020) Language models are an effective patient representation learning technique for electronic health record data. arXiv preprint arXiv:2001.05295. Cited by: §5.
  • C. Tosh, A. Krishnamurthy, and D. Hsu (2020) Contrastive estimation reveals topic posterior information to linear models. arXiv preprint arXiv:2003.02234. Cited by: §5.
  • C. Tosh, A. Krishnamurthy, and D. Hsu (2021) Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pp. 1179–1206. Cited by: §5.
  • D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060. Cited by: §5.
  • L. Wiskott and T. J. Sejnowski (2002) Slow feature analysis: unsupervised learning of invariances. Neural computation 14 (4), pp. 715–770. Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §C.2, §4.2.
  • F. Xia and M. Yetisgen-Yildiz (2012) Clinical corpus annotation: challenges and strategies. In Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM’2012) in conjunction with the International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, Cited by: §1.
  • J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar (2020)

    VIME: extending the success of self-and semi-supervised learning to tabular domain

    .
    Advances in Neural Information Processing Systems 33. Cited by: §5.
  • T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi (2020) Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987. Cited by: §4.2.

Appendix A Theory details

In this section, we provide proofs for the simple model introduced in Section 3. In particular, we prove the finite sample bound for OCP by first proving that the driver features are the unique optimal representation, and then we apply standard uniform convergence arguments to obtain the bound (3).

We also show how so-called bias (Chuang et al., 2020) in the distribution of negative samples can affect OCP. In particular, suppose that instead of choosing windows as positive and as negatives, we chose as positives and a random pair with as negatives. This is analogous to PCL’s negative sampling, where some of the negatives are still in the correct order. The difference with PCL is just that all negatives are still consecutive. For the model in Section 3, we prove that (i) in the infinite-data limit, this “biased” version recovers the same representation as (so in fact, the estimator obtained by minimizing this objective is not biased in a statistical sense) and (ii) the bound on unlabeled sample complexity required to find a good representation is worse for this biased version of OCP than for the unbiased version. This gives more theoretical evidence for the value of de-biasing the negative distribution in contrastive learning, where possible: even if it doesn’t change the representation learned with infinite unlabeled data, de-biasing the contrastive distribution can improve unlabeled sample complexity.

a.1 Assumptions and example class of distributions

Here we provide simple assumptions under which a set of features is the unique optimal solution to (1) for the model from Section 3. For simplicity in this section, we only consider consecutive windows and the case where (the sequence length) is the same for all sequences, but all of our results generalize (with suitable modifications to these assumptions) to the case where the positive and negative distributions over windows are symmetric up to ordering of the elements (correct versus incorrect), and to the case with a distribution over .444For example, if for all and , (i.e., the distribution of any window pair only depends on the fact that the trajectory is still active, and not on the actual length) then assumptions remain the same.

Assumption A.1.

For all and all ,

Assumption A.2.

For all and with , and for all ,

Assumption A.3.

For all such that , there exists and , , with and:

Now we prove that the example from Section 3 satisfies these assumptions.

Theorem A.1.

Partition the indices into three sets satisfying the following assumptions:

  • [beginpenalty=10000]

  • Time-irreversible features satisfying Assumption A.1. We also assume that each has a nonzero probability of activating on its own, without the other features in . That is, for each there exists , with .

  • Noisy versions of : for each , there exists with , with , for all . Additionally, is conditionally independent of the other variables (for all times) given its driver : , where if and otherwise.

  • Background, reversible features : features such that for all and all ,

This class of distributions satisfies Assumption A.1-A.3.

Proof.

Assumption A.1 is satisfied by definition. For Assumption A.2, fix with and . Then we have:

Using the conditional independence assumption for the features in , the right-hand-side factors to:

Because , trivially this is equal to:

The reversibility of the features in (and summing over the full joint distribution) imply that the above is equal to:

Finally, note that because , and are identically distributed by the definition of features in , so:

Combining with the previous equation and simplifying, we obtain

which is Assumption A.2.

Finally, for Assumption A.3, fix with , and fix some . We assumed that each has some probability of activating on its own, so there exists with . In particular, this implies that there exist with , , and . By summing over variables in this joint distribution, that implies , so Now we need to reverse and for the indices to account for the other term in the of Assumption A.3. For each , we can take while maintaining and , since we assumed that all values of . We know that , so we will try to rearrange to make use of this fact by attempting to switch for .

We have:

because . Let be the set of indices in corresponding to noisy indicators of feature . Because we chose so that , all features in are identically distributed at times and . Therefore, we can switch and for some of the indices in :

By the definition of the features in , we can also switch and for . The only remaining terms to switch are and . Because we took for all without loss of generality, and , , we have:

Hence, we can finally combine:

So we have shown for these that:

But we know the RHS is positive because had . Therefore,

So we have shown that both and are strictly positive, which gives Assumption A.3. ∎

a.2 Optimal representations and finite-sample guarantee

This first lemma shows that the feature set is an optimal representation for the OCP objective.

Lemma A.1.

Let

be the Bayes-optimal classifier for the OCP objective. There exists a classifier , depending only on the coordinates of and , achieving the same OCP error as .

Proof.

First we compute the error of the Bayes-optimal classifier. For , , we say if . Assumption A.1 implies that:

That is, given windows , if the set of variables active in is properly contained in the set of variables active in , we know the windows must be in the correct order. Likewise, if the active variables in are properly contained in those of , the windows must be in the wrong order. Assumption A.1 also implies that the only possible cases are , , and . For the last case, we have:

Assumption A.2 implies that the two terms in the denominator are equal when