## 1. Introduction

In *rich-data* environments with strong observation capabilities, data often come with rich representations that encompass multiple channels of features.
For example, multiple leads of Electrocardiogram (ECG) signals in hospital used for diagnosing heart diseases are measured in intensive care units (ICU), of which each lead is considered a feature channel. The availability of such rich-data environment has thus sparked strong interest in applying deep learning (DL) for predictive health analytics as DL models built on data with multi-channel features have demonstrated promising results in healthcare (Xiao et al., 2018). However, in many practical scenarios, such rich data are often private and not accessible due to privacy concern. Thus, we often have to develop DL models on lower quality data comprising fewer feature channels, which were collected from *poor-data* environments with limited observation capabilities (e.g., home monitoring devices which provide only a single channel of feature). Inevitably, the performance of state-of-the-art DL models, which are fueled by the abundance and richness of data, becomes much less impressive in such poor-data environments (Salehinejad et al., 2018).

To alleviate this issue, we hypothesize that learning patterns consolidated by DL models trained in one environment often encode information that can be transferred to related environments. For example, a heart disease detection model trained on *rich-data* from 12 ECG channels in a hospital will likely carry pertinent information that can help improve a similar model trained on *poor-data* from a single ECG channel collected by a wearable device due to the correlation between their data. Motivated by this intuition, we further postulate that given access to a prior model trained on *rich-data*, the performance of a DL model built on a related *poor-data* can be improved if we can extract transferable information from the *rich* model and infuse them into the *poor*

model. This is related to deep transfer learning and knowledge distillation but with a new setup that has not been addressed before, as elaborated in Section

2 below.In this work, we propose a knowledge infusion framework, named CHEER, to address the aforementioned challenges. In particular, CHEER aims to effectively transfer domain-invariant knowledge consolidated from a *rich* model with high-quality data demand to a *poor* model with low data demand and model complexity, which is more suitable for deployment in *poor-data* settings. We also demonstrate empirically that CHEER helps bridge the performance gap between DL models applied in *rich*- and *poor*-data settings. Specifically, we have made the following key contributions:

1. We develop a transferable representation that summarizes the *rich model* and then infuses the summarized knowledge effectively into the *poor model* (Section 3.2). The representation can be applied to a wide range of existing DL models.

2. We perform theoretical analysis to demonstrate the efficiency of knowledge infusion mechanism of CHEER. Our theoretical results show that under practical learning configurations and mild assumptions, the *poor model*’s prediction will agree with that of the *rich model*

with high probability (Section

4).3. Finally, we also conduct extensive empirical studies to demonstrate the efficiency of CHEER on several healthcare datasets. Our results show that CHEER outperformed the second best approach (knowledge distillation) and the baseline without knowledge infusion by % and %, respectively, in terms of macro-F1 score and demonstrated more robust performance (Section 5).

## 2. Related Works

Deep Transfer Learning:
Most existing deep transfer learning methods transfer knowledge across domains while assuming the target and source models have equivalent modeling and/or data representation capacities. For example, deep domain adaptation have focused mainly on learning domain-invariant representations between very specific domains (e.g., image data) on (Glorot et al., 2011; Chen et al., 2012; Ganin et al., 2016; Zhou et al., 2016; Bousmalis et al., 2016; Long et al., 2015; Huang et al., 2018; Rozantsev et al., 2019; Xu et al., 2018). Furthermore, this can only be achieved by training both models jointly on source and target domain data.

More recently, another type of deep transfer learning (Zagoruyko and Komodakis, 2017) has been developed to transfer only the attention mechanism (Bahdanau et al., 2015)

from complex to shallow neural network to boost its performance. Both source and target models, however, need to be jointly trained on the same dataset. In our setting, since source and target datasets are not available at the same time and that the target model often has to adopt representations with significantly less modeling capacity to be compatible with the

*poor-data*domain with weak observation capabilities.

Knowledge Distillation: Knowledge distillation (Hinton et al., 2015) or mimic learning (Ba and Caruana, 2014) aim to transfer the predictive power from a high-capacity but expensive DL model to a simpler model such as shallow neural networks for ease of deployment (Sau and Balasubramanian, 2016; Radosavovic et al., 2017; Yim et al., 2017; Lopez-Paz et al., 2015). This can usually be achieved via training simple models on soft labels learned from high-capacity models, which, however, assume that both models operate on the same domain and have access to the same data or at least datasets with similar qualities. In our setting, we only have access to low-quality data with *poor* feature representation, and an additional set of limited *paired* data that include both *rich* and *poor* representations (e.g., high-quality ICU data and lower-quality health-monitoring information from personal devices) of the same object.

Domain Adaptation: There also exists another body of non-deep learning transfer paradigms that were often referred to as domain adaption. This however often include methods that not only assume access domain-specific (Wei et al., 2019; Wang et al., 2019; Luo et al., 2018; Peng et al., 2018; Tang et al., 2018) and/or model-specific knowledge of the domains being adapted (Pan et al., 2009, 2011; Yao et al., 2019; Jiang et al., 2019; Li et al., 2018; Luo et al., 2019; Segev et al., 2017; Xu et al., 2017), but are also not applicable to deep learning models (Ghifary et al., 2017; Wu et al., 2017) with arbitrary architecture as addressed in our work.

In particular, our method does not impose any specific assumption on the data domain and the deep learning model of interest. We recognize that our method is only demonstrated on deep model (with arbitrary architecture) in this research but our formulation can be straightforwardly extended to non-deep model as well. We omit such detail in the current manuscript to keep the focus on deep models which are of greater interest in healthcare context due to their expressive representation in modeling multi-channel data.

## 3. The Cheer Method

Notation | Definition |

; | rich data; rich model |

; | poor data; poor model |

paired data | |

domain-specific features | |

feature scoring functions | |

feature aggregation component |

### 3.1. Data and Problem Definition

Rich and Poor Datasets. Let and
denote the *rich* and *poor* datasets, respectively. The subscript indexes the -th data point (e.g., the

-th patient in healthcare applications), which contains input feature vector

or and output target or of the rich or poor datasets. The*rich*and

*poor*input features and are - and -dimensional vectors with , respectively. The output targets, and

, are categorical variables. The input features of these datasets (i.e.,

and ) are non-overlapping as they are assumed to be collected from different channels of data (i.e., different data modalities). In the remaining of this paper, we will use data channel and data modality interchangeably.For example, the rich data can be the physiological data from ICU (e.g., vital signs, continuous blood pressure and electrocardiography) or temporal event sequences such as electronic health records with discrete medical codes, while the poor data are collected from personal wearable devices. The target can be the mortality status of those patients, onset of heart diseases and etc. Note that these raw data are not necessarily plain feature vectors. They can be arbitrary rich features such as time series, images and text data. We will present one detailed implementation using time series data in Section 3.3.

Input Features. We (implicitly) assume that the raw data of interest comprises (says, or ) multiple sensory channels, each of which can be represented by or embedded^{1}^{1}1We embed these channel jointly rather than separately to capture their latent correlation. into a particular feature signal (i.e., one feature per channel). This results in an embedded feature vector of size or (per data point), respectively. In a different practice, a single channel may be encoded by multiple latent features and our method will still be applicable. In this paper, however, we will assume one embedded feature per channel to remain close to the standard setting of our healthcare scenario, which is detailed below.

Paired Dataset. To leverage both rich and poor datasets, we need a small amount of paired data to learn the relationships between them, which is denoted as . Note that the paired dataset contains both *rich* and *poor* input features, i.e. and , of the same subjects (hence, sharing the same target ).

Concretely, this means a concatenated input of the paired dataset has features where the first features are collected from rich channels (with highly accurate observation capability) while the remaining features are collected from poor channels (with significantly more noisy observations). We note that our method and analysis also apply to settings where . In such cases, and (though the number of data point for which is accessible as paired data is much less than the number of those with accessible ). To avoid confusion, however, we will proceed with the implicit assumption that there is no feature overlapping between poor and rich datasets in the remaining of this paper.

For example, the paired dataset may comprise of rich data from ICU () and poor data from wearable sensors (), which are extracted from the same patient . The paired dataset often contains much fewer data points (i.e., patients) than the rich and poor datasets themselves, and cannot be used alone to train a prediction model with high quality.

Problem Definition. Given (1) a poor dataset collected from a particular patient cohort of interest, (2) a paired dataset collected from a limited sample of patients, and (3) a *rich* model which were pre-trained on private (rich) data of the same patient cohort, we are interested in learning a model using both , and , which can perform better than a vanilla model generated using only or .

Challenges. This requires the ability to transfer the learned knowledge from to improve the prediction quality of . This is however a highly non-trivial task because (a) only generates meaningful prediction if we can provide input from rich data channels, (b) its training data is private and cannot be accessed to enable knowledge distillation and/or domain adaptation, and (c) the paired data is limited and cannot be used alone to build an accurate prediction model.

Solution Sketch. Combining these sources of information coherently to generate a useful prediction model on the patient cohort of interest is therefore a challenging task which has not been investigated before. To address this challenge, the idea is to align both rich and poor models using a transferable representation described in Section 3.2. This representation in turn helps infuse knowledge from the rich model into the poor model, thus improving its performance. The overall structure of CHEER is shown in Figure 1. The notations are summarized in Table 1.

### 3.2. Learning Transferable Rich Model

In our knowledge infusion task, the *rich* model is assumed to be trained in advance using the rich dataset . The rich dataset is, however, not accessible and we only have access to the *rich* model. The knowledge infusion task aims to consolidate the knowledge acquired by the rich model and infuse it with a simpler model (i.e., the *poor* model).

Transferable Representation. To characterize a DL model, we first describe the building blocks and then discuss how they would interact to generate the final prediction scores. In particular, let , and denote the building blocks, namely *Feature Extraction*, *Feature Scoring* and *Feature Aggregation*, respectively.
Intuitively, the *Feature Extraction* first transforms raw input feature into a vector of high-level features , whose importance are then scored by the *Feature Scoring* function . The high-level features

are combined first via a linear transformation

that focuses the model’s attention on important features. The results are translated into a vector of final predictive probabilities via the*Feature Aggregation*function

, which implements a non-linear transformation. Mathematically, the above workflow can be succinctly characterized using the following conditional probability distributions:

(1) |

We will describe these building blocks in more details next.

Feature Extraction. Dealing with complex input data such as time series, images and text, it is common to derive more effective features instead of directly using the raw input . The extracted features are denoted as where is a -dimensional feature vector extracted by the -th feature extractor from the raw input . Each feature extractor is applied to a separate segment of the time series input (defined later in Section 3.3). To avoid cluttering the notations, we shorten as .

Feature Scoring.
Since the extracted features are of various importance to each subject, they are combined via weights specific to each subject. More formally, the extracted features of the *rich model* are combined via using subject-specific weight vector .

Essentially, each weight component maps from the raw input feature to the important score of its extracted feature. For each dimension , parameterized by a set of parameters , which are learned using the *rich* dataset.

Feature Aggregation. The *feature aggregation* implements a nonlinear transformation (e.g., a feed-forward layer) that maps the combined features into final predictive scores. The input to this component is the linearly combined feature and the output is a vector of logistic inputs,

(2) |

which is subsequently passed through the softmax function to compute the predictive probability for each candidate label,

(3) |

### 3.3. A DNN Implementation of Rich Model

This section describes an instantiation of the aforementioned abstract building blocks using a popular DNN architecture with self-attention mechanism (Lin et al., 2017) for modeling multivariate time series (Choi et al., 2016):

Raw Features. Raw data from rich data environment often consist of multivariate time series such as physiological signals collected from hospital or temporal event sequences such as electronic health records (EHR) with discrete medical codes. In the following, we consider the raw feature input as continuous monitoring data (e.g., blood pressure measures) for illustration purpose.

Feature Extraction. To handle such continuous time series, we extract a set of domain-specific features using CNN and RNN models. More specifically, we splits the raw time series into non-overlapping segments of equal length.

That is, where and such that with

denotes the number of features of the rich data. Then, we apply stacked 1-D convolutional neural networks (

) with mean pooling () on each segment, i.e.(4) |

where , and

denotes the number of filters of the CNN components of the rich model. After that, we place a recurrent neural network (

) across the output segments of the previous CNN and Pooling layers:(5) |

The output segments of the RNN layer are then concatenated to generate the feature matrix,

(6) |

which correspond to our domain-specific feature extractors where , as defined previously in our transferable representation (Section 3.2).

Feature Scoring. The concatenated features is then fed to the self-attention component to generate a vector of importance scores for the output components, i.e. . For more details on how to construct this component, see (Chorowski et al., 2015; Hermann et al., 2015; Ba et al., 2014) and (Lin et al., 2017). The result corresponds to the feature scoring functions^{2}^{2}2We use the notation to denote the -th component of vector . where .

Feature Aggregation. The extracted features are combined using the above feature scoring functions, which yields . The combined features are subsequently passed through a linear layer with densely connected hidden units (),

(7) |

where with denotes the number of class labels and

denotes the parametric weights of the dense layers. Then, the output of the dense layer is transformed into a probability distribution over class labels via the following softmax activation functions parameterized with softmax temperatures

:The entire process corresponds to the feature aggregation function parameterized by .

### 3.4. Knowledge Infusion for Poor Model

To infuse the above knowledge extracted from the *rich* model to the *poor* model, we adopt the same transferable representation for the *poor* model as follows:

where , and are the *poor* model’s domain-specific feature extractors, feature scoring functions and feature aggregation functions, which are similar in format to those of the *rich* model. Infusing knowledge from the *rich* model to the *poor* model can then be boiled down to matching these components between them. This process can be decomposed into two steps:

Behavior Infusion. As mentioned above, each scoring function is defined by a weight vector . The collection of these weight vectors thus defines the *poor model*’s *learning behaviors* (i.e., its feature scoring mechanism).

Given the input components of the subjects included in the paired dataset and the *rich model*’s scoring outputs at those subjects, we can construct an auxiliary dataset to learn the corresponding behavior of the *poor model* so that its scoring mechanism is similar to that of the *rich model*. That is, we want to learn a mapping from a *poor* data point to the important score assigned to its latent feature by the *rich model*. Formally, this can be cast as the optimization task given by Eq. 8:

(8) |

For example, if we parameterize and choose , then Eq. 8

reduces to a linear regression task, which can be solved analytically. Alternatively, by choosing

, Eq. 8 reduces to a maximum a posterior (MAP) inference task with normal prior imposed on , which is also analytically solvable.Incorporating more sophisticated, non-linear parameterization for (e.g., deep neural network with varying structures) is also possible but Eq. 8 can only be optimized approximately via numerical methods (see Section 3.2). Eq. (8) can be solved via standard gradient descent. The complexity of deriving the solution thus depends on the number of iteration and the cost of computing the gradient of which depends on the parameterization of but is usually where . As such, the cost of computing the gradient of the objective function with respect to a particular is O(kw). As there are iterations, the cost of solving for the optimal is . Lastly, since we are doing this for values of , the total complexity would be .

Target Infusion. Given the *poor model*’s learned behaviors (which were fitted to those of the *rich model* via solving Eq. 8), we now want to optimize the *poor model*’s feature aggregation and feature extraction components so that its predictions will (a) fit those of the *rich model* on paired data ; and also (b) fit the ground truth provided by the poor data .
Formally, this can be achieved by solving the following optimization task:

(9) |

To understand the above, note that the first term tries to fit *poor model* to *rich model* in the context of the paired dataset while the second term tries to adjust the *poor model*’s fitted behavior and target in a local context of its *poor data* . This allows the second term to act as a filter that downplays distilled patterns which are irrelevant in the poor data context. Again, Eq. 9 can be solved depending on how we parameterize the aforementioned components .

For example, can be set as a linear feed-forward layer with densely connected hidden units, which are activated by a softmax function. Again, Eq. (9) could be solved via standard gradient descent. The cost of computing the gradient would depend linearly on the total no.

of neurons in the parameterization of

and for the poor model. In particular, the gradient computation complexity for one iteration is . For iteration, the total cost would be .Both steps of behavior infusion and target infusion are succinctly summarized in Algorithm 1 below.

## 4. Theoretical Analysis

In this section, we provide theoretical analysis for CHEER. Our goal is to show that under certain practical assumptions and with respect to a random instance drawn from an arbitrary data distribution , the prediction of the resulting *poor model* will agree with that of the *rich model*, , with high probability, thus demonstrating the accuracy of our knowledge infusion algorithm in Section 3.4.

High-Level Ideas. To achieve this, our strategy is to first bound the *expected* target fitting loss (see Definition ) on a random instance of the *poor model* with respect to its optimized scoring function , feature extraction and feature aggregation components via solving Eq. 8 and Eq. 9 in Section 3.4 (see Lemma ).

We can then characterize the sufficient condition on the target fitting loss (see Definition ) with respect to a particular instance for the *poor model* to agree with the *rich model* on their predictions of and , respectively (see Lemma ). The probability that this sufficient condition happens can then be bounded in terms of the bound on the *expected* target fitting loss in Lemma (see Theorem ), which in turn characterizes how likely the *poor model* will agree with the *rich model* on the prediction of a random data instance. To proceed, we put forward the following assumptions and definitions:

Definition . Let denote an arbitrary parameterization of the *poor model*. The *particular* target fitting loss of the *poor model* with respect to a data instance is

(10) | |||||

where denotes the number of classes, and denotes the probability scores assigned to candidate class by the *rich* and *poor* models, respectively.

Definition . Let be defined as in Definition . The *expected* target fitting loss of the *poor model* with respect to the parameterization is defined below,

(11) |

where the expectation is over the unknown data distribution .

Definition . Let and . The robustness constant of the *rich model* is defined below,

(12) |

That is, if the probability scores of the model are being perturbed additively within , its prediction outcome will not change.

Assumption . The paired data points of are assumed to be distributed independently and identically from .

Assumption . The hard-label predictions and of the *poor* and *rich* models are unique.

Given the above, we are now ready to state our first result:

Lemma . Let and denote the optimal parameterization of the *poor model* that yields the minimum *expected* target fitting loss (see Definition ) and the optimal solution found by minimizing the objective functions in Eq. 8 and Eq. 9, respectively. Let , and denote the number of classes in our predictive task. If then,

(13) |

Proof. We first note that by definition in Eq. (10), for all , . Then, let us define the *empirical* target fitting loss as

(14) |

where

can be treated as identically and independently distributed random variables in

. Then, by Definition , it also follows that . Thus, by Hoeffding inequality:(15) |

Then, for an arbitrary , setting and solving for yields . Thus, for , with probability at least , holds simultaneously for all . When that happens with probability at least , we have:

(16) | |||||

That is, , which completes our proof for Lemma . Note that the above 2nd inequality follows from the definition of , which implies .

This result implies the *expected* target fitting loss incurred by our knowledge infusion algorithm in Section 3.4 can be made arbitrarily close (with high confidence) to the optimal *expected* target fitting loss with a sufficiently large paired dataset .

Lemma . Let and as defined in Lemma . If the corresponding *particular* target fitting loss (see Definition ) , then both *poor* and *rich* models agree on their predictions for and , respectively. That is, and are the same.

Proof. Let and be defined as in the statement of Lemma . We have:

(17) | |||||

To understand Eq. (17), note that the first inequality follows from the definition of . The second inequality follows from the fact that , which implies and hence, or . The third inequality follows from the definitions of (see Definition ) and . Finally, the last inequality follows from the definition of and that , which also implies .

Eq. (17) thus implies and hence, . Since the hard-label prediction is unique (see Assumption ), this means and hence, by definitions of and , the *poor* and *rich* models yield the same prediction. This completes our proof for Lemma .

Intuitively, Lemma specifies the sufficient condition under which the *poor model* will yield the same hard-label prediction on a particular data instance as the *rich model*. Thus, if we know how likely this sufficient condition will happen, we will also know how likely the *poor model* will imitate the *rich model* successfully on a random data instance. This intuition is the key result of our theoretical analysis and is formalized below:

Theorem . Let and denote a random instance drawn from . Let denote the size of the paired dataset , which were used to fit the learning behaviors of the *poor model* to that of the *rich model*, and denotes the event that both models agree on their predictions of . If , then with probability at least ,

(18) |

Proof. Since implies , it follows that

(19) |

Then, by Markov inequality, we have

(20) |

Subtracting both sides of Eq. (20) from a unit probability yields

(21) |

where the last equality follows because , which follows immediately from Definitions - and Assumption . Thus, plugging Eq. (21) into Eq. (19) yields

(22) |

Applying Lemma , we know that with probability , . Thus, plugging this into Eq. (22) yields

(23) |

That is, by union bound, with probability at least , the *poor model* yields the same prediction as that of the *rich model*. This completes our proof for Theorem .

This immediately implies will happen with probability at least . The chance for the *poor model* to yield the same prediction as the *rich model* on an arbitrary instance (i.e., knowledge infusion succeeds) is therefore at least .

## 5. Experiments

### 5.1. Experimental Settings

Datasets. We use the following datasets in our evaluation.

A. MIMIC-III Critical Care Database (MIMIC-III) ^{3}^{3}3https://mimic.physionet.org/ is collected from more than ICU patients at the Beth Israel Deaconess Medical Center (BIDMC) from June 2001 to October 2012 (Johnson et al., 2016). We collect a subset of patients who has one of the following (most frequent) diseases in their main diagnosis: (1) acute myocardial infarction, (2) chronic ischemic heart disease, (3) heart failure, (4) intracerebral hemorrhage, (5) specified procedures complications, (6) lung diseases, (7) endocardium diseases, and (8) septicaemia. The task is disease diagnosis classification (i.e., predicting which of 8 diseases the patient has) based on features collected from 6 data channels: vital sign time series including Heart Rate (HR), Respiratory Rate (RR), Blood Pressure mean (BPm), Blood Pressure systolic (BPs), Blood Pressure diastolic (BPd) and Blood Oxygen Saturation (SpO2). We randomly divided the data into training (), validation () and testing () sets.

B. PTB Diagnostic ECG Database (PTBDB) ^{4}^{4}4https://physionet.org/physiobank/database/ptbdb/ is a 15-channel 1000 Hz ECG time series including 12 conventional leads and 3 Frank leads (Goldberger et al., 2000; Bousseljot et al., 1995)

collected from both healthy controls and cases of heart diseases, which amounts to a total number of 549 records. The given task is to classify ECG to one of the following categories: (1) myocardial infarction, (2) healthy control, (3) heart failure, (4) bundle branch block, (5) dysrhythmia, and (6) hypertrophy. We down-sampled the data to 200 Hz and pre-processed it following the "frame-by-frame" method

(Reiss and Stricker, 2012) with sliding windows of 10-second duration and 5-second stepping between adjacent windows.C. NEDC TUH EEG Artifact Corpus (EEG)
^{5}^{5}5https://www.isip.piconepress.com/projects/tuh_eeg/html/overview.shtml is a 22-channel 500 Hz sensor time series collected from over 30,000 EEGs spanning the years from 2002 to present (Obeid and Picone, 2016). The task is to classify 5 types of EEG events including (1) eye movements (EYEM), (2) chewing (CHEW), (3) shivering (SHIV), (4) electrode pop, electrode static, and lead artifacts (ELPP), and (5) muscle artifacts (MUSC). We randomly divided the data into training (), validation () and testing () sets by records.

The statistics of the above datasets, as well as the architectures of the *rich* and *poor* models on each dataset are summarized in the tables below.

MIMIC-III | PTBDB | EEG | |

# subjects | 9,488 | 549 | 213 |

# classes | 8 | 6 | 5 |

# features | 6 | 15 | 22 |

Average length | 48 | 108,596 | 13,007 |

Sample frequency | 1 per hour | 1,000 Hz | 500 Hz |

Layer | Type | Hyper-parameters | Activation |

1 | Split | n_seg=6 | |

2 | Convolution1D | n_filter=64, kernel_size=4, stride=1 |
ReLU |

3 | Convolution1D | n_filter=64, kernel_size=4, stride=1 | ReLU |

4 | AveragePooling1D | ||

5 | LSTM | hidden_units=64 | ReLU |

6 | PositionAttention | ||

7 | Dense | hidden_units=n_classes | Linear |

8 | Softmax |

*rich*model in MIMIC-III, which includes a total of 51.6k parameters.

Layer | Type | Hyper-parameters | Activation |

1 | Split | n_seg=6 | |

2 | Convolution1D | n_filter=16, kernel_size=4, stride=1 | ReLU |

3 | Convolution1D | n_filter=16, kernel_size=4, stride=1 | ReLU |

4 | AveragePooling1D | ||

5 | LSTM | hidden_units=16 | ReLU |

6 | PositionAttention | ||

7 | Dense | hidden_units=n_classes | Linear |

8 | Softmax |

*poor*model used by CHEER, Direct, KD and AT for knowledge infusion in MIMIC-III, which includes a total of 3.5k parameters.

Layer | Type | Hyper-parameters | Activation |

1 | Split | n_seg=10 | |

2 | Convolution1D | n_filter=128, kernel_size=16, stride=2 | ReLU |

3 | Convolution1D | n_filter=128, kernel_size=16, stride=2 | ReLU |

4 | Convolution1D | n_filter=128, kernel_size=16, stride=2 | ReLU |

5 | AveragePooling1D | ||

6 | LSTM | hidden_units=128 | ReLU |

7 | PositionAttention | ||

8 | Dense | hidden_units=n_classes | Linear |

9 | Softmax |

*rich*model in PTBDB, which includes a total of 688.8k parameters.

Layer | Type | Hyper-parameters | Activation |

1 | Split | n_seg=10 | |

2 | Convolution1D | n_filter=32, kernel_size=16, stride=2 | ReLU |

3 | Convolution1D | n_filter=32, kernel_size=16, stride=2 | ReLU |

4 | Convolution1D | n_filter=32, kernel_size=16, stride=2 | ReLU |

5 | AveragePooling1D | ||

6 | LSTM | hidden_units=32 | ReLU |

7 | PositionAttention | ||

8 | Dense | hidden_units=n_classes | Linear |

9 | Softmax |

*poor*model used by CHEER, Direct, KD and AT for knowledge infusion in PTBDB, which includes a total 45.0k parameters.

Layer | Type | Hyper-parameters | Activation |

1 | Split | n_seg=5 | |

2 | Convolution1D | n_filter=128, kernel_size=8, stride=2 | ReLU |

3 | Convolution1D | n_filter=128, kernel_size=8, stride=2 | ReLU |

4 | Convolution1D | n_filter=128, kernel_size=8, stride=2 | ReLU |

5 | AveragePooling1D | ||

6 | LSTM | hidden_units=128 | ReLU |

7 | PositionAttention | ||

8 | Dense | hidden_units=n_classes | Linear |

9 | Softmax |

*rich*model in EEG, which includes a total of 417.4k parameters.

Layer | Type | Hyper-parameters | Activation |

1 | Split | n_seg=5 | |

2 | Convolution1D | n_filter=32, kernel_size=8, stride=2 | ReLU |

3 | Convolution1D | n_filter=32, kernel_size=8, stride=2 | ReLU |

4 | Convolution1D | n_filter=32, kernel_size=8, stride=2 | ReLU |

5 | AveragePooling1D | ||

6 | LSTM | hidden_units=32 | ReLU |

7 | PositionAttention | ||

8 | Dense | hidden_units=n_classes | Linear |

9 | Softmax |

*poor*model used by CHEER, Direct, KD and AT for knowledge infusion in EEG, which includes a total of 51.6k parameters.

Baselines. We compare CHEER against the following baselines:

Direct: In all experiments, we train a neural network model parameterized with CHEER directly on the poor dataset without knowledge infusion from the *rich model*. The resulting model can be used to produce a lower bound of predictive performance on each dataset.

Knowledge Distilling (KD) (Hinton et al., 2015): KD transfers predictive power from teacher to student models via soft labels produced by the teacher model. In our experiments, all KD models have similar complexity as the infused model generated by CHEER. The degree of label softness (i.e., the temperature parameter of soft-max activation function) in KD is set to 5.

Attention Transfer (AT) (Zagoruyko and Komodakis, 2017): AT enhances shallow neural networks by leveraging attention mechanism (Bahdanau et al., 2015) to learn a similar attention behavior of a full-fledged deep neural network (DNN). In our experiments, we first train a DNN with attention component, which can be parameterized by CHEER. The trained attention component of DNN is then transferred to that of a shallow neural networks in poor-data environment via activation-based attention transfer with -normalization.

Heterogeneous Domain Adaptation (HDA) (Yao et al., 2019): Maximize Mean Discrepancy (MMD) loss (Gretton et al., 2006) has been successfully used in domain adaptation such as (Long et al., 2015). However, one drawback is that these works only consider homogeneous settings where the source and target domains have the same feature space, or use the same architecture of neural network. To mitigate this limitation, HDA (Yao et al., 2019) proposed modification of soft MMD loss to handle with heterogeneity between source domain and target domain.

Performance Metrics.

The tested methods’ prediction performance was compared based on their corresponding areas under the Precision-Recall (PR-AUC) and Receiver Operating Characteristic curves (ROC-AUC) as well as the accuracy and F1 score , which are often used in multi-class classification to evaluate the tested method’s prediction quality. In particular, accuracy is measured by the ratio between the number of correctly classified instances and the total number of test instances. F1 score is the harmonic average of precision (the proportion of true positive cases among the predicted positive cases) and recall (the proportion of positive cases whose are correctly identified), with threshold

to determine whether a predictive probability for being positive is large enough (larger than threshold) to actually assign a positive label to the case being considered or not.Then, we use the average of F1 scores evaluated for each label (i.e., macro-F1 score) to summarize the averaged predictive performance of all tested methods across all classes. The ROC-AUC and PR-AUC scores are computed based on predicted probabilities and ground-truths directly. For ROC-AUC, it is the area under the curve produced by points of true positive rate (TPR) and false positive rate (FPR) at various threshold settings. Likewise, the PR-AUC score is the area under the curve produced by points of (precision, recall) at various threshold settings. In our experiments, we report the average PR-AUC and ROC-AUC since all three tasks are multi-class classification.

Training Details

For each method, the reported results (mean performance and its empirical standard deviation) are averaged over 20 independent runs. For each run, we randomly split the entire dataset into training (80%), validation (10%) and test sets (10%). All models are built using the training and validation sets and then, evaluated using test set. We use Adam optimizer

(Kingma and Ba, 2014)to train each model, with the default learning rate set to 0.001. The number of training epoches for each model is set as 200 and an early stopping criterion is invoked if the performance does not improve in 20 epoches. All models are implemented in Keras with Tensorflow backend and tested on a system equipped with 64GB RAM, 12 Intel Core i7-6850K 3.60GHz CPUs and Nvidia GeForce GTX 1080. For fair comparison, we use the same model architecture and hyper-parameter setting for Direct, KD, AT, HDA and

CHEER. For rich dataset, we use the entire amount of dataset with the entire set of data features. For poor dataset, we vary the size of paired dataset and the number of features to analyze the effect of knowledge infusion in different data settings as shown in Section 4.3. The default maximum amount of paired data is set to 50% of entire dataset, and the default number of data features used in the poor dataset is set to be half of the entire set of data features. In Section 4.2, to compare the tested methods’ knowledge infusion performance under different data settings, we use the default settings for all models (including CHEER and other baselines).### 5.2. Performance Comparison

Results on MIMIC-III, PTBDB and EEG datasets are reported in Table 9, Table 10 and Table 11, respectively. In this experiment, we set the size of paired dataset to 50% of the size of the *rich data*, and set the number of features used in poor-data environment to 3, 7, 11 for MIMIC-III, PTBDB and EEG, respectively. In all datasets, it can be observed that the infused model generated by CHEER consistently achieves the best predictive performance among those of the tested methods, which demonstrates the advantage of our knowledge infusion framework over existing transfer methods such as KD and AT.

Notably, in terms of the macro-F1 scores, CHEER improves over KD, AT, HDA and Direct by , , and , respectively, on MIMIC-III dataset. The infused model generated by CHEER also achieves performance of the *rich* model on PTBDB in terms of the macro-F1 score (i.e., , see Table 10) while adopting an architecture that is times smaller than the rich model’s (see Tables 5 and 6). We have also performed a significance test to validate the significance of our reported improvement of CHEER over the baselines in Table 12.

Furthermore, it can also be observed that the performance variance of the infused model generated by

CHEER (as reflected in the reported standard deviation) is the lowest among all tested methods’, which suggests that CHEER’s knowledge infusion is more robust. Our investigation in Section 5.3 further shows that this is the result of CHEER being able to perform both target and behavior infusion. This helps the infused model generated by CHEER achieved better and more stable performance than those of KD, HDA and AT, which either match the prediction target or reasoning behavior of the*rich*and

*poor*models (but not both). This consequently leads to their less robust performance with wide fluctuation in different data settings, as demonstrated next in Section 5.3.

ROC-AUC | PR-AUC | Accuracy | Macro-F1 | |

Direct | 0.6220.062 | 0.2080.044 | 0.8210.012 | 0.1410.057 |

KD | 0.6860.043 | 0.2570.029 | 0.8330.012 | 0.1960.049 |

AT | 0.6450.064 | 0.2250.044 | 0.8260.013 | 0.1670.057 |

HDA | 0.6550.034 | 0.2250.029 | 0.8240.011 | 0.1570.038 |

CHEER | 0.6970.024 | 0.2660.023 | 0.8350.010 | 0.2070.030 |

Rich Model | 0.7590.014 | 0.3410.024 | 0.8520.007 | 0.2950.027 |

ROC-AUC | PR-AUC | Accuracy | Macro-F1 | |

Direct | 0.6860.114 | 0.4040.088 | 0.9200.015 | 0.2750.057 |

KD | 0.7140.096 | 0.4390.093 | 0.9250.016 | 0.2950.043 |

AT | 0.7030.117 | 0.4020.078 | 0.9210.016 | 0.2830.056 |

HDA | 0.6850.113 | 0.4300.080 | 0.9240.011 | 0.2990.051 |

CHEER | 0.7240.103 | 0.4410.080 | 0.9270.017 | 0.2990.052 |

Rich Model | 0.7320.110 | 0.4830.101 | 0.9300.017 | 0.3660.071 |

ROC-AUC | PR-AUC | Accuracy | Macro-F1 | |

Direct | 0.7970.064 | 0.5060.083 | 0.8880.015 | 0.4250.078 |

KD | 0.7720.083 | 0.5120.082 | 0.8880.021 | 0.4450.097 |

AT | 0.7930.071 | 0.5020.082 | 0.8840.012 | 0.4170.062 |

HDA | 0.8050.050 | 0.5230.073 | 0.8840.019 | 0.4550.073 |

CHEER | 0.8080.066 | 0.5350.061 | 0.8950.016 | 0.4600.076 |

Rich Model | 0.8540.069 | 0.6570.077 | 0.9220.014 | 0.5950.070 |

MIMIC-III | PTBDB | EEG | |

with Direct | 0.0000 (s = 01%) | 0.0154 (s = 05%) | 0.1720 (s = 20%) |

with KD | 0.0450 (s = 05%) | 0.1874 (s = 20%) | 0.0124 (s = 05%) |

with AT | 0.0007 (s = 01%) | 0.0421 (s = 05%) | 0.1821 (s = 20%) |

with HDA | 0.0000 (s = 01%) | 0.0741 (s = 10%) | 0.4823 (s = 20%) |

(a) MIMIC-III | (b) PTBDB | (c) EEG |

*rich*dataset.

(a) MIMIC-III | (b) PTBDB | (c) EEG |

### 5.3. Analyzing Knowledge Infusion Effect in Different Data Settings

To further analyze the advantages of CHEER’s knowledge infusion over those of the existing works (e.g., KD and AT), we perform additional experiments to examine how the variations in (1) sizes of the paired dataset and (2) the number of features of the *poor* dataset will affect the infused model’s performance. The results are shown in Fig. 3 and Fig. 4, respectively. In particular, Fig. 3 shows how the ROC-AUC of the infused model generated by each tested method varies when we increase the ratio between the size of the paired dataset and that of the *rich* data. Fig. 4, on the other hand, shows how the infused model’s ROC-AUC varies when we increase the number of features of the *poor* dataset. In both settings, the reported performance of all methods is averaged over independent runs.

Varying Paired Data. Fig. 3 shows that (a) CHEER outperforms all baselines with varying sizes of the paired data and (b) direct learning on *poor data* yields significantly worse performance across all settings. Both of which are consistent with our observations earlier on the superior knowledge infusion performance of CHEER.
The infused models generated by KD, HDA and AT both perform consistently worse than that of CHEER by a substantial margin across all datasets. Their performance also fluctuates over a much wider range (especially on EEG data) than that of CHEER when we vary the size of the paired datasets. This shows that CHEER’s knowledge infusion is more data efficient and robust under different data settings.

On another note, we also notice that when the amount of paired data increases from to of the *rich* data, there is a performance drop that happens to all tested methods with attention transfer (i.e., CHEER and AT) on MIMIC-III but not on PTBDB and EEG. This is, however, not surprising since unlike PTBDB and EEG, MIMIC-III comprises of more heterogeneous types of signals and its data distribution is also more unbalanced, which affects the attention learning, and causes similar performance drop patterns between methods with attention transfer such as CHEER and AT.

Varying The Number of Features. Fig. 4 shows how the prediction performance of the infused models generated by tested methods changes as we vary the number of features in *poor data*. In particular, it can be observed that the performance of CHEER’s infused model on all datasets increases steadily as we increase the number of input features observed by the *poor model*, which is expected.

On the other hand, it is perhaps surprising that as the number of features increases, the performance of KD, HDA, AT and Direct fluctuates more widely on PTBDB and EEG datasets, which is in contrast to our observation of CHEER. This is, however, not unexpected since the informativeness of different features are different and hence, to utilize and combine them effectively, we need an accurate feature weighting/scoring mechanism. This is not possible in the cases of Direct, KD, HDA and AT because (a) Direct completely lacks knowledge infusion from the *rich model*, (b) KD and HDA only performs target transfer from the *rich* to *poor* model, and ignores the weighting/scoring mechanism, and (c) AT only transfers the scoring mechanism to the *poor* model (i.e., attention transfer) but not the feature aggregation mechanism, which is also necessary to combine the weighted features correctly. In contrast, CHEER transfers both the weighting/scoring (via behavior infusion) and feature aggregation (via target infusion) mechanisms, thus performs more robustly and is able to produce steady gain (without radical fluctuations) in term of performance when the number of features increases. This supports our observations earlier regarding the lowest performance variance achieved by the infused model of CHEER, which also suggests that CHEER’s knowledge infusion scheme is more robust than those of KD, HDA and AT.

Dataset | Highest MI Features | Lowest MI Features |

MIMIC-III | 0.672 0.044 | 0.657 0.012 |

PTBDB | 0.646 0.133 | 0.639 0.115 |

EEG | 0.815 0.064 | 0.807 0.042 |

Finally, to demonstrate how the performance of CHEER varies with different choices of feature sets for poor data, we computed the mutual information between each feature and the class label, and then ranked them in decreasing order. The performance of CHEER on all datasets is then reported in two cases, which include (a) features with highest mutual information, and (b) features with lowest mutual information. In particular, the reported results (see Table 13) show that a feature set with low mutual information to the class label will induce worse transfer performance and conversely, a feature set (with the same number of features) with high mutual information will likely improve the transfer performance.

To further inspect the effects of used modalities in CHEER, we also computed the averaged entropy of each modality across all classes, and ranked them in decreasing order for each dataset. Then, we selected a small number of top-ranked, middle-ranked and bottom-ranked features from the entire set of modalities. These are marked as Top, Middle and Bottom respectively in Table 14.

The number of selected features for each rank is 2, 4 and 5 for MIMIC-III, PTBDB and EEG, respectively. Finally, we report the ROC-AUC scores achieved by the corresponding infused models generated by CHEER for each of those feature settings in Table 14. It can be observed from this table that the ROC-AUC of the infused model degrades consistently across all datasets when we change the features of *poor* data from those in Top to Middle and then to Bottom. This verifies our statement earlier that the informativeness of different data features are different.

MIMIC-III | PTBDB | EEG | |

Top | 0.688 0.010 | 0.710 0.131 | 0.839 0.044 |

Middle | 0.676 0.014 | 0.682 0.132 | 0.788 0.065 |

Bottom | 0.664 |

Comments

There are no comments yet.