Surgical Site Infection (SSI) is one of the most common types of nosocomial infection [lewis2013] and represents up to 30% of hospital-acquired infections [magill2012prevalence]. Studies have shown that being infected with SSI both increases the risk of re-admissions [shah2017evaluation], and prolongs the postoperative stay for up to two weeks and thereby also the cost per patient [whitehouse2002]. Hence, being able to detect infections is of utmost importance both for the patients and for the healthcare system.
Blood sample measurements represent a fundamental source of information for predicting the risk of getting SSI in a given patient. In some studies blood tests have been jointly analyzed together with other electronic health record data for detecting the presence of SSI [soguero2016support, hu2017strategies]. However, due to the fact that blood samples are recorded frequently with low burden for the patients and describe the health status of a patient with certainty, other studies have successfully focused on the analysis of blood tests alone for predicting SSI [soguero2015data, angiolini2016].
Blood samples are usually collected for each patient in given periods both before and after the surgery. The data are series of several indicators, measured on a patient over time and, due to the presence of important relationships in time among the measurements, data can be naturally represented as multivariate time series. An effective machine learning framework used to model and analyze multivariate time series (MTS) is the Recurrent Neural Network (RNN).
RNNs are a special class of Neural Networks characterized by internal self-connections, which are capable of modeling sequential data of variable lengths [DBLP:journals/corr/BianchiMKRJ17]. Thanks to their recurrent nature, an RNN captures temporal dependencies in the MTS to perform prediction or classification. At each time step, the RNN output depends on past inputs and previously computed states. This allows the network to develop a memory of previous events, which is implicitly encoded in its internal state. Thanks to these properties, RNNs have proven powerful in a number of different healthcare applications [pmlr-v56-Choi16, GULER2005506].
Although a vanilla RNN, or Elman RNN (ERNN), can in principle learn how to model very complex relationships, an optimal training is often difficult to achieve and the network often performs poorly on unseen data and fails to capture long-term dependencies. A more sophisticated architecture, called Gated Recurrent Unit [cho2014learning] (GRU), implements recurrent units that adaptively captures dependencies at different time scales and it demonstrated to outperform other architectures on several tasks [DBLP:journals/corr/ChungGCB14].
Most clinical data, including blood sample measurements, are corrupted by the presence of missing values. Indeed, for each patient some measurements may not be registered and, at some time, data might be not collected at all. Most machine learning models, including RNNs, are not designed to deal with missing data and their presence often complicates the training and deteriorate performances [García-Laencina2010]. A commonly used approach is to replace missing data with imputation methods [AZPH:AZPH464], trying to introduce as less bias as possible.
The Gated Recurrent Unit with Decay (GRUD) [DBLP:journals/corr/ChePCSL16] is a recently proposed RNN, specifically designed to handle MTS with missing data and to leverage on the missing patterns to achieve better prediction results. GRUD takes as input two representations of missing patterns: a masking that informs the model which inputs are observed or missing and time intervals that encapsulate the input observation patterns.
In this work, we study the problem of identifying surgical site infection by only relying on blood measurements that contain many missing data. We evaluate the classification performance of three types of RNN-based classifiers, to discriminate between MTS relative to infected and non-infected patients. Specifically, we compare ERNN and GRU, where missing data are imputed, and GRUD, which handles missing data without having to resort to imputation.
Let us consider a dataset of multivariate time series with variables of same length . Since a time series may contain missing entries, according to the procedure in [DBLP:journals/corr/ChePCSL16] we associate to a binary mask , whose element if is missing, otherwise .
Ii-a Approaches for handling missing data
To replace missing values from the input data, we consider three baseline imputation techniques [AZPH:AZPH464].
Zero imputation: the missing values in each time series are replaced with . The main drawback of this imputation is the introduction of a strong bias in the data.
Last value carried forward: for each variable in the missing values are replaced by the last value observed for . The main problem with this method is the assumption that there will be no change from one observed value to the next.
Mean substitution: for each variable in the missing values are replaced by the mean value of across all the time series. Mean values are computed only relative to values that are observed, i.e. that are associated to a “1” in each matrix
Ii-B Elman RNN
The state update in a ERNN is governed by the difference equation , where and are the recurrent and inputs weights respectively,
is a bias vector, and
is the activation function usually implemented by atanh. The network output is computed as , where is the last hidden state of the RNN produced once the whole MTS is processed, and are the output weights and bias respectively, and is a softmax function. The parameters , , , and are trained with gradient descent so that the matches a desired output .
Ii-C Gated Recurrent Unit
The Gated Recurrent Unit (GRU) [cho2014learning]
is a gated architecture that can store information for longer periods of time, with respect to ERNNs. While the ERNN neuron implements a single squashing nonlinearity, GRU has a more elaborated processing unit calledcell
, which is composed of different nonlinear components interacting with each other in a particular way. The internal state of a cell is modified by the network only through linear interactions. This permits information to backpropagate smoothly across time, with a consequent enhancement of the memory capacity of the cell. A schema of the GRU cell is depicted in Fig.1.
A GRU protects and controls the information in the cell through two gates. The update gate, controls how much the current content of the cell should be updated with the new candidate state. The reset gate if closed (value near to 0) can effectively reset the memory of the cell and make the unit act as if the next processed input was the first in the sequence. The activation of each gate depends on the current external input, the previous state of the GRU cells and their output. The state equations of the GRU are the following:
Here, and are a non-linear functions usually implemented as hyperbolic tangent and logistic function, respectively. The parameters are the rectangular matrices , , , the square matrices , , , and the bias vectors , , . To control the behavior of each gate, those parameters are trained with gradient descent to solve a target task.
Ii-D Gated Recurrent Unit with Decay
In the GRUD cell the standard GRU architecture is modified to implement a decay mechanism for the input variables and the hidden states, according to the missing values in input. Such decays capture two different properties that characterize health care data. First, the values of the missing variable tend to be close to some default value if its last observation occurs far in time [vodovotz2013systems]. Second, the influence of the last seen input variables diminish over time when the next values are missing [ZHOU2007183].
Beside the mask , to track missing values in each MTS , GRUD maintains the last time interval when each variable was observed in a matrix . Specifically, an element of is defined as
where are the time stamps relative to each measurement. A vector of decay rates is defined as
where and are trained on data along with the other parameters. GRUD employs two different decays. First, , decays the input over time toward its empirical mean
where is the last value observed for variable and is its empirical mean. Secondly, decays the extracted features before computing the next hidden state
The state update equations for GRUD are
Contrarily to the GRU and ERNN, in GRUD it is not necessary to apply an imputation on the input data and the model can be trained end-to-end in presence of missing values.
Ii-E Loss function
In all three RNNs, the weights are trained using the same loss function, implementing binary cross-entropy combined with aregularization term. Due to the class imbalance in the dataset, we implemented a weighting scheme to penalize mistakes on the minority class by an amount proportional to how under-represented it is. In particular, errors relative to class are weighted by a term , where is the number of training samples of class and the size of the training set. In this way, classification errors on the class with less elements contribute more than errors on the other class. The resulting loss function is
where and are the true and predicted class respectively, is the norm of all network weights (biases excluded), and weights the regularization strength.
The purpose of the current study is to discriminate with RNN-based classifiers between MTS of blood samples relative to patients with and without surgical site infection. The blood samples are continuous variables over time and represented as MTS. In our analysis, we discretized time and let each time interval be one day. Ten different blood tests were extracted over 20 days after surgery, namely, alanine aminotransferase, albumin, alkaline phosphatase, creatinine, CRP, hemoglobine, leukocytes, potassium, sodium and thrombocytes.
The dataset consist of patients that underwent a gastrointestinal surgical procedure at UNN in the years 2004 - 2012. To extract the cohort for this study, we considered both the International Classification of Diseases and NOMESCO Classification of Surgical Procedures codes related to severe postoperative complications. A patient that did not have any of these codes was considered as a control, otherwise, as a case. We removed patients with less than two measurements during the postoperative window from the cohort. We ended up with a total of 232 infected patients (cases) and 651 not infected (control).
20 % of the dataset was used for validation. The remaining part was randomly split into a training (60 %) set and a test set. This procedure was repeated 10 times, using each time a new random initialization of the parameters in the RNNs. To measure performance we used F1-score and area under the ROC curve (AUC), which are more suitable performance measures in presence of imbalanced data [LOPEZ2013113].
Iii-a Network configuration
In the experiments, we used identical network architectures and only switched the internal processing units to be ERNN, GRU or GRUD. More specifically, we used a network with a single layer and
hidden units. On the output layer we applied dropout with probabilityand we set the regularization parameter . To train model parameters we used mini batches of size 40 and Adam as optimization algorithm. Each network is trained for epochs, with data shuffled each time. The models used for testing are the ones yielding the best F1 score on the validation set.
In Table I
we report the mean classification results and standard errors obtained by a RNN classifier configured either with ERNN, GRU or GRUD on the validation set during training and on the final classification of the test set once the training is over. When using ERNN and GRU, missing values in the inputs are filled using mean substitution (-m), zero imputation (-z) or last value carried forward (-l).
|Model||AUC (val)||F1 (val)||AUC (test)||F1 (test)|
As we can see, the best classification results in terms of F1-score and AUC are achieved by GRUD and GRU configured with mean imputation, which achieve similar performance in both validation and testing. Interestingly, GRUD can handle missing values as well as the standard GRU cell and provides the advantage of an end-to-end training, without requiring imputation procedures to be applied in advance. On the other hand, the ERNN configurations perform worse than the other architectures. This is somehow expected, since the presence of the gating mechanisms in GRU and GRUD provide more flexibility and computational capability.
To analyze the quality of the representations learned by GRUD compared to ERNN, we performed principal component analysis (PCA) on the final hidden states of the networks. The classification outcome heavily depends on those states since they are the input to the last softmax layer, which produces class assignment. Each last state is therefore the high-level static representation of the sequential input learned by the network. In Fig.3 the representations relative to the MTS in the test set are mapped to two dimensions. As we can see, the GRUD separates the two classes well, while in the case of ERNN the test elements are highly overlapped.
We observe that, in contrast to GRU, the performance of ERNN is heavily affected by the choice of imputation method. Indeed, each technique introduces a different kind of bias in the data and the optimal choice depends on type of task at hand. Using the wrong imputation may complicate the training. While this can represent an issue in the weaker ERNN, the higher computational capability of GRU permits to handle well the presence of stronger biases.
In this work we focused on the classification of blood sample data relative to patients with surgical site infections. Data are represented by multivariate time series and are characterized by a large amount of missing values. To classify the data, we used three different RNNs configured either with ERNN, GRU or GRUD. While GRUD can process MTS with missing values, ERNN and GRU require imputation to replace missing values. In the experiments, we observed that GRUD and GRU with imputation achieves better performance than ERNN in classifying the MTS. We also noticed that different imputations yield a substantial variation in ERNN classification results, while the performance in GRU are more stable. Since selecting the best imputation method is often difficult and requires expertise on the data domain, a critical sensitivity in this configuration may represent an issue. Therefore, the stability provided by GRU and the GRUD, which does not require using imputation at all, is an important advantage in many practical applications.