I Introduction
Surgical Site Infection (SSI) is one of the most common types of nosocomial infection [lewis2013] and represents up to 30% of hospitalacquired infections [magill2012prevalence]. Studies have shown that being infected with SSI both increases the risk of readmissions [shah2017evaluation], and prolongs the postoperative stay for up to two weeks and thereby also the cost per patient [whitehouse2002]. Hence, being able to detect infections is of utmost importance both for the patients and for the healthcare system.
Blood sample measurements represent a fundamental source of information for predicting the risk of getting SSI in a given patient. In some studies blood tests have been jointly analyzed together with other electronic health record data for detecting the presence of SSI [soguero2016support, hu2017strategies]. However, due to the fact that blood samples are recorded frequently with low burden for the patients and describe the health status of a patient with certainty, other studies have successfully focused on the analysis of blood tests alone for predicting SSI [soguero2015data, angiolini2016].
Blood samples are usually collected for each patient in given periods both before and after the surgery. The data are series of several indicators, measured on a patient over time and, due to the presence of important relationships in time among the measurements, data can be naturally represented as multivariate time series. An effective machine learning framework used to model and analyze multivariate time series (MTS) is the Recurrent Neural Network (RNN).
RNNs are a special class of Neural Networks characterized by internal selfconnections, which are capable of modeling sequential data of variable lengths [DBLP:journals/corr/BianchiMKRJ17]. Thanks to their recurrent nature, an RNN captures temporal dependencies in the MTS to perform prediction or classification. At each time step, the RNN output depends on past inputs and previously computed states. This allows the network to develop a memory of previous events, which is implicitly encoded in its internal state. Thanks to these properties, RNNs have proven powerful in a number of different healthcare applications [pmlrv56Choi16, GULER2005506].
Although a vanilla RNN, or Elman RNN (ERNN), can in principle learn how to model very complex relationships, an optimal training is often difficult to achieve and the network often performs poorly on unseen data and fails to capture longterm dependencies. A more sophisticated architecture, called Gated Recurrent Unit [cho2014learning] (GRU), implements recurrent units that adaptively captures dependencies at different time scales and it demonstrated to outperform other architectures on several tasks [DBLP:journals/corr/ChungGCB14].
Most clinical data, including blood sample measurements, are corrupted by the presence of missing values. Indeed, for each patient some measurements may not be registered and, at some time, data might be not collected at all. Most machine learning models, including RNNs, are not designed to deal with missing data and their presence often complicates the training and deteriorate performances [GarcíaLaencina2010]. A commonly used approach is to replace missing data with imputation methods [AZPH:AZPH464], trying to introduce as less bias as possible.
The Gated Recurrent Unit with Decay (GRUD) [DBLP:journals/corr/ChePCSL16] is a recently proposed RNN, specifically designed to handle MTS with missing data and to leverage on the missing patterns to achieve better prediction results. GRUD takes as input two representations of missing patterns: a masking that informs the model which inputs are observed or missing and time intervals that encapsulate the input observation patterns.
In this work, we study the problem of identifying surgical site infection by only relying on blood measurements that contain many missing data. We evaluate the classification performance of three types of RNNbased classifiers, to discriminate between MTS relative to infected and noninfected patients. Specifically, we compare ERNN and GRU, where missing data are imputed, and GRUD, which handles missing data without having to resort to imputation.
Ii Methods
Let us consider a dataset of multivariate time series with variables of same length . Since a time series may contain missing entries, according to the procedure in [DBLP:journals/corr/ChePCSL16] we associate to a binary mask , whose element if is missing, otherwise .
Iia Approaches for handling missing data
To replace missing values from the input data, we consider three baseline imputation techniques [AZPH:AZPH464].

Zero imputation: the missing values in each time series are replaced with . The main drawback of this imputation is the introduction of a strong bias in the data.

Last value carried forward: for each variable in the missing values are replaced by the last value observed for . The main problem with this method is the assumption that there will be no change from one observed value to the next.

Mean substitution: for each variable in the missing values are replaced by the mean value of across all the time series. Mean values are computed only relative to values that are observed, i.e. that are associated to a “1” in each matrix
. As main drawback, this method can lead to underestimates of the variance.
IiB Elman RNN
The state update in a ERNN is governed by the difference equation , where and are the recurrent and inputs weights respectively,
is a bias vector, and
is the activation function usually implemented by a
tanh. The network output is computed as , where is the last hidden state of the RNN produced once the whole MTS is processed, and are the output weights and bias respectively, and is a softmax function. The parameters , , , and are trained with gradient descent so that the matches a desired output .IiC Gated Recurrent Unit
The Gated Recurrent Unit (GRU) [cho2014learning]
is a gated architecture that can store information for longer periods of time, with respect to ERNNs. While the ERNN neuron implements a single squashing nonlinearity, GRU has a more elaborated processing unit called
cell, which is composed of different nonlinear components interacting with each other in a particular way. The internal state of a cell is modified by the network only through linear interactions. This permits information to backpropagate smoothly across time, with a consequent enhancement of the memory capacity of the cell. A schema of the GRU cell is depicted in Fig.
1.A GRU protects and controls the information in the cell through two gates. The update gate, controls how much the current content of the cell should be updated with the new candidate state. The reset gate if closed (value near to 0) can effectively reset the memory of the cell and make the unit act as if the next processed input was the first in the sequence. The activation of each gate depends on the current external input, the previous state of the GRU cells and their output. The state equations of the GRU are the following:
Here, and are a nonlinear functions usually implemented as hyperbolic tangent and logistic function, respectively. The parameters are the rectangular matrices , , , the square matrices , , , and the bias vectors , , . To control the behavior of each gate, those parameters are trained with gradient descent to solve a target task.
IiD Gated Recurrent Unit with Decay
In the GRUD cell the standard GRU architecture is modified to implement a decay mechanism for the input variables and the hidden states, according to the missing values in input. Such decays capture two different properties that characterize health care data. First, the values of the missing variable tend to be close to some default value if its last observation occurs far in time [vodovotz2013systems]. Second, the influence of the last seen input variables diminish over time when the next values are missing [ZHOU2007183].
Beside the mask , to track missing values in each MTS , GRUD maintains the last time interval when each variable was observed in a matrix . Specifically, an element of is defined as
where are the time stamps relative to each measurement. A vector of decay rates is defined as
(1) 
where and are trained on data along with the other parameters. GRUD employs two different decays. First, , decays the input over time toward its empirical mean
(2) 
where is the last value observed for variable and is its empirical mean. Secondly, decays the extracted features before computing the next hidden state
(3) 
The state update equations for GRUD are
where and are respectively updated according to Eq. 2 and Eq. 3, while , and are additional trainable weights for the masking values in . A schema of the GRUD architecture is depicted in Fig. 2.
Contrarily to the GRU and ERNN, in GRUD it is not necessary to apply an imputation on the input data and the model can be trained endtoend in presence of missing values.
IiE Loss function
In all three RNNs, the weights are trained using the same loss function, implementing binary crossentropy combined with a
regularization term. Due to the class imbalance in the dataset, we implemented a weighting scheme to penalize mistakes on the minority class by an amount proportional to how underrepresented it is. In particular, errors relative to class are weighted by a term , where is the number of training samples of class and the size of the training set. In this way, classification errors on the class with less elements contribute more than errors on the other class. The resulting loss function iswhere and are the true and predicted class respectively, is the norm of all network weights (biases excluded), and weights the regularization strength.
Iii Experiments
The purpose of the current study is to discriminate with RNNbased classifiers between MTS of blood samples relative to patients with and without surgical site infection. The blood samples are continuous variables over time and represented as MTS. In our analysis, we discretized time and let each time interval be one day. Ten different blood tests were extracted over 20 days after surgery, namely, alanine aminotransferase, albumin, alkaline phosphatase, creatinine, CRP, hemoglobine, leukocytes, potassium, sodium and thrombocytes.
The dataset consist of patients that underwent a gastrointestinal surgical procedure at UNN in the years 2004  2012. To extract the cohort for this study, we considered both the International Classification of Diseases and NOMESCO Classification of Surgical Procedures codes related to severe postoperative complications. A patient that did not have any of these codes was considered as a control, otherwise, as a case. We removed patients with less than two measurements during the postoperative window from the cohort. We ended up with a total of 232 infected patients (cases) and 651 not infected (control).
20 % of the dataset was used for validation. The remaining part was randomly split into a training (60 %) set and a test set. This procedure was repeated 10 times, using each time a new random initialization of the parameters in the RNNs. To measure performance we used F1score and area under the ROC curve (AUC), which are more suitable performance measures in presence of imbalanced data [LOPEZ2013113].
Iiia Network configuration
In the experiments, we used identical network architectures and only switched the internal processing units to be ERNN, GRU or GRUD. More specifically, we used a network with a single layer and
hidden units. On the output layer we applied dropout with probability
and we set the regularization parameter . To train model parameters we used mini batches of size 40 and Adam as optimization algorithm. Each network is trained for epochs, with data shuffled each time. The models used for testing are the ones yielding the best F1 score on the validation set.IiiB Results
In Table I
we report the mean classification results and standard errors obtained by a RNN classifier configured either with ERNN, GRU or GRUD on the validation set during training and on the final classification of the test set once the training is over. When using ERNN and GRU, missing values in the inputs are filled using mean substitution (m), zero imputation (z) or last value carried forward (l).
Model  AUC (val)  F1 (val)  AUC (test)  F1 (test) 

5pt. ERNNm  
ERNNl  
ERNNz  
GRUm  
GRUl  
GRUz  
GRUD 
As we can see, the best classification results in terms of F1score and AUC are achieved by GRUD and GRU configured with mean imputation, which achieve similar performance in both validation and testing. Interestingly, GRUD can handle missing values as well as the standard GRU cell and provides the advantage of an endtoend training, without requiring imputation procedures to be applied in advance. On the other hand, the ERNN configurations perform worse than the other architectures. This is somehow expected, since the presence of the gating mechanisms in GRU and GRUD provide more flexibility and computational capability.
To analyze the quality of the representations learned by GRUD compared to ERNN, we performed principal component analysis (PCA) on the final hidden states of the networks. The classification outcome heavily depends on those states since they are the input to the last softmax layer, which produces class assignment. Each last state is therefore the highlevel static representation of the sequential input learned by the network. In Fig.
3 the representations relative to the MTS in the test set are mapped to two dimensions. As we can see, the GRUD separates the two classes well, while in the case of ERNN the test elements are highly overlapped.We observe that, in contrast to GRU, the performance of ERNN is heavily affected by the choice of imputation method. Indeed, each technique introduces a different kind of bias in the data and the optimal choice depends on type of task at hand. Using the wrong imputation may complicate the training. While this can represent an issue in the weaker ERNN, the higher computational capability of GRU permits to handle well the presence of stronger biases.
Iv Conclusions
In this work we focused on the classification of blood sample data relative to patients with surgical site infections. Data are represented by multivariate time series and are characterized by a large amount of missing values. To classify the data, we used three different RNNs configured either with ERNN, GRU or GRUD. While GRUD can process MTS with missing values, ERNN and GRU require imputation to replace missing values. In the experiments, we observed that GRUD and GRU with imputation achieves better performance than ERNN in classifying the MTS. We also noticed that different imputations yield a substantial variation in ERNN classification results, while the performance in GRU are more stable. Since selecting the best imputation method is often difficult and requires expertise on the data domain, a critical sensitivity in this configuration may represent an issue. Therefore, the stability provided by GRU and the GRUD, which does not require using imputation at all, is an important advantage in many practical applications.