1 Introduction
Sepsis is a lifethreatening condition that arises when the body’s response to infection causes injury to its tissues and organs. And the early prediction of the sepsis onset is important for physicians to take early preventive treatment. However, sepsis prediction is a difficult task, because there are complex sepsis risk factors including the age of patient, immune system weakness, complication (e.g. cancer, diabetes), conditions (e.g. trauma, burns) and so on. Hopefully, with the help of the widespread availability of electronic health records (EHR), welldesigned predictive models, which can effectively make use of clinical sequential data, will be able to increase the sepsis prediction performance.
The early sepsis prediction is challenging because patients’ sequential data in EHR contains temporal interactions of multiple clinical events [Mio2017deep, qian2017topic]. The interactions of multiple clinical events include event cooccurrence in a short period (e.g. two related symptoms occur together) and event temporal dependency at large timescale (e.g. A vital signal abnormally arises several hours after certain drug injection). One possible solution is directly applying deep sequential models, such as LSTM [hochreiter1997long], Transformer[vaswani2017attention], on the clinical event sequence. However, capturing temporal interactions in the long event sequence is hard for traditional LSTM because the length of clinical sequences exceeds the modeling ability of LSTM.
Rather than directly applying the LSTM model to the event sequences, some works design hierarchical neural networks to model the long sequence
[che2018hierarchical]. For example, aggregating events in a short period into a vector helps to shorten the original long sequence
[liu2019learning]. However, the information of each kind of clinical events is mixed in the aggregation vector, so temporal interactions of these events are hard to capture.To address these issues, our proposed model firstly aggregates heterogeneous clinical events in a short period and then captures temporal interactions of the aggregated representations with LSTM. The Heterogeneous Event Aggregation module can not only shorten the length of clinical event sequence but also help to retain temporal interactions of both categorical and numerical features of clinical events in the multiple heads of the aggregation representations. The separated clinical information in different heads makes it easier to capture event temporal interactions in different aggregation vectors. Experiments on the PhysioNet/Computing in Cardiology Challenge 2019 show that our proposed model is effective and efficient compared to traditional methods. The contributions of this work is summarised as following:

We propose a model to capture temporal interactions among multiple types of clinical event streams from EHR data for early sepsis prediction.

The proposed heterogeneous event aggregation module can reduce the length of long clinical event sequences and retain their temporal interactions.

Our proposed model achieves good prediction performance and time efficiency.
2 Dataset and Preprocessing
2.1 Dataset
The EHR data provided publicly for this challenge is sourced from two separate ICU, containing 20000 and 20643 records respectively. Each record is made up of hourly clinical data for a specific patient. Each row represents a single hour’s data with 40 variables and an additional label indicating whether the patient will get sepsis within 6 hours. With a positive sample proportion of 7.21%, there are 2932 sepsis patients in total. Sepsis and normal patients are divided into train and test set at the same ratio respectively. 5fold cross validation is established over the train set.
2.2 Data Preprocessing
In this competition, our goal is to make an early sepsis detection within 6 hours for every timepoint without causal model. For a patient’s record, We use a fixlength sliding window, with 1 hour step, to sample fixlength records (zero filling if ).
The 40 columns of the record contain 37 numerical variables and 3 binary variables. At the preprocessing stage, we relabel the 3 variables of twocategorical from 0 to 6 (6 clinical categories and one empty category for NaN). Finally we reorder the categorical variables to the last columns. As for numerical variables, in order to make the deep model converge easily, for each numerical variable, zscore normalization is applied.
3 Proposed Model
In this section, Heterogeneous Event Aggregation (HEA) module is designed to effectively capture the interaction information among the heterogeneous clinical events. The motivations of HEA are listed following: (1) Modeling interaction of both categorical and numerical heterogeneous clinical events from their embedding. (2) Grouping events into multiple heads in different aspects. (3) Shortening the length of clinical event sequence.
After the feature is extracted from HEA, the temporary dependency is captured by bidirectional LSTM. At last, the final outputs of 2 direction is sum up and forwarded to a single dense layer with sigmoid activation to get the detection.
The proposed architecture is shown in Figure 1. Given a sequential clinical record , our objective is to generate an early prediction of sepsis for the last time step . Heterogeneous Event Aggregation (HEA) module is composed of two parts, Heterogeneous Events Embedding and Attentional Multihead Aggregation, which are specifically explained as follows.
3.1 Heterogeneous Events Embedding
Given the sequential clinical data, the first step of our model is to generate the embedding that can be used to capture the interaction representation among the heterogeneous clinical events [liu2018learning]. For each time step (the first 37 columns are numerical variables, and the last 3 are categorical variables). Randomly initialized numerical event vector book , categorical event lookup table and value vector table are generated. The embedding for is then generated as:
(1)  
(2)  
(3)  
(4) 
Mask function is used to mask the event embedding to zero if the corresponding variable is default. d is the dimensional numbers of embedding. is combination of both numerical and categorical embedding, is the numerical embedding, is the categorical embedding, is the Key matrice of both numerical and categorical events.
3.2 Attentional Events Aggregation
It is difficult to extract the information through the long sequential data for the two reasons: (1) Within the long sequential record, the interaction among heterogeneous events could be complex, it is difficult to capture the dynamic interaction events representation; (2) The total dimensional numbers of the heterogeneous events embedding could be disastrously vast, making it impossible for the sequential model to capture the temporal representation.
To effectively capture the dynamic heterogeneous events representation, we propose attentional multihead aggregation. Given a time step events embedding , M randomly initialized maskvectors are generated. Different masks are used to capture different aspects of eventsinteraction at with attentionbased mechanism. At last, all heads are concatenated to produce the eventual aggregation representation. The details of the aggregation are shown as follows:
(5)  
(6)  
a  (7) 
Where is the head of aggregation representation, concatenates all heads, is the attention mechanism of the Head, getting the aggregation proportion of each event with calculating the dot product between events and Mask. Each head could capture its own concerned events information with its corresponding formed transform matrices , and Mask vector m. An example of twoheads aggregation is shown as Figure 2.
3.3 Sequential Model and Prediction
For each time step, we get an aggregation representation. Given , a temporal events aggregation representation
is captured. As LSTM is successfully used in sequential data, we pass on A to onelayer bidirectional LSTM module. We sum up the lasttime outputs of both forward and backward units and get the logits through a single dense layer with sigmoid activation. Our objective is a binary classification, we use cross entropy:
(8) 
4 Experiment Results
4.1 Implementation Details
The provided data is divided into train and test sets at the ratio of 7 to 3. To conduct a further experiment, we divide the data into train and held sets at the ratio of 9 to 1, and then 5fold cross validation is established. To evaluate our proposed module, we directly use a MLP prediction model, Transformer and LSTM with dense layer embedding, set various numbers of heads to gain different models. The results show that our proposed model, with high efficiency, can obviously improve the performance. We keep L as 24 (one day) in all experiments.
Multihead aggregation: We keep dimensional numbers(d) as 16 and set heads to be 1, 8 and 16 as 3 different aggregation modules.
We measure Area Under the Receiver Operating Characteristic curve (AUC) and Area Under The Precision Recall Curve (APC) as our evaluation metrics. What’s more, the utility score function defined in CinC2019 challenge is used as the extra metric.
4.2 Baseline
MLP: Without considering the temporal information, Multilayer perception can be used to directly model the raw record.
Dense Layer Embedding: Given a onetime sampled record x, the dense representation layer use a single dense layer with activation to generate the representation .
Transformer / LSTM: Conventional sequential model Transformer and LSTM use a single dense layer embedding to capture the events representation.
4.3 Result
We firstly conduct an experiment for different models over train and test sets. The results over the test set are shown as Table 1. It should be noticed that all the metrics on both Table 1 and Table 2 are based on the locally partitioned test set from the public dataset.
The result shows that heterogeneous events aggregation modules could improve the metrics obviously, then we have the further experiment over 5fold cross validation. The results over the held set are as Table 2 shows.
What’s more, our proposed model is in high efficiency, for it just needs to calculate the attention score between events and multiple heads. With a GeForce GTX 1080 10G, our proposed model with 16 heads cost 10 minutes per epoch to train over 10240000 training samples.
In thePhysioNet/Computing in Cardiology Challenge 2019, we got the results of score (0.402, 0.386, 0.169) on the test set A,B and C, with the overall utility score of 0.321, ranking the 13th out of 78 teams.
5 Conclusion
We proposed an attentionbased sequential representation model to do early sepsis prediction from clinical data. Our proposed model includes two main parts: clinical events interaction extraction with heterogeneous events aggregation and temporal interaction capture with LSTM. Experiments in the PhysioNet/Computing in Cardiology Challenge 2019 show that the heterogeneous event aggregation module can shorten the length of clinical event sequence for better temporal dependency modeling, and the separated storage strategy of aggregation representation with different heads retains temporal interactions of events.
Acknowledgments
This paper is partially supported by National Key Research and Development Program of China with Grant No. 2018AAA0101900, Beijing Municipal Commission of Science and Technology under Grant No. Z181100008918005, and the National Natural Science Foundation of China (NSFC Grant No. 61772039 and No. 91646202).