1 Introduction
An irregularly (or unevenly) sampled time series is a sequence of samples with irregular time intervals between observations. This class of data add a time sparsity factor when the intervals between observations are large. Most machine learning methods do not have time comprehension, this means they only consider observation order. This makes it harder to learn time dependencies found in time series problems. To solve this problem recent work propose models that are able to deal with such irregularity [11, 1, 3, 19]
, but they often rely on complex mechanisms to represent irregularity or to impute missing data.
In this paper, we introduce a novel way to represent time as a dense vector representation, which is able to improve the expressiveness of irregularly sampled data, we call it Time Embeddings (TEs). The proposed method is based on sinusoidal functions discretized to create a continuous representation of time. TEs can make a model capable of estimating time intervals between observations, and they do so without the addition of any trainable parameters.
We evaluate the method with a publicly available realworld dataset of irregularly sampled electronic health records called MIMICIII [9]. The tests were made with two tasks: a classification task (inhospital mortality prediction) and a regression (length of stay).
To evaluate the impact of time representation in the data expressiveness we used LSTM and SelfAttentive LSTM models. Both are common RNN models that have been reported to achieve great performance in several time series classification problems, and specifically with the MIMICIII dataset [11, 19, 1, 22]
. We also evaluated simpler models such as linear and logistic regression and a shallow Multi Layer Perceptron. All models were evaluated with and without TEs to asses possible improvements.
2 Related Work
The problem in focus of this work is how can a machine learning method learn representations from irregularly sampled data. Irregularity is found in many different areas, as electronic health records [21], climate science [18], ecology [4], and astronomy [17].
Some works deal with irregularity as a missing data problem. With time axis discretization into fixed nonoverlapping intervals, those with no observations are then said to contain missing values. This approach was taken by [14], and [11]. Lipton showed how binary indicators of missingness and observation time delta can improve Recurrent Neural Network based models better than imputation, even with the sparsity of binary masks.
Despite the improvement an issue about these methods is missing the potential of how the observation time can be informative [12]. More recently [19]
introduced a neural network model capable to learning how to interpolate missing data and avoid time discretization, by turning a irregularly sampled time series into a regular one.
[1] also proposed a method to improve discretization by doing a data augmentation based on temporalclustering.Another approach is to make complex models capable of dealing with irregularities. The work of [3]
describes a GRU (Gated Recurrent Unit) model called GRUD. It makes use of binary missing indicators and observation time delta as an input data and incorporates them into GRU gates to control a decay rate of missing data.
[2] proposed a similar method using LSTM cell states to improve the decay concept.The concept proposed in this work is similar to [11] and [3]
, as they propose the use of an additional input to describe observation time deltas. But instead of using time intervals we are proposing a way to describe the exact time moment using continuous cyclic functions. This way it is possible to calculate with a linear operation the time between any two irregular observations without the need of a cumulative sum over all intermediate data, also avoiding fixedlength time discretization and interpolation noise. Another difference is that Time Embeddings are dense representations that avoid unnecessary sparsity from missingness masking.
3 Methodology
3.1 Positional Embeddings
The concept of Positional Embeddings (PE) was first introduced at [5]
where the author used vectors to represent word positions in a sentence. It was initially proposed to improve Convolutional Neural Network ability to handle temporal data. As a CNN do not consider order the PE was introducing numerical representation of order into embedding latent space.
The Transformer [20] network brought back the PE to improve a neural network based only on attention modules. The model had the same issue with order modeling as it contains no recurrence, they propose a set of sinusoidal functions discretized by each relative input position.
(1)  
(2) 
The equations described above have a variable to indicate position, the dimension and is the dimension of original embedding space. This way each dimension corresponds to a sinusoidal and the model is able to learn relative positions, as argued by the authors "since for any fixed offset , can be represented as a linear function of " [20]. The total dimensions of the positional embedding is defined by . Each wavelength form a geometric progression from to . The biggest wavelength defines the maximum number of inputs, if a position is higher than , it will start to be redundant.
3.2 Time Embeddings
Inspired by Transformer position representation we propose a positional embedding for irregular positions. As [20] discretize sinusoidal functions based on positions, it is possible to discretize it based on irregular hour times or dates. Applying these time descriptors to a irregularly sample series can make the own data be time representative.
(3)  
(4) 
To do it we redefine the equations based on irregular timestamps. Instead of a position indicator there is a variable, which is continuous. The dimension of TEs () can parameterized and a defines a maximum time that can be represented.
The relation between maximum time and TE dimension can be a limiting factor, as the maximum time increase the distance between TEs becomes smaller. To avoid this problem it is possible to increase TE dimensionality or set a reasonable maximum time.
The main pros of using TEs can be summarized as:

Do not need any optimizable parameter, making it a modelfree choice to deal with irregularity.

Time delta can be linearly computed between two TEs, possibly improving long term dependencies recognition.

All TEs have the same norm, avoiding big values as it is possible to happen with time delta descriptors when interval between observations are big.
4 Experiments
We evaluate the proposed algorithms on two benchmark tasks: inhospital mortality and length of stay prediction. Booth tasks with the publicly available MIMICIII dataset [9]. The following section we will briefly describe the data acquisition and prepossessing used, followed by the test results and discussion.
4.1 Dataset and training details
To assess the method performance we used the MIMICIII benchmark dataset following the benchmark defined by [7, 8]. With the available code we extracted sequences from inhospital stays with first hours and split into training and testing set. This results in a dataset with training samples and test samples for inhospital mortality after 48 hours task and for length of stay after 24 hours.
The dataset contains 18 variables with real values and five categorical. We did our own normalization of real variables to zero mean and unit variance, categorical variables are represented with onehot encoding. At the length of stay task we also change labels from hours to days to avoid large outputs, to report results we change it back to hours.
To make the dataset even more irregular we removed randomly part of observed test data. By doing this we artificially create bigger time gaps to reevaluate the models with an increased irregularity.
All models was trained with PyTorch
[15] on a P100, with batch size of 100 and AdamW [13] optimizer with amsgrad [16]. We performed a five fold crossvalidation with 10 runs on each fold. The model with best validation performance (AUC for inhospital mortality and Mean Absolute Error for length of stay) was selected to compose the average performance for test set. We report the mean and standard error of evaluation measures in test set.
4.2 Baselines and tests
To have a baseline we compared TEs primarily with binary masking with time interval indicators, as reported to have a good performance with RNNs in [11]. It was compared with the proposed method with a regular LSTM [6] and a SelfAttentive LSTM [10], as RNNs are reported to achieve best results with the evaluated tasks [11, 19, 1].
TEs was tested with dimension () of 32 and maximum time of 48 hours. As binary masking dimension and concatenated TE increase input dimension we adjusted the LSTM hidden size to keep close the number of parameters of as describe at Table 1.
All neural models are connected to a 3 layers MultiLayer Perceptron (MLP) with 32, 32, and 16 neurons. The last layers is a two neurons softmax for inhospital mortality and one linear, with ReLU, for length of stay.
Model  Alias  h  Nº of params 

Vanilla LSTM  LSTM  34  15.5k 
LSTM with Binary Masking  BM + LSTM  22  15.5k 
LSTM with TE concatenated  catTE + LSTM  26  15.5k 
SelfAttentive LSTM  SALSTM  32  15.5k 
SelfAttentive LSTM with Binary Masking  BM SALSTM  22  15.5k 
SelfAttentive LSTM with TE added  addTE + SALSTM  32  15.5k 
The Self Attention was implemented as introduced in [10] with the only difference of using unidirectional LSTMs. The attention size () was 32 and the number of attentions () was 8. We also used the penalization term of
As the MIMIC III data are composed by multivariated series and we assume that Time Embeddings (TEs) should not be combined directly. So, we propose to use TEs in two different ways, as additional inputs, replacing missing mask, and as a latent space transformation, by adding TEs to the RNN output hidden state.
To have also a baseline of nonrecurrent models and assess the TE effect on them, we tested a four layer MLP and Linear/Logistic regression (linear for length of stay and logistic for inhospital mortality task).
4.3 Results
Results for inhospital mortality shows that selfattention seems to deteriorate the vanilla LSTM performance, but when added the TEs it got improved sufficiently to surpass it and achieve our better average result.
In the length of stay task TEs achieved better results, especially with bigger gaps at the reduced data test. TEs improved LSTM average error, but a slight worse explained variance, were binary masking had a better performance.
At Figure 3 and 4 we can see the performance of models when we randomly remove observed data from 100% to 10%. With length of stay task the LSTM with TE concatenated have a overall smaller absolute error than vanilla LSTM, being surpassed only by the binary mask. At inhospital mortality we see a similar performance with TE SALSTM and LSTM with binary masking.
With nonrecurrent models it is possible to observe how TEs does not rely on recurrence. It improved both linear/logistic regression and MLP.
Model  24h LengthofStay  48h Inhospital Mortality  

MAE  RMSE  EV  AUCROC  AP  
Lin.R/Log.R  65.829(2.001)  133.178(2.777)  0.040(0.034)  0.783(0.003)  0.357(0.357) 
BM + Lin.R/Log.R  68.343(2.453)  127.579(2.044)  0.104(0.029)  0.804(0.003)  0.369(0.369)) 
catTE + Lin.R/Log.R  69.585(0.870)  129.484(1.220)  0.077(0.017)  0.794(0.005)  0.359(0.359) 
MLP  64.983(1.695)  137.949(14.440)  0.053(0.236)  0.807(0.002)  0.374(0.374)) 
BM + MLP  63.181(3.120)  125.888(0.781)  0.133(0.007)  0.804(0.004)  0.365(0.365)) 
catTE + MLP  64.154(1.734)  129.259(2.115)  0.090(0.019)  0.805(0.003)  0.355(0.355)) 
LSTM  62.597(0.957)  122.284(0.266)  0.174(0.004)  0.855(0.002)  0.481(0.481) 
BM + LSTM  61.666(0.702)  121.899(0.442)  0.180(0.006)  0.854(0.004)  0.485(0.485) 
catTE + LSTM  62.354(0.686)  123.253(0.155)  0.164(0.002))  0.846(0.004)  0.476(0.476) 
SALSTM  63.353(0.715)  122.976(0.215)  0.167(0.003)  0.851(0.005)  0.448(0.448) 
BM + SALSTM  61.955(0.460)  121.667(0.089)  0.185(0.001)  0.854(0.006)  0.454(0.454)) 
addTE + SALSTM  62.591(0.321)  122.325(0.475)  0.173(0.006)  0.856(0.003)  0.482(0.482) 
mean(standard deviation)
format5 Conclusions
This paper propose a novel method to represent hour time or dates as dense vectors to improve irregularly sampled time series. It was evaluated with two different approaches and evaluated in two tasks from the MIMIC III dataset. Our method showed some improvement with most models tested, including recurrent neural networks and classic machine learning methods.
Despite being outperformed by binary masking in some tests we believe TEs can still be an viable option. Specially to very irregular time series and high dimensional data, were TEs can be applied by addition without increasing the input dimensionality.
6 Future Work
We see a promising future for the method proposed. We expect to extend it to improve other types of irregular timecontinuous data and also evaluate how can TE improve recent models proposed for irregularly time series, like the GRUD [3], interpolation networks [19] and TemporalClustering Regularization [1]. The code for TEs results reported will be publicly available in the future.
References
 [1] (2019) Temporalclustering invariance in irregular healthcare time series. arXiv preprint arXiv:1904.12206. Cited by: §1, §1, §2, §4.2, §6.
 [2] Phasedlstm based predictive model for longitudinal ehr data with missing values. Cited by: §2.
 [3] (2018) Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 6085. Cited by: §1, §2, §2, §6.
 [4] (2004) Population time series: process variability, observation errors, missing values, lags, and hidden states. Ecology 85 (11), pp. 3140–3150. Cited by: §2.
 [5] (2017) Convolutional sequence to sequence learning. CoRR abs/1705.03122. External Links: Link, 1705.03122 Cited by: §3.1.
 [6] (1999) Learning to forget: continual prediction with lstm. Cited by: §4.2.
 [7] (2017) Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771. Cited by: §4.1.
 [8] (2019) Multitask learning and benchmarking with clinical time series data. Scientific data 6 (1), pp. 96. Cited by: §4.1.
 [9] (2016) MIMICiii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §1, §4.
 [10] (2017) A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §4.2, §4.2.
 [11] (2016) Modeling missing data in clinical time series with rnns. arXiv preprint arXiv:1606.04130. Cited by: §1, §1, §2, §2, §4.2.
 [12] (2019) Statistical analysis with missing data. Vol. 793, John Wiley & Sons. Cited by: §2.
 [13] (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, External Links: Link Cited by: §4.1.
 [14] (2012) Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT international health informatics symposium, pp. 389–398. Cited by: §2.
 [15] (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §4.1.
 [16] (2019) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §4.1.
 [17] (1982) Studies in astronomical time series analysis. iistatistical aspects of spectral analysis of unevenly spaced data. The Astrophysical Journal 263, pp. 835–853. Cited by: §2.
 [18] (1997) SPECTRUM: spectral analysis of unevenly spaced paleoclimatic time series. Computers & Geosciences 23 (9), pp. 929–945. Cited by: §2.
 [19] (2018) Interpolationprediction networks for irregularly sampled time series. Cited by: §1, §1, §2, §4.2, §6.
 [20] (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §3.1, §3.1, §3.2.
 [21] (2018) Mining electronic health records (ehrs): a survey. ACM Computing Surveys (CSUR) 50 (6), pp. 85. Cited by: §2.
 [22] (2018) Patient2vec: a personalized interpretable deep representation of the longitudinal electronic health record. IEEE Access 6, pp. 65333–65346. Cited by: §1.