. For instance, convolutional neural networks (CNNs) have had some success in detecting and classifying radiological abnormalities on chest x-rays, a particularly complex task[2, 21, 15, 12]. The majority of these studies have been designed for cross-sectional analyses, viewing a single image in isolation, and discard the fact that a patient may have had previous medical imaging examinations for which the radiological reports are also available. It is standard practice for radiologists to take clinical history into account to add context to their report by using comparison to previous imaging. Some abnormalities will be long-standing, but others may change over time, with varying clinical relevance. Often in elderly patients or those with a history of smoking, the baseline x-ray appearances, i.e. when that patient is ‘well’, can still be abnormal. If individual films are viewed in isolation, it can be challenging to tell with certainty if there are acute findings. If previous imaging is available, it is possible to determine if there has been interval change, for example, acute consolidation (indicating infection). As with humans, it is expected that a neural network can learn from previous patient-specific information, in this case all prior chest radiographs for that patient and their corresponding reports.
The motivation for this work is to assess the potential of recurrent neural networks (RNNs) for the real-time detection of radiological abnormalities when modelling the entire series of past exams that are available for any given patient. In particular, we set out to explore the performance of Long Short-Term Memory (LSTM) networks[8, 10]
, which have lately become the method of choice in sequential modelling, especially when used in combination with CNNs for visual feature extraction[6, 20]. The technical challenge faced in our context is that sequential medical exams are event-based observations. As such, they are collected at times of clinical need, i.e. they are not equally spaced, and the number of historical exams available for each patient can vary greatly. Fig. 1 shows four longitudinal chest x-rays acquired on the same patient over a certain period of time. This figure also illustrates other challenges faced when modelling this type of longitudinal data: the images may be aquired using different x-ray devices (resulting in different image quality, i.e. resolution, brightness, etc.), there may be differences in patient positioning (i.e. supine, erect, rotated, degree of inspiration), differences in projection (postero-anterior and antero-posterior), and not all images are equally centred (i.e. there can be rotations, translations, etc.).
As LSTMs are typically applied on regularly-sampled data [9, 16, 17], they are ill-suited to work with irregular time gaps between consecutive observations, as previously noted [13, 3]. This is a particularly important limitation in our context as certain radiological abnormalities tend to be observed for longer periods of time whereas others are short-lived. In this article we demonstrate that an architecture combining a CNN with a simple modification of the standard LSTM is able to handle irregularly-sampled data and learn the temporal dynamics of certain visual features resulting in improved pattern detection. Using both simulated and real x-ray datasets, we demonstrate that this capability yields improved image classification performance over an LSTM baseline.
2 Motivating dataset and problem formulation
The dataset used in this study was collected from the historical archives of the PACS (Picture Archiving and Communication System) at Guy’s and St. Thomas’ NHS Foundation Trust, in London, during the period from January 2005 to March 2016. The dataset has been previously used for the detection of lung nodules  and for multi-label metric learning . It consists of
chest radiographs representative of an adult population and acquired using 40 different x-ray systems. Each associated radiological report was parsed using a natural language processing system for the automated extraction of radiological labels[14, 5]. For this study, we extracted a subset of patients having a history of at least two exams, which resulted in images (with used for training and for testing). Each image was scaled to a standard format of pixels. The resulting dataset has an average of examinations per patient with an average of days between consecutive exams per patient.
In what follows, each individual sequence of longitudinal chest x-rays along with its associated vector of radiological labels is denoted as, where is the patient index and is the time index. Typical chest x-ray datasets are characterised by relatively few examinations per patient (e.g. is around 4-5) and highly-irregular sampling rates. Our task is to predict the vector of image labels given the entire history of exams up to time plus the current image, i.e. .
3 Time-modulated LSTM
LSTMs are a particular type of RNNs able to classify, process and predict time series [10, 8]. The internal state of an LSTM (a.k.a. the cell state or memory) gives the architecture its ability to ’remember’. A standard LSTM contains memory blocks, and blocks contain memory cells. A typical memory block is made of three main components: an input gate controlling the flow of input activations into the memory cell, an output gate controlling the output flow of cell activations, and a forget gate for scaling the internal state of the cell. The forget gate modulates how much information is used from the internal state of the previous time-step. However, standard LSTMs are ill-suited for our task where the time between consecutive exams is variable, because they have no mechanism for explicitly modelling the arrival time of each observation. In fact, it has been shown that LSTMs, and more generally RNNs, underperform with irregularly sampled data or time series with missing values [13, 4]. Previous attempts to adapt LSTMs for use with irregularly sampled datapoints have mostly focused on speeding up the converge of the algorithm in settings with high-resolution sampled data  or to discount short-term memory .
To address these issues, we introduce two simple modifications of the standard LSTM architecture, called time-modulated LSTM (tLSTM), both making explicit use of the time indexes associated to the inputs. In the proposed architecture, all the images for a given patient are initially processed by a CNN architecture, which extracts a set of imaging features, denoted by , at each time step. The LSTM takes as inputs , i.e. the radiological labels describing the images acquired at the previous time-step, the current image features, , and the time lapse between and , which we denote as . For the last image in the sequence, the LSTM predicts the image labels, , called . Figure 2 provides a high-level overview of this model and the equations below define the tLSTM unit:
Here, defines the internal state at time-step , while , and refer to the forget, input and output gates at time-step , respectively. These are all computed as linear combinations of the vectors and the scalar
, and then transformed by a sigmoid function,. The matrices denoted by contain learnable weights indexed by two letters (e.g. contains the weights of the forget gate for labels , and so on). At time , we initialise (an array of zeros) and . The time lapses, , linearly modulate the information inside the internal cell state as well as the output, forget and input gates.
A different variation of the previous model (tLSTMv2) uses the time lapse only to modulate the internal state, . In this case, each actively contributes to updating
directly and, implicitly, to estimating the label vector, i.e.
The form of the other updating equations, i.e. and , is similar to those in Eq. (1), without the elements.
4 Simulated data
In order to better assess the potential advantages introduced by the time-modulated LSTM in settings where observations are event-driven and the underlying patterns to be detected are time-varying, we generated simulated data as an alternative to the the real chest x-ray dataset of Section 2. Simulating images enables us to precisely control the sampling frequency at which the relevant visual patterns appear and disappear over time as well as the signal to noise ratio. For this study, we simulated a population of image sequences of varying lengths. Within a sequence, each image consisted of a noisy background image containing one or more randomly placed digits drawn from the set . We simulated three kinds of patterns inspired by the radiological patterns seen in real medical images: (i) rare patterns
consisting of digits appearing with low probability; (ii)common patterns consisting of rapidly appearing and resolving digits; (iii) persistent labels, consisting of digits observed for extended periods of time. In analogy to medical images, each digit in our simulation represents a radiological abnormality to be detected, hence multiple (and possibly overlapping) digits are allowed to coexist within an image. The time lapse
was modelled as a uniform random variable taking value in the interval. An example of simulated images can be found in the Supplementary Material.
5 Experimental results
In our experiments with the real x-ray dataset, the CNN component in our architecture conists of a pre-trained Inception v3  without the classification layer. The imaging features (an array
elements) from the CNN are as used as inputs for the LSTM component along with the image labels. We considered four possible radiological labels: cardiomegaly, consolidation, pleural effusion and hiatus hernia. The performance of the time-modulated LSTM models is assessed by the PPV (Positive Predictive Value) and NPV (Negative Predictive Value) along with F-score, i.e the harmonic mean of precision and recall.
We compared the performance of four models: the baseline CNN classifier (Inceptionv3) that only uses each current image to predict the labels, but does not exploit the historical exams for a given patient, and three variations of the architecture illustrated in Figure 2: one using the standard LSTM and the two versions of time-modulated LSTM model introduced in Section 3. Both tLSTM versions introduced noticeable performance improvements; see Table 1. In particular, tLSTMv1 yields an increase of % in F-measure over the baseline and % over a standard LSTM. Moreover, tLSTMv1 achieves a % improvement in PPV over the baseline. Overall, tLSTM achieves improved performance over the standard LSTM due to its ability to handle irregularly sampled data.
For the simulated dataset, we used a pre-trained AlexNet  as feature extractor in combination with three versions of the LSTM for modelling sequences of images. A full table with results can be found in the Supplementary Material. We purposely introduced a sufficiently high level of noise in the visual patterns so as to make the classification problem with individual images particularly difficult; accordingly, the single-image classifier did not achieve acceptable classification results. Likewise, the architecture using a standard LSTM did not introduce significant improvements due to the irregularly sampled observations. On the other hand, larger classification improvements were achieved using the time-modulated LSTM units as those were able to decode the sequential patterns by explicitly taking into account the time gaps between consecutive observations.
Our experimental results suggest that the modified LSTM architectures, combined with CNNs, are suitable for modelling sequences of event-based imaging observations. By explicitly modelling the individual time lapses between consecutive events, these architectures are able to better capture the evolution of visual patterns over time, which has a boosting effect on the classification performance. The full potential of these models is best demonstrated using simulated datasets whereby we have control over the exact nature of the temporal patterns and the image labels are perfectly known. In real radiological datasets, there are often errors in some of the image labels due to typographical errors, interpretive errors, ambiguous language and, in some cases, long-standing findings not being mentioned. This can cause problems both in CNN training and testing. Despite these challenges, we have demonstrated that improved classification results can also be achieved by the time-modulated LSTM components on a large chest x-ray dataset. Thus we empirically proved that a patient’s imaging history can be used to improve automated radiological reporting. In future work, we plan more extensive testing of a system trained end-to-end on a much larger number of radiological classes. The code with the networks used for our experiment can be found online: https://github.com/WMGDataScience/tLSTM.
- Annarumma and Montana  M. Annarumma and G. Montana. Deep metric learning for multi-labelled radiographs. In 33rd Annual ACM SAC ’18, pages 34–37. ACM, 2018.
- Bar et al.  Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, and H. Greenspan. Chest pathology detection using deep learning with non-medical training. 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), 2015:294–297, 07 2015.
- Baytas et al.  I M Baytas, C Xiao, X Zhang, F Wang, A K Jain, and J Zhou. Patient subtyping via time-aware LSTM networks. In 23rd ACM SIGKDD, 2017.
- Che et al.  Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 2018.
- Cornegruta et al.  S. Cornegruta, R. Bakewell, S. Withey, and G. Montana. Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks. 7th Workshop on Health Text Mining and Information Analysis, 2016.
- Donahue et al.  J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. ArXiv e-prints, Nov 2014.
- Esteva et al.  A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542:115 EP –, Jan 2017.
- Gers et al.  F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12:2451–2471, 1999.
- Graves et al.  A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, 2013.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Krizhevsky et al.  A. Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
- Litjens et al.  G. Litjens, T. Kooi, B. Ehteshami Bejnordi, A. Arindra Adiyoso Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez. A Survey on Deep Learning in Medical Image Analysis . ArXiv e-prints, February 2017.
- Neil et al.  D. Neil, M. Pfeiffer, and S.-C. Liu. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. ArXiv e-prints, October 2016.
- Pesce et al.  E. Pesce, P.-P. Ypsilantis, S. Withey, R. Bakewell, V. Goh, and G. Montana. Learning to detect chest radiographs containing lung nodules using visual attention networks. ArXiv e-prints, December 2017.
- Rajpurkar et al.  P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, M. P. Lungren, and A. Y. Ng. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv e-prints, November 2017.
SHI et al. 
X. SHI, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. WOO.
Convolutional LSTM network: A machine learning approach for precipitation nowcasting.In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 802–810. Curran Associates, Inc., 2015.
- Srivastava et al.  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. CoRR, abs/1502.04681, 2015.
Szegedy et al. 
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- V et al.  Gulshan V, Peng L, Coram M, and et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316(22):2402–2410, 2016.
- Vinyals et al.  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition, 2015 IEEE Conference on, pages 3156–3164. IEEE, 2015.
- Wang et al.  X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. ArXiv e-prints, May 2017.
7 Appendix: Artificial data simulation
In this section we describe the data simulation procedure. Each data point consists of a variable number of simulated images. The length of each sequence of images is allowed to vary from a minimum of to a maximum of , with an average of images per each sequence. All the simulated images consist of a prefixed grey background, some random noise (a Gaussian blur) applied to this background and simulated digits from a set .
At each time step, , we draw a random integer to represent the time elapsed between two consecutive images, and . Initially, at state , the image contains no digits. The digits allowed to be sampled at depend on the current state, , and the particular value of . Table 2 defines all the digits allowed to be seen at as a function of and . Each one of the allowed digits is then sampled with fixed probability. The digit is placed at a random location and its rotation angle is also randomly chosen.
According to this procedure, different digits behave differently, the digit "9" is independent of other labels, whilst all the others are dependent of each other. Some labels (e.g. "6" or "9") can persist over longer periods of time; some digits are rare, i.e. have a low probability of appearing (e.g. "0") while others are more frequent, i.e. have higher probabilities (e.g. "3"). These scenarios somewhat mimic rare and common abnormalities. Figure 3 provides an example of a typical sequence, in this case with only 4 simulated images, each one having one or two digits.
Using this procedure, we simulated independent training and testing datasets used in our work to test and compare our models with the standard LSTM and the AlexNet as explained in the paper. The empirical results obtained on the test dataset can be found in Table 3.
|Current state ()||Next state ()|
|*State transitions used to simulate sequence of images with time-varying visual patterns. Numbers in cells are the needed to take a determinate path between and e.g. ‘7:10’ means , ‘1:3,5:9’ means or and so on.|