Comparing statistical methods to predict leptospirosis incidence using hydro-climatic covariables

by   Maria Jose Llop, et al.

Leptospiroris, the infectious disease caused by the spirochete bacteria Leptospira interrogans, constitutes an important public health problem all over the world. In Argentina, some regions present climate and geographic characteristics that favors the habitat of the bacteria Leptospira, whose survival strongly depends on climatic factors. For this reason, regional public health systems should include, as a main factor, the incidence of the disease in order to improve the prediction of potential outbreaks, helping to stop or delay the virus transmission. The classic methods used to perform this kind of predictions are based in autoregressive time series tools which, as it is well known, perform poorly when the data do not meet their requirements. Recently, several nonparametric methods have been introduced to deal with those problems. In this work, we compare a semiparametric method, called Semi-Functional Partial Linear Regression (SFPLR) with the classic ARIMA and a new alternative ARIMAX, in order to select the best predictive tool for the incidence of leptospirosis in the Argentinian Litoral region. In particular, SFPLR and ARIMAX are methods that allow the use of (hydrometeorological) covariables which could improve the prediction of outbreaks of leptospirosis.



There are no comments yet.


page 1

page 2

page 3

page 4


A functional autoregressive model based on exogenous hydrometeorological variables for river flow prediction

In this research, a functional time series model was introduced to predi...

Prediction of infectious disease epidemics via weighted density ensembles

Accurate and reliable predictions of infectious disease dynamics can be ...

Infectious Disease Forecasting for Public Health

Forecasting transmission of infectious diseases, especially for vector-b...

Accuracy of the Epic Sepsis Prediction Model in a Regional Health System

Interest in an electronic health record-based computational model that c...

A new look at weather-related health impacts through functional regression

A major challenge of climate change adaptation is to assess the effect o...

Translation-invariant functional clustering on COVID-19 deaths adjusted on population risk factors

The COVID-19 pandemic has taken the world by storm with its high infecti...

A Robust Stochastic Method of Estimating the Transmission Potential of 2019-nCoV

The recent outbreak of a novel coronavirus (2019-nCoV) has quickly evolv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Leptospirosis is a public health problem all over the world, particularly in tropical and subtropical areas. It is a zoonosis caused by the spirochete bacteria Leptospira interrogans Cespedes05. Leptospira has been found in virtually all mammalian species examined. Humans most commonly become infected through occupational, recreational, or domestic contact with the urine of host animals, either directly or via contaminated water or soil (Adler10; Trueba04

). Infectious vector-borne diseases, particularly leptospirosis, are climatic-sensitive (

WHOWMO12; Lopez18; Lopez19). Therefore, extreme climate events enhanced by climate change increase the problems associated with people’s health (Cerda08; Sanchez09). In this context, the average rainfall and extreme precipitation events both in intensity and frequency increased between 1951 and 2010 in northeastern Argentina (Barr09; Penalba10). This region has important rivers such as Paraná and Uruguay, as a consequence, the highest rainfall caused significant flooding in the last decade (Antico14; Lopez18) and this trend has continued to rise in recent years Lovino18b. These extreme events influence the epidemiology of infectious diseases in the region (Lopez18; Lopez19

). For these reasons, health systems responses, besides the treatment of individual cases, should include the estimated number of cases of said disease. This estimation would improve the response of health systems during potential outbreaks, cutting of or delaying the virus transmission


The modeling of leptospirosis in relation to hydroclimatic variables has been studied using deterministicaly (Triampo07; Zaman10; Zaman12; Holt06) and statistically (Chadsuthi12; Torres08). For example, Chadsuthi12 modelled that rainfall is correlated with cases of leptospirosis in both of the studied regions, while temperature is correlated with the desease only in one of those regions. Models can be useful tools to show the trend of the variable of interest (in this case, the incidence of leptospirosis), by adjusting to the recorded data, or to predict the next seasonal peak with higher accuracy.

From the statistical point of view, the commonly used methods to predict cases of epidemiological diseases are the well known autorregresive models. The simplest of these models is the autorregresive (AR) of order model, initially developed by Yule27

, which models linearly the response variable in terms of its own previous values and a stochastic term. The

moving-average model (MA), introduced lately by Slutzky37, proposes that the response variable depends linearly on the current and various past values of the stochastic term. The autoregressive moving-average (ARMA) model (Wold38), regress the response variable in terms of a linear combination of both, AR and MA models. More precisely, the AR term models the response as a linear combination of its own past values and the MA term involves modeling the error as a linear combination of previous values of the stochastic error. The autoregressive integrated moving average (ARIMA) model (Box94) is a generalization of the ARMA

to non-stationary series, meaning, those who mean and variance are non constant in time. The

integrated part refers to a differencing initial step, which can be applied in order to eliminate the non-stationarity of the serie. Some application of this method to epidemiological time series can be found, for instance, in Promprou06; Liu11; Countin07

. Although this model seems to be efficient, it only uses the variable of interest without taking into account the additional information that some predictor variables may present. As we mentioned before, when predicting outbreaks of leptospirosis in the Argentinian Litoral region, it is important to consider hydrometeorological covariables since they can be indicators of outbreaks and could improve the prediction. In this direction, the

ARIMAX model (see, for instance, Kongcharoen13) is an extension of ARIMA that proposes the use of covariables. Some applications of this method can be found in Chadsuthi12; Yogarajah13; avellaneda12.

Although autoregressive predictive models are simple and easy to understand and analize, in real data applications they are difficult to apply since, in this situations, rarely the data meets the (linear) requirements of this methods. For instance, the large amount of nulls values and the discrete nature of leptospirosis incidence data series, makes difficult to perform the analysis with the usual time series methods. In this direction, nonparametric and semiparametric methods are a preferable alternative to autoregressive models. Some references on semiparametric prediction of temporal series have been studied, among others, in

Gyorfietal89, Bosq97, Bosq98 and Fanetal03. Some reviews can also be found in Wolfgangetal08. More recently, Biauetal10 studied the problem of sequential nonparametric prediction and proved some consistency results under mild conditions. In Shangetal11, the authors introduced a nonparametric prediction method with dynamic updating and showed its performance using monthly Niño indices from January, 1950 to December, 2008.

In this paper we consider the method introduced by Aneirosetal08, who proposed to treat time series readings for each year as a function, this way having, a sample of time series of functional data. More precisely, the idea behind this method is to split the observed long-time series in short-time trajectories, all of them of the same size (for instance, a year), resulting in a sample of curves, that can be then adapted to the functional data framework.

That means, that although the data is recorded only in a discrete grid of time points, by adapting them to the functional data framework, we can assume a continuous underlying curve for each year.

The model by Aneirosetal08 is called Semi-Functional Partial Linear Regression (SFPLR) and it is an additive model with two terms: one modeling nonparametrically the (temporal) response variable and other adding the additional information presented in the covariates by linearly combining them.

As the authors mention, the SFPLR method can have several advantages since, cutting the observed time series into a sample of curves and incorporating one single past trajectory rather than many single past values in the model, solves the problem of choosing the number of past values to be used in the construction of autoregressive prediction methods. In addition, in the context of infectious diseases with seasonal cycles depending on the climate, as it is the case of leptospirosis, this approach uses the measured value of the (hydroclimatic) covariables only in the month of interest which improves considerably the prediction. Another advantage of SFPLR is that it is not necessary to predict a future value of the covariables since, to perform the prediction, the method uses only recorded past values of the variable of interest and covariables.

In this work we compare de forecasting performance of the methods ARIMA, ARIMAX and SFPLR when we applied them to historical leptospirosis data using, in addition, hydrometeorological information. The main goal in these kind of studies is select the best method to predict the incidence of epidemiological diseases in the northeast of our country. In particular, we are interested in find the more suitable tool to predict outbreaks of leptospirosis that can be used by the regional public health systems.

The rest of the paper is organized as follows: in Section 2 we describe the three prediction methods used in the applications, the source of data is presented in Section 3, Section 4 is devoted to presenting the results obtained when applying the prediction methods to leptospirosis data and finally, in Section 5 we present the conclusions of the work.

2 Methods

In this Section we present the models that caracterize the three prediction methods: ARIMA, ARIMAX and SFPLR, that we applied to the leptospirosis incidence data collected in the Litoral area of our country. All the numerical implementation was performed in the statistical software R R.

2.1 ARIMA model

Following Box94, we define the ARMA model for stationary time series, as


where is the backward shift operator which verifies , is the autoregressive operator, is the moving average operator and

is a white noise process with

and .

Recall that a serie is (strictly) stationary if, for any set of times and its lag

the joint distribution of the

observations and is the same. For linear stochastic process, this is equivalent to say that roots of have a modulus (absolute value) less than one.

In the case of non stationary processes, this is, when has unit roots, the process will need to be differenced multiple times () to become stationary and, in this case, the autoregressive operator can be written as . In this case, in Equation (1) we get


Observe that the model described before is the ARIMA(p,q,d) model and corresponds to assume that, the difference of order , can be represented as the ARMA(p,q) model

The notation used for this kind of models is ARIMA(p,q,d), where , and were described before and they indicate the number of autoregressive terms, the number of lagged forecast errors in the prediction equation and the number of nonseasonal differences needed for stationarity, respectively.

To adjust the ARIMA model, the following steps must be performed:

  • Verify the stationarity of the data using the known Dickey-Fuller test. If they are nonstationary, use Box-Cox transformation to stabilize the variance and make () differences between the observations to stabilize the mean.

  • Using autocorrelation and partial autocorrelation graphs, identify the orders and such that the serie can be modeled as an ARMA(p,q) model.

2.2 ARIMAX model

As it was mentioned before, the ARIMAX model is an extension of the ARIMA model which allows the use of covariables. The expression that represent this model is

where, analogous to the ARIMA model, is the backward shift operator which verifies , is the autoregressive stationary process, represents the covariables, is a tranference function representing the effect of on and is the moving average operator.

2.3 SFPLR model

Following Aneirosetal08, we assume that the long-time serie , has been observed at equispaced point times (this is, ) and we cut it (without losing generality) into short-times curves , of length . As a consequence, we get a sample where for any curve , the vector represents the numerical, independent, covariables. Let some characteristic from the period of time that we want to study, for instance the leptospirosis incidence in an specific month. Then, the SFPLR model is given by

where is a vector of knonwn real parameters, is an unknown (smooth) real function and are identically distributed random errors with . The estimators (see kersmooth) are given by

where and , and

where , are Nadaraya-Watson type kernel weights with the smoothing parameter is chosen by cross-validation.

3 Data

Leptospirosis incidence is Notified in Santa Fe and Entre Ríos provinces from 2009, given that year Sistema Nacional de Vigilancia Epidemiológica por Laboratorios de Argentina (SIVILA) was implemented. The confirmed cases of leptospirosis were requested to Dirección de Promoción y prevención de la Salud, Ministerio de Salud of the province of Santa Fe and División Epidemiológica, Ministerio de Salud of the province of Entre Ríos. Therefore, the period of analysis is 2009-2018. The total number of confirmed leptospirosis cases was 810; 496 for Santa Fe and 314 for Entre Ríos, for the mentioned period.
Selected covariables are those identified in Lopez19 as the main hydroclimatic indicators that can influence leptospirosis outbreaks occurrence in the northeastern Argentina. Hydroclimatic datasets identified in Lopez19 include monthly total precipitation, monthly river hydrometric levels and the Oceanic Niño Index (ONI,NOAA/NWS/CPC). Precipitation data were provided by Servicio Meteorológico Nacional (SMN) and Instituto Nacional de Tecnología Agropecuaria (INTA). The meteorological stations include: Sauce Viejo Aero, Rosario Aero and Paraná. Hydrometric data were provided by the Instituto Nacional del Agua (INA) while hydrometric river evacuation level (hydrometric river level from which people are evacuated to safe non-floodable areas) data were provided by Prefectura Naval Argentina (PNA). ONI characterizes the years and months under El Niño, La Niña or neutral conditions. The ONI is the 3-month running mean SST anomaly for the Niño 3.4 region ( Finally, data about population indicators were retrieved from the 2010 Instituto Nacional de Estadística y Censos (INDEC) Census.

4 Empirical Results and Discussion

To explore the observed leptospirosis data in the area of study, in Figure 1 we plot the long-time (left) and short-time (right) incidence series.

Figure 1: Left: long-time serie of monthly leptospirosis incidence in the area of study. Right: short-time series of functional data.

However, since in the area of study the climatic conditions can differ, we perform the statistical analysis for each region separately. Therefore, the statistical analysis was performed in the three selected cities. To do that, for each region, we split the whole sample in two: the training and testing samples. With the former one, we estimate all the parameters involved in each model so that, we have the estimated predictive rules. This sample contains the leptospirosis incidence and the hydroclimatic covariables monthly total precipitation, monthly river hydrometric levels and the ONI Index from January, 2009 to December 2017. The time series of hydroclimatic covariables are plotted in Figura (2).

Figure 2: Hydroclimatic covariables

The testing sample corresponds to the leptospirosis incidence from January, 2018 to December 2018 ( months). With this sample we measure the predictive power the three methods: we evaluate the estimated rule in the testing sample in order to get the estimated leptospirosis incidence (, ) and compare them with the true values (, ). To compare the methods we used two criterion which are described below.

The Nash–Sutcliffe criterion is

Observe that Nash–Sutcliffe criterion can range from to : corresponds to a perfect prediction, indicates that the method perform as well as the mean of the data, occurs when the observed mean is a better predictor than the model.

The second criterion is the Root Mean Square Error

In Table 1 we present the results, best results are reported in bold.

Santa Fe
ARIMA -5.00 0.91
ARIMAX -9.20 1.19
SFPLR -1.40 0.58
ARIMA -2.38 0.87
ARIMAX -1.63 0.76
SFPLR -0.50 0.58
ARIMA -0.63 0.82
ARIMAX -0.83 0.87
SFPLR -0.63 0.82
Table 1: NSE and MSE for each region.

5 Conclusions

In this work we compare de forecasting performance of the methods ARIMA, ARIMAX and SFPLR, when we applied them to leptospirosis data collected in the different regions, in addition, hydrometeorological information. We found that, in general, SFPLR provides better forecasting performance than ARIMA and ARIMAX, except in the region of Gualeguachú, were there was no cases of leptospirosis, this is, the data is almost al s zero.

It is worth to say that, although for this kind of data (incidence) the good performance of SFPLR is not very visible, in datasets that are continuous and not so sparse (this is, with less zeros) like the Ozone dataset presented in Aneirosetal08, compared with ARIMA and ARIMAX, SFPLR perform highly better (see Table 2).

In Table 1 are showed the values of both criterion obtained for 2018 in each of the three cities, for the ARIMA, ARIMAX and SFPLR models implementations.

In general, SFPLR provides better forecasting performance than ARIMA and ARIMAX, except in the Rosario case where similar values are obtained for SFPLR and ARIMA.

In general, SFPLR provides better forecasting performance than ARIMA and ARIMAX in the three studied cities.

SFPLR is able to predict the future leptospirosis occurrence relatively accurate considering hydroclimatic variables. However, leptospirosis is also determined by social factors, such as human activities, age and sex of individuals and health system prophylaxis that might increase the probability of infection (citas). Some of these factors could be considered in future studies.

Through this first comparative analysis, could be concluded that these types of models become a set of suitable tools to predict outbreaks of leptospirosis in northeastern Argentina and could be used by the regional public health systems.

Method NSE
ARIMA -0.05 36.65
ARIMAX 0.85 13.73
SFPLR 0.93 9.70
Table 2: NSE and MSE for the Ozone data set.