1 Introduction
Leptospirosis is a public health problem all over the world, particularly in tropical and subtropical areas. It is a zoonosis caused by the spirochete bacteria Leptospira interrogans Cespedes05. Leptospira has been found in virtually all mammalian species examined. Humans most commonly become infected through occupational, recreational, or domestic contact with the urine of host animals, either directly or via contaminated water or soil (Adler10; Trueba04
). Infectious vectorborne diseases, particularly leptospirosis, are climaticsensitive (
WHOWMO12; Lopez18; Lopez19). Therefore, extreme climate events enhanced by climate change increase the problems associated with people’s health (Cerda08; Sanchez09). In this context, the average rainfall and extreme precipitation events both in intensity and frequency increased between 1951 and 2010 in northeastern Argentina (Barr09; Penalba10). This region has important rivers such as Paraná and Uruguay, as a consequence, the highest rainfall caused significant flooding in the last decade (Antico14; Lopez18) and this trend has continued to rise in recent years Lovino18b. These extreme events influence the epidemiology of infectious diseases in the region (Lopez18; Lopez19). For these reasons, health systems responses, besides the treatment of individual cases, should include the estimated number of cases of said disease. This estimation would improve the response of health systems during potential outbreaks, cutting of or delaying the virus transmission
Canals10.The modeling of leptospirosis in relation to hydroclimatic variables has been studied using deterministicaly (Triampo07; Zaman10; Zaman12; Holt06) and statistically (Chadsuthi12; Torres08). For example, Chadsuthi12 modelled that rainfall is correlated with cases of leptospirosis in both of the studied regions, while temperature is correlated with the desease only in one of those regions. Models can be useful tools to show the trend of the variable of interest (in this case, the incidence of leptospirosis), by adjusting to the recorded data, or to predict the next seasonal peak with higher accuracy.
From the statistical point of view, the commonly used methods to predict cases of epidemiological diseases are the well known autorregresive models. The simplest of these models is the autorregresive (AR) of order model, initially developed by Yule27
, which models linearly the response variable in terms of its own previous values and a stochastic term. The
movingaverage model (MA), introduced lately by Slutzky37, proposes that the response variable depends linearly on the current and various past values of the stochastic term. The autoregressive movingaverage (ARMA) model (Wold38), regress the response variable in terms of a linear combination of both, AR and MA models. More precisely, the AR term models the response as a linear combination of its own past values and the MA term involves modeling the error as a linear combination of previous values of the stochastic error. The autoregressive integrated moving average (ARIMA) model (Box94) is a generalization of the ARMAto nonstationary series, meaning, those who mean and variance are non constant in time. The
integrated part refers to a differencing initial step, which can be applied in order to eliminate the nonstationarity of the serie. Some application of this method to epidemiological time series can be found, for instance, in Promprou06; Liu11; Countin07. Although this model seems to be efficient, it only uses the variable of interest without taking into account the additional information that some predictor variables may present. As we mentioned before, when predicting outbreaks of leptospirosis in the Argentinian Litoral region, it is important to consider hydrometeorological covariables since they can be indicators of outbreaks and could improve the prediction. In this direction, the
ARIMAX model (see, for instance, Kongcharoen13) is an extension of ARIMA that proposes the use of covariables. Some applications of this method can be found in Chadsuthi12; Yogarajah13; avellaneda12.Although autoregressive predictive models are simple and easy to understand and analize, in real data applications they are difficult to apply since, in this situations, rarely the data meets the (linear) requirements of this methods. For instance, the large amount of nulls values and the discrete nature of leptospirosis incidence data series, makes difficult to perform the analysis with the usual time series methods. In this direction, nonparametric and semiparametric methods are a preferable alternative to autoregressive models. Some references on semiparametric prediction of temporal series have been studied, among others, in
Gyorfietal89, Bosq97, Bosq98 and Fanetal03. Some reviews can also be found in Wolfgangetal08. More recently, Biauetal10 studied the problem of sequential nonparametric prediction and proved some consistency results under mild conditions. In Shangetal11, the authors introduced a nonparametric prediction method with dynamic updating and showed its performance using monthly Niño indices from January, 1950 to December, 2008.In this paper we consider the method introduced by Aneirosetal08, who proposed to treat time series readings for each year as a function, this way having, a sample of time series of functional data. More precisely, the idea behind this method is to split the observed longtime series in shorttime trajectories, all of them of the same size (for instance, a year), resulting in a sample of curves, that can be then adapted to the functional data framework.
That means, that although the data is recorded only in a discrete grid of time points, by adapting them to the functional data framework, we can assume a continuous underlying curve for each year.
The model by Aneirosetal08 is called SemiFunctional Partial Linear Regression (SFPLR) and it is an additive model with two terms: one modeling nonparametrically the (temporal) response variable and other adding the additional information presented in the covariates by linearly combining them.
As the authors mention, the SFPLR method can have several advantages since, cutting the observed time series into a sample of curves and incorporating one single past trajectory rather than many single past values in the model, solves the problem of choosing the number of past values to be used in the construction of autoregressive prediction methods. In addition, in the context of infectious diseases with seasonal cycles depending on the climate, as it is the case of leptospirosis, this approach uses the measured value of the (hydroclimatic) covariables only in the month of interest which improves considerably the prediction. Another advantage of SFPLR is that it is not necessary to predict a future value of the covariables since, to perform the prediction, the method uses only recorded past values of the variable of interest and covariables.
In this work we compare de forecasting performance of the methods ARIMA, ARIMAX and SFPLR when we applied them to historical leptospirosis data using, in addition, hydrometeorological information. The main goal in these kind of studies is select the best method to predict the incidence of epidemiological diseases in the northeast of our country. In particular, we are interested in find the more suitable tool to predict outbreaks of leptospirosis that can be used by the regional public health systems.
The rest of the paper is organized as follows: in Section 2 we describe the three prediction methods used in the applications, the source of data is presented in Section 3, Section 4 is devoted to presenting the results obtained when applying the prediction methods to leptospirosis data and finally, in Section 5 we present the conclusions of the work.
2 Methods
In this Section we present the models that caracterize the three prediction methods: ARIMA, ARIMAX and SFPLR, that we applied to the leptospirosis incidence data collected in the Litoral area of our country. All the numerical implementation was performed in the statistical software R R.
2.1 ARIMA model
Following Box94, we define the ARMA model for stationary time series, as
(1) 
where is the backward shift operator which verifies , is the autoregressive operator, is the moving average operator and
is a white noise process with
and .Recall that a serie is (strictly) stationary if, for any set of times and its lag
the joint distribution of the
observations and is the same. For linear stochastic process, this is equivalent to say that roots of have a modulus (absolute value) less than one.In the case of non stationary processes, this is, when has unit roots, the process will need to be differenced multiple times () to become stationary and, in this case, the autoregressive operator can be written as . In this case, in Equation (1) we get
(2) 
Observe that the model described before is the ARIMA(p,q,d) model and corresponds to assume that, the difference of order , can be represented as the ARMA(p,q) model
The notation used for this kind of models is ARIMA(p,q,d), where , and were described before and they indicate the number of autoregressive terms, the number of lagged forecast errors in the prediction equation and the number of nonseasonal differences needed for stationarity, respectively.
To adjust the ARIMA model, the following steps must be performed:

Verify the stationarity of the data using the known DickeyFuller test. If they are nonstationary, use BoxCox transformation to stabilize the variance and make () differences between the observations to stabilize the mean.

Using autocorrelation and partial autocorrelation graphs, identify the orders and such that the serie can be modeled as an ARMA(p,q) model.
2.2 ARIMAX model
As it was mentioned before, the ARIMAX model is an extension of the ARIMA model which allows the use of covariables. The expression that represent this model is
where, analogous to the ARIMA model, is the backward shift operator which verifies , is the autoregressive stationary process, represents the covariables, is a tranference function representing the effect of on and is the moving average operator.
2.3 SFPLR model
Following Aneirosetal08, we assume that the longtime serie , has been observed at equispaced point times (this is, ) and we cut it (without losing generality) into shorttimes curves , of length . As a consequence, we get a sample where for any curve , the vector represents the numerical, independent, covariables. Let some characteristic from the period of time that we want to study, for instance the leptospirosis incidence in an specific month. Then, the SFPLR model is given by
where is a vector of knonwn real parameters, is an unknown (smooth) real function and are identically distributed random errors with . The estimators (see kersmooth) are given by
where and , and
where , are NadarayaWatson type kernel weights with the smoothing parameter is chosen by crossvalidation.
3 Data
Leptospirosis incidence is Notified in Santa Fe and Entre Ríos provinces from 2009, given that year Sistema Nacional de Vigilancia Epidemiológica por Laboratorios de Argentina (SIVILA) was implemented. The confirmed cases of leptospirosis were requested to Dirección de Promoción y prevención de la Salud, Ministerio de Salud of the province of Santa Fe and División Epidemiológica, Ministerio de Salud
of the province of Entre Ríos. Therefore, the period of analysis is 20092018. The total number of confirmed leptospirosis cases was 810; 496 for Santa Fe and 314 for Entre Ríos, for the mentioned period.
Selected covariables are those identified in Lopez19 as the main hydroclimatic indicators that can influence leptospirosis outbreaks occurrence in the northeastern Argentina. Hydroclimatic datasets identified in Lopez19 include monthly total precipitation, monthly river hydrometric levels and the Oceanic Niño Index (ONI,NOAA/NWS/CPC). Precipitation data were provided by Servicio Meteorológico Nacional (SMN)
and Instituto Nacional de Tecnología Agropecuaria (INTA).
The meteorological stations include: Sauce Viejo Aero, Rosario Aero and Paraná.
Hydrometric data were provided by the
Instituto Nacional del Agua (INA) while hydrometric river evacuation level (hydrometric river level from which people are evacuated to safe nonfloodable areas) data were provided by
Prefectura Naval Argentina (PNA). ONI characterizes the years and months under El Niño, La Niña or neutral conditions. The ONI is the 3month running mean SST anomaly for the Niño 3.4 region (https://ggweather.com/enso/oni.htm).
Finally, data about population indicators were retrieved from the 2010 Instituto Nacional de Estadística y Censos (INDEC) Census.
4 Empirical Results and Discussion
To explore the observed leptospirosis data in the area of study, in Figure 1 we plot the longtime (left) and shorttime (right) incidence series.
However, since in the area of study the climatic conditions can differ, we perform the statistical analysis for each region separately. Therefore, the statistical analysis was performed in the three selected cities. To do that, for each region, we split the whole sample in two: the training and testing samples. With the former one, we estimate all the parameters involved in each model so that, we have the estimated predictive rules. This sample contains the leptospirosis incidence and the hydroclimatic covariables monthly total precipitation, monthly river hydrometric levels and the ONI Index from January, 2009 to December 2017. The time series of hydroclimatic covariables are plotted in Figura (2).
The testing sample corresponds to the leptospirosis incidence from January, 2018 to December 2018 ( months). With this sample we measure the predictive power the three methods: we evaluate the estimated rule in the testing sample in order to get the estimated leptospirosis incidence (, ) and compare them with the true values (, ). To compare the methods we used two criterion which are described below.
The Nash–Sutcliffe criterion is
Observe that Nash–Sutcliffe criterion can range from to : corresponds to a perfect prediction, indicates that the method perform as well as the mean of the data, occurs when the observed mean is a better predictor than the model.
The second criterion is the Root Mean Square Error
In Table 1 we present the results, best results are reported in bold.
Santa Fe  

Method  NSE  RMSE 
ARIMA  5.00  0.91 
ARIMAX  9.20  1.19 
SFPLR  1.40  0.58 
Paraná  

Method  NSE  RMSE 
ARIMA  2.38  0.87 
ARIMAX  1.63  0.76 
SFPLR  0.50  0.58 
Rosario  

Method  NSE  RMSE 
ARIMA  0.63  0.82 
ARIMAX  0.83  0.87 
SFPLR  0.63  0.82 
5 Conclusions
In this work we compare de forecasting performance of the methods ARIMA, ARIMAX and SFPLR, when we applied them to leptospirosis data collected in the different regions, in addition, hydrometeorological information. We found that, in general, SFPLR provides better forecasting performance than ARIMA and ARIMAX, except in the region of Gualeguachú, were there was no cases of leptospirosis, this is, the data is almost al s zero.
It is worth to say that, although for this kind of data (incidence) the good performance of SFPLR is not very visible, in datasets that are continuous and not so sparse (this is, with less zeros) like the Ozone dataset presented in Aneirosetal08, compared with ARIMA and ARIMAX, SFPLR perform highly better (see Table 2).
In Table 1 are showed the values of both criterion obtained for 2018 in each of the three cities, for the ARIMA, ARIMAX and SFPLR models implementations.
In general, SFPLR provides better forecasting performance than ARIMA and ARIMAX, except in the Rosario case where similar values are obtained for SFPLR and ARIMA.
In general, SFPLR provides better forecasting performance than ARIMA and ARIMAX in the three studied cities.
SFPLR is able to predict the future leptospirosis occurrence relatively accurate considering hydroclimatic variables. However, leptospirosis is also determined by social factors, such as human activities, age and sex of individuals and health system prophylaxis that might increase the probability of infection (citas). Some of these factors could be considered in future studies.
Through this first comparative analysis, could be concluded that these types of models become a set of suitable tools to predict outbreaks of leptospirosis in northeastern Argentina and could be used by the regional public health systems.
Method  NSE  

ARIMA  0.05  36.65 
ARIMAX  0.85  13.73 
SFPLR  0.93  9.70 
Comments
There are no comments yet.