Airborne-pollen allergy is prevalent, affecting up to 40% of the total population worldwide [d2007allergenic, sanchez2002use]. The long-term customized forecasting of pollen allergy provides individuals with guidance for travel planning and medication planning [prank2013operational, arizmendi1993time, ranzi2003forecasting]. For pharmaceutical companies, the demand of medication for pollen allergy treatment, in addition to the sales and operations planning for supply chain management, further necessities the forecasting of the start/end date of the allergy season [singh2006supply, jaipuria2014improved, barlas2011demand].
In Fig. 1, the concentration of pollen across the years from 2004 to 2008 is shown. We observe a significant change of the start date and end date for varying years with no explicit trend. For example, the allergy seasons in 2005 and in 2007 exhibit almost no overlapping; the length of the allergy season in 2004 roughly doubles that in any other year.
For a more rigorous definition of the allergy season tailored for each patient, we introduce the concentration threshold which is the minimum customized concentration requirement of pollen for a typical day in the allergy season, and the number threshold which is the minimum number of typical days within a week (7 consecutive days) in the allergy season. The start date of allergy season is defined as the day, during the following week of which the number of typical days (pollen concentration ) is at least . Both and can be customized according to different allergy sensitivity level of individuals. For example, if and
, the standard deviation of the start date, the end date and the length of allergy season are 19.9 days, 41.4 days and 47.2 days, respectively, from the year 2003 to year 2019.
Univariate forecasting methodologies, such as Exponential Smoothing, ARIMA and T-BATS, can achieve expected performance on time series data with a strong signal of level, trend and seasonality [gardner1985exponential, hillmer1982arima, hassani2017forecasting]
. In addition to the classic univariate models, the use of network and neural network architectures, such as convolutional neural network (CNN) and recurrent neural network (RNN) furthers the scope of application[bai2019mental, chavez2019identify]chen2011time, oreshkin2019n, liang2019solid, fish2019dynamic, liang2018dynamic].
However, univariate forecasting methodologies exhibit underperformance in the context of cyclic intermittency, in particular for predicting the start date and the end date of the allergy season [stock1998comparison]. In the scenario of airborne-pollen allergy season prediction, the concentration of pollen, primarily produced by plants, is closely dependent on local environmental conditions like the weather and geography, necessitating the integration of the meteorological information into the forecasting methodology [samal2019time, meese1984comparison]. Univariate models, such as Holt-Winters exponential smoothing, ARIMA and Facebook Prophet model, are unable to integrate other time-varying covariates, in particular the weather information such as the precipitation, temperature and wind [fildes1984choice, de2011forecasting].
In this paper, we propose a multi-variate triple-regression algorithm to predict the airborne-pollen allergy season in the long term, i.e. the start date and end date of the season. The triggering concentration of the pollen – the definition of the allergy season can be customized as aforementioned. The proposed algorithm leverages the inferential signal from other covariates to make long-term accurate predictions, and uses a novel three-stage modelling approach to improve forecasting performance. In particular, we take into consideration the other 11 covariates including the history of temperature, wind and precipitation in addition to the historical data of pollen concentrations. The prediction results from early stage(s) are used in later regressions to further improve the accuracy and reduce the uncertainty of prediction.
The proposed algorithm encompasses a three-stage regression for the start/end date of the allergy season prediction, together with a pre-processing for the feature extraction. The data pipeline and the regression algorithm are outlined in Fig.2. In the pre-processing, a total of (=30) time series selected features are extracted by applying a 14-day rolling window to each of the
(=12) original time series vectors, including the pollen concentration history, temperature history, wind history, participation history, etc. The feature matrix corresponding to is denoted as , and the ensemble feature matrix is
. The ensemble feature matrix is then fed into the three-stage regression. In Stage 1, a Gradient Boosting model to predict the start/end date is trained on a training setwhich is based on the feature matrix . The ground truth for the start/end date of allergy season is determined following the definition after we select appropriate pollen concentration threshold and number threshold . The vector of prediction is given by . In Stage 2, we select another training set based on the feature matrix and the predicted start/end date from Stage 1 to train another Gradient Boosting model to predict the uncertainty in . The vector of uncertainty is given by . In Stage 3, we perform a weighted linear regression on based on the linear constraint on start/end date predictions at consecutive days ahead of the allergy season. The weights are assigned according to the predicted uncertainty . Thus, the final predicted start/end date of the allergy season is obtained by .
Although one can opt to terminate the algorithm at Stage 1 when the predictions are already made, we emphasize that the three-stage regression can guarantee its uncertainty to be smaller than that using only one-stage regression. The proof is given for a simplified problem as follows.
We may assume that each prediction in the first-stage regression
follows a Gaussian distribution:
where the varianceis assumed to be constant , and is theoretically constrained by a linear relationship given the predictions are made at consecutive days . To set up the problem of a multi-linear regression in general, we may consider independent variables and one dependent variable . Suppose we have observations,
where are the coefficients of the -th dependent variable . Our goal is to minimize the sum of the weighted squared residuals (errors) . Thus, the cost function is:
where the weights is provided by the inverse of the uncertainty from the regression results in Stage 2, . No analytical solution can be obtained for a set of random weights. Without loss of generality, we can focus on a simplified scenario with uniform weights, and the linear regression results can be expressed as:
is the expectation of random variable, is the covariance of random variable and , and is the variance of random variable .
In the context of predicting the start/end date of the allergy season, the only independent variable in the weighted linear regression is the date at which the prediction is made. Thus, we only have two non-zero coefficients, and , to be learned from the regression. The variance of the predicted coefficient can be approximated by [james2013introduction]:
It can be shown that is related to the uncertainty from the Gradient Boosting regression in Stage 1 through the following formula:
The final prediction of the start/date date is given by the -intercept of the line from the weighted linear regression:
Thus, the standard deviation of is approximated by
Utilizing the third-stage regression, we aim for a reduced variance, i.e. . In other words, the uncertainty should be reduced from the one-stage regression result. It is obvious that a minimum number of days where the predictions are made in Stage 1 is required to guarantee the reduced uncertainty. We can calculate by first defining the threshold function:
then setting to 1, indicating that the uncertainty does not change after the weighted linear regression.
Iii Results and Discussion
To guarantee a reduced uncertainty in the three-stage regression compared with the one-stage regression, we obtain the minimum number of days, , used for making predictions. In Fig. 3, we plot the threshold function as a function of the coefficient with varying . The shaded area in Fig. 3 indicates the regime where the three-stage regression has a reduced uncertainty.
Therefore, the corresponding minimum number of days for a specific value can be obtained from the threshold function which satisfies and has the smallest among all threshold functions. As increases, the value of threshold function increases, reflecting an increased minimum number of days is required.
We apply the data pipeline and three-stage regression algorithm to the dataset described in Section II to predict the start date of the allergy season. The ground truth of the start date of the allergy season is set by the thresholds, and , i.e., for seven consecutive days after the start day, the number of days when the pollen concentrations is greater than is at least .
In Fig. 4, we show the mean predicted in Stage 1, and the corresponding error bar, which is the standard deviation predicted in Stage 2 as functions of the -th day of the year 2005. The models to predict and are trained using the dataset from year 2003 and 2004 respectively. The blue lines in decreasing transparency represent weighed linear regression results with increasing number of predictions used, where the inverse of the uncertainty serves as the weights. The green circles locates at the -intercept represents the final prediction made in Stage 3. It is manifested in the Fig. 4 (Inset) that the prediction converges to Day 54 as the number of days accounted increases, while the actual start date is Day 51 for year 2005. We also performed backtesting for year 2006, 2007 and 2008, and a mean absolute error of 4.7 days was achieved using the triple-regression algorithm.
The intercept of -axis, coefficient , indicates the approximate prediction of the start/end date.
The airborne-pollen allergy season exhibits significant variations in terms of the start/end date and the length of the allergy season. Univariate models fail to extract its seasonality or trend and fail to integrate other covariates such as the temperature and precipitation.
In our proposed triple-regression algorithm, (a) the pollen allergy season can be customized based on each individual’s allergy sensitivity level, (b) the predictions are obtained based on the historical data of pollen concentration together with other 11 covariates, (c) most importantly, the uncertainty of the prediction in Stage 3 can be reduced given that the minimum number of predictions obtained from Stage 1 is satisfied. The final prediction in Stage 3 converges to a mean value with the increasing number of predictions obtained from Stage 1. The weighted linear regression further improve the accuracy by integrating the uncertainty predicted in Stage 2. In our backtesting, a mean absolute error of 4.7 days was achieved using the triple-regression forecasting algorithm. We conclude that this algorithm could be useful in both generic and long-term forecasting problems.