1 Introduction
In recent years the Centers for Disease Control and Prevention (CDC) have organized FluSight influenza forecasting competitions (Reich et al., 2019b), which have “pioneered infectious disease forecasting in a formal way” (Viboud and Vespignani, 2019). The competitions distinguish themselves by an elaborate technical infrastructure which allows a large number of participating teams to submit weekly forecasts for several quantities in real time. All targets are based on a measure called weighted influenzalike illness (wILI) which describes the proportion of outpatient visits due to influenzalike symptoms. Specifically, the targets are (see Reich et al. 2019b
, Fig. 1B): (a) The wILI values one to four weeks ahead of the last available observation, classified into bins of width 0.1%. (b) The week of the season onset. (c) The peak week. (d) The peak intensity,
i.e.the wILI value in the peak week. All targets are discrete, and participants submit their forecasts in the form of distributions assigning a probability to each possible outcome. Forecasts are evaluated based on the multibin logarithmic score
(Centers for Disease Control and Prevention, 2018), a nonstandard variant of the logarithmic score (Gneiting and Raftery, 2007). Following to its use in the FluSight competitions this score has also been adopted in numerous scientific works (e.g. Akhmetzhanov et al. 2019, BenNun et al. 2019, Brooks et al. 2018, Farrow et al. 2017, Kandula et al. 2018, 2019, Kandula and Shaman 2019, McGowan et al 2019, Osthus et al. 2019a; Osthus et al. 2019b, Reich et al. 2019b, Zimmer et al. 2018), even though it is not proper. Propriety is generally considered an important requirement for scoring rules (Gneiting and Raftery, 2007; Held et al., 2017), as it encourages honesty of forecasters. The goal of this note is to explore the practical implications the use of an improper scoring rule may have, detailing on a previously published letter (Bracher, 2019).2 Proper scoring rules
The evaluation of probabilistic forecasts requires comparison of a predictive distribution for a quantity to a single observed outcome . Various scoring rules have been suggested to this end, and there is no single “best” approach. However, it is agreed that propriety is a desirable property of a score (Gneiting and Raftery, 2007). A score is called proper if its expectation is maximized by the true distribution of , and strictly proper if this maximum is unique. Intuitively speaking, the highest expected score should be achieved by a forecast which is based on a perfect understanding of the system to be predicted.
Another implication of propriety is that it encourages honesty of forecasters. To understand this, assume a forecaster whose belief about is given by the distribution . The forecaster is asked to issue a predictive distribution for , and will receive a reward depending on the agreement between this forecast and the outcome . This is measured using a score , the expectation of which the forecaster consequently aims to maximize. She will do so based on her true belief , but she is not obliged to actually issue as her forecast to be evaluated. If there is a different predictive distribution so that
(1) 
i.e. has a higher expected score than if is actually true, the forecaster can issue instead. Such strategies are called hedging, and “it is generally accepted that it is undesirable to use a score for which hedging can improve the score or its expected value” (Jolliffe, 2008, p.25). To discourage hedging and the reporting of forecasts which differ from forecasters’ actual beliefs, the score should be constructed such that there can be no pair of and for which (1) holds. This is exactly the definition of a proper score.
3 The log score and the multibin log score
A widely used score is the logarithmic or log score. It is defined as (Gneiting and Raftery, 2007)
where is the density or probability mass function of the forecast distribution . For a categorical with ordered levels and probabilities as in the FluSight competitions this becomes
This score has many desirable properties (Gneiting et al., 2007), notably it is strictly proper.
In the FluSight competitions a modified log score is used in which not only the probability mass assigned to the observed outcome , but also the neighbouring values on either side is counted. The resulting multibin log score (Centers for Disease Control and Prevention, 2018) is defined as
(2) 
where for and . The FluSight organizers chose for the wILI forecasts (i.e. forecasts within a range of 0.5 percentage points are considered accurate) and week for the onset and peak timing. The multibin log score has been argued to measure “accuracy of practical significance” (Reich et al. 2019b, p.3153) while showing little sensitivity to retrospective corrections of wILI values (McGowan et al 2019). It has therefore been favoured over the regular log score. For both scores larger values are better and overall results are obtained by averaging over all forecasts issued by a team.
4 A hedging strategy for the multibin log score
As also mentioned by Reich et al (2019b), the multibin log score is improper. We now show how a forecaster can apply hedging to improve the expected score under her true belief . For the following assume that assigns probability 0 to the extreme categories on either end of the support, i.e.
(3) 
where . This rids us of extra considerations on these categories and can always be achieved by adding categories to the support. Then denote by a distribution with the same support as and
(4) 
where again for and . This represents a “blurred” version of , where we always redistribute the probability mass for one outcome equally between itself and the neighbouring ones on either side (i.e. the are “moving averages” of the ). Condition (3) ensures that so that is a welldefined distribution. The multibin log score of can now be expressed through the regular log score of , as
Applying the MBlogS is thus essentially the same as applying the regular log score, but after “blurring” the predictive distribution as in equation (4). As the regular log score is proper, a forecaster then has an incentive to issue a sharper forecast so that the corresponding blurred distribution (with probabilities derived from in analogy to (4)) is as close as possible to . If there is a so that corresponds exactly to it can be found via recursive computations. Otherwise an optimal is obtained by numerically maximizing with respect to while
. This is the same as minimizing the KullbackLeibler divergence
(Joyce, 2011) of and . The optimum is not necessarily unique, but in general holds, and implies less variability than .Intuitively speaking, a forecaster is incentivized to issue a sharper, more “risky” forecast because the MBlogS does not sanction a low probability assigned to the observed value as long as the neighbouring weeks or bins received enough probability mass. Indeed, the optimized forecast will often cover outcomes with a high probability under exclusively by assigning probability mass to their neighbours. We illustrate this using some example forecasts of the peak timing (i.e. ), visualized in Figure 1:

Example 1: Assume we are sure that the peak of the season will occur between weeks 3 and 5, more precisely is given by . The expected MBlogS under when reporting is . If, however, we report with , i.e. claim to be sure that the peak occurs in week 4, we can expect a score of . In fact, will score at least as good as for all three outcomes we consider possible, and better for two of them.

Example 2: Our true belief is given by . We thus believe that the peak will occur around week 4, but even weeks 2 and 6 are considered possible. In this case we can find so that corresponds exactly to . is given by . We should thus claim the peak to definitely occur in weeks 3, 4 or 5. The corresponding expectations for the multibin score under are and .

Example 3: Now assume to be , which is similar to the previous example, but with more probability assigned to weeks 2 and 6. Now is given by . The optimized forecast distribution is thus bimodal and the peak is claimed to occur either in week 3 or 5. The expected scores are and .

Example 4: Lastly assume to be . We thus consider it likely that the peak occurs in week 2, but it may also occur later. In this setting there is no so that corresponds exactly to , but numerical optimization returns . To get the highest expected score we should thus shift the mode of our predictive distribution and claim to be almost certain that the peak occurs in week 3. The expected scores are and .
The patterns observed here also occur in many other settings. The optimized forecasts are sharper than the respective . Moreover, the mode often gets shifted by up to weeks, and one or several additional local modes can occur.
5 Application to FluSight forecasts
To stress its practical relevance we apply the hedging strategy from the previous section to real forecasts from the FluSight competitions. These are publicly available at https://github.com/FluSightNetwork/cdcflusightensemble/. Specifically we consider national level forecasts from the 2016/17 season submitted by the Los Alamos National Laboratories (LANL) team. Their dynamic Bayesian forecasting method has shown remarkably good performance over several years (Osthus et al., 2019b). We follow the same evaluation procedure as in Reich et al. (2019b), where average scores are only computed from the relevant parts of the season (e.g. forecasts of the onset week are ignored once it is clear that the onset has occurred; see p.8 in Reich et al. 2019b).
For all forecasts of the seven targets (one to fourweekahead wILI, onset and peak timing, peak incidence) we obtained optimized versions with the respective value of . For illustration Figure 2 shows forecasts of the onset timing issued in calendar weeks 49 and 50, 2016. As in Example 4, the optimized forecast in week 49 has its mode shifted by one week. In both cases the optimized are visibly sharper and are multimodal, even though the corresponding are unimodal. Averaged over the 2016/17 season the optimized forecasts yield indeed higher and thus improved MBlogS for the onset timing ( vs. ).
Figure 3 shows the same for oneweekahead forecasts of wILI, i.e. now we use . The optimized leave gaps between the values to which they assign positive probabilities. These forecast distributions with many spikes are unlikely to be useful to public health experts. Nonetheless, averaged over the course of the season, the optimized forecasts outperform the original ones ( vs. ). Indeed, as shown in Table 1, such improvements are also achieved for the remaining five targets. This illustrates that the hedging strategy enabled by the improper MBlogS can lead to nonnegligible improvements of average scores in practice.
1 wk  2 wk  3 wk  4 wk  onset week  peak week  peak intensity  

original forecasts  0.30  0.81  0.85  0.89  0.39  0.48  0.62 
optimized forecasts  0.19  0.75  0.78  0.84  0.33  0.43  0.59 

Note that the numbers given here are not directly comparable to the ones in Reich et al. (2019a), Fig. 1. We focus on the season 2016/17 and the national level, while Reich et al. average over different seasons and geographical resolutions.
6 Discussion
We showed that the multibin log score used in the CDC FluSight competitions incentivizes hedging, i.e.
tuning forecast distributions in a specific way before submission. While we strongly doubt that participants have consciously tried to game the score, it is possible that this happens unintentionally. In forecasting, crossvalidation methods to optimize forecasts for a given evaluation metric are common. Such methods could lead to hedging of the score without the authors being aware. As in previous work
(Held et al., 2017) we therefore advocate the use of proper scoring rules to evaluate epidemic forecasts, e.g. the regular log score. Measures which are easier to interpret could be reported as a supplement to facilitate communication with public health experts.Data and code availability: The data used in this note are available from the FluSight Collaboration at https://github.com/FluSightNetwork/cdcflusightensemble/. R codes to reproduce the presented results are available at https://github.com/jbracher/multibin.
Acknowledgements: I would like to thank T. Gneiting for helpful discussions and the FluSight Collaboration for making its forecasts publicly available. I also thank N.G. Reich for a very interesting discussion on the points raised in my letter.
plus 0.3ex
References
 Akhmetzhanov et al. (2019) Akhmetzhanov, A. R., Lee, H., Jung, S., Kayano, T., Yuan, B., and Nishiura, H. (2019). Analyzing and forecasting the Ebola incidence in North Kivu, the Democratic Republic of the Congo from 2018–19 in real time. Epidemics, 27:123 – 131.
 BenNun et al. (2019) BenNun, M., Riley, P., Turtle, J., Bacon, D. P., and Riley, S. (2019). Forecasting national and regional influenzalike illness for the USA. PLOS Computational Biology, 15(5):1–20.
 Bracher (2019) Bracher, J. (2019). On the multibin logarithmic score used in the FluSight competitions. Proceedings of the National Academy of Sciences, in press, Sep 2019, https://doi.org/10.1073/pnas.1912147116.
 Brooks et al. (2018) Brooks, L. C., Farrow, D. C., Hyun, S., Tibshirani, R. J., and Rosenfeld, R. (2018). Nonmechanistic forecasts of seasonal influenza with iterative oneweekahead distributions. PLOS Computational Biology, 14(6):1–29.
 Centers for Disease Control and Prevention (2018) Centers for Disease Control and Prevention (2018). Preliminary Guidelines for the 201819 Influenza Forecasting Challenge. Accessible online at https://predict.cdc.gov/api/v1/attachments/flusight%202018%E2%80%932019/flu_challenge_201819_tentativefinal_9.18.18.docx, retrieved on 23 April 2019.
 Farrow et al. (2017) Farrow, D. C., Brooks, L. C., Hyun, S., Tibshirani, R. J., Burke, D. S., and Rosenfeld, R. (2017). A human judgment approach to epidemiological forecasting. PLOS Computational Biology, 13(3):1–19.
 Gneiting et al. (2007) Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268.

Gneiting and Raftery (2007)
Gneiting, T. and Raftery, A. E. (2007).
Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association, 102(477):359–378.  Held et al. (2017) Held, L., Meyer, S., and Bracher, J. (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Statistics in Medicine, 36(22):3443–3460.
 Jolliffe (2008) Jolliffe, I. T. (2008). The impenetrable hedge: a note on propriety, equitability and consistency. Meteorological Applications, 15(1):25–29.
 Joyce (2011) Joyce, J. (2011). Kullbackleibler divergence. In Lovric, M., editor, International Encyclopedia of Statistical Science, pages 720–722. Springer, Berlin.
 Kandula et al. (2019) Kandula, S., Pei, S., and Shaman, J. (2019). Improved forecasts of influenzaassociated hospitalization rates with Google search trends. Journal of The Royal Society Interface, 16(155):20190080.
 Kandula and Shaman (2019) Kandula, S. and Shaman, J. (2019). Nearterm forecasts of influenzalike illness: An evaluation of autoregressive time series approaches. Epidemics, 27:41–51.
 Kandula et al. (2018) Kandula, S., Yamana, T., Pei, S., Yang, W., Morita, H., and Shaman, J. (2018). Evaluation of mechanistic and statistical methods in forecasting influenzalike illness. Journal of The Royal Society Interface, 15(144):20180174.
 McGowan et al. (2019) McGowan, C., Biggerstaff, M., Johansson, M., Apfeldorf, K., BenNun, M., Brooks, L., Convertino, M., Erraguntla, M., Farrow, D., Freeze, J., Ghosh, S., Hyun, S., Kandula, S., Lega, J., Liu, Y., Michaud, N., Morita, H., Niemi, J., Ramakrishnan, N., Ray, E., Reich, N., Riley, P., Shaman, J., Tibshirani, R., Vespignani, A., Zhang, Q., Reed, C., and The Influenza Forecasting Working Group (2019). Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific Reports, Article Nr. 683.
 Osthus et al. (2019a) Osthus, D., Daughton, A. R., and Priedhorsky, R. (2019a). Even a good influenza forecasting model can benefit from internetbased nowcasts, but those benefits are limited. PLOS Computational Biology, 15(2):1–19.
 Osthus et al. (2019b) Osthus, D., Gattiker, J., Priedhorsky, R., and Del Valle, S. Y. (2019b). Dynamic Bayesian influenza forecasting in the United States with hierarchical discrepancy (with discussion). Bayesian Analysis, 14(1):261–312.
 Reich et al. (2019a) Reich, N., Osthus, D., Ray, E., Yamana, T., Biggerstaff, M., Johansson, M., Rosenfeld, R., and Shaman, J. (2019a). Scoring probabilistic forecasts to maximize public health interpretability. Proceedings of the National Academy of Sciences, in press, Sep 2019, https://doi.org/10.1073/pnas.1912694116.
 Reich et al. (2019b) Reich, N. G., Brooks, L. C., Fox, S. J., Kandula, S., McGowan, C. J., Moore, E., Osthus, D., Ray, E. L., Tushar, A., Yamana, T. K., Biggerstaff, M., Johansson, M. A., Rosenfeld, R., and Shaman, J. (2019b). A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences, 116(8):3146–3154.
 Viboud and Vespignani (2019) Viboud, C. and Vespignani, A. (2019). The future of influenza forecasts. Proceedings of the National Academy of Sciences, 116(8):2802–2804.
 Zimmer et al. (2018) Zimmer, C., Leuba, S. I., Yaesoubi, R., and Cohen, T. (2018). Use of daily internet search query data improves realtime projections of influenza epidemics. Journal of The Royal Society Interface, 15(147):20180220.
Comments
There are no comments yet.