An extended note on the multibin logarithmic score used in the FluSight competitions

by   Johannes Bracher, et al.

In recent years the Centers for Disease Control and Prevention (CDC) have organized FluSight influenza forecasting competitions. To evaluate the participants' forecasts a multibin logarithmic score has been created, which is a non-standard variant of the established logarithmic score. Unlike the original log score, the multibin version is not proper and may thus encourage dishonest forecasting. We explore the practical consequences this may have, using forecasts from the 2016/17 FluSight competition for illustration.



There are no comments yet.



Evaluating epidemic forecasts in an interval format

For practical reasons, many forecasts of case, hospitalization and death...

Wisdom of the crowds forecasting the 2018 FIFA Men's World Cup

The FIFA Men's World Cup Tournament (WCT) is the most important football...

Ranking earthquake forecasts using proper scoring rules: Binary events in a low probability environment

Operational earthquake forecasting for risk management and communication...

Proper scoring rules for evaluating asymmetry in density forecasting

This paper proposes a novel asymmetric continuous probabilistic score (A...

Multiscale Influenza Forecasting

Influenza forecasting in the United States (US) is complex and challengi...

Anchor Attention for Hybrid Crowd Forecasts Aggregation

Forecasting the future is a notoriously difficult task. To overcome this...

A Scheme for Continuous Input to the Tsetlin Machine with Applications to Forecasting Disease Outbreaks

In this paper, we apply a new promising tool for pattern classification,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years the Centers for Disease Control and Prevention (CDC) have organized FluSight influenza forecasting competitions (Reich et al., 2019b), which have “pioneered infectious disease forecasting in a formal way” (Viboud and Vespignani, 2019). The competitions distinguish themselves by an elaborate technical infrastructure which allows a large number of participating teams to submit weekly forecasts for several quantities in real time. All targets are based on a measure called weighted influenza-like illness (wILI) which describes the proportion of outpatient visits due to influenza-like symptoms. Specifically, the targets are (see Reich et al. 2019b

, Fig. 1B): (a) The wILI values one to four weeks ahead of the last available observation, classified into bins of width 0.1%. (b) The week of the season onset. (c) The peak week. (d) The peak intensity,


 the wILI value in the peak week. All targets are discrete, and participants submit their forecasts in the form of distributions assigning a probability to each possible outcome. Forecasts are evaluated based on the multibin logarithmic score

(Centers for Disease Control and Prevention, 2018), a non-standard variant of the logarithmic score (Gneiting and Raftery, 2007). Following to its use in the FluSight competitions this score has also been adopted in numerous scientific works (e.g. Akhmetzhanov et al. 2019, Ben-Nun et al. 2019, Brooks et al. 2018, Farrow et al. 2017, Kandula et al. 2018, 2019, Kandula and Shaman 2019, McGowan et al 2019, Osthus et al. 2019a; Osthus et al. 2019b, Reich et al. 2019b, Zimmer et al. 2018), even though it is not proper. Propriety is generally considered an important requirement for scoring rules (Gneiting and Raftery, 2007; Held et al., 2017), as it encourages honesty of forecasters. The goal of this note is to explore the practical implications the use of an improper scoring rule may have, detailing on a previously published letter (Bracher, 2019).

2 Proper scoring rules

The evaluation of probabilistic forecasts requires comparison of a predictive distribution for a quantity to a single observed outcome . Various scoring rules have been suggested to this end, and there is no single “best” approach. However, it is agreed that propriety is a desirable property of a score (Gneiting and Raftery, 2007). A score is called proper if its expectation is maximized by the true distribution of , and strictly proper if this maximum is unique. Intuitively speaking, the highest expected score should be achieved by a forecast which is based on a perfect understanding of the system to be predicted.

Another implication of propriety is that it encourages honesty of forecasters. To understand this, assume a forecaster whose belief about is given by the distribution . The forecaster is asked to issue a predictive distribution for , and will receive a reward depending on the agreement between this forecast and the outcome . This is measured using a score , the expectation of which the forecaster consequently aims to maximize. She will do so based on her true belief , but she is not obliged to actually issue as her forecast to be evaluated. If there is a different predictive distribution so that


i.e.  has a higher expected score than if is actually true, the forecaster can issue instead. Such strategies are called hedging, and “it is generally accepted that it is undesirable to use a score for which hedging can improve the score or its expected value” (Jolliffe, 2008, p.25). To discourage hedging and the reporting of forecasts which differ from forecasters’ actual beliefs, the score should be constructed such that there can be no pair of and for which (1) holds. This is exactly the definition of a proper score.

3 The log score and the multibin log score

A widely used score is the logarithmic or log score. It is defined as (Gneiting and Raftery, 2007)

where is the density or probability mass function of the forecast distribution . For a categorical with ordered levels and probabilities as in the FluSight competitions this becomes

This score has many desirable properties (Gneiting et al., 2007), notably it is strictly proper.

In the FluSight competitions a modified log score is used in which not only the probability mass assigned to the observed outcome , but also the neighbouring values on either side is counted. The resulting multibin log score (Centers for Disease Control and Prevention, 2018) is defined as


where for and . The FluSight organizers chose for the wILI forecasts (i.e. forecasts within a range of 0.5 percentage points are considered accurate) and week for the onset and peak timing. The multibin log score has been argued to measure “accuracy of practical significance” (Reich et al. 2019b, p.3153) while showing little sensitivity to retrospective corrections of wILI values (McGowan et al 2019). It has therefore been favoured over the regular log score. For both scores larger values are better and overall results are obtained by averaging over all forecasts issued by a team.

4 A hedging strategy for the multibin log score

As also mentioned by Reich et al (2019b), the multibin log score is improper. We now show how a forecaster can apply hedging to improve the expected score under her true belief . For the following assume that assigns probability 0 to the extreme categories on either end of the support, i.e.


where . This rids us of extra considerations on these categories and can always be achieved by adding categories to the support. Then denote by a distribution with the same support as and


where again for and . This represents a “blurred” version of , where we always re-distribute the probability mass for one outcome equally between itself and the neighbouring ones on either side (i.e. the are “moving averages” of the ). Condition (3) ensures that so that is a well-defined distribution. The multibin log score of can now be expressed through the regular log score of , as

Applying the MBlogS is thus essentially the same as applying the regular log score, but after “blurring” the predictive distribution as in equation (4). As the regular log score is proper, a forecaster then has an incentive to issue a sharper forecast so that the corresponding blurred distribution (with probabilities derived from in analogy to (4)) is as close as possible to . If there is a so that corresponds exactly to it can be found via recursive computations. Otherwise an optimal is obtained by numerically maximizing with respect to while

. This is the same as minimizing the Kullback-Leibler divergence

(Joyce, 2011) of and . The optimum is not necessarily unique, but in general holds, and implies less variability than .

Intuitively speaking, a forecaster is incentivized to issue a sharper, more “risky” forecast because the MBlogS does not sanction a low probability assigned to the observed value as long as the neighbouring weeks or bins received enough probability mass. Indeed, the optimized forecast will often cover outcomes with a high probability under exclusively by assigning probability mass to their neighbours. We illustrate this using some example forecasts of the peak timing (i.e. ), visualized in Figure 1:

Example 1: Assume we are sure that the peak of the season will occur between weeks 3 and 5, more precisely is given by . The expected MBlogS under when reporting is . If, however, we report with , i.e. claim to be sure that the peak occurs in week 4, we can expect a score of . In fact, will score at least as good as for all three outcomes we consider possible, and better for two of them.

Example 2: Our true belief is given by . We thus believe that the peak will occur around week 4, but even weeks 2 and 6 are considered possible. In this case we can find so that corresponds exactly to . is given by . We should thus claim the peak to definitely occur in weeks 3, 4 or 5. The corresponding expectations for the multibin score under are and .

Example 3: Now assume to be , which is similar to the previous example, but with more probability assigned to weeks 2 and 6. Now is given by . The optimized forecast distribution is thus bimodal and the peak is claimed to occur either in week 3 or 5. The expected scores are and .

Example 4: Lastly assume to be . We thus consider it likely that the peak occurs in week 2, but it may also occur later. In this setting there is no so that corresponds exactly to , but numerical optimization returns . To get the highest expected score we should thus shift the mode of our predictive distribution and claim to be almost certain that the peak occurs in week 3. The expected scores are and .

The patterns observed here also occur in many other settings. The optimized forecasts are sharper than the respective . Moreover, the mode often gets shifted by up to weeks, and one or several additional local modes can occur.

Figure 1: Examples 1–4: Original forecasts , optimized versions and the respective blurred distributions and . Note that and are identical in Examples 1–3, but not 4. Expected scores are computed under .

5 Application to FluSight forecasts

To stress its practical relevance we apply the hedging strategy from the previous section to real forecasts from the FluSight competitions. These are publicly available at Specifically we consider national level forecasts from the 2016/17 season submitted by the Los Alamos National Laboratories (LANL) team. Their dynamic Bayesian forecasting method has shown remarkably good performance over several years (Osthus et al., 2019b). We follow the same evaluation procedure as in Reich et al. (2019b), where average scores are only computed from the relevant parts of the season (e.g. forecasts of the onset week are ignored once it is clear that the onset has occurred; see p.8 in Reich et al. 2019b).

For all forecasts of the seven targets (one to four-week-ahead wILI, onset and peak timing, peak incidence) we obtained optimized versions with the respective value of . For illustration Figure 2 shows forecasts of the onset timing issued in calendar weeks 49 and 50, 2016. As in Example 4, the optimized forecast in week 49 has its mode shifted by one week. In both cases the optimized are visibly sharper and are multimodal, even though the corresponding are unimodal. Averaged over the 2016/17 season the optimized forecasts yield indeed higher and thus improved MBlogS for the onset timing ( vs. ).

Figure 2: Forecasts for the onset week, submitted by the LANL team in weeks 49–50, 2016, and optimized versions . Diamonds mark the true peak week. Expected scores are computed under .

Figure 3 shows the same for one-week-ahead forecasts of wILI, i.e. now we use . The optimized leave gaps between the values to which they assign positive probabilities. These forecast distributions with many spikes are unlikely to be useful to public health experts. Nonetheless, averaged over the course of the season, the optimized forecasts outperform the original ones ( vs. ). Indeed, as shown in Table 1, such improvements are also achieved for the remaining five targets. This illustrates that the hedging strategy enabled by the improper MBlogS can lead to non-negligible improvements of average scores in practice.

Figure 3: Forecasts for wILI (one week ahead), submitted by the LANL team in weeks 6–7, 2017, and optimized versions . Diamonds mark the observed wILI values. Expected scores are computed under .
1 wk 2 wk 3 wk 4 wk onset week peak week peak intensity
original forecasts -0.30 -0.81 -0.85 -0.89 -0.39 -0.48 -0.62
optimized forecasts -0.19 -0.75 -0.78 -0.84 -0.33 -0.43 -0.59

Note that the numbers given here are not directly comparable to the ones in Reich et al. (2019a), Fig. 1. We focus on the season 2016/17 and the national level, while Reich et al. average over different seasons and geographical resolutions.

Table 1: Average multibin log scores for different targets at the national level over the course of the 2016/17 season: original forecasts as issued by the LANL team and optimized forecasts (with for onset and peak timing, otherwise).

6 Discussion

We showed that the multibin log score used in the CDC FluSight competitions incentivizes hedging, i.e.

 tuning forecast distributions in a specific way before submission. While we strongly doubt that participants have consciously tried to game the score, it is possible that this happens unintentionally. In forecasting, cross-validation methods to optimize forecasts for a given evaluation metric are common. Such methods could lead to hedging of the score without the authors being aware. As in previous work

(Held et al., 2017) we therefore advocate the use of proper scoring rules to evaluate epidemic forecasts, e.g. the regular log score. Measures which are easier to interpret could be reported as a supplement to facilitate communication with public health experts.

Data and code availability: The data used in this note are available from the FluSight Collaboration at R codes to reproduce the presented results are available at

Acknowledgements: I would like to thank T. Gneiting for helpful discussions and the FluSight Collaboration for making its forecasts publicly available. I also thank N.G. Reich for a very interesting discussion on the points raised in my letter.

plus 0.3ex


  • Akhmetzhanov et al. (2019) Akhmetzhanov, A. R., Lee, H., Jung, S., Kayano, T., Yuan, B., and Nishiura, H. (2019). Analyzing and forecasting the Ebola incidence in North Kivu, the Democratic Republic of the Congo from 2018–19 in real time. Epidemics, 27:123 – 131.
  • Ben-Nun et al. (2019) Ben-Nun, M., Riley, P., Turtle, J., Bacon, D. P., and Riley, S. (2019). Forecasting national and regional influenza-like illness for the USA. PLOS Computational Biology, 15(5):1–20.
  • Bracher (2019) Bracher, J. (2019). On the multibin logarithmic score used in the FluSight competitions. Proceedings of the National Academy of Sciences, in press, Sep 2019,
  • Brooks et al. (2018) Brooks, L. C., Farrow, D. C., Hyun, S., Tibshirani, R. J., and Rosenfeld, R. (2018). Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions. PLOS Computational Biology, 14(6):1–29.
  • Centers for Disease Control and Prevention (2018) Centers for Disease Control and Prevention (2018). Preliminary Guidelines for the 2018-19 Influenza Forecasting Challenge. Accessible online at, retrieved on 23 April 2019.
  • Farrow et al. (2017) Farrow, D. C., Brooks, L. C., Hyun, S., Tibshirani, R. J., Burke, D. S., and Rosenfeld, R. (2017). A human judgment approach to epidemiological forecasting. PLOS Computational Biology, 13(3):1–19.
  • Gneiting et al. (2007) Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007).

    Strictly proper scoring rules, prediction, and estimation.

    Journal of the American Statistical Association, 102(477):359–378.
  • Held et al. (2017) Held, L., Meyer, S., and Bracher, J. (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Statistics in Medicine, 36(22):3443–3460.
  • Jolliffe (2008) Jolliffe, I. T. (2008). The impenetrable hedge: a note on propriety, equitability and consistency. Meteorological Applications, 15(1):25–29.
  • Joyce (2011) Joyce, J. (2011). Kullback-leibler divergence. In Lovric, M., editor, International Encyclopedia of Statistical Science, pages 720–722. Springer, Berlin.
  • Kandula et al. (2019) Kandula, S., Pei, S., and Shaman, J. (2019). Improved forecasts of influenza-associated hospitalization rates with Google search trends. Journal of The Royal Society Interface, 16(155):20190080.
  • Kandula and Shaman (2019) Kandula, S. and Shaman, J. (2019). Near-term forecasts of influenza-like illness: An evaluation of autoregressive time series approaches. Epidemics, 27:41–51.
  • Kandula et al. (2018) Kandula, S., Yamana, T., Pei, S., Yang, W., Morita, H., and Shaman, J. (2018). Evaluation of mechanistic and statistical methods in forecasting influenza-like illness. Journal of The Royal Society Interface, 15(144):20180174.
  • McGowan et al. (2019) McGowan, C., Biggerstaff, M., Johansson, M., Apfeldorf, K., Ben-Nun, M., Brooks, L., Convertino, M., Erraguntla, M., Farrow, D., Freeze, J., Ghosh, S., Hyun, S., Kandula, S., Lega, J., Liu, Y., Michaud, N., Morita, H., Niemi, J., Ramakrishnan, N., Ray, E., Reich, N., Riley, P., Shaman, J., Tibshirani, R., Vespignani, A., Zhang, Q., Reed, C., and The Influenza Forecasting Working Group (2019). Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific Reports, Article Nr. 683.
  • Osthus et al. (2019a) Osthus, D., Daughton, A. R., and Priedhorsky, R. (2019a). Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited. PLOS Computational Biology, 15(2):1–19.
  • Osthus et al. (2019b) Osthus, D., Gattiker, J., Priedhorsky, R., and Del Valle, S. Y. (2019b). Dynamic Bayesian influenza forecasting in the United States with hierarchical discrepancy (with discussion). Bayesian Analysis, 14(1):261–312.
  • Reich et al. (2019a) Reich, N., Osthus, D., Ray, E., Yamana, T., Biggerstaff, M., Johansson, M., Rosenfeld, R., and Shaman, J. (2019a). Scoring probabilistic forecasts to maximize public health interpretability. Proceedings of the National Academy of Sciences, in press, Sep 2019,
  • Reich et al. (2019b) Reich, N. G., Brooks, L. C., Fox, S. J., Kandula, S., McGowan, C. J., Moore, E., Osthus, D., Ray, E. L., Tushar, A., Yamana, T. K., Biggerstaff, M., Johansson, M. A., Rosenfeld, R., and Shaman, J. (2019b). A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences, 116(8):3146–3154.
  • Viboud and Vespignani (2019) Viboud, C. and Vespignani, A. (2019). The future of influenza forecasts. Proceedings of the National Academy of Sciences, 116(8):2802–2804.
  • Zimmer et al. (2018) Zimmer, C., Leuba, S. I., Yaesoubi, R., and Cohen, T. (2018). Use of daily internet search query data improves real-time projections of influenza epidemics. Journal of The Royal Society Interface, 15(147):20180220.