1 Introduction
Forecasting problems frequently have natural hierarchies. For example, looking at employment levels in the United States, statelevel figures must sum to nationallevel, and industrylevel must sum to overall employment. In Figure 1, through could be counties that sum to the states and , which sum to the national employment .
Forecasting each time series independently leads to several undesirable outcomes. It throws away information on the nested structure of the problem, which can result in less accurate forecasts and unreliably estimated confidence intervals. Researchers estimating many time series simultaneously test multiple hypotheses about the nested forecasts, requiring them to adjust confidence intervals, diluting the statistical power of models. Further, discrepancies make decisionmaking more difficult for policymakers, who may not know how to translate a case where the childlevel estimates add up to more than the parentlevel forecast into good policy.
When forecasting time series with a known hierarchical structure, the current standard for achieving coherence in a given hierarchy is by (Hyndman et al., 2011; Wickramasuriya et al., 2018), who propose an approach to combine independent forecasts for each time series to obtain a weighted sum that minimizes the reconciliation error. Given a set of forecasts and a hierarchy , Hyndman’s “optimal reconciliation” (often referred to as HTS for “hierarchical time series”) estimates a matrix such that , where
are the reconciled forecasts. They use the forecasts and estimated variances of each time series to create simple prediction intervals that ignore the hierarchical structure.
(Wickramasuriya et al., 2018) derive an estimator for based on the trace of the covariance matrix of the forecast errors.We propose a different approach, leveraging techniques from the machine learning literature in selfsupervised learning (SSL), which utilizes the structure of the problem to improve predictions when limited labeled data is available. In our case, the forecaster has access to labeled data from the training period alongside unlabeled data from the forecast period by relying on structural properties of the data. This approach has been used successfully in fields such as computer vision. In
(Rasmus et al., 2015), the authors propose that the classifications of perturbed (e.g., rotated) versions of the same image should be similar to one another, even if the original is not labeled and thus we do not know whether they are correct.Compared to HTS, our method results in more accurate forecasts (as measured by outofsample mean squared errors in simulations) that are closely reconciled. Because all forecasts are estimated simultaneously, our estimated confidence intervals account for uncertainty arising from reconciliation error, and avoid expost adjustments for multiple hypothesis testing.
When the child time series are heterogeneous, and the researcher has access to highquality covariates that predict the parent and child time series well, disaggregating a single parent time series into multiple child time series and forecasting the entire hierarchy at once can lead to significant improvements in test accuracy, as shown in section 3.1. Randomly or arbitrarily splitting the parent into smaller time series will not improve test errors, and may lead to randomly divergent forecasts. For example, splitting a sample of companies alphabetically will be less useful than splitting them by size, location, or industry, for which we can find suitable theorydriven covariates.
This method is applicable to pure forecasting (with access only to past observations of the target time series and covariates) as well as to counterfactual forecasting (with access to contemporaneous covariate time series). In fact, this method can be used in any case with a maximum likelihood objective. Here we focus on counterfactual forecasting using a synthetic control method, initially developed by (Abadie et al., 2010). They construct a counterfactual forecast for a US state with a tobacco tax increase by constructing an artificial control group using unaffected states as covariates in a pretreatment window and creating a weighted sum. In the original paper, the synthetic control was limited to a nonnegative weighted sum of other covariates (also called donors) and sparsity between the donors was enforced by choosing coefficients that were on the simplex (i.e., they sum to one). These restrictions were lifted in subsequent work (Doudchenko & Imbens, 2016), allowing for more flexible relationships between target and donors. Once an artificial control group has been built, the effect of the intervention is defined as the difference between the control group and the treated group after the intervention. In the absence of an intervention, the synthetic control forecast should closely match the observed values.
(Brodersen et al., 2015) proposed an extension to the synthetic control framework by using Bayesian structural time series models (Scott & Varian, 2013) to form the synthetic control group. By leveraging the flexibility of structural time series models, they can incorporate components into their counterfactual such as seasonality, covariates, and trend terms. There are many varieties of synthetic control methods, including generalized synthetic controls(Xu, 2017), any of which can be estimated and reconciled using our proposed loss function.
2 Theory
Notation: We denote target time series as for . For all , we write the forecast for as .
We now describe the theoretical basis for our estimations, beginning with the objective function we are interested in minimizing for the simple case of a hierarchy of time series, with children that sum to a single parent. (Additional levels of the hierarchy are estimated and reconciled in the same way. Note that we do not require the hierarchy to be structured as a tree – we can optimize the model as long as the hierarchy can be described using a DAG.^{1}^{1}1For example, using overlapping product families that must simultaneously reconcile – ‘popular categories for kids’ and ‘popular categories for teens’ may both contain apparel and school supplies. Our method forecasts and reconciles all categories and groupings simultaneously.) The objective function is:
(1) 
where is a loss function, are the observations in the training period, are forecasts in the test period of the th target and is the reconciliation penalty size. Without loss of generality, we will assume that and that is a parametric forecast.
Theorem 1.
Proof: See Appendix Section A.
If , forecasts for each target become independent and are simply equal to the minimizers of the forecasting loss function. For any , each forecast will be adjusted to ensure . Note that the reconciliation penalty only applies for , because, for , minimizing the forecasting loss already implicitly minimizes the reconciliation loss, as the ground truth data satisfies the hierarchy by construction.
This approach is applicable to any method of predicting . If is a differentiable parametric function , we can optimize objective (1) using gradient based methods. In Appendix B we provide an efficient algorithm to solve (1).
We can view the reconciliation penalty as penalizing the forecaster if the time series start deviating from their natural hierarchy, and we can also justify it rigorously. Ideally, we would like to find parameters that minimize test error for , but we do not have access to for . However, the reconciliation penalty serves as a lower bound for the sum of test errors. Namely, since , the CauchySchwarz inequality implies
Intuitively, having perfect reconciliation is a necessary but not sufficient condition for a perfect forecast, therefore, if a hierarchical forecast does not perfectly reconcile, we can use that information to quantify the minimum amount by which the forecast is off.
2.1 Bayesian modeling
We now show that we can cast reconciliation in a Bayesian formulation, and use it to obtain confidence intervals that automatically incorporate uncertainty from the reconciliation step. We focus our attention on the case of counterfactual forecasting, where we have highly predictive covariates
available, and thus use a simple linear model for the likelihood. For a review of Bayesian statistics, including in time series, see
(Geweke, 2005).We forecast each time series using a potentially disjoint set of covariates , and estimate a joint likelihood for all observations:
(3)  
where LKJ prior is standard choice of matrix distribution for covariance matrix from (Lewandowski et al., 2009), is a parameter that controls how uniform the nondiagonal terms are, and is a standard halfCauchy prior on the diagonal elements of the covariance.
For the reconciliation term, note that, by definition, the difference between jointly Gaussian random variables is itself Gaussian, and therefore, for a simple hierarchy with a single parent
we have:(4) 
with prior:
In this formulation, the prior on the reconciliation variance plays the same role as the hyperparameter in (1). If we choose a large value for , we are expressing our prior belief that forecasts should reconcile closely.
An alternative approach to estimating is to observe that every is a random variable, and therefore, the linear combination is random as well. For , its variance by definition is equal to
(5) 
Therefore, when we have the full covariance matrix of the forecast errors , we can derive analytically. One property of this approach is that by specifying an additional likelihood for each reconciliation step, we avoid unrealistically narrow confidence intervals: if the model has learned a very accurate forecast for every target variable, the variance in the likelihood of the reconciliation term will necessarily be small as well, as it is a linear function of the forecast variances. This means that the model cannot overfit to the training set and predict unreasonably tight confidence intervals, as those predictions will give a large penalty for nonreconciliation. By jointly estimating the reconciliation and forecast likelihoods, we force the model to come to terms with what it does not know, which translates into decreased confidence in the forecast.
3 Applications
3.1 Synthetic data
Synthetic control models are typically used to estimate the causal impact of a treatment, but there need not be an intervention. Here, we use synthetic data with no intervention, but that includes predictive contemporaneous covariates.
no_hierarchy  no_hierarchy + hts  full_hierarchy_lam_1  full_hierarchy_lam_10  

test  mse  y_1  5.78 12.7  5.78 12.8  1.62 2.56  1.61 2.72 
y_2  3.02 9.36  3.02 9.44  0.894 1.59  0.89 1.39  
y_3  2.79 5.02  2.78 5.01  1.03 2.13  0.956 1.85  
y_4  1.29 4.1  1.3 4.0  0.496 0.874  0.462 0.692  
y_5  1.06 2.1  1.07 2.09  0.42 0.673  0.425 0.703  
y_6  1.1 1.92  1.1 1.94  0.501 1.17  0.452 0.826  
y_7  1.09 1.8  1.1 1.78  0.445 0.638  0.422 0.609  
rec  total  0.0392 0.0668  4.62e32 1e31  3.28e05 1.66e05  1.5e05 6.81e06  
train  mse  y_1  0.00575 0.00518  *  0.0212 0.0205  0.0265 0.0253 
y_2  0.00414 0.00408  0.0108 0.0104  0.0134 0.0122  
y_3  0.00414 0.00397  0.0109 0.0105  0.0135 0.0128  
y_4  0.00311 0.00318  0.00562 0.00543  0.00684 0.00632  
y_5  0.00311 0.00305  0.00563 0.00543  0.0068 0.00627  
y_6  0.00311 0.00307  0.00562 0.0054  0.00685 0.0063  
y_7  0.0031 0.00304  0.00567 0.00551  0.00696 0.00657  
rec  total  0.000195 0.000118  0.000311 0.000141  0.000817 0.000346 
To assess the extent to which our reconciliation approach works for nonlinear models, we consider an artificial dataset with ground truth available, with covariates and target variables, as in the hierarchy from Figure 1. In this setting, we show that reconciliation serves as an effective regularizer, improving forecasts even compared to HTS.
Each covariate time series is drawn from a Gaussian process prior using the Celerite library (ForemanMackey et al., 2017) and each leaf target variable
is defined as a linear transformation of both
and plus random noise. For nonleaf target variables, we sum up the children using the hierarchy shown in Figure 1. Formally, we take a kernel from Celerite and sampleWe then use a multilayer perceptron (one hidden layer with 100 units and ReLU activations) to forecast
usingas an input. The neural network gets the first 1000 time steps as a training set, and we then report the MSE (averaged over 1000 independent experiments) on train, test, and reconciliation in table
3.1.We trained the neural networks without a reconciliation penalty (no_hierarchy), and with two different values of , 1 and 10 (full_hierarchy_lam_1 and full_ hierarchy_lam_10). We also used HTS (conjugate gradient method) to reconcile the forecasts from the no_hierarchy) baseline and report those values in the no_hierarchy + hts column.
Reconciliation serves as a regularizer: by adding the reconciliation penalty (full_hierarchy_lam_1 and full_hierarchy_lam_10 columns), we decrease train set performance (4x increase in train MSE), but improve test set performance, reducing MSE by up to 3.5x compared to the no_hierarchy baseline in the parent node . Although the reconciliation penalty decreases test set error for all nodes, the reduction is greater on nodes higher in the hierarchy (, , and ). The decrease in test error is far greater than the decrease in reconciliation error; we believe this is because, by jointly training with a reconciliation loss, we give the model an inductive bias toward simpler solutions with lower test error. In contrast, HTS adjusts the forecasts as little as possible after training is complete in order to achieve reconciliation expost, so it can at most reduce the forecast error by the amount of reconciliation error, which is much smaller. This is why HTS reduces the reconciliation error to approximately zero, but with little improvement in test forecast accuracy.
4 Conclusion
This paper focuses on counterfactual forecasting, however, the reconciliation loss can be applied to any maximum likelihood objective with hierarchical data, including multitarget regression and pure forecasting. When estimating hierarchical times series using e.g. ARIMA, we can add a reconciliation term to the maximum likelihood objective and automatically obtain forecasts that are nearly reconciled and have more sensible uncertainty estimates.
Although we only prove lower bounds on the error term, we see that applying our reconciliation loss to synthetic data, where we have access to real ground truth, forecasts with the reconciliation loss term have higher accuracy compared to a strong HTS baseline. In future work, we are planning a more systematic investigation into how reconciliation affects the error as well as the inductive bias of the model.
References
 Abadie et al. (2010) Abadie, A., Diamond, A., and Hainmueller, J. Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American statistical Association, 105(490):493–505, 2010.
 Brodersen et al. (2015) Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., and Scott, S. L. Inferring causal impact using bayesian structural timeseries models. Annals of Applied Statistics, 9:247–274, 2015.
 Doudchenko & Imbens (2016) Doudchenko, N. and Imbens, G. W. Balancing, regression, differenceindifferences and synthetic control methods: A synthesis. Technical report, National Bureau of Economic Research, 2016.
 ForemanMackey et al. (2017) ForemanMackey, D., Agol, E., Ambikasaran, S., and Angus, R. Fast and scalable gaussian process modeling with applications to astronomical time series. The Astronomical Journal, 154(6):220, 2017.
 Geweke (2005) Geweke, J. Contemporary Bayesian Econometrics and Statistics. Wiley, 2005.
 Hyndman et al. (2011) Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., and Shang, H. L. Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9):2579–2589, 2011.

Lewandowski et al. (2009)
Lewandowski, D., Kurowicka, D., and Joe, H.
Generating random correlation matrices based on vines and extended
onion method.
Journal of multivariate analysis
, 100(9):1989–2001, 2009.  Mishchenko & Richtárik (2018) Mishchenko, K. and Richtárik, P. A stochastic penalty model for convex and nonconvex optimization with big constraints. arXiv preprint arXiv:1810.13387, 2018.
 Rasmus et al. (2015) Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
 Scott & Varian (2013) Scott, S. L. and Varian, H. R. Predicting the present with bayesian structural time series. Available at SSRN 2304426, 2013.
 Wickramasuriya et al. (2018) Wickramasuriya, S. L., Athanasopoulos, G., and Hyndman, R. J. Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 0(0):1–16, 2018. doi: 10.1080/01621459.2018.1448825. URL https://doi.org/10.1080/01621459.2018.1448825.
 Xu (2017) Xu, Y. Generalized synthetic control method: Causal inference with interactive fixed effects models. Political Analysis, 25:57–76, 2017. doi: 10.1017/pan.2016.2.
Appendix A Proof of theorem 1
The analysis below is motivated by that of constrained optimization with penalties (Mishchenko & Richtárik, 2018).
Proof.
The objective (1) can be also seen as a penalty reformulation of the harder problem:
(6)  
subject to 
Let be a solution of (6), and denote by the objective function in (1). We plug into (1) and by the definition of we have
Moreover, for we get by definition of that . Noting that
we deduce . Since is a feasible solution of (6), we additionally get . Therefore,
Dividing both sides by finishes the proof. ∎
Appendix B Optimization algorithm
Problems (1) and (6) can be solved efficiently using gradient methods. If the model used to produce is simple, we can use projected gradient methods for (6). Its relaxed version (1), however, is easier since we can compute the derivatives of the penalty term. Furthermore, it allows for using more efficient coordinate descent optimizers such as Algorithm 1.
The main feature of randomized coordinate descent applied to (1) is that it automatically randomizes data sampling as well. Indeed, if we are adjusting the parameters used to produce , , all terms that use for any and are not updated and, thus, we do not have to look up the related data. This means that the update is times more efficient in how many coordinates it updates and, additionally, up to times more efficient in how much data it processes. This makes overall training much faster.