Abstract
This paper discusses the prediction of hierarchical time series, where each upperlevel time series is calculated by summing appropriate lowerlevel time series. Forecasts for such hierarchical time series should be coherent, meaning that the forecast for an upperlevel time series equals the sum of forecasts for corresponding lowerlevel time series. Previous methods for making coherent forecasts consist of two phases: first computing base (incoherent) forecasts and then reconciling those forecasts based on their inherent hierarchical structure. With the aim of improving time series predictions, we propose a structured regularization method for completing both phases simultaneously. The proposed method is based on a prediction model for bottomlevel time series and uses a structured regularization term to incorporate upperlevel forecasts into the prediction model. We also develop a backpropagation algorithm specialized for application of our method to artificial neural networks for time series prediction. Experimental results using synthetic and realworld datasets demonstrate the superiority of our method in terms of prediction accuracy and computational efficiency.
Introduction
Multivariate time series data often have a hierarchical (tree) structure in which each upperlevel time series is calculated by summing appropriate lowerlevel time series. For instance, numbers of tourists are usually counted on a regional basis, such as sites, cities, regions, or countries [1]. Similarly, many companies require regionally aggregated forecasts to support resource allocation decisions [28]. Product demand is often analyzed by category to reduce the overall forecasting burden [14].
Forecasts for such hierarchical time series should be coherent, meaning that the forecast for an upperlevel time series equals the sum of forecasts for corresponding lowerlevel time series [3, 44]. Smoothing methods such as the moving average and exponential smoothing are widely used in both academia and industry for time series predictions [10, 20]. Although these methods provide coherent forecasts for hierarchical time series, they have low accuracy, especially for rapidly changing time series.
Another common approach for making coherent forecasts is the use of bottomup and topdown methods [14, 24, 33, 45]. These methods first develop base forecasts by separately predicting each time series and then reconcile those base forecasts based on their inherent hierarchical structure. The bottomup method calculates base forecasts for bottomlevel time series and then aggregates them for upperlevel time series. In contrast, the topdown method calculates base forecasts only for a root (total) time series and then disaggregates them according to historical proportions of lowerlevel time series. Park and Nassar [35] considered a hierarchical Bayesian dynamic proportions model for the topdown method to sequentially disaggregate upperlevel forecasts. The middle–out method calculates base forecasts for intermediatelevel time series and then applies the bottomup and topdown methods to make upper and lowerlevel forecasts, respectively. However, the bottomup method often accumulates prediction errors as the time series level rises, and the topdown method cannot exploit detailed information about lowerlevel time series. Notably, when base forecasts are unbiased, only the bottomup method gives unbiased forecasts [19].
Hyndman et al. [19]
proposed a linear regression approach to optimal base forecasts by the bottomup method. This forecast reconciliation method worked well for predicting tourism demand
[1] and monthly inflation [11], and this approach can be extended to hierarchical and grouped time series [21]. van Erven and Cugliari [42] devised a gametheoretically optimal reconciliation method. Regularized regression models have also been employed to deal with highdimensional time series [2, 4]. Wickramasuriya et al. [44] devised a sophisticated method for optimal forecast reconciliation through trace minimization, and their experimental results showed that this trace minimization method performed very well with synthetic and realworld datasets. Note, however, that all of these forecast reconciliation methods consist of two phases: first computing base forecasts and then reconciling those forecasts based on a hierarchical structure. The aim of this study was to produce better time series predictions by simultaneously completing these two phases.Structured regularization uses inherent structural relations among explanatory variables to construct a statistical model [17, 23, 49]. Various regularization methods have been proposed for multivariate time series [34, 38], hierarchical explanatory variables [8, 27, 31, 37], and artificial neural networks [43]. Prediction of multivariate time series is related to multitask learning, which shares useful information among related tasks to enhance the prediction performance for all tasks [12, 48]. Tailored regularization methods have been developed for multitask learning [13, 22] and applied to artificial neural networks [36]. To the best of our knowledge, however, no prior studies have applied structured regularization methods to predictions of hierarchical time series.
In this study, we aimed to develop a structured regularization method that takes full advantage of hierarchical structure for better time series predictions. Our method is based on a prediction model for bottomlevel time series and uses a structured regularization term to incorporate upperlevel forecasts into the prediction model. This study particularly focused on application of our method to artificial neural networks, which have been effectively used in time series prediction [15, 18, 26, 30, 46, 47]. We developed a backpropagation algorithm specialized for our structured regularization model based on artificial neural networks. Experiments involving application of our method to synthetic and realworld datasets demonstrated the superiority of our method in terms of prediction accuracy and computational efficiency.
Methods
This section briefly reviews forecasts for hierarchical time series and forecast reconciliation methods. It then presents our structured regularization model and its application to artificial neural networks. This section also describes a backpropagation algorithm for artificial neural networks with structured regularization.
Forecasts for hierarchical time series
We address prediction of multivariate time series where each series is represented as a node in a hierarchical (tree) structure. Let be an observation of node at time , where is the set of nodes and is the set of timepoints. For simplicity, we focus on twolevel hierarchical structures. Fig. 1 shows the example of a twolevel hierarchical structure with nodes, where
denotes the number of set elements. The nodes are classified as
where node 1 is the root (levelzero) node, and and are sets of midlevel (levelone) and bottomlevel (leveltwo) nodes, respectively. The associated time series is characterized by the aggregation constraint
(1) 
Each upperlevel time series is thus calculated by summing the corresponding lowerlevel time series.
A hierarchical structure is represented by the structure matrix as
We define the summing matrix as
where
is the identity matrix of size
. In Fig. 1, we haveLet
be a column vector comprising observations of all nodes at time
. Similarly, for a node subset we define as the observation vector of nodes at time . In Fig. 1, we haveLet be a column vector comprising base forecasts at time . Note that the base forecasts are calculated separately for each node , so they do not satisfy the aggregation constraint (2). For a node subset , we define at time . Such base forecasts can be converted into coherent forecasts satisfying the aggregation constraint (2) by using the reconciliation matrix . Specifically, we develop bottomlevel forecasts and use the aggregation constraint (3) to obtain coherent forecasts, as
(4) 
A typical example of a reconciliation matrix is
where is a zero matrix. This leads to the bottomup method
(5) 
Another example is
where is a column vector comprising historical proportions of bottomlevel time series. This results in the topdown method
Forecast reconciliation methods
Hyndman et al. [19] introduced the following linear regression model based on the aggregation constraint (3):
where
is a column vector of bottomlevel estimates, and
is a column vector of errors having zero mean and covariance matrix . The bottomup method (5) with is then used to makes coherent forecasts.If the base forecasts are unbiased and the covariance matrix
is known, the generalized leastsquares estimation yields the minimum variance unbiased estimate of
. However, the covariance matrix is nonidentifiable and therefore impossible to estimate [44].In contrast, Wickramasuriya et al. [44] focused on differences between observations and coherent forecasts (4),
The associated covariance matrix is derived as
(6) 
where is the covariance matrix of base forecasts. The trace of the covariance matrix (6) is minimized subject to the unbiasedness condition . This yields the optimal reconciliation matrix
and coherent forecasts (4) are given by
(7) 
See Wickramasuriya et al. [44] for the full details.
Note, however, that in these forecast reconciliation methods, base forecasts are first determined regardless of the underlying hierarchical structure, then those forecasts are corrected based on the hierarchical structure. In contrast, our proposal is a structured regularization model that directly computes highquality forecasts based on the hierarchical structure.
Structured regularization model
We consider a prediction model for bottomlevel time series. Its predictive value is denoted by the column vector , where
is a set of model parameters. As an example, the firstorder vector autoregressive model is represented as
where .
The residual sum of squares for bottomlevel time series is given by
(8) 
We also introduce a structured regularization term that quantifies the error for upperlevel time series based on the hierarchical structure. Let be a diagonal matrix of regularization parameters, where is a vector of its diagonal entries. Then, we construct a structured regularization term based on the aggregation constraint (2) as
(9) 
Minimizing this term aids in correcting bottomlevel forecasts, thus improving the upperlevel forecasts.
Adding the regularization term (9) to the residual sum of squares (8) yields the objective function to be minimized. Consequently, our structured regularization model is posed as
(10) 
Here, matrix adjusts the tradeoff between minimizing the error term (8) for bottomlevel times series and minimizing the error term (9) for upperlevel time series. In the experiments section, we set its diagonal entries as
(11) 
where and are regularization parameters for root and midlevel time series, respectively.
Application to artificial neural networks
This study focused on application of our structured regularization model (10) to artificial neural networks for time series prediction; see Bishop [9] and Goodfellow et al. [16] for general descriptions of artificial neural networks. For simplicity, we consider a twolayer neural network like the one shown in Fig. 2, where the input vector is defined as
First, we calculate the vector as the weighted sum of the input entries
(12) 
where is a weight matrix to be estimated. This vector is transferred from the input units to hidden units, as shown in Fig. 2.
Next, we generate the vector by nonlinear activation functions as
Typical examples of activation functions include the logistic sigmoid function
(13) 
and the rectified linear function
The vector is transferred from the hidden units to the output units as shown in Fig. 2.
Finally, we calculate the vector as the weighted sum of the output entries from the hidden units as
(14) 
where is a weight matrix to be estimated.
This process is summarized as
(15) 
where the set of model parameters is
This neural network outputs as a vector of predictive values.
Backpropagation algorithm
We develop a backpropagation algorithm specialized for training artificial neural networks in our structured regularization model (10); see Bishop [9] and Goodfellow et al. [16] for overviews of backpropagation algorithms. Our algorithm sequentially minimizes the following error function for time :
(16) 
We first define vectors and , which consist of partial derivatives of the error function (16) with respect to intermediate variables (12) and (14) as follows:
From Eqs. (12) and (14), the partial derivatives of the error function (16) can be calculated as
(17)  
(18) 
Algorithm 1 summarizes our backpropagation algorithm.
 Step 0 (Initialization):

Let be a step size and be a threshold for convergence. Set as an incumbent value of the objective function .
 Step 1 (Backpropagation):
 Step 2 (Gradient Descent):

Update the weight parameter values as
 Step 3 (Termination Condition):

If , terminate the algorithm with . Otherwise, set and return to Step 1.
Experimental results and discussion
The experimental results reported in this section evaluate the effectiveness of our structured regularization model when applied to artificial neural networks. These experiments focused on the twolevel hierarchical structure shown in Fig. 3, where
Performance evaluation methodology
To evaluate outofsample prediction performance, we considered training and test periods of time series data, where the training period was used to train prediction models, and the test period was used to compute prediction errors in the trained models. We calculated the rootmeansquared error (RMSE) for each node during the test period as
We compared performance of the following methods for time series prediction.
 MA():

moving average of the previous values,
 ES():

exponential smoothing with smoothing parameter ,
 NN+BU:

bottomup method (5) using artificial neural networks for base forecasts
 NN+MinT:
 NN+SR():
Here, we determined parameter values for and that minimized RMSE in the training period. During the training period, we tuned regularization parameters and through holdout validation [5].
We adopted twolayer artificial neural networks (Fig. 2), using the previous two values and to compute . Following prior studies [25, 32], we set the number of hidden units to twice the number of input units (i.e., ). Bias parameters were added to hidden and output units.
We implemented the backpropagation algorithm (Algorithm 1) in the R programming language, with the step size and convergence threshold set as and , respectively. We employ the logistic sigmoid function (13) as an activation function. The algorithm was repeated 30 times by randomly generating initial values for the parameter
from a standard normal distribution. The following sections show average RMSE values with 95% confidence intervals over the 30 trials.
Synthetic datasets
We generated common factors to express correlations among time series. Denote as a normal distribution with mean
. For common factors, we used the firstorder autoregressive modelswhere is an autoregressive coefficient, and
is the standard deviation of white noise for the
th common factor. Note that reflects the overall trend for and midlevel trends for .Bottomlevel time series were produced by combining the overall trend, midlevel trends, and autocorrelation. We denote the parent (midlevel) node of node as
For bottomlevel time series, we used the firstorder autoregressive models
where and respectively indicate effects of the common factors and on the th time series. After that, we generated upperlevel time series ( for ) according to the aggregation constraint (2).
We prepared three synthetic datasets: NgtvC, WeakC, and PstvC. Table 1 lists the parameter values used to generate these datasets. Time series are negatively correlated in the NgtvC dataset, weakly correlated in the WeakC dataset, and positively correlated in the PstvC dataset. Each dataset consists of time series data at 100 timepoints; the first 70 and latter 30 times were used as training and test periods, respectively. We standardized each time series according to the mean and variance over the training period.
NgtvC  WeakC  PstvC  

Node  
1  0.3  0.3  —  —  —  —  —  — 
2  0.3  0.3  —  —  —  —  —  — 
3  0.3  0.3  —  —  —  —  —  — 
4  0.3  0.3  —  —  —  —  —  — 
5  0.3  0.3  0.1  1.0  0.1  0.1  1.0  1.0 
6  0.3  0.3  0.1  1.0  0.1  0.1  1.0  1.0 
7  0.3  0.3  1.0  0.1  0.1  0.1  1.0  1.0 
8  0.3  0.3  0.1  1.0  0.1  0.1  1.0  1.0 
9  0.3  0.3  0.1  1.0  0.1  0.1  1.0  1.0 
10  0.3  0.3  1.0  0.1  0.1  0.1  1.0  1.0 
11  0.3  0.3  0.1  1.0  0.1  0.1  1.0  1.0 
12  0.3  0.3  0.1  1.0  0.1  0.1  1.0  1.0 
13  0.3  0.3  1.0  0.1  0.1  0.1  1.0  1.0 
Results for synthetic datasets
Tables 2–4 show the outofsample RMSE values provided by each method for each node in the NgtvC, WeakC, and PstvC datasets. In the tables, the rows labeled “Midlevel” and “Bottomlevel” show the average RMSE values over the mid and bottomlevel nodes, respectively, with smallest RMSE values for each node indicated in bold.
For the NgtvC dataset (Table 2), our structured regularization method NN+SR clearly outperformed the other methods, except for the RMSE of the root node. For the WeakC dataset (Table 3), our method was slightly inferior to the exponential smoothing method, but the differences were very small. For the PstvC dataset (Table 4), the MinT method gave the best prediction performance, and our method attained the secondbest value for average RMSE. These results show that our structured regularization method delivered good prediction performance for the three synthetic datasets. Our method was especially effective when the time series were strongly correlated, as in the NgtvC and PstvC datasets.
We next focus on the parameter values for our structured regularization. Only for the PstvC dataset, our method NN+SR() adopted and performed significantly better than the bottomup method in terms of the RMSE of the root node. Additionally, our method employed for all three datasets and outperformed the bottomup method for midlevel RMSEs. These results show an association between regularization weights and prediction accuracy at each time series level. Our method adjusts the regularization parameters to fit the data characteristic, thereby achieving better prediction performance.
RMSE  
Node  MA(12)  ES(0.20)  NN+BU  NN+MinT  NN+SR(0.0, 2.1) 
Root  
2  
3  
4  
Midlevel  
5  
6  
7  
8  
9  
10  
11  
12  
13  
Bottomlevel  
Average 
RMSE  
Node  MA(12)  ES(0.00)  NN+BU  NN+MinT  NN+SR(0.0, 1.2) 
Root  
2  
3  
4  
Midlevel  
5  
6  
7  
8  
9  
10  
11  
12  
13  
Bottomlevel  
Average 
RMSE  
Node  MA(1)  ES(0.89)  NN+BU  NN+MinT  NN+SR(0.4, 2.4) 
Root  
2  
3  
4  
Midlevel  
5  
6  
7  
8  
9  
10  
11  
12  
13  
Bottomlevel  
Average 
Fig. 4
shows the outofsample RMSE values as a function of the epoch (number of iterations) in the backpropagation algorithm for the synthetic datasets. RMSEs decreased faster for our structured regularization method NN+SR than for the bottomup method NN+BU. The convergence performance of the two methods greatly differed, especially for the PstvC dataset and upperlevel time series. Consequently, our structured regularization method improved both prediction accuracy and convergence speed of the backpropagation algorithm. This suggests that our method will deliver good prediction performance even if the backpropagation algorithm is terminated in the middle of computation.
(a) Root (NgtvC)  (b) Root (WeakC)  (c) Root (PstvC) 
(d) Midlevel (NgtvC)  (e) Midlevel (WeakC)  (f) Midlevel (PstvC) 
(g) Bottomlevel (NgtvC)  (h) Bottomlevel (WeakC)  (i) Bottomlevel (PstvC) 
(j) Average (NgtvC)  (k) Average (WeakC)  (l) Average (PstvC) 
Fig. 5 shows the outofsample relative RMSE values provided by our structured regularization method NN+SR() for the synthetic datasets. This figure shows how regularization for each time series level affects the prediction performance. Note that RMSE values were normalized such that the RMSE for was zero in each trial, so the corresponding regularization is effective if this relative RMSE value is negative. Relative RMSE values are represented as a function of the regularization parameter value in Fig. 5.
NN+SR() and NN+SR() performed regularization only for the root time series ( and ) and the midlevel time series ( and ), respectively. In contrast, NN+SR() performed regularization for both using the same regularization parameter values (). RMSEs were consistently reduced in the NgtvC dataset. NN+SR() attained relatively small RMSE values only for the root time series, whereas NN+SR() delivered the smallest RMSE values for the other time series. NN+SR() had RMSE values intermediate between NN+SR() and NN+SR(). These results suggest that the regularization for the midlevel time series greatly impacted prediction performance.
(a) Root (NgtvC)  (b) Root (WeakC)  (c) Root (PstvC) 
(d) Midlevel (NgtvC)  (e) Midlevel (WeakC)  (f) Midlevel (PstvC) 
(g) Bottomlevel (NgtvC)  (h) Bottomlevel (WeakC)  (i) Bottomlevel (PstvC) 
(j) Average (NgtvC)  (k) Average (WeakC)  (l) Average (PstvC) 
Realworld datasets
We downloaded historical data describing unemployment rates in Japan from eStat, a portal site for official Japanese statistics (https://www.estat.go.jp/en). Using these data, we prepared three realworld datasets for Japanese regions: Tohoku, Chubu, and Kansai. Table 5 lists the prefectures forming the resulting twolevel hierarchical structure (Fig. 3).
We used quarterly statistics (modelbased estimates) of unemployment rates during 90 time periods from January 1997 to June 2019, taking the first 60 and last 30 time periods as the training and test periods, respectively. We used the stl function in the R stats package to remove seasonal and trend components. Each time series was standardized according to the mean and variance over the training period.
Prefectures  

Node  Tohoku  Chubu  Kanasi 
5  Aomori  Niigata  Mie 
6  Iwate  Toyama  Shiga 
7  Miyagi  Ishikawa  Kyoto 
8  Akita  Fukui  Osaka 
9  Yamagata  Yamanashi  Hyogo 
10  Fukushima  Nagano  Nara 
11  Ibaraki  Gifu  Wakayama 
12  Tochigi  Shizuoka  Tottori 
13  Gunma  Mie  Okayama 
Results for realworld datasets
Tables 6–8 list the outofsample RMSE values provided by each method for each node in the Tohoku, Chubu, and Kansai datasets. For the Tohoku dataset (Table 6), our structured regularization method NN+SR substantially outperformed the other methods. For the Chubu dataset (Table 7), our method attained average RMSEs that were equally good as those from the exponential smoothing and bottomup methods, whereas the MinT method showed by far the worst performance. For the Kansai dataset (Table 8), our method greatly exceeded the prediction performance of the other methods. These results demonstrate that our structured regularization method achieved superior performance for the three realworld datasets.
RMSE  
Node  MA(20)  ES(0.12)  NN+BU  NN+MinT  NN+SR(0.4, 1.5) 
Root  
2  
3  
4  
Midlevel  
5  
6  
7  
8  
9  
10  
11  
12  
13  
Bottomlevel  
Average 
RMSE  
Node  MA(16)  ES(0.03)  NN+BU  NN+MinT  NN+SR(0.4, 0.6) 
Root  
2  
3  
4  
Midlevel  
5  
6  
7  
8  
9  
10 