1 Introduction
Time series forecasting has proven to be important to help people manage resources and make decisions Lim and Zohren (2020). For example, probabilistic forecasting of product demand and supply in retails Chen et al. (2019), or the forecasting of loans AbuMostafa and Atiya (1996) in a financial institution can help people do inventory or financing planning to maximize the profit. Most time series of interest are macroscopic time series, e.g., the sales of an online retail platform, the loans of a financial institution, or the number of infections caused by some pandemic diseases in a state, that are comprised of microscopic time series, e.g., the sales of a merchant in the online retail, the loans from a customer given the financial institution, or the number of infections in a certain region. That is, the observed macroscopic time series are just the aggregation or sum of microscopic time series.
Although various time series forecasting models, e.g., State Space Models (SSMs) Durbin and Koopman (2012), Autoregressive (AR) models Asteriou and Hall (2011)
, or deep neural networks
Benidis et al. (2020), have been widely studied for decades, all of them study the modeling of time series without considering the connections between macroscopic time series of interest and the underlying time series on the microscopic level.In this paper, we study the question whether the forecasting of macroscopic time series can be improved by leveraging the underlying microscopic time series, and the answer is yes. Basically, though accurately modeling each microscopic time series could be challenging due to large variations, we show that by carefully clustering microscopic time series into clusters, i.e., clustered time series, and using canonical approaches to model each of clusters, finally we can achieve promising results by simply summing over the forecasting results of each cluster.
To be more specific, first, we assume that the microscopic time series are generated from a probabilistic mixture model McLachlan and Basford (1988) where there exist components. The generation of each microscopic time series is by first selecting a component from with a prior (a Discrete distribution), then generating the microscopic observation from a probabilistic distribution parameterized by the corresponding component . We show that as we can identify the ground truth components of the mixture, and the ground truth assignment of each microscopic observation, independent modeling of time series data from each component could be improved due to lower variance, and further benefitting the estimation of macroscopic time series that are of interest. Second, inspired by recent successes of Seq2seq models Vaswani et al. (2017); Cho et al. (2014); Du et al. (2018)
based on deep neural networks, e.g., variants of recurrent neural networks (RNNs)
Hewamalage et al. (2021); Yu et al. (2017); Maddix et al. (2018), convolutional neural networks (CNNs)
Bai et al. (2018); Hao et al. (2020), and Transformers Li et al. (2019); Wu et al. (2020), we propose Mixture of Seq2seq (MixSeq), a mixture model for time series, where the components come from a family of Seq2seq models parameterized by different parameters. Third, we conduct synthetic experiments to demonstrate the superiority of our approach, and extensive experiments on realworld data to show the power of our approach compared with canonical approaches.Our contributions. We summarize our contributions in twofold. (1) We show that by transforming the original macroscopic time series via clustering, the expected variance of each clustered time series could be optimized, thus improving the accuracy and robustness for the estimation of macroscopic time series. (2) We propose MixSeq which is an end2end mixture model with each component coming from a family of Seq2seq models. Our empirical results based on MixSeq show the superiority compared with canonical approaches.
2 Background
In this section, we first give a formal problem definition. We then review the bases related to this work, and have a discussion of related works.
Problem definition. Let us assume a macroscopic time series , and denotes the value of time series at time . We aim to predict the next time steps, i.e., . We are interested in the following conditional distribution
(1) 
where represents in interval . To study the above problem, we assume that the macroscopic time series is comprised of microscopic time series, i.e., where denotes the value of the th microscopic time series at time . We aim to cluster the microscopic time series into clustered time series , where given the label assignment of the th microscopic time series . This is based on our results in Section 3 that the macroscopic time series forecasting can be improved with optimal clustering. Hence, instead of directly modeling , we study the clustering of microscopic time series in Section 4, and model the conditional distribution of clustered time series with canonical approaches.
2.1 Seq2seq: encoderdecoder architectures for time series
An encoderdecoder based neural network models the conditional distribution Eq. (1
) as a distribution from the exponential families, e.g., Gaussian, Gamma or Binomial distributions, with sufficient statistics generated from a neural network. The encoder feeds
into a neural architecture, e.g., RNNs, CNNs or selfattentions, to generate the representation of historical time series, denoted as , then we use a decoder to yield the result . After iterations in an autoregressive style, it finally generates the whole time series to be predicted.To instantiate above Seq2seq architecture, we denote , where , as covariates that are known a priori, e.g., dates. We denote where we use for concatenation. The encoder generates the representation of via Transformer Vaswani et al. (2017); Li et al. (2019) as follows. We first transform by some functions , e.g., causal convolution Li et al. (2019) to . Transformer then iterates the following selfattention layer times:
(2)  
That is, we first transform ^{1}^{1}1We ignore the subscript for simplicity in condition that the context is of clarity. into query, key, and value matrices, i.e., , , and respectively, where in each layer are learnable parameters. Then we do scaled inner product attention to yield where is a mask matrix to filter out rightward attention by setting all upper triangular elements to . We denote
as a multilayer perceptron function. Afterwards, we can generate the representation
for via where we denote as a deep set function Zaheer et al. (2017) that operates on rows of , i.e., . We denote the feedforward function to generate as , i.e., Eq (2).Given , the decoder generates the sufficient statistics and finally yields from a distribution in the exponential family.
2.2 Related works
Time series forecasting has been studied for decades. We summarize works related to time series forecasting into two categories. First, many models come from the family of autoregressive integrated moving average (ARIMA) Box and Jenkins (1968); Asteriou and Hall (2011), where AR indicates that the evolving variable of interest is regressed on its own lagged values, the MA indicates that the regression error is actually a linear combination of error terms, and the “I” indicates that the data values have been replaced with the difference between their values and the previous values to handle nonstationary Pemberton (1990). The State Space Models (SSM) Durbin and Koopman (2012) aim to use state transition function to model the transfer of states and generate observations via a observation function. These statistical approaches typically model time series independently, and most of them only utilize values from history but ignore covariates that are important signals for forecasting. Second, as rapid development of deep neural networks, people started studying many neural networks for the modeling of time series Benidis et al. (2020); Lim and Zohren (2020). Most successful neural networks are based on the encoderdecoder architectures Vaswani et al. (2017); Cho et al. (2014); Du et al. (2018); Sutskever et al. (2014); Bahdanau et al. (2015); Dama and Sinoquet (2021); Lim and Zohren (2020); Ma et al. (2019), namely Seq2seq. Basically, various Seq2seq models based on RNNs Hewamalage et al. (2021); Salinas et al. (2020); Wen et al. (2017); Lai et al. (2018); Yu et al. (2017); Maddix et al. (2018), CNNs Bai et al. (2018); Hao et al. (2020), and Transformers (selfattentions) Li et al. (2019); Wu et al. (2020) are proposed to model the nonlinearity for time series.
No matter models studied in statistics or deep neural networks, these works mainly focus on the forecasting of single or multivariate time series, but ignore the auxiliary information that the time series could be made up of microscopic data.
Time series clustering
is another topic for exploratory analysis of time series. We summarize the literature into three categories, i.e., study of distance functions, generative models, and feature extraction for time series.
First, Dynamic time wrapping Petitjean et al. (2011), similarity metric that measures temporal dynamics Yang and Leskovec (2011), and specific measures for the shape Paparrizos and Gravano (2015) of time series are proposed to adapt to various time series characteristics, e.g., scaling and distortion. Typically these distance functions are mostly manually defined and cannot generalize to more general settings. Second, generative model based approaches assume that the observed time series is generated by an underlying model, such as hidden markov model
Oates et al. (1999) or mixture of ARMA Xiong and Yeung (2004). Third, early studies on feature extraction of time series are based on component analysis Guo et al. (2008), and kernels, e.g., ushapelet Zakaria et al. (2012). As the development of deep neural networks, several encoderdecoder architectures Madiraju et al. (2018); Ma et al. (2019) are proposed to learn better representations of time series for clustering.However, the main purpose of works in this line is to conduct exploratory analysis of time series, while their usage for time series forecasting has never been studied. That is, these works define various metrics to evaluate the goodness of the clustering results, but how to learn the optimal clustering for time series forecasting remains an open question.
3 Microscopic time series under mixture model
We analyze the variance of mixture models, and further verify our results with simple toy examples.
3.1 Analyses on the variance of mixture model
In this part, we analyze the variance of probabilistic mixture models. A mixture model McLachlan and Basford (1988)
is a probabilistic model for representing the presence of subpopulations within an overall population. Mixture model typically consists of a prior that represents the probability over subpopulations, and components, each of which defines the probability distribution of the corresponding subpopulation. Formally, we can write
(3) 
where denotes the mixture distribution, denotes the prior over subpopulations, and represents the distribution corresponding to the th component.
Proposition 1.
Assuming the mixture model with probability density function
, and corrsponding components with constants ( lie in a simplex), we have . In condition that andhave first and second moments, i.e.,
and for , and and for components , we have:(4) 
We use the fact that . By using Jensen’s Inequality on , we immediately yield the result. See detailed proofs in supplementary.
This proposition states that, if we have limited data samples (always the truth in reality) and in case we know the ground truth data generative process a priori, i.e., the exact generative process of each sample from its corresponding component, the variance on expectation conditioned on the ground truth data assignment should be no larger than the variance of the mixture. Based on the assumption that microscopic data are independent, the variance of the aggregation of clustered data should be at least no larger than the aggregation of all microscopic data, i.e., the macroscopic data. So the modeling of clustered data from separate components could possibly be more accurate and robust compared with the modeling of macroscopic data. This result motivates us to forecast macroscopic time series by clustering the underlying microscopic time series. Essentially, we transform the original macroscopic time series data to clusters with lower variances using a clustering approach, then followed by any time series models to forecast each clustered time series. After that, we sum over all the results from those clusters so as to yield the forecasting of macroscopic time series. We demonstrate this result with toy examples next.
3.2 Demonstration with toy examples
We demonstrate the effectiveness of forecasting macroscopic time series by aggregating the forecasting results from clustered time series.
Simulation setting. We generate microscopic time series from a mixture model, such as Gaussian process (GP) Roberts et al. (2013) or ARMA Box et al. (2015) with or components. We generate time series for each component, and yield or microscopic time series in total. We sum all the time series as the macroscopic time series. We get clustered time series by simply summing microscopic time series from the same component. Our purpose is to compare the performance between forecasting results directly on macroscopic time series (macro results) and sum of forecasting results of clustered time series (clustered results). We set the length of time series as , and use rolling window approach for training and validating our results in the last time steps (i.e., at each time step, we train the model using the time series before current time point, and validate using the following values). We fit the data with either GP or ARMA depending on the generative model. We describe the detailed simulation parameters of mixture models in supplementary.
Simulation results. Table 1 shows the results measured by symmetric mean absolute percentage error (SMAPE)^{2}^{2}2Details are in supplementary. It is obvious that no matter time series generated by mixture of GP or mixure of ARMA, the clustered results are superior to macro results. In other words, if we knew the ground truth component of each microscopic time series, modeling on clustered data aggregated by time series from the same component would have better results compared with directly modeling the macroscopic time series.
GP time series data  ARMA time series data  
3 clusters  5 clusters  3 clusters  5 clusters  
macro results  0.0263  0.0242  0.5870  0.5940 
clustered results  0.0210  0.0198  0.3590  0.3840 
4 MixSeq: a mixture model for time series
Based on the analysis in Section 3, we assume microscopic time series follow a mixture distribution, and propose a mixture model, MixSeq, to cluster microscopic time series.
4.1 Our model
Our model assumes that each of microscopic time series follows a mixture probabilistic distribution. To forecast given , our approach is to first partition into clusters via MixSeq. Since the distribution is applicable to all microscopic time series, we ignore the subscript and time interval for simplicity. We study the following generative probability of :
(5) 
where is the discrete latent variable, is the number of components in the mixture model, is the prior of cluster indexed by , and is the probability of time series generated by the corresponding component governed by parameter . Note that we have parameters in the mixture model.
We use ConvTrans introduced in Li et al. (2019) as our backbone to model the conditional . To instantiate, we model the conditional by first generating via causal convolution Li et al. (2019). We then generate given . Finally we generate the representation for as , where and
as ReLU activation function. Afterwards, we decode
to form the specific distribution from an exponential family. In particular, we use Gaussian distribution
, where the mean and variance can be generated by following transformations,(6) 
where are parameters, and are biases.
4.2 Posterior inference and learning algorithms
We aim to learn the parameter and efficiently infer the posterior distribution of in Eq. (5). However, it is intractable to maximize the marginal likelihood after taking logarithm, i.e., , since of the involvement of logarithm of sum. To tackle this nonconvex problem, we resort to stochastic autoencoding variational Bayesian algorithm (AEVB) Kingma and Welling (2013). Regarding single microscopic time series, the variational lower bound (LB) Kingma and Welling (2013) on the marginal likelihood is as below.
(7)  
where is the approximated posterior of the latent variable given time series . The benefit of using AEVB is that we can treat as an encoder modeled by a neural network. Hence, we reuse the ConvTrans Li et al. (2019) as our backbone, and model as:
(8) 
where we denote , with parameter is the deep set function, and as parameters to project the encoding to dimension. After the softmax operator, we derive the posterior distribution that lies in a simplex of dimension. Note that we use distinct ’s, ’s and ’s with different parameters to model and respectively. We assign each microscopic to cluster in our experiments.
Mode collapsing. We find that directly optimizing the lower bound in Eq. (7) suffers from the mode collapsing problem. That is, the encoder tends to assign all microscopic time series to one cluster, and does not effectively distinguish the data as expected, thus implying ( for mutual information). In order to address the above mode collapsing problem, we add to the lower bound in Eq. (7) which expects that the latent variable can extract discriminative information from different time series Zhao et al. (2018). Then, we have
(9)  
where is an average of approximated posteriors over all microscopic data. We approximate this term by using a minibatch of samples, i.e., .
Annealing tricks. Regarding longlength time series, the reconstruction loss and KL divergence in Eq. (9) are out of proportion. In this situation, the KL divergence has few effects on the optimization objective. So we finally derive the following objective to maximize:
(10) 
where
is the tradeoff hyperparameter. We use the following annealing strategy
to dynamically adjust in the training process, where is the parameter controlling the rate of descent. Meanwhile, we also involve the norm regularizers on Seq2seq’s parameters with hyperparameter .ARMA synthetic data  DeepAR synthetic data  
2 clusters  3 clusters  2 clusters  3 clusters  
MixARMA  0.9982(0.0001)  0.9509(0.1080)  0.7995(0.2734)  0.7687(0.0226) 
MixSeq  0.9915(0.0024)  0.9540(0.0974)  0.9986(0.0003)  0.8460(0.0774) 
MixSeqinfer  0.9929(0.0027)  0.9544(0.0975)  0.9982(0.0006)  0.8460(0.0775) 
Mean and standard deviation (SD, in bracket) of Rand Index (RI, the higher the better) by clustering on synthetic data generated by ARMA and DeepAR. MixSeqinfer represents that we infer the cluster of new data generated by different models after training MixSeq. On ARMA data, MixSeq and MixARMA have comparable performance; on DeepAR data, MixARMA degrades significantly which shows the effectiveness of MixSeq.
5 Experimental results
We conduct extensive experiments to show the advantage of MixSeq. We evaluate the clustering performance of MixSeq on synthetic data, present the results of macroscopic time series forecasting on realworld data, and analyze the sensitivity of the cluster number of MixSeq.
5.1 Synthetic datasets
To demonstrate MixSeq’s capability of clustering microscopic time series that follow various probabilistic mixture distributions, we conduct clustering experiments on synthetic data with ground truth. We generate two kinds of synthetic time series by ARMA Box et al. (2015) and DeepAR Salinas et al. (2020) respectively. For each model, we experiment with different number of clusters (2 and 3) generated with components governed by different parameters.
Experiment setting. To generate data from ARMA, we use ARMA(2, 0) and with . We set parameters for three components as , , and respectively. The synthetic time series from a mixture of components are generated using the first two components. The synthetic time series from a mixture of components are generated using all components. To generate data from DeepAR, we use the DeepAR model with one LSTM layer, and the hidden number of units is . Since it is difficult to randomly initialize the parameters of DeepAR, we train a base model on the realworld Wiki dataset Tran et al. (2021) (discussed in section 5.2). To build the other two DeepAR components, we respectively add random disturbance to the parameters of the base model. For each cluster, we generate time series with random initialized sequences, and set the length of time series as .
We use 1layer causal convolution Transformer (ConvTrans Li et al. (2019)) as our backbone model in MixSeq. We use the following parameters unless otherwise stated. We set the number of multiheads as , kernel size as , the number of kernel for causal convolution , dropout rate as , the penalty weight on the norm regularizer as 1e5, and . Meanwhile, we set the prior as . For the training parameters, we set the learning rate as 1e4, batch size as
and epochs as
. Furthermore, the in MixSeq is annealed using the schedule , where denotes the current epoch, and is updated in the th epochs. For comparison, we employ MixARMA Xiong and Yeung (2004), a mixture of ARMA(2, 0) model optimized by EM algorithm Blei et al. (2017), as our baseline. Both methods are evaluated using Rand Index (RI) Rand (1971) (more details in supplementary).Experiment results. We show the clustering performance of MixSeq and MixARMA on the synthetic data in Table 2. The results are given by the average of 5 trials. Regarding the synthetic data from ARMA, both MixSeq and MixARMA perform very well. However, for the synthetic data from DeepAR, MixARMA degrades significantly while MixSeq achieves much better performance. This suggests that MixSeq can capture the complex nonlinear characteristics of time series generated by DeepAR when MixARMA fails to do so. Furthermore, we also generate new time series by the corresponding ARMA and DeepAR models, and infer their clusters with the trained MixSeq model. The performance is comparable with the training performance, which demonstrates that MixSeq actually captures the generation mode of time series.
5.2 Realworld datasets
We further evaluate the effectiveness of our model on the macroscopic time series forecasting task. We compare MixSeq with existing clustering methods and stateoftheart time series forecasting approaches on several realworld datasets. Specifically, for each dataset, the goal is to forecast the macroscopic time series aggregated by all microscopic data. We cluster microscopic time series into groups, and aggregate the time series in each group to form the clustered time series. Then, we train the forecasting models on the clustered time series separately, and give predictions of each clustered time series. Finally, the estimation of macroscopic time series is obtained by aggregating all the predictions of clustered time series.
We report results on three realworld datasets, including Rossmann^{3}^{3}3https://www.kaggle.com/c/rossmannstoresales, M5^{4}^{4}4https://www.kaggle.com/c/m5forecastingaccuracy and Wiki Tran et al. (2021). The Rossmann dataset consists of historical sales data of Rossmann stores recorded every day. Similarly, the M5 dataset consists of microscopic time series as the daily sales of different products in ten Walmart stores in USA. The Wiki dataset contains microscopic time series representing the number of daily views of different Wikipedia articles. The dataset summary is shown in Table 3, together with the setting of data splits.
dataset 


train interval  test internal  
Rossmann  1115  942  2013010120141231  2015010120150731  
M5  30490  1941  2011012920160101  2016010120160619  
Wiki  309765  1827  2015070120191231  2020010120200630 
Macro  DTCR  MixARMA  MixSeq  
DeepAR  0.1904/0.0869  0.2292/0.1432  0.1981/0.1300  0.1857/0.0987  
Rossmann  TCN  0.1866/0.1005  0.2023/0.1633  0.1861/0.1160  0.1728/0.0997 
ConvTrans  0.1861/0.0822  0.2077/0.0930  0.1866/0.0854  0.1847/0.0813  
DeepAR  0.0548/0.0289  0.0787/0.0627  0.0624/0.0582  0.0582/0.0445  
M5  TCN  0.0790/0.0635  0.0847/0.0805  0.0762/0.0789  0.0694/0.0508 
ConvTrans  0.0553/0.0260  0.0514/0.0260  0.0497/0.0257  0.0460/0.0238  
DeepAR  0.0958/0.0962  0.1073/0.1336  0.0974/0.1070  0.0939/0.0901  
Wiki  TCN  0.0966/0.1064  0.1237/0.1480  0.0963/0.1218  0.0886/0.0980 
ConvTrans  0.0968/0.0589  0.1029/0.0531  0.0961/0.0594  0.0901/0.0516 
Experiment setting. We summarize the clustering strategies for macroscopic time series forecasting as follows. (1) “DTCR” Ma et al. (2019)
is the deep temporal clustering representation method which integrates the temporal reconstruction, Kmeans objective and auxiliary classification task into a single Seq2seq model.
(2) “MixARMA” Xiong and Yeung (2004) is the mixture of ARMA model that uses ARMA to capture the characteristics of microscopic time series. (3) “MixSeq” is our model with 1layer causal convolution Transformer Li et al. (2019). (4) We also report the results that we directly build forecasting model on the macroscopic time series without leveraging the microscopic data, named as “Macro”.For time series forecasting, we implement five methods combined with each clustering strategy, including ARMA Box et al. (2015), Prophet Taylor and Letham (2018), DeepAR Salinas et al. (2020), TCN Bai et al. (2018), and ConvTrans Li et al. (2019). ARMA and Prophet give the prediction of pointwise value for time series, while DeepAR, TCN and ConvTrans are methods based on neural network for probabilistic forecasting with Gaussian distribution. We use the rolling window strategy on the test interval, and compare different methods in terms of the longterm forecasting performance for days. The data of last two months in train interval are used as validation data to find the optimal model.
We do grid search for the following hyperparameters in clustering and forecasting algorithms, i.e., the number of clusters , the learning rate , the penalty weight on the norm regularizers , and the dropout rate . The model with best validation performance is applied for obtaining the results on test interval. Meanwhile, we set batch size as , and the number of training epochs as for Rossmann, for M5 and for Wiki. For DTCR, we use the same setting as Ma et al. (2019).
Regarding time series forecasting models, we apply the default setting to ARMA and Prophet provided by the Python packages. The architectures of DeepAR, TCN and ConvTrans are as follows. The number of layers and hidden units are and for DeepAR. The number of multiheads and kernel size are and for ConvTrans. The kernel size is for TCN with dilations in . We also set batch size as and the number of epochs as for all forecasting methods.
Experiment results. Following Li et al. (2019); Rangapuram et al. (2018); Tran et al. (2021), we evaluate the experimental methods using SMAPE and
quantile loss
^{5}^{5}5Detailed definition is in supplementary. with . The SMAPE results of all combination of clustering and forecasting methods are given in Table 5. Table 4 shows the loss for DeepAR, TCN and ConvTrans which give probabilistic forecasts. All results are run in trials. The best performance is highlighted by bold character. We observe that MixSeq is superior to other three methods, suggesting that clustering microscopic time series by our model is able to improve the estimation of macroscopic time series. Meanwhile, Macro and MixARMA have comparable performance and are better than DTCR, which further demonstrates the effectiveness of our method, i.e., only proper clustering methods are conductive to macroscopic time series forecasting.Macro  DTCR  MixARMA  MixSeq  
ARMA  0.2739(0.0002)  0.2735(0.0106)  0.2736(0.0013)  0.2733(0.0012)  
Prophet  0.1904(0.0007)  0.1738(0.0137)  0.1743(0.0037)  0.1743(0.0026)  
Rossmann  DeepAR  0.1026(0.0081)  0.1626(0.0117)  0.1143(0.0088)  0.0975(0.0013) 
TCN  0.1085(0.0155)  0.1353(0.0254)  0.1427(0.0180)  0.1027(0.0075)  
ConvTrans  0.1028(0.0091)  0.1731(0.0225)  0.1022(0.0041)  0.0961(0.0019)  
ARMA  0.0540(0.0001)  0.0544(0.0018)  0.0541(0.0003)  0.0543(0.0001)  
Prophet  0.0271(0.0003)  0.0271(0.0003)  0.0269(0.0002)  0.0267(0.0002)  
M5  DeepAR  0.0278(0.0034)  0.0410(0.0046)  0.0319(0.0063)  0.0298(0.0029) 
TCN  0.0412(0.0075)  0.0447(0.0044)  0.0395(0.0094)  0.0358(0.0014)  
ConvTrans  0.0274(0.0048)  0.0253(0.0020)  0.0245(0.0024)  0.0227(0.0006)  
ARMA  0.0362(0.0001)  0.0363(0.0006)  0.0364(0.0005)  0.0362(0.0002)  
Prophet  0.0413(0.0001)  0.0423(0.0008)  0.0434(0.0003)  0.0420(0.0005)  
Wiki  DeepAR  0.0481(0.0008)  0.0552(0.0015)  0.0489(0.0006)  0.0470(0.0002) 
TCN  0.0494(0.0076)  0.0654(0.0022)  0.0491(0.0015)  0.0446(0.0023)  
ConvTrans  0.0471(0.0029)  0.0497(0.0012)  0.0466(0.0001)  0.0440(0.0010) 
5.3 Sensitivity analysis of cluster number
The cluster number is a critical hyperparameter of MixSeq. To analyze its effect on the forecasting of macroscopic time series, we conduct experiments on both synthetic data and realworld data. The results state the importance of setting a proper number of clusters. We suggest do binary search on this critical hyperparameter. Details are as follows.
Synthetic data. Following the experimental setting in section 5.1, we generate microscopic time series from different parameterized ARMA respectively. That is, the ground truth number of clusters is and there are microscopic samples in total. The aggregation of all samples is the macroscopic time series which is the forecasting target of interest. Then, we compare the forecasting performance between the method that directly forecasts macroscopic time series (denoted as Macro) and our method with different cluster numbers (including , , , denoted as MixSeq_2, MixSeq_3 and MixSeq_5 respectively). We fix the forecasting method as ARMA, and apply the rolling window approach for T+10 forecasting in the last time steps. The average SMAPE of trials are , , and for Macro, MixSeq_2, MixSeq_3 and MixSeq_5 respectively. MixSeq_3 being set with ground truth cluster number shows the best performance, while MixSeq_2 and MixSeq_5 would degenerate though still better than Macro. This result shows the importance of setting a proper number of clusters.
Realworld data. We do the evaluation on three realworld datasets by varying the cluster number of MixSeq while maintaining the other parameters fixed. For Rossmann and M5 datasets, we set the cluster number , while we explore the cluster number on Wiki dataset. The architecture and training parameters of MixSeq are same to section 5.2, except that we set the dropout rate as , the penalty weight on the norm regularizer as 5e5, and the learning rate as 1e4. Meanwhile, we also fix the time series forecasting method as causal convolution Transformer (ConvTrans).
Figure 1 reports the macroscopic time series forecasting performance (testing on and loss) based on MixSeq with different cluster number on three realworld datasets. The horizontal dashed lines are the results with that directly building ConvTrans model on the macroscopic time series without leveraging the microscopic data (named as “Macro” in section 5.2). It is obvious that each dataset has its own suitable number of clusters, and our method is relatively sensitive to , especially on the dataset with less microscopic time series, such as Rossmann. Similar to the accurately modeling of each microscopic time series, the larger cluster number of MixSeq also brings large variance to macroscopic time series forecasting, which degrades the performance of our method.
6 Conclusion
In this paper, we study the problem that whether macroscopic time series forecasting can be improved by leveraging microscopic time series. Under mild assumption of mixture models, we show that appropriately clustering microscopic time series into groups is conductive to the forecasting of macroscopic time series. We propose MixSeq to cluster microscopic time series, where all the components come from a family of Seq2seq models parameterized with different parameters. We also propose an efficient stochastic autoencoding variational Bayesian algorithm for the posterior inference and learning for MixSeq. Our experiments on both synthetic and realworld data suggest that MixSeq can capture the characteristics of time series in different groups and improve the forecasting performance of macroscopic time series.
This work is supported by Ant Group.
References
 AbuMostafa and Atiya (1996) Y. S. AbuMostafa and A. F. Atiya. Introduction to financial forecasting. Applied intelligence, 6(3):205–213, 1996.
 Asteriou and Hall (2011) D. Asteriou and S. G. Hall. Arima models and the box–jenkins methodology. Applied Econometrics, 2(2):265–286, 2011.
 Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (ICLR), 2015.
 Bai et al. (2018) S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
 Benidis et al. (2020) K. Benidis, S. S. Rangapuram, V. Flunkert, B. Wang, D. Maddix, C. Turkmen, J. Gasthaus, M. BohlkeSchneider, D. Salinas, L. Stella, et al. Neural forecasting: Introduction and literature overview. arXiv preprint arXiv:2004.10240, 2020.
 Blei et al. (2017) D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
 Box and Jenkins (1968) G. E. Box and G. M. Jenkins. Some recent advances in forecasting and control. Journal of the Royal Statistical Society: Series C (Applied Statistics), 17(2):91–109, 1968.
 Box et al. (2015) G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 Chen et al. (2019) C. Chen, Z. Liu, J. Zhou, X. Li, Y. Qi, Y. Jiao, and X. Zhong. How much can a retailer sell? sales forecasting on tmall. In PacificAsia Conference on Knowledge Discovery and Data Mining, pages 204–216. Springer, 2019.
 Cho et al. (2014) K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 Dama and Sinoquet (2021) F. Dama and C. Sinoquet. Analysis and modeling to forecast in time series: a systematic review. arXiv preprint arXiv:2104.00164, 2021.

Du et al. (2018)
S. Du, T. Li, and S.J. Horng.
Time series forecasting using sequencetosequence deep learning framework.
In 2018 9th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), pages 171–176. IEEE, 2018.  Durbin and Koopman (2012) J. Durbin and S. J. Koopman. Time series analysis by state space methods. Oxford university press, 2012.
 Guo et al. (2008) C. Guo, H. Jia, and N. Zhang. Time series clustering based on ica for stock data analysis. In 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing, pages 1–4. IEEE, 2008.
 Hao et al. (2020) H. Hao, Y. Wang, Y. Xia, J. Zhao, and F. Shen. Temporal convolutional attentionbased network for sequence modeling. arXiv preprint arXiv:2002.12530, 2020.
 Hewamalage et al. (2021) H. Hewamalage, C. Bergmeir, and K. Bandara. Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1):388–427, 2021.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Lai et al. (2018) G. Lai, W.C. Chang, Y. Yang, and H. Liu. Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104, 2018.
 Li et al. (2019) S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.X. Wang, and X. Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. arXiv preprint arXiv:1907.00235, 2019.
 Lim and Zohren (2020) B. Lim and S. Zohren. Time series forecasting with deep learning: A survey. arXiv preprint arXiv:2004.13408, 2020.
 Ma et al. (2019) Q. Ma, J. Zheng, S. Li, and G. W. Cottrell. Learning representations for time series clustering. Advances in neural information processing systems, 32:3781–3791, 2019.
 Maddix et al. (2018) D. C. Maddix, Y. Wang, and A. Smola. Deep factors with gaussian processes for forecasting. arXiv preprint arXiv:1812.00098, 2018.
 Madiraju et al. (2018) N. S. Madiraju, S. M. Sadat, D. Fisher, and H. Karimabadi. Deep temporal clustering: Fully unsupervised learning of timedomain features. arXiv preprint arXiv:1802.01059, 2018.
 McLachlan and Basford (1988) G. J. McLachlan and K. E. Basford. Mixture models: Inference and applications to clustering, volume 38. M. Dekker New York, 1988.

Oates et al. (1999)
T. Oates, L. Firoiu, and P. R. Cohen.
Clustering time series with hidden markov models and dynamic time warping.
InProceedings of the IJCAI99 workshop on neural, symbolic and reinforcement learning methods for sequence learning
, pages 17–21. Citeseer, 1999.
 Paparrizos and Gravano (2015) J. Paparrizos and L. Gravano. kshape: Efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1855–1870, 2015.
 Pemberton (1990) J. Pemberton. Nonlinear and nonstationary time series analysis., 1990.
 Petitjean et al. (2011) F. Petitjean, A. Ketterlin, and P. Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition, 44(3):678–693, 2011.
 Rand (1971) W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
 Rangapuram et al. (2018) S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31:7785–7794, 2018.
 Roberts et al. (2013) S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian processes for timeseries modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550, 2013.
 Salinas et al. (2020) D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
 Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 2014.
 Taylor and Letham (2018) S. J. Taylor and B. Letham. Forecasting at scale. The American Statistician, 72(1):37–45, 2018.
 Tran et al. (2021) A. Tran, A. Mathews, C. S. Ong, and L. Xie. Radflow: A recurrent, aggregated, and decomposable model for networks of time series. arXiv preprint arXiv:2102.07289, 2021.
 Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
 Wen et al. (2017) R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka. A multihorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
 Wu et al. (2020) N. Wu, B. Green, X. Ben, and S. O’Banion. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317, 2020.
 Xiong and Yeung (2004) Y. Xiong and D.Y. Yeung. Time series clustering with arma mixtures. Pattern Recognition, 37(8):1675–1689, 2004.
 Yang and Leskovec (2011) J. Yang and J. Leskovec. Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 177–186, 2011.

Yu et al. (2017)
R. Yu, S. Zheng, A. Anandkumar, and Y. Yue.
Longterm forecasting using tensortrain rnns.
Arxiv, 2017.  Zaheer et al. (2017) M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola. Deep sets. arXiv preprint arXiv:1703.06114, 2017.
 Zakaria et al. (2012) J. Zakaria, A. Mueen, and E. Keogh. Clustering time series using unsupervisedshapelets. In 2012 IEEE 12th International Conference on Data Mining, pages 785–794. IEEE, 2012.
 Zhao et al. (2018) T. Zhao, K. Lee, and M. Eskenazi. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. arXiv preprint arXiv:1804.08069, 2018.
Appendix A Proofs
Proposition 1.
Assuming the mixture model with probability density function , and corrsponding components with constants ( lie in a simplex), we have . In condition that and have first and second moments, i.e., and for , and and for components , we have:
(11) 
Proof.
We prove the result based on the fact that we have for any moment that
(12) 
We then derive the variance of mixture as
(13)  
Since the squared function is convex, by Jensen’s Inequality we immediately have . ∎
Appendix B Complexity analysis and running time
Complexity analysis of MixSeq. The complexity of MixSeq depends on the number of clusters and three network architectures, including convolution, multihead selfattention and MLP. For time series , where and are the length and dimension of time series respectively, the FLOPs (floating point operations) of convolution is , where and are the size and number of convolution kernel. The FLOPs of multihead selfattention is , where is the number of multiheads. The FLOPs of MLP is . Finally, the FLOPs of MixSeq is . Since , and are usually smaller than , so the time complexity can be simplified as which is similar to Transformer. Time series data is always recorded by day or hour. There are only one thousand values even for the data recorded in three years, so our method is capable for dealing with them. Furthermore, some existing methods can also be used in MixSeq to accelerate the computation of selfattention.
Running time of macroscopic time series forecasting. The overall running time of forecasting macroscopic data based on MixSeq is comprised of two steps. (1) The first step is to do clustering with MixSeq. Figure 2 shows the convergence of MixSeq over time (seconds) compared with comparison approaches to time series clustering. Our approach takes seconds to convergence on the dataset containing microscopic time series, while MixARMA and DTCR take and seconds to convergence respectively. The convergence rate of our method is not worse than existing neural network based approach, i.e., DTCR. (2) The second step is to forecast clustered time series with any proper forecasting model. The time complexity is in linear w.r.t the number of clustered time series. We can always accelerate this step by using more workers in parallel.
Appendix C Evaluation metrics
c.1 Rand index
Given the labels as a clustering ground truth, Rand index (RI) measures the clustering accuracy between ground truth and predicted clusters, defined as
where is the total number of samples, is the number of sample pairs that are in the same cluster with same label, and is the number of sample pairs that in different clusters with different labels.
c.2 Symmetric mean absolute percentage error
Symmetric mean absolute percentage error (SMAPE) is an accuracy measure for time series forecasting based on percentage (or relative) errors. is defined as
where is the actual value and is the predicted value for time , and is the horizon of time series forecasting.
c.3 quantile loss
In the experiments, we evaluate different methods by the rolling window strategy. The target value of macroscopic time series for each dataset is given as , where is the th testing sample of macroscopic time series and is the lead time after the forecast start point. For a given quantile , we denote the predicted quantile for as . To obtain such a quantile prediction from the estimation of clustered time series, a set of predicted samples of each clustered time series is first sampled. Then each realization is summed and the samples of these sums represent the estimated distribution for . Finally, we can take the quantile from the empirical distribution.
The quantile loss is then defined as
where is an indicator function.
Appendix D Experiments
d.1 Enviroment setting
We conduct the experiments on an internal cluster with core CPU, G RAM and
P100 GPU. Meanwhile, MixSeq, together with the time series forecasting methods based on neural network, are implemented with tensorflow 1.14.0.
d.2 Simulation parameters of toy examples
We use the TimeSynth^{6}^{6}6https://github.com/TimeSynth/TimeSynth python package to generate simulation time series data. For GP time series, we use RBF as the kernel function. The lengthscale and variance are , and for the mixture model of 3 GPs. Then, we add time series generated by GP with and for the samples from mixture of 5 GPs. Similarly, we use ARMA(2, 0) to generate ARMA time series. The parameters of the first three components of the mixture of ARMA are , and respectively. The parameters of another two components are and . The initial values of ARMA are sampled from .
Appendix E Societal impacts
We study the problem that whether macroscopic time series forecasting can be improved by leveraging microscopic time series, and finally propose MixSeq to cluster microscopic time series to improve the forecasting of macroscopic time series. This work will be especially useful for financial institutions and ecommercial platforms, e.g., loan forecasting, balance forecasting, and Gross Merchandise Volume (GMV) forecasting. The forecasting can help business decisions like controlling the risk of each financial institution, and help the lending to merchant in an ecommerce platform. Misuse of microscopic data could possibly lead to privacy issues. As such, protecting microscopic data with privacypreserving techniques should be important.
Comments
There are no comments yet.