MixSeq: Connecting Macroscopic Time Series Forecasting with Microscopic Time Series Data

by   Zhibo Zhu, et al.

Time series forecasting is widely used in business intelligence, e.g., forecast stock market price, sales, and help the analysis of data trend. Most time series of interest are macroscopic time series that are aggregated from microscopic data. However, instead of directly modeling the macroscopic time series, rare literature studied the forecasting of macroscopic time series by leveraging data on the microscopic level. In this paper, we assume that the microscopic time series follow some unknown mixture probabilistic distributions. We theoretically show that as we identify the ground truth latent mixture components, the estimation of time series from each component could be improved because of lower variance, thus benefitting the estimation of macroscopic time series as well. Inspired by the power of Seq2seq and its variants on the modeling of time series data, we propose Mixture of Seq2seq (MixSeq), an end2end mixture model to cluster microscopic time series, where all the components come from a family of Seq2seq models parameterized by different parameters. Extensive experiments on both synthetic and real-world data show the superiority of our approach.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case

In this paper, we present a new approach to time series forecasting. Tim...

GRATIS: GeneRAting TIme Series with diverse and controllable characteristics

The explosion of time series data in recent years has brought a flourish...

Time Series Prediction : Predicting Stock Price

Time series forecasting is widely used in a multitude of domains. In thi...

Orbit: Probabilistic Forecast with Exponential Smoothing

Time series forecasting is an active research topic in academia as well ...

tsBNgen: A Python Library to Generate Time Series Data from an Arbitrary Dynamic Bayesian Network Structure

Synthetic data is widely used in various domains. This is because many m...

Protecting Time Series Data with Minimal Forecast Loss

Forecasting could be negatively impacted due to anonymization requiremen...

Visual Forecasting of Time Series with Image-to-Image Regression

Time series forecasting is essential for agents to make decisions in man...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series forecasting has proven to be important to help people manage resources and make decisions Lim and Zohren (2020). For example, probabilistic forecasting of product demand and supply in retails Chen et al. (2019), or the forecasting of loans Abu-Mostafa and Atiya (1996) in a financial institution can help people do inventory or financing planning to maximize the profit. Most time series of interest are macroscopic time series, e.g., the sales of an online retail platform, the loans of a financial institution, or the number of infections caused by some pandemic diseases in a state, that are comprised of microscopic time series, e.g., the sales of a merchant in the online retail, the loans from a customer given the financial institution, or the number of infections in a certain region. That is, the observed macroscopic time series are just the aggregation or sum of microscopic time series.

Although various time series forecasting models, e.g., State Space Models (SSMs) Durbin and Koopman (2012), Autoregressive (AR) models Asteriou and Hall (2011)

, or deep neural networks 

Benidis et al. (2020), have been widely studied for decades, all of them study the modeling of time series without considering the connections between macroscopic time series of interest and the underlying time series on the microscopic level.

In this paper, we study the question whether the forecasting of macroscopic time series can be improved by leveraging the underlying microscopic time series, and the answer is yes. Basically, though accurately modeling each microscopic time series could be challenging due to large variations, we show that by carefully clustering microscopic time series into clusters, i.e., clustered time series, and using canonical approaches to model each of clusters, finally we can achieve promising results by simply summing over the forecasting results of each cluster.

To be more specific, first, we assume that the microscopic time series are generated from a probabilistic mixture model McLachlan and Basford (1988) where there exist components. The generation of each microscopic time series is by first selecting a component from with a prior (a Discrete distribution), then generating the microscopic observation from a probabilistic distribution parameterized by the corresponding component . We show that as we can identify the ground truth components of the mixture, and the ground truth assignment of each microscopic observation, independent modeling of time series data from each component could be improved due to lower variance, and further benefitting the estimation of macroscopic time series that are of interest. Second, inspired by recent successes of Seq2seq models Vaswani et al. (2017); Cho et al. (2014); Du et al. (2018)

based on deep neural networks, e.g., variants of recurrent neural networks (RNNs) 

Hewamalage et al. (2021); Yu et al. (2017); Maddix et al. (2018)

, convolutional neural networks (CNNs) 

Bai et al. (2018); Hao et al. (2020), and Transformers Li et al. (2019); Wu et al. (2020), we propose Mixture of Seq2seq (MixSeq), a mixture model for time series, where the components come from a family of Seq2seq models parameterized by different parameters. Third, we conduct synthetic experiments to demonstrate the superiority of our approach, and extensive experiments on real-world data to show the power of our approach compared with canonical approaches.

Our contributions. We summarize our contributions in two-fold. (1) We show that by transforming the original macroscopic time series via clustering, the expected variance of each clustered time series could be optimized, thus improving the accuracy and robustness for the estimation of macroscopic time series. (2) We propose MixSeq which is an end2end mixture model with each component coming from a family of Seq2seq models. Our empirical results based on MixSeq show the superiority compared with canonical approaches.

2 Background

In this section, we first give a formal problem definition. We then review the bases related to this work, and have a discussion of related works.

Problem definition. Let us assume a macroscopic time series , and denotes the value of time series at time . We aim to predict the next time steps, i.e., . We are interested in the following conditional distribution


where represents in interval . To study the above problem, we assume that the macroscopic time series is comprised of microscopic time series, i.e., where denotes the value of the -th microscopic time series at time . We aim to cluster the microscopic time series into clustered time series , where given the label assignment of the -th microscopic time series . This is based on our results in Section 3 that the macroscopic time series forecasting can be improved with optimal clustering. Hence, instead of directly modeling , we study the clustering of microscopic time series in Section 4, and model the conditional distribution of clustered time series with canonical approaches.

2.1 Seq2seq: encoder-decoder architectures for time series

An encoder-decoder based neural network models the conditional distribution Eq. (1

) as a distribution from the exponential families, e.g., Gaussian, Gamma or Binomial distributions, with sufficient statistics generated from a neural network. The encoder feeds

into a neural architecture, e.g., RNNs, CNNs or self-attentions, to generate the representation of historical time series, denoted as , then we use a decoder to yield the result . After iterations in an autoregressive style, it finally generates the whole time series to be predicted.

To instantiate above Seq2seq architecture, we denote , where , as covariates that are known a priori, e.g., dates. We denote where we use for concatenation. The encoder generates the representation of via Transformer Vaswani et al. (2017); Li et al. (2019) as follows. We first transform by some functions , e.g., causal convolution Li et al. (2019) to . Transformer then iterates the following self-attention layer times:


That is, we first transform 111We ignore the subscript for simplicity in condition that the context is of clarity. into query, key, and value matrices, i.e., , , and respectively, where in each layer are learnable parameters. Then we do scaled inner product attention to yield where is a mask matrix to filter out rightward attention by setting all upper triangular elements to . We denote

as a multi-layer perceptron function. Afterwards, we can generate the representation

for via where we denote as a deep set function Zaheer et al. (2017) that operates on rows of , i.e., . We denote the feedforward function to generate as , i.e., Eq (2).

Given , the decoder generates the sufficient statistics and finally yields from a distribution in the exponential family.

2.2 Related works

Time series forecasting has been studied for decades. We summarize works related to time series forecasting into two categories. First, many models come from the family of autoregressive integrated moving average (ARIMA) Box and Jenkins (1968); Asteriou and Hall (2011), where AR indicates that the evolving variable of interest is regressed on its own lagged values, the MA indicates that the regression error is actually a linear combination of error terms, and the “I” indicates that the data values have been replaced with the difference between their values and the previous values to handle non-stationary Pemberton (1990). The State Space Models (SSM) Durbin and Koopman (2012) aim to use state transition function to model the transfer of states and generate observations via a observation function. These statistical approaches typically model time series independently, and most of them only utilize values from history but ignore covariates that are important signals for forecasting. Second, as rapid development of deep neural networks, people started studying many neural networks for the modeling of time series Benidis et al. (2020); Lim and Zohren (2020). Most successful neural networks are based on the encoder-decoder architectures Vaswani et al. (2017); Cho et al. (2014); Du et al. (2018); Sutskever et al. (2014); Bahdanau et al. (2015); Dama and Sinoquet (2021); Lim and Zohren (2020); Ma et al. (2019), namely Seq2seq. Basically, various Seq2seq models based on RNNs Hewamalage et al. (2021); Salinas et al. (2020); Wen et al. (2017); Lai et al. (2018); Yu et al. (2017); Maddix et al. (2018), CNNs Bai et al. (2018); Hao et al. (2020), and Transformers (self-attentions) Li et al. (2019); Wu et al. (2020) are proposed to model the non-linearity for time series.

No matter models studied in statistics or deep neural networks, these works mainly focus on the forecasting of single or multivariate time series, but ignore the auxiliary information that the time series could be made up of microscopic data.

Time series clustering

is another topic for exploratory analysis of time series. We summarize the literature into three categories, i.e., study of distance functions, generative models, and feature extraction for time series.

First, Dynamic time wrapping Petitjean et al. (2011), similarity metric that measures temporal dynamics Yang and Leskovec (2011), and specific measures for the shape Paparrizos and Gravano (2015) of time series are proposed to adapt to various time series characteristics, e.g., scaling and distortion. Typically these distance functions are mostly manually defined and cannot generalize to more general settings. Second

, generative model based approaches assume that the observed time series is generated by an underlying model, such as hidden markov model 

Oates et al. (1999) or mixture of ARMA Xiong and Yeung (2004). Third, early studies on feature extraction of time series are based on component analysis Guo et al. (2008), and kernels, e.g., u-shapelet Zakaria et al. (2012). As the development of deep neural networks, several encoder-decoder architectures Madiraju et al. (2018); Ma et al. (2019) are proposed to learn better representations of time series for clustering.

However, the main purpose of works in this line is to conduct exploratory analysis of time series, while their usage for time series forecasting has never been studied. That is, these works define various metrics to evaluate the goodness of the clustering results, but how to learn the optimal clustering for time series forecasting remains an open question.

3 Microscopic time series under mixture model

We analyze the variance of mixture models, and further verify our results with simple toy examples.

3.1 Analyses on the variance of mixture model

In this part, we analyze the variance of probabilistic mixture models. A mixture model McLachlan and Basford (1988)

is a probabilistic model for representing the presence of subpopulations within an overall population. Mixture model typically consists of a prior that represents the probability over subpopulations, and components, each of which defines the probability distribution of the corresponding subpopulation. Formally, we can write


where denotes the mixture distribution, denotes the prior over subpopulations, and represents the distribution corresponding to the -th component.

Proposition 1.

Assuming the mixture model with probability density function

, and corrsponding components with constants ( lie in a simplex), we have . In condition that and

have first and second moments, i.e.,

and for , and and for components , we have:


We use the fact that . By using Jensen’s Inequality on , we immediately yield the result. See detailed proofs in supplementary.

This proposition states that, if we have limited data samples (always the truth in reality) and in case we know the ground truth data generative process a priori, i.e., the exact generative process of each sample from its corresponding component, the variance on expectation conditioned on the ground truth data assignment should be no larger than the variance of the mixture. Based on the assumption that microscopic data are independent, the variance of the aggregation of clustered data should be at least no larger than the aggregation of all microscopic data, i.e., the macroscopic data. So the modeling of clustered data from separate components could possibly be more accurate and robust compared with the modeling of macroscopic data. This result motivates us to forecast macroscopic time series by clustering the underlying microscopic time series. Essentially, we transform the original macroscopic time series data to clusters with lower variances using a clustering approach, then followed by any time series models to forecast each clustered time series. After that, we sum over all the results from those clusters so as to yield the forecasting of macroscopic time series. We demonstrate this result with toy examples next.

3.2 Demonstration with toy examples

We demonstrate the effectiveness of forecasting macroscopic time series by aggregating the forecasting results from clustered time series.

Simulation setting. We generate microscopic time series from a mixture model, such as Gaussian process (GP) Roberts et al. (2013) or ARMA Box et al. (2015) with or components. We generate time series for each component, and yield or microscopic time series in total. We sum all the time series as the macroscopic time series. We get clustered time series by simply summing microscopic time series from the same component. Our purpose is to compare the performance between forecasting results directly on macroscopic time series (macro results) and sum of forecasting results of clustered time series (clustered results). We set the length of time series as , and use rolling window approach for training and validating our results in the last time steps (i.e., at each time step, we train the model using the time series before current time point, and validate using the following values). We fit the data with either GP or ARMA depending on the generative model. We describe the detailed simulation parameters of mixture models in supplementary.

Simulation results. Table 1 shows the results measured by symmetric mean absolute percentage error (SMAPE)222Details are in supplementary. It is obvious that no matter time series generated by mixture of GP or mixure of ARMA, the clustered results are superior to macro results. In other words, if we knew the ground truth component of each microscopic time series, modeling on clustered data aggregated by time series from the same component would have better results compared with directly modeling the macroscopic time series.

GP time series data ARMA time series data
3 clusters 5 clusters 3 clusters 5 clusters
macro results 0.0263 0.0242 0.5870 0.5940
clustered results 0.0210 0.0198 0.3590 0.3840
Table 1: We run the experiments times, and show the average results (SMAPE) of macro results and clustered results with ground truth clusters. Lower is better.

4 MixSeq: a mixture model for time series

Based on the analysis in Section 3, we assume microscopic time series follow a mixture distribution, and propose a mixture model, MixSeq, to cluster microscopic time series.

4.1 Our model

Our model assumes that each of microscopic time series follows a mixture probabilistic distribution. To forecast given , our approach is to first partition into clusters via MixSeq. Since the distribution is applicable to all microscopic time series, we ignore the subscript and time interval for simplicity. We study the following generative probability of :


where is the discrete latent variable, is the number of components in the mixture model, is the prior of cluster indexed by , and is the probability of time series generated by the corresponding component governed by parameter . Note that we have parameters in the mixture model.

We use ConvTrans introduced in Li et al. (2019) as our backbone to model the conditional . To instantiate, we model the conditional by first generating via causal convolution Li et al. (2019). We then generate given . Finally we generate the representation for as , where and

as ReLU activation function. Afterwards, we decode

to form the specific distribution from an exponential family. In particular, we use Gaussian distribution

, where the mean and variance can be generated by following transformations,


where are parameters, and are biases.

4.2 Posterior inference and learning algorithms

We aim to learn the parameter and efficiently infer the posterior distribution of in Eq. (5). However, it is intractable to maximize the marginal likelihood after taking logarithm, i.e., , since of the involvement of logarithm of sum. To tackle this non-convex problem, we resort to stochastic auto-encoding variational Bayesian algorithm (AEVB) Kingma and Welling (2013). Regarding single microscopic time series, the variational lower bound (LB) Kingma and Welling (2013) on the marginal likelihood is as below.


where is the approximated posterior of the latent variable given time series . The benefit of using AEVB is that we can treat as an encoder modeled by a neural network. Hence, we reuse the ConvTrans Li et al. (2019) as our backbone, and model as:


where we denote , with parameter is the deep set function, and as parameters to project the encoding to dimension. After the softmax operator, we derive the posterior distribution that lies in a simplex of dimension. Note that we use distinct ’s, ’s and ’s with different parameters to model and respectively. We assign each microscopic to cluster in our experiments.

Mode collapsing. We find that directly optimizing the lower bound in Eq. (7) suffers from the mode collapsing problem. That is, the encoder tends to assign all microscopic time series to one cluster, and does not effectively distinguish the data as expected, thus implying ( for mutual information). In order to address the above mode collapsing problem, we add to the lower bound in Eq. (7) which expects that the latent variable can extract discriminative information from different time series Zhao et al. (2018). Then, we have


where is an average of approximated posteriors over all microscopic data. We approximate this term by using a mini-batch of samples, i.e., .

Annealing tricks. Regarding long-length time series, the reconstruction loss and KL divergence in Eq. (9) are out of proportion. In this situation, the KL divergence has few effects on the optimization objective. So we finally derive the following objective to maximize:



is the trade-off hyperparameter. We use the following annealing strategy

to dynamically adjust in the training process, where is the parameter controlling the rate of descent. Meanwhile, we also involve the -norm regularizers on Seq2seq’s parameters with hyperparameter .

ARMA synthetic data DeepAR synthetic data
2 clusters 3 clusters 2 clusters 3 clusters
MixARMA 0.9982(0.0001) 0.9509(0.1080) 0.7995(0.2734) 0.7687(0.0226)
MixSeq 0.9915(0.0024) 0.9540(0.0974) 0.9986(0.0003) 0.8460(0.0774)
MixSeq-infer 0.9929(0.0027) 0.9544(0.0975) 0.9982(0.0006) 0.8460(0.0775)
Table 2:

Mean and standard deviation (SD, in bracket) of Rand Index (RI, the higher the better) by clustering on synthetic data generated by ARMA and DeepAR. MixSeq-infer represents that we infer the cluster of new data generated by different models after training MixSeq. On ARMA data, MixSeq and MixARMA have comparable performance; on DeepAR data, MixARMA degrades significantly which shows the effectiveness of MixSeq.

5 Experimental results

We conduct extensive experiments to show the advantage of MixSeq. We evaluate the clustering performance of MixSeq on synthetic data, present the results of macroscopic time series forecasting on real-world data, and analyze the sensitivity of the cluster number of MixSeq.

5.1 Synthetic datasets

To demonstrate MixSeq’s capability of clustering microscopic time series that follow various probabilistic mixture distributions, we conduct clustering experiments on synthetic data with ground truth. We generate two kinds of synthetic time series by ARMA Box et al. (2015) and DeepAR Salinas et al. (2020) respectively. For each model, we experiment with different number of clusters (2 and 3) generated with components governed by different parameters.

Experiment setting. To generate data from ARMA, we use ARMA(2, 0) and with . We set parameters for three components as , , and respectively. The synthetic time series from a mixture of components are generated using the first two components. The synthetic time series from a mixture of components are generated using all components. To generate data from DeepAR, we use the DeepAR model with one LSTM layer, and the hidden number of units is . Since it is difficult to randomly initialize the parameters of DeepAR, we train a base model on the real-world Wiki dataset Tran et al. (2021) (discussed in section 5.2). To build the other two DeepAR components, we respectively add random disturbance to the parameters of the base model. For each cluster, we generate time series with random initialized sequences, and set the length of time series as .

We use 1-layer causal convolution Transformer (ConvTrans Li et al. (2019)) as our backbone model in MixSeq. We use the following parameters unless otherwise stated. We set the number of multi-heads as , kernel size as , the number of kernel for causal convolution , dropout rate as , the penalty weight on the -norm regularizer as 1e-5, and . Meanwhile, we set the prior as . For the training parameters, we set the learning rate as 1e-4, batch size as

and epochs as

. Furthermore, the in MixSeq is annealed using the schedule , where denotes the current epoch, and is updated in the -th epochs. For comparison, we employ MixARMA Xiong and Yeung (2004), a mixture of ARMA(2, 0) model optimized by EM algorithm Blei et al. (2017), as our baseline. Both methods are evaluated using Rand Index (RI) Rand (1971) (more details in supplementary).

Experiment results. We show the clustering performance of MixSeq and MixARMA on the synthetic data in Table 2. The results are given by the average of 5 trials. Regarding the synthetic data from ARMA, both MixSeq and MixARMA perform very well. However, for the synthetic data from DeepAR, MixARMA degrades significantly while MixSeq achieves much better performance. This suggests that MixSeq can capture the complex nonlinear characteristics of time series generated by DeepAR when MixARMA fails to do so. Furthermore, we also generate new time series by the corresponding ARMA and DeepAR models, and infer their clusters with the trained MixSeq model. The performance is comparable with the training performance, which demonstrates that MixSeq actually captures the generation mode of time series.

5.2 Real-world datasets

We further evaluate the effectiveness of our model on the macroscopic time series forecasting task. We compare MixSeq with existing clustering methods and state-of-the-art time series forecasting approaches on several real-world datasets. Specifically, for each dataset, the goal is to forecast the macroscopic time series aggregated by all microscopic data. We cluster microscopic time series into groups, and aggregate the time series in each group to form the clustered time series. Then, we train the forecasting models on the clustered time series separately, and give predictions of each clustered time series. Finally, the estimation of macroscopic time series is obtained by aggregating all the predictions of clustered time series.

We report results on three real-world datasets, including Rossmann333https://www.kaggle.com/c/rossmann-store-sales, M5444https://www.kaggle.com/c/m5-forecasting-accuracy and Wiki Tran et al. (2021). The Rossmann dataset consists of historical sales data of Rossmann stores recorded every day. Similarly, the M5 dataset consists of microscopic time series as the daily sales of different products in ten Walmart stores in USA. The Wiki dataset contains microscopic time series representing the number of daily views of different Wikipedia articles. The dataset summary is shown in Table 3, together with the setting of data splits.

# microscopic
time series
length of
time series
train interval test internal
Rossmann 1115 942 20130101-20141231 20150101-20150731
M5 30490 1941 20110129-20160101 20160101-20160619
Wiki 309765 1827 20150701-20191231 20200101-20200630
Table 3: Real-world dataset summary.
Macro DTCR MixARMA MixSeq
DeepAR 0.1904/0.0869 0.2292/0.1432 0.1981/0.1300 0.1857/0.0987
Rossmann TCN 0.1866/0.1005 0.2023/0.1633 0.1861/0.1160 0.1728/0.0997
ConvTrans 0.1861/0.0822 0.2077/0.0930 0.1866/0.0854 0.1847/0.0813
DeepAR 0.0548/0.0289 0.0787/0.0627 0.0624/0.0582 0.0582/0.0445
M5 TCN 0.0790/0.0635 0.0847/0.0805 0.0762/0.0789 0.0694/0.0508
ConvTrans 0.0553/0.0260 0.0514/0.0260 0.0497/0.0257 0.0460/0.0238
DeepAR 0.0958/0.0962 0.1073/0.1336 0.0974/0.1070 0.0939/0.0901
Wiki TCN 0.0966/0.1064 0.1237/0.1480 0.0963/0.1218 0.0886/0.0980
ConvTrans 0.0968/0.0589 0.1029/0.0531 0.0961/0.0594 0.0901/0.0516
Table 4: Comparisons on the microscopic time series clustering methods for macroscopic time series forecasting combined with three network-based forecasting methods: testing /-loss on three real-world datasets. Lower is better.

Experiment setting. We summarize the clustering strategies for macroscopic time series forecasting as follows. (1) “DTCR” Ma et al. (2019)

is the deep temporal clustering representation method which integrates the temporal reconstruction, K-means objective and auxiliary classification task into a single Seq2seq model.

(2) “MixARMA” Xiong and Yeung (2004) is the mixture of ARMA model that uses ARMA to capture the characteristics of microscopic time series. (3) “MixSeq” is our model with 1-layer causal convolution Transformer Li et al. (2019). (4) We also report the results that we directly build forecasting model on the macroscopic time series without leveraging the microscopic data, named as “Macro”.

For time series forecasting, we implement five methods combined with each clustering strategy, including ARMA Box et al. (2015), Prophet Taylor and Letham (2018), DeepAR Salinas et al. (2020), TCN Bai et al. (2018), and ConvTrans Li et al. (2019). ARMA and Prophet give the prediction of point-wise value for time series, while DeepAR, TCN and ConvTrans are methods based on neural network for probabilistic forecasting with Gaussian distribution. We use the rolling window strategy on the test interval, and compare different methods in terms of the long-term forecasting performance for days. The data of last two months in train interval are used as validation data to find the optimal model.

We do grid search for the following hyperparameters in clustering and forecasting algorithms, i.e., the number of clusters , the learning rate , the penalty weight on the -norm regularizers , and the dropout rate . The model with best validation performance is applied for obtaining the results on test interval. Meanwhile, we set batch size as , and the number of training epochs as for Rossmann, for M5 and for Wiki. For DTCR, we use the same setting as Ma et al. (2019).

Regarding time series forecasting models, we apply the default setting to ARMA and Prophet provided by the Python packages. The architectures of DeepAR, TCN and ConvTrans are as follows. The number of layers and hidden units are and for DeepAR. The number of multi-heads and kernel size are and for ConvTrans. The kernel size is for TCN with dilations in . We also set batch size as and the number of epochs as for all forecasting methods.

Experiment results. Following Li et al. (2019); Rangapuram et al. (2018); Tran et al. (2021), we evaluate the experimental methods using SMAPE and

-quantile loss

555Detailed definition is in supplementary. with . The SMAPE results of all combination of clustering and forecasting methods are given in Table 5. Table 4 shows the -loss for DeepAR, TCN and ConvTrans which give probabilistic forecasts. All results are run in trials. The best performance is highlighted by bold character. We observe that MixSeq is superior to other three methods, suggesting that clustering microscopic time series by our model is able to improve the estimation of macroscopic time series. Meanwhile, Macro and MixARMA have comparable performance and are better than DTCR, which further demonstrates the effectiveness of our method, i.e., only proper clustering methods are conductive to macroscopic time series forecasting.

Macro DTCR MixARMA MixSeq
ARMA 0.2739(0.0002) 0.2735(0.0106) 0.2736(0.0013) 0.2733(0.0012)
Prophet 0.1904(0.0007) 0.1738(0.0137) 0.1743(0.0037) 0.1743(0.0026)
Rossmann DeepAR 0.1026(0.0081) 0.1626(0.0117) 0.1143(0.0088) 0.0975(0.0013)
TCN 0.1085(0.0155) 0.1353(0.0254) 0.1427(0.0180) 0.1027(0.0075)
ConvTrans 0.1028(0.0091) 0.1731(0.0225) 0.1022(0.0041) 0.0961(0.0019)
ARMA 0.0540(0.0001) 0.0544(0.0018) 0.0541(0.0003) 0.0543(0.0001)
Prophet 0.0271(0.0003) 0.0271(0.0003) 0.0269(0.0002) 0.0267(0.0002)
M5 DeepAR 0.0278(0.0034) 0.0410(0.0046) 0.0319(0.0063) 0.0298(0.0029)
TCN 0.0412(0.0075) 0.0447(0.0044) 0.0395(0.0094) 0.0358(0.0014)
ConvTrans 0.0274(0.0048) 0.0253(0.0020) 0.0245(0.0024) 0.0227(0.0006)
ARMA 0.0362(0.0001) 0.0363(0.0006) 0.0364(0.0005) 0.0362(0.0002)
Prophet 0.0413(0.0001) 0.0423(0.0008) 0.0434(0.0003) 0.0420(0.0005)
Wiki DeepAR 0.0481(0.0008) 0.0552(0.0015) 0.0489(0.0006) 0.0470(0.0002)
TCN 0.0494(0.0076) 0.0654(0.0022) 0.0491(0.0015) 0.0446(0.0023)
ConvTrans 0.0471(0.0029) 0.0497(0.0012) 0.0466(0.0001) 0.0440(0.0010)
Table 5: Comparisons on the microscopic time series clustering methods for macroscopic time series forecasting combined with five forecasting methods: testing SMAPE on three real-world datasets.

5.3 Sensitivity analysis of cluster number

Figure 1: The macroscopic time series forecasting performance based on MixSeq with different cluster number on three real-world datasets. The time series forecasting method is fixed as causal convolution Transformer. Top three figures show the -loss and bottom three figures show the -loss.

The cluster number is a critical hyperparameter of MixSeq. To analyze its effect on the forecasting of macroscopic time series, we conduct experiments on both synthetic data and real-world data. The results state the importance of setting a proper number of clusters. We suggest do binary search on this critical hyperparameter. Details are as follows.

Synthetic data. Following the experimental setting in section 5.1, we generate microscopic time series from different parameterized ARMA respectively. That is, the ground truth number of clusters is and there are microscopic samples in total. The aggregation of all samples is the macroscopic time series which is the forecasting target of interest. Then, we compare the forecasting performance between the method that directly forecasts macroscopic time series (denoted as Macro) and our method with different cluster numbers (including , , , denoted as MixSeq_2, MixSeq_3 and MixSeq_5 respectively). We fix the forecasting method as ARMA, and apply the rolling window approach for T+10 forecasting in the last time steps. The average SMAPE of trials are , , and for Macro, MixSeq_2, MixSeq_3 and MixSeq_5 respectively. MixSeq_3 being set with ground truth cluster number shows the best performance, while MixSeq_2 and MixSeq_5 would degenerate though still better than Macro. This result shows the importance of setting a proper number of clusters.

Real-world data. We do the evaluation on three real-world datasets by varying the cluster number of MixSeq while maintaining the other parameters fixed. For Rossmann and M5 datasets, we set the cluster number , while we explore the cluster number on Wiki dataset. The architecture and training parameters of MixSeq are same to section 5.2, except that we set the dropout rate as , the penalty weight on the -norm regularizer as 5e-5, and the learning rate as 1e-4. Meanwhile, we also fix the time series forecasting method as causal convolution Transformer (ConvTrans).

Figure 1 reports the macroscopic time series forecasting performance (testing on and loss) based on MixSeq with different cluster number on three real-world datasets. The horizontal dashed lines are the results with that directly building ConvTrans model on the macroscopic time series without leveraging the microscopic data (named as “Macro” in section 5.2). It is obvious that each dataset has its own suitable number of clusters, and our method is relatively sensitive to , especially on the dataset with less microscopic time series, such as Rossmann. Similar to the accurately modeling of each microscopic time series, the larger cluster number of MixSeq also brings large variance to macroscopic time series forecasting, which degrades the performance of our method.

6 Conclusion

In this paper, we study the problem that whether macroscopic time series forecasting can be improved by leveraging microscopic time series. Under mild assumption of mixture models, we show that appropriately clustering microscopic time series into groups is conductive to the forecasting of macroscopic time series. We propose MixSeq to cluster microscopic time series, where all the components come from a family of Seq2seq models parameterized with different parameters. We also propose an efficient stochastic auto-encoding variational Bayesian algorithm for the posterior inference and learning for MixSeq. Our experiments on both synthetic and real-world data suggest that MixSeq can capture the characteristics of time series in different groups and improve the forecasting performance of macroscopic time series.

This work is supported by Ant Group.


Appendix A Proofs

Proposition 1.

Assuming the mixture model with probability density function , and corrsponding components with constants ( lie in a simplex), we have . In condition that and have first and second moments, i.e., and for , and and for components , we have:


We prove the result based on the fact that we have for any moment that


We then derive the variance of mixture as


Since the squared function is convex, by Jensen’s Inequality we immediately have . ∎

Appendix B Complexity analysis and running time

Complexity analysis of MixSeq. The complexity of MixSeq depends on the number of clusters and three network architectures, including convolution, multi-head self-attention and MLP. For time series , where and are the length and dimension of time series respectively, the FLOPs (floating point operations) of convolution is , where and are the size and number of convolution kernel. The FLOPs of multi-head self-attention is , where is the number of multi-heads. The FLOPs of MLP is . Finally, the FLOPs of MixSeq is . Since , and are usually smaller than , so the time complexity can be simplified as which is similar to Transformer. Time series data is always recorded by day or hour. There are only one thousand values even for the data recorded in three years, so our method is capable for dealing with them. Furthermore, some existing methods can also be used in MixSeq to accelerate the computation of self-attention.

Running time of macroscopic time series forecasting. The overall running time of forecasting macroscopic data based on MixSeq is comprised of two steps. (1) The first step is to do clustering with MixSeq. Figure 2 shows the convergence of MixSeq over time (seconds) compared with comparison approaches to time series clustering. Our approach takes seconds to convergence on the dataset containing microscopic time series, while MixARMA and DTCR take and seconds to convergence respectively. The convergence rate of our method is not worse than existing neural network based approach, i.e., DTCR. (2) The second step is to forecast clustered time series with any proper forecasting model. The time complexity is in linear w.r.t the number of clustered time series. We can always accelerate this step by using more workers in parallel.

Figure 2: The convergence of different clustering methods over time (seconds) on Rossmann dataset. The optimization objective of MixARMA is maximized by EM algorithm, and the optimization objective of MixSeq and DTCR are minimized by gradient descent with Adam.

Appendix C Evaluation metrics

c.1 Rand index

Given the labels as a clustering ground truth, Rand index (RI) measures the clustering accuracy between ground truth and predicted clusters, defined as

where is the total number of samples, is the number of sample pairs that are in the same cluster with same label, and is the number of sample pairs that in different clusters with different labels.

c.2 Symmetric mean absolute percentage error

Symmetric mean absolute percentage error (SMAPE) is an accuracy measure for time series forecasting based on percentage (or relative) errors. is defined as

where is the actual value and is the predicted value for time , and is the horizon of time series forecasting.

c.3 -quantile loss

In the experiments, we evaluate different methods by the rolling window strategy. The target value of macroscopic time series for each dataset is given as , where is the -th testing sample of macroscopic time series and is the lead time after the forecast start point. For a given quantile , we denote the predicted -quantile for as . To obtain such a quantile prediction from the estimation of clustered time series, a set of predicted samples of each clustered time series is first sampled. Then each realization is summed and the samples of these sums represent the estimated distribution for . Finally, we can take the -quantile from the empirical distribution.

The -quantile loss is then defined as

where is an indicator function.

Appendix D Experiments

d.1 Enviroment setting

We conduct the experiments on an internal cluster with -core CPU, G RAM and

P100 GPU. Meanwhile, MixSeq, together with the time series forecasting methods based on neural network, are implemented with tensorflow 1.14.0.

d.2 Simulation parameters of toy examples

We use the TimeSynth666https://github.com/TimeSynth/TimeSynth python package to generate simulation time series data. For GP time series, we use RBF as the kernel function. The lengthscale and variance are , and for the mixture model of 3 GPs. Then, we add time series generated by GP with and for the samples from mixture of 5 GPs. Similarly, we use ARMA(2, 0) to generate ARMA time series. The parameters of the first three components of the mixture of ARMA are , and respectively. The parameters of another two components are and . The initial values of ARMA are sampled from .

Appendix E Societal impacts

We study the problem that whether macroscopic time series forecasting can be improved by leveraging microscopic time series, and finally propose MixSeq to cluster microscopic time series to improve the forecasting of macroscopic time series. This work will be especially useful for financial institutions and e-commercial platforms, e.g., loan forecasting, balance forecasting, and Gross Merchandise Volume (GMV) forecasting. The forecasting can help business decisions like controlling the risk of each financial institution, and help the lending to merchant in an e-commerce platform. Misuse of microscopic data could possibly lead to privacy issues. As such, protecting microscopic data with privacy-preserving techniques should be important.