1 Introduction
Time series forecasting is a key ingredient in the automation and optimization of business processes. In retail, decisions about which products to stock, when to (re)order them, and where to store them depend on forecasts of future demand in different regions; in (cloud) computing, the estimated future usage of services and infrastructure components guides capacity planning; regional forecasts of energy consumption are used to plan and optimize the generation of power; and workforce scheduling in warehouses and factories depends on forecasts of the future workload.
The prevalent forecasting methods in statistics and econometrics have been developed in the context of forecasting individual or small groups of time series. The core of these methods is formed by comparatively simple (often linear) models, which require manual feature engineering and model design by domain experts to achieve good performance (Harvey, 1990). Recently, there has been a paradigm shift from these modelbased methods to fullyautomated datadriven approaches. This shift can be attributed to the availability of large and diverse time series datasets in a wide variety of fields, e.g. energy consumption of households, server load in a data center, online user behavior, and demand for all products that a large retailer offers. These large datasets make it possible and necessary to learn models from data without significant manual work (Böse et al., 2017).
A collection of time series can exhibit various dependency relationships between the individual time series that can be leveraged in forecasting. These include: (1) local covariate relationships (e.g. the price and demand for a product, which tend to be (negatively) correlated), (2) indirect relationships through shared latent causes (e.g. demand for multiple products increasing because an advertising campaign is driving traffic to the site), (3) subtle dependencies through smoothness, temporal dynamics, and noise characteristics of time series that are measurements of similar underlying phenomena (e.g. product sales time series tend to be similar to each other, but different from energy consumption time series). The data in practical forecasting problems typically has all of these forms of dependencies. Making use of this data from related time series allows more complex and potentially more accurate models to be fitted without overfitting.
Classical time series models have been extended to address the above dependencies of types (1) and (2) by allowing exogenous variables (e.g. the ARIMAX model and control inputs in linear dynamical systems), and employing multivariate time series models that impose a certain covariance structure (dynamic factor models), respectively. Neural networkbased models have been recently shown to excel in extracting complex patterns from large datasets of related time series by exploiting similar smoothness and temporal dynamics, and common responses to exogenous input, i.e. dependencies of type (1) and (3) (Flunkert et al., 2017; Wen et al., 2017; Mukherjee et al., 2018; Gasthaus et al., 2019). These models struggle in producing calibrated uncertainty estimates. They can also be sampleinefficient, and cannot handle type (2) dependencies. See (Faloutsos et al., 2018) for a recent survey on traditional and modern methods for forecasting.
The two main challenges that arise in the fullyautomated datadriven approaches are: how to build statistical models that are able to borrow statistical strength and effectively learn to forecast from large and diverse data sources exhibiting all forms of dependencies, and how to combine the data efficiency and uncertainty characterization of classical time series models with the expressive power of deep neural networks. In this paper, we propose a family of models that efficiently (in terms of sample complexity) and effectively (in terms of computational complexity) addresses these aforementioned challenges.
1.1 Background
Classical time series models, such as general StateSpace Models (SSMs), including ARIMA and exponential smoothing, excel at modeling the complex dynamics of individual time series of sufficiently long history. For Gaussian StateSpace Models, these methods are computationally efficient, e.g. via a Kalman filter, and provide uncertainty estimates. Uncertainty estimates are critical for optimal downstream decision making. Gaussian Processes
(Rasmussen & Williams, 2006; Seeger, 2004) (GPs) are another family of the models that have been applied to time series forecasting (Girard et al., 2003; BrahimBelhouari & Bermak, 2004). These methods are local, that is, they learn one model per time series. As a consequence, they cannot effectively extract information across multiple time series. Finally, these classical methods struggle with coldstart problems, where more time series are added or removed over time.Mixed effect models (Crawley, 2012) consist of two kinds of effects: fixed (global) effects that describe the whole population, and random (local) effects that capture the idiosyncratic of individuals or subgroups. A similar mixed approach is used in Hierarchical Bayesian (Gelman et al., 2013) methods, which combine global and local models to jointly model a population of related statistical problems. In (Ahmed et al., 2012; Low et al., 2011), other combined local and global models are detailed.
Dynamic factor models (DFMs) have been studied in econometrics for decades to model the coevolvement of multiple time series (Geweke, 1977; Stock & Watson, 2011; Forni et al., 2000; Pan & Yao, 2008)
. DFMs can be thought as an extension of principal component analysis in the temporal setting. All the time series are assumed to be driven by a small number of dynamic (latent) factors. Similar to other models in classical statistics, theories and techniques are developed with assuming that the data is normally distributed and stationary. Desired theoretical properties are often lost when generalizing to other likelihoods. Closely related are the matrix factorization (MF) techniques
(Yu et al., 2016; Xie et al., 2017; Hooi et al., 2019)and tensor factorization
(Araujo et al., 2019), which have been applied to the time series matrix with temporal regularization to ensure the regularity of the latent time series. These methods are not probabilistic in nature, and cannot provide uncertainty estimation for nonGaussian observations.1.2 Main Contributions
In this paper, we propose a novel globallocal method, Deep Factor Models with Random Effects. It is based on a global DNN backbone and local probabilistic graphical models for computational efficiency. The globallocal structure extracts complex nonlinear patterns globally while capturing individual random effects for each time series locally.
The main idea of our approach is to represent each time series, or its latent function, as a combination of a global time series and a corresponding local model. The global part is given by a linear combination of a set of deep dynamic factors, where the loading is temporally determined by attentions. The local model is stochastic. Typical local choices include white noise processes, linear dynamical systems (LDS) or Gaussian processes (GPs). The stochastic local component allows for the uncertainty to propagate forward in time. Our contributions are as follows: i) Provide a unique characterization of exchangeable time series (Section
2); ii) Propose a novel globallocal framework for time series forecasting, based on i), that systematically marries deep neural networks and probabilistic models (Section 3); iii) Develop an efficient and scalable inference algorithm for nonGaussian likelihoods that is generally applicable to any normally distributed probabilistic models, such as SSMs and GPs (Section 3). As a byproduct, we obtain new approximate inference methods for SSMs/GPs with nonGaussian likelihoods; iv) Show how stateoftheart time series forecasting methods can be subsumed in the proposed framework (Section 4); v) Demonstrate the accuracy and data efficiency of our approach through scientific experiments (Section 5).2 Exchangeable Series
In this section, we formulate a general model for exchangeable time series. A distribution over objects is exchangeable, if the probability of the objects is invariant under any permutation. Exchangeable time series are a common occurrence. For instance, user purchase behavior is exchangeable, since there is no specific reason to assign a particular coordinate to a particular user. Other practical examples include sales statistics over similar products, prices of securities on the stock market and the use of electricity.
2.1 Characterization
Let , where denotes the exchangeable time series, denotes the domain of observations and denotes the length of the time series.^{1}^{1}1Without loss of generality and to avoid notational clutter, we omit the extension to time series beginning at different points of time. Our approach is general and covers these cases as well. We denote individual observations at some time as . We assume that we observe at discrete time steps to have a proper time series rather than a marked point process.
Theorem 1.
Let be a distribution over exchangeable time series over with length , where . Then admits the form
In other words, decomposes into a global time series and local times series , which are conditionally independent given the latent series .
Proof.
It follows from de Finetti’s theorem (Diaconis, 1977; Diaconis & Freedman, 1980) that
(1) 
Since are time series, we can decompose
in the causal direction using the chain rule as
Substituting this into the de Finetti factorization in Eqn. (1) gives
Lastly, we can decompose , such that contains a sufficient statistic of with respect to . This holds trivially by setting , but defeats the purpose of the subsequent models. Using the chain rule on and substituting the result in proves the claim. ∎
Theorem 2.
For treewise exchangeable time series, that is time series that can be grouped hierarchically into exchangeable sets, there exists a corresponding set of hierarchical latent variable models.
The proof is analogous to that of Theorem 1, and follows from a hierarchical extension of de Finetti’s theorem (Austin & Panchenko, 2014). This decomposition is useful when dealing with product hierarchies. For instance, the sales events within the category of iPhone charger cables and the category of electronics cables may be exchangeable.
2.2 Practical Considerations
We now review some common design decisions used in modeling time series. The first is to replace the decomposition by a tractable, approximate statistic of the past, such that . Here, typically assumes the form of a latent variable model via
. Popular choices for realvalued random variables are SSMs and GPs.
The second is to assume that the global variable is drawn from some . The inference in this model is costly, since it requires constant interaction, via Sequential Monte Carlo, variational inference or a similar procedure between local and global variables at prediction time. One way to reduce these expensive calculations is to incorporate past local observations explicitly. While this somewhat negates the simplicity of Theorem 1, it yields significantly higher accuracy for a limited computational budget,
Lastly, the time series often comes with observed covariates, such as a user’s location or a detailed description of an item being sold. We add these covariates to the time series signal to obtain the following model:
(2)  
Even though this model is rarely used in its full generality, Eqn. (2) is relevant because it is by some measure the most general model to consider, based on the de Finetti factorization in Theorem 1.
2.3 Special Cases
The globallocal structure has been used previously in a number of special contexts (Xu et al., 2009; Ahmadi et al., 2011; Hwang et al., 2016; Choi et al., 2011). For instance, in Temporal LDA (Ahmed et al., 2012) we assume that we have a common fixed DirichletMultinomial distribution capturing the distribution of tokens per topic, and a timevariant set of latent random variables capturing the changes in user preferences. This is a special case of the above model, where the global time series does not depend on time, but is stationary instead.
A more closely related case is the Neural Survival Recommender model of (Jing & Smola, 2017). This models the temporal dynamics of return times of a user to an app via survival analysis. In particular, it uses a LSTM for the global dynamics and LSTMs for the local survival probabilities. In this form, it falls into the category of models described by Theorem 1. Unlike the models we propose in this paper, it does not capture local uncertainty accurately. It also primarily deals with point processes rather than proper time series, and the inference algorithm differs quite significantly.
3 Deep Factor Models with Random Effects
Motivated by the characterization of exchangeable time series, in this section, we propose a general framework for globallocal forecasting models, called Deep Factor Models with Random Effects, that follows the structure given by the decomposition in Theorem 1. We describe the family of the methods, show three concrete instantiations (DFRNN, DFLDS, and DFGP), and derive the general inference and learning algorithm. Further models that can be obtained within the same framework, and additional details about the design choices, are described in Appendix A.
We are given a set of time series, with the time series consisting of tuples where are the input covariates, and is the corresponding observation at time . Given a forecast horizon , our goal is to calculate the joint predictive distribution of future observations,
i.e. the joint distribution over future observations given all covariates (including future ones) and past observations.
3.1 Generative Model
Our key modeling assumption is that each time series is governed by a fixed global (nonrandom) and a random component, whose prior is specified by a generative model . In particular, we assume the following generative process:
(3)  
(4)  
(5)  
The observation model can be any parametric distribution, such as Gaussian, Poisson or Negative Binomial. All the functions take features as input, and we define , the embedding
name  description  local  likelihood (gaussian case) 
DFRNN  Zeromean Gaussian noise process given by rnn  
DFLDS  Statespace models  (cf. Eqn. (6))  given by Kalman Filter 
DFGP  Zeromean Gaussian Process  (cf. Eqn. (7)) 
3.1.1 Global effects (common patterns)
The global effects are given by linear combinations of latent global deep factors modeled by RNNs. These deep factors can be thought of as dynamic principal components or eigen time series that drive the underlying dynamics of all the time series. As mentioned in Section 2.2, we restrict the global effects to be deterministic to avoid costly inference at the global level that depends on all time series.
The novel formulation of the fixed effects from the RNN in Eqn. (3) has advantages in comparison to a standard RNN forecaster. Figure 2 compares the generalization errors and average running times of using Eqn. (3) with the loss and a standard RNN forecaster with the same loss and a comparable number of parameters on a realword dataset electricity (Dheeru & Karra Taniskidou, 2017)
. Our fixed effect formulation shows significant data and computational efficiency improvement. The proposed model has less variance in comparison to the standard structure. Detailed empirical explorations of the proposed structures can be found in Section
5.1.3.1.2 Random effects (local fluctuations)
The random effects in Eqn. (4) can be chosen to be any classical probabilistic time series model . To efficiently compute their marginal likelihood , should be chosen to satisfy the normal distributed observation assumption. Table 1 summarizes the three models we consider for the local random effects models .
The first local model, DFRNN, is defined as , where is given by a noise RNN that takes input feature . The noise process becomes correlated with the covariance function implicitly defined by the RNN, resulting in a simple deep generative model.
The second local model, DFLDS, is a part of a special and robust family of SSMs, Innovation StateSpace Models (ISSMs) (Hyndman et al., 2008; Seeger et al., 2016). This gives the following generative model:
(6) 
The latent state contains information about the level, trend, and seasonality patterns. It evolves by a deterministic transition matrix and a random innovation The structure of the transition matrix and innovation strength determines which kinds of time series patterns the latent state encodes (cf. Appendix A.2 for the concrete choice of ISSM).
The third local model, DFGP, is defined as the Gaussian Process,
(7) 
where
denotes the kernel function. In this model, with each time series has its own set of GP hyperparameters to be learned.
3.2 Inference and Learning
Given a set of time series, our goal is to jointly estimate , the parameters in the global RNNs, the embeddings and the hyperparameters in the local models. We use maximum likelihood estimation, where Computing the marginal likelihood may require doing inference over the latent variables. The general learning algorithm is summarized in Algorithm 1.
3.2.1 Gaussian Likelihood
When the observation model is Gaussian, , the marginal likelihood can be easily computed for all three models. Evaluating the marginal likelihood for DFRNN is straightforward (see Table 1).
For DFLDS and DFGP, the Gaussian noise can be absorbed into the local model, yielding where , instead of coming from the noiseless LDS and GP, is generated by the noisy versions. More precisely, for DFLDS, and , and the marginal likelihood is obtained with a Kalman filter. In DFGP, it amounts to adding to Eqn (7), where is the Dirac delta function. The marginal likelihood becomes the standard GP marginal likelihood, which is the multivariate normal with mean and covariance matrix , where and
is the identity matrix of suitable size.
3.2.2 NonGaussian Likelihood
When the likelihood is not Gaussian, the exact marginal likelihood is intractable. We use variational inference, and optimize a variational lower bound of the marginal likelihood :
(8) 
where is the latent function values, and is the latent states in the local probabilistic models ^{2}^{2}2For cleaner notation, we omit the time series index . Optimizing this stochastic variational lower bound for the joint model over all time series is computationally expensive.
We propose a new method that leverages the structure of the local probabilistic model to enable fast inference at the pertime series level. This enables parallelism and efficient inference that scales up to a large collection (millions) of time series. Motivated by (Fraccaro et al., 2017), we choose the following structural approximation where the second term matches the exact conditional posterior with the random effect probabilistic model . With this form, given , the inference becomes the canonical inference problem with , from Section 3.2.1. The first term is given by another neural network parameterized by , called a recognition network in the variational AutoEncoder (VAE) framework (Kingma & Welling, 2014; Rezende et al., 2014). After massaging the equations, the stochastic variational lower bound in Eqn. (8) becomes
(9)  
(10) 
with for sampled from the recognition network. The first and third terms in Eqn. (10) are straightforward to compute. For the second term, we drop the sample index to obtain the marginal likelihood under the normally distributed random effect models. This term is computed in the same manner as in Section 3.2.1, with substituted by . When the likelihood is Gaussian, the latent function values are equal to and we arrive at from Eqn. (9).
4 Related Work and Discussions
Effectively combining probabilistic graphical models and deep learning approaches has been an active research area. Several approaches have been proposed for marrying RNN with SSMs through either one or both of the following: (i) extending the Gaussian emission to complex likelihood models; (ii) making the transition equation nonlinear via a multilayer perceptron (MLP) or interlacing SSM with transition matrices temporally specified by RNNs. Deep Markov Models (DMMs), proposed by
(Krishnan et al., 2017, 2015), keep the Gaussian transition dynamics with mean and covariance matrix parameterized by MLPs. Stochastic RNNs (SRNNs) (Fraccaro et al., 2016) explicitly incorporate the deterministic dynamics from RNNs that do not depend on latent variables by interlacing them with a SSM. Chung et al. (2015) first proposed Variational RNNs (VRNNs), which is another way to make the transition equation nonlinear, by cutting ties between the latent states and associating them with deterministic states. This makes the state transition nonlinearly determined by the RNN. VRNNs are also used in Latent LSTM Allocation (LLA) (Zaheer et al., 2017) and StateSpace LSTM (SSL) (Zheng et al., 2017). These models require expensive inference at the global level through a recognition network, with is in stark contrast with Deep Factors, where the structural assumption of the variational approximation decomposes the inference problem to local probabilistic inference that is easily parallelizable and global standard RNN learning (cf. Section 3.2).In Fraccaro et al. (2017) and the recent Deep State Models (Rangapuram et al., 2018), the linear Gaussian transition structure is kept intact, so that the highly efficient Kalman filter/smoother is readily applicable. Our model differs from the former in that we avoid sampling the latent states in the ELBO, and eliminate the variance coming from the MonteCarlo estimate of the second integral in Eqn. (10). Deep State is designed for a Gaussian likelihood with time varying SSM parameters per time series. In contrast, with time invariant local SSMs and flexible global effects, our model DFLDS offers a parsimonious representation that can handle nonGaussian observations.
Deep Gaussian Processes (Damianou & N.D., 2013) have attracted growing interests in recent years. Inspired by GPLVM structure (Lawrence, 2004), Deep GPs stacks GPs on top of latent variables, resulting in more expressive mappings. Our framework provides an alternative approach to utilize GPs more efficiently.
Due to its flexibility of interpolating between purely local and purely global models, there are a variety of common methods that can be subsumed in our proposed model framework. Deep Factor with one factor and no random effects, accompanied with autoregressive inputs, reduces to DeepAR
(Flunkert et al., 2017). One difference in our formulation is that the scale of each time series is automatically estimated rather than prespecified as it is in DeepAR. Changing the emission probability to Gaussian Mixtures results in ARMDN (Mukherjee et al., 2018). SequencetoSequence models for forecasting (Wen et al., 2017) are another family of models that are a part of our framework. These methods make predictions discriminatively rather than generatively. By dropping the random effects, using GPs as the prior and removing the restriction of RNNs on the global factors, we recover the semiparametric latent factor model (Teh et al., 2005). By dropping the global effects, we arrive at the standard SSM or GP. Our newly developed general variational inference algorithm applies to both of these methods and other normally distributed local stochastic models (cf. subsubsection 3.2.2). While we have built upon existing works, to the best of our knowledge, Deep Factors provide the first model framework that incorporate SSMs and GPs with DNNs in a systematic manner.5 Experiments
We conduct experiments with synthetic and realworld data to provide evidence for the practical effectiveness of our approach. We use a p3.8xlarge SageMaker instance in all our experiments. Our algorithms are implemented in MXNet Gluon (Chen et al., 2015), and make extensive use of its Linear Algebra library (Seeger et al., 2017; Zhenwen et al., 2018). Further experiments with GPs as the local model are detailed in (Maddix et al., 2018).
5.1 Model Understanding and Exploration
The first set of experiments compares our proposed structure in Eqn. (3) with no probabilistic component to the standard RNN structure on the electricity dataset. For each time series , we have its embedding and two time features , day of the week and hour of the day. Given an RNN Cell (LSTM, GRU, etc.), the RNN Forecaster predicts the time series values where . Deep Factor generates the point forecast as . The RNN cell for RNN Forecaster has an output dimension of 1 while that of DeepFactor is of dimension 10. The resulting number of parameters of both structures are roughly the same.
Figure 2 demonstrates that the proposed structure significantly improves the data efficiency while having less variance. The runtime of Deep Factor scales better than that of the standard RNN structure. By using the concatenation of , the standard RNN structure operates on the outer product space of and (of dimension ), while Deep Factor computes the intrinsic structure (of dimension 12 and separately).
Next, we focus on the DFLDS model, and investigate (i) the effect of the recognition network for nonGaussian observations with the purely local part () (variational LDS cf. Appendix A.2), and (ii) the recovery of the global factors in the presence of Gaussian and nonGaussian noise. We generate data according to the following model, which is adapted from Example 24.3 in (Barber, 2012)
. The twodimensional latent vector
is rotated at each iteration, and then projected to produce a scalar observation,where , , is the identity matrix, and
is the standard basis vector. The true observations are generated by a Poisson distribution,
This could be used to model seasonal products, where most of the sales happen when an item is in season, e.g. snowboards normally sell shortly before or during winters, and less so afterwards. Figure 3 shows the reconstructed intensity function , as well as corresponding forecasts for each choice of recognition network. Visual inspections reveal that RNNs are superior over MLPs as recognition networks. This is expected because the time series are sequential. We also test the ability of our algorithm to recover the underlying global factors. Our experiments show that even with the Poisson noise model, we are able to identify the true latent factors in the sense of distances of the subspaces spanned by the global factors (cf. Appendix B.1).ds  h  p50ql  p90ql  
DA  P  MR  DF  DA  P  MR  DF  
E  72  0.216 0.054  0.149  0.204 0.042  0.101 0.006  0.182 0.077  0.103  0.088 0.008  0.049 0.004 
24  0.272 0.078  0.124  0.185 0.042  0.112 0.012  0.100 0.013  0.091  0.083 0.008  0.059 0.013  
N  72  0.468 0.043  0.397  0.418 0.031  0.212 0.044  0.175 0.005  0.178  0.267 0.038  0.104 0.028 
24  0.390 0.042  0.328  0.358 0.029  0.239 0.037  0.167 0.005  0.168  0.248 0.088  0.139 0.035  
T  72  0.337 0.026  0.457  0.383 0.040  0.190 0.048  0.184 0.051  0.207  0.226 0.017  0.127 0.026 
24  0.296 0.021  0.380  0.331 0.011  0.225 0.050  0.149 0.011  0.191  0.154 0.020  0.159 0.059  
U  72  0.417 0.011  0.441  0.730 0.031  0.344 0.033  0.296 0.017  0.239  0.577 0.059  0.190 0.013 
24  0.353 0.009  0.468  0.879 0.156  0.425 0.063  0.238 0.009  0.239  0.489 0.069  0.238 0.026 
5.2 Empirical Studies
In this subsection, we test how our model performs on several realworld and publicly available datasets: electricity (E) and traffic (T) from the UCI data set (Dheeru & Karra Taniskidou, 2017; Yu et al., 2016), nyc taxi (N) (Taxi & Commission, 2015) and uber (U) (Flowers, 2015) (cf. Appendix B.2).
In the experiments, we choose the DFRNN (DF) model with a Gaussian likelihood. To assess the performance of our algorithm, we compare with DeepAR (DA), a stateofart RNNbased probabilistic forecasting algorithm on the publicly available AWS SageMaker (Januschowski et al., 2018), MQRNN (MR
), a sequence model that generates conditional predictive quantiles
(Wen et al., 2017), and Prophet (P), a Bayesian structural time series model (Taylor & Letham, 2017). The Deep Factor model has 10 global factors with a LSTM cell of 1layer and 50 hidden units. The noise LSTM has 1layer and 5 hidden units. For a fair comparison with DeepAR, we use a comparable number of model parameters, that is, an embedding size of 10 with 1layer and 50 hidden LSTM units. The student likelihood in DeepAR is chosen for its robust performance. The same model structure is chosen for MQRNN, and the decoder MLP has a single hidden layer of 20 units. We use the adam optimization method with the default parameters in Gluon to train the DFRNN and MQRNN. We use the default training parameters for DeepAR.We use the quantile loss to evaluate the probabilistic forecasts. For a given quantile , a target value and quantile prediction , the quantile loss is defined as
We use a normalized sum of quantile losses, to compute the quantile losses for a given span across all time series. We include results for which we abbreviate as the P50QL (mean absolute percentage error (MAPE)) and P90QL, respectively.
For all the datasets, we limit the training length to only one week of time series (168 observations per time series). This represents a relevant scenario that occurs frequently in demand forecasting, where products often have only limited historical sales data. We average the Deep Factor, MQRNN and DeepAR results over ten trials. We use one trial for Prophet, since classical methods are typically less variable than neuralnetwork based models. Figure 4 illustrates the performances of the different algorithms in terms of the MAPE (P50 quantile loss) for the 72 hour forecast horizon. Table 2 shows the full results, and that our model outperforms the others in terms of accuracy and variability in most of the cases. For DeepAR, using SageMaker’s HPO, our preliminary results (cf. Appendix B.2) show that with a larger model, it achieves a performance that is onpar with our method, which has much less parameters. In addition, the sequencetosequence structure of DeepAR and MQRNN limits their ability to react flexibly to changing forecasting scenarios, e.g. during ondemand forecasting, or interactive scenarios. For a forecasting scenario with a longer prediction horizon than during training horizon, DeepAR needs to be retrained to reflect the changes in the decoder length. Similarly, they cannot generate forecasts that are longer than the length of the training time series, for example, the case in Figure 2. Our method has no difficulty performing this task and has greater data efficiency.
6 Conclusion
We propose a novel globallocal framework for forecasting a collection of related time series, accompanied with a result that uniquely characterizes exchangeable time series. Our main contribution is a general, powerful and practically relevant modeling framework that scales, and obtains stateoftheart performance. Future work includes comparing variational dropout (Gal & Ghahramani, 2016) or Deep Ensemble (Lakshminarayanan et al., 2017) of nonprobabilistic DNN models (e.g., RNNForecaster (cf. 5.1)) for uncertainty.
References

Ahmadi et al. (2011)
Ahmadi, B., Kersting, K., and Sanner, S.
Multievidence lifted message passing, with application to pagerank
and the kalman filter.
In
TwentySecond International Joint Conference on Artificial Intelligence
, 2011.  Ahmed et al. (2012) Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. J. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 123–132. ACM, 2012.
 Araujo et al. (2019) Araujo, M., Ribeiro, P., Song, H. A., and Faloutsos, C. Tensorcast: forecasting and mining with coupled tensors. Knowledge and Information Systems, 59(3):497–522, 2019.
 Austin & Panchenko (2014) Austin, T. and Panchenko, D. A hierarchical version of the de finetti and aldoushoover representations. Probability Theory and Related Fields, 159(34):809–823, 2014.
 Barber (2012) Barber, D. Bayesian reasoning and machine learning. Cambridge University Press, 2012.
 Böse et al. (2017) Böse, J.H., Flunkert, V., Gasthaus, J., Januschowski, T., Lange, D., Salinas, D., Schelter, S., Seeger, M., and Wang, Y. Probabilistic demand forecasting at scale. Proceedings of the VLDB Endowment, 10(12):1694–1705, 2017.
 BrahimBelhouari & Bermak (2004) BrahimBelhouari, S. and Bermak, A. Gaussian process for nonstationary time series prediction. Computational Statistics & Data Analysis, 47(4):705–712, 2004.
 Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 Choi et al. (2011) Choi, J., GuzmanRivera, A., and Amir, E. Lifted relational kalman filtering. In TwentySecond International Joint Conference on Artificial Intelligence, 2011.
 Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
 Crawley (2012) Crawley, M. J. Mixedeffects models. The R Book, Second Edition, pp. 681–714, 2012.
 Damianou & N.D. (2013) Damianou, A. and N.D., L. Deep gaussian processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 207–215, 2013.
 Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017. University of California, Irvine, School of Information and Computer Sciences.
 Diaconis (1977) Diaconis, P. Finite forms of de finetti’s theorem on exchangeability. Synthese, 36:271–281, 1977.
 Diaconis & Freedman (1980) Diaconis, P. and Freedman, D. Finite exchangeable sequences. The Annals of Probability, 8:745–764, 1980.
 Faloutsos et al. (2018) Faloutsos, C., Gasthaus, J., Januschowski, T., and Wang, Y. Forecasting big time series: old and new. Proceedings of the VLDB Endowment, 11(12):2102–2105, 2018.
 Flowers (2015) Flowers, A. Uber TLC foil response. https://github.com/fivethirtyeight/ubertlcfoilresponse/blob/master/ubertripdata/uberrawdataapr14.csv, 2015.
 Flunkert et al. (2017) Flunkert, V., Salinas, D., and Gasthaus, J. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 Forni et al. (2000) Forni, M., Hallin, M., Lippi, M., and Reichlin, L. The generalized dynamicfactor model: Identification and estimation. Review of Economics and statistics, 82(4):540–554, 2000.
 Fraccaro et al. (2016) Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207, 2016.

Fraccaro et al. (2017)
Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.
In Advances in Neural Information Processing Systems, pp. 3604–3613, 2017.  Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
 Gasthaus et al. (2019) Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S. S., Salinas, D., Flunkert, V., and Januschowski, T. Probabilistic forecasting with spline quantile function rnns. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1901–1910, 2019.
 Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. Bayesian data analysis. CRC press, 2013.
 Geweke (1977) Geweke, J. The dynamic factor analysis of economic time series. Latent variables in socioeconomic models, 1977.
 Girard et al. (2003) Girard, A., Rasmussen, C. E., Candela, J. Q., and MurraySmith, R. Gaussian process priors with uncertain inputs application to multiplestep ahead time series forecasting. In Advances in neural information processing systems, pp. 545–552, 2003.
 Harvey (1990) Harvey, A. C. Forecasting, structural time series models and the Kalman filter. Cambridge university press, 1990.
 Hooi et al. (2019) Hooi, B., Shin, K., Liu, S., and Faloutsos, C. Smf: Driftaware matrix factorization with seasonal patterns. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 621–629. SIAM, 2019.
 Hwang et al. (2016) Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In International Conference on Machine Learning, pp. 3030–3039, 2016.
 Hyndman et al. (2008) Hyndman, R., Koehler, A. B., Ord, J. K., and Snyder, R. D. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
 Januschowski et al. (2018) Januschowski, T., Arpin, D., Salinas, D., Flunkert, V., Gasthaus, J., Stella, L., and Vazquez, P. Now available in amazon sagemaker: Deepar algorithm for more accurate time series forecasting. https://aws.amazon.com/blogs/machinelearning/nowavailableinamazonsagemakerdeeparalgorithmformoreaccuratetimeseriesforecasting/, 2018.
 Jing & Smola (2017) Jing, H. and Smola, A. J. Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 515–524. ACM, 2017.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. ICLR, 2014.
 Krishnan et al. (2015) Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Krishnan et al. (2017) Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101–2109, 2017.
 Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413, 2017.

Lawrence (2004)
Lawrence, N. D.
Gaussian process latent variable models for visualisation of high dimensional data.
In Advances in neural information processing systems, pp. 329–336, 2004.  Low et al. (2011) Low, Y., Agarwal, D., and Smola, A. J. Multiple domain user personalization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 123–131. ACM, 2011.
 Maddix et al. (2018) Maddix, D. C., Wang, Y., and Smola, A. Deep factors with gaussian processes for forecasting. NeurIPS workshop on Bayesian Deep Learning, 2018.
 Mukherjee et al. (2018) Mukherjee, S., Shankar, D., Ghosh, A., Tathawadekar, N., Kompalli, P., Sarawagi, S., and Chaudhury, K. Armdn: Associative and recurrent mixture density networks for eretail demand forecasting. arXiv preprint arXiv:1803.03800, 2018.
 Pan & Yao (2008) Pan, J. and Yao, Q. Modelling multiple time series via common factors. Biometrika, 95(2):365–379, 2008.
 Rangapuram et al. (2018) Rangapuram, S. S., Seeger, M., Gasthaus, J., Stella, L., Wang, Y., and Januschowski, T. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, 2018.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. Gaussian process for machine learning. MIT press, 2006.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, pp. 1278–1286, 2014.  Seeger (2004) Seeger, M. Gaussian processes for machine learning. International journal of neural systems, 14(02):69–106, 2004.
 Seeger et al. (2017) Seeger, M., Hetzel, A., Dai, Z., Meissner, E., and Lawrence, N. D. Autodifferentiating linear algebra. arXiv preprint arXiv:1710.08717, 2017.
 Seeger et al. (2016) Seeger, M. W., Salinas, D., and Flunkert, V. Bayesian intermittent demand forecasting for large inventories. In Advances in Neural Information Processing Systems, pp. 4646–4654, 2016.
 Stock & Watson (2011) Stock, J. H. and Watson, M. Dynamic factor models. Oxford handbook on economic forecasting, 2011.
 Taxi & Commission (2015) Taxi, N. and Commission, L. TLC trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, 2015.
 Taylor & Letham (2017) Taylor, S. J. and Letham, B. Forecasting at scale. The American Statistician, 2017.
 Teh et al. (2005) Teh, Y., Seeger, M., and Jordan, M. Semiparametric latent factor models. In AISTATS 2005Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 333–340, 2005.
 Wen et al. (2017) Wen, R., Torkkola, K., and Narayanaswamy, B. A multihorizon quantile recurrent forecaster. NIPS Workshop on Time Series, 2017.
 Xie et al. (2017) Xie, C., Tank, A., GreavesTunnell, A., and Fox, E. A unified framework for long range and cold start forecasting of seasonal profiles in time series. arXiv preprint arXiv:1710.08473, 2017.
 Xu et al. (2009) Xu, Z., Kersting, K., and Tresp, V. Multirelational learning with gaussian processes. In TwentyFirst International Joint Conference on Artificial Intelligence, 2009.
 Yu et al. (2016) Yu, H.F., Rao, N., and Dhillon, I. S. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in neural information processing systems, pp. 847–855, 2016.
 Zaheer et al. (2017) Zaheer, M., Ahmed, A., and Smola, A. J. Latent lstm allocation: Joint clustering and nonlinear dynamic modeling of sequence data. In International Conference on Machine Learning, pp. 3967–3976, 2017.
 Zheng et al. (2017) Zheng, X., Zaheer, M., Ahmed, A., Wang, Y., Xing, E. P., and Smola, A. J. State space lstm models with particle mcmc inference. arXiv preprint arXiv:1711.11179, 2017.
 Zhenwen et al. (2018) Zhenwen, D., Meissner, E., and Lawrence, N. D. Mxfusion: A modular deep probabilistic programming library. NIPS 2018 Workshop MLOSS, 2018.
Acknowledgments
We would like to thank the reviews, Valentin Flunkert, David Salinas, Michael Bohlkeschneider, Edo Liberty and Bing Xiang for their feedbacks.
Appendix A Deep Factor Models with Random Effects
In this section, we provide additional details on the models discussed in Section 3.1. In what follows, we overload the notation and use to denote the vector function consisting of the global factors.
name  description  local  likelihood (gaussian case) 
DFRNN  Zeromean Gaussian noise process given by rnn  
DFLDS  Statespace models  (cf. Eqn. (6))  given by Kalman Filter 
DFGP  Zeromean Gaussian Process  (cf. Eqn. (7)) 
a.1 DFRNN: Gaussian noise process as the local model
The local model in DFRNN is zeromean i.i.d. Gaussian, where the variance is generated by a noise RNN process. The generative model is given as:
Although the noise is i.i.d., there is correlated noise from the latent RNN process. For the nonGaussian likelihood, the latent innovation terms in the latent function requires inference. To do so, we use Algorithm 1. The variational lower bound in Eqn. (10) is straightforward to calculate, since the marginal likelihood is the loglikelihood of a multivariate Gaussian with mean and a covariance matrix with on the diagonal entries. DFRNN can be seen as a type of deep Gaussian latent variable model.
a.2 DFLDS: LDS as the local model
The generative model for DFLDS is given as:
We consider the leveltrend ISSM model with damping,
where control the damping effect, and the parameters and contain the innovation strength for the level and trend (slope), respectively. The initial latent state
is assumed to follow an isotropic Gaussian distribution. We choose a simple leveltrend ISSM because we expect the global factors with attention to explain the majority of the time series’ dynamics, such as trend and seasonality. This avoids having to hand design the latent states of the ISSM.
As a special case, we recover SSM with nonGaussian likelihood, and our inference algorithm gives a new approach to perform inference on such models, which we call Variational LDS.
Variational LDS: DFLDS with no global factors
Variational LDS is a statespace model (SSM) with an arbitrary likelihood as its emission function. In this subsection, we supply additional details of the inference of the variational LDS using the synthetic example in Section 5.1. The data is generated as,
Our observations are generated as:
resulting in the popular Poisson LDS.
Our goal is perform inference over the latent variables, , where and , given the observations . As in Section 3.2.2, we use a variational autoencoder (VAE) with structural approximation, and a variational posterior to approximate the true posterior as As in DFLDS, the first term is generated by the recognition network, and the second term chosen to match the conditional posterior. For the training procedure, we resort to Algorithm 1 with no global factors, i.e., is the sample generated by the recognition network.
There are different neural networks structures to choose from for the variational encoder . We choose the bidirectional LSTM since it considers information from both the past and the future. This is similar to the backwards message in the Kalman smoother. Similar preference can also be found in (Krishnan et al., 2015), although their model structure differs. The recognition network is only used in the training phase to better approximate the posterior distribution of the latent function values, and so there is no information leak for the desired forecasting case.
a.3 DFGP: Gaussian Process as the local model
With DFGP, the random effects is assumed to be drawn from a zeromean Gaussian process.
Simiarily with Variational LDS, with the proposed inference algorithm, we get a new approach for doing approximate inference for GP with nonGaussian likelihood. We leave the comparison with classical approaches (Rasmussen & Williams, 2006) such as Laplace approximation, Expectation Propagation, Variational Inference to future work.
Appendix B Detailed Experimental Analysis
In this appendix, we provide further results in both the synthetic and empirical cases.
b.1 Synthetic Experiments
The local model and observations are generated using the variationalLDS example in Appendix A.2. We want to test the ability of the algorithm to recover the underlying latent factors. To do so, we add global factors that are generated from Fourier series of different orders, to the synthetic dataset described in Section 5.1
. For each time series, the coefficients of the linear combination are sampled from a uniform distribution.
In the presence of Gaussian LDS noise ( observed), Figure B.1 shows that the algorithm does not precisely recover the global factors. The distance between the subspaces spanned by the true and estimated factors reveals that they are reasonably close. Even with Poisson observations in Figure B.2, the method is still able to roughly recover the factors.
b.2 Empirical Studies
Table B.1 summarizes the datasets that are used to test our Deep Factor models.
name  support  granularity  no. ts  comment 
electricity  hourly  370  electricity consumption of different customers  
traffic  hourly  963  occupancy rate of SF bay area freeways  
taxi  hourly  140  No. of taxi pick ups in different blocks of NYC  
uber  hourly  114  No. of Uber pick ups in different blocks of NYC 
We first compare our two methods, DFRNN and DFLDS with leveltrend ISSM, on the electricity dataset with forecast horizon 24. Table B.2 shows the averaged results over 50 trials. We see that both methods have similar accuracy with DFRNN being slightly preferrable. Due to its simplicity and computational advantages, we choose DFRNN as the Deep Factor model to compare against the other baseline methods. Regarding DFGP, although it shows a good performance in our perliminary results, still extra exploration is needed for which we left to future work.
method  P50QL  P90QL  RMSE 
DFRNN  0.144 0.019  0.101 0.044  888.285 278.798 
DFLDS  0.153 0.016  0.135 0.052  1012.551 240.118 
We provide the result of one trial for Prophet in Table 2 since its classical methods have less variability between experiments and are more computationally expensive. The acronyms DA, P, MR, DF in these tables stand for DeepAR, Prophet, MQRNN and DeepFactor, respectively. Figure B.5 illustrates example forecasts from our model.
In our experiments, we do not tune the hyperparameters of the methods. With tuning and a much bigger model, DeepAR would achieve higher accuracy in our preliminary experiments. For example, with the SageMaker HPO, we achieve the DeepAR’s best performance of 0.118 MAPE and 0.042 P90 Loss on the electricity
dataset for horizon 72. The hyperparameters are selected to be a twolayer LSTM each with 40 units, a learning rate of 5e3, 2000 epochs and an early stopping criteria of 200 epochs. As we can see, this metric is on par with our results in Table
2 and DF achieves such accuracy with much less number of parameters.ds  h  RMSE  
DA  P  MR  DF  
E  72  1194.421 295.603  902.724  1416.992 510.705  770.437 68.151 
24  2100.927 535.733  783.598  1250.233 509.159  786.667 144.459  
N  72  108.963 15.905  97.706  106.588 9.988  53.523 13.096 
24  98.545 13.561  91.573  102.721 9.640  63.059 11.876  
T  72  0.027 0.001  0.032  0.029 0.003  0.019 0.002 
24  0.025 0.001  0.028  0.028 0.001  0.021 0.003  
U  72  9.218 0.486  7.488  14.373 0.997  6.393 1.185 
24  6.043 0.295  6.948  16.585 3.452  8.044 2.307 
Comments
There are no comments yet.