The Automatic Bayesian Covariance Discovery (ABCD) system which is so-called automatic statistician system, is proposed (Lloyd et al., 2014) with the aim of automating the process of statistical modeling. It focuses on regression problems. Specifically, it takes time series as input, searches and then produces a learned Gaussian process model which has interpretable properties (smoothness, periodicity, changepoints) and is summarized with a report. With the same purpose, (Hwang, Tong, and Choi, 2016) proposed a relational approach to handle multiple data. However, such kinds of modeling usually require a body of work and efforts to make build a new model. One of the advantages of probabilistic programming which helps in this situation is the ease of creating generative models with several lines of code (Kulkarni et al., 2015). In order to facilitate the process of building new ABCD-based models, we propose a method generating Stan probabilistic programming from ABCD results. Time series are stored in the compact representation of encoded ABCD probabilistic programmings which potentially allow construct a complex model from heterogeneous models.
This paper is organized as follows. We briefly introduce the background of Gaussian Processes (GP) , the ABCD system and its relational version for multiple data. Then, we present our main contribution on how a probabilistic programming is generated from time series data; Next is the experiment section; Finally, we conclude our work.
2.1 Automatic Statistician System
Automatic Bayesian Covariance Discovery (ABCD) explains data without requiring expert input for regression problems. Here the approach is to use Gaussian processes to model regression functions.
Let us take a brief overview about Gaussian Processes (GPs). GPs are distributions over functions such that any finite set of function evaluations,
form a multivariate Gaussian distribution(Rasmussen and Williams, 2006). It is specified by a mean function and a covariance kernel function
(f(x), f(x’)). Evaluations of the two functions on a finite set of points correspond to the mean vector and the covariance matrix for the multivariate Gaussian distribution, likeand . When a function or evaluations of the function are drawn from a *gp specified by its mean function and covariance kernel function ,
Covariance kernel function plays a crucial role in a GP, as it conveys our assumptions about the function. Duvenaud et al. (2013) proposed a compositional kernel learning method. It constructs and find richer kernel which are composed of server base kernels and operation. In theory, any positive definite kernels are closed under addition and multiplication.
Five base kernels are used for making compositional kernels. Each kernel encodes different characteristics of functions, which further enables the generalization of structure and inference given new data. Base Kernels Encoding Function White Noise (WN) Uncorrelated noise Constant (C) Constant functions Linear (LIN) Linear functions Squared Exponential (SE) Smooth functions Periodic (PER) Periodic functions
The first operation is addition which sums multiple kernel functions and makes a new kernel function.
ABCD assumes zero mean for *gp, since marginalizing over an unknown mean function can be equivalently expressed as a zero-mean *gp with a new kernel (Lloyd et al., 2014). Under this assumption, the following multiplication operation is also applicable.
The third operation is the change-point (CP) operation. Given two kernel functions, and , the new kernel function is represented as follows:
is a sigmoidal function which lies between 0 and 1, andis the change-point. The change-point operation divides function domain (i.e., time) into two sides and applies different kernel function on each side.
Finally, the change-window (CW) operation applies the CP operation twice with two different change points and . Given two sigmoidal functions and where , the new function will be , which applies the function to the window . A composite kernel expression after the change-window operation will be as follows:
ABCD searches a composite kernel based on the search grammar. The search grammar specifies how to develop the current kernel expression by applying the operations with the base kernels. The following rules are examples of typical search grammar:
where represents any kernel subexpression, and are base kernels.
Given data and a maximum search depth, the algorithm gives a compositional kernel . Starting from the WN
kernel, the algorithm expands the kernel expression based on the search grammar, optimizes hyperparameters for the expanded kernels, evaluates those kernels given the data and selects the best one among them. This procedure repeats. The next iterative procedure starts with the best composite kernel selected in the previous iteration. The conjugate gradient method is used when optimizing hyperparameters. Bayesian Information Criterion (BIC) is used for the model evaluation. The BIC of modelwith number of free parameters and data with number of data points is:
The iteration continues until the specified maximum search depth is reached. During the iteration, the algorithm keeps the best model for the output.
2.2 Relational Automatic Statistic System
Hwang, Tong, and Choi (2016) has developed a relational version to deal with multiple sequences of data. The assumption is that sequences are related to each other. Relational Automatic Bayesian Covariance Discovery (Relational ABCD) finds a shared structure across multiple sequences.
Relational ABCD considers two methods: Relational Kernel Learning, and Semi-Relational Kernel Learning (see Figure 1). The former is that the sequences share the same kernel structure which represents high-level properties like periodic, trends but differ by additive scale factors and multiplicative scale factors . The latter relaxes the assumption among sequences by considering two parts of structure including a shared structure and an individualized structure . Comparing to ABCD qualitatively and quantitatively, it showed improvements in term of capturing general information across time-series as well as extrapolation performance.
2.3 Probabilistic Programming
which enables users to write a Bayesian inference model. Stan development team provides friendly Stan’s APIs for several languages (R, Python, Matlab), allowing access to many type of users. The fundamental concept used in Stan is the No U-Turn (NUTS) sampler(Homan and Gelman, 2014) which builds a tree of possible samples by randomly simulating Hamiltonian dynamics both forwards and backwards in time until the combined trajectory turns back on itself.
A sample Stan code is showed in Figure 2
which considers estimating the chance of success parameter for a Bernoulli distribution based on a sequence of observer binary outcomes. Here the assumption is thatbinary data y,..., y[N] is independent and identically distributed, with success probability theta. The model makes the prior assumption beta(1,1) on theta. Then, the data is fitted in the loop for (n in 1:N)
The followings are concentrated on how a Stan program is structured and executed.
Data block Data block declares the data to fit the model. In the Figure 2, the data block declares an integer value N which is the number of observed data. The array y has size N, containing information of success outcomes. It is able to set constraints of data with an upper bound upper and a lower bound lower.
Transformed data block A transformed data block is used to define a new auxiliary variables which is computed from data, containing mediate information. Figure 2 does not include the transformed data block.
Parameter block In Figure 2, the program has only one parameter theta defined in the parameter block. We also can set constraints on parameters.
Transformed parameter block The transformed parameter block defines transforms of parameters for a model. It is optional, and does not appear in Figure 2.
Model block The model block defines the log probability function on the parameter space. In the program, by providing the prior on theta, the likelihood is evaluated.
Generated quantities block The (optional) generated quantities allows values that depend on parameters and data. It may be used to calculate predictive inferences. It can carry out forward simulation for predictive posterior checks.
3 Automatic Generation of Probabilistic Programming
Generating probabilistic programming from ABCD and/or relational ABCD is a crucial component in the system we aim to build (see Figure 3). Prior to this component, one can choose either ABCD or relational ABCD based on whether the preference is for a single time series or for global information in multiple time series. Both ABCD and relational ABCD play as producers which output compositional kernels from data (Step 1 and 2). Step 4 and 5 are an example application. We will discuss what is inside Step 3 in this section.
From now on, we use the notation StanABCD to indicate the Stan probabilistic programs generated automatically from ABCD.
3.1 Base kernels
An ABCD’s result is represented by a compositional structure which is a sum of products of base kernels. This summation is the outcome of simplifications: the multiplication of two SE kernel produces another SE with different parameter values. The product of WN and any stationary kernel including C, PER, WN, SE results a new WN kernel. Multiplying C with any kernel does not change the kernel but changes the scale parameter of that kernel (Lloyd et al., 2014). Hence, let be a set of all possible kernel expressions, written as
where is the sigmoid function, is in
For example, a learned compositional kernel is described as
Here, the square exponential kernel and linear kernel are written respectively as
The kernel (2) can be understood linguistically as ’a smooth function with linearly (LIN) increasing amplitude’.
Another real-world example is a chosen currency exchange data set (see Figure 5). The data set contains exchange value of Indonesian Rupiad from 2015-06-28 to 2015-12-30 acquired from Yahoo Finance (Yahoo Inc., 2016). Carried experiments on this data set, relational ABCD found the best compositional kernel which is shortly written as
This kernel is well-explained for several currency exchanges sharing common financial behaviors. A qualitative result shows that the changewindow (CW) kernel occurs around mid September 2015 which reflects big financial events (FED’s announcement about policy changes in interest rates, China’s foreign exchange reserves falls) (Hwang, Tong, and Choi, 2016) . We take this compostional kernel as a typical example for demonstration purpose.
Given a data set and a learned kernel, we are interested in encoding them into a Stan program. In order to do that, we first prepare built-in base kernels in Stan version. A base kernel is written as a Stan function which takes data, and hyperparameters as input and returns a matrix. The matrix has elements reflecting how similar (correlated) the data points in data set are. Each base kernel have a specific number of hyperparameters itself. For the SE kernel case, it has two hyperparameters: a scale factor , and a lengthscale ; then we build a Stan code as showed in Figure 4 (The detail implementations of other kernels are in the Appendix).
Next, we will discuss how a StanABCD organize. StanABCDs share common conventional blocks (as in previous section) but the compositional kernel. We briefly describe what is required in each blocks.
3.1.1 Data block
In general, StanABCD contains training data points and test data points . Abiding by the Stan convention, StanABCD declares data as following:
Here, we provide the information of training data through N1 (number of training data points) and vectors x1, y1. Similarly, test data is specified by the number of test data points N2 and a vector x2.
For example, we analyze Indonesian Rupiah exchange data with the period from July 2015 to December 2015 as shown in Figure 5. This data set consists of 132 data points in which we take the first 120 data points as training data, and the next 12 data points as test data. We have N1 = 120, x1 be the days in training data set, y1 be the exchange value, N2 = 12, and x2 be the days in test data set. We want to predict the exchange value on x2.
3.1.2 Parameter block and transformed parameters block
These blocks consist of all necessary parameters to construct the model. With the purpose of sampling data on test points, StanABCD has the parameter block containing a parameter z as an array with length N2 equal to the number of test data points. z
follows unit normal distribution to increase the sampling performance which is discussed later in the transformed parameters block. If we useStanABCD not only as a sampler, the parameter block should be customized by adding more parameters for our desired models.
What the transformed parameters block does is to make computation for the posterior distribution of GPs. Based on GP prior, the joint distribution of training outputand test output is represented as
Assume we already know the structure of compositional kernel from ABCD. From this compositional kernel structure, is a matrix evaluated at all pairs of training and test points, where is the number of training data points, is the number of test data points. Analogously, we compute , , and . Note that . Now, using the conditioning Gaussians, it follows that
The posterior distribution is analytically tractable. This makes easier to represent on a Stan programming because it supports most of distributions. Here, we only need to specify a mean and a covariance matrix , in order to declare a multivariate normal distribution in the next blocks.
Taking a consideration about the efficiency of implementation, the Cholesky decomposition (which is available as a built-in function in Stan) of is pre-computed. Let us denote the Cholesky decompositon of be . Stan only perform its sampling method to produce a unit normal distribution
Then, we transform into to obtain our target distribution
Here is the sample code for this block:
The above variables Sigma, Omega, K correspond respectively to the terms , .
We still leave the question how to build a compositional kernel from base kernels. Basically, the compositional kernel is the sum of product of base kernels. To represent the compositional kernel in Stan, the summation is used as a matrix sum operation and the multiplication between base kernels is replaced by the Hadamard product. Here is an example
3.1.3 Model block
Stan allows us to quickly design a Bayesian hierarchical model. Utilizing the mean and covariance computed in the previous block helps us declare a normal distribution which plays a role as the first level in the multiple levels of hierarchical model. However, we set this aside and only illustrate the case that we sample the posterior distribution on test data points. From (5), we declare a Gaussian distribution to serve the sampling purpose in the generated quantities block.
3.1.4 Generated quantities block
For purpose of generating sample extrapolation value, the normal distribution declared in the model block will be called one time.
In order to get the sample of test output y2, the above Stan
code performs the linear transformation on sample values generated fromz () as we explained in the transformed parameters block.
4.0.1 Data set
Beside the Indonesian Rupiah exchange data mentioned in previous section, we select a airline data set to perform experiments on. The data set describes monthly international airline passenger numbers for the period between January 1949 and December 1960 (George E. P. Box (2013)). The number of passengers was periodic with a typical period 1 year. The total number of passengers per year increased monotonically. ABCD captures this information well, and explain the data set by a compositional kernel: LIN + SE PER LIN + SE + WN LIN.
4.0.2 Sampling data
We provide a complete sample Stan code in the appendix. We want to get extrapolation sample values on test points (from January 1961) of airline data set. During the experiment, we use Python to retrieve data set then pass to Stan compiler through PyStan (Stan Development Team, 2016). Figure 6 and 7 show that StanABCD provides similar results as ABCD in the view of extrapolation performance. Our generating method guarantees a reliable way to perform one-to-one mapping from ABCD result into a probabilistic programming.
We propose a beautiful blend between the automatic statistician and probabilistic programming. As the result, it opens a broad direction to explore on encoded Stan programs because of their potentiality to make further inference or perform statistical relational learning. On the other hand, StanABCD provides a promising way to accelerate the learning kernel in ABCD framework which requires an exhaustive search procedure. A database of StanABCDs is one of the possible solutions.
This work is supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science, ICT & Future Planning (MSIP) (NRF- 2014R1A1A1002662) and the NRF grant funded by the MSIP (NRF-2014M2A8A2074096).
- Carpenter (2015) Carpenter, B. 2015. Stan: A probabilistic programming language. Journal of Statistical Software.
- De Raedt, Kimmig, and Toivonen (2007) De Raedt, L.; Kimmig, A.; and Toivonen, H. 2007. Problog: A probabilistic prolog and its application in link discovery. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, 2468–2473.
- Duvenaud et al. (2013) Duvenaud, D. K.; Lloyd, J. R.; Grosse, R. B.; Tenenbaum, J. B.; and Ghahramani, Z. 2013. Structure discovery in nonparametric regression through compositional kernel search. In Proceedings of the 30th International Conference on Machine Learning (ICML), 1166–1174.
- Freer, Roy, and Tenenbaum (2012) Freer, C. E.; Roy, D. M.; and Tenenbaum, J. B. 2012. Towards common-sense reasoning via conditional simulation: legacies of Turing in Artificial Intelligence. In Downey, R., ed., Turing’s Legacy, ASL Lecture Notes in Logic. Cambridge University Press.
- George E. P. Box (2013) George E. P. Box, Gwilym M. Jenkins, G. C. 2013. Time Series Analysis, Forecasting and Control.
- Goodman et al. (2012) Goodman, N. D.; Mansinghka, V. K.; Roy, D. M.; Bonawitz, K.; and Tenenbaum, J. B. 2012. Church: a language for generative models. CoRR abs/1206.3255.
- Homan and Gelman (2014) Homan, M. D., and Gelman, A. 2014. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(1):1593–1623.
- Hwang, Tong, and Choi (2016) Hwang, Y.; Tong, A.; and Choi, J. 2016. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016.
- Kulkarni et al. (2015) Kulkarni, T. D.; Kohli, P.; Tenenbaum, J. B.; and Mansinghka, V. 2015. Picture: A probabilistic programming language for scene perception. In .
- Lake, Salakhutdinov, and Tenenbaum (2015) Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338.
- Lebeltel et al. (2004) Lebeltel, O.; Bessière, P.; Diard, J.; and Mazer, E. 2004. Bayesian robot programming. Advanced Robotics 16(1):49–79.
- Lloyd et al. (2014) Lloyd, J. R.; Duvenaud, D. K.; Grosse, R. B.; Tenenbaum, J. B.; and Ghahramani, Z. 2014. Automatic construction and natural-language description of nonparametric regression models. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 1242–1250.
Lunn et al. (2000)
Lunn, D. J.; Thomas, A.; Best, N.; and Spiegelhalter, D.
– A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing 10(4):325–337.
- Rasmussen and Williams (2006) Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Stan Development Team (2016) Stan Development Team. 2016. Pystan: the python interface to stan. http://mc-stan.org.
- Yahoo Inc. (2016) Yahoo Inc. 2016. Yahoo finance - business finance, stock market, quotes, news. http://finance.yahoo.com/. Accessed: 2016-02-05.
.1 Base kernels
.1.1 White noise kernel
A white noise kernel is written as
We implement this kernel in
.1.2 Constant kernel
A constant kernel is written as
.1.3 Linear kernel
A linear kernel is written as
.1.4 Squared exponential kernel
A squared exponential is defined as follows:
.1.5 Periodic kernel
A periodic kernel is defined as
where is the modified Bessel function of the first kind of order zero
.1.6 Changepoint operator
A changepoint operator on kernels and is defined as