A Sparse and Adaptive Prior for Time-Dependent Model Parameters

10/09/2013 ∙ by Dani Yogatama, et al. ∙ Carnegie Mellon University 0

We consider the scenario where the parameters of a probabilistic model are expected to vary over time. We construct a novel prior distribution that promotes sparsity and adapts the strength of correlation between parameters at successive timesteps, based on the data. We derive approximate variational inference procedures for learning and prediction with this prior. We test the approach on two tasks: forecasting financial quantities from relevant text, and modeling language contingent on time-varying financial measurements.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When learning from streams of data to make predictions in the future, how should we handle the timestamp associated with each instance? Ignoring timestamps and assuming data are i.i.d. is scalable but risks distracting a model with irrelevant “ancient history.” On the other hand, using only the most recent portion of the data risks overfitting to current trends and missing important time-insensitive effects. In this paper, we seek a general approach to learning model parameters that are overall sparse, but that adapt to variation in how different effects change over time.

Our approach is a prior over parameters of an exponential family (e.g., coefficients in linear or logistic regression). We assume that parameter values shift at each timestep, with correlation between adjacent timesteps captured using a multivariate normal distribution whose precision matrix is restricted to a tridiagonal structure. We (approximately) marginalize the (co)variance parameters of this normal distribution using a Jeffreys prior, resulting in a model that allows smooth variation over time while encouraging overall sparsity in the parameters. (The parameters themselves are not given a fully Bayesian treatment.)

We demonstrate the usefulness of our model on two tasks, showing gains over alternative approaches. The first is a text regression problem in which an economic variable (volatility of returns) is forecast from financial reports (Kogan et al., 2009). The second forecasts text by constructing a language model that conditions on highly time-dependent economic variables.

Notation is given in §2. Our prior distribution is presented in §3. We draw connections to related work in §4. §5 presents our inference algorithm and §6 our experimental results.

2 Notation

We assume data of the form , where each includes a timestamp denoted .111In this work we assume timestamps are discretized. The aim is to learn a predictor that maps input , assumed to be at timestep , to output . In the probabilistic setting we adopt here, the prediction is MAP inference over r.v.  given and a model parameterized by

. Learning is parameter estimation to solve:


The focus of the paper is on the prior distribution . Throughout, we will denote the task-specific log-likelihood (second term) by

and assume a generalized linear model such that a feature vector function

maps inputs into and is “linked” to the distribution over

using, e.g., a logit or identity. We will refer to elements of

as “features” and to as “coefficients.” We assume discrete timesteps.

3 Time-Series Prior

Our time-series prior draws inspiration from the probabilistic interpretation of the sparsity-inducing lasso (Tibshirani, 1996) and group lasso (Yuan & Lin, 2007). In non-overlapping group lasso, features are divided into groups, and the coefficients within each group are drawn according to:

  1. Variance

    an exponential distribution.

    222The exponential distribution can be replaced by the (improper) Jeffreys prior, although then the familiar Laplace distribution interpretation no longer holds (Figueiredo, 2002).

  2. .

We seek a prior that lets each coefficient vary smoothly over time. A high-level intuition of our prior is that we create copies of , one at each timestep: . For each feature , let the sequence form a group, denoted . Group lasso does not view coefficients in a group as explicitly correlated; they are independent given the variance parameter. Given the sequential structure of , we replace the covariance matrix to capture autocorrelation. Specifically, we assume the vector is drawn from a multivariate normal distribution with mean zero and a precision matrix with the following tridiagonal form:333We suppress the subscript for this discussion; each feature has its own .


is a scalar multiplier whose role is to control sparsity in the coefficients, while dictates the degree of correlation between coefficients in adjacent timesteps (autocorrelation). Importantly, and (and hence and ) are allowed to be different for each group .

We need to ensure that is positive definite. Fortunately, it is easy to show that for , the resulting is positive definite.

Proof sketch.

To show this, since is a symmetric matrix, we verify that each of its principal minors have strictly positive determinants. The principal minors of are uniform tridiagonal symmetric matrices, and the determinant of a uniform tridiagonal matrix can be written as (see, e.g., Volpi (2003) for the proof). Since , if , the determinant is always positive. Therefore, is always p.d. for . ∎

3.1 Generative Model

Our generative model for the group of coefficients is given by:

  1. an improper Jeffreys prior ().

  2. a truncated exponential prior with parameter . This distribution forces to fall in , so that is p.d. and autocorrelations are always positive:


    We fix .

  3. , with the precision matrix as defined in Eq. 2.

During estimation of , each and are marginalized, giving a sparse and adaptive estimate for .

3.2 Scalability

Our design choice of the precision matrix is driven by scalability concerns. Instead of using, e.g., a random draw from a Wishart distribution, we specify the precision matrix to have a tridiagonal structure. This induces dependencies between coefficients in adjacent timesteps (first-order dependencies) and allows the prior to scale to fine-grained timesteps more efficiently. Let denote the number of training instances, the number of base features, and the number of timesteps. A single pass of our variational algorithm (discussed in §5) has runtime and space requirement , instead of for both if each is drawn from a Wishart distribution. This can make a big difference for applications with large numbers of features (). Additionally, we choose the off-diagonal entries to be uniform, so we only need one

for each base feature. This design choice restricts the expressive power of the prior but still permits flexibility in adapting to trends for different coefficients, as we will see. The prior encourages sparsity at the group level, essentially performing feature selection: some feature coefficients

may be driven to zero across all timesteps, while others will be allowed to vary over time, with an expectation of smooth changes.

Note that this model introduces only one hyperparameter,

, since we marginalize and .

4 Related Work

Our model is related to autoregressive integrated moving average approaches to time-series data (Box et al., 2008), but we never have access to direct observations of the time-series. Instead, we observe data ( and ) assumed to have been sampled using time-series-generated variables as coefficients (). During learning, we therefore use probabilistic inference to reason about the variables at all timesteps together. In §5, we describe a scalable variational inference algorithm for inferring coefficients at all timesteps, enabling prediction of future data and inspection of trends.

We follow Yogatama et al. (2011) in creating time-specific copies of the base coefficients, so that . As a prior over , they used a multivariate Gaussian imposing non-zero covariance between each and its time-adjacent copies and . The strength of that covariance was set for each base feature by a global hyperparameter, which was tuned on held-out development data along with the global variance hyperparameter. Yogatama et al.’s model can be obtained from ours by fixing the same and for all features . Our approach differs in that (i) we marginalize the hyperparameters, (ii) we allow each coefficient its own autocorrelation, and (iii) we encourage sparsity.

There are many related Bayesian approaches for time-varying model parameters (Belmonte et al., 2012; Nakajima & West, 2012; Caron et al., 2012), as well as work on time-varying signal estimation (Angelosante & Giannakis, 2009; Angelosante et al., 2009; Charles & Rozell, 2012). Each provides a different probabilistic interpretation of parameter generation. Our model has a distinctive generative story in that correlations between parameters of successive timesteps are encoded in a precision matrix. Additionally, unlike these fully Bayesian approaches that infer full posterior distributions, we only obtain posterior mode estimates of coefficients, which has computational advantages at prediction time (e.g., straightforward MAP inference and sparsity) and interpretability of .

As noted, our grouping together of each feature’s instantiations at all timesteps, and seeking sparsity, bears clear similarity to group lasso (Yuan & Lin, 2007), which encourages whole groups of coefficients to collectively go to zero. A probabilistic interpretation for lasso as a two level exponential-normal distribution that generalizes to (non-overlapping) group lasso was introduced by Figueiredo (2002). He also showed that the exponential distribution prior can be replaced with an improper Jeffreys prior for a parameter-free model, a step we follow as well.

Our model is also related to the fused lasso (Tibshirani et al., 2005)

, which penalizes a loss function by the

-norm of the coefficients and their differences. Our prior has a more clear probabilistic interpretation and adapts the degree of autocorrelation for each coefficient, based on the data.

Zhang & Yeung (2010) proposed a regularization method using a matrix-variate normal distribution prior to model task relationships in multitask learning. If we consider timesteps as tasks, the technique resembles our regularizer. Their method jointly optimizes the covariance matrix with the feature coefficients; we choose a Bayesian treatment and encode our prior belief to the (inverse) covariance matrix, while still allowing the learned feature coefficients to modify the matrix by posterior inference. As a result, our method allows different base features to have different matrices.

5 Learning and Inference

We marginalize and and obtain a maximum a posteriori estimate for , which includes a coefficient for each base feature at each timestep . Specifically, we seek to maximize:


Exact inference in this model is intractable. We use mean-field variational inference to derive a lower bound on the above log-likelihood function. We then apply a standard optimization technique to jointly optimize the variational parameters and the coefficients .

We introduce fully factored variational distributions for each and . For

, we use a Gamma distribution with parameters

as our variational distribution:

Therefore, we have , , and ( is the digamma function).

For , we choose the form of our variational distribution to be the same truncated exponential distribution as its prior, with parameter , denoting this distribution . We have


We let denote the set of all variational distributions over and .

The variational bound that we seek to maximize is given in Figure 1. Our learning algorithm involves optimizing with respect to variational parameters , , and , and the coefficients . We employ the L-BFGS quasi-Newton method (Liu & Nocedal, 1989), for which we need to compute the gradient of . We turn next to each part of this gradient.

Figure 1: The variational bound on Equation 1 that is maximized to learn . The boxed expression is further bounded by using Jensen’s inequality, giving a new lower bound we denote by .

5.1 Coefficients

For , the first derivative with respect to time-specific coefficient is:


We can interpret the first derivative as including a penalty scaled by . We rewrite this penalty as:

This form makes it clear that the penalty depends on and , penalizing the difference between and these time-adjacent coefficients proportional to .

The form bears strong similarity to the first derivative of the time-series (log-)prior introduced in Yogatama et al. (2011), which depends on fixed, global hyperparameters analogous to our and . Because our approach does not require us to specify scalars playing the roles of “” and “” in advance, it is possible for each feature to have its own autocorrelation. Obtaining the same effect in their model would require careful tuning of hyperparameters, which is not practical.

It also has some similarities to the fused lasso penalty (Tibshirani et al., 2005), which is intended to encourage sparsity in the differences between features coefficients across timesteps. Our prior, on the other hand, encourages smoothness in the differences, with additional sparsity at the feature level.

5.2 Variational Parameters for and

Recall that the variational distribution for is a Gamma distribution with parameters and .

Precision matrix scalar .

The first derivative for variational parameters is easy to compute:


where is the trigamma function. We can solve for in closed form given the other free variables:


We therefore treat as a function of , , and in optimization.

Off-diagonal entries .

First, notice that using Jensen’s inequality: due to the fact that is a convex function. Furthermore, for a uniform symmetric tridiagonal matrix like , the log determinant can be computed in closed form as follows (Volpi, 2003):

We therefore maximize a lower bound on , making use of the above to calculate first derivatives with respect to :

The partial derivatives , , and are easy to compute. We omit them for space.

5.3 Implementation Details

A well-known property of numerical optimizers like the one we use (L-BFGS; Liu & Nocedal (1989)) is the failure to reach optimal values exactly at zero. Although theoretically strongly sparse, our prior only produces weak sparsity in practice. Future work might consider a more principled proximal-gradient algorithm to obtain strong sparsity (Bach et al., 2011; Liu & Ye, 2010; Duchi & Singer, 2009).

If we expect feature coefficients at specific timesteps to be sparse as well, it is straightforward to incorporate additional terms in the objective function that encode this prior belief (analogous to an extension from group lasso to sparse group lasso). For the tasks we consider in our experiments, we found that it does not substantially improve the overall performance. Therefore, we keep the simpler bound given in Figure 1.

6 Experiments

We report two sets of experiments, one with a continuous , the other a language modeling application where is text. Each timestep in our experiments is one year.

6.1 Baselines

On both tasks, we compare our approach to a range of baselines. Since this is a forecasting task, at each test year, we only used training examples that come from earlier years. Our baselines vary in how they use this earlier data and in how they regularize.

  • ridge-one

    : ridge regression

    (Hoerl & Kennard, 1970), trained on only examples from the year prior to the test data (e.g., for the 2002 task, train on examples from 2001)

  • ridge-all: ridge regression trained on the full set of past examples (e.g., for the 2002 task, train on examples from 1996–2001)

  • ridge-ts: the non-adaptive time-series ridge model of Yogatama et al. (2011)

  • lasso-one

    : lasso regression

    (Tibshirani, 1996), trained on only examples from the year prior to the test data444Brendan O’Connor (personal communication) has established the superiority of the lasso to the support vector regression method of Kogan et al. (2009) on this dataset; lasso is a strong baseline for this problem.

  • lasso-all: lasso regression trained on the full set of past examples

In all cases, we tuned hyperparameters on a development data. Note that, of the above baselines, only ridge-ts replicates the coefficients at different timesteps (i.e., parameters); the others have only time-insensitive coefficients.

The model with our prior always uses all training examples that are available up to the test year (this is equivalent to a sliding window of size infinity). Like ridge-ts, our model trusts more recent data more, allowing coefficients farther in the past to drift farther away from those most relevant for prediction at time . Our model, however, adapts the “drift” of each coefficient separately rather than setting a global hyperparameter.

6.2 Forecasting Risk from Text

In the first experiment, we apply our prior to a forecasting task. We consider the task of predicting volatility of stock returns from financial reports of publicly-traded companies, similar to Kogan et al. (2009).

year # examples ridge-one ridge-all ridge-ts lasso-one lasso-all our model
2002(dev) 2,845 0.182 0.176 0.171 0.165 0.156 0.158
2003 3,611 0.185 0.173 0.171 0.164 0.176 0.164
2004 3,558 0.125 0.137 0.129 0.116 0.119 0.113
2005 3,474 0.135 0.133 0.136 0.124 0.124 0.122
overall 13,488 0.155 0.154 0.151 0.141 0.143 0.139
Table 1: MSE on the 10-K dataset (various test sets). The first test year (2002) was used as our development data. Our model uses the sparse adaptive prior described in §3. The overall differences between our model and all competing models are statistically significant (Wilcoxon signed-rank test, ).

In finance, volatility

refers to a measure of variation in a quantity over time; for stock returns, it is measured using the standard deviation during a fixed period (here, one year). Volatility is used as a measure of financial risk. Consider a linear regression model for predicting the log volatility

555Similar to Kogan et al. (2009) and as also the standard practice in finance, we perform a log transformation, since log volatilities are typically close to normally distributed. of a stock from a set of features (see §6.2.1 for a complete description of our features). We can interpret a linear regression model probabilistically as drawing from a normal distribution with as the mean of the normal. Therefore, in this experiment: .

We apply the time-series prior to the feature coefficients . When making a prediction for the test data, we use , the set of feature coefficients for the last timestep in the training data.

6.2.1 Dataset

We used a collection of Securities Exchange Commission-mandated annual reports from 10,492 publicly traded companies in the U.S. There are 27,159 reports over a period of ten years from 1996–2005 in the corpus. These reports are known as “Form 10-K.”666See Kogan et al. (2009) for a complete description of the dataset; it is available at http://www.ark.cs.cmu.edu/10K.

For the feature set, we downcased and tokenized the texts and selected the 101st–10,101st most frequent words as binary features. The feature set was kept the same across experiments for all models. It is widely known in the financial community that the past history of volatility of stock returns is a good indicator of the future volatility. Therefore, we also included the log volatility of the stocks twelve months prior to the report as a feature. Our response variable

is the log volatility of stock returns over a period of twelve months after the report is published.

6.2.2 Results

The first test year (i.e., 2002) was used as our development data for hyperparameter tuning ( was selected to be ). We initialized all the feature coefficients by the coefficients from training a lasso regression on the last year of the training data (lasso-one). Table 1 provides a summary of experimental results. We report the results in mean squared error on the test set: , where is the true response for instance and is the predicted response.

Our model consistently outperformed ridge variants, including the one with a time-series penalty (Yogatama et al., 2011). It also outperformed the lasso variants without any time-series penalty, on average and in three out of four test sets apiece.

One of the major challenges in working with time-series data is to choose the right window size, in which the data is still relevant to current predictions. Our model automates this process with a Bayesian treatment of the strength of each feature coefficient’s autocorrelation. The results indicate that our model was able to learn when to trust a longer history of training data, and when to trust a shorter history of training data, demonstrating the adaptiveness of our prior. Figure 2 shows the distribution of the expected values of the autocorrelation paramaters under the variational distributions for 10,002 features, learned by our model from the last run (test year 2005).

In future work, an empirical Bayesian treatment of the hyperprior

, fitting it to improve the variational bound, might lead to further improvements.

Figure 2: The distribution of expected values of the autocorrelation paramaters under the variational distributions for 10,002 features used in our experiments (10,000 unigram features, the previous year log volatility feature, and a bias feature).

6.3 Text Modeling in Context

In the second experiment, we consider a hard task of modeling a collection of texts over time conditioned on economic measurements. The goal is to predict the probability of words appearing in a document, based on the “state of the world” at the time the document was authored. Given a set of macroeconomic variables in the U.S. (e.g., unemployment rate, inflation rate, average housing prices, etc.), we want to predict what kind of texts will be produced at a specific timestep. These documents can be written by either the government or publicly-traded companies directly or indirectly affected by the current economic situation.

6.3.1 Model

Our text model is a sparse additive generative model (SAGE; Eisenstein et al. (2011)). In SAGE, there is a background lexical distribution that is perturbed additively in the log-space. When the effects are due to a (sole) feature , the probability of a word is:

where is the vocabulary, (always observed) is the vector of background log-frequencies of words in the corpus, (observed) is the feature derived from the context , and is the feature-specific deviation.

Notice that the formulation above is easily extended to multiple effects with coefficients . In our experiment, we have 117 effects (features), each with its own . The first 50 correspond to U.S. states, plus an additional feature for the entire U.S., and they are observed for each text since each text is associated with a known set of states (discussed below). We assume that texts that are generated in different states have distinct characteristics; for each state, we have a binary indicator feature. The other 66 features depend on observed macroeconomics variables at each timestep (e.g., unemployment rate, inflation rate, house price index, etc.). Given an economic state of the world, we hypothesize that there are certain words that are more likely to be used, and each economic variable has its own (sparse) deviation from the background word frequencies. The generative story for a word at timestep associated with (observed) features is:

  • Given observed real-world observed variables , draw word from a multinomial distribution .

Our is simply the negative log-loss function commonly used in multiclass logistic regression: . We apply our time-series prior from §3 to . is fixed to be the log frequencies of words at timestep . For a single feature, coefficients over time for different classes (words) are assumed to be generated from the same prior.

6.3.2 Dataset

There is a great deal of text that is produced to describe current macroeconomic events. We conjecture that the connection between the economy and the text will have temporal dependencies (e.g., the amount of discussion about housing or oil prices might vary over time). We use three sources of text commentary on the economy. The first is a subset of the 10-K reports we used in §6.2. We selected the 10-K reports of 200 companies chosen randomly from the top quintile of size (measured by beginning-of-sample market capitalization). This gives us a manageable sample of the largest U.S. companies. Each report is associated with the state in which the company’s head office is located. Our next two data sources come from the Federal Reserve System, the primary body responsible for monetary policy in the U.S.777 For an overview of the Federal Reserve System, see the Federal Reserve’s “Purpose and Functions” document at http://www.federalreserve.gov/pf/pf.htm. The Federal Open Market Committee (FOMC) meets roughly eight times per year to discuss economic conditions and set monetary policy. Prior to each meeting, each of the twelve regional banks write an informal “anecdotal” description of economic activity in their region as well as a national summary. This “Beige Book” is akin to a blog of economic activity released prior to each meeting. Each FOMC meeting also produces a transcript of the discussion. For our experiments here, we focus on text from 1996–2006.888All the text is freely available at http://www.federalreserve.gov. The Beige Book is released to the public prior to each meeting. The transcripts are released five years after the meetings. As a result, we have 2,075 documents in the final corpus, consisting of 842 documents of the 10-K reports, 89 documents of the FOMC meeting transcripts, and 1,144 documents of the Beige Book summaries.

We use the 501st–5,501st most frequent words in the dataset. We associated the FOMC meeting transcripts with all states. The “Beige Book” texts were produced by the Federal Reserve Banks. There are twelve Federal Reserve Banks in the United States, each serving a collection of states. We associated texts from a Federal Reserve Bank with the states that it serves.

# tokens ridge-one ridge-all ridge-ts lasso-one lasso-all our model
year () () () () () () ()
2003(dev) 1.1 2,736 2,765 2,735 2,736 2,765 2,735
2004 1.5 2,975 3,004 2,975 2,975 3,004 2,974
2005 1.9 2,999 3,027 2,997 2,998 3,027 2,997
2006 2.3 2,916 2,922 2,913 2,912 2,922 2,912
overall 6.8 11,626 11,718 11,619 11,620 11,718 11,618
Table 2: Negative log-likelihood of the documents on various test sets (lower is better). The first test year (2003) was used as our development data. Our model uses the sparse adaptive prior in §3.

Quantitative U.S. macroeconomic data was obtained from the Federal Reserve Bank of St. Louis data repository (“FRED”). We used standard measures of economic activity focusing on output (GDP), employment, and specific markets (e.g., housing).999For growing output series, like GDP, we calculate growth rates as log differences. We use equity market returns for the U.S. market as a whole and various industry and characteristic portfolios.101010Returns are monthly, excess of the risk-free rate, and continuously compounded. The data are from CRSP and are available for these portfolios at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. They are used as in our model; in addition to state indicator variables, there are 66 macroeconomic variables in total.

We compare our model to the baselines in §6.1. The lasso variants are analogous to the original formulation of SAGE (Eisenstein et al., 2011), except that our model directly conditions on macroeconomic variables instead of a Dirichlet-multinomial compound.

6.3.3 Results

We score models by computing the negative log-likelihood on the test dataset:111111Out-of-vocabulary items are ignored. . We initialized all the feature coefficients by the coefficients by training a lasso regression on the last year of the training data (lasso-one). The first test year (i.e., 2003) was used as our development data for hyperparameter tuning ( was selected to be ). Table 2 shows the results for the six models we compared. Similar to the forecasting experiments, at each test year, we trained only on documents from earlier years. When we collapsed all the training data and ignored the temporal dimension (ridge-all and lasso-all), the background log-frequencies are computed using the entire training data, which is different compared to the background log-frequencies for only the last timestep of the training data. Our model outperformed all ridge and lasso variants, including the one with a time-series penalty (Yogatama et al., 2011), in terms of negative log-likelihood on unseen dataset.

In addition to improving predictive accuracy, the prior also allows us to discover trends in the feature coefficients and gain insight. We manually examined the model from the last run (test year 2006). Examples of temporal trends learned by our model are shown in Figure 3. The plot illustrates feature coefficients for words that contain the string employ. For comparison, we also included the percentage of unemployment rate in the U.S. (which was used as one of the features ), scaled to fit into the plot. We can see that there is a correlation between feature coefficients for the word unemployment and the actual unemployment rate. On the other hand, the correlations are less evident for other words.

Figure 3: Temporal trends learned by our model for the words that contains employ in our dataset, as well as the actual unemployment rate (scaled by for ease of reading). The -axis denotes coefficients and the -axis is years. See the text for explanation.

7 Conclusions

We presented a time-series prior for the parameters of probabilistic models; it produces sparse models and adapts the strength of temporal effects on each coefficient separately, based on the data, without an explosion in the number of hyperparameters. We showed how to do inference under this prior using variational approximations. We evaluated the prior for the task of forecasting volatility of stock returns from financial reports, and demonstrated that it outperforms other competing models. We also evaluated the prior for the task of modeling a collection of texts over time, i.e., predicting the probability of words given some observed real-world variables. We showed that the prior achieved state-of-the-art results as well.


The authors thank several anonymous reviewers for helpful feedback on earlier drafts of this paper. This research was supported in part by a Google research award to the second and third authors. This research was supported in part by the Intelligence Advanced Research Projects Activity via Department of Interior National Business Center contract number D12PC00347. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.


  • Angelosante & Giannakis (2009) Angelosante, Daniele and Giannakis, Georgios B. Rls-weighted lasso for adaptive estimation of sparse signals. In Proc. of ICASSP, 2009.
  • Angelosante et al. (2009) Angelosante, Daniele, Giannakis, Georgios B., and Grossi, Emanuele. Compressed sensing of time-varying signals. In Proc. of ICDSP, 2009.
  • Bach et al. (2011) Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, and Obozinski, Guillaume. Convex optimization with sparsity inducing norms. MIT Press, 2011.
  • Belmonte et al. (2012) Belmonte, Miguel A. G., Koop, Gary, and Korobilis, Dimitris. Hierarchical shrinkage in time-varying parameter models, 2012. Working paper.
  • Box et al. (2008) Box, George E. P., Jenkins, Gwilym M., and Reinsel, Gregory C. Time Series Analysis: Forecasting and Control. Wiley Series in Probability and Statistics, 2008.
  • Caron et al. (2012) Caron, François, Bornn, Luke, and Doucet, Arnaud. Sparsity-promoting bayesian dynamic linear models, 2012. arXiv 1203.0106.
  • Charles & Rozell (2012) Charles, Adam S. and Rozell, Christopher J. Re-weighted dynamic filtering for time-varying sparse signal estimation, 2012. arXiv 1208.0325.
  • Duchi & Singer (2009) Duchi, John and Singer, Yoram. Efficient online and batch learning using forward backward splitting.

    Journal of Machine Learning Research

    , 10:2873–2908, 2009.
  • Eisenstein et al. (2011) Eisenstein, Jacob, Ahmed, Amr, and Xing, Eric P. Sparse additive generative models of text. In Proc. of ICML, 2011.
  • Figueiredo (2002) Figueiredo, Mario A. T. Adaptive sparseness using jeffreys’ prior. In Proc. of NIPS, 2002.
  • Hoerl & Kennard (1970) Hoerl, Arthur E. and Kennard, Robert W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  • Kogan et al. (2009) Kogan, Shimon, Levin, Dimitry, Routledge, Bryan R., Sagi, Jacob S., and Smith, Noah A. Predicting risk from financial reports with regression. In Proc. of NAACL, 2009.
  • Liu & Nocedal (1989) Liu, Dong C. and Nocedal, Jorge. On the limited memory BFGS method for large scale optimization. Mathematical Programming B, 45(3):503–528, 1989.
  • Liu & Ye (2010) Liu, Jun and Ye, Jieping. Moreau-yosida regularization for grouped tree structure learning. In Proc. of NIPS, 2010.
  • Nakajima & West (2012) Nakajima, Jouchi and West, Mike. Bayesian analysis of latent threshold dynamic models. Journal of Business and Economic Statistics, 2012.
  • Tibshirani (1996) Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society B, 58(1):267–288, 1996.
  • Tibshirani et al. (2005) Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, Zhu, Ji, and Knight, Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B, 67(1):91–108, 2005.
  • Volpi (2003) Volpi, Leonardo. Eigenvalues and eigenvectors of tridiagonal uniform matrices. 2003.
  • Yogatama et al. (2011) Yogatama, Dani, Heilman, Michael, O’Connor, Brendan, Dyer, Chris, Routledge, Bryan R., and Smith, Noah A. Predicting a scientific community’s response to an article. In Proc. of EMNLP, 2011.
  • Yuan & Lin (2007) Yuan, Ming and Lin, Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1):49–67, 2007.
  • Zhang & Yeung (2010) Zhang, Yu and Yeung, Dit-Yan. A convex formulation for learning task relationships in multi-task learning. In Proc. of UAI, 2010.