1 Introduction
When learning from streams of data to make predictions in the future, how should we handle the timestamp associated with each instance? Ignoring timestamps and assuming data are i.i.d. is scalable but risks distracting a model with irrelevant “ancient history.” On the other hand, using only the most recent portion of the data risks overfitting to current trends and missing important timeinsensitive effects. In this paper, we seek a general approach to learning model parameters that are overall sparse, but that adapt to variation in how different effects change over time.
Our approach is a prior over parameters of an exponential family (e.g., coefficients in linear or logistic regression). We assume that parameter values shift at each timestep, with correlation between adjacent timesteps captured using a multivariate normal distribution whose precision matrix is restricted to a tridiagonal structure. We (approximately) marginalize the (co)variance parameters of this normal distribution using a Jeffreys prior, resulting in a model that allows smooth variation over time while encouraging overall sparsity in the parameters. (The parameters themselves are not given a fully Bayesian treatment.)
We demonstrate the usefulness of our model on two tasks, showing gains over alternative approaches. The first is a text regression problem in which an economic variable (volatility of returns) is forecast from financial reports (Kogan et al., 2009). The second forecasts text by constructing a language model that conditions on highly timedependent economic variables.
2 Notation
We assume data of the form , where each includes a timestamp denoted .^{1}^{1}1In this work we assume timestamps are discretized. The aim is to learn a predictor that maps input , assumed to be at timestep , to output . In the probabilistic setting we adopt here, the prediction is MAP inference over r.v. given and a model parameterized by
. Learning is parameter estimation to solve:
(1) 
The focus of the paper is on the prior distribution . Throughout, we will denote the taskspecific loglikelihood (second term) by
and assume a generalized linear model such that a feature vector function
maps inputs into and is “linked” to the distribution overusing, e.g., a logit or identity. We will refer to elements of
as “features” and to as “coefficients.” We assume discrete timesteps.3 TimeSeries Prior
Our timeseries prior draws inspiration from the probabilistic interpretation of the sparsityinducing lasso (Tibshirani, 1996) and group lasso (Yuan & Lin, 2007). In nonoverlapping group lasso, features are divided into groups, and the coefficients within each group are drawn according to:

Variance
^{2}^{2}2The exponential distribution can be replaced by the (improper) Jeffreys prior, although then the familiar Laplace distribution interpretation no longer holds (Figueiredo, 2002). 
.
We seek a prior that lets each coefficient vary smoothly over time. A highlevel intuition of our prior is that we create copies of , one at each timestep: . For each feature , let the sequence form a group, denoted . Group lasso does not view coefficients in a group as explicitly correlated; they are independent given the variance parameter. Given the sequential structure of , we replace the covariance matrix to capture autocorrelation. Specifically, we assume the vector is drawn from a multivariate normal distribution with mean zero and a precision matrix with the following tridiagonal form:^{3}^{3}3We suppress the subscript for this discussion; each feature has its own .
(2) 
is a scalar multiplier whose role is to control sparsity in the coefficients, while dictates the degree of correlation between coefficients in adjacent timesteps (autocorrelation). Importantly, and (and hence and ) are allowed to be different for each group .
We need to ensure that is positive definite. Fortunately, it is easy to show that for , the resulting is positive definite.
Proof sketch.
To show this, since is a symmetric matrix, we verify that each of its principal minors have strictly positive determinants. The principal minors of are uniform tridiagonal symmetric matrices, and the determinant of a uniform tridiagonal matrix can be written as (see, e.g., Volpi (2003) for the proof). Since , if , the determinant is always positive. Therefore, is always p.d. for . ∎
3.1 Generative Model
Our generative model for the group of coefficients is given by:

an improper Jeffreys prior ().

a truncated exponential prior with parameter . This distribution forces to fall in , so that is p.d. and autocorrelations are always positive:
(3) We fix .

, with the precision matrix as defined in Eq. 2.
During estimation of , each and are marginalized, giving a sparse and adaptive estimate for .
3.2 Scalability
Our design choice of the precision matrix is driven by scalability concerns. Instead of using, e.g., a random draw from a Wishart distribution, we specify the precision matrix to have a tridiagonal structure. This induces dependencies between coefficients in adjacent timesteps (firstorder dependencies) and allows the prior to scale to finegrained timesteps more efficiently. Let denote the number of training instances, the number of base features, and the number of timesteps. A single pass of our variational algorithm (discussed in §5) has runtime and space requirement , instead of for both if each is drawn from a Wishart distribution. This can make a big difference for applications with large numbers of features (). Additionally, we choose the offdiagonal entries to be uniform, so we only need one
for each base feature. This design choice restricts the expressive power of the prior but still permits flexibility in adapting to trends for different coefficients, as we will see. The prior encourages sparsity at the group level, essentially performing feature selection: some feature coefficients
may be driven to zero across all timesteps, while others will be allowed to vary over time, with an expectation of smooth changes.4 Related Work
Our model is related to autoregressive integrated moving average approaches to timeseries data (Box et al., 2008), but we never have access to direct observations of the timeseries. Instead, we observe data ( and ) assumed to have been sampled using timeseriesgenerated variables as coefficients (). During learning, we therefore use probabilistic inference to reason about the variables at all timesteps together. In §5, we describe a scalable variational inference algorithm for inferring coefficients at all timesteps, enabling prediction of future data and inspection of trends.
We follow Yogatama et al. (2011) in creating timespecific copies of the base coefficients, so that . As a prior over , they used a multivariate Gaussian imposing nonzero covariance between each and its timeadjacent copies and . The strength of that covariance was set for each base feature by a global hyperparameter, which was tuned on heldout development data along with the global variance hyperparameter. Yogatama et al.’s model can be obtained from ours by fixing the same and for all features . Our approach differs in that (i) we marginalize the hyperparameters, (ii) we allow each coefficient its own autocorrelation, and (iii) we encourage sparsity.
There are many related Bayesian approaches for timevarying model parameters (Belmonte et al., 2012; Nakajima & West, 2012; Caron et al., 2012), as well as work on timevarying signal estimation (Angelosante & Giannakis, 2009; Angelosante et al., 2009; Charles & Rozell, 2012). Each provides a different probabilistic interpretation of parameter generation. Our model has a distinctive generative story in that correlations between parameters of successive timesteps are encoded in a precision matrix. Additionally, unlike these fully Bayesian approaches that infer full posterior distributions, we only obtain posterior mode estimates of coefficients, which has computational advantages at prediction time (e.g., straightforward MAP inference and sparsity) and interpretability of .
As noted, our grouping together of each feature’s instantiations at all timesteps, and seeking sparsity, bears clear similarity to group lasso (Yuan & Lin, 2007), which encourages whole groups of coefficients to collectively go to zero. A probabilistic interpretation for lasso as a two level exponentialnormal distribution that generalizes to (nonoverlapping) group lasso was introduced by Figueiredo (2002). He also showed that the exponential distribution prior can be replaced with an improper Jeffreys prior for a parameterfree model, a step we follow as well.
Our model is also related to the fused lasso (Tibshirani et al., 2005)
, which penalizes a loss function by the
norm of the coefficients and their differences. Our prior has a more clear probabilistic interpretation and adapts the degree of autocorrelation for each coefficient, based on the data.Zhang & Yeung (2010) proposed a regularization method using a matrixvariate normal distribution prior to model task relationships in multitask learning. If we consider timesteps as tasks, the technique resembles our regularizer. Their method jointly optimizes the covariance matrix with the feature coefficients; we choose a Bayesian treatment and encode our prior belief to the (inverse) covariance matrix, while still allowing the learned feature coefficients to modify the matrix by posterior inference. As a result, our method allows different base features to have different matrices.
5 Learning and Inference
We marginalize and and obtain a maximum a posteriori estimate for , which includes a coefficient for each base feature at each timestep . Specifically, we seek to maximize:
(4) 
Exact inference in this model is intractable. We use meanfield variational inference to derive a lower bound on the above loglikelihood function. We then apply a standard optimization technique to jointly optimize the variational parameters and the coefficients .
We introduce fully factored variational distributions for each and . For
, we use a Gamma distribution with parameters
as our variational distribution:Therefore, we have , , and ( is the digamma function).
For , we choose the form of our variational distribution to be the same truncated exponential distribution as its prior, with parameter , denoting this distribution . We have
(5) 
We let denote the set of all variational distributions over and .
The variational bound that we seek to maximize is given in Figure 1. Our learning algorithm involves optimizing with respect to variational parameters , , and , and the coefficients . We employ the LBFGS quasiNewton method (Liu & Nocedal, 1989), for which we need to compute the gradient of . We turn next to each part of this gradient.
5.1 Coefficients
For , the first derivative with respect to timespecific coefficient is:
(6) 
We can interpret the first derivative as including a penalty scaled by . We rewrite this penalty as:
This form makes it clear that the penalty depends on and , penalizing the difference between and these timeadjacent coefficients proportional to .
The form bears strong similarity to the first derivative of the timeseries (log)prior introduced in Yogatama et al. (2011), which depends on fixed, global hyperparameters analogous to our and . Because our approach does not require us to specify scalars playing the roles of “” and “” in advance, it is possible for each feature to have its own autocorrelation. Obtaining the same effect in their model would require careful tuning of hyperparameters, which is not practical.
It also has some similarities to the fused lasso penalty (Tibshirani et al., 2005), which is intended to encourage sparsity in the differences between features coefficients across timesteps. Our prior, on the other hand, encourages smoothness in the differences, with additional sparsity at the feature level.
5.2 Variational Parameters for and
Recall that the variational distribution for is a Gamma distribution with parameters and .
Precision matrix scalar .
The first derivative for variational parameters is easy to compute:
(7) 
where is the trigamma function. We can solve for in closed form given the other free variables:
(8) 
We therefore treat as a function of , , and in optimization.
Offdiagonal entries .
First, notice that using Jensen’s inequality: due to the fact that is a convex function. Furthermore, for a uniform symmetric tridiagonal matrix like , the log determinant can be computed in closed form as follows (Volpi, 2003):
We therefore maximize a lower bound on , making use of the above to calculate first derivatives with respect to :
The partial derivatives , , and are easy to compute. We omit them for space.
5.3 Implementation Details
A wellknown property of numerical optimizers like the one we use (LBFGS; Liu & Nocedal (1989)) is the failure to reach optimal values exactly at zero. Although theoretically strongly sparse, our prior only produces weak sparsity in practice. Future work might consider a more principled proximalgradient algorithm to obtain strong sparsity (Bach et al., 2011; Liu & Ye, 2010; Duchi & Singer, 2009).
If we expect feature coefficients at specific timesteps to be sparse as well, it is straightforward to incorporate additional terms in the objective function that encode this prior belief (analogous to an extension from group lasso to sparse group lasso). For the tasks we consider in our experiments, we found that it does not substantially improve the overall performance. Therefore, we keep the simpler bound given in Figure 1.
6 Experiments
We report two sets of experiments, one with a continuous , the other a language modeling application where is text. Each timestep in our experiments is one year.
6.1 Baselines
On both tasks, we compare our approach to a range of baselines. Since this is a forecasting task, at each test year, we only used training examples that come from earlier years. Our baselines vary in how they use this earlier data and in how they regularize.

ridgeone
(Hoerl & Kennard, 1970), trained on only examples from the year prior to the test data (e.g., for the 2002 task, train on examples from 2001) 
ridgeall: ridge regression trained on the full set of past examples (e.g., for the 2002 task, train on examples from 1996–2001)

ridgets: the nonadaptive timeseries ridge model of Yogatama et al. (2011)

lassoone
(Tibshirani, 1996), trained on only examples from the year prior to the test data^{4}^{4}4Brendan O’Connor (personal communication) has established the superiority of the lasso to the support vector regression method of Kogan et al. (2009) on this dataset; lasso is a strong baseline for this problem. 
lassoall: lasso regression trained on the full set of past examples
In all cases, we tuned hyperparameters on a development data. Note that, of the above baselines, only ridgets replicates the coefficients at different timesteps (i.e., parameters); the others have only timeinsensitive coefficients.
The model with our prior always uses all training examples that are available up to the test year (this is equivalent to a sliding window of size infinity). Like ridgets, our model trusts more recent data more, allowing coefficients farther in the past to drift farther away from those most relevant for prediction at time . Our model, however, adapts the “drift” of each coefficient separately rather than setting a global hyperparameter.
6.2 Forecasting Risk from Text
In the first experiment, we apply our prior to a forecasting task. We consider the task of predicting volatility of stock returns from financial reports of publiclytraded companies, similar to Kogan et al. (2009).
year  # examples  ridgeone  ridgeall  ridgets  lassoone  lassoall  our model 

2002(dev)  2,845  0.182  0.176  0.171  0.165  0.156  0.158 
2003  3,611  0.185  0.173  0.171  0.164  0.176  0.164 
2004  3,558  0.125  0.137  0.129  0.116  0.119  0.113 
2005  3,474  0.135  0.133  0.136  0.124  0.124  0.122 
overall  13,488  0.155  0.154  0.151  0.141  0.143  0.139 
In finance, volatility
refers to a measure of variation in a quantity over time; for stock returns, it is measured using the standard deviation during a fixed period (here, one year). Volatility is used as a measure of financial risk. Consider a linear regression model for predicting the log volatility
^{5}^{5}5Similar to Kogan et al. (2009) and as also the standard practice in finance, we perform a log transformation, since log volatilities are typically close to normally distributed. of a stock from a set of features (see §6.2.1 for a complete description of our features). We can interpret a linear regression model probabilistically as drawing from a normal distribution with as the mean of the normal. Therefore, in this experiment: .We apply the timeseries prior to the feature coefficients . When making a prediction for the test data, we use , the set of feature coefficients for the last timestep in the training data.
6.2.1 Dataset
We used a collection of Securities Exchange Commissionmandated annual reports from 10,492 publicly traded companies in the U.S. There are 27,159 reports over a period of ten years from 1996–2005 in the corpus. These reports are known as “Form 10K.”^{6}^{6}6See Kogan et al. (2009) for a complete description of the dataset; it is available at http://www.ark.cs.cmu.edu/10K.
For the feature set, we downcased and tokenized the texts and selected the 101st–10,101st most frequent words as binary features. The feature set was kept the same across experiments for all models. It is widely known in the financial community that the past history of volatility of stock returns is a good indicator of the future volatility. Therefore, we also included the log volatility of the stocks twelve months prior to the report as a feature. Our response variable
is the log volatility of stock returns over a period of twelve months after the report is published.6.2.2 Results
The first test year (i.e., 2002) was used as our development data for hyperparameter tuning ( was selected to be ). We initialized all the feature coefficients by the coefficients from training a lasso regression on the last year of the training data (lassoone). Table 1 provides a summary of experimental results. We report the results in mean squared error on the test set: , where is the true response for instance and is the predicted response.
Our model consistently outperformed ridge variants, including the one with a timeseries penalty (Yogatama et al., 2011). It also outperformed the lasso variants without any timeseries penalty, on average and in three out of four test sets apiece.
One of the major challenges in working with timeseries data is to choose the right window size, in which the data is still relevant to current predictions. Our model automates this process with a Bayesian treatment of the strength of each feature coefficient’s autocorrelation. The results indicate that our model was able to learn when to trust a longer history of training data, and when to trust a shorter history of training data, demonstrating the adaptiveness of our prior. Figure 2 shows the distribution of the expected values of the autocorrelation paramaters under the variational distributions for 10,002 features, learned by our model from the last run (test year 2005).
In future work, an empirical Bayesian treatment of the hyperprior
, fitting it to improve the variational bound, might lead to further improvements.6.3 Text Modeling in Context
In the second experiment, we consider a hard task of modeling a collection of texts over time conditioned on economic measurements. The goal is to predict the probability of words appearing in a document, based on the “state of the world” at the time the document was authored. Given a set of macroeconomic variables in the U.S. (e.g., unemployment rate, inflation rate, average housing prices, etc.), we want to predict what kind of texts will be produced at a specific timestep. These documents can be written by either the government or publiclytraded companies directly or indirectly affected by the current economic situation.
6.3.1 Model
Our text model is a sparse additive generative model (SAGE; Eisenstein et al. (2011)). In SAGE, there is a background lexical distribution that is perturbed additively in the logspace. When the effects are due to a (sole) feature , the probability of a word is:
where is the vocabulary, (always observed) is the vector of background logfrequencies of words in the corpus, (observed) is the feature derived from the context , and is the featurespecific deviation.
Notice that the formulation above is easily extended to multiple effects with coefficients . In our experiment, we have 117 effects (features), each with its own . The first 50 correspond to U.S. states, plus an additional feature for the entire U.S., and they are observed for each text since each text is associated with a known set of states (discussed below). We assume that texts that are generated in different states have distinct characteristics; for each state, we have a binary indicator feature. The other 66 features depend on observed macroeconomics variables at each timestep (e.g., unemployment rate, inflation rate, house price index, etc.). Given an economic state of the world, we hypothesize that there are certain words that are more likely to be used, and each economic variable has its own (sparse) deviation from the background word frequencies. The generative story for a word at timestep associated with (observed) features is:

Given observed realworld observed variables , draw word from a multinomial distribution .
Our is simply the negative logloss function commonly used in multiclass logistic regression: . We apply our timeseries prior from §3 to . is fixed to be the log frequencies of words at timestep . For a single feature, coefficients over time for different classes (words) are assumed to be generated from the same prior.
6.3.2 Dataset
There is a great deal of text that is produced to describe current macroeconomic events. We conjecture that the connection between the economy and the text will have temporal dependencies (e.g., the amount of discussion about housing or oil prices might vary over time). We use three sources of text commentary on the economy. The first is a subset of the 10K reports we used in §6.2. We selected the 10K reports of 200 companies chosen randomly from the top quintile of size (measured by beginningofsample market capitalization). This gives us a manageable sample of the largest U.S. companies. Each report is associated with the state in which the company’s head office is located. Our next two data sources come from the Federal Reserve System, the primary body responsible for monetary policy in the U.S.^{7}^{7}7 For an overview of the Federal Reserve System, see the Federal Reserve’s “Purpose and Functions” document at http://www.federalreserve.gov/pf/pf.htm. The Federal Open Market Committee (FOMC) meets roughly eight times per year to discuss economic conditions and set monetary policy. Prior to each meeting, each of the twelve regional banks write an informal “anecdotal” description of economic activity in their region as well as a national summary. This “Beige Book” is akin to a blog of economic activity released prior to each meeting. Each FOMC meeting also produces a transcript of the discussion. For our experiments here, we focus on text from 1996–2006.^{8}^{8}8All the text is freely available at http://www.federalreserve.gov. The Beige Book is released to the public prior to each meeting. The transcripts are released five years after the meetings. As a result, we have 2,075 documents in the final corpus, consisting of 842 documents of the 10K reports, 89 documents of the FOMC meeting transcripts, and 1,144 documents of the Beige Book summaries.
We use the 501st–5,501st most frequent words in the dataset. We associated the FOMC meeting transcripts with all states. The “Beige Book” texts were produced by the Federal Reserve Banks. There are twelve Federal Reserve Banks in the United States, each serving a collection of states. We associated texts from a Federal Reserve Bank with the states that it serves.
# tokens  ridgeone  ridgeall  ridgets  lassoone  lassoall  our model  

year  ()  ()  ()  ()  ()  ()  () 
2003(dev)  1.1  2,736  2,765  2,735  2,736  2,765  2,735 
2004  1.5  2,975  3,004  2,975  2,975  3,004  2,974 
2005  1.9  2,999  3,027  2,997  2,998  3,027  2,997 
2006  2.3  2,916  2,922  2,913  2,912  2,922  2,912 
overall  6.8  11,626  11,718  11,619  11,620  11,718  11,618 
Quantitative U.S. macroeconomic data was obtained from the Federal Reserve Bank of St. Louis data repository (“FRED”). We used standard measures of economic activity focusing on output (GDP), employment, and specific markets (e.g., housing).^{9}^{9}9For growing output series, like GDP, we calculate growth rates as log differences. We use equity market returns for the U.S. market as a whole and various industry and characteristic portfolios.^{10}^{10}10Returns are monthly, excess of the riskfree rate, and continuously compounded. The data are from CRSP and are available for these portfolios at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. They are used as in our model; in addition to state indicator variables, there are 66 macroeconomic variables in total.
6.3.3 Results
We score models by computing the negative loglikelihood on the test dataset:^{11}^{11}11Outofvocabulary items are ignored. . We initialized all the feature coefficients by the coefficients by training a lasso regression on the last year of the training data (lassoone). The first test year (i.e., 2003) was used as our development data for hyperparameter tuning ( was selected to be ). Table 2 shows the results for the six models we compared. Similar to the forecasting experiments, at each test year, we trained only on documents from earlier years. When we collapsed all the training data and ignored the temporal dimension (ridgeall and lassoall), the background logfrequencies are computed using the entire training data, which is different compared to the background logfrequencies for only the last timestep of the training data. Our model outperformed all ridge and lasso variants, including the one with a timeseries penalty (Yogatama et al., 2011), in terms of negative loglikelihood on unseen dataset.
In addition to improving predictive accuracy, the prior also allows us to discover trends in the feature coefficients and gain insight. We manually examined the model from the last run (test year 2006). Examples of temporal trends learned by our model are shown in Figure 3. The plot illustrates feature coefficients for words that contain the string employ. For comparison, we also included the percentage of unemployment rate in the U.S. (which was used as one of the features ), scaled to fit into the plot. We can see that there is a correlation between feature coefficients for the word unemployment and the actual unemployment rate. On the other hand, the correlations are less evident for other words.
7 Conclusions
We presented a timeseries prior for the parameters of probabilistic models; it produces sparse models and adapts the strength of temporal effects on each coefficient separately, based on the data, without an explosion in the number of hyperparameters. We showed how to do inference under this prior using variational approximations. We evaluated the prior for the task of forecasting volatility of stock returns from financial reports, and demonstrated that it outperforms other competing models. We also evaluated the prior for the task of modeling a collection of texts over time, i.e., predicting the probability of words given some observed realworld variables. We showed that the prior achieved stateoftheart results as well.
Acknowledgments
The authors thank several anonymous reviewers for helpful feedback on earlier drafts of this paper. This research was supported in part by a Google research award to the second and third authors. This research was supported in part by the Intelligence Advanced Research Projects Activity via Department of Interior National Business Center contract number D12PC00347. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
References
 Angelosante & Giannakis (2009) Angelosante, Daniele and Giannakis, Georgios B. Rlsweighted lasso for adaptive estimation of sparse signals. In Proc. of ICASSP, 2009.
 Angelosante et al. (2009) Angelosante, Daniele, Giannakis, Georgios B., and Grossi, Emanuele. Compressed sensing of timevarying signals. In Proc. of ICDSP, 2009.
 Bach et al. (2011) Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, and Obozinski, Guillaume. Convex optimization with sparsity inducing norms. MIT Press, 2011.
 Belmonte et al. (2012) Belmonte, Miguel A. G., Koop, Gary, and Korobilis, Dimitris. Hierarchical shrinkage in timevarying parameter models, 2012. Working paper.
 Box et al. (2008) Box, George E. P., Jenkins, Gwilym M., and Reinsel, Gregory C. Time Series Analysis: Forecasting and Control. Wiley Series in Probability and Statistics, 2008.
 Caron et al. (2012) Caron, François, Bornn, Luke, and Doucet, Arnaud. Sparsitypromoting bayesian dynamic linear models, 2012. arXiv 1203.0106.
 Charles & Rozell (2012) Charles, Adam S. and Rozell, Christopher J. Reweighted dynamic filtering for timevarying sparse signal estimation, 2012. arXiv 1208.0325.

Duchi & Singer (2009)
Duchi, John and Singer, Yoram.
Efficient online and batch learning using forward backward splitting.
Journal of Machine Learning Research
, 10:2873–2908, 2009.  Eisenstein et al. (2011) Eisenstein, Jacob, Ahmed, Amr, and Xing, Eric P. Sparse additive generative models of text. In Proc. of ICML, 2011.
 Figueiredo (2002) Figueiredo, Mario A. T. Adaptive sparseness using jeffreys’ prior. In Proc. of NIPS, 2002.
 Hoerl & Kennard (1970) Hoerl, Arthur E. and Kennard, Robert W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
 Kogan et al. (2009) Kogan, Shimon, Levin, Dimitry, Routledge, Bryan R., Sagi, Jacob S., and Smith, Noah A. Predicting risk from financial reports with regression. In Proc. of NAACL, 2009.
 Liu & Nocedal (1989) Liu, Dong C. and Nocedal, Jorge. On the limited memory BFGS method for large scale optimization. Mathematical Programming B, 45(3):503–528, 1989.
 Liu & Ye (2010) Liu, Jun and Ye, Jieping. Moreauyosida regularization for grouped tree structure learning. In Proc. of NIPS, 2010.
 Nakajima & West (2012) Nakajima, Jouchi and West, Mike. Bayesian analysis of latent threshold dynamic models. Journal of Business and Economic Statistics, 2012.
 Tibshirani (1996) Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society B, 58(1):267–288, 1996.
 Tibshirani et al. (2005) Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, Zhu, Ji, and Knight, Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B, 67(1):91–108, 2005.
 Volpi (2003) Volpi, Leonardo. Eigenvalues and eigenvectors of tridiagonal uniform matrices. 2003.
 Yogatama et al. (2011) Yogatama, Dani, Heilman, Michael, O’Connor, Brendan, Dyer, Chris, Routledge, Bryan R., and Smith, Noah A. Predicting a scientific community’s response to an article. In Proc. of EMNLP, 2011.
 Yuan & Lin (2007) Yuan, Ming and Lin, Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1):49–67, 2007.
 Zhang & Yeung (2010) Zhang, Yu and Yeung, DitYan. A convex formulation for learning task relationships in multitask learning. In Proc. of UAI, 2010.
Comments
There are no comments yet.