Strongly Hierarchical Factorization Machines and ANOVA Kernel Regression

12/25/2017 ∙ by Ruocheng Guo, et al. ∙ Arizona State University 0

High-order parametric models that include terms for feature interactions are applied to various data min- ing tasks, where ground truth depends on interactions of features. However, with sparse data, the high- dimensional parameters for feature interactions often face three issues: expensive computation, difficulty in parameter estimation and lack of structure. Previous work has proposed approaches which can partially re- solve the three issues. In particular, models with fac- torized parameters (e.g. Factorization Machines) and sparse learning algorithms (e.g. FTRL-Proximal) can tackle the first two issues but fail to address the third. Regarding to unstructured parameters, constraints or complicated regularization terms are applied such that hierarchical structures can be imposed. However, these methods make the optimization problem more challeng- ing. In this work, we propose Strongly Hierarchical Factorization Machines and ANOVA kernel regression where all the three issues can be addressed without making the optimization problem more difficult. Ex- perimental results show the proposed models signifi- cantly outperform the state-of-the-art in two data min- ing tasks: cold-start user response time prediction and stock volatility prediction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the area of data mining, there exist many high-order parametric models, which explicitly incorporate terms for modeling interactions between features. In the applications such as prediction of users’ behavior in social media [21, 7, 27], movie ratings [20] and stock return volatility [3], co-occurrence of features can be crucial to decide the ground truth labels. For example, the observation that a user retweeted the microblog SIAM SDM deadlines are approaching.

just a moment after it had been posted can result from the co-occurrence of the word

SDM and the phrase data mining researcher

in the user’s profile. In the case of models including feature interactions, the problem is to learn a function that maps features and their interactions to a scalar such that a predefined loss function is minimized. In the most straightforward high-order models (i.e. Polynomial Regression), each feature interaction is modeled by an independent parameter. This leads to the problem of high-dimensional parameters. In sparse and high-dimensional settings where the number of nonzero elements in each feature vector is much smaller than its dimension, there are mainly three issues: 1. Expensive computation: the number of parameters increases exponentially with the order number. 2. Difficulty in estimating high-dimensional parameters with sparse data: for example, given a pair of features

, it requires enough samples with for reliable parameter estimation which is not the case in sparse data. 3. Lack of structure between parameters: it is hard to justify models where interaction plays an important role in prediction but neither nor does.

To address the first two issues, one idea is to learn sparse models. Approaches such as Lasso [24] and Elastic net [29] were proposed, which apply regularization terms (e.g. norm of parameters) that can lead to sparsity. Recently, follow-the-regularized-leader algorithms such as RDA [25] and FTRL-Proximal [13] have been shown to be effective in producing sparsity for generalized linear models. Another idea is to develop novel models that can handle interaction effects with low-dimensional parameters. Models such as Factorization Machines (FMs) and ANOVA kernel regression [20, 3] resolve the first two issues by modeling interaction effects with low-rank factorized parameters.

As shown in [11, 2, 4, 26], hierarchical structures between main and interaction effects contribute to models effectiveness and selection of important features and interactions. To address the third issue, Bien et al. [2] defined strong and weak hierarchy. Strong (weak) hierarchy demonstrates that an interaction effect could be with non-zero weight iff both (one) of the corresponding linear terms are. They also proposed the Weak Hierarchical Lasso where constraints are added to the optimization problem to guarantee weak hierarchy. Then, in [10], Li et al. mentioned a structured sparsity which can impose strong hierarchy to their model with a complicated regularization term. However, both methods mentioned above can limit two types of operations : applying other regularization or constraints for specific purpose (e.g. domain prior knowledge), usage of efficient algorithms which are only applicable for certain types of loss functions (e.g. convex functions) or regularizations (e.g. regularization). The facts listed above motivate us to propose our models. We show that strong hierarchy can be imposed by adding a context dimension to FMs and ANOVA kernel regression. Thus, we can address the three issues simultaneously and leave the optimization problem without extra constraints or complicated regularization terms. We list our contributions as below:

  • We propose Strongly Hierarchical FMs and ANOVA kernel regression where the three issues mentioned above are addressed without making the optimization problem more difficult.

  • We show that predictions can be made with these models, time complexity is linear to data dimension (or average number of nonzero features for sparse data) and the number of latent dimensions.

  • We derive a FTRL-Proximal class algorithm for the proposed models. Analysis shows that the time and space complexity of this algorithm is linear to the feature dimension and the number of latent factors.

  • Experimental results show our proposed models can significantly outperform the state-of-the-art ones. Moreover, these models can achieve high sparsity without significant loss of performance.

The remaining of the paper is organized as follows: we give a brief review of background knowledge in Section 2. In Section 3, we introduce the proposed models and some properties of them. We also derive and analyze an efficient FTRL-Proximal algorithm for them. Then, we describe the experimental setup and results in Section 4. Finally, we summarize related work and conclude this paper in Section 5 and 6, respectively.

2 Background

In this section, we start with the preliminaries. Next, we describe hierarchical structures amongst parameters. Finally, we introduce FMs [20] and ANOVA kernel regression [3].

2.1 Preliminaries

Bold uppercase (e.g. ), bold lowercase (e.g. ) and lowercase letters (e.g. ) denote matrices, vectors and scalars, respectively. Subscripts refer to rows, columns of matrices or elements of matrices and vectors. For example, notations and denote the th row and th column of matrix , respectively. Superscripts denote the order number of kernels (e.g. ) or the iteration number (e.g. ).

2.2 Factorization Machines and ANOVA kernel regression

Previous work  [8, 11, 19] has shown that augmenting feature vector with interaction effects can significantly improve performance of models in various data mining tasks such as user response prediction and microblog retrieval. One of the simplest models which take into account interaction effects is the polynomial regression (PR). With second-order PR, predictions are made as follows:


where , and is the dimension of data. The number of parameters can be for

th-order PR or SVM with polynomial kernel. They are not able to scale well with high-dimensional data. Moreover, this can result in models where significant amount of parameters are fitted only by very few samples because of data sparsity. Following 

[2], in this paper, we refer to linear terms (e.g. ) as main effects and second-order terms (e.g. ) as interaction effects. In [20], Rendle proposed FM by factorizing the parameters for feature interactions , where and positive integer is the number of latent factors. FM is defined as:


Following [3], ANOVA kernel regression is defined as:


where is the th-order ANOVA kernel proposed in [23]. Given vectors and , the ANOVA kernel is formally defined as:


We denote second-order ANOVA kernel regression model by symbol . In [3], it is shown that FM is a special case of ANOVA kernel regression with .

2.3 Hierarchical Structures for Parameters

Bien et al. defined strong and weak hierarchy in [2]. Using notations from (1), they are defined as: Strong hierarchy: Weak hierarchy: Here, we demonstrate the intuition of these two constraints by the example given in the introduction. In prediction of when a given user would retweet a particular tweet, it is difficult to justify a model which states that the co-occurrence of the phrase data mining researcher in user profile and the abbreviation SDM in tweet text is crucial but ignores main effect of either of them. In [2], it is shown that weak hierarchy can be imposed by adding constraints to the optimization problem with regularization. where is the loss function and . On the other hand, in [10], Li et al. mentioned the method to impose strong hierarchy to FMs with the structured sparsity regularization term [16]: As shown above, the optimization problem would become more challenging if such methods are applied. This is the main motivation for us to propose our models.

3 Strong Hierarchy with Context Dimension

In this section, we begin with derivation of our proposed model where strong hierarchy is imposed by incorporating the context dimension into FM and ANOVA kernel regression. Then we demonstrate efficient computation can be done with these models. Finally, we derive and analyze the per-coordinate FTRL-Proximal algorithm for our proposed models. Since models such as FM can be extended to high-order, without loss of generality, we focus on second-order models in this paper.

3.1 The Proposed Models

Here, we propose Strongly Hierarchical ANOVA kernel regression (SH) and its special case Strongly Hierarchical Factorization Machines (SHFMs). First, we show how strong hierarchy of these two models are guaranteed without extra constraints or regularization terms. In the proposed models, to create hierarchical structure, parameters for main and interaction effects are connected by considering main effects as interactions between features and the constant context feature (). Then, main effects become , where is the context latent factor. Moreover, we can merge the main effects into interactions by concatenating the context dimension (, ) to features and parameters, respectively: and . Finally, we formulate SH as:


In [3], authors proved that fitting can be helpful when the order number is even. Therefore, we consider both the cases: the parameter vector takes constant value (SHFMs); is estimated as parameters (SH). Strong hierarchy is guaranteed in parameters of SHFMs with assumptions: 1). ; 2). . For a pair of features , given , we can infer that and . Further, we can conclude () by () and the two assumptions. Therefore, strong hierarchy is guaranteed by . This means that the interaction effect between and will be included in the model iff both of their main effects are. We justify the second assumption of Proposition 5 by experiments in Section 4

showing the probability of cases where

and is significantly lower than those with . Next, we demonstrate Proposition 3.1 about the time complexity of making a prediction with SH (5). This follows the conclusion from [20, 3] that the time complexity of making a prediction with FMs and can be reduced from to . Time complexity of making a prediction with SH is . The proof of Proposition 3.1 can be found in Appendix. Experimental results showing the linear time complexity with the proposed models can be found in Section 4. It worth noting that if , no computation is needed for the dimension as all terms w.r.t would be . Therefore, when the data is sparse, we can write the right-hand side of (5) as:


where . In this way, the time complexity can be reduced to , where denotes the cardinality of a set.

3.2 Learning SHFMs and SH

Because SHFMs is a special case of SH, we focus on learning SH. We only discuss how to learn as estimating the bias and the weight vector for each latent dimension is trivial compared to . For the proposed models, given any loss function convex in predicted label , we show that it is also convex along each element of . This is done through the demonstration that the model equation of our proposed models is affine to each row of the factorized parameter matrix (Proposition 3.2). The loss function is convex in each element of the factorized parameter matrix , assuming is convex in . We start with demonstration of Proposition 3.2, which enables us to prove Theorem 3.2 later. is an affine function of .

Considering as variable, we analyze interaction effects . In [3], Blondel et al. concluded that multi-linearity is a key property of ANOVA kernel, which can be written as:

where and . With this property, by only considering (th row of the factorized parameter matrix for interaction effects) as a variable while other rows as constants we analyze the second term on right-hand side of (5):

where and are constants. Therefore, and are both constants, and is an affine function for . This completes the proof of Proposition 3.2. As assumed, the loss function is convex in (e.g. mean squared error, sigmoid cross entropy etc.). Then, according to Proposition 3.2, we know that is affine in each row of the factorized parameter matrix, and thus also affine in each element (). Hence, the loss function is a composite of convex and affine functions in every , which implies that is convex in each element of the factorized parameter matrix . This completes the proof. With Theorem 3.2, we conclude that the loss function can be optimized efficiently with per-coordinate algorithms. Here, we derive Algorithm 1 to estimate for SH based on the FTRL-Proximal algorithm [13, 15].

FTRL-Proximal for SH. We first derive the Per-coordinate FTRL-Proximal Algorithm with and Regularization for SH. As the th sample received by the model, the algorithm plays the following implicit update:


where , and is the non-increasing learning rate, we introduce how to compute them later in this section. Here, we state and prove Theorem 7 as below. Theorem 7 is an important property of the FTRL-Proximal algorithm to show its efficiency. In the unconstrained minimization problem of (7), each factorized parameter has a closed form solution. To solve Eq. 7 for each in closed form, we reformulate it as:


where , is a hyper-parameter, namely the per-coordinate learning rate. So right-hand side of Eq. 8 matches the form of the soft-thresholding operator [5]:

With , and the fact that , and we have:


where for , for . With (9

), the proof is completed. Using the chain rule, the gradient is computed as:


As shown in (9), plays the role of per-coordinate learning rate which controls the magnitude of . Following [15], with positive hyper-parameters , and , we set it as: . Therefore, the same initial learning rate is used for each co-ordinate. According to (10), when , which means that if the th feature does not occur, then the gradient of each element in is zero. Then, with the sum of squared gradients in denominator of , the more th feature is found in training samples , the smaller is likely to be. According to [10], this imposes Frequency Adaptive Regularization (FAR) to the learning process. FAR refers to applying larger learning rate on parameters corresponding to infrequent features. The FAR method has been shown to be effective to improve model’s generalized error. To explain FAR by our running example, we can state that if an AI has already been familiar with the retweet time pattern of users who describe themselves as data mining researcher or tweets related to SDM, it does not demand drastic changes for further observation with these two features.

0:  , , ,
0:   Init : ,
1:  for  to  do
2:     Receive the sample , concatenate it as , let
3:     for  do
4:         for  do
5:            Compute by (9)
6:         end for
7:     end for
8:     Compute prediction by (5) with
9:     Observe label
10:     for  do
11:         Compute
12:         for  do
13:            Compute by (10)
17:         end for
18:     end for
19:  end for
Algorithm 1 FTRL-Proximal for SH

Analysis of Algorithm Here we demonstrate the complexity of Algorithm 1 by the following Proposition. The time and space complexity of Algorithm 1 is and , respectively. First, we discuss the time complexity. For each iteration , first, elements of are computed in line 5 by Eq. 9 in . As given , and , it takes to compute each . Then, (5) in line 8 also takes as shown in (6). The algorithm computes for each in line 11, which overall takes . By doing this, we avoid repeating the computation of in inner loop for each . Finally, similar to line 5, each line from line 13 to 16 takes for all and .With these analysis, we conclude that the time complexity of Algorithm 1 is with iterations. On the other hand, in terms of space complexity, the two matrices and need to be stored in memory. Besides them, in each iteration, implicitly, it also requires to store . So, the space complexity of Algorithm 1 is . It is worthwhile to mention that the space complexity of Algorithm 1 can not be reduced to because elements in both and accumulate impacts from with where set can be different for each iteration.

4 Experiments

In this section, we start with the dataset description for evaluating the proposed models. Then, we describe our experimental setup. Finally, we report experimental results in two aspects: effectiveness, sensitivity analysis.

4.1 Datasets

In Table 1, we summarize statistics of datasets with three quantities: the dimensionality of data, the number of training and testing samples.

Dataset Dimension Training Testing
WDYR (cold-start) 24,025 78,738 9,186
E2006-tfidf 150,360 16,087 3,308
Table 1: Statistics of datasets

WDYR. We collect the When Do You Retweet (WDYR) dataset and share a subset of it The WDYR dataset is a collection of retweets posted from June to November in 2016 and related user profiles. The task is to predict how much time it takes for a certain user to retweet a particular original tweet. In this dataset, each sample represents a retweet labeled by where and are when the retweet and the original tweet was posted, respectively. For each user, we only consider her earliest retweet, given a tweet. We categorize (in seconds) into five classes: , , and . A retweet can be considered as a result from interactions of the user and the original tweet. Thus, we concatenate the user profile features and those of the original tweet. User profile includes: user id, user description, create time, favorite count, followers count, friends count and tweet count. Tweet attributes are: original tweet id, original tweet time (

) and tweet text. We treat each attribute as a field and apply one-hot encoding to each field (See Appendix). The training set consists of retweets from the first

of original tweets. Thus, the task can be interpreted as predicting for unseen original tweets whose retweets can only be in testing set. To the best of our knowledge, this is the first study of cold-start problem w.r.t. information diffusion.

E2006-tfidf. E2006-tfidf [9] is a subset of the 10K corpus222 of reports from thousands of publicly traded companies in the United States. The target is to predict logarithm scale of stock return volatility (log-volatility) which is often used in the industry of finance to measure risk. As log-volatility takes continuous values, this is a regression task. Features comprise tf-idf of unigrams and volatility in the past 12 months.

4.2 Experimental Setup

In the experiments, for each iteration, models are trained with mini-batches, then we evaluate the performance on the complete testing set. Generalized linear models are trained with the FTRL-Proximal algorithm for linear models [13, 15]. Our proposed models are trained with Algorithm 1. At the same time, FMs and are trained with a variant of Algorithm 1 without the contextual dimension. For hyper-parameters, grid search is carried out. The domain of grid search for each hyper-parameter is shown in Appendix.

Loss functions. In the experiments for multi-class classification, the loss function we use is the softmax cross-entropy: where is the output of a model for the th class, vector is the one-hot encoding of label and . For regression tasks, we choose mean squared error (MSE) as the loss function: .

Evaluation metrics.

Micro-F1 and macro-F1 scores are used vis-a-vis evaluation of model performance on multi-class classification tasks. They are defined as harmonic mean of micro-average and macro-average of precision, recall respectively. They are formally defined as:

. where , and refer to true positives, false negatives, false positives for class and denotes the number of classes. For regression, we apply root mean squared error (RMSE) and mean absolute error (MAE) to measure the difference between prediction and ground truth.

Baseline models. In our experiments, we compare our proposed models, trained with Algorithm 1

against generalized linear models (logistic regression for classification and linear regression for regression), FMs 

[20] and  [3] on real-world datasets.

4.3 Effectiveness Analysis

From this point on, experimental results are discussed. We start with effectiveness of each model in the prediction tasks for the two datasets mentioned in Section 4.1

. While showing performance of models for multiple epochs, we only compare the best performance of each model, which might not happen in the same epoch.

WDYR. To evaluate models with the WDYR dataset, we set hyper-parameters as follows: , , , , and . Each model is trained for 10 epochs with batch size set to 16. As shown in Figure 0(a), for macro-F1 scores, SHFMs and SH outperform FMs and by and , respectively. Figure 0(b) shows that SH is better than w.r.t. micro-F1 score. These improvements are not trivial as only outperforms the logistic regression by in micro-F1 and in macro-F1. Even though FMs can achieve performance comparable to that of SHFMs measured by micro-F1, we can conclude that our proposed models outperform their corresponding baselines for the task of cold-start prediction of WDYR.

E2006-tfidf. Regarding the E2006-tfidf dataset, hyper-parameters are set to: , , , , and . The training phase lasts 20 epochs for each model in a mini-batch style with batch size 64. As shown in Fig. 0(c), in terms of RMSE, SHFMs and SH outperform FMs and with and less error, respectively. Similarly, measured by MAE (see Fig. 0(d)), SHFMs result in less error than FMs and SH leads to less error than . In brief, for the E2006-tfidf dataset, the experimental results manifest that our proposed hierarchical models outperform their counterparts with unstructured parameters. For the task of predicting stock return volatility, SHFMs again achieve the best generalized error. Similar to results in [3], Figure 1 shows that fixing leads to better predictions on testing sets than fitting them as parameters. The explanation for this could be that fitting the low-dimensional vector increases the chance of overfitting.

(a) Macro-F1 (WDYR)
(b) Micro-F1 (WDYR)
(c) RMSE (E2006)
(d) MAE (E2006)
Figure 1: Prediction performance for the two tasks.

4.4 Sensitivity Analysis

Experiments are carried out for understanding how hyper-parameters affect the proposed models. Experimental results show our models trained with Algorithm 1 can perform well even with high sparsity in the study of . In addition, the run time analysis of models with different provides justification for Proposition 3.1 and 1. To save space, we may only report the experimental results for one of our proposed models on one of the datasets, because similar results are yielded by experiments for the other model or dataset.

The scale of regularization (). The scale of regularization () in Algorithm 1 plays the role of making trade-off between minimizing training loss and maximizing model sparsity. As shown in (9), given a certain co-ordinate and a sample , if is smaller than , the corresponding parameter would become zero after the update. Therefore, as the value of increases, the proposed models become sparser. To study the influence of , we train our proposed models with various values of (see Appendix).We set other hyper-parameters as described in 4.3. By results shown in Table 2, we find that SHFMs and SH trained with Algorithm 1 can be both sparse and effective. In detail, for the WDYR dataset, SHFMs and SH achieve the best testing error with sparsity equal to 0.631 and 0.704, respectively. Then again, for the E2006 dataset, even if the models that achieve the best RMSE and MAE on the testing set are dense. The very sparse SHFMs (0.999 sparsity) and SH (0.937 sparsity) do not cause significant increase in RMSE or MAE on testing samples.

The number of latent dimensions (). The value of can affect the proposed models in terms of both effectiveness and efficiency. To study impact of for the proposed models, we fix other parameters as mentioned in 4.3 and carry out a series of experiments with different values of . Training loss, testing errors and GPU time per epoch are measured for each value of . Each epoch includes training with all training samples and testing on both training and testing sets. Intuitively, as the number of latent dimensions increases, the functions that can be represented by both SHFMs and SH become more abundant. Thus, the minimal training loss that our proposed models can reach within a certain number of epochs becomes smaller. But this also leads to more expensive computation and may cause the problem of overfitting. The experimental results in Fig. 2 shows that with larger number of latent dimensions, the proposed models deliver smaller training loss, testing RMSE and MAE. However, it is shown that larger value of can also cause overfitting. For the cases of and , while the training loss is still dropping, the testing error (measured by RMSE and MAE) starts to increase after several epochs for both proposed models. In Table 3, we show that the relationship between and GPU time per epoch is close to linear. This supports Proposition 3.1 and 1.

In addition, we monitor whether the assumptions of Proposition 5 are held for each value of and in grid search (see Appendix) and find the probability of cases with either or is negligible compared to that of which can reach more than 0.999 when is large. Therefore, these observations provide justification for our assumptions.

Figure 2: The impact of the number of latent dimensions () on training loss, RMSE and MAE of SHFMs and SH for E2006-tfidf dataset.
(a) WDYR dataset
Model Sparsity Loss Mic-F1 Mac-F1
SHFMs 0.630 0.331 0.792 0.776
0.631 0.337 0.801 0.787
0.854 0.839 0.758 0.728
0.999 1.550 0.593 0.500
SH 0.631 0.604 0.793 0.777
0.704 0.651 0.798 0.781
0.999 1.550 0.593 0.500
0.999 1.550 0.593 0.500
(b) E2006 dataset
Model Sparsity Loss RMSE MAE
SHFMs 0.003 1590.85 0.384 0.466
0.018 1591.93 0.385 0.466
0.137 1591.69 0.385 0.466
0.582 1596.32 0.386 0.467
0.999 1632.87 0.386 0.470
0.999 7738.07 2.561 1.509
SH 0.002 1619.3 0.386 0.467
0.023 1619.3 0.387 0.467
0.192 1620.12 0.387 0.467
0.675 1627.47 0.388 0.468
0.937 1650.46 0.390 0.470
0.995 1732.65 0.423 0.493
Table 2: Sparsity and Performance with Various
WDYR E2006 WDYR E2006
49.58 5.23 50.53 5.42
76.51 7.13 67.82 7.28
142.49 16.97 130.50 18.09
337.44 38.87 342.54 39.54
Table 3: GPU Time per Epoch for different values of

5 Related Work

Factorization Machines and ANOVA kernel regression. In [20], Rendle proposed FMs and showed that it can outperform SVM and PITF [22] in various tasks. Variants of FMs have been shown to be effective and efficient in click-through rate prediction [8, 10, 18], recommendation systems [17, 12] and microblog retrieval [19]. Recently, Blondel et al. [3] showed that FMs belong to the class of ANOVA kernel regression.

Hierarchical structures in parameters. In [4], Choi et al. extended Lasso [24] with element-wise regularization for imposing strong hierarchy. In [2]

, Bien et al. justified weak hierarchy with assumption that a feature is not any more special than its linear transformation. They also proposed to guarantee weak hierarchy with constraints. But this method is limited to

regularization. Similarly, Zhong et al. [26] proposed constraints for strong hierarchy and solvers for , and regularization. In [10], Li et al. treated strong hierarchy in FMs as an instance of structured sparsity [16] and mentioned the method which adds a complicated regularization term. However, solving these optimization problems with these constraints or complicated regularization terms are challenging.

Online sparse learning. In [28], Zinkevich et al. showed that online gradient descent can be considered as a special case of mirror descent [1], which states the closed form update implicitly as an optimization. Following this, regarding learning sparse models for large-scale data, algorithms such as COMID [6] and FTRL class algorithms are proposed. In [14, 13], FTRL-Proximal is claimed to be the most efficient algorithm amongst them in terms of producing sparsity.

6 Conclusion

In this work, we propose SHFMs and SH, where strong hierarchy is imposed without extra constraints or complicated regularization terms. A FTRL-Proximal algorithm is also derived for learning these models. Analysis and experiments are done to show that it only takes linear time and space complexity to train the models with the algorithm or make a prediction with the models. Evaluated with two data mining tasks, we conclude that the proposed models outperform FMs, and the generalized linear models. Furthermore, our models can reach high sparsity without significant loss of performance when trained by the algorithm we derive. For future work, we plan to apply these models to other challenging data mining tasks such as recommendation systems and heterogeneous data mining.


  • [1] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters, 31 (2003), pp. 167–175.
  • [2] J. Bien, J. Taylor, and R. Tibshirani, A lasso for hierarchical interactions, Annals of statistics, 41 (2013), p. 1111.
  • [3] M. Blondel, M. Ishihata, A. Fujino, and N. Ueda, Polynomial networks and factorization machines: New insights and efficient training algorithms

    , in Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 850–858.

  • [4] N. H. Choi, W. Li, and J. Zhu, Variable selection with the strong heredity constraint and its oracle property, Journal of the American Statistical Association, 105 (2010), pp. 354–364.
  • [5] D. L. Donoho and I. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, Journal of the american statistical association, 90 (1995), pp. 1200–1224.
  • [6] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), pp. 2121–2159.
  • [7] L. Hong, A. S. Doumith, and B. D. Davison, Co-factorization machines: modeling user interests and predicting individual decisions in twitter, in Proceedings of the sixth ACM international conference on Web search and data mining, ACM, 2013, pp. 557–566.
  • [8] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, Field-aware factorization machines for ctr prediction, in Proceedings of the 10th ACM Conference on Recommender Systems, ACM, 2016, pp. 43–50.
  • [9] S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith, Predicting risk from financial reports with regression, in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2009, pp. 272–280.
  • [10] M. Li, Z. Liu, A. J. Smola, and Y.-X. Wang, Difacto: Distributed factorization machines, in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 377–386.
  • [11] Y. Liu, J. Wang, and J. Ye, An efficient algorithm for weak hierarchical lasso, ACM Transactions on Knowledge Discovery from Data (TKDD), 10 (2016), p. 32.
  • [12] B. Loni, Y. Shi, M. Larson, and A. Hanjalic, Cross-domain collaborative filtering with factorization machines, in European Conference on Information Retrieval, Springer, 2014, pp. 656–661.
  • [13] B. McMahan, Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization

    , in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 525–533.

  • [14] H. B. McMahan, A unified view of regularized dual averaging and mirror descent with implicit updates, arXiv preprint arXiv:1009.3240, (2010).
  • [15] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al., Ad click prediction: a view from the trenches, in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2013, pp. 1222–1230.
  • [16] S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar, A unified framework for high-dimensional analysis of -estimators with decomposable regularizers, in Advances in Neural Information Processing Systems, 2009, pp. 1348–1356.
  • [17] T. V. Nguyen, A. Karatzoglou, and L. Baltrunas, Gaussian process factorization machines for context-aware recommendations, in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, ACM, 2014, pp. 63–72.
  • [18] Z. Pan, E. Chen, Q. Liu, T. Xu, H. Ma, and H. Lin, Sparse factorization machines for click-through rate prediction, in Data Mining (ICDM), 2016 IEEE 16th International Conference on, IEEE, 2016, pp. 400–409.
  • [19] R. Qiang, F. Liang, and J. Yang, Exploiting ranking factorization machines for microblog retrieval, in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, ACM, 2013, pp. 1783–1788.
  • [20] S. Rendle, Factorization machines, in Data Mining (ICDM), 2010 IEEE 10th International Conference on, IEEE, 2010, pp. 995–1000.
  • [21] S. Rendle, Social network and click-through prediction with factorization machines, in KDD-Cup Workshop, 2012.
  • [22] S. Rendle and L. Schmidt-Thieme,

    Pairwise interaction tensor factorization for personalized tag recommendation

    , in Proceedings of the third ACM international conference on Web search and data mining, ACM, 2010, pp. 81–90.
  • [23] M. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston, Support vector regression with anova decomposition kernels, Advances in kernel methods—Support vector learning, (1999), pp. 285–292.
  • [24] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), (1996), pp. 267–288.
  • [25] L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), pp. 2543–2596.
  • [26] L. W. Zhong and J. T. Kwok, Efficient learning for models with dag-structured parameter constraints, in Data Mining (ICDM), 2013 IEEE 13th International Conference on, IEEE, 2013, pp. 897–906.
  • [27] Y. Zhu, X. Wang, E. Zhong, N. N. Liu, H. Li, and Q. Yang, Discovering spammers in social networks, in Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
  • [28] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, (2003).
  • [29] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2005), pp. 301–320.