Multi-output Polynomial Networks and Factorization Machines

Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.

Authors

• 24 publications
• 22 publications
• 4 publications
• 13 publications
• Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms

Polynomial networks and factorization machines are two recently-proposed...
07/29/2016 ∙ by Mathieu Blondel, et al. ∙ 0

• Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

We study the asymmetric low-rank factorization problem: min_𝐔∈ℝ^m ×...
06/27/2021 ∙ by Tian Ye, et al. ∙ 16

• Convex Factorization Machine for Regression

We propose the convex factorization machine (CFM), which is a convex var...
07/04/2015 ∙ by Makoto Yamada, et al. ∙ 0

• Rank-Based Multi-task Learning for Fair Regression

In this work, we develop a novel fairness learning approach for multi-ta...
09/23/2020 ∙ by Chen Zhao, et al. ∙ 0

Determinantal point processes (DPPs) have received significant attention...
05/24/2018 ∙ by Romain Warlop, et al. ∙ 0

• Supervised Learning with Similarity Functions

We address the problem of general supervised learning when data can only...
10/22/2012 ∙ by Purushottam Kar, et al. ∙ 0

• RaFM: Rank-Aware Factorization Machines

Factorization machines (FM) are a popular model class to learn pairwise ...
05/18/2019 ∙ by Xiaoshuang Chen, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interactions between features play an important role in many classification and regression tasks. Classically, such interactions have been leveraged either explicitly, by mapping features to their products (as in polynomial regression), or implicitly, through the use of the kernel trick. While fast linear model solvers have been engineered for the explicit approach low_poly ; coffin , they are typically limited to small numbers of features or low-order feature interactions, due to the fact that the number of parameters that they need to learn scales as , where is the number of features and is the order of interactions considered. Models kernelized with the polynomial kernel do not suffer from this problem; however, the cost of storing and evaluating these models grows linearly with the number of training instances, a problem sometimes referred to as the curse of kernelization budgetpegasos .

Factorization machines (FMs) fm

are a more recent approach that can use pairwise feature interactions efficiently even in very high-dimensional data. The key idea of FMs is to model the weights of feature interactions using a

low-rank matrix. Not only this idea offers clear benefits in terms of model compression compared to the aforementioned approaches, it has also proved instrumental in modeling interactions between categorical

variables, converted to binary features via a one-hot encoding. Such binary features are usually so sparse that many interactions are never observed in the training set, preventing classical approaches from capturing their relative importance. By imposing a low rank on the feature interaction weight matrix, FMs encourage shared parameters between interactions, allowing to estimate their weights even if they never occurred in the training set. This property has been used in recommender systems to model interactions between user variables and item variables, and is the basis of several industrial successes of FMs

fm_param_server ; ffm .

Originally motivated as neural networks with a polynomial activation (instead of the classical sigmoidal or rectifier activations), polynomial networks (PNs)

livni have been shown to be intimately related to FMs and to only subtly differ in the non-linearity they use fm_icml . PNs achieve better performance than rectifier networks on pedestrian detection livni and on dependency parsing dependency_parsing , and outperform kernel approximations such as the Nyström method fm_icml . However, existing PN and FM works have been limited to single-output models, i.e., they are designed to learn scalar-valued functions, which restricts them to regression or binary classification problems.

Our contributions. In this paper, we generalize FMs and PNs to multi-output models, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems.

1) We cast learning multi-output FMs and PNs as learning a 3-way tensor, whose slices share a common basis (each slice corresponds to one output). To obtain a convex formulation of that problem, we propose to cast it as learning an infinite-dimensional but row-wise sparse matrix. This can be achieved by using group-sparsity inducing penalties. (§3)

2) To solve the obtained optimization problem, we develop a variant of the conditional gradient (a.k.a. Frank-Wolfe) algorithm dunn ; revisiting_fw , which repeats the following two steps: i) select a new basis vector to add to the model and ii) refit the model over the current basis vectors. (§4) We prove the global convergence of this algorithm (Theorem 1), despite the fact that the basis selection step is non-convex and more challenging in the shared basis setting. (§5)

3) On multi-class classification tasks, we show that our algorithm achieves comparable accuracy to kernel SVMs but with much more compressed models than the Nyström method. On recommender system tasks, where kernelized models cannot be used (since they do not generalize to unseen user-item pairs), we demonstrate how our algorithm can be combined with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms single-output PNs and FMs both in terms of root mean squared error (RMSE) and ranking accuracy, as measured by nDCG (normalized discounted cumulative gain) scores. (§6)

2 Background and related work

Notation. We denote the set by . Given a vector , we denote its elements by . Given a matrix , we denote its rows by and its columns by . We denote the norm of by and its norm by . The number of non-zero rows of is denoted by .

Factorization machines (FMs). Given an input vector , FMs predict a scalar output by

 ^yFM\coloneqqwTx+∑i

where contains feature weights and is a low-rank matrix that contains pairwise feature interaction weights. To obtain a low-rank , fm originally proposed to use a change of variable , where (with a rank parameter) and to learn instead. Noting that this quadratic model results in a non-convex problem in , convex_fm ; cfm_yamada proposed to convexify the problem by learning directly but to encourage low rank using a nuclear norm on . For learning, convex_fm proposed a conditional gradient like approach with global convergence guarantees.

Polynomial networks (PNs).

PNs are a recently-proposed form of neural network where the usual activation function is replaced with a squared activation. Formally, PNs predict a scalar output by

 ^yPN\coloneqqwTx+vTσ(Hx)=wTx+k∑r=1vr σ(hTrx), (2)

where (evaluated element-wise) is the squared activation, is the output layer vector, is the hidden layer matrix and is the number of hidden units. Because the r.h.s term can be rewritten as if we set , we see that PNs are clearly a slight variation of FMs and that learning can be recast as learning a low-rank matrix . Based on this observation, livni proposed to use GECO geco , a greedy algorithm for convex optimization with a low-rank constraint, similar to the conditional gradient algorithm. pn_global proposed a learning algorithm for PNs with global optimality guarantees but their theory imposes non-negativity on the network parameters and they need one distinct hyper-parameter per hidden unit to avoid trivial models. Other low-rank polynomial models were recently introduced in sl_tensor_net ; exp_machines but using a tensor network (a.k.a. tensor train) instead of the canonical polyadic (CP) decomposition.

3 A convex formulation of multi-output PNs and FMs

In this section, we generalize PNs and FMs to multi-output problems. For the sake of concreteness, we focus on PNs for multi-class classification. The extension to FMs is straightforward and simply requires to replace by , as noted in fm_icml .

The predictions of multi-class PNs can be naturally defined as , where is the number of classes, and is low-rank. Following fm_icml , we can model the linear term directly in the quadratic term if we augment all data points with an extra feature of value 1, i.e., . We will therefore simply assume henceforth. Our main proposal in this paper is to decompose using a shared basis:

 Wc=HTdiag(v:,c)H=∑kr=1vr,chrhTr∀c∈[m], (3)

where, in neural network terminology, can be interpreted as a hidden layer matrix and as an output layer matrix. Compared to the naive approach of decomposing each as , this reduces the number of parameters from to .

While a nuclear norm could be used to promote a low rank on each , similarly as in convex_fm ; cfm_yamada , this is clearly not sufficient to impose a shared basis. A naive approach would be to use non-orthogonal joint diagonalization as a post-processing. However, because this is a non-convex problem for which no globally convergent algorithm is known beyond_cca , this would result in a loss of accuracy. Our key idea is to cast the problem of learning a multi-output PN as that of learning an infinite but row-wise sparse matrix. Without loss of generality, we assume that basis vectors (hidden units) lie in the unit ball. We therefore denote the set of basis vectors by Let us denote this infinite matrix by (we use a discrete notation for simplicity). We can then write

 (4)

denotes the weights of basis across all classes (outputs). In this formulation, we have and sharing a common basis (hidden units) amounts to encouraging the rows of , , to be either dense or entirely sparse. This can be naturally achieved using group-sparsity inducing penalties. Intuitively, in (3) can be thought as restricted to its row support. Define the training set by and . We then propose to solve the convex problem

 minΩ(U)≤τF(U)\coloneqqn∑i=1ℓ(yi,o(xi;U)), (5)

where

is a smooth and convex multi-class loss function (cf. Appendix

A for three common examples), is a sparsity-inducing penalty and is a hyper-parameter. In this paper, we focus on the (lasso), (group lasso) and penalties for , cf. Table 1. However, as we shall see, solving (5) is more challenging with the and penalties than with the penalty. Although our formulation is based on an infinite view, we next show that has finite row support.

Proposition 1

Finite row support of for multi-output PNs and FMs

Let be an optimal solution of (5), where is one of the penalties in Table 1. Then,
. If , we can tighten this bound to .

Proof is in Appendix B.1. It is open whether we can tighten this result when or .

4 A conditional gradient algorithm with approximate basis vector selection

At first glance, learning with an infinite number of basis vectors seems impossible. In this section, we show how the well-known conditional gradient algorithm dunn ; revisiting_fw combined with group-sparsity inducing penalties naturally leads to a greedy algorithm that selects and adds basis vectors that are useful across all outputs. On every iteration, the conditional gradient algorithm performs updates of the form , where is a step size and is obtained by solving a linear approximation of the objective around the current iterate :

 Δ⋆∈argminΩ(Δ)≤τ⟨Δ,∇F(U(t))⟩=τ⋅argmaxΩ(Δ)≤1⟨Δ,−∇F(U(t))⟩. (6)

Let us denote the negative gradient by for short. Its elements are defined by

 gh,c=−n∑i=1σ(hTxi)∇ℓ(yi,o(xi;U))c, (7)

where is the gradient of w.r.t. (cf. Appendix A

). For ReLu activations, solving (

6) is known to be NP-hard convex_nn_bach . Here, we focus on quadratic activations, for which we will be able to provide approximation guarantees. Plugging the expression of , we get

 gh,c=−hTΓch where Γc\coloneqqXTDcX (PN) or Γc\coloneqq12(XTDcX−Dcn∑i=1diag(xi)2) (FM) (8)

and is a diagonal matrix such that . Let us recall the definition of the dual norm of : . By comparing this equation to (6), we see that is the argument that achieves the maximum in the dual norm , up to a constant factor . It is easy to verify that any element in the subdifferential of , which we denote by , achieves that maximum, i.e., .

Basis selection. As shown in Table 1, elements of (subgradients) are matrices with a single non-zero row indexed by , where is an optimal basis (hidden unit) selected by

 h⋆∈argmaxh∈H∥gh∥p, (9)

and where when , when and when . We call (9) a basis vector selection criterion. Although this selection criterion was derived from the linearization of the objective, it is fairly natural: it chooses the basis vector with largest “violation”, as measured by the norm of the negative gradient row .

Multiplicative approximations. The key challenge in solving (6) or equivalently (9) arises from the fact that has infinitely many rows . We therefore cast basis vector selection as a continuous optimization problem w.r.t. . Surprisingly, although the entire objective (5) is convex, (9) is not. Instead of the exact maximum, we will therefore only require to find a that satisfies

 Ω(^Δ)≤τand⟨^Δ,G⟩≥ν⟨Δ⋆,G⟩, (10)

where is a multiplicative approximation (higher is better). It is easy to verify that this is equivalent to replacing the optimal by an approximate that satisfies .

Sparse case. When , we need to solve

 maxh∈H∥gh∥∞=maxh∈Hmaxc∈[m]|hTΓch|=maxc∈[m]maxh∈H|hTΓch|. (11)

It is well known that the optimal solution of

is the dominant eigenvector of

. Therefore, we simply need to find the dominant eigenvector of each and select as the

with largest singular value

. Using the power method, we can find an that satisfies

 |hTcΓchc|≥(1−ϵ)maxh∈H|hTΓch|, (12)

for some tolerance parameter . The procedure takes time, where is the number of non-zero elements in geco . Taking the maximum w.r.t. on both sides of (12) leads to , where . However, using does not encourage selecting an that is useful for all outputs. In fact, when , our approach is equivalent to imposing independent nuclear norms on .

Group-sparse cases. When or , we need to solve

 maxh∈H∥gh∥22=maxh∈Hf2(h)\coloneqqm∑c=1(hTΓch)2ormaxh∈H∥gh∥1=maxh∈Hf1(h)\coloneqqm∑c=1|hTΓch|, (13)

respectively. Unlike the -constrained case, we are clearly selecting a basis vector with largest violation across all outputs. However, we are now faced with a more difficult non-convex optimization problem. Our strategy is to first choose an initialization which guarantees a certain multiplicative approximation , then refine the solution using a monotonically non-increasing iterative procedure.

Initialization. We simply choose as the approximate solution of the case, i.e., we have

 ∥gh(0)∥∞≥(1−ϵ)maxh∈H∥gh∥∞. (14)

Now, using and , this immediately implies

 ∥gh(0)∥p≥νmaxh∈H∥gh∥p, (15)

with if and if .

Refining the solution. We now apply another instance of the conditional gradient algorithm to solve the subproblem itself, leading to the following iterates:

 h(t+1)=(1−ηt)h(t)+ηt∇fp(h(t))∥∇fp(h(t))∥2, (16)

where . Following (bertsekas, , Section 2.2.2), if we use the Armijo rule to select , every limit point of the sequence is a stationary point of . In practice, we observe that is almost always selected. Note that when and (i.e., single-output case), our refining algorithm recovers the power method. Generalized power methods were also studied for structured matrix factorization journee_generalized ; luss_generalized , but with different objectives and constraints. Since the conditional gradient algorithm assumes a differentiable function, in the case , we replace the absolute function with the Huber function if , otherwise.

Corrective refitting step. After iterations, contains at most non-zero rows. We can therefore always store as (the output layer matrix) and (the basis vectors / hidden units added so far). In order to improve accuracy, on iteration , we can then refit the objective . We consider two kinds of corrective steps, a convex one that minimizes w.r.t. and an optional non-convex one that minimizes w.r.t. both and . Refitting allows to remove previously-added bad basis vectors, thanks to the use of sparsity-inducing penalties. Similar refitting procedures are commonly used in matching pursuit (omp, ). The entire procedure is summarized in Algorithm 1 and implementation details are given in Appendix D.

5 Analysis of Algorithm 1

The main difficulty in analyzing the convergence of Algorithm 1 stems from the fact that we cannot solve the basis vector selection subproblem globally when or . Therefore, we need to develop an analysis that can cope with the multiplicative approximation . Multiplicative approximations were also considered in block_fw but the condition they require is too stringent (cf. Appendix B.2 for a detailed discussion). The next theorem guarantees the number of iterations needed to output a multi-output network that achieves as small objective value as an optimal solution of (5).

Theorem 1

Convergence of Algorithm 1

Assume is smooth with constant . Let be the output after iterations of Algorithm 1 run with constraint parameter . Then, .

In livni , single-output PNs were trained using GECO geco , a greedy algorithm with similar guarantees. However, GECO is limited to learning infinite vectors (not matrices) and it does not constrain its iterates like we do. Hence GECO cannot remove bad basis vectors. The proof of Theorem 1 and a detailed comparison with GECO are given in Appendix B.2. Finally, we note that the infinite dimensional view is also key to convex neural networks convex_nn_bengio ; convex_nn_bach . However, to our knowledge, we are the first to give an explicit multiplicative approximation guarantee for a non-linear multi-output network.

6 Experimental results

6.1 Experimental setup

Datasets. For our multi-class experiments, we use four publicly-available datasets: segment (7 classes), vowel (11 classes), satimage (6 classes) and letter (26 classes) datasets . Quadratic models substantially improve over linear models on these datasets. For our recommendation system experiments, we use the MovieLens 100k and 1M datasets movielens . See Appendix E for complete details.

Model validation. The greedy nature of Algorithm 1 allows us to easily interleave training with model validation. Concretely, we use an outer loop (embarrassingly parallel) for iterating over the range of possible regularization parameters, and an inner loop (Algorithm 1, sequential) for increasing the number of basis vectors. Throughout our experiments, we use 50% of the data for training, 25% for validation, and 25% for evaluation. Unless otherwise specified, we use a multi-class logistic loss.

6.2 Method comparison for the basis vector (hidden unit) selection subproblem

As we mentioned previously, the linearized subproblem (basis vector selection) for the and constrained cases involves a significantly more challenging non-convex optimization problem. In this section, we compare different methods for obtaining an approximate solution to (9). We focus on the case, since we have a method for computing the true global solution , albeit with exponential complexity in (cf. Appendix C). This allows us to report the empirically observed multiplicative approximation factor .

Compared methods. We compare init + refine (proposed), random init + refine, init (without refine), random init and best data: where .

Results. We report in Figure 2. init + refine achieves nearly the global maximum on both datasets and outperforms random init + refine, showing the effectiveness of the proposed initialization and that the iterative update (16) can get stuck in a bad local minimum if initialized badly. On the other hand, init + refine outperforms init alone, showing the importance of iteratively refining the solution. Best data

, a heuristic similar to that of approximate kernel SVMs

lasvm , is not competitive.

6.3 Sparsity-inducing penalty comparison

In this section, we compare the , and penalties for the choice of , when varying the maximum number of basis vectors (hidden units). Figure 3

indicates test set accuracy when using output layer refitting. We also include linear logistic regression, kernel SVMs and the Nyström method as baselines. For the latter two, we use the quadratic kernel

. Hyper-parameters are chosen so as to maximize validation set accuracy.
Results. On the vowel (11 classes) and letter (26 classes) datasets, and penalties outperform norm starting from 20 and 75 hidden units, respectively. On satimage (6 classes) and segment (7 classes), we observed that the three penalties are mostly similar (not shown). We hypothesize that and penalties make a bigger difference when the number of classes is large. Multi-output PNs substantially outperform the Nyström method with comparable number of basis vectors (hidden units). Multi-output PNs reach the same test accuracy as kernel SVMs with very few basis vectors on vowel and satimage but appear to require at least 100 basis vectors to reach good performance on letter. This is not surprising, since kernel SVMs require 3,208 support vectors on letter, as indicated in Table 2 below.

6.4 Multi-class benchmark comparison

Compared methods. We compare the proposed conditional gradient algorithm with output layer refitting only and with both output and hidden layer refitting; projected gradient descent (FISTA) with random initialization; linear and kernelized models; one-vs-rest PNs (i.e., fit one PN per class). We focus on PNs rather than FMs since they are known to work better on classification tasks fm_icml .

Results are included in Table 2. From these results, we can make the following observations and conclusions. When using output-layer refitting on vowel and letter (two datasets with more than 10 classes), group-sparsity inducing penalties lead to better test accuracy. This is to be expected, since these penalties select basis vectors that are useful across all classes. When using full hidden layer and output layer refitting, catches up with and on the vowel and letter datasets. Intuitively, the basis vector selection becomes less important if we make more effort at every iteration by refitting the basis vectors themselves. However, on vowel, is still substantially better than (89.57 vs. 87.83).

Compared to projected gradient descent with random initialization, our algorithm (for both output and full refitting) is better on (), () and () of the datasets. In addition, with our algorithm, the best model (chosen against the validation set) is substantially sparser. Multi-output PNs substantially outperform OvR PNs. This is to be expected, since multi-output PNs learn to share basis vectors across different classes.

6.5 Recommender system experiments using ordinal regression

A straightforward way to implement recommender systems consists in training a single-output model to regress ratings from one-hot encoded user and item indices fm . Instead of a single-output PN or FM, we propose to use ordinal McRank, a reduction from ordinal regression to multi-output binary classification, which is known to achieve good nDCG (normalized discounted cumulative gain) scores mcrank

. This reduction involves training a probabilistic binary classifier for each of the

relevance levels (for instance, in the MovieLens datasets). The expected relevance of (e.g. the concatenation of the one-hot encoded user and item indices) is then computed by

 ^y=m∑c=1c p(y=c∣x)=m∑c=1c[p(y≤c∣x)−p(y≤c−1∣x)], (17)

where we use the convention . Thus, all we need to do to use ordinal McRank is to train a probabilistic binary classifier for all .

Our key proposal is to use a multi-output model to learn all classifiers simultaneously, i.e., in a multi-task fashion. Let be a vector representing a user-item pair with corresponding rating , for . We form a matrix such that if and otherwise, and solve

 minΩ(U)≤τn∑i=1m∑c=1ℓ(yi,c,∑h∈HσANOVA(h,xi)uh,c), (18)

where

is set to the binary logistic loss, in order to be able to produce probabilities. After running Algorithm 1 on that objective for

iterations, we obtain and . Because is shared across all outputs, the only small overhead of using the ordinal McRank reduction, compared to a single-output regression model, therefore comes from learning instead of .

In this experiment, we focus on multi-output factorization machines (FMs), since FMs usually work better than PNs for one-hot encoded data (fm_icml, ). We show in Figure 4 the RMSE and nDCG (truncated at 1 and 5) achieved when varying (the maximum number of basis vectors / hidden units).

Results. When combined with the ordinal McRank reduction, we found that and –constrained multi-output FMs substantially outperform single-output FMs and PNs on both RMSE and nDCG measures. For instance, on MovieLens 100k and 1M, –constrained multi-output FMs achieve an nDCG@1 of 0.75 and 0.76, respectively, while single-output FMs only achieve 0.71 and 0.75. Similar trends are observed with nDCG@5. We believe that this reduction is more robust to ranking performance measures such as nDCG thanks to its modelling of the expected relevance.

7 Conclusion and future directions

We defined the problem of learning multi-output PNs and FMs as that of learning a 3-way tensor whose slices share a common basis. To obtain a convex optimization objective, we reformulated that problem as that of learning an infinite but row-wise sparse matrix. To learn that matrix, we developed a conditional gradient algorithm with corrective refitting, and were able to provide convergence guarantees, despite the non-convexity of the basis vector (hidden unit) selection step.

Although not considered in this paper, our algorithm and its analysis can be modified to make use of stochastic gradients. An open question remains whether a conditional gradient algorithm with provable guarantees can be developed for training deep polynomial networks or factorization machines. Such deep models could potentially represent high-degree polynomials with few basis vectors. However, this would require the introduction of a new functional analysis framework.

References

• (1) F. Bach.

Breaking the curse of dimensionality with convex neural networks.

JMLR, 2017.
• (2) Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS, 2005.
• (3) D. P. Bertsekas. Nonlinear programming. Athena Scientific Belmont, 1999.
• (4) M. Blondel, A. Fujino, and N. Ueda. Convex factorization machines. In ECML/PKDD, 2015.
• (5) M. Blondel, M. Ishihata, A. Fujino, and N. Ueda. Polynomial networks and factorization machines: New insights and efficient training algorithms. In ICML, 2016.
• (6) M. Blondel, K. Seki, and K. Uehara. Block coordinate descent algorithms for large-scale sparse multiclass classification. Machine Learning, 93(1):31–52, 2013.
• (7) A. Bordes, S. Ertekin, J. Weston, and L. Bottou.

Fast kernel classifiers with online and active learning.

JMLR, 6(Sep):1579–1619, 2005.
• (8) V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
• (9) Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11:1471–1490, 2010.
• (10) D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In EMNLP, 2014.
• (11) J. C. Dunn and S. A. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432–444, 1978.
• (12) R.-E. Fan and C.-J. Lin.
• (13) A. Gautier, Q. N. Nguyen, and M. Hein. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. In NIPS, 2016.
• (14) GroupLens.
• (15) M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.
• (16) M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre.

Generalized power method for sparse principal component analysis.

Journal of Machine Learning Research, 11:517–553, 2010.
• (17) Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin. Field-aware factorization machines for CTR prediction. In ACM Recsys, 2016.
• (18) S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In ICML, 2012.
• (19) P. Li, C. J. Burges, and Q. Wu.

McRank: Learning to rank using multiple classification and gradient boosting.

In NIPS, 2007.
• (20) R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efficiency of training neural networks. In NIPS, 2014.
• (21) R. Luss and M. Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Review, 55(1):65–98, 2013.
• (22) S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993.
• (23) A. Novikov, M. Trofimov, and I. Oseledets. Exponential machines. arXiv preprint arXiv:1605.03795, 2016.
• (24) A. Podosinnikova, F. Bach, and S. Lacoste-Julien.

Beyond CCA: Moment matching for multi-view models.

In ICML, 2016.
• (25) S. Rendle. Factorization machines. In ICDM, 2010.
• (26) S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-rank constraint. In ICML, 2011.
• (27) S. Shalev-Shwartz, Y. Wexler, and A. Shashua. ShareBoost: Efficient multiclass learning with feature sharing. In NIPS, 2011.
• (28) S. Sonnenburg and V. Franc. Coffin: A computational framework for linear SVMs. In ICML, 2010.
• (29) E. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In NIPS, 2016.
• (30) Z. Wang, K. Crammer, and S. Vucetic. Multi-class Pegasos on a budget. In ICML, 2010.
• (31) M. Yamada, W. Lian, A. Goyal, J. Chen, K. Wimalawarne, S. A. Khan, S. Kaski, H. M. Mamitsuka, and Y. Chang. Convex factorization machine for toxicogenomics prediction. In KDD, 2017.
• (32) E. Zhong, Y. Shi, N. Liu, and S. Rajan. Scaling factorization machines with parameter server. In CIKM, 2016.

Appendix A Convex multi-class loss functions

The gradient w.r.t. , denoted , can be computed by

 ∇ℓ(y,o)=∑c≠yρc(y,o)(ec−ey), (19)

where is a vector whose element is 1 and other elements are 0. For the smoothed multi-class hinge loss and the multi-class squared hinge loss, see [27] and [6], respectively.

Appendix B Proofs

b.1 Finite support of an optimal solution (Proposition 1)

General case. We first state a result that holds for arbitrary activation function (sigmoid, ReLu, etc…). The main idea is to use the fact that the penalties considered in Table 1 are atomic [8]. Then, we can equivalently optimize (5) over the convex hull of a set of atoms and invoke Carathéodory’s theorem for convex hulls.

Let be an -dimensional vector whose element is . Let us define the sets

 (20)

where we define the set as follows:

• case:

• case:

• case: .

Then (5) is equivalent to

 minU∈R|H|×m n∑i=1ℓ(yi,∑h∈Hϕh(X)i uh) s.t.Ω(U)≤τ (21) = minU∈R|H|×m n∑i=1ℓ(yi,∑h∈Hϕh(X)i uh) s.t.U∈τ⋅conv(A) = minO∈Rn×m n∑i=1ℓ(yi,oi) s.t.O∈τ⋅conv(B),

where is the convex hull of the set . The matrices and are related to each other by

 U=∑h∈H∑v∈Vθh,vehvTandO=∑h∈H∑v∈Vθh,vϕh(X)vT, (22)

for some such that and . By Carathéodory’s theorem for convex hulls, there exists with at most non-zero elements. Because elements of are matrices with a single non-zero row, contains at most non-zero rows (hidden units).

Case of constraint and squared activation. When , given s.t. , the output can be written as

 ∑h∈Hσ(hTx)uh,c=∑h∈H(hTx)2uh,c=xT(∑h∈Huh,chhT)x\eqqcolonxTWcx. (23)

Following [5, Lemma 10], the nuclear norm of a symmetric matrix can be defined by

 ∥M∥∗=minλ∈Rd,P∈Rd×dd∑j=1|λj| ∥pj∥22%s.t.M=d∑j=1λjpjpTj (24)

and the minimum is attained by the eigendecomposition and .

Therefore, we can always compute the eigendecomposition of each

and use the eigenvectors as hidden units and the eigenvalues as output layer weights. Moreover, this solution is feasible, since eigenvectors belong to

and since the norm of all eigenvalues is minimized. Since a matrix can have at most eigenvalues, we can conclude that has at most elements. Combined with the previous result, has at most non-zero rows (hidden units).

For the and penalties, we cannot make this argument, since applying the eigendecomposition might increase the penalty value and therefore make the solution infeasible.

b.2 Convergence analysis (Theorem 1)

In this section, we include a convergence analysis of the conditional gradient algorithm with multiplicative approximation in the linear minimization oracle. The proof follows mostly from [15] with a trick inspired from [1] to handle multiplicative approximations. Finally, we also include a detailed comparison with the analysis of GECO [26] and Block-FW [18].

We focus on constrained optimization problems of the form

 minx∈Df(x), (25)

where is convex and -smooth w.r.t. and .

Curvature and smoothness constants. The convergence analysis depends on the following standard curvature constant

 Cf,D\coloneqqsupx,s∈Dγ∈[0,1]y=x+γ(s−x)2γ2(f(y)−f(x)−⟨y−x,∇f(x)⟩). (26)

Intuitively, this is a measure of non-linearity of : the maximum deviation between and its linear approximations over . The assumption of bounded is closely related to a smoothness assumption on . Following [15, Lemma 7], for any choice of norm , can be upper-bounded by the smoothness constant as

 Cf,D≤diamΩ(D)2β. (27)

Using , we obtain

 diamΩ(D)=supx,y∈DΩ(x−y)≤supx,y∈DΩ(x)+Ω(y)≤2τ (28)

and therefore

 Cf,D≤4τ2β. (29)

Linear duality gap. Following [15], we define the linear duality gap

 gD(x)\coloneqqmaxs∈D⟨x−s,∇f(x)⟩. (30)

Since is convex and differentiable, we have that

 f(s)≥f(x)+⟨s−x,∇f(x)⟩. (31)

Let us define the primal error

 hD(x)\coloneqqf(x)−minx∈Df(x). (32)

Minimizing (31) w.r.t. on both sides we obtain

 gD(x)≥hD(x). (33)

Hence can be used as a certificate of optimality about .

Bounding progress. Let be the current iterate and be our update. The definition of implies

 f(y)≤f(x)+γ⟨s−x,∇f(x)⟩+γ22Cf,D. (34)

We now use that is obtained by an exact linear minimization oracle (LMO)

 s=argmins∈D⟨s,∇f(x)⟩ (35)

and therefore . Combined with , we obtain

 f(y)≤f(x)−γhD(x)+γ22Cf,D. (36)

Subtracting on both sides, we finally get

 hD(y)≤(1−γ)hD(x)+γ22Cf,D. (37)

Primal convergence. Since we use a fully-corrective variant of the conditional gradient method, our algorithm enjoys a convergence rate at least as good as the variant with fixed step size. Following [15, Theorem 1] and using (29), for every , the iterates satisfy

 f(x(t))−minx∈Df(x)≤2Cf,Dt+2≤8τ2βt+2. (38)

Thus, we can obtain an -accurate solution if we run the algorithm for iterations.

Linear minimization with multiplicative approximation. We now extend the analysis to the case of approximate linear minimization. Given , we assume that an approximate LMO outputs a certain such that

 ⟨−s,∇f(x)⟩≥νmaxs′∈D⟨−s′,∇f(x)⟩, (39)

for some multiplicative factor (higher is more accurate). Since and are in , we have like before

 f(y)≤f(x)+γ⟨s−x,∇f(x)⟩+γ22Cf,D. (40)

Following the same trick as [1, Appendix B], we now absorb the multiplicative factor in the constraint

 ⟨−s,∇f(x)⟩≥maxs′∈D′⟨−s′,∇f(x)⟩, (41)

where we defined (i.e., the ball is shrunk by a factor ). We therefore obtain . Similarly as before, this implies that

 f(y)≤f(x)−γhD′(x)+γ22Cf,D. (42)

Subtracting on both sides, we get

 hD′(y)≤(1−γ)hD′(x)+γ22Cf,D. (43)

We thus get that iterate satisfies and

 f(x(t))≤minx∈D′f(x)+8τ2βt+2. (44)

We can therefore obtain an such that if we run our algorithm for iterations with constraint parameter and multiplicative factor . Put differently, we can obtain an