1 Introduction
Interactions between features play an important role in many classification and regression tasks. Classically, such interactions have been leveraged either explicitly, by mapping features to their products (as in polynomial regression), or implicitly, through the use of the kernel trick. While fast linear model solvers have been engineered for the explicit approach low_poly ; coffin , they are typically limited to small numbers of features or loworder feature interactions, due to the fact that the number of parameters that they need to learn scales as , where is the number of features and is the order of interactions considered. Models kernelized with the polynomial kernel do not suffer from this problem; however, the cost of storing and evaluating these models grows linearly with the number of training instances, a problem sometimes referred to as the curse of kernelization budgetpegasos .
Factorization machines (FMs) fm
are a more recent approach that can use pairwise feature interactions efficiently even in very highdimensional data. The key idea of FMs is to model the weights of feature interactions using a
lowrank matrix. Not only this idea offers clear benefits in terms of model compression compared to the aforementioned approaches, it has also proved instrumental in modeling interactions between categoricalvariables, converted to binary features via a onehot encoding. Such binary features are usually so sparse that many interactions are never observed in the training set, preventing classical approaches from capturing their relative importance. By imposing a low rank on the feature interaction weight matrix, FMs encourage shared parameters between interactions, allowing to estimate their weights even if they never occurred in the training set. This property has been used in recommender systems to model interactions between user variables and item variables, and is the basis of several industrial successes of FMs
fm_param_server ; ffm .Originally motivated as neural networks with a polynomial activation (instead of the classical sigmoidal or rectifier activations), polynomial networks (PNs)
livni have been shown to be intimately related to FMs and to only subtly differ in the nonlinearity they use fm_icml . PNs achieve better performance than rectifier networks on pedestrian detection livni and on dependency parsing dependency_parsing , and outperform kernel approximations such as the Nyström method fm_icml . However, existing PN and FM works have been limited to singleoutput models, i.e., they are designed to learn scalarvalued functions, which restricts them to regression or binary classification problems.Our contributions. In this paper, we generalize FMs and PNs to multioutput models, i.e., for learning vectorvalued functions, with application to multiclass or multitask problems.
1) We cast learning multioutput FMs and PNs as learning a 3way tensor, whose slices share a common basis (each slice corresponds to one output). To obtain a convex formulation of that problem, we propose to cast it as learning an infinitedimensional but rowwise sparse matrix. This can be achieved by using groupsparsity inducing penalties. (§3)
2) To solve the obtained optimization problem, we develop a variant of the conditional gradient (a.k.a. FrankWolfe) algorithm dunn ; revisiting_fw , which repeats the following two steps: i) select a new basis vector to add to the model and ii) refit the model over the current basis vectors. (§4) We prove the global convergence of this algorithm (Theorem 1), despite the fact that the basis selection step is nonconvex and more challenging in the shared basis setting. (§5)
3) On multiclass classification tasks, we show that our algorithm achieves comparable accuracy to kernel SVMs but with much more compressed models than the Nyström method. On recommender system tasks, where kernelized models cannot be used (since they do not generalize to unseen useritem pairs), we demonstrate how our algorithm can be combined with a reduction from ordinal regression to multioutput classification and show that the resulting algorithm outperforms singleoutput PNs and FMs both in terms of root mean squared error (RMSE) and ranking accuracy, as measured by nDCG (normalized discounted cumulative gain) scores. (§6)
2 Background and related work
Notation. We denote the set by . Given a vector , we denote its elements by . Given a matrix , we denote its rows by and its columns by . We denote the norm of by and its norm by . The number of nonzero rows of is denoted by .
Factorization machines (FMs). Given an input vector , FMs predict a scalar output by
(1) 
where contains feature weights and is a lowrank matrix that contains pairwise feature interaction weights. To obtain a lowrank , fm originally proposed to use a change of variable , where (with a rank parameter) and to learn instead. Noting that this quadratic model results in a nonconvex problem in , convex_fm ; cfm_yamada proposed to convexify the problem by learning directly but to encourage low rank using a nuclear norm on . For learning, convex_fm proposed a conditional gradient like approach with global convergence guarantees.
Polynomial networks (PNs).
PNs are a recentlyproposed form of neural network where the usual activation function is replaced with a squared activation. Formally, PNs predict a scalar output by
(2) 
where (evaluated elementwise) is the squared activation, is the output layer vector, is the hidden layer matrix and is the number of hidden units. Because the r.h.s term can be rewritten as if we set , we see that PNs are clearly a slight variation of FMs and that learning can be recast as learning a lowrank matrix . Based on this observation, livni proposed to use GECO geco , a greedy algorithm for convex optimization with a lowrank constraint, similar to the conditional gradient algorithm. pn_global proposed a learning algorithm for PNs with global optimality guarantees but their theory imposes nonnegativity on the network parameters and they need one distinct hyperparameter per hidden unit to avoid trivial models. Other lowrank polynomial models were recently introduced in sl_tensor_net ; exp_machines but using a tensor network (a.k.a. tensor train) instead of the canonical polyadic (CP) decomposition.
3 A convex formulation of multioutput PNs and FMs
In this section, we generalize PNs and FMs to multioutput problems. For the sake of concreteness, we focus on PNs for multiclass classification. The extension to FMs is straightforward and simply requires to replace by , as noted in fm_icml .
The predictions of multiclass PNs can be naturally defined as , where is the number of classes, and is lowrank. Following fm_icml , we can model the linear term directly in the quadratic term if we augment all data points with an extra feature of value 1, i.e., . We will therefore simply assume henceforth. Our main proposal in this paper is to decompose using a shared basis:
(3) 
where, in neural network terminology, can be interpreted as a hidden layer matrix and as an output layer matrix. Compared to the naive approach of decomposing each as , this reduces the number of parameters from to .
While a nuclear norm could be used to promote a low rank on each , similarly as in convex_fm ; cfm_yamada , this is clearly not sufficient to impose a shared basis. A naive approach would be to use nonorthogonal joint diagonalization as a postprocessing. However, because this is a nonconvex problem for which no globally convergent algorithm is known beyond_cca , this would result in a loss of accuracy. Our key idea is to cast the problem of learning a multioutput PN as that of learning an infinite but rowwise sparse matrix. Without loss of generality, we assume that basis vectors (hidden units) lie in the unit ball. We therefore denote the set of basis vectors by Let us denote this infinite matrix by (we use a discrete notation for simplicity). We can then write
(4) 
denotes the weights of basis across all classes (outputs). In this formulation, we have and sharing a common basis (hidden units) amounts to encouraging the rows of , , to be either dense or entirely sparse. This can be naturally achieved using groupsparsity inducing penalties. Intuitively, in (3) can be thought as restricted to its row support. Define the training set by and . We then propose to solve the convex problem
(5) 
where
is a smooth and convex multiclass loss function (cf. Appendix
A for three common examples), is a sparsityinducing penalty and is a hyperparameter. In this paper, we focus on the (lasso), (group lasso) and penalties for , cf. Table 1. However, as we shall see, solving (5) is more challenging with the and penalties than with the penalty. Although our formulation is based on an infinite view, we next show that has finite row support.Proposition 1
Finite row support of for multioutput PNs and FMs
Proof is in Appendix B.1. It is open whether we can tighten this result when or .
Subproblem  

(lasso)  
(group lasso) 


4 A conditional gradient algorithm with approximate basis vector selection
At first glance, learning with an infinite number of basis vectors seems impossible. In this section, we show how the wellknown conditional gradient algorithm dunn ; revisiting_fw combined with groupsparsity inducing penalties naturally leads to a greedy algorithm that selects and adds basis vectors that are useful across all outputs. On every iteration, the conditional gradient algorithm performs updates of the form , where is a step size and is obtained by solving a linear approximation of the objective around the current iterate :
(6) 
Let us denote the negative gradient by for short. Its elements are defined by
(7) 
where is the gradient of w.r.t. (cf. Appendix A
). For ReLu activations, solving (
6) is known to be NPhard convex_nn_bach . Here, we focus on quadratic activations, for which we will be able to provide approximation guarantees. Plugging the expression of , we get(8) 
and is a diagonal matrix such that . Let us recall the definition of the dual norm of : . By comparing this equation to (6), we see that is the argument that achieves the maximum in the dual norm , up to a constant factor . It is easy to verify that any element in the subdifferential of , which we denote by , achieves that maximum, i.e., .
Basis selection. As shown in Table 1, elements of (subgradients) are matrices with a single nonzero row indexed by , where is an optimal basis (hidden unit) selected by
(9) 
and where when , when and when . We call (9) a basis vector selection criterion. Although this selection criterion was derived from the linearization of the objective, it is fairly natural: it chooses the basis vector with largest “violation”, as measured by the norm of the negative gradient row .
Multiplicative approximations. The key challenge in solving (6) or equivalently (9) arises from the fact that has infinitely many rows . We therefore cast basis vector selection as a continuous optimization problem w.r.t. . Surprisingly, although the entire objective (5) is convex, (9) is not. Instead of the exact maximum, we will therefore only require to find a that satisfies
(10) 
where is a multiplicative approximation (higher is better). It is easy to verify that this is equivalent to replacing the optimal by an approximate that satisfies .
Sparse case. When , we need to solve
(11) 
It is well known that the optimal solution of
is the dominant eigenvector of
. Therefore, we simply need to find the dominant eigenvector of each and select as thewith largest singular value
. Using the power method, we can find an that satisfies(12) 
for some tolerance parameter . The procedure takes time, where is the number of nonzero elements in geco . Taking the maximum w.r.t. on both sides of (12) leads to , where . However, using does not encourage selecting an that is useful for all outputs. In fact, when , our approach is equivalent to imposing independent nuclear norms on .
Groupsparse cases. When or , we need to solve
(13) 
respectively. Unlike the constrained case, we are clearly selecting a basis vector with largest violation across all outputs. However, we are now faced with a more difficult nonconvex optimization problem. Our strategy is to first choose an initialization which guarantees a certain multiplicative approximation , then refine the solution using a monotonically nonincreasing iterative procedure.
Initialization. We simply choose as the approximate solution of the case, i.e., we have
(14) 
Now, using and , this immediately implies
(15) 
with if and if .
Refining the solution. We now apply another instance of the conditional gradient algorithm to solve the subproblem itself, leading to the following iterates:
(16) 
where . Following (bertsekas, , Section 2.2.2), if we use the Armijo rule to select , every limit point of the sequence is a stationary point of . In practice, we observe that is almost always selected. Note that when and (i.e., singleoutput case), our refining algorithm recovers the power method. Generalized power methods were also studied for structured matrix factorization journee_generalized ; luss_generalized , but with different objectives and constraints. Since the conditional gradient algorithm assumes a differentiable function, in the case , we replace the absolute function with the Huber function if , otherwise.
Corrective refitting step. After iterations, contains at most nonzero rows. We can therefore always store as (the output layer matrix) and (the basis vectors / hidden units added so far). In order to improve accuracy, on iteration , we can then refit the objective . We consider two kinds of corrective steps, a convex one that minimizes w.r.t. and an optional nonconvex one that minimizes w.r.t. both and . Refitting allows to remove previouslyadded bad basis vectors, thanks to the use of sparsityinducing penalties. Similar refitting procedures are commonly used in matching pursuit (omp, ). The entire procedure is summarized in Algorithm 1 and implementation details are given in Appendix D.
5 Analysis of Algorithm 1
The main difficulty in analyzing the convergence of Algorithm 1 stems from the fact that we cannot solve the basis vector selection subproblem globally when or . Therefore, we need to develop an analysis that can cope with the multiplicative approximation . Multiplicative approximations were also considered in block_fw but the condition they require is too stringent (cf. Appendix B.2 for a detailed discussion). The next theorem guarantees the number of iterations needed to output a multioutput network that achieves as small objective value as an optimal solution of (5).
Theorem 1
Convergence of Algorithm 1
Assume is smooth with constant . Let be the output
after iterations of Algorithm 1 run with
constraint parameter . Then,
.
In livni , singleoutput PNs were trained using GECO geco , a greedy algorithm with similar guarantees. However, GECO is limited to learning infinite vectors (not matrices) and it does not constrain its iterates like we do. Hence GECO cannot remove bad basis vectors. The proof of Theorem 1 and a detailed comparison with GECO are given in Appendix B.2. Finally, we note that the infinite dimensional view is also key to convex neural networks convex_nn_bengio ; convex_nn_bach . However, to our knowledge, we are the first to give an explicit multiplicative approximation guarantee for a nonlinear multioutput network.
6 Experimental results
6.1 Experimental setup
Datasets. For our multiclass experiments, we use four publiclyavailable datasets: segment (7 classes), vowel (11 classes), satimage (6 classes) and letter (26 classes) datasets . Quadratic models substantially improve over linear models on these datasets. For our recommendation system experiments, we use the MovieLens 100k and 1M datasets movielens . See Appendix E for complete details.
Model validation. The greedy nature of Algorithm 1 allows us to easily interleave training with model validation. Concretely, we use an outer loop (embarrassingly parallel) for iterating over the range of possible regularization parameters, and an inner loop (Algorithm 1, sequential) for increasing the number of basis vectors. Throughout our experiments, we use 50% of the data for training, 25% for validation, and 25% for evaluation. Unless otherwise specified, we use a multiclass logistic loss.
6.2 Method comparison for the basis vector (hidden unit) selection subproblem
As we mentioned previously, the linearized subproblem (basis vector selection) for the and constrained cases involves a significantly more challenging nonconvex optimization problem. In this section, we compare different methods for obtaining an approximate solution to (9). We focus on the case, since we have a method for computing the true global solution , albeit with exponential complexity in (cf. Appendix C). This allows us to report the empirically observed multiplicative approximation factor .
Compared methods. We compare init + refine (proposed), random init + refine, init (without refine), random init and best data: where .
Results. We report in Figure 2. init + refine achieves nearly the global maximum on both datasets and outperforms random init + refine, showing the effectiveness of the proposed initialization and that the iterative update (16) can get stuck in a bad local minimum if initialized badly. On the other hand, init + refine outperforms init alone, showing the importance of iteratively refining the solution. Best data
, a heuristic similar to that of approximate kernel SVMs
lasvm , is not competitive.6.3 Sparsityinducing penalty comparison
In this section, we compare the , and penalties for the choice of , when varying the maximum number of basis vectors (hidden units). Figure 3
indicates test set accuracy when using output layer refitting. We also include linear logistic regression, kernel SVMs and the Nyström method as baselines. For the latter two, we use the quadratic kernel
. Hyperparameters are chosen so as to maximize validation set accuracy.Results. On the vowel (11 classes) and letter (26 classes) datasets, and penalties outperform norm starting from 20 and 75 hidden units, respectively. On satimage (6 classes) and segment (7 classes), we observed that the three penalties are mostly similar (not shown). We hypothesize that and penalties make a bigger difference when the number of classes is large. Multioutput PNs substantially outperform the Nyström method with comparable number of basis vectors (hidden units). Multioutput PNs reach the same test accuracy as kernel SVMs with very few basis vectors on vowel and satimage but appear to require at least 100 basis vectors to reach good performance on letter. This is not surprising, since kernel SVMs require 3,208 support vectors on letter, as indicated in Table 2 below.
6.4 Multiclass benchmark comparison
Compared methods. We compare
the proposed conditional gradient algorithm with output layer refitting only and
with both output and hidden layer refitting; projected gradient descent (FISTA)
with random initialization; linear and kernelized models; onevsrest PNs (i.e.,
fit one PN per class). We focus on PNs rather than FMs since they are known to
work better on classification tasks fm_icml .
Results are included in Table 2. From these results, we can make the following observations and conclusions. When using outputlayer refitting on vowel and letter (two datasets with more than 10 classes), groupsparsity inducing penalties lead to better test accuracy. This is to be expected, since these penalties select basis vectors that are useful across all classes. When using full hidden layer and output layer refitting, catches up with and on the vowel and letter datasets. Intuitively, the basis vector selection becomes less important if we make more effort at every iteration by refitting the basis vectors themselves. However, on vowel, is still substantially better than (89.57 vs. 87.83).
Compared to projected gradient descent with random initialization, our algorithm (for both output and full refitting) is better on (), () and () of the datasets. In addition, with our algorithm, the best model (chosen against the validation set) is substantially sparser. Multioutput PNs substantially outperform OvR PNs. This is to be expected, since multioutput PNs learn to share basis vectors across different classes.
6.5 Recommender system experiments using ordinal regression
A straightforward way to implement recommender systems consists in training a singleoutput model to regress ratings from onehot encoded user and item indices fm . Instead of a singleoutput PN or FM, we propose to use ordinal McRank, a reduction from ordinal regression to multioutput binary classification, which is known to achieve good nDCG (normalized discounted cumulative gain) scores mcrank
. This reduction involves training a probabilistic binary classifier for each of the
relevance levels (for instance, in the MovieLens datasets). The expected relevance of (e.g. the concatenation of the onehot encoded user and item indices) is then computed by(17) 
where we use the convention . Thus, all we need to do to use ordinal McRank is to train a probabilistic binary classifier for all .
Our key proposal is to use a multioutput model to learn all classifiers simultaneously, i.e., in a multitask fashion. Let be a vector representing a useritem pair with corresponding rating , for . We form a matrix such that if and otherwise, and solve
(18) 
where
is set to the binary logistic loss, in order to be able to produce probabilities. After running Algorithm 1 on that objective for
iterations, we obtain and . Because is shared across all outputs, the only small overhead of using the ordinal McRank reduction, compared to a singleoutput regression model, therefore comes from learning instead of .In this experiment, we focus on multioutput factorization machines (FMs), since FMs usually work better than PNs for onehot encoded data (fm_icml, ). We show in Figure 4 the RMSE and nDCG (truncated at 1 and 5) achieved when varying (the maximum number of basis vectors / hidden units).
Results. When combined with the ordinal McRank reduction, we found that and –constrained multioutput FMs substantially outperform singleoutput FMs and PNs on both RMSE and nDCG measures. For instance, on MovieLens 100k and 1M, –constrained multioutput FMs achieve an nDCG@1 of 0.75 and 0.76, respectively, while singleoutput FMs only achieve 0.71 and 0.75. Similar trends are observed with nDCG@5. We believe that this reduction is more robust to ranking performance measures such as nDCG thanks to its modelling of the expected relevance.
7 Conclusion and future directions
We defined the problem of learning multioutput PNs and FMs as that of learning a 3way tensor whose slices share a common basis. To obtain a convex optimization objective, we reformulated that problem as that of learning an infinite but rowwise sparse matrix. To learn that matrix, we developed a conditional gradient algorithm with corrective refitting, and were able to provide convergence guarantees, despite the nonconvexity of the basis vector (hidden unit) selection step.
Although not considered in this paper, our algorithm and its analysis can be modified to make use of stochastic gradients. An open question remains whether a conditional gradient algorithm with provable guarantees can be developed for training deep polynomial networks or factorization machines. Such deep models could potentially represent highdegree polynomials with few basis vectors. However, this would require the introduction of a new functional analysis framework.
References

(1)
F. Bach.
Breaking the curse of dimensionality with convex neural networks.
JMLR, 2017.  (2) Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS, 2005.
 (3) D. P. Bertsekas. Nonlinear programming. Athena Scientific Belmont, 1999.
 (4) M. Blondel, A. Fujino, and N. Ueda. Convex factorization machines. In ECML/PKDD, 2015.
 (5) M. Blondel, M. Ishihata, A. Fujino, and N. Ueda. Polynomial networks and factorization machines: New insights and efficient training algorithms. In ICML, 2016.
 (6) M. Blondel, K. Seki, and K. Uehara. Block coordinate descent algorithms for largescale sparse multiclass classification. Machine Learning, 93(1):31–52, 2013.

(7)
A. Bordes, S. Ertekin, J. Weston, and L. Bottou.
Fast kernel classifiers with online and active learning.
JMLR, 6(Sep):1579–1619, 2005.  (8) V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
 (9) Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, and C.J. Lin. Training and testing lowdegree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11:1471–1490, 2010.
 (10) D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In EMNLP, 2014.
 (11) J. C. Dunn and S. A. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432–444, 1978.
 (12) R.E. Fan and C.J. Lin. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, 2011.
 (13) A. Gautier, Q. N. Nguyen, and M. Hein. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. In NIPS, 2016.
 (14) GroupLens. http://grouplens.org/datasets/movielens/, 1998.
 (15) M. Jaggi. Revisiting FrankWolfe: Projectionfree sparse convex optimization. In ICML, 2013.

(16)
M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre.
Generalized power method for sparse principal component analysis.
Journal of Machine Learning Research, 11:517–553, 2010.  (17) Y. Juan, Y. Zhuang, W.S. Chin, and C.J. Lin. Fieldaware factorization machines for CTR prediction. In ACM Recsys, 2016.
 (18) S. LacosteJulien, M. Jaggi, M. Schmidt, and P. Pletscher. Blockcoordinate FrankWolfe optimization for structural SVMs. In ICML, 2012.

(19)
P. Li, C. J. Burges, and Q. Wu.
McRank: Learning to rank using multiple classification and gradient boosting.
In NIPS, 2007.  (20) R. Livni, S. ShalevShwartz, and O. Shamir. On the computational efficiency of training neural networks. In NIPS, 2014.
 (21) R. Luss and M. Teboulle. Conditional gradient algorithms for rankone matrix approximations with a sparsity constraint. SIAM Review, 55(1):65–98, 2013.
 (22) S. G. Mallat and Z. Zhang. Matching pursuits with timefrequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993.
 (23) A. Novikov, M. Trofimov, and I. Oseledets. Exponential machines. arXiv preprint arXiv:1605.03795, 2016.

(24)
A. Podosinnikova, F. Bach, and S. LacosteJulien.
Beyond CCA: Moment matching for multiview models.
In ICML, 2016.  (25) S. Rendle. Factorization machines. In ICDM, 2010.
 (26) S. ShalevShwartz, A. Gonen, and O. Shamir. Largescale convex minimization with a lowrank constraint. In ICML, 2011.
 (27) S. ShalevShwartz, Y. Wexler, and A. Shashua. ShareBoost: Efficient multiclass learning with feature sharing. In NIPS, 2011.
 (28) S. Sonnenburg and V. Franc. Coffin: A computational framework for linear SVMs. In ICML, 2010.
 (29) E. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In NIPS, 2016.
 (30) Z. Wang, K. Crammer, and S. Vucetic. Multiclass Pegasos on a budget. In ICML, 2010.
 (31) M. Yamada, W. Lian, A. Goyal, J. Chen, K. Wimalawarne, S. A. Khan, S. Kaski, H. M. Mamitsuka, and Y. Chang. Convex factorization machine for toxicogenomics prediction. In KDD, 2017.
 (32) E. Zhong, Y. Shi, N. Liu, and S. Rajan. Scaling factorization machines with parameter server. In CIKM, 2016.
Appendix A Convex multiclass loss functions
Loss  

Multiclass logistic  
Smoothed multiclass hinge 

Multiclass squared hinge 


Appendix B Proofs
b.1 Finite support of an optimal solution (Proposition 1)
General case. We first state a result that holds for arbitrary activation function (sigmoid, ReLu, etc…). The main idea is to use the fact that the penalties considered in Table 1 are atomic [8]. Then, we can equivalently optimize (5) over the convex hull of a set of atoms and invoke Carathéodory’s theorem for convex hulls.
Let be an dimensional vector whose element is . Let us define the sets
(20) 
where we define the set as follows:

case:

case:

case: .
Then (5) is equivalent to
(21)  
where is the convex hull of the set . The matrices and are related to each other by
(22) 
for some such that and . By Carathéodory’s theorem for convex hulls, there exists with at most nonzero elements. Because elements of are matrices with a single nonzero row, contains at most nonzero rows (hidden units).
Case of constraint and squared activation. When , given s.t. , the output can be written as
(23) 
Following [5, Lemma 10], the nuclear norm of a symmetric matrix can be defined by
(24) 
and the minimum is attained by the eigendecomposition and .
Therefore, we can always compute the eigendecomposition of each
and use the eigenvectors as hidden units and the eigenvalues as output layer weights. Moreover, this solution is feasible, since eigenvectors belong to
and since the norm of all eigenvalues is minimized. Since a matrix can have at most eigenvalues, we can conclude that has at most elements. Combined with the previous result, has at most nonzero rows (hidden units).For the and penalties, we cannot make this argument, since applying the eigendecomposition might increase the penalty value and therefore make the solution infeasible.
b.2 Convergence analysis (Theorem 1)
In this section, we include a convergence analysis of the conditional gradient algorithm with multiplicative approximation in the linear minimization oracle. The proof follows mostly from [15] with a trick inspired from [1] to handle multiplicative approximations. Finally, we also include a detailed comparison with the analysis of GECO [26] and BlockFW [18].
We focus on constrained optimization problems of the form
(25) 
where is convex and smooth w.r.t. and .
Curvature and smoothness constants. The convergence analysis depends on the following standard curvature constant
(26) 
Intuitively, this is a measure of nonlinearity of : the maximum deviation between and its linear approximations over . The assumption of bounded is closely related to a smoothness assumption on . Following [15, Lemma 7], for any choice of norm , can be upperbounded by the smoothness constant as
(27) 
Using , we obtain
(28) 
and therefore
(29) 
Linear duality gap. Following [15], we define the linear duality gap
(30) 
Since is convex and differentiable, we have that
(31) 
Let us define the primal error
(32) 
Minimizing (31) w.r.t. on both sides we obtain
(33) 
Hence can be used as a certificate of optimality about .
Bounding progress. Let be the current iterate and be our update. The definition of implies
(34) 
We now use that is obtained by an exact linear minimization oracle (LMO)
(35) 
and therefore . Combined with , we obtain
(36) 
Subtracting on both sides, we finally get
(37) 
Primal convergence. Since we use a fullycorrective variant of the conditional gradient method, our algorithm enjoys a convergence rate at least as good as the variant with fixed step size. Following [15, Theorem 1] and using (29), for every , the iterates satisfy
(38) 
Thus, we can obtain an accurate solution if we run the algorithm for iterations.
Linear minimization with multiplicative approximation. We now extend the analysis to the case of approximate linear minimization. Given , we assume that an approximate LMO outputs a certain such that
(39) 
for some multiplicative factor (higher is more accurate). Since and are in , we have like before
(40) 
Following the same trick as [1, Appendix B], we now absorb the multiplicative factor in the constraint
(41) 
where we defined (i.e., the ball is shrunk by a factor ). We therefore obtain . Similarly as before, this implies that
(42) 
Subtracting on both sides, we get
(43) 
We thus get that iterate satisfies and
(44) 
We can therefore obtain an such that if we run our algorithm for iterations with constraint parameter and multiplicative factor . Put differently, we can obtain an such that