Models involving factorization or decomposition are ubiquitous across a wide variety of technical fields and application areas. As a simple example relevant to machine learning, various forms ofmatrix factorization are used in classical dimensionality reduction techniques such as Principle Component Analysis (PCA) and in more recent methods like non-negative matrix factorization or dictionary learning (Lee and Seung, 1999; Aharon et al., 2006; Mairal et al., 2010). In a typical matrix factorization problem, we might seek to find matrices such that the product closely approximates a given data matrix while at the same time requiring that and satisfy certain properties (e.g., non-negativity, sparseness, etc.). This naturally leads to an optimization problem of the form
where is some function that measures how closely is approximated by and is a regularization function to enforce the desired properties in and . Unfortunately, aside from a few special cases (e.g., PCA), a vast majority of matrix factorization models suffer from the significant disadvantage that the associated optimization problems are non-convex and very challenging to solve. For example, in (1) even if we choose to be jointly convex in and to be a convex function in , the optimization problem is still typically a non-convex problem in due to the composition with the bilinear form .
Given this challenge, a common approach is to relax the non-convex factorization problem into a problem which is convex on the product of the factorized matrices, . As a concrete example, in low-rank matrix factorization, one might be interested in solving a problem of the form
where the rank constraint can be easily enforced by limiting the number of columns in the and matrices to be less than or equal to . However, aside from a few special choices of , solving (2) is a NP-hard problem in general. Instead, one can relax (2) into a fully convex problem by using a convex regularization that promotes low-rank solutions, such as the nuclear norm , and then solve
via a singular value decomposition. Unforunately, however, while the nuclear norm provides a nice convex relaxation for low-rank matrix factorization problems, nuclear norm relaxation does not capture the full generality of problems such as (1) as it does not necessarily ensure that can be ’efficiently’ factorized as for some pair which has the desired properties encouraged by (sparseness, non-negativity, etc.), nor does it provide a means to find the desired factors. To address these issues, in this paper we consider the task of solving non-convex optimization problems directly in the factorized space and use ideas inspired from the convex relaxation of matrix factorizations as a means to analyze the non-convex factorization problem. Our framework includes problems such as (1) as a special case but also applies much more broadly to a wide range of non-convex optimization problems; several of which we describe below.
1.1 Generalized Factorization
More generally, tensor factorization models provide a natural extension to matrix factorization and have been employed in a wide variety of applications (Cichocki et al., 2009; Kolda and Bader, 2009). The resulting optimization problem is similar to matrix factorization, with the difference that we now consider more general factorizations which decompose a multidimensional tensor into a set of different factors , where each factor is also possibly a multidimensional tensor. These factors are then combined via an arbitrary multilinear mapping ; i.e., is a linear function in each term if the other terms are held constant. This model then typically gives optimization problems of the form
where again might measure how closely is approximated by the tensor and encourages the factors to satisfy certain requirements. Clearly, (4) is a generalization of (1) by taking and , and similar to matrix factorization, the optimization problem given by (4) will typically be non-convex regardless of the choice of and functions due to the multilinear mapping .
While the tensor factorization framework is very general with regards to the dimensionalities of the data and the factors, a tensor factorization usually implies the assumption that the mapping from the factorized space to the output space (the codomain of ) is multilinear. However, if we consider more general mappings from the factorized space into the output space (i.e., mappings which are not restricted to be multilinear) then we can capture a much broader array of models in the ’factorized model’ family. For example, in deep neural network training the output of the network is typically generated by performing an alternating series of a linear function followed by a non-linear function. More concretely, if one is given training data consisting of data points of dimensional data,
, and an associated vector of desired outputs, the goal then is to find a set of network parameters by solving an optimization problem of the form (4) using a mapping
where each factor is an appropriately sized matrix and the
functions apply some form of non-linearity after each matrix multiplication, e.g., a sigmoid function, rectification, max-pooling. Note that although here we have shown the linear operations to be simple matrix multiplications for notational simplicity, this is easily generalized to other linear operators (e.g., in a convolutional network each linear operator could be a set of convolutions with a group of various kernels with parameters contained in thevariables).
1.2 Paper Contributions
Our primary contribution is to extend ideas from convex matrix factorization and present a general framework which allows for a wide variety of factorization problems to be analyzed within a convex formulation. Specifically, using this convex framework we are able to show that local minima of the non-convex factorization problem achieve the global minimum if they satisfy a simple condition. Further, we also show that if the factorization is done with factorized variables of sufficient size, then from any initialization it is always possible to reach a global minimizer using purely local descent search strategies.
Two concepts are key to our analysis framework: 1) the size of the factorized elements is not constrained, but instead fit to the data through regularization (for example, the number of columns in and is allowed to change in matrix factorization) 2) we require that the mapping from the factorized elements to the final output,
, satisfies a positive homogeneity property. Interestingly, the deep learning field has increasingly moved to using non-linearities such as Rectified Linear Units (ReLU) and Max-Pooling, both of which satisfy the positive homogeneity property, and it has been noted empirically that both the speed of training the neural network and the overall performance of the network is increased significantly when ReLU non-linearities are used instead of the more traditional hyperbolic tangent or sigmoid non-linearities(Dahl et al., 2013; Maas et al., 2013; Krizhevsky et al., 2012; Zeiler et al., 2013). We suggest that our framework provides a partial theoretical explanation to this phenomena and also offers directions of future research which might be beneficial in improving the performance of multilayer neural networks.
2 Prior Work
Despite the significant empirical success and wide ranging applications of the models discussed above (and many others not discussed), as we have mentioned, a vast majority of the above techniques models suffer from the significant disadvantage that the associated optimization problems are non-convex and very challenging to solve. As a result, the numerical optimization algorithms often used to solve factorization problems – including (but certainly not limited to) alternating minimization, gradient descent, stochastic gradient descent, block coordinate descent, back-propagation, and quasi-newton methods – are typically only guaranteed to converge to a critical point or local minimum of the objective function(Mairal et al., 2010; Rumelhart et al., 1988; Ngiam et al., 2011; Wright and Nocedal, 1999; Xu and Yin, 2013). The nuclear norm relaxation of low-rank matrix factorization discussed above provides a means to solve factorization problems with reglarization promoting low-rank solutions111Similar convex relaxation techniques have also been proposed for low-rank tensor factorizations, but in the case of tensors finding a final factorization from a low-rank tensor can still be a challenging problem (Tomioka et al., 2010; Gandy et al., 2011), but it fails to capture the full generality of problems such as (1) as it does not allow one to find factors, , with the desired properties encouraged by (sparseness, non-negativity, etc.). To address this issue, several studies have explored a more general convex relaxation via the matrix norm given by
where denotes the ’th columns of and , and are arbitrary vector norms, and the number of columns () in the and matrices is allowed to be variable (Bach et al., 2008; Bach, 2013; Haeffele et al., 2014). The norm in (6) has appeared under multiple names in the literature, including the projective tensor norm, decomposition norm, and atomic norm, and by replacing the column norms in (6) with gauge functions the formulation can be generalized to incorporate additional regularization on , such as non-negativity, while still being a convex function of (Bach, 2013). Further, it is worth noting that for particular choices of the and vector norms, reverts to several well known matrix norms and thus provides a generalization of many commonly used regularizers. Notably, when the vector norms are both norms, , and the form in (6) is the well known variational definition of the nuclear norm.
The norm has the appealing property that by an appropriate choice of vector norms and (or more generally gauge functions), one can promote desired properties in the factorized matrices while still working with a problem which is convex w.r.t. the product . Based on this concept, several studies have explored optimization problems over factorized matrices of the form
Even though the problem is still non-convex w.r.t. the factorized matrices , it can be shown using ideas from Burer and Monteiro (2005) on factorized semidefinite programming that, subject to a few general conditions, then local minima of (7) will be global minima (Bach et al., 2008; Haeffele et al., 2014), which can significantly reduce the dimensionality of some large scale optimization problems. Unfortunately, aside from a few special cases, the norm defined by (6) (and related regularization functions such as those discussed by Bach (2013)) cannot be evaluated efficiently, much less optimized over, due to the complicated and non-convex nature of the definition. As a result, in practice one is often forced to replace (7) by the closely related problem
However, (7) and (8) are not equivalent problems, due to the fact that solutions to (7) include any factorization such that their product equals the optimal solution, , while in (7) one is specifically searching for a factorization that achieves the infimum in (6); in brief, solutions to (8) will be solutions to (7), but the converse is not true. As a consequence, results guaranteeing that local minima of the form (7) will be global minima cannot be applied to the formulation in (8), which is typically more useful in practice. Here we focus our analysis on the more commonly used family of problems, such as (8), and show that similar guarantees can be provided regarding the global optimality of local minima. Additionally, we show that these ideas can be significantly extended to a very wide range of non-convex models and regularization functions, with applications such as tensor factorization and certain forms of neural network training being additional special cases of our framework.
In the context of neural networks, Bengio et al. (2005)
showed that for neural networks with a single hidden layer, if the number of neurons in the hidden layer is not fixed, but instead fit to the data through a sparsity inducing regularization, then the process of training a globally optimal neural network is analgous to selecting a finite number of hidden units from the infinite dimensional space of all possible hidden units and taking a weighted summation of these units to produce the output. Further, these ideas have very recently been used to analyze the generalization performance of such networks(Bach, 2014). Here, our results take a similar approach and extend these ideas to certain forms of multi-layer neural networks. Additionally, our framework provides sufficient conditions on the network architecture to guarantee that from any intialization a globally optimal solution can be found by performing purely local descent on the network weights.
Before we present our main results, we first describe our notation system and recall a few definitions.
Our formulation is fairly general in regards to the dimensionality of the data and factorized variables. As a result, to simplify the notation, we will use capital letters as a shorthand for a set of dimensions, and individual dimensions will be denoted with lower case letters. For example, for ; we also denote the cardinality of as . Similarly, for and .
Given an element from a tensor space, we will use a subscript to denote a slice of the tensor along the last dimension. For example, given a matrix , then , denotes the ’th column of . Similarly, given a cube then , denotes the ’th slice along the third dimension. Further, given two tensors with matching dimensions except for the last dimension, and , we will use to denote the concatenation of the two tensors along the last dimension.
We denote the dot product between two elements from a tensor space as , where denotes flattening the tensor into a vector. For a function , we denote its image as and its Fenchel dual as . The gradient of a differentiable function is denoted , and the subgradient of a convex (but possibly non-differentiable) function is denoted . For a differentiable function with multiple variables , we will use to denote the portion of the gradient corresponding to . The space of non-negative real numbers is denoted , and the space of positive integers is denoted .
We now make/recall a few general definitions and well known facts which will be used in our analysis. A size-r set of K factors is defined to be a set of tensors where the final dimension of each tensor is equal to . This is to be interpreted . The indicator function of a set is defined as
A function is positively homogeneous with degree p if . Note that this definition also implies that for .
A function is positive semidefinite if and . The one-sided directional derivative of a function at a point in the direction is denoted and defined as . Also, recall that for a differentiable function , .
4 Problem Formulation
Returning to the motivating example from the introduction (4), we now define the family of mapping functions from the factors into the output space and the family of regularization functions on the factors ( and , respectively) which we will study in our framework.
4.1 Factorization Mappings
In this paper, we consider mappings which are based on a sum of what we refer to as an elemental mapping. Specifically, if we are given a size- set of factors , the elemental mapping takes a slice along the last dimension from each tensor in the set of factors and maps it into the output space. We then define the full mapping to be the sum of these elemental mappings along each of the slices in the set of factors. The only requirement we impose on the elemental mapping is that it must be positively homogeneous. More formally, An elemental mapping, is any mapping which is positively homogeneous with degree . The r-element factorization mapping is defined as
As we do not place any restrictions on the elemental mapping, , beyond the requirement that it must be positively homogeneous, there are a wide range of problems that can be captured by a mapping with form (10). Several example problems which can be placed in this framework include:
Matrix Factorization: The elemental mapping,
is positively homogeneous with degree 2 and is simply matrix multiplication for matrices with columns.
Tensor Decomposition - CANDECOMP/PARAFAC (CP): Slightly more generally, the elemental mapping
(where denotes the tensor outer product) results in being the mapping used in the rank- CANDECOMP/PARAFAC (CP) tensor decomposition model (Kolda and Bader, 2009). Further, instead of choosing to be a simple outer product, we can also generalize this to be any multilinear function of the factors 222We note that more general tensor decompositions, such as the general form of the Tucker decomposition, do not explicitly fit inside the framework we describe here; however, by using similar arguments to the ones we develop here, it is possible to show analogous results to those we derive in this paper for more general tensor decompositions, which we do not show for clarity of presentation..
Neural Networks with Rectified Linear Units (ReLU): Let be the linear rectification function, which is applied element-wise to a tensor of arbitrary dimension. Then if we are given a matrix of training data , the elemental mapping
results in a mapping , which can be interpreted as producing the outputs of a 3 layer neural network with hidden units in response to the input of data points of dimensional data, . The hidden units have a ReLU non-linearity; the other units are linear; and the matrices contain the connection weights from the input-to-hidden and hidden-to-output layers, respectively.
By utilizing more complicated definitions of , it is possible to consider a broad range of neural network architectures. As a simple example of networks with multiple hidden layers, an elemental mapping such as
gives a mapping which is the output of a 5 layer neural network in response to the inputs in the matrix with ReLU non-linearities on all of the hidden layer units. In this case, the network has the architecture that there are , 4 layer fully-connected subnetworks, with each subnetwork having the same number of units in each layer as defined by the dimensions . The subnetworks are all then fed into a fully connected linear layer to produce the output.
More general still, since any positively homogenous transformation is a potential elemental mapping, by an appropriate definition of
, one can describe neural networks with very general architectures, provided the non-linearities in the network are compatible with positive homogeneity. Note that max-pooling and rectification are both positively homogeneous and thus fall within our framework. For example, the well-known ImageNet network from(Krizhevsky et al., 2012), which consists of a series of convolutional layers, linear-rectification, max-pooling layers, response normalization layers, and fully connected layers, can be described by taking and defining to be the entire transformation of the network (with the removal of the response normalization layers, which are not positively homogenous). Note, however, that our results will rely on potentially changing size or being initialized to be sufficiently large, which limits the applicability of our results to current state-of-the-art network architectures (see discussion).
Here we have provided a few examples of common factorization mappings that can be cast in form (10), but certainly there are a wide variety of other problems for which our framework is relevant. Additionally, while all of the mappings described above are positively homogeneous with degree equal to the degree of the factorization (), this is not a requirement; is sufficient. For example, non-linearities such as a rectification followed by raising each element to a non-zero power are positively homogeneous but of a possibly different degree. What will turn out to be essential, however, is that we require to match the degree of positive homogeneity used to regularize the factors, which we will discuss in the next section.
4.2 Factorization Regularization
Inspired by the ideas from structured convex matrix factorization, instead of trying to analyze the optimization over a size- set of factors for a fixed , we instead consider the optimization problem where is possibly allowed to vary and adapted to the data through regularization. To do so, we will define a regularization function similar to the norm discussed in matrix factorization which is convex with respect to the output tensor but which still allows for regularization to be placed on the factors. Similar to our definition in (10), we will begin by first defining an elemental regularization function which takes as input slices of the factorized tensors along the last dimension and returns a non-negative number. The requirements we place on are that it must be positively homogeneous and positive semidefinite. Formally, We define an elemental regularization function , to be any function which is positive semidefinite and positively homogeneous.
Again, due to the generality of the framework, there are a wide variety of possible elemental regularization functions. We highlight two positive semidefinite, positively homogeneous functions which are commonly used and note that functions can be composed with summations, multiplications, and raising to non-zero powers to change the degree of positive homogeneity and combine various functions.
Norms: Any norm is positively homogeneous with degree 1. Note that because we make no requirement of convexity on , this framework can also include functions such as the pseudo-norms for .
Conic Indicators: The indicator function of any conic set is positively homogeneous for all degrees. Recall that a conic set, , is simply any set such that if then . A few popular conic sets which can be of interest include the non-negative orthant , the kernel of a linear operator , inequality constraints for a linear operator , and the set of positive semidefinite matrices. Constraints on the non-zero support of are also typically conic sets. For example, the set is a conic set, where is simply the number of non-zero elements in and is a positive integer. More abstractly, conic sets can also be used to enforce invariances w.r.t. positively homogeneous transformations. For example, given two positively homogeneous functions with equal degrees of positive homogeneity, the sets and are also conic sets.
A few typical formulations of a which are positively homogeneous with degree might include:
where all of the norms, , are arbitrary. Forms (15) and (16) can be shown to be equivalent, in the sense that they give rise to the same function, for all of the example mappings we have discussed here and by an appropriate choice of norm can induce various properties in the factorized elements (such as sparsity), while form (17) is similar but additionally constrains each factor to be an element of a conic set (see Bach et al., 2008; Bach, 2013; Haeffele et al., 2014, for examples from matrix factorization).
To define our regularization function on the output tensor, , it will be necessary that the elemental regularization function, , and the elemental mapping, , satisfy a few properties to be considered ’compatible’ for the definition of our regularization function. Specifically, we will require the following definition. Given an elemental mapping and an elemental regularization function , will we say that are a nondegenerate pair if 1) and are both positively homogeneous with degree , for some and 2) and such that , , and for all such that 333Property 1 from the definition of a nondegenerate pair will be critical to our formulation. Several of our results can be shown without Property 2, but Property 2 is almost always satisfied for most interesting choices of and is designed to avoid ’pathological’ functions (such as ). For example, in matrix factorization with , taking for any arbitrary norm and conic set satisfies Property 1 but not Property 2, as we can always reduce the value of by scaling by a constant and scaling by without changing the value of .
From this, we now define our main regularization function: Given an elemental mapping and an elemental regularization function such that are a nondegenerate pair, we define the factorization regularization function, to be
with the additional condition that if .
We will show that is a convex function of and that in general the infimum in (18) can always be achieved with a finitely sized factorization (i.e., does not need to approach )444In particular, the largest needs to be is , and we note that is a worst case upper bound on the size of the factorization. In certain cases the bound can be shown to be lower. As an example, when and . In this case the infimum can be achieved with .. While suffers from many of the practical issues associated with the matrix norm discussed earlier (namely that in general it cannot be evaluated in polynomial time due to the complicated definition), because is a convex function on , this allows us to use purely as an analysis tool to derive results for a more tractable factorized formulation.
4.3 Problem Definition
To build our analysis, we will start by defining the convex (but typically non-tractable) problem, given by
Here is the output of the factorization mapping as we have been discussing, and the
term is an optional additional set of non-factorized variables which can be helpful in modeling some problems (for example, to add intercept terms or to model outliers in the data). For our analysis we will assume the following:
is once differentiable and jointly convex in
is convex (but possibly non-differentiable)
The minimum of exists .
As noted above, it is typically impractical to optimize over functions involving , and, even if one were given an optimal solution to (19), , one would still need to solve the problem given in (18) to recover the desired factors. Therefore, we use (19) merely as analysis tool and instead tailor our results to the non-convex optimization problem given by
We will show in the next section that any local minima of (20) is a global minima if it satisfies the condition that one slice from each of the factorized tensors is all zero. Further, we will also show that if is taken to be large enough then from any initialization we can always find a global minimum of (20) by doing an optimization based purely on local descent.
5 Main Analysis
We begin our analysis by first showing a few simple properties and lemmas relevant to our framework.
5.1 Preliminary Results
First, from the definition of it is easy to verify that if is positively homogeneous with degree , then is also positively homogeneous with degree and satisfies the following proposition Given a size- set of factors, , and a size- set of factors, , then
where recall, denotes the concatenation of and along the final dimension of the tensor. Further, satisfies the following proposition: The function as defined in (18) has the properties
is positively homogeneous with degree 1.
is convex w.r.t. .
The infimum in (18) can be achieved with s.t. .
1) By definition and the fact that is positive semidefinite, we always have . Trivially, since we can always take to achieve the infimum. For , because is a non-degenerate pair then for any and finite. Property 5) shows that the infimum can be achieved with finite, completing the result.
2) For all and any such that , note that from positive homogeneity and . Applying this fact to the definition of gives that .
3) If either or then the inequality is trivially satisfied. Considering any pair such that is finite for both and , for any let be an optimal factorization of . Specifically, and . Similarly, let be an optimal factorization of . From Proposition 5.1 we have , so . Letting tend to 0 completes the result.
4) Convexity is given by the combination of properties 2 and 3. Further, note that properties 2 and 3 also show that is a convex set.
5) Let be defined as
Note that because is a nondegenerate pair, for any non-zero there exists such that is on the boundary of , so and its convex hull are compact sets.
Further, note that contains the origin by definition of and , so as a result, is equivalent to a gauge function on the convex hull of
Since the infimum w.r.t. is linear and constrained to a compact set, it must be achieved. Therefore, there must exist , , and such that and .
This, combined with positive homogeneity, completes the result as we can take , which gives
and shows that a factorization of size- which achieves the infimum must exist.
We next derive the Fenchel dual of , which will provide a useful characterization of the subgradient of .
The Fenchel dual of is given by
Recall, , so for to approach the supremum we must have . As result, the problem is equivalent to
If then all the terms in the summation of (28) will be non-positive, so taking will achieve the supremum. Conversely, if , then such that . This result, combined with the positive homogeneity of and gives that (28) is unbounded by considering as .
We briefly note that the optimization problem associated with (26) is typically referred to as the polar problem and is a generalization of the concept of a dual norm. In practice solving the polar can still be very challenging and is often the limiting factor in applying our results in practice (see Bach, 2013; Zhang et al., 2013, for further information).
With the above derivation of the Fenchel dual, we now recall that if then the subgradient of can be characterized by . This forms the basis for the following lemma which will be used in our main results Given a factorization and a regularization function , then the following conditions are equivalent:
is an optimal factorization of ; i.e.,
such that and ,
such that and ,
Further, any which satisfies condition 2 or 3 satisfies both conditions 2 and 3 and . 2 3) 3 trivially implies 2 from the definition of . For the opposite direction, because we have . Taking the sum over , we can only achieve equality in 2 if we have equality in condition 3. This also shows that any which satisfies condition 2 or 3 must also satisfy the other condition.
We next show that if satisfies conditions 2/3 then . First, from condition 2/3 and the definition of , we have . Thus, recall that because is convex and finite at , we have with equality iff . Now, by contradiction assume satisfies conditions 2/3 but