1 Introduction
In recent years, a large number of methods have been developed to improve matrix completion methods using side information [1, 2, 3]. By including features linked to the row and/or columns of a matrix, also called “side information”, these methods have a better performance at estimating the missing entries.
In this work, we generalize this idea to nonnegative matrix factorization (NMF, [4]). Although more difficult to identify, NMF often has a better empirical performance, and provides factors that are more interpretable in an appropriate context. We propose an NMF method that takes into account side information. Given some observations of a matrix, this method jointly estimates the nonnegative factors of the matrix, and regression models of these factors on the side information. This allows us to improve the matrix recovery performance of NMF. Moreover, using the regression models, we can predict the value of interest for new rows and columns that are previously unseen. We develop this method in the general matrix recovery context, where linear measurements are observed instead of matrix entries.
This choice is especially motivated by applications in electricity consumption. We are interested in estimating and predicting the electricity load from temporal aggregates. In the context of load balancing in the power market, electric transmission system operators (TSO) of the electricity network are typically legally bound to estimate the electricity consumption and production at a small temporal scale (halfhourly or hourly), for market participants within their perimeter, i.e. utility providers, traders, large consumers, groups of consumers, etc. [5]. Most traditional electricity meters do not provide information at such a fine temporal scale. Although smart meters can record consumption locally every minute, the usage of such data can be extremely constrained for TSOs, because of the high cost of data transmission and processing and/or privacy issues. Nowadays, TSOs often use regulatorapproved proportional consumption profiles to estimate market participants’ load. In a previous work by the authors [5], we proposed to solve the estimation problem by NMF using temporal aggregate data.
Using the method developed in this article, we put in parallel temporal aggregate data with features that are known to have a correlation with electricity consumption, such as the temperature, the day of the week, or the type of client. This not only improves the performance of load estimation, but also allows us to predict future load for users in the dataset, and estimate and predict the consumption of new users previously unseen. In electrical power networks, load prediction for new periods is useful for balancing offerdemand on the network, and prediction for new individuals is useful for network planning.
In the rest of this section, we introduce the general framework of this method, and the related literature. In Section 2, we deduce a sufficient condition on the side information for the NMF to be unique. In Section 3, we present HALSX, an algorithm which solves the NMF with side information problem, that we prove to converge to stationary points. In Section 4, we present experimental results, applying HALSX on simulated and real datasets, both for the electricity consumption application, and for a standard task in collaborative filtering.
1.1 General model definition
We are interested in reconstructing a nonnegative matrix , from linear measurements,
(1) 
where is a linear operator. Formally, can be represented by , …, , design matrices of dimension , and each linear measurement can be represented by
(2) 
The design matrices , …, are called masks.
Moreover, we suppose that the matrix of interest, , stems from a generative lowrank nonnegative model, in the following sense:

The matrix is of nonnegative rank , with . This means, is the smallest number so that we can find two nonnegative matrices and satisfying
Note that this implies that is of rank at most , and therefore is of low rank.

There are some row features and column features connected to each row and column of . We note by the th row of , and by the th row of . There are two link functions and , so that
where
is the matrix obtained by stacking row vectors
, for (idem for ), and is the ramp function which corresponds to thresholding operation at 0 for any matrix or vector.
In this general setting, the features and , the measurement operator , and the measurements are observed. The objective is to estimate the true matrix as well as the factor matrices and , by estimating the link functions and .
To obtain a solution to this matrix recovery problem, we minimize the quadratic error of the matrix factorization. In Section 3, we will propose an algorithm for the following optimization problem:
(3)  
s.t. 
where and are some functional spaces in which the row and column link functions are to be searched.
By specializing , , and , or restricting the search space of and , this general model includes a number of interesting applications, old and new.
The masks , …,

Complete observation: , where is the th canonical vector. This means every entry of is observed.

Matrix completion: the set of masks is a subset of complete observation masks, with .

Matrix sensing: the design matrices
are random matrices, sampled from a certain probability distribution. Typically, the probability distribution needs to verify certain conditions, so that with a large probability,
verifies the Restricted Isometry Property [6]. 
Rankone projections [7, 8]: the design matrices are random rankone matrices, that is , where and are respectively random vectors of dimension and . The main advantage to this setting is that much less memory is needed to store the masks, since we can store the vectors and (dimension) instead of (dimension). In [7, 8], theoretical properties are proved for the case where and are vectors with independent Gaussian entries and/or drawn uniformly from the vectors of the canonical basis.

Temporal aggregate measurements: in this case, the matrix is composed of time series concerning individuals, and each measure is a temporal aggregate of the time series of an individual. The design matrices are defined as , where is the individual concerned by the th measure, the first period covered by the measure, and the number of periods covered by the measure.
The features and

Individual features: . Basically, no side information is available. The row individuals and column individuals are each different.

General numeric features: and . This includes all numeric features.

Features generated from a kernel: certain information about the row and column individuals may not be in the form of a numeric vector. For example, if the row individuals are vertices of a graph, their connection to each other is interesting information for the problem, but it is difficult to encode as real vectors. In this case, features can be generated through a transformation, or by defining a kernel function.
The link functions and

Linear: , and . In this case, we need to estimate and to fit the model. With identity matrices as row and column features, this case is reduced to the traditional matrix factorization model with
When the features are generated from a kernel function, even a linear link function permits nonlinear relationship between the features and the factor matrices.

General regression models: when the relationship between the features and the variable of interest is not linear, any offtheshelf regression methods can be plugged in to search for a nonlinear link function.
The choice of the optimization problem (3)
Notice that with individual row and column features, linear link functions and complete observations, (3) becomes
(4) 
This is equivalent to the classical NMF problem,
(5)  
s.t. 
in the sense that for any solution to (4), is a solution to (5).
A more immediate generalization of (5) to include exogenous variables would be in the form
(6)  
s.t.  
Solving (6) would involve identifying the subset of and that only produce nonnegative value on the row and column features, which could be difficult.
Link function  Linear  Other regression methods  

Features  Identity  General numeric features  Kernel features  General numeric features  
Mask  Identity  Matrix factorization  Reducedregression rank [9, 10, 11]  Multiple kernel learning [12]  Nonparametric RRR [13] 
Matrix completion  Matrix completion [14]  IMC[15, 1, 16, 17]  GRMF [18, 2, 3]  
Rankone projections  [7, 8]  
Temporal aggregates  [5]  
General masks  Matrix recovery [19, 6] 
By using (3), we actually shifted the search space and to and which consists of composing all functions of and with the ramp function (thresholding at 0). In a word, (3) is equivalent to
(7)  
s.t. 
Problem (7) also helps us to reason on the identifiability of (3). In a sense, (3) is not wellidentified: two distinct elements in have the same evaluation value of the objective function, if they only differ on their negative parts. In fact, this does not affect the interpretation of the model, because these distinct elements correspond to the same element in . Since we are only going to use the positive parts of the function both in recovery and prediction, this becomes a parameterization choice which has no consequence on the applications.
As a comparison, we also propose an algorithm for the following optimization problem in Section 3:
(8) 
Instead of minimizing the lowrank approximation error for a matrix that matches the data as a linear matrix equation in (3), (8) minimizes the sampling error of an exactly lowrank matrix. Both have been studied in the literature. For example, objective functions similar to (3) have been considered in [6], and ones similar to (8) in [20]. We will see in both Sections 3 and 4 that (3) has a better performance than (8).
1.2 Prior works
Table 1 shows a taxonomy of matrix factorization models with side information, by the mask, the link function and the features used as side information. A number of problems in this taxonomy has been addressed in established literature.
There is an abundant literature that studies the general matrix factorization problem under various measurement operators, when no additional information is provided (see [21, 14, 6, 8] for various masks considered, and [22] for a recent global convergence result with RIP measurements). The NMF with general linear measurements is studied in various applications [20, 23, 5].
On the other hand, with complete observations, the multiple regression problem taking into account the lowrank structure of the (multidimensional) variable of interest is known as reducedrank regression. This approach was first developed very early (see [9] for a review). Recent developments on rank selection [10], adaptive estimation procedures [24], using nonparametric link function [13], often draw the parallel between reducedrank regression and the matrix completion problem. However, measurement operators other than complete observations or matrix completion are rarely considered in this community.
Building on theoretical boundaries on matrix completion, the authors of [1, 16, 17] showed that by providing side information (the matrix ), the number of measurements needed for exact matrix completion can be reduced. Moreover, the number of measurements necessary for successful matrix completion can be quantified by measuring the quality of the side information [17].
Collaborative filtering with side information has received much attention from practitioners and academic researchers alike, for its huge impact in ecommerce and web applications [15, 1, 16, 17, 2, 3]. One of the first methods for including side information in collaborative filtering systems (matrix completion masks) was proposed by [18]. The authors generalized collaborative filtering into a operator estimation problem. This method allows more general feature spaces than a numerical matrix, by applying a kernel function to side information. [3] proposed choosing the kernel function based on the goal of the application. [12] applied the kernelbased collaborative filtering framework to electricity price forecasting. Their kernel choice is determined by multikernel learning methods.
To the best of our knowledge, matrix factorization (nonnegative or not) with side information, from general linear measurements has rarely been considered, nor is general nonlinear functions other than with features obtained from kernels. This article aims at proposing a general approach which fills this gap.
2 Identifiability of nonnegative matrix factorization with side information
Matrix factorization is not a wellidentified problem: for one pair of factors , with
, any invertible matrix
produces another pair of factors, , with . In order to address this identifiability problem, one has to introduce extra constraints on the factors.When the nonnegativity constraint is imposed on and , however, it has been shown that sometimes the only invertible matrices that verify and are the composition of a permutation matrix and a diagonal matrix with strictly positive diagonal elements. A nonnegative matrix factorization is said to be “identified” if the factors are unique up to permutation and scaling. The identifiability conditions for NMF are a hard problem, because it turns out that NMF identifiability is equivalent to conditions that are computationally difficult to check. In this section, we review some known necessary and sufficient conditions for NMF identifiability in the literature, and develop a sufficient condition for NMF identifiability in the context of linear numerical features.
In order to simplify our theoretical analysis, we focus on the complete observation case in this section (every entry in is observed). Without loss of generality, we derive the sufficient condition for row features. That is, we will derive conditions on and , so that the nonnegative matrix factorization , with , is unique. A generalization to column features can be easily obtained. In this section, we assume that in addition to be of nonnegative rank , matrix is also exactly of rank .
2.1 Identifiability of NMF
The authors of [4] and [25] proposed two necessary and sufficient conditions for the factorization to be unique. Both conditions use the following geometric interpretation of NMF introduced by [4].
Since , the columns of are conical combinations of the columns in . Formally, , the conical hull of the columns of , is a polyhedral cone contained in the first orthant of . As is of rank , the rank of is also . This implies that the extreme rays (also called generators) of are exactly the columns of , which are linearly independent. is therefore

a simplicial cone of generators,

contained in ,

containing all columns of .
Inversely, if we take any cone verifying these three conditions, and define a matrix whose columns are the generators of , there will be a nonnegative matrix , so that . The uniqueness of NMF is therefore equivalent to the uniqueness of simplicial cones of generators contained in the first orthant of and containing all columns .
In [25], an equivalent geometric interpretation in is given in the following theorem:
Theorem 1.
[25] A dimensional NMF of a rank nonnegative matrix is unique if and only if the nonnegative orthant is the only simplicial cone with extreme rays satisfying
Despite the apparent simplicity of the theorem, the necessary and sufficient conditions are very difficult to check. Based on the theorem above, several sufficient conditions have been proposed. The most widely used condition is called the separability condition. Before introducing this condition (in its least restrictive version presented by [25]), we need the following two definitions.
Definition 1 (Separability).
Suppose . A nonnegative matrix is said to be separable if there is a by permutation matrix which verifies
where is a by diagonal matrix with only strictly positive coefficients on the diagonal and zeros everywhere else, and the by matrix is a collection of the other rows of .
Definition 2 (Strongly Boundary Closeness).
A nonnegative matrix is said to be strongly boundary close if the following conditions are satisfied.

is boundary close: for all , there is a row in which satisfies ;

There is a permutation of such that for all , there are rows in which satisfy

for all ;

the square matrix is of full rank .

Strongly boundary closeness demands, modulo a permutation in , that for each , there are rows of that have the following form,
(9)  
These row vectors, , all have 0 on the th element, and its lower square matrix of is of full rank. There are therefore enough linearly independent points on each dimensional facet , which shows that is somewhat maximal in .
The following was proved in [25]:
Theorem 2.
[25] If is strongly boundary close, then the only simplicial cone with generators in containing is . Moreover, if is separable, then is the unique NMF of up to permutation and scaling.
2.2 Identifiability with side information
The NMF with linear row features, , is said to be unique, if for all matrix pairs that verifies
we have up to permutation of columns and scaling.
For a given fullrank matrix , consider the following two sets of matrices:
Theorem 3.
If , and , and is separable, then the factorization is unique.
Proof.
Notice that for , the nonnegative matrix is strongly boundary close. The factorization is therefore unique. The model identifiability follows immediately, since is of full rank. ∎
Example of that verifies
For this theorem to have practical consequences, one needs to find appropriate row features so that .
Here we provide a family of matrices so that .
With a fixed , suppose that has columns, and at least rows, with the first rows defined as the following:

the first row and column have 0 on the first entry and positive entries elsewhere;

for , has strictly positive entries on the first columns, from Row to Row , and zero entries everywhere else. These rows are linearly independent.
Then we have , because the following by matrix is in this set:

for , has consecutive strictly positive entries on the th column, between Row and Row .
The following matrices instantiate the case of :
If , for any invertible matrix , .
3 HALSX algorithm
In this section, we propose Hierarchical Alternating Least Squares with eXogeneous variables (HALSX), a general algorithm to estimate the nonnegative matrix factorization problem with side information, from linear measurement, by solving (3). It is an extension to a popular NMF algorithm: Hierarchical Alternating Least Squares (HALS) (see [26, 27]).
Before discussing HALSX, we will first present a result on the local convergence of GaussSeidel algorithms. This result guarantees that any legitimate limiting points generated by HALSX are critical points of (3).
While presenting specific methods to estimate link functions, we will only discuss row features, as a generalization to column features is immediate.
3.1 Relaxation of convexity assumption for the convergence of GaussSeidel algorithm
To show the local convergence of HALSX algorithm, we first extend a classical result concerning block nonlinear GaussSeidel algorithm [28, Proposition 4].
Consider the minimization problem,
(10)  
s.t. 
where is a continuously differentiable realvalued function, and the feasible set is the Cartesian product of closed, nonempty and convex subsets , for , with . Suppose that the global minimum is reached at a point in . The block GaussSeidel algorithm is defined as Algorithm 1.
Define formally the notion of componentwise quasiconvexity.
Definition 3.
Let . The function is quasiconvex with respect to the th component on if for every and , we have
for all . is said to be strictly quasiconvex with respect to the th component, if with the additional assumption that , we have
for all .
It has been shown that if is strictly quasiconvex with respect to the first blocks of components on , then a limiting point produced by a GaussSeidel algorithm is a critical point [28].
This result is not directly applicable for the HALS algorithm. Typically, if , the column of
, is identically zero, the loss function is completely flat respect to
, the th column of . Therefore the loss function is not strictly quasiconvex. In order to avoid this scenario, [27] suggests thresholding at a small positive number instead of at 0, when updating each column of the factor matrices.In fact the convexity assumption of [28] can be slightly relaxed to directly apply to HALS, as demonstrated by the following proposition.
Theorem 4.
Suppose that the function is quasiconvex with respect to on , for . Suppose that some limit points of the sequence verify that is strictly quasiconvex with respect to on the product set , for . Then every such limiting point is a critical point of Problem (1).
Compared to the result of [28], this shows that the strict convexity with respect to one block does not have to hold universally for feasible regions of other blocks. It only needs to hold at the limiting point.
This theorem can be established following the proof of Proposition 5 of [28], using the following lemma.
Lemma 5.
Suppose that the function is quasiconvex with respect to on , for some . Suppose that some limit points of verify that is strictly quasiconvex with respect to on . Let be a sequence of vectors defined as follows:
Then, if , we have . That is .
Proof.
(The proof of the lemma is based on [29].)
Suppose on the contrary that does not converge to 0. Define . Restricting to a subsequence, we can obtain that Define . Notice that is of unit norm, and . Since is on the unit sphere, it has a converging subsequence. By restricting to a subsequence again, we could suppose that converges to .
For all , we have , which implies is on the segment . This segment has strictly positive dimension in the subspace corresponding to .
By the definition of , , for all , and for all . In particular,
By quasiconvexity of on ,
Taking the limit when converges to on both equalities, we obtain
In other words, , , which contradicts the strict quasiconvexity of on . ∎
3.2 HALSX algorithm
To solve (3), we propose HALSX (Algorithm 2). When complete observations are available, the feature matrices are identity matrices, and when only linear functions are allowed as link functions, Algorithm 2 is equivalent to HALS [26].
From Theorem 4, one deduces that every fullrank limiting point produced by the popular HALS algorithm is a critical point.
In this algorithm, at each elementary update step, we first look for a link function which minimizes the quadratic error, without concerning ourselves with its nonnegativity. The obtained evaluation of the minimizer function is then thresholded at to update the factors.
To show the convergence of HALSX algorithm, we need to assure that for some functional spaces and , such an update solves a corresponding subproblem of (3). To do this, we will use the following proposition:
Proposition 6.
Suppose that , are not identically equal to zero, and , with , is a convex differentiable function. Suppose
If , the Jacobian matrix of at , is of rank , then is also a solution to
(11) 
Proof.
Take , not identically equal to zero. We will note by the loss function, so that for all . The function is convex.
Problem (11), which can be rewritten as
is also convex. The subgradient of the composition function at is simply obtained by multiplying , the Jacobian matrix of at , to each element of , or . Therefore , is a minimizer of (11), if and only if .
Since
is a minimizer of a smooth convex problem,
This means , because is of full rank. Consequently
It has been shown in NMF literature (for example [27, Theorem 2]) that,
This is equivalent to
or
We therefore conclude with . ∎
In many regression methods, even when a nonlinear transformation is applied to the data, the regression function is linear in its parameters. A nonexhaustive list of methods include linear regression
Comments
There are no comments yet.