Nonnegative matrix factorization with side information for time series recovery and prediction

09/19/2017
by   Jiali Mei, et al.
0

Motivated by the reconstruction and the prediction of electricity consumption, we extend Nonnegative Matrix Factorization (NMF) to take into account side information (column or row features). We consider general linear measurement settings, and propose a framework which models non-linear relationships between features and the response variables. We extend previous theoretical results to obtain a sufficient condition on the identifiability of the NMF in this setting. Based the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates the factorization model. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation dataset, to show its performance in matrix recovery and prediction for new rows and columns.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/05/2016

Recovering Multiple Nonnegative Time Series From a Few Temporal Aggregates

Motivated by electricity consumption metering, we extend existing nonneg...
06/29/2019

Fast Convolutive Nonnegative Matrix Factorization Through Coordinate and Block Coordinate Updates

Identifying recurring patterns in high-dimensional time series data is a...
07/28/2014

Algorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization

It is well known that good initializations can improve the speed and acc...
07/12/2020

An Alternating Rank-K Nonnegative Least Squares Framework (ARkNLS) for Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a prominent technique for data...
09/27/2018

Supervised Nonnegative Matrix Factorization to Predict ICU Mortality Risk

ICU mortality risk prediction is a tough yet important task. On one hand...
08/11/2021

The Lawson-Hanson Algorithm with Deviation Maximization: Finite Convergence and Sparse Recovery

In this work we apply the "deviation maximization", a new column selecti...
04/08/2021

Archetypal Analysis for Sparse Nonnegative Matrix Factorization: Robustness Under Misspecification

We consider the problem of sparse nonnegative matrix factorization (NMF)...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, a large number of methods have been developed to improve matrix completion methods using side information [1, 2, 3]. By including features linked to the row and/or columns of a matrix, also called “side information”, these methods have a better performance at estimating the missing entries.

In this work, we generalize this idea to nonnegative matrix factorization (NMF, [4]). Although more difficult to identify, NMF often has a better empirical performance, and provides factors that are more interpretable in an appropriate context. We propose an NMF method that takes into account side information. Given some observations of a matrix, this method jointly estimates the nonnegative factors of the matrix, and regression models of these factors on the side information. This allows us to improve the matrix recovery performance of NMF. Moreover, using the regression models, we can predict the value of interest for new rows and columns that are previously unseen. We develop this method in the general matrix recovery context, where linear measurements are observed instead of matrix entries.

This choice is especially motivated by applications in electricity consumption. We are interested in estimating and predicting the electricity load from temporal aggregates. In the context of load balancing in the power market, electric transmission system operators (TSO) of the electricity network are typically legally bound to estimate the electricity consumption and production at a small temporal scale (half-hourly or hourly), for market participants within their perimeter, i.e. utility providers, traders, large consumers, groups of consumers, etc. [5]. Most traditional electricity meters do not provide information at such a fine temporal scale. Although smart meters can record consumption locally every minute, the usage of such data can be extremely constrained for TSOs, because of the high cost of data transmission and processing and/or privacy issues. Nowadays, TSOs often use regulator-approved proportional consumption profiles to estimate market participants’ load. In a previous work by the authors [5], we proposed to solve the estimation problem by NMF using temporal aggregate data.

Using the method developed in this article, we put in parallel temporal aggregate data with features that are known to have a correlation with electricity consumption, such as the temperature, the day of the week, or the type of client. This not only improves the performance of load estimation, but also allows us to predict future load for users in the dataset, and estimate and predict the consumption of new users previously unseen. In electrical power networks, load prediction for new periods is useful for balancing offer-demand on the network, and prediction for new individuals is useful for network planning.

In the rest of this section, we introduce the general framework of this method, and the related literature. In Section 2, we deduce a sufficient condition on the side information for the NMF to be unique. In Section 3, we present HALSX, an algorithm which solves the NMF with side information problem, that we prove to converge to stationary points. In Section 4, we present experimental results, applying HALSX on simulated and real datasets, both for the electricity consumption application, and for a standard task in collaborative filtering.

1.1 General model definition

We are interested in reconstructing a nonnegative matrix , from  linear measurements,

(1)

where  is a linear operator. Formally,  can be represented by , …,  design matrices of dimension , and each linear measurement can be represented by

(2)

The design matrices , …,  are called masks.

Moreover, we suppose that the matrix of interest, , stems from a generative low-rank nonnegative model, in the following sense:

  1. The matrix  is of nonnegative rank , with . This means, is the smallest number so that we can find two nonnegative matrices  and  satisfying

    Note that this implies that  is of rank at most , and therefore is of low rank.

  2. There are some row features  and column features  connected to each row and column of . We note by  the -th row of , and by  the -th row of . There are two link functions  and , so that

    where 

    is the matrix obtained by stacking row vectors 

    , for  (idem for ), and  is the ramp function which corresponds to thresholding operation at 0 for any matrix or vector.

In this general setting, the features  and , the measurement operator , and the measurements  are observed. The objective is to estimate the true matrix  as well as the factor matrices  and , by estimating the link functions  and .

To obtain a solution to this matrix recovery problem, we minimize the quadratic error of the matrix factorization. In Section 3, we will propose an algorithm for the following optimization problem:

(3)
s.t.

where  and  are some functional spaces in which the row and column link functions are to be searched.

By specializing , and , or restricting the search space of  and , this general model includes a number of interesting applications, old and new.

The masks , …, 

  • Complete observation, where  is the -th canonical vector. This means every entry of  is observed.

  • Matrix completion: the set of masks is a subset of complete observation masks, with .

  • Matrix sensing: the design matrices 

    are random matrices, sampled from a certain probability distribution. Typically, the probability distribution needs to verify certain conditions, so that with a large probability, 

    verifies the Restricted Isometry Property [6].

  • Rank-one projections [7, 8]: the design matrices are random rank-one matrices, that is , where and are respectively random vectors of dimension and . The main advantage to this setting is that much less memory is needed to store the masks, since we can store the vectors and (dimension-) instead of (dimension-). In [7, 8], theoretical properties are proved for the case where and are vectors with independent Gaussian entries and/or drawn uniformly from the vectors of the canonical basis.

  • Temporal aggregate measurements: in this case, the matrix is composed of  time series concerning  individuals, and each measure is a temporal aggregate of the time series of an individual. The design matrices are defined as , where  is the individual concerned by the -th measure,  the first period covered by the measure, and  the number of periods covered by the measure.

The features  and 

  • Individual features. Basically, no side information is available. The row individuals and column individuals are each different.

  • General numeric features and . This includes all numeric features.

  • Features generated from a kernel: certain information about the row and column individuals may not be in the form of a numeric vector. For example, if the row individuals are vertices of a graph, their connection to each other is interesting information for the problem, but it is difficult to encode as real vectors. In this case, features can be generated through a transformation, or by defining a kernel function.

The link functions  and 

  • Linear, and . In this case, we need to estimate  and  to fit the model. With identity matrices as row and column features, this case is reduced to the traditional matrix factorization model with

    When the features are generated from a kernel function, even a linear link function permits non-linear relationship between the features and the factor matrices.

  • General regression models: when the relationship between the features and the variable of interest is not linear, any off-the-shelf regression methods can be plugged in to search for a non-linear link function.

The choice of the optimization problem (3)

Notice that with individual row and column features, linear link functions and complete observations, (3) becomes

(4)

This is equivalent to the classical NMF problem,

(5)
s.t.

in the sense that for any solution  to (4),  is a solution to (5).

A more immediate generalization of (5) to include exogenous variables would be in the form

(6)
s.t.

Solving (6) would involve identifying the subset of  and  that only produce nonnegative value on the row and column features, which could be difficult.

Link function Linear Other regression methods
Features Identity General numeric features Kernel features General numeric features
Mask Identity Matrix factorization Reduced-regression rank [9, 10, 11] Multiple kernel learning [12] Nonparametric RRR [13]
Matrix completion Matrix completion [14] IMC[15, 1, 16, 17] GRMF [18, 2, 3]
Rank-one projections [7, 8]
Temporal aggregates [5]
General masks Matrix recovery [19, 6]
Table 1: Classification of matrix factorization with side information by the mask, the link function, and the features included as side information, with some problems previously addressed in the literature.

By using (3), we actually shifted the search space  and  to  and  which consists of composing all functions of  and  with the ramp function (thresholding at 0). In a word, (3) is equivalent to

(7)
s.t.

Problem (7) also helps us to reason on the identifiability of (3). In a sense, (3) is not well-identified: two distinct elements in  have the same evaluation value of the objective function, if they only differ on their negative parts. In fact, this does not affect the interpretation of the model, because these distinct elements correspond to the same element in . Since we are only going to use the positive parts of the function both in recovery and prediction, this becomes a parameterization choice which has no consequence on the applications.

As a comparison, we also propose an algorithm for the following optimization problem in Section 3:

(8)

Instead of minimizing the low-rank approximation error for a matrix that matches the data as a linear matrix equation in (3), (8) minimizes the sampling error of an exactly low-rank matrix. Both have been studied in the literature. For example, objective functions similar to (3) have been considered in [6], and ones similar to (8) in [20]. We will see in both Sections 3 and 4 that (3) has a better performance than (8).

1.2 Prior works

Table 1 shows a taxonomy of matrix factorization models with side information, by the mask, the link function and the features used as side information. A number of problems in this taxonomy has been addressed in established literature.

There is an abundant literature that studies the general matrix factorization problem under various measurement operators, when no additional information is provided (see [21, 14, 6, 8] for various masks considered, and [22] for a recent global convergence result with RIP measurements). The NMF with general linear measurements is studied in various applications [20, 23, 5].

On the other hand, with complete observations, the multiple regression problem taking into account the low-rank structure of the (multi-dimensional) variable of interest is known as reduced-rank regression. This approach was first developed very early (see [9] for a review). Recent developments on rank selection [10], adaptive estimation procedures [24], using non-parametric link function [13], often draw the parallel between reduced-rank regression and the matrix completion problem. However, measurement operators other than complete observations or matrix completion are rarely considered in this community.

Building on theoretical boundaries on matrix completion, the authors of [1, 16, 17] showed that by providing side information (the matrix ), the number of measurements needed for exact matrix completion can be reduced. Moreover, the number of measurements necessary for successful matrix completion can be quantified by measuring the quality of the side information [17].

Collaborative filtering with side information has received much attention from practitioners and academic researchers alike, for its huge impact in e-commerce and web applications [15, 1, 16, 17, 2, 3]. One of the first methods for including side information in collaborative filtering systems (matrix completion masks) was proposed by [18]. The authors generalized collaborative filtering into a operator estimation problem. This method allows more general feature spaces than a numerical matrix, by applying a kernel function to side information. [3] proposed choosing the kernel function based on the goal of the application. [12] applied the kernel-based collaborative filtering framework to electricity price forecasting. Their kernel choice is determined by multi-kernel learning methods.

To the best of our knowledge, matrix factorization (nonnegative or not) with side information, from general linear measurements has rarely been considered, nor is general non-linear functions other than with features obtained from kernels. This article aims at proposing a general approach which fills this gap.

2 Identifiability of nonnegative matrix factorization with side information

Matrix factorization is not a well-identified problem: for one pair of factors , with 

, any invertible matrix 

produces another pair of factors, , with . In order to address this identifiability problem, one has to introduce extra constraints on the factors.

When the nonnegativity constraint is imposed on  and , however, it has been shown that sometimes the only invertible matrices that verify  and  are the composition of a permutation matrix and a diagonal matrix with strictly positive diagonal elements. A nonnegative matrix factorization is said to be “identified” if the factors are unique up to permutation and scaling. The identifiability conditions for NMF are a hard problem, because it turns out that NMF identifiability is equivalent to conditions that are computationally difficult to check. In this section, we review some known necessary and sufficient conditions for NMF identifiability in the literature, and develop a sufficient condition for NMF identifiability in the context of linear numerical features.

In order to simplify our theoretical analysis, we focus on the complete observation case in this section (every entry in  is observed). Without loss of generality, we derive the sufficient condition for row features. That is, we will derive conditions on  and , so that the nonnegative matrix factorization , with , is unique. A generalization to column features can be easily obtained. In this section, we assume that in addition to be of nonnegative rank , matrix  is also exactly of rank .

2.1 Identifiability of NMF

The authors of [4] and [25] proposed two necessary and sufficient conditions for the factorization to be unique. Both conditions use the following geometric interpretation of NMF introduced by [4].

Since , the columns of  are conical combinations of the columns in . Formally, , the conical hull of the columns of , is a polyhedral cone contained in the first orthant of . As  is of rank , the rank of  is also . This implies that the extreme rays (also called generators) of  are exactly the columns of , which are linearly independent. is therefore

  • a simplicial cone of  generators,

  • contained in ,

  • containing all columns of .

Inversely, if we take any cone  verifying these three conditions, and define a matrix  whose columns are the  generators of , there will be a nonnegative matrix , so that . The uniqueness of NMF is therefore equivalent to the uniqueness of simplicial cones of  generators contained in the first orthant of  and containing all columns .

In [25], an equivalent geometric interpretation in  is given in the following theorem:

Theorem 1.

[25]-dimensional NMF  of a rank- nonnegative matrix  is unique if and only if the nonnegative orthant  is the only simplicial cone  with  extreme rays satisfying

Despite the apparent simplicity of the theorem, the necessary and sufficient conditions are very difficult to check. Based on the theorem above, several sufficient conditions have been proposed. The most widely used condition is called the separability condition. Before introducing this condition (in its least restrictive version presented by [25]), we need the following two definitions.

Definition 1 (Separability).

Suppose . A nonnegative matrix  is said to be separable if there is a -by- permutation matrix  which verifies

where  is a -by- diagonal matrix with only strictly positive coefficients on the diagonal and zeros everywhere else, and the -by- matrix  is a collection of the other  rows of .

Definition 2 (Strongly Boundary Closeness).

A nonnegative matrix  is said to be strongly boundary close if the following conditions are satisfied.

  1. is boundary close: for all , there is a row  in  which satisfies ;

  2. There is a permutation of  such that for all , there are  rows  in  which satisfy

    1. for all ;

    2. the square matrix  is of full rank .

Strongly boundary closeness demands, modulo a permutation in , that for each , there are  rows  of  that have the following form,

(9)

These row vectors, , all have 0 on the -th element, and its lower square matrix of is of full rank. There are therefore enough linearly independent points on each -dimensional facet , which shows that  is somewhat maximal in .

The following was proved in [25]:

Theorem 2.

[25] If  is strongly boundary close, then the only simplicial cone with  generators in  containing  is . Moreover, if  is separable, then  is the unique NMF of  up to permutation and scaling.

2.2 Identifiability with side information

The NMF with linear row features, , is said to be unique, if for all matrix pairs  that verifies

we have   up to permutation of columns and scaling.

For a given full-rank matrix , consider the following two sets of matrices:

Theorem 3.

If , and , and  is separable, then the factorization  is unique.

Proof.

Notice that for , the nonnegative matrix  is strongly boundary close. The factorization  is therefore unique. The model identifiability follows immediately, since  is of full rank. ∎

Example of  that verifies 

For this theorem to have practical consequences, one needs to find appropriate row features so that .

Here we provide a family of matrices  so that .

With a fixed , suppose that  has  columns, and at least  rows, with the first  rows defined as the following:

  • the first row and column have 0 on the first entry and positive entries elsewhere;

  • for ,   has strictly positive entries on the first  columns, from Row  to Row , and zero entries everywhere else. These  rows are linearly independent.

Then we have , because the following -by- matrix  is in this set:

  • for  has  consecutive strictly positive entries on the -th column, between Row  and Row .

The following matrices instantiate the case of :

If , for any invertible matrix ,  .

3 HALSX algorithm

In this section, we propose Hierarchical Alternating Least Squares with eXogeneous variables (HALSX), a general algorithm to estimate the nonnegative matrix factorization problem with side information, from linear measurement, by solving (3). It is an extension to a popular NMF algorithm: Hierarchical Alternating Least Squares (HALS) (see [26, 27]).

Before discussing HALSX, we will first present a result on the local convergence of Gauss-Seidel algorithms. This result guarantees that any legitimate limiting points generated by HALSX are critical points of (3).

While presenting specific methods to estimate link functions, we will only discuss row features, as a generalization to column features is immediate.

3.1 Relaxation of convexity assumption for the convergence of Gauss-Seidel algorithm

To show the local convergence of HALSX algorithm, we first extend a classical result concerning block nonlinear Gauss-Seidel algorithm [28, Proposition 4].

Consider the minimization problem,

(10)
s.t.

where  is a continuously differentiable real-valued function, and the feasible set  is the Cartesian product of closed, nonempty and convex subsets , for , with . Suppose that the global minimum is reached at a point in . The -block Gauss-Seidel algorithm is defined as Algorithm 1.

  Initialize 
  while Stopping criterion is not satisfied do
     for  do
        Calculate 
     end for
     Set 
      
  end while
Algorithm 1 Gauss-Seidel algorithm

Define formally the notion of component-wise quasi-convexity.

Definition 3.

Let . The function  is quasi-convex with respect to the -th component on  if for every  and , we have

for all . is said to be strictly quasi-convex with respect to the -th component, if with the additional assumption that , we have

for all .

It has been shown that if  is strictly quasi-convex with respect to the first  blocks of components on , then a limiting point produced by a Gauss-Seidel algorithm is a critical point [28].

This result is not directly applicable for the HALS algorithm. Typically, if , the  column of 

, is identically zero, the loss function is completely flat respect to 

, the -th column of . Therefore the loss function is not strictly quasi-convex. In order to avoid this scenario, [27] suggests thresholding at a small positive number  instead of at 0, when updating each column of the factor matrices.

In fact the convexity assumption of [28] can be slightly relaxed to directly apply to HALS, as demonstrated by the following proposition.

Theorem 4.

Suppose that the function  is quasi-convex with respect to  on , for . Suppose that some limit points  of the sequence  verify that  is strictly quasi-convex with respect to  on the product set , for . Then every such limiting point is a critical point of Problem (1).

Compared to the result of [28], this shows that the strict convexity with respect to one block does not have to hold universally for feasible regions of other blocks. It only needs to hold at the limiting point.

This theorem can be established following the proof of Proposition 5 of [28], using the following lemma.

Lemma 5.

Suppose that the function  is quasi-convex with respect to  on , for some . Suppose that some limit points  of  verify that  is strictly quasi-convex with respect to  on . Let  be a sequence of vectors defined as follows:

Then, if  , we have . That is .

Proof.

(The proof of the lemma is based on [29].)

Suppose on the contrary that  does not converge to 0. Define . Restricting to a subsequence, we can obtain that  Define . Notice that  is of unit norm, and . Since  is on the unit sphere, it has a converging subsequence. By restricting to a subsequence again, we could suppose that  converges to .

For all , we have , which implies  is on the segment . This segment has strictly positive dimension in the subspace corresponding to .

By the definition of , for all , and for all . In particular,

By quasi-convexity of  on ,

Taking the limit when  converges to  on both equalities, we obtain

In other words, , which contradicts the strict quasi-convexity of  on . ∎

3.2 HALSX algorithm

To solve (3), we propose HALSX (Algorithm 2). When complete observations are available, the feature matrices are identity matrices, and when only linear functions are allowed as link functions, Algorithm 2 is equivalent to HALS [26].

0:  Measurement operator , measurements , features  and , functional spaces  and  in which to search the link functions, and . 
  Initialize 
  while Stopping criterion is not satisfied do
      
      
     for  do
         
        Calculate 
         
         
     end for
     for  do
         
        Calculate 
         
         
     end for
      
  end while
  return   ,  
Algorithm 2 Hierarchical Alternating Least Squares with eXogeneous variables for NMF (HALSX)

From Theorem 4, one deduces that every full-rank limiting point produced by the popular HALS algorithm is a critical point.

In this algorithm, at each elementary update step, we first look for a link function which minimizes the quadratic error, without concerning ourselves with its nonnegativity. The obtained evaluation of the minimizer function is then thresholded at  to update the factors.

To show the convergence of HALSX algorithm, we need to assure that for some functional spaces  and , such an update solves a corresponding subproblem of (3). To do this, we will use the following proposition:

Proposition 6.

Suppose that  are not identically equal to zero, and , with , is a convex differentiable function. Suppose

If , the Jacobian matrix of  at , is of rank , then is also a solution to

(11)
Proof.

Take  not identically equal to zero. We will note by  the loss function, so that  for all . The function  is convex.

Problem (11), which can be rewritten as

is also convex. The subgradient of the composition function  at  is simply obtained by multiplying , the Jacobian matrix of  at , to each element of , or . Therefore  is a minimizer of (11), if and only if .

Since

is a minimizer of a smooth convex problem,

This means , because  is of full rank. Consequently

It has been shown in NMF literature (for example [27, Theorem 2]) that,

This is equivalent to

or

We therefore conclude with . ∎

In many regression methods, even when a non-linear transformation is applied to the data, the regression function is linear in its parameters. A non-exhaustive list of methods include linear regression