Collaborative Filtering and Multi-Label Classification with Matrix Factorization

Machine learning techniques for Recommendation System (RS) and Classification has become a prime focus of research to tackle the problem of information overload. RS are software tools that aim at making informed decisions about the services that a user may like. On the other hand, classification technique deals with the categorization of a data object into one of the several predefined classes. In the multi-label classification problem, unlike the traditional multi-class classification setting, each instance can be simultaneously associated with a subset of labels. The focus of thesis is on the development of novel techniques for collaborative filtering and multi-label classification. We propose a novel method of constructing a hierarchical bi-level maximum margin matrix factorization to handle matrix completion of ordinal rating matrix. Taking the cue from the alternative formulation of support vector machines, a novel loss function is derived by considering proximity as an alternative criterion instead of margin maximization criterion for matrix factorization framework. We extended the concept of matrix factorization for yet another important problem of machine learning namely multi-label classification which deals with the classification of data with multiple labels. We propose a novel piecewise-linear embedding method with a low-rank constraint on parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space. We also study the embedding of labels together with the group information with an objective to build an efficient multi-label classifier. We assume the existence of a low-dimensional space onto which the feature vectors and label vectors can be embedded. We ensure that labels belonging to the same group share the same sparsity pattern in their low-rank representations.

Authors

• 7 publications
• Group Preserving Label Embedding for Multi-Label Classification

Multi-label learning is concerned with the classification of data with m...
12/24/2018 ∙ by Vikas Kumar, et al. ∙ 16

• Multi-View Multi-Instance Multi-Label Learning based on Collaborative Matrix Factorization

Multi-view Multi-instance Multi-label Learning(M3L) deals with complex o...
05/13/2019 ∙ by Yuying Xing, et al. ∙ 12

• Matrix Co-completion for Multi-label Classification with Missing Features and Labels

We consider a challenging multi-label classification problem where both ...
05/23/2018 ∙ by Miao Xu, et al. ∙ 0

• Multilabel Classification by Hierarchical Partitioning and Data-dependent Grouping

In modern multilabel classification problems, each data instance belongs...
06/24/2020 ∙ by Shashanka Ubaru, et al. ∙ 0

• Multi-typed Objects Multi-view Multi-instance Multi-label Learning

Multi-typed objects Multi-view Multi-instance Multi-label Learning (M4L)...
10/06/2020 ∙ by Yuanlin Yang, et al. ∙ 0

• Multi-distance Support Matrix Machines

Real-world data such as digital images, MRI scans and electroencephalogr...
07/02/2018 ∙ by Yunfei Ye, et al. ∙ 0

• Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes

Probabilistic matrix factorization (PMF) is a powerful method for modeli...
08/09/2014 ∙ by Ryan Prescott Adams, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Contributions of the Thesis

The major contributions of the thesis are as follows. In maximum margin matrix factorization scheme (a variant of basic matrix factorization method), ratings matrix with multiple discrete values is treated by specially extending hinge loss function to suit multiple levels. We view this process as analogous to extending two-class classifier to a unified multi-class classifier. Alternatively, multi-class classifier can be built by arranging multiple two- class classifiers in a hierarchical manner. We investigate this aspect for collaborative filtering and propose a novel method of constructing a hierarchical bi-level maximum margin matrix factorization to handle matrix completion of ordinal rating matrix [67].

We observe that there could be several possible alternative criteria to formulate the factorization problem of discrete ordinal rating matrix, other than the maximum margin criterion. Taking a cue from the alternative formulation of support vector machines, a novel loss function is derived by considering proximity as an alternative criterion instead of margin maximization criterion for matrix factorization framework [69].

We extended the concept of matrix factorization for yet another important problem of machine learning namely multi-label classification which deals with the classification of data with multiple labels. We propose a novel piecewise-linear embedding method with a low-rank constraint on the parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space [66].

We study the embedding of labels together with the group information with an objective to build an efficient multi-label classifier. We assume the existence of a low-dimensional space onto which the feature vectors and label vectors can be embedded. We ensure that labels belonging to the same group share the same sparsity pattern in their low-rank representations.

1.2 Structure of the Thesis

The thesis is organized as follows. In Chapter , we start our discussion with an introductory discussion on matrix factorization and the associated optimization problem formulation. Thereafter we discuss some common loss functions used in matrix factorization to measure the deviation between the observed data and the corresponding approximation. We also discuss several ways of norm regularization which is needed to avoid overfitting in matrix factorization models. In the later part of the chapter we discuss at length the application of matrix of factorization techniques in collaborative filtering and multi-label classification.

Chapter starts with a discussion on bi-level maximum margin matrix factorization (MMMF) which we subsequently use in our proposed algorithm. We carry out a deep investigation of the well known maximum margin matrix factorization technique for discrete ordinal rating matrix. This investigation led us to propose a novel and efficient algorithm called HMF (Hierarchical Matrix Factorization) for constructing a hierarchical bi-level maximum margin matrix factorization method to handle matrix completion of ordinal rating matrix. The advantages of HMF over other matrix factorization based collaborative filtering methods are given by detailed experimental analysis at the end of the chapter.

Chapter introduces a novel method termed as PMMMF (Proximal Maximum Margin Matrix Factorization) for factorization of matrix with discrete ordinal ratings. Our work is motivated by the notion of Proximal SVMs (PSVMs) [77, 31] for binary classification where two parallel planes are generated, one for each class, unlike the standard SVMs [121, 10, 87]. Taking the cue from here, we make an attempt to introduce a new loss function based on the proximity criterion instead of margin maximization criterion in the context of matrix factorization. We validate our hypothesis by conducting experiments on real and synthetic datasets.

Chapter extended the concept of matrix factorization for yet another important problem in machine learning namely multi-label classification. We visualize matrix factorization as a kind of low-dimensional embedding of the data which can be practically relevant when a matrix is viewed as a transformation of data from one space to the other. At the beginning of the chapter we discuss briefly about the traditional approach of multi-label classification and establish a bridge between multi-label classification and matrix factorization. We present a novel multi-label classification method, called MLC-HMF (Multi-label Classification using Hierarchical Embedding), which learns piecewise-linear embedding with a low-rank constraint on parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space. Extensive comparative studies to validate the effectiveness of the proposed method against the state-of-the-art multi-label learning approaches is discussed in the experimental section.

In Chapter , we study the embedding of labels together with group information with an objective to build an efficient multi-label classifier. We assume the existence of a low-dimensional space onto which the feature vectors and label vectors can be embedded. In order to achieve this, we address three sub-problems namely; (1) Identification of groups of labels; (2) Embedding of label vectors to a low rank-space so that the sparsity characteristic of individual groups remains invariant; and (3) Determining a linear mapping that embeds the feature vectors onto the same set of points, as in stage 2, in the low-rank space. At the end, we perform comparative analysis which manifests the superiority of our proposed method over state-of-art algorithms for multi-label learning.

We conclude the thesis with a discussion on future directions in Chapter .

1.3 Publications of the Thesis

1. Vikas Kumar, Arun K Pujari, Sandeep Kumar Sahu, Venkateswara Rao Kagita, Vineet Padmanabhan. "Collaborative Filtering Using Multiple Binary Maximum Margin Matrix Factorizations." Information Sciences 380 (2017): 1-11.

2. Vikas Kumar, Arun K Pujari, Sandeep Kumar Sahu, Venkateswara Rao Kagita, Vineet Padmanabhan. "Proximal Maximum Margin Matrix Factorization for Collaborative Filtering." Pattern Recognition Letters 86 (2017): 62-67.

3. Vikas Kumar, Arun K Pujari, Vineet Padmanabhan, Sandeep Kumar Sahu,Venkateswara Rao Kagita. "Multi-label Classification Using Hierarchical Embedding." Expert Systems with Applications 91 (2018): 263-269.

4. Vikas Kumar, Arun K Pujari, Vineet Padmanabhan, Venkateswara Rao Kagita. "Group Preserving Label Embedding for Multi-Label Classification." PatternRecognition, Under Review.

2.1 Matrix Factorization

Matrix factorization is a key technique employed for many real-world applications wherein the objective is to select and among all the matrices that fit the given data and the one with the lowest rank is preferred. In most applications, the task that needs to be performed is not just to compute any factorization, but also to enforce additional constraints on the factors and . The matrix factorization problem can be formally defined as follows:

Definition 1 (Matrix Factorization).

Given a data matrix , matrix factorization aims at determining two matrices and such that where the inner dimension is called the numerical rank of the matrix. The numerical rank is much smaller than and , and hence, factorization allows the matrix to be stored inexpensively. is called as low-rank approximation of .

A common formulation for this problem is a regularized loss problem which can be given as follows.

 minU,Vℓ(Y,U,V)+λR(U,V) (2.1)

where is the data matrix, is its low-rank approximation, is a loss function that measures how well approximates , and is a regularization function that promotes various desired properties in (low-rank, sparsity, group-sparsity, etc.). When and are convex functions of (equivalently, of and ) the above problem can be solved efficiently. The loss function can further be decmposed into the sum of pairwise discrepancy between the observed entries and their corresponding prediction, that is, , where is the th entry of the matrix and , the th row of , represent the latent feature vector of th user and , the th row of , represent the latent feature vector of th item. There have been umpteen number of proposals for factorizing a matrix and these differ, by and large, among themselves in defining the loss function and the regularization function. The structure of the latent factors depends on the types of loss (cost) functions and the constraints imposed on the factor matrices. In the following subsections, we present a brief literature review of the major loss functions and regularizations.

2.2 Loss Function

Loss function in MF model is used to measure the discrepancy between the observed entry and the corresponding prediction. Based on the types of observed data, the loss function in MF can be roughly grouped into three categories: (1) Loss function for Binary data; (2) Loss function for Discrete ordinal data; and (3) Loss function for Real-valued data.

2.2.1 Binary Loss Function:

For a binary data matrix where the entries are restricted to take only two values , we review some of the important loss functions such as zero-one loss, hinge loss, smooth hinge loss, modified least square and logistic loss [101] in the following subsections.

Zero-One Loss : The zero-one loss is a standard loss function to measure the penalty for an incorrect prediction which takes only two values zero or one. For a given observed entry and the corresponding prediction , zero-one loss is defined as

 ℓ(z)={0if z ≥ 0;1otherwise, (2.2)

where .

Hinge Loss: There are two major drawbacks with zero-one loss (1) Minimizing objective function involving zero-one loss is difficult to optimize as the function is non-convex; (2) It is insensitive to the magnitude of prediction whereas in general, when the entries are binary, the magnitude of score represent the confidence in our prediction. The hinge loss is a convex upper bound on zero-one error [34] and sensitive to the magnitude of prediction. In the case of classification with large confidence (margin), hinge loss is the most preferred loss function and is defined as

 ℓ(z)={$0$if z ≥ 1;1−zotherwise. (2.3)

Smooth Hinge Loss: Hinge loss, is non-differentiable at

and very sensitive to outliers

[100]. An alternative of hinge loss is proposed in [101] called smooth hinge and is defined as

 ℓ(z)=⎧⎪ ⎪⎨⎪ ⎪⎩0if z ≥ 1;12(1−z)2if 0

Smooth Hinge shares many properties with hinge loss and is insensitive to outliers. A detailed discussion about hinge and smooth hinge loss is given in Chapter .

Modified Square Loss: In [153], the hinge function is replaced by a smoother quadratic function to make the derivative smooth. The modified square loss is defined as

 ℓ(z)={0if z ≥ 1;(1−z)2otherwise. (2.5)

Logistic Loss: The logistic loss function is strictly convex function which enjoys properties similar to that of the hinge loss [86]. The logistic loss is defined as

 ℓ(z)=log(1+exp−z). (2.6)

2.2.2 Discrete Ordinal Loss Function

For a discrete ordinal matrix where entries are no more like and dislike but can be any value from a discrete range, that is, , there is need to extend binary class loss function to multi-class loss function. There are two major approaches for extending loss function for binary class classification to multi-class classification. The first approach is to directly solve a multi-class problem by modifying the binary class objective function by adding a constraint to it for every class as suggested in [101] [143] . The second approach is to decompose the multi-class classification problem into a series of independent binary class classification problems as given in [11].

To extend binary loss function to multi-class setting, most of the approaches define a set of threshold values such that the real line is divided into regions. The region defined by threshold and corresponds to rating  [101]. For simplicity of notation, we assume and . There are different approaches for constructing a loss function based on a set of threshold values such as immediate-threshold and all-threshold [101]. For each observed entry and corresponding prediction pair , the immediate-threshold based approach calculates the loss as the sum of immediate-threshold violation.

 ℓ(yij,UiVTj)=ℓ(UiVTj−θi,yij−1) + ℓ(θi,yij−UiVTj) (2.7)

On the other hand, all-threshold loss is calculated as the sum of loss for all threshold which is the cost of crossing multiple rating-boundaries and is defined as

 ℓ(UiVTj,yij)=yij−1∑r=1ℓ(UiVTj−θi,r)+R−1∑r=yijℓ(θi,r−UiVTj). (2.8)

2.2.3 Real-valued Loss Function

For a real-valued data matrix where the entries are in real-valued domain , we review some of the important loss functions such as square-loss [93], KL-divergence [73], -divergence [17] and Itakura-Saito divergence [29] loss. These are given in the following subsections.

Squared Loss: The square loss is the most common loss function used for real-valued prediction. The penalty for misclassification is calculated as the square distance between the observed entry and the corresponding prediction. The square loss is defined as

 ℓ(yij,^yij)=(yij−^yij)2 (2.9)

where .

The KL-divergence loss also know as I-divergence loss is the measure of information loss when is used as an approximate to . The KL-divergence loss is defined as

 ℓ(yij,^yij)=yijlnyij^yij−yij+^yij. (2.10)

Itakura-Saito Divergence: Itakura and Saito [29]

is obtained from the the maximum likelihood (ML) estimation. The loss function is defined as

 ℓ(yij,^yij)=yij^yij−yijlnyij^yij−1. (2.11)

-Divergence Loss: Cichocki et al. [17] proposed a generalized family of loss function called -divergence defined as

 ℓ(yij,Ui,Vj)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩yβijβ(β−1)+^yβijβ−yij^yβ−1ijβ−1if β∈R∖{0,1};yijlnyij^yij−yij+^yijif β=1;yij^yij−yijlnyij^yij−1if β=0, (2.12)

where is a generalization parameter. At limitng case and , the -divergence corresponds to Itakura-Saito divergence and Kullback-Leibler divergence, respectively. The squared loss can be obtained as a special case at .

2.3 Regularization

Most of the machine learning algorithms suffer from the problem of overfitting where the model fits the training data too well but have poor generalization capability for new data. Thus, a proper technique should be adopted to avoid overfitting of the training data. In MF community, to make the model unbiased i.e., to avoid the model to fit only with a particular dataset, many researchers have tried with different regularization terms along with some loss function [63, 94, 113, 65]. Regularization is not only used to prevent overfitting but also to achieve different structural representation of the latent factors. Several methods based on norm regularization has been proposed in the literature which includes , and .

Norm: The norm, also know as sparsity inducing norm, is used to produce sparse latent factors and thus avoids overfitting by retaining only the useful factors. In effect, this implies that most units take values close to zero while only few take significantly non-zero values. For a a given matrix , the norm is defined as

 ∥A∥1=∑ij|aij|. (2.13)

Norm: The most popular and widely investigated regularization term used in MF model is norm which is also known as Frobenius norm. For a given matrix , the norm is defined as

 ∥A∥F=√∑ija2ij (2.14)

where is the norm of a matrix. Minimizing the objective function containing the norm as regularization term gives two benefits 1) It avoids overfitting by penalizing the large latent factor values. 2) Approximating the target matrix with low-rank factor matrix is a typical non-convex optimization problem and in fact, the norm has also been suggested as a convex surrogate for the rank in control applications [28, 100].

Norm: The norm also known as group sparsity norm is used to induce a sparse representation at the level of groups. For a given matrix , the norm is defined as

 ∥A∥2,1=n∑i=1 ⎷m∑j=1A2ij. (2.15)

2.4 Collaborative filtering with Matrix Factorization

In collaborative filtering, the goal is to infer user preferences for items based on his/ her previously given preference and a large collection of preferences of other users. Given a partially observed user-item rating matrix with number of users and number of items the goal is to predict unobserved preference of users for items. The collaborative filtering problem can be formally defined as follows:

Definition 2 (Collaborative Filtering).

Let be a size user-item rating matrix and be the set of observed entries. For each , the entry defines the preference of th user for th item. For each , indicates that preference of th user for th item is not available (unsampled entry). Given a partially observed rating matrix , the goal is to predict for .

Matrix factorization is a key technique employed for completion of user-item rating matrix wherein the objective is to learn low-rank (or low-norm) latent factors (for users) and (for items) so as to simultaneously approximate the observed entries under some loss measure and predict the unobserved entries. There are various ways of doing so. It is shown in [71] that to approximate , the entries in the factor matrix and need to be non-negative so that only additive combination factors are allowed. The basic idea is to learn factor matrices and in such a way that, the squared sum distance between the observed entry and corresponding prediction is minimized. The optimization problem is formulated as

 minU≥0,V≥0J(U,V)=∑(i,j)∈Ω(yij−UiVTj)2. (2.16)

Singular value decomposition (SVD) is used in [105] to learn the factor matrices and . The key technical challenge when SVD is applied to sparse matrices is that it suffers from severe overfitting. When SVD factorization is applied on sparse data, error function needs to be modified so as to consider only the observed ratings by setting the non-observed entries to zero. This minor modification results in a non-convex optimization problem. Instead of minimizing the rank of a matrix, maximum margin matrix factorization (MMMF) [110] proposed by srebro et al. aims at minimizing the Froebenius norms of and , resulting in convex optimization problems. It is shown that MMMF can be formulated as a semi-definite programming (SDP) problem and solved using standard SDP solvers. However, current SDP solvers can only handle MMMF problems on matrices of dimensionality up to a few hundred. Hence, a direct gradient based optimization method for MMMF is proposed in [100] to make fast collaborative prediction. The detailed discussion about MMMF is given in Chapter 3. To further improve the performance of MMMF, in [23], MMMF is casted using ensemble methods which includes bagging and random weight seeding. MMMF was further extended in [129] by introducing offset terms, item dependent regularization and a graph kernel on the recommender graph. In [135], a noparametric Bayesian-style MMMF was proposed that utilizes nonparametric techniques to resolve the unknown number of latent factors in MMMF model [129][100][23]. A probabilistic interpretation of MMMF was presented in [136] model through data augmentation.

The proposal of MMMF hinges heavily on extended hinge loss function. Research on different loss functions and their extension to handle multiple classes has not attracted much attention of researchers though there are some important proposals [83]. MMMF has become a very popular research topic since its publication and several extensions have been proposed [23, 129, 135, 136]. There has also been some research on matrix factorization on binary or bi-level preferences [122]. But many view binary preference as a special case of matrix factorization with discrete ratings. In [155] the rating matrix is decomposed hierarchically by grouping similar users and items together, and each sub-matrix is factorized locally. To the best of our knowledge, there is no research on hierarchical MMMF.

2.5 Multi-label Classification with Matrix Factorization

In machine learning and statistics, the classification problem is concerned with the assignment of a class (category) to a data object (instance) from a given set of discrete classes. For example, classifying a document into one of the several known categories such as sports, crime, business, politics etc. In a traditional classification problem, data objects are represented in the form of feature vectors, each associated with a unique class label from a set of disjoint class labels , . Depending on the total number of disjoint classes in , a learning task is categorized as binary classification (when ) or multi-class classification (when [108]. However, in many real-word classification tasks, data object can be simultaneously associated with one or more than one class in . For example, a document can simultaneously belong to more than one class such as politics and business. The objective of multi-label classification (MLC) is to build a classifier that can automatically tag an example with the most relevant subset of labels. This problem can be seen as a generalization of the single label classification where an instance is associated with a unique class label from a set of disjoint labels . The multi-label classification problem can be formally defined as follows:

Definition 3 (Multi-label Classification).

Given training examples in the form of a pair of feature matrix and label matrix where each example , is a row of and its associated labels is the corresponding row of . The entry at the th coordinate of vector indicates the presence of label in data point . The task of multi-label classification is to learn a parametrization that maps each instance (or, a feature vector) to a set of labels (a label vector).

MLC is a major research area in the machine learning community and finds application in several domains such as computer vision [12, 9], data mining [118, 106] and text classification [150, 106]. Due to the exponential size of the output space, exploiting intrinsic information in the feature and the label space has been the major thrust of research in recent years and the use of parametrization and embedding techniques have been the prime focus in MLC. The embedding based approach assumes that there exists a low-dimensional space onto which the given set of feature vectors and/ or label vectors can be embedded. The embedding strategies can be grouped into two categories namely; (1) Feature space embedding; and (2) Label space embedding. Feature space embedding aims to design a projection function which can map the instance in the original feature space to the label space. On the other hand, the label space embedding approach transform the label vectors to an embedded space via linear or local non-linear embeddings, followed by the association between feature vectors and embedded label space for classification purpose. With proper decoding process that maps the projected data back to the original label space, the task of multi-label prediction is achieved [41, 97, 112]. We present a brief review of the FE and LE approaches for multi-label classification. The detailed discussion is given in Chapter 5.

Given training examples in the form of a pair of feature matrix and label matrix where each example , is a row of and its associated labels is the corresponding row of , the goal of FE is to learn a transformation matrix which maps instances feature space to label space. This approach requires parameter to model the classification problem, which will be expensive when and are large [139]. In  [139] a generic empirical risk minimization (ERM) framework is used with low-rank constraint on linear parametrization , where and are of rank . The problem can be restated as follows.

 minU,Vℓ(Y,XUVT)+λ2(∥U∥2F+∥V∥2F) (2.17)

where is Frobenius norm. The formulation in Eq. (2.17) can capture the intrinsic information of both feature and label space. It can also be seen as a joint learning framework in which dimensionality reduction and multi-label classification are performed simultaneously [51, 140].

The matrix factorization (MF) based approach for LE aims at determining two matrices and  [5, 134]. The matrix can be viewed as the basis matrix, while the matrix can be treated as the coefficient matrix and a common formulation is the following optimization problem.

 minU,Vℓ(Y,U,V)+λR(U,V) (2.18)

where is a loss function that measures how well approximates , is a regularization function that promotes various desired properties in and (sparsity, group-sparsity, etc.) and the constant is the regularization parameter which controls the extent of regularization. In [79], a MF based approach is used to learn the label encoding and decoding matrix simultaneously. The problem is formulated as follows.

 minU,V ∥Y−UV∥2F+αΨ(X,U) (2.19)

where is the code matrix, is the decoding matrix, is used to make feature-aware by considering correlations between and as side information and the constant is the trade-off parameter.

3.1 Introduction

In this chapter, we describe the proposed method, termed as HMF (Hierarchical Matrix Factorization). HMF is a novel method for constructing a hierarchical bi-level maximum margin matrix factorization to handle matrix completion of ordinal rating matrix. The proposed method draws motivation from research on multi-class classification. There are two major approaches of extending two-class classifiers to multi-class classifiers. The first approach explicitly reformulates the classifier, resulting in a unified multiclass optimization problem (embedded technique). The second approach (combinational technique) is to decompose a multiclass problem into multiple, independently trained, binary classification problems and to combine them appropriately so as to form a multiclass classifier. In maximum margin matrix factorization (MMMF) [100], the authors adopted embedded approach by extending bi-level hinge loss to multi-level cases.

Combinational techniques have been very popular and successful as they all exhibit some relative advantages over embedded techniques and this prompts us to question whether some sort of combinational approach can be employed in the context of MMMF. There are several approaches in combinational techniques and these are One-Against-All (OAA) [40, 102, 131], One-Against-One (OAO) [22, 40] etc. HMF falls into the category of OAA approach with a difference. The OAA approach of classification employs one binary classifier for each class against all other classes. In the present context ‘class’ corresponds to the number of ordinal ranks. Interestingly, since the ranks are ordered, in the present case, OAA strategy is used by taking advantage of the ordering of ’classes’. Unlike the traditional OAA strategy, (which means that one bi-level matrix factorization to be used for rank as one label (say, ) and all other ranks as the other label (say, )), here we employ for each rank , all ranks below as and all ranks above as .

MMMF and HMF, like any other matrix factorization methods, use latent factor model approach. The latent factor model is concerned with learning from limited observations, latent factor vector for each user and latent factor vector for each item, such that the dot product of and gives the ranking of user for item . In MMMF, besides learning the latent factors, the set of thresholds also needs to be learned. It is assumed that there are thresholds for each user ( is the number of ordinal values or number of ranks). Thus, the rating of user for item is decided by comparing the dot product of and with each . Thus, the fundamental assumption of MMMF is that the latent factor vectors determine the properties of users and items and the threshold values capture the characteristics of rating. HMF differs from MMMF on this principle. The underlying principle of HMF is that the latent factors of users and items are going to be different, if the users’ likes or dislikes cutoff thresholds are different. The latent factors are different according to situations. For instance, the latent factors when ranks above are identified as likes is different from situations wherein the ranks above are identified as likes. Thus if we have ratings then, there will be pairs of latent factors . Unlike the process of learning single pair of latent factors, and , HMF learns several ’s and ’s in this process without any additional computational overheads. The process is proved to be a more accurate matrix completion process. There has not been any attempt in this direction and we believe that our present study will open up new areas of research in future.

The rest of the chapter is organized as follows. In Section 3.2, we describe the bi-level MMMF which we use subsequently in our algorithm. Section 3.3 summarizes the well-known existing MMMF method. We introduce our proposed method of Hierarchical Matrix Factorization (HMF) in Section 3.4. The advantages of HMF over other matrix factorization based collaborative filtering methods are given by detailed experimental analysis in Section 3.6. Section 3.7 concludes the chapter.

3.2 Bi-level MMMF

In this section we describe Maximum Margin Matrix Factorization for a bi-level rating matrix. The matrix completion of bi-level matrix is concerned with the following problem.

Problem P (Y): Let be a partially observed user-item rating matrix and is the set of observed entries. For each , the entry defines the preference of th user for th item with for likes and for dislikes. For each , the entry indicates that the preference of th user for th item is not available. Given a partially observed rating matrix , the goal is to infer for .

Matrix factorization is one of the major techniques employed for any matrix completion problem. In this line, the above problem can be rephrased using latent factors. Given a partially observed rating matrix and the observed preference set , matrix factorization aims at determining two low-rank (or, low-norm) matrices and such that . The row vectors , and , are the -dimensional representations of the users and the items, respectively. A common formulation of P(Y) is to find and as solution of the following optimization problem.

 minU,V J(U,V)=∑(i,j)∈Ωℓ(yij,Ui,Vj)+λR(U,V) (3.1)

where is a loss function that measures how well approximates , and is a regularization function. The idea of the above formulation is to alleviate the problem of outliers through a robust loss function and at the same time to avoid overfitting and to make the optimization function smooth with the use of regularization function.

A number of approaches can be used to optimize the objective function 3.1. Gradient Descent method and its variants start with random and and iteratively update and using the equations 3.2 and 3.3, respectively.

 Ut+1ip =Utip−c∂J∂Utip (3.2) Vt+1jq =Vtjq−c∂J∂Vtjq (3.3)

where is the step length parameter and suffixes and indicate current values and updated values.

The step-wise description of the process is given as Pseudo-code in Algorithm 1.

Once and are computed by Algorithm 1, the matrix completion process is accomplished from the factor matrices as follows.

 ^yij=⎧⎪ ⎪⎨⎪ ⎪⎩−1, if (i,j)∉Ω∧UiVTj<θ; +1,if (i,j)∉Ω∧UiVTj≥θ;yij,if (i,j)∈Ω, (3.4)

where is the user-specified threshold value.

The latent factor of each item can be viewed as a point in -dimensional space and the latent factor of user

can be viewed as a decision hyperplane in this space. The objective of bi-level matrix factorization is to learn the embeddings of these points and hyperplanes in

such that each hyperplane (corresponding to a user) separates (or, equivalently, classifies) the items by likes and dislikes of a user. Let us consider the following partially observed bi-level matrix (Table 3.1) for illustration. The unobserved entries are indicated by entries.

Let us assume that the following and are the latent factor matrices with .

The same is depicted graphically in Figure 3.1. are depicted as points. The hyperplanes with threshold are depicted as lines. An arrow is shown to indicate the side of each line. In other words, if a point falls this side, the preference is and the other side preference is . Entry is predicted based on the position of with respect to . falls to the positive side of line corresponding to and hence, the entry is predicted as . The latent factor and are obtained by a learning process making use of the generic algorithm (Algorithm 1) with the observed entries as the training set. The objective of this learning process is to minimize the loss due to discrepancy between computed entries and the observed entries. In the example (Figure 3.1) point and are in the side of and in the negative side of . These points are located at different distance.

There are many matrix factorization algorithms for bi-level matrices which adopts the generic algorithm describe in Algorithm 1. These algorithms are designed based on the specification of the loss function and the regularization function. We adopt here a maximum-margin formulation as the guiding principle.

Loss functions in matrix factorization models are used to measure the discrepancy between the observed entry and the corresponding prediction. Furthermore, especially when predicting discrete values such as ratings, loss functions other then sum-squared loss are often more appropriate [100]. The trade-off between generalization loss and empirical loss has been prevailing since the advent of support vector machine (SVM). Maximum margin approach aims at providing higher generalization ability and avoiding overfitting. In this context, hinge loss function is the most preferred loss function and is defined as follows.

 h(z)={0,if z ≥ 1;1−z,otherwise, (3.5)

where . The hinge loss is illustrated in Figure 3.2.

The following real-valued prediction matrix is obtained from the latent factor matrices and (Table 3.4) corresponding to the matrix Y (Table 3.1).

The hinge loss function values corresponding to the observed entries in (Table 3.1) and the real-valued prediction (Table 3.5) is shown in Table 3.6.

Hinge loss, is non-differentiable at and is very sensitive to outliers as mentioned in [100]. Therefore an alternative called smooth hinge loss is proposed in [101] and can be defined as,

 h(z)=⎧⎪ ⎪⎨⎪ ⎪⎩0,if z ≥ 1;12(1−z)2,if 0

The smooth hinge loss is illustrated in Figure 3.3.

The smooth hinge loss function values corresponding to the observed entries in (Table 3.1) and the real-valued prediction (Table 3.5) is shown in Table 3.7.

Figure 3.2 and 3.3 show the loss function values for the hinge and smooth hinge loss, respectively. It can be seen that the smooth hinge loss shares important similarities to the hinge loss and has a smooth transition between a zero slope and a constant negative slope. Table 3.6 and  3.7 show the hinge loss and smooth hinge loss function values corresponding to the observed entries in (Table 3.1). It can also be seen that the smooth hinge is less sensitive to the outliers as compared to the hinge loss function. For example, the rating given by user for item is positive and the same reflect in the embedding (Figure 3.1). The loss incurred by hinge and smooth hinge are and , respectively. Even though the point is embedded with margin the hinge loss function gives more reward to the model for the increase in objective value.

We reformulate P(Y) problem for the bi-level rating matrix as the following optimization problem.

 minU,VJ(U,V)=Σ(i,j)∈Ωh(yij(UiVTj))+λ2(∥U∥2F+∥V∥2F) (3.7)

where is the Frobenius norm which is the same as defined in Chapter , is the regularization parameter and is the smooth hinge loss function as defined previously.

The gradients of the variables to be optimized are determined as follows. The gradient with respect to each element of is calculated as

 ∂J∂Uip =∑j|(i,j)∈Ω∂h(yij(UiVTj))∂Uip+λ2(∂∥U∥2F∂Uip+∂∥V∥2F∂Uip) =∑j|(i,j)∈Ωyijh′(yij(UiVTj))∂(UiVTj)∂Uip+λUip =∑j|(i,j)∈Ωyijh′(yij(UiVTj))Vjp+λUip (3.8)

Similarly, the gradient with respect to each element of is calculated as follows.

 ∂J∂Vjq =∑i|(i,j)∈Ωyijh′(yij(UiVTj))Uiq+λVjq (3.9)

where is defined as follows.

 h′(z)=⎧⎨⎩0,if z ≥ 1;z−1,if 0

Algorithm 2 outlines the main flow of the bi-level maximum margin matrix factorization (BMMMF).

We also show the behaviour of (Equation 3.7). We plot the value of obtained in every iteration for the same (Table 3.1) starting from different initial points. It can be seen from Figure 3.4 that is having asymptotic convergence.

3.3 Multi-level MMMF

As discussed in the previous chapter, MMMF [110] and subsequently, Fast MMMF [100] are proposed primarily for collaborative filtering with ordinal rating matrix when user-preferences are not in the form of like/ dislike but are values in a discrete range. The matrix completion of ordinal rating matrix is concerned with the following problem.

Problem P(Y): Let be a partially observed user-item rating matrix and is the set of observed entries. For each , the entry defines the preference of the th user for the th item. For each , indicates that the preference of th user for th item is not available. Given a partially observed rating matrix , the goal is to predict for .

Need for multiple thresholds: Unlike P(Y), P(Y) has domain of with more than two values. When the domain has two values, P(Y) is equivalent to P(Y). Continuing our discussion on geometrical interpretation of P(Y), the likes and dislikes of user are separated with both sides of the hyperplane defined by . The same concept when extended to P(Y), it is necessary to define threshold values such that the region between two parallel hyperplanes defined by the same with different threshold and corresponds to the rating . Thus, in P(Y), in addition to learning the latent factor matrices and it is also needed to learn the thresholds ’s. There may be a debate on the number of ’s needed, but following the original proposal of MMMF [100], we follow thresholds for each user and hence there are thresholds.

There are different approaches for constructing a loss function based on a set of thresholds such as immediate-threshold and all-threshold [101]. For each observed entry and the corresponding prediction pair , the immediate-threshold based approach calculates the loss as sum of immediate-threshold violation which is . On the other hand, all-threshold loss is calculated as the sum of loss for all thresholds which is the cost of crossing multiple rating-boundaries and is defined as follows.

 ℓ(UiVTj,yij)=yij−1∑r=1ℓ(UiVTj−θi,r)+R−1∑r=yijℓ(θi,r−UiVTj) (3.11)

Using the thresholds ’s, MMMF extended the hinge loss function meant for binary classification to ordinal settings. The difference between immediate-threshold and all-threshold hinge is illustrated with the help of the following example. Let us consider the partially observed ordinal rating matrix with , the learnt factor matrices , and the set of thresholds ’s for each user (Table 3.12).

The following real-valued prediction matrix is obtained from the above latent factor matrices and corresponding to the matrix .

One can see that the entry is observed as and the corresponding real-valued prediction is . When immediate-threshold hinge is the loss measure, the overall loss is calculated as follows.

 h(y13,U1VT3) =h(U1VT3−θi,3)+h(θi,4−U1VT3) =h(0.37−0.51)+h(1.21−0.37) =h(−0.14)+h(0.84) =0.65

where is the smooth hinge loss function as defined previously. For the same example, the overall loss with all-threshold hinge function is calculated as follows.

 h(y13,U1VT3) =h(U1VT3−θi,1)+h(U1VT3−θi,2)+h(U1VT3−θi,3)+h(θi,4−U1VT3) =h(0.37+0.61)+h(0.37+0.18))+h(0.37−0.51)+h(1.21−0.37) =h(0.98)+h(0.55)+h(−0.14)+h(0.84) =0.75

Continuing our discussion on geometrical interpretation, immediate-threshold loss tries to embed the point into the region defined by the parallel hyperplanes and which basically mean that and . The all-threshold hinge function not only tries to embed the points rated as into the region defined by and but also consider the position of the points with respect to other hyperplanes. It is also desirable that every point rated by user should satisfy the condition for and for .

In MMMF [100], each hyperplane acts as a maximum-margin separator which is ensured by considering smooth hinge as the loss function (all-threshold hinge). The resulting optimization problem for P(Y) is

 minU,VJ(U,V,θ)∑(i,j)∈Ω(yij−1∑r=1h(UiVTj−θi,r)+R−1∑r=yijh(θi,r−UiVTj))+λ2(∥U∥2F+∥V∥2F) (3.12)

where is the Frobenius norm, is the regularization parameter, is the set of observed entries, is the smooth hinge loss as defined previously and is the threshold for rank of user . The equation given above can be rewritten as follows.

 minU,VJ(U,V,θ)R−1∑r=1∑(i,j)∈Ωh(Trij(θi,r−UiVTj))+λ2(∥U∥2F+∥V∥2F) (3.13)

where is defined as

 Trij={+1,if r ≥ yij;−1,if r < yij.

The gradients of the variables to be optimized are determined as follows. The gradient with respect to each element of is calculated as follows.

 ∂J∂Uip =R−1∑r=1∑j|(i,j)∈Ω∂h(Trij(θi,r−UiVTj))∂Uip+λ2(∂∥U∥2F∂Uip+∂∥V∥2F∂Uip) =R−1∑r=1∑j|(i,j)∈ΩTrijh′(Trij(θi,r−UiVTj))∂(θi,r−UiVTj)∂Uip+λUip =λUip−R−1∑r=1∑j|(i,j)∈ΩTrijh′(Trij(θi,r−UiVTj))Vjp (3.14)

where is the same as defined previously. Similarly, the gradients with respect to each element of is calculated as

 ∂J∂Vjq =λVjq−R−1∑r=1∑i|(i,j)∈ΩTrijh′(Trij(θi,r−UiVTj))Uiq (3.15)

and the gradient with respect to each is determined as follows.

 ∂J∂θi,r =∑j|(i,j)∈ΩTrijh′(Trij(θi,r−UiVTj))∂(θi,r−UiVTj)∂θi,r =∑j|(i,j)∈ΩTrijh′(Trij(