1.1 Contributions of the Thesis
The major contributions of the thesis are as follows. In maximum margin matrix factorization scheme (a variant of basic matrix factorization method), ratings matrix with multiple discrete values is treated by specially extending hinge loss function to suit multiple levels. We view this process as analogous to extending twoclass classifier to a unified multiclass classifier. Alternatively, multiclass classifier can be built by arranging multiple two class classifiers in a hierarchical manner. We investigate this aspect for collaborative filtering and propose a novel method of constructing a hierarchical bilevel maximum margin matrix factorization to handle matrix completion of ordinal rating matrix [67].
We observe that there could be several possible alternative criteria to formulate the factorization problem of discrete ordinal rating matrix, other than the maximum margin criterion. Taking a cue from the alternative formulation of support vector machines, a novel loss function is derived by considering proximity as an alternative criterion instead of margin maximization criterion for matrix factorization framework [69].
We extended the concept of matrix factorization for yet another important problem of machine learning namely multilabel classification which deals with the classification of data with multiple labels. We propose a novel piecewiselinear embedding method with a lowrank constraint on the parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space [66].
We study the embedding of labels together with the group information with an objective to build an efficient multilabel classifier. We assume the existence of a lowdimensional space onto which the feature vectors and label vectors can be embedded. We ensure that labels belonging to the same group share the same sparsity pattern in their lowrank representations.
1.2 Structure of the Thesis
The thesis is organized as follows. In Chapter , we start our discussion with an introductory discussion on matrix factorization and the associated optimization problem formulation. Thereafter we discuss some common loss functions used in matrix factorization to measure the deviation between the observed data and the corresponding approximation. We also discuss several ways of norm regularization which is needed to avoid overfitting in matrix factorization models. In the later part of the chapter we discuss at length the application of matrix of factorization techniques in collaborative filtering and multilabel classification.
Chapter starts with a discussion on bilevel maximum margin matrix factorization (MMMF) which we subsequently use in our proposed algorithm. We carry out a deep investigation of the well known maximum margin matrix factorization technique for discrete ordinal rating matrix. This investigation led us to propose a novel and efficient algorithm called HMF (Hierarchical Matrix Factorization) for constructing a hierarchical bilevel maximum margin matrix factorization method to handle matrix completion of ordinal rating matrix. The advantages of HMF over other matrix factorization based collaborative filtering methods are given by detailed experimental analysis at the end of the chapter.
Chapter introduces a novel method termed as PMMMF (Proximal Maximum Margin Matrix Factorization) for factorization of matrix with discrete ordinal ratings. Our work is motivated by the notion of Proximal SVMs (PSVMs) [77, 31] for binary classification where two parallel planes are generated, one for each class, unlike the standard SVMs [121, 10, 87]. Taking the cue from here, we make an attempt to introduce a new loss function based on the proximity criterion instead of margin maximization criterion in the context of matrix factorization. We validate our hypothesis by conducting experiments on real and synthetic datasets.
Chapter extended the concept of matrix factorization for yet another important problem in machine learning namely multilabel classification. We visualize matrix factorization as a kind of lowdimensional embedding of the data which can be practically relevant when a matrix is viewed as a transformation of data from one space to the other. At the beginning of the chapter we discuss briefly about the traditional approach of multilabel classification and establish a bridge between multilabel classification and matrix factorization. We present a novel multilabel classification method, called MLCHMF (Multilabel Classification using Hierarchical Embedding), which learns piecewiselinear embedding with a lowrank constraint on parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space. Extensive comparative studies to validate the effectiveness of the proposed method against the stateoftheart multilabel learning approaches is discussed in the experimental section.
In Chapter , we study the embedding of labels together with group information with an objective to build an efficient multilabel classifier. We assume the existence of a lowdimensional space onto which the feature vectors and label vectors can be embedded. In order to achieve this, we address three subproblems namely; (1) Identification of groups of labels; (2) Embedding of label vectors to a low rankspace so that the sparsity characteristic of individual groups remains invariant; and (3) Determining a linear mapping that embeds the feature vectors onto the same set of points, as in stage 2, in the lowrank space. At the end, we perform comparative analysis which manifests the superiority of our proposed method over stateofart algorithms for multilabel learning.
We conclude the thesis with a discussion on future directions in Chapter .
1.3 Publications of the Thesis

Vikas Kumar, Arun K Pujari, Sandeep Kumar Sahu, Venkateswara Rao Kagita, Vineet Padmanabhan. "Collaborative Filtering Using Multiple Binary Maximum Margin Matrix Factorizations." Information Sciences 380 (2017): 111.

Vikas Kumar, Arun K Pujari, Sandeep Kumar Sahu, Venkateswara Rao Kagita, Vineet Padmanabhan. "Proximal Maximum Margin Matrix Factorization for Collaborative Filtering." Pattern Recognition Letters 86 (2017): 6267.

Vikas Kumar, Arun K Pujari, Vineet Padmanabhan, Sandeep Kumar Sahu,Venkateswara Rao Kagita. "Multilabel Classification Using Hierarchical Embedding." Expert Systems with Applications 91 (2018): 263269.

Vikas Kumar, Arun K Pujari, Vineet Padmanabhan, Venkateswara Rao Kagita. "Group Preserving Label Embedding for MultiLabel Classification." PatternRecognition, Under Review.
2.1 Matrix Factorization
Matrix factorization is a key technique employed for many realworld applications wherein the objective is to select and among all the matrices that fit the given data and the one with the lowest rank is preferred. In most applications, the task that needs to be performed is not just to compute any factorization, but also to enforce additional constraints on the factors and . The matrix factorization problem can be formally defined as follows:
Definition 1 (Matrix Factorization).
Given a data matrix , matrix factorization aims at determining two matrices and such that where the inner dimension is called the numerical rank of the matrix. The numerical rank is much smaller than and , and hence, factorization allows the matrix to be stored inexpensively. is called as lowrank approximation of .
A common formulation for this problem is a regularized loss problem which can be given as follows.
(2.1) 
where is the data matrix, is its lowrank approximation, is a loss function that measures how well approximates , and is a regularization function that promotes various desired properties in (lowrank, sparsity, groupsparsity, etc.). When and are convex functions of (equivalently, of and ) the above problem can be solved efficiently. The loss function can further be decmposed into the sum of pairwise discrepancy between the observed entries and their corresponding prediction, that is, , where is the th entry of the matrix and , the th row of , represent the latent feature vector of th user and , the th row of , represent the latent feature vector of th item. There have been umpteen number of proposals for factorizing a matrix and these differ, by and large, among themselves in defining the loss function and the regularization function. The structure of the latent factors depends on the types of loss (cost) functions and the constraints imposed on the factor matrices. In the following subsections, we present a brief literature review of the major loss functions and regularizations.
2.2 Loss Function
Loss function in MF model is used to measure the discrepancy between the observed entry and the corresponding prediction. Based on the types of observed data, the loss function in MF can be roughly grouped into three categories: (1) Loss function for Binary data; (2) Loss function for Discrete ordinal data; and (3) Loss function for Realvalued data.
2.2.1 Binary Loss Function:
For a binary data matrix where the entries are restricted to take only two values , we review some of the important loss functions such as zeroone loss, hinge loss, smooth hinge loss, modified least square and logistic loss [101] in the following subsections.
ZeroOne Loss : The zeroone loss is a standard loss function to measure the penalty for an incorrect prediction which takes only two values zero or one. For a given observed entry and the corresponding prediction , zeroone loss is defined as
(2.2) 
where .
Hinge Loss: There are two major drawbacks with zeroone loss (1) Minimizing objective function involving zeroone loss is difficult to optimize as the function is nonconvex; (2) It is insensitive to the magnitude of prediction whereas in general, when the entries are binary, the magnitude of score represent the confidence in our prediction. The hinge loss is a convex upper bound on zeroone error [34] and sensitive to the magnitude of prediction. In the case of classification with large confidence (margin), hinge loss is the most preferred loss function and is defined as
(2.3) 
Smooth Hinge Loss: Hinge loss, is nondifferentiable at
and very sensitive to outliers
[100]. An alternative of hinge loss is proposed in [101] called smooth hinge and is defined as(2.4) 
Smooth Hinge shares many properties with hinge loss and is insensitive to outliers. A detailed discussion about hinge and smooth hinge loss is given in Chapter .
Modified Square Loss: In [153], the hinge function is replaced by a smoother quadratic function to make the derivative smooth. The modified square loss is defined as
(2.5) 
Logistic Loss: The logistic loss function is strictly convex function which enjoys properties similar to that of the hinge loss [86]. The logistic loss is defined as
(2.6) 
2.2.2 Discrete Ordinal Loss Function
For a discrete ordinal matrix where entries are no more like and dislike but can be any value from a discrete range, that is, , there is need to extend binary class loss function to multiclass loss function. There are two major approaches for extending loss function for binary class classification to multiclass classification. The first approach is to directly solve a multiclass problem by modifying the binary class objective function by adding a constraint to it for every class as suggested in [101] [143] . The second approach is to decompose the multiclass classification problem into a series of independent binary class classification problems as given in [11].
To extend binary loss function to multiclass setting, most of the approaches define a set of threshold values such that the real line is divided into regions. The region defined by threshold and corresponds to rating [101]. For simplicity of notation, we assume and . There are different approaches for constructing a loss function based on a set of threshold values such as immediatethreshold and allthreshold [101]. For each observed entry and corresponding prediction pair , the immediatethreshold based approach calculates the loss as the sum of immediatethreshold violation.
(2.7) 
On the other hand, allthreshold loss is calculated as the sum of loss for all threshold which is the cost of crossing multiple ratingboundaries and is defined as
(2.8) 
2.2.3 Realvalued Loss Function
For a realvalued data matrix where the entries are in realvalued domain , we review some of the important loss functions such as squareloss [93], KLdivergence [73], divergence [17] and ItakuraSaito divergence [29] loss. These are given in the following subsections.
Squared Loss: The square loss is the most common loss function used for realvalued prediction. The penalty for misclassification is calculated as the square distance between the observed entry and the corresponding prediction. The square loss is defined as
(2.9) 
where .
KullbackLeibler Divergence Loss: The KLdivergence loss also know as Idivergence loss is the measure of information loss when is used as an approximate to . The KLdivergence loss is defined as
(2.10) 
ItakuraSaito Divergence: Itakura and Saito [29]
is obtained from the the maximum likelihood (ML) estimation. The loss function is defined as
(2.11) 
Divergence Loss: Cichocki et al. [17] proposed a generalized family of loss function called divergence defined as
(2.12) 
where is a generalization parameter. At limitng case and , the divergence corresponds to ItakuraSaito divergence and KullbackLeibler divergence, respectively. The squared loss can be obtained as a special case at .
2.3 Regularization
Most of the machine learning algorithms suffer from the problem of overfitting where the model fits the training data too well but have poor generalization capability for new data. Thus, a proper technique should be adopted to avoid overfitting of the training data. In MF community, to make the model unbiased i.e., to avoid the model to fit only with a particular dataset, many researchers have tried with different regularization terms along with some loss function [63, 94, 113, 65]. Regularization is not only used to prevent overfitting but also to achieve different structural representation of the latent factors. Several methods based on norm regularization has been proposed in the literature which includes , and .
Norm: The norm, also know as sparsity inducing norm, is used to produce sparse latent factors and thus avoids overfitting by retaining only the useful factors. In effect, this implies that most units take values close to zero while only few take significantly nonzero values. For a a given matrix , the norm is defined as
(2.13) 
Norm: The most popular and widely investigated regularization term used in MF model is norm which is also known as Frobenius norm. For a given matrix , the norm is defined as
(2.14) 
where is the norm of a matrix. Minimizing the objective function containing the norm as regularization term gives two benefits 1) It avoids overfitting by penalizing the large latent factor values. 2) Approximating the target matrix with lowrank factor matrix is a typical nonconvex optimization problem and in fact, the norm has also been suggested as a convex surrogate for the rank in control applications [28, 100].
Norm: The norm also known as group sparsity norm is used to induce a sparse representation at the level of groups. For a given matrix , the norm is defined as
(2.15) 
2.4 Collaborative filtering with Matrix Factorization
In collaborative filtering, the goal is to infer user preferences for items based on his/ her previously given preference and a large collection of preferences of other users. Given a partially observed useritem rating matrix with number of users and number of items the goal is to predict unobserved preference of users for items. The collaborative filtering problem can be formally defined as follows:
Definition 2 (Collaborative Filtering).
Let be a size useritem rating matrix and be the set of observed entries. For each , the entry defines the preference of th user for th item. For each , indicates that preference of th user for th item is not available (unsampled entry). Given a partially observed rating matrix , the goal is to predict for .
Matrix factorization is a key technique employed for completion of useritem rating matrix wherein the objective is to learn lowrank (or lownorm) latent factors (for users) and (for items) so as to simultaneously approximate the observed entries under some loss measure and predict the unobserved entries. There are various ways of doing so. It is shown in [71] that to approximate , the entries in the factor matrix and need to be nonnegative so that only additive combination factors are allowed. The basic idea is to learn factor matrices and in such a way that, the squared sum distance between the observed entry and corresponding prediction is minimized. The optimization problem is formulated as
(2.16) 
Singular value decomposition (SVD) is used in [105] to learn the factor matrices and . The key technical challenge when SVD is applied to sparse matrices is that it suffers from severe overfitting. When SVD factorization is applied on sparse data, error function needs to be modified so as to consider only the observed ratings by setting the nonobserved entries to zero. This minor modification results in a nonconvex optimization problem. Instead of minimizing the rank of a matrix, maximum margin matrix factorization (MMMF) [110] proposed by srebro et al. aims at minimizing the Froebenius norms of and , resulting in convex optimization problems. It is shown that MMMF can be formulated as a semidefinite programming (SDP) problem and solved using standard SDP solvers. However, current SDP solvers can only handle MMMF problems on matrices of dimensionality up to a few hundred. Hence, a direct gradient based optimization method for MMMF is proposed in [100] to make fast collaborative prediction. The detailed discussion about MMMF is given in Chapter 3. To further improve the performance of MMMF, in [23], MMMF is casted using ensemble methods which includes bagging and random weight seeding. MMMF was further extended in [129] by introducing offset terms, item dependent regularization and a graph kernel on the recommender graph. In [135], a noparametric Bayesianstyle MMMF was proposed that utilizes nonparametric techniques to resolve the unknown number of latent factors in MMMF model [129][100][23]. A probabilistic interpretation of MMMF was presented in [136] model through data augmentation.
The proposal of MMMF hinges heavily on extended hinge loss function. Research on different loss functions and their extension to handle multiple classes has not attracted much attention of researchers though there are some important proposals [83]. MMMF has become a very popular research topic since its publication and several extensions have been proposed [23, 129, 135, 136]. There has also been some research on matrix factorization on binary or bilevel preferences [122]. But many view binary preference as a special case of matrix factorization with discrete ratings. In [155] the rating matrix is decomposed hierarchically by grouping similar users and items together, and each submatrix is factorized locally. To the best of our knowledge, there is no research on hierarchical MMMF.
2.5 Multilabel Classification with Matrix Factorization
In machine learning and statistics, the classification problem is concerned with the assignment of a class (category) to a data object (instance) from a given set of discrete classes. For example, classifying a document into one of the several known categories such as sports, crime, business, politics etc. In a traditional classification problem, data objects are represented in the form of feature vectors, each associated with a unique class label from a set of disjoint class labels , . Depending on the total number of disjoint classes in , a learning task is categorized as binary classification (when ) or multiclass classification (when ) [108]. However, in many realword classification tasks, data object can be simultaneously associated with one or more than one class in . For example, a document can simultaneously belong to more than one class such as politics and business. The objective of multilabel classification (MLC) is to build a classifier that can automatically tag an example with the most relevant subset of labels. This problem can be seen as a generalization of the single label classification where an instance is associated with a unique class label from a set of disjoint labels . The multilabel classification problem can be formally defined as follows:
Definition 3 (Multilabel Classification).
Given training examples in the form of a pair of feature matrix and label matrix where each example , is a row of and its associated labels is the corresponding row of . The entry at the th coordinate of vector indicates the presence of label in data point . The task of multilabel classification is to learn a parametrization that maps each instance (or, a feature vector) to a set of labels (a label vector).
MLC is a major research area in the machine learning community and finds application in several domains such as computer vision [12, 9], data mining [118, 106] and text classification [150, 106]. Due to the exponential size of the output space, exploiting intrinsic information in the feature and the label space has been the major thrust of research in recent years and the use of parametrization and embedding techniques have been the prime focus in MLC. The embedding based approach assumes that there exists a lowdimensional space onto which the given set of feature vectors and/ or label vectors can be embedded. The embedding strategies can be grouped into two categories namely; (1) Feature space embedding; and (2) Label space embedding. Feature space embedding aims to design a projection function which can map the instance in the original feature space to the label space. On the other hand, the label space embedding approach transform the label vectors to an embedded space via linear or local nonlinear embeddings, followed by the association between feature vectors and embedded label space for classification purpose. With proper decoding process that maps the projected data back to the original label space, the task of multilabel prediction is achieved [41, 97, 112]. We present a brief review of the FE and LE approaches for multilabel classification. The detailed discussion is given in Chapter 5.
Given training examples in the form of a pair of feature matrix and label matrix where each example , is a row of and its associated labels is the corresponding row of , the goal of FE is to learn a transformation matrix which maps instances feature space to label space. This approach requires parameter to model the classification problem, which will be expensive when and are large [139]. In [139] a generic empirical risk minimization (ERM) framework is used with lowrank constraint on linear parametrization , where and are of rank . The problem can be restated as follows.
(2.17) 
where is Frobenius norm. The formulation in Eq. (2.17) can capture the intrinsic information of both feature and label space. It can also be seen as a joint learning framework in which dimensionality reduction and multilabel classification are performed simultaneously [51, 140].
The matrix factorization (MF) based approach for LE aims at determining two matrices and [5, 134]. The matrix can be viewed as the basis matrix, while the matrix can be treated as the coefficient matrix and a common formulation is the following optimization problem.
(2.18) 
where is a loss function that measures how well approximates , is a regularization function that promotes various desired properties in and (sparsity, groupsparsity, etc.) and the constant is the regularization parameter which controls the extent of regularization. In [79], a MF based approach is used to learn the label encoding and decoding matrix simultaneously. The problem is formulated as follows.
(2.19) 
where is the code matrix, is the decoding matrix, is used to make featureaware by considering correlations between and as side information and the constant is the tradeoff parameter.
3.1 Introduction
In this chapter, we describe the proposed method, termed as HMF (Hierarchical Matrix Factorization). HMF is a novel method for constructing a hierarchical bilevel maximum margin matrix factorization to handle matrix completion of ordinal rating matrix. The proposed method draws motivation from research on multiclass classification. There are two major approaches of extending twoclass classifiers to multiclass classifiers. The first approach explicitly reformulates the classifier, resulting in a unified multiclass optimization problem (embedded technique). The second approach (combinational technique) is to decompose a multiclass problem into multiple, independently trained, binary classification problems and to combine them appropriately so as to form a multiclass classifier. In maximum margin matrix factorization (MMMF) [100], the authors adopted embedded approach by extending bilevel hinge loss to multilevel cases.
Combinational techniques have been very popular and successful as they all exhibit some relative advantages over embedded techniques and this prompts us to question whether some sort of combinational approach can be employed in the context of MMMF. There are several approaches in combinational techniques and these are OneAgainstAll (OAA) [40, 102, 131], OneAgainstOne (OAO) [22, 40] etc. HMF falls into the category of OAA approach with a difference. The OAA approach of classification employs one binary classifier for each class against all other classes. In the present context ‘class’ corresponds to the number of ordinal ranks. Interestingly, since the ranks are ordered, in the present case, OAA strategy is used by taking advantage of the ordering of ’classes’. Unlike the traditional OAA strategy, (which means that one bilevel matrix factorization to be used for rank as one label (say, ) and all other ranks as the other label (say, )), here we employ for each rank , all ranks below as and all ranks above as .
MMMF and HMF, like any other matrix factorization methods, use latent factor model approach. The latent factor model is concerned with learning from limited observations, latent factor vector for each user and latent factor vector for each item, such that the dot product of and gives the ranking of user for item . In MMMF, besides learning the latent factors, the set of thresholds also needs to be learned. It is assumed that there are thresholds for each user ( is the number of ordinal values or number of ranks). Thus, the rating of user for item is decided by comparing the dot product of and with each . Thus, the fundamental assumption of MMMF is that the latent factor vectors determine the properties of users and items and the threshold values capture the characteristics of rating. HMF differs from MMMF on this principle. The underlying principle of HMF is that the latent factors of users and items are going to be different, if the users’ likes or dislikes cutoff thresholds are different. The latent factors are different according to situations. For instance, the latent factors when ranks above are identified as likes is different from situations wherein the ranks above are identified as likes. Thus if we have ratings then, there will be pairs of latent factors . Unlike the process of learning single pair of latent factors, and , HMF learns several ’s and ’s in this process without any additional computational overheads. The process is proved to be a more accurate matrix completion process. There has not been any attempt in this direction and we believe that our present study will open up new areas of research in future.
The rest of the chapter is organized as follows. In Section 3.2, we describe the bilevel MMMF which we use subsequently in our algorithm. Section 3.3 summarizes the wellknown existing MMMF method. We introduce our proposed method of Hierarchical Matrix Factorization (HMF) in Section 3.4. The advantages of HMF over other matrix factorization based collaborative filtering methods are given by detailed experimental analysis in Section 3.6. Section 3.7 concludes the chapter.
3.2 Bilevel MMMF
In this section we describe Maximum Margin Matrix Factorization for a bilevel rating matrix. The matrix completion of bilevel matrix is concerned with the following problem.
Problem P (Y): Let be a partially observed useritem rating matrix and is the set of observed entries. For each , the entry defines the preference of th user for th item with for likes and for dislikes. For each , the entry indicates that the preference of th user for th item is not available. Given a partially observed rating matrix , the goal is to infer for .
Matrix factorization is one of the major techniques employed for any matrix completion problem. In this line, the above problem can be rephrased using latent factors. Given a partially observed rating matrix and the observed preference set , matrix factorization aims at determining two lowrank (or, lownorm) matrices and such that . The row vectors , and , are the dimensional representations of the users and the items, respectively. A common formulation of P(Y) is to find and as solution of the following optimization problem.
(3.1) 
where is a loss function that measures how well approximates , and is a regularization function. The idea of the above formulation is to alleviate the problem of outliers through a robust loss function and at the same time to avoid overfitting and to make the optimization function smooth with the use of regularization function.
A number of approaches can be used to optimize the objective function 3.1. Gradient Descent method and its variants start with random and and iteratively update and using the equations 3.2 and 3.3, respectively.
(3.2)  
(3.3) 
where is the step length parameter and suffixes and indicate current values and updated values.
The stepwise description of the process is given as Pseudocode in Algorithm 1.
Once and are computed by Algorithm 1, the matrix completion process is accomplished from the factor matrices as follows.
(3.4) 
where is the userspecified threshold value.
The latent factor of each item can be viewed as a point in dimensional space and the latent factor of user
can be viewed as a decision hyperplane in this space. The objective of bilevel matrix factorization is to learn the embeddings of these points and hyperplanes in
such that each hyperplane (corresponding to a user) separates (or, equivalently, classifies) the items by likes and dislikes of a user. Let us consider the following partially observed bilevel matrix (Table 3.1) for illustration. The unobserved entries are indicated by entries.0  1  0  0  1  0  1 
1  0  1  1  1  0  1 
0  1  1  0  0  1  1 
1  1  1  1  1  1  0 
1  0  1  1  1  0  0 
0  1  0  1  0  1  1 
1  1  0  1  0  0  0 
Let us assume that the following and are the latent factor matrices with .
The same is depicted graphically in Figure 3.1. are depicted as points. The hyperplanes with threshold are depicted as lines. An arrow is shown to indicate the side of each line. In other words, if a point falls this side, the preference is and the other side preference is . Entry is predicted based on the position of with respect to . falls to the positive side of line corresponding to and hence, the entry is predicted as . The latent factor and are obtained by a learning process making use of the generic algorithm (Algorithm 1) with the observed entries as the training set. The objective of this learning process is to minimize the loss due to discrepancy between computed entries and the observed entries. In the example (Figure 3.1) point and are in the side of and in the negative side of . These points are located at different distance.
There are many matrix factorization algorithms for bilevel matrices which adopts the generic algorithm describe in Algorithm 1. These algorithms are designed based on the specification of the loss function and the regularization function. We adopt here a maximummargin formulation as the guiding principle.
Loss functions in matrix factorization models are used to measure the discrepancy between the observed entry and the corresponding prediction. Furthermore, especially when predicting discrete values such as ratings, loss functions other then sumsquared loss are often more appropriate [100]. The tradeoff between generalization loss and empirical loss has been prevailing since the advent of support vector machine (SVM). Maximum margin approach aims at providing higher generalization ability and avoiding overfitting. In this context, hinge loss function is the most preferred loss function and is defined as follows.
(3.5) 
where . The hinge loss is illustrated in Figure 3.2.
The following realvalued prediction matrix is obtained from the latent factor matrices and (Table 3.4) corresponding to the matrix Y (Table 3.1).
0.24  0.95  0.46  0.14  0.84  0.13  0.81 
0.65  1.46  0.33  0.55  1.06  0.09  1.23 
1.12  0.92  0.74  1.21  0.11  0.81  0.71 
1.02  0.42  1.02  1.16  0.35  0.91  0.27 
0.96  1.06  0.41  0.99  0.39  0.58  0.85 
0.47  0.80  1.27  0.68  1.23  0.81  0.75 
0.61  0.56  1.26  0.81  1.08  0.87  0.55 
The hinge loss function values corresponding to the observed entries in (Table 3.1) and the realvalued prediction (Table 3.5) is shown in Table 3.6.
0  0.05  0  0  0.16  0  0 
0.35  0  0.67  0.45  0  0  0.35 
0  0.08  0.26  0  0  0.19  0 
0  0.58  0  0  0.65  0.09  0 
0.04  0  0.59  0.01  0.61  0  0.04 
0  0.20  0  0.32  0  0.19  0 
0.39  0.44  0  0.19  0  0  0.39 
Hinge loss, is nondifferentiable at and is very sensitive to outliers as mentioned in [100]. Therefore an alternative called smooth hinge loss is proposed in [101] and can be defined as,
(3.6) 
The smooth hinge loss is illustrated in Figure 3.3.
The smooth hinge loss function values corresponding to the observed entries in (Table 3.1) and the realvalued prediction (Table 3.5) is shown in Table 3.7.
0  0  0  0  0.01  0  0 
0.06  0  0.23  0.10  0  0  0.06 
0  0  0.03  0  0  0.02  0 
0  0.17  0  0  0.21  0  0 
0  0  0.17  0  0.19  0  0 
0  0.02  0  0.05  0  0.02  0 
0.08  0.10  0  0.02  0  0  0.08 
Figure 3.2 and 3.3 show the loss function values for the hinge and smooth hinge loss, respectively. It can be seen that the smooth hinge loss shares important similarities to the hinge loss and has a smooth transition between a zero slope and a constant negative slope. Table 3.6 and 3.7 show the hinge loss and smooth hinge loss function values corresponding to the observed entries in (Table 3.1). It can also be seen that the smooth hinge is less sensitive to the outliers as compared to the hinge loss function. For example, the rating given by user for item is positive and the same reflect in the embedding (Figure 3.1). The loss incurred by hinge and smooth hinge are and , respectively. Even though the point is embedded with margin the hinge loss function gives more reward to the model for the increase in objective value.
We reformulate P(Y) problem for the bilevel rating matrix as the following optimization problem.
(3.7) 
where is the Frobenius norm which is the same as defined in Chapter , is the regularization parameter and is the smooth hinge loss function as defined previously.
The gradients of the variables to be optimized are determined as follows. The gradient with respect to each element of is calculated as
(3.8) 
Similarly, the gradient with respect to each element of is calculated as follows.
(3.9) 
where is defined as follows.
(3.10) 
Algorithm 2 outlines the main flow of the bilevel maximum margin matrix factorization (BMMMF).
3.3 Multilevel MMMF
As discussed in the previous chapter, MMMF [110] and subsequently, Fast MMMF [100] are proposed primarily for collaborative filtering with ordinal rating matrix when userpreferences are not in the form of like/ dislike but are values in a discrete range. The matrix completion of ordinal rating matrix is concerned with the following problem.
Problem P(Y): Let be a partially observed useritem rating matrix and is the set of observed entries. For each , the entry defines the preference of the th user for the th item. For each , indicates that the preference of th user for th item is not available. Given a partially observed rating matrix , the goal is to predict for .
Need for multiple thresholds: Unlike P(Y), P(Y) has domain of with more than two values. When the domain has two values, P(Y) is equivalent to P(Y). Continuing our discussion on geometrical interpretation of P(Y), the likes and dislikes of user are separated with both sides of the hyperplane defined by . The same concept when extended to P(Y), it is necessary to define threshold values such that the region between two parallel hyperplanes defined by the same with different threshold and corresponds to the rating . Thus, in P(Y), in addition to learning the latent factor matrices and it is also needed to learn the thresholds ’s. There may be a debate on the number of ’s needed, but following the original proposal of MMMF [100], we follow thresholds for each user and hence there are thresholds.
There are different approaches for constructing a loss function based on a set of thresholds such as immediatethreshold and allthreshold [101]. For each observed entry and the corresponding prediction pair , the immediatethreshold based approach calculates the loss as sum of immediatethreshold violation which is . On the other hand, allthreshold loss is calculated as the sum of loss for all thresholds which is the cost of crossing multiple ratingboundaries and is defined as follows.
(3.11) 
Using the thresholds ’s, MMMF extended the hinge loss function meant for binary classification to ordinal settings. The difference between immediatethreshold and allthreshold hinge is illustrated with the help of the following example. Let us consider the partially observed ordinal rating matrix with , the learnt factor matrices , and the set of thresholds ’s for each user (Table 3.12).
The following realvalued prediction matrix is obtained from the above latent factor matrices and corresponding to the matrix .
One can see that the entry is observed as and the corresponding realvalued prediction is . When immediatethreshold hinge is the loss measure, the overall loss is calculated as follows.
where is the smooth hinge loss function as defined previously. For the same example, the overall loss with allthreshold hinge function is calculated as follows.
Continuing our discussion on geometrical interpretation, immediatethreshold loss tries to embed the point into the region defined by the parallel hyperplanes and which basically mean that and . The allthreshold hinge function not only tries to embed the points rated as into the region defined by and but also consider the position of the points with respect to other hyperplanes. It is also desirable that every point rated by user should satisfy the condition for and for .
In MMMF [100], each hyperplane acts as a maximummargin separator which is ensured by considering smooth hinge as the loss function (allthreshold hinge). The resulting optimization problem for P(Y) is
(3.12) 
where is the Frobenius norm, is the regularization parameter, is the set of observed entries, is the smooth hinge loss as defined previously and is the threshold for rank of user . The equation given above can be rewritten as follows.
(3.13) 
where is defined as
The gradients of the variables to be optimized are determined as follows. The gradient with respect to each element of is calculated as follows.
(3.14) 
where is the same as defined previously. Similarly, the gradients with respect to each element of is calculated as
(3.15) 
and the gradient with respect to each is determined as follows.
Comments
There are no comments yet.