Transduction with Matrix Completion Using Smoothed Rank Function

05/19/2018 ∙ by Ashkan Esmaeili, et al. ∙ Sharif Accelerator Stanford University 0

In this paper, we propose two new algorithms for transduction with Matrix Completion (MC) problem. The joint MC and prediction tasks are addressed simultaneously to enhance the accuracy, i.e., the label matrix is concatenated to the data matrix forming a stacked matrix. Assuming the data matrix is of low rank, we propose new recommendation methods by posing the problem as a constrained minimization of the Smoothed Rank Function (SRF). We provide convergence analysis for the proposed algorithms. The simulations are conducted on real datasets in two different scenarios of randomly missing pattern with and without block loss. The results confirm that the accuracy of our proposed methods outperforms those of state-of-the-art methods even up to 10 observation rates for the scenario without block loss. Our accuracy in the latter scenario, is comparable to state-of-the-art methods while the complexity of the proposed algorithms are reduced up to 4 times.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prediction using labels known as supervised learning is tackled in many papers in the literature, and several efficient approaches are introduced to this end Castelli2018 . In many real-world applications, missing information or random sampling scenario is an inseparable part of the problem marvasti2012nonuniform ; marvasti2017wideband . One of the most common approaches to address unobserved or quantized attributes is utilizing Matrix Completion (MC) methods candes2009exact ; SmoothedBabaieZadeh , and 8281111

. Combining the two aforementioned concepts, the classification, prediction, or multi-label learning tasks are considered in the missing information scenario. This problem can be generally addressed both directly, or indirectly. The indirect approach addresses the imputation and prediction tasks separately

farhangfar2008impact ; liu2016adaptive , while the direct approaches introduce a unique platform where both tasks are conducted simultaneously [liu2018svm ; shang2013semi ; kiasari2017novel .
The direct Transduction with MC task, introduced in goldberg2010transduction , not only addresses the multi-label problem but it also imputes the unobserved entries in a unique framework, simultaneously. In their proposed method MC-1, the labels and data matrices are concatenated forming a larger stacked matrix. Then, they minimize the penalized nuclear norm of the stacked matrix assuming the low-rank property holds for the data matrix and for the stacked matrix consequently (in linear models). In their model, nuclear norm approximation of the rank function is utilized as its convex surrogate. In xu2013speedup , the authors have suggested an algorithm which is more robust than the one proposed in goldberg2010transduction , and also outperformed their accuracy in terms of the Average Precision (AP) measure.
In our paper, we introduce a novel direct method to impute the labels and missing data together. To this end, we pose a new optimization problem model, approximating the rank of the stacked matrix with a smoothed function. The Smoothed Rank Function (SRF) concept, leveraged in our proposed model, leads to the differentiability property SmoothedBabaieZadeh . Thus, we take the advantage of using the Projected Gradient (PG), and Spectral Projected Gradient (SPG) method birgin2000nonmonotone , which are more robust and faster than Subgradient based methods derived from the penalized nuclear norm cost functions. It is worth noting that the problem model we introduce is different from a simple MC task since the hard labels force additional nonaffine constraints. In our work, we introduce two new algorithms based on projected GD and SPG. We have also achieved noticeable simulation results which illustrate our methods’ outperformance both in accuracy and complexity in most of the cases compared to state-of-the-art methods. Detailed simulation analysis is provided in Section 5. We also provide convergence analysis for our proposed algorithms.

Other authors have used the concatenation concept for different purposes such as the image classification scenario, and have leveraged the semi-supervised transduction with MC for tagging and classifying images

wang2013learning , where the authors propose a novel Hashing approach for Tag Completion and Prediction. In luo2015multiview and lin2013image , the applications of this model to social image tagging and image classification are investigated. In doi:10.1093/bioinformatics/btu269 , a novel matrix-completion method called Inductive Matrix Completion is applied to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. In alameda2015analyzing , the authors use the ADMM technique for optimizing the augmented Lagrangian function for the sake of MC. The matrix in their work is a concatenated version based on the similar idea introduced in goldberg2010transduction

. Their purpose is to carry out head and body pose estimation which could be considered as one of the wearable device applications. In

wu2016constrained , the authors use ADMM to optimize a class of sub-modular cost functions in order to deal with the missing information and class imbalance in multi-linear learning simultaneously. In fan2014distant two noise-tolerant optimization models, DRMC-b and DRMC-1, for distantly supervised relation extraction task from a novel perspective are introduced.
The rest of the paper is organized as follows: In Section 2, the problem formulation is provided. In Section 3, we review the smoothed rank function approximation and explain the motivation of the new objective function taken into account. We also include our proposed algorithms in this section. Next, in Section 4, we analyze the convergence of our proposed algorithms. We illustrate the performance of our method and compare it to several state-of-the-art methods in Section 5. Finally, we conclude the paper in Section 6.

2 Problem Formulation


be feature vectors associated with

items. These vectors are combined together in a row-wise fashion to create a feature matrix, . Let be classification label vectors of size . These vectors are combined together to create the label matrix, . In missing data scenario, some of the entries in and are observed and the others are Missing Completely at Random (MCAR) little2014statistical . We assume some of the entries in and are randomly lost. Let and denote the sets of observed entries in and , respectively. If a specific feature relating to an item is not reported or in other words, is not observed in these matrices, then it is reported as (In older literature on missing data, NA (Not Assigned) was used to denote the missing entries). Thus, the entries of the matrix are reported as or for classified labels and for missing labels.

Our goal is to predict the missing labels for as well as imputing the missing features in . To solve this generally ill-posed problem, we assume that and are jointly produced by an underlying low rank matrix goldberg2010transduction . We assume is the low rank pre-feature matrix. Let denote the soft labels associated with . By the assumption, is produced as , where is the weight matrix and

is the bias vector. The hard labels

are generated from soft labels via some function (In general, Sign function or the Logistic function is used). Let be the soft label matrix. Since , the columns of the soft label matrix are linear combinations of the columns in where is the all-1 vector. Thus, . It is assumed that is low-rank, therefore is also low-rank, because . Let be the stacked matrix . Our goal is to recover this stacked matrix in which the unknown labels are also imputed as built-in parts of a global matrix. The recovered stacked matrix should be consistent with the observed data. Additionally, is desired to be of low-rank. Thus, the following constrained optimization problem is obtained:

subject to

In our proposed problem model, we substitute the function with our proposed smoothed function, and do not relax the hard constraints. Further elaborations on our model and algorithms are provided in the subsequent Section.

3 The Proposed Algorithm

The concept forming our algorithm is based on approximating the rank function with a smooth function and then improve this approximation by tuning the smoothing function. This concept is introduced in SmoothedBabaieZadeh to solve the MC problem. Generally, the rank function is not differentiable and gradient methods cannot be efficiently applied to problems containing the rank function. However, we use a smooth differentiable function to approximate the rank function. This will allow us to use the gradient methods in order to optimize the smooth function. Then, we update and tune the parameter of the smooth function to improve the accuracy of our approximation. Let

denote the vector containing all of the singular values of the matrix

where assuming . We have . Also, we have where is the Kronecker delta function,


Next, we seek for a class of appropriate functions approximating the rank function. The following definition introduces this class of functions SmoothedBabaieZadeh :

Definition 1.

Qualified Rank Approximation (QRA). A function is called a qualified rank approximation if

  1. is symmetric and analytic,

  2. ,

  3. is concave in a neighborhood of ,

  4. .

Further, we define . Many functions may be found that satisfy the QRA conditions. Through this paper, we consider which satisfies the QRA conditions. It can be observed that converges in a pointwise fashion to the Kronecker delta function as .

Assume is a QRA function. Thus, we have


Now, we define


This is an approximation of the rank function. Instead of the rank function, we solve the optimization problem for which gives us an approximation of the solution of problem (1). Thus, we use the previous solution as a warm-start and the new value of to solve the new optimization problem. After iterating this procedure, we will obtain a sequence of matrices , where each term is obtained by optimizing for some fixed using the previous solution as the warm-start in each iteration. Since different values are close to each other and is continuous, we expect and be close to each other w.r.t the Frobenius norm. On the other hand, we improve accuracy in each step by shrinking the which leads to a better approximation of the rank function. Thus, we expect converges to the solution of problem (1) as . We will analytically show in Section 4 that this sequence of matrices converges to the solution of problem (1). In the rest of this section, we will describe the algorithm completely.

3.1 Constrained Optimization of the Rank Approximation

As explained before, for some fixed , we solve the following problem which is obtained by substituting the rank by in problem (1):

subject to

Let denote the feasible region in problem (5), and let . Assume . If we do not have any constraint on i.e. . Otherwise, regarding label constraints, we have or . ( can be interpreted as or ). For , if , then can take any value; otherwise, . If , then . Therefore, for all , we have lower and upper bounds such as , which means lies inside a box. Therefore, is a convex set. However, problem (5) is generally non-concave since is not concave. Recalling the third property of QRA, by choosing an appropriate value for , we can convert problem (5) to a locally concave problem, and solve it using robust methods. Thus, we assume the values of are chosen appropriately and problem (5) is locally concave. We use the PG technique bertsekas to solve this problem. We calculate the gradient of w.r.t the matrix . The gradient function is provided in 1 as follows:

Theorem 1.

(SmoothedBabaieZadeh, , Thm. 1) Suppose that is represented as where

with the Singular Value Decomposition (SVD)

, contains the singular values of the matrix , , and is absolutely symmetric and differentiable. Then the gradient of at is


where .

Recalling (4), we have and since is an even differentiable function, is absolutely even(symmetric) and differentiable. Also, by the definition,


Denoting as the direction of movement in gradient ascent step, we have


In the next step, we must project the point obtained by moving in the direction of gradient onto the feasible region of the problem. Projection onto the feasible region which is a box can be easily described as . Specifically, this projection can be described as in (10).


Now, we have described all of the components of PG. Solution of the problem (5) is obtained by iterating the PG procedure until convergence is reached. In each iteration, is updated as


where is defined in (9) and is the gradient ascent step size. Choosing this step size can be done via cross-validation. We will discuss about choosing the step size in Section 5. Algorithm 1 includes the procedure of the PG method in order to solve the optimization problem in 5.

2:Partially observed features matrix .
3:Partially observed hard labels matrix .
4:The GP step size .
5:The decay factor .
7:The estimated features matrix .
8:The estimated soft labels matrix .
13:     while not converged do
16:          while not converged do
22:          end while
26:     end while
27:     return
28:end procedure
Algorithm 1 Transductive Imputation of Matrix using Smoothed Rank Function (PG based version) TIM-SRF1

In order to enhance the robustness and convergence rate of the proposed algorithm we have also used the concept of Quasi-Newton minimization approach. We leverage the SPG method as introduced in birgin2000nonmonotone in algorithm 2. in algorithm 2 is defined as in (9).

2:Sets of observed entries .
3:Partially observed features matrix .
4:Partially observed hard labels matrix .
5:The decay factor .
6:The maximum step size .
7:The minimum step size .
8:The sufficient decrease parameter .
9:The memory size .
11:The estimated features matrix .
12:The estimated soft labels matrix .
17:     while not converged do
21:          while not converged do
23:               Task 1:
27:               if  then
32:                    goto Task 2
33:               else
36:                    goto Task 1
37:               end if
38:               Task 2:
40:               if  then
42:               else
45:               end if
47:          end while
51:     end while
52:     return
53:end procedure
Algorithm 2 Transductive Imputation of Matrix using Smoothed Rank Function (SPG based version) TIM-SRF2

4 Convergence Analysis

In this section, we investigate convergence of the proposed algorithms in the previous section. We start with finding reasonable conditions under which, the solution of problem (1) is unique. Unlike the problem in SmoothedBabaieZadeh where all the constraints are affine, the first constraint in problem (1) is nonaffine. We define a secondary problem as in (12) by just considering the affine constraints.

subject to

Let denote the feasible region of problem (12). Define , , as


It can be verified that is a linear operator. Also, we define the linear operator , as an operator which gets a matrix and vectorizes it. Finally, we define as


This operator is linear as it can be considered as a composition of two linear operators. We can rewrite problem (12) as


where represents constraint constants. Now, consider the following definition.

Definition 2.

Spherical Section PropertySSP . The spherical section constant of a linear operator is defined as


Further, is said to have the spherical section property if .

It has been proven in mohimani2009fast that if all entries of the matrix representation of

are identically and independently distributed from a zero-mean, unit-variance Gaussian distribution, then,

has the

spherical section property with high probability under some reasonable conditions.

We add 2 assumptions to our problem.
Assumption 1: has the spherical section property.
Assumption 2: There exists such that .
We have because we have ignored the first constraint in problem (1) to obtain . Thus . Recalling part (a) of theorem 2.1 of SSP , , we have . Therefore, , we have and this proves uniqueness of the global solution of problem (1).

Let denote the global solution of problem (5) for some fixed . Our next goal is to show that . This is done in the following theorem.

Theorem 2.

Assume has the -spherical section property, , is a QRA, is defined as in assumption 2 and , , and are defined as before. If represents the maximizer of over , then


We have


The first inequality is correct since is the maximizer of over . The second inequality is correct since and therefore has zero singular values. Considering the definition of , we have and recalling (4), we have .
Taking lemmas 3 and 4 of SmoothedBabaieZadeh into account, (19) is resulted from (18) as:


This is followed immediately by


where . As , converges to 0 and

5 Simulation Results

In this section, we provide simulations to compare our proposed algorithms to state-of-the-art ones on three well-known real datasets. Several studies have been conducted to address the transduction with MC task. We explain about the datasets taken into account and the methods considered in our simulations in the two following subsections.

5.1 Datasets

  • Yeast: This biological dataset is studied for Yeast gene functional classification task by Elisseeff and Weston in elisseeff2002kernel . This dataset consists of instances, features, and labels. The instance-feature matrix is relatively a large skinny matrix which leads to better MC accuracy.

  • CAL500: a collection of semantic information about music is provided in this dataset turnbull2008semantic . This dataset includes songs (instances) and features. This dataset includes labels. In this dataset, the ratio of the number of labels to the number of features is large. Therefore, the concept of concatenating the labels and the data matrix becomes significantly profitable in this scenario. In other words, working on the data matrix independently in a separate phase leads to ignorance of numerous labels while these labels can be extremely helpful in imputation and prediction.

  • Music Emotions: This dataset is utilized to discover the emotions existing inside different pieces of songs. It contains songs (instances), and features. There are labels representing the emotions elaborated in trohidis2008multi by Trohidis, et al.

5.2 Methods Investigated in the Simulations

We consider the following methods in our simulations as they have been proven to be the state-of-the-art methods in the literature.

  • MC-1: Goldberg, et al., formulated the problem for the first time in goldberg2010transduction , and they leveraged low-rank assumption for the underlying matrix. Modified fixed-point continuation was employed to tackle the multi-label transduction with MC task and they have achieved noticeable accuracy results.

  • Maxide: This method is introduced by Xu, et al., in goldberg2010transduction . Their proposed method called Maxide uses the side information for MC. One of the applications as stated in xu2013speedup is multi-label learning. They have devised an efficient method in terms of computational runtime and could also enhance the accuracy in their own simulation setting which is also discussed in 5.3.2 among our simulation settings.

  • SRF+SVM: In this method, direct imputation by concatenation of labels and the data is not employed. In farhangfar2008impact , indirect approaches are studied in different cases. Taking a similar attitude, indirect (two-phase) prediction is carried out by initial MC on the data followed by SVM. The MC approach we use for this method is the algorithm introduced in SmoothedBabaieZadeh . We intentionally use this approach since the concept of smoothed rank function is the basis of the SRF MC method maintaining compatibility with our direction of interest in this paper. The purpose of providing the simulations for this method is mainly comparing the direct imputation and the two-phase approaches on diverse datasets.

  • TIM-SRF: TIM-SRF is our proposed method. We have provided two algorithms for implementation of TIM-SRF. In TIM-SRF1, we have used projected gradient method for minimizing the smoothed rank function under certain constraints. In Table 1, is the gradient ascent step size, and is the decay factor as explained in 1. in our simulations is selected in the range using cross-validation. is set to a value between using cross-validation. Next, we have leveraged a Quasi-Newton based approach in TIM-SRF2 towards the same constrained optimization problem not only to reduce the computational runtime but also to enhance the accuracy in certain cases. In 5.3.1 and 5.3.2 we illustrate the superiority of our methods in terms of accuracy, and also the additional advantage advantage of reducing the complexity in specific cases. In TIM-SRF2, the parameters and are the maximum and minimum thresholds of the step size. We have set to , and is chosen between using cross-validation. is the memory size which is set to in our simulations for the sake of reduction in computational runtime. is the sufficient decrease parameter in the backtracking algorithm which is arbitrarily assigned in the interval which is set to the typical value of in our simulations.

5.3 Missing Scenarios

Two main set of simulations are considered, each representing a different missing pattern. We provided the results of these two scenarios in Tables I and II, respectively. We discuss the simulations results in two subsections. The evaluation of our proposed methods and the other discussed algorithms is based on the area under the curve (AUC). The computational runtime is also measured in seconds on an Intel(R) Core (TM) i7-2600K CPU @3.40 GHz system.

5.3.1 Random Missing Pattern

First, we assume the missing entries are uniformly selected from the concatenated data. This setting is considered in goldberg2010transduction , where the sampling method on the labels is completely at random. The results of simulations for this scenario are reflected in Table 5.3.2. The observation percentage values are: and . Let denote the observation percentage. We provide detailed analyses of the results as follows: On the Music Emotions data, TIM-SRF2 outperforms other methods both in terms of accuracy and computational runtime. In addition, TIM-SRF1 performs closely similar to TIM-SRF2 with slight inferiority and is second in terms of AUC except for , where the MC-1 method is the second best with slight difference. On the CAL500 dataset, the best accuracy performance for belong to TIM-SRF2. For the rest of values, TIM-SRF1 outperforms the other methods. TIM-SRF1, however, owns the minimum runtime complexity for the CAL500 case. On the Yeast dataset, TIM-SRF2 outperforms other methods for and . TIM-SRF1 achieves the best accuracy for while the best runtime is achieved by Maxide algorithm. It is worth noting that TIM-SRF2 is faster than TIM-SRF1 when .

5.3.2 Random Missing Pattern + Block loss on Labels

In this scenario, in addition to the random missing mask, of the labels are chosen as a whole block which is entirely missing, i.e., ten percents of the instances do not have any assigned labels, and are therefore considered as the test part. Again, the values are considered for in this scenario. It is worth noting that, random label rows which are selected to be omitted could be merged together and considered as a whole block loss. On the Music Emotions dataset, Maxide method outperforms the other methods except for where SRF+SVM shows the best performance. The lowest time complexity belongs to TIM-SRF2. The accuracy measure of the method TIM-SRF2 is close to Maxide and both TIM-SRF methods outperform the accuracy of MC-1. On the CAL500, Maxide algorithm achieves the highest accuracy. The second best accuracy goes to TIM-SRF2. In terms of runtime, TIM-SRF1 and TIM-SRF2 are the fastest methods of all. On the Yeast dataset, the method SRF+SVM has the highest accuracy. This observation can be reasoned as follows: Knowing that there is a in the labels in this scenario, the methods which concatenate the two matrices may not perform well since the adversely affects their performance. However, the SRF+SVM method considers the initial phase of completion simply on the data matrix and is therefore more efficient in completion since the is not taken into account. The second phase is SVM implementation which is used for the prediction. SVM is computationally complex and as a result, the runtime of this method is far larger than the other methods although the accuracy is improved. The other methods show superior performance when the labels are not forced to have . The second best method on is Maxide. For the rest of values, TIM-SRF2 has the second best accuracy performance. In terms of the complexity, Maxide goes to the second ranking.

Dataset Method
time(s) AUC(%)
time(s) AUC(%)
time(s) AUC(%)
Music Emotions TIM-SRF2 87.4 (1.03) 0.32 82.8 (0.8) 0.27 76.0 (1.1) 0.27 63.1 (1.6) 0.25
TIM-SRF1 86.5 (1.0) 0.73 78.4 (2.7) 0.73 73.6 (1.5) 0.69 61.8 (1.8) 0.61
MC-1 80.2 (2.4) 0.36 76.3 (1.6) 0.39 72.6 (0.8) 0.35 62.3 (1.9) 0.31
Maxide 76.0 (2.0) 2.9 71.1 (1.5) 2.29 65.2 (1.4) 1.64 56.4 (1.4) 0.84
SRF+SVM 70.0 (1.8) 8.84 67.0 (1.3) 9.20 63.6 (1.6) 9.0 58.8 (1.5) 8.25
Yeast TIM-SRF2 95.3 (0.2) 1.18 90.8 (0.2) 1.2 85.2 (0.3) 1.22 74.7 (0.5) 1.26
TIM-SRF1 94.8 (0.2) 1.50 90.0 (0.2) 1.55 84.5 (0.3) 2.00 75.1 (0.4) 2.01
MC-1 92.1 (0.2) 1.62 88.5 (0.2) 1.69 83.8 (0.3) 1.73 73.6 (0.4) 1.66
Maxide 64.9 (0.8) 0.07 63.3 (0.5) 0.05 60.8 (0.6) 0.03 57.4 (0.6) 0.02
SRF+SVM 72.8 (0.6) 700.2 71.4 (0.4) 711.8 69.6 (0.6) 689.6 67.9 (0.6) 686.0
CAL500 TIM-SRF2 90.4 (0.2) 1.33 87.8 (0.2) 1.36 82.9 (0.4) 1.38 72.7 (0.7) 1.37
TIM-SRF1 87.6 (0.3) 0.34 85.9 (0.4) 0.35 83.2 (0.2) 0.36 77.6 (0.2) 0.36
MC-1 89.8 (0.3) 1.89 85.5 (0.2) 1.88 78.6 (0.3) 1.84 68.1 (0.4) 1.76
Maxide 78.4 (0.4) 14.54 76.4 (0.4) 11.33 74.1 (0.3) 8.28 71.3 (0.5) 5.26
SRF+SVM 59.7 (0.5) 13.95 50.7 (0.5) 12.4 59.5 (0.3) 10.31 59.4 (0.6) 7.97
Table 1: Simulation results for the scenario 5.3.1 in terms of AUC and simulation time. Observation rates
Dataset Method
time(s) AUC(%)
time(s) AUC(%)
time(s) AUC(%)
Music Emotions TIM-SRF2 72.2 (3.6) 0.25 65.8 (4.1) 0.25 61.9 (2.3) 0.24 55.1 (3.1) 0.24
TIM-SRF1 72.0 (3.8) 0.64 65.8 (3.6) 0.62 61.9 (2.0) 0.62 55.5 (3.6) 0.61
MC-1 65.1 (3.9) 0.34 60.0 (3.4) 0.31 58.0 (2.3) 0.30 54.5 (3.2) 0.29
Maxide 76.0 (2.3) 2.22 70.0 (3.8) 1.91 63.9 (2.7) 1.66 55.6 (5.2) 1.13
SRF+SVM 71.3 (2.5) 7.8 67.4 (4.4) 7.70 63.0 (3.0) 7.61 59.4 (2.6) 7.68
Yeast TIM-SRF2 63.3 (1.3) 0.84 62.4 (2.2) 0.86 61.1 (2.3) 0.83 58.0 (1.2) 0.86
TIM-SRF1 62.3 (0.7) 1.62 61.3 (1.6) 2.10 59.7 (1.4) 1.46 56.3 (0.9) 1.44
MC-1 61.9 (0.7) 1.78 61.1 (1.6) 1.77 59.4 (1.4) 1.74 56.3 (0.9) 1.70
Maxide 63.6 (1.1) 0.07 61.9 (2.2) 0.05 60.2 (1.9) 0.03 56.4 (1.5) 0.01
SRF+SVM 71.9 (0.9) 695.6 71.1 (1.5) 694.7 70.5 (1.2) 692.3 68.0 (0.9) 691.0
CAL500 TIM-SRF2 75.2 (1.4) 1.24 71.6 (2.2) 1.24 69.9 (1.0) 1.22 66.4 (1.0) 1.22
TIM-SRF1 73.9 (1.4) 1.12 68.4 (2.0) 1.11 66.9 (0.8) 1.10 65.4 (1.2) 1.12
MC-1 67.5 (0.6) 2.05 61.0 (1.4) 2.00 58.3 (1.4) 1.96 54.9 (0.9) 1.89
Maxide 77.4 (0.8) 13.29 75.2 (0.8) 10.26 73.3 (0.8) 7.35 70.5 (0.9) 4.35
SRF+SVM 59.9 (0.5) 14.47 59.3 (0.5) 12.67 59.2 (0.7) 10.5 58.9 (0.5) 8.14
Table 2: Simulation results for the scenario 5.3.2 in terms of AUC and simulation time. Observation rates

6 Conclusion

In this paper, the general problem of semi-supervised multi-label learning is addressed. We have taken the advantage of concatenating the label and feature matrix to enhance the accuracy of imputation. We have proposed a new optimization model based on the Smoothed Rank Function (SRF) approximation. Two novel algorithms (TIM-SRF1, and TIM-SRF2) are proposed using Projected Gradient (PG), and Spectral Projected Gradient (SPG) methods. These methods are employed to reduce the complexity as they are computationally efficient. We have provided convergence analysis for our algorithms as well.
Our simulation results reveal robustness and superiority of our proposed algorithms in prediction accuracy in various settings. We have implemented simulations on real datasets in two main scenarios:

  • Random Missing Pattern

  • Random Missing Pattern + block loss on Labels

Low observation rates are common in practical settings. Our simulations in the first scenario, illustrate that the proposed algorithms have improved the results of state-of-the-art methods even up to in terms of the accuracy in such cases. Moreover, for higher observation rates, the AUC is enhanced by on average. The computational runtime of TIM-SRF2 is up to times lower than other mentioned methods in the first scenario. In the latter, in spite of slightly lower AUC in comparison to Maxide, TIM-SRF1 and TIM-SRF2 outperformed Maxide in terms of complexity in some cases.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.



  • (1) M. Castelli, L. Vanneschi, Álvaro Rubio Largo, Supervised learning: Classification, in: Reference Module in Life Sciences, Elsevier, 2018, pp. –. doi:
  • (2) F. Marvasti, Nonuniform sampling: theory and practice, Springer Science & Business Media, 2012.
  • (3) F. Marvasti, M. Mashhadi, Wideband analog to digital conversion by random or level crossing sampling, uS Patent 9,729,160 (Aug. 8 2017).
  • (4) E. J. Candès, B. Recht, Exact matrix completion via convex optimization, Foundations of Computational mathematics 9 (6) (2009) 717.
  • (5) M. Malek-Mohammadi, M. Babaie-Zadeh, A. Amini, C. Jutten, Recovery of low-rank matrices under affine constraints via a smoothed rank function, IEEE Transactions on Signal Processing 62 (4) (2014) 981–992.
  • (6) M. B. Mashhadi, S. Gazor, N. Rahnavard, F. Marvasti, Feedback acquisition and reconstruction of spectrum-sparse signals by predictive level comparisons, IEEE Signal Processing Letters 25 (4) (2018) 496–500. doi:10.1109/LSP.2018.2801836.
  • (7)

    A. Farhangfar, L. Kurgan, J. Dy, Impact of imputation of missing values on classification error for discrete data, Pattern Recognition 41 (12) (2008) 3692–3705.

  • (8) Z.-g. Liu, Q. Pan, J. Dezert, A. Martin, Adaptive imputation of missing values for incomplete pattern classification, Pattern Recognition 52 (2016) 85–95.
  • (9) Y. Liu, K. Wen, Q. Gao, X. Gao, F. Nie, Svm based multi-label learning with missing labels for image annotation, Pattern Recognition 78 (2018) 307–317.
  • (10) F. Shang, L. Jiao, Y. Liu, H. Tong, Semi-supervised learning with nuclear norm regularization, Pattern Recognition 46 (8) (2013) 2323–2336.
  • (11) M. A. Kiasari, G.-J. Jang, M. Lee, Novel iterative approach using generative and discriminative models for classification with missing features, Neurocomputing 225 (2017) 23–30.
  • (12) A. Goldberg, B. Recht, J. Xu, R. Nowak, X. Zhu, Transduction with matrix completion: Three birds with one stone, in: Advances in neural information processing systems, 2010, pp. 757–765.
  • (13) M. Xu, R. Jin, Z.-H. Zhou, Speedup matrix completion with side information: Application to multi-label learning, in: Advances in Neural Information Processing Systems, 2013, pp. 2301–2309.
  • (14) E. G. Birgin, J. M. Martínez, M. Raydan, Nonmonotone spectral projected gradient methods on convex sets, SIAM Journal on Optimization 10 (4) (2000) 1196–1211.
  • (15) Q. Wang, L. Ruan, Z. Zhang, L. Si, Learning compact hashing codes for efficient tag completion and prediction, in: Proceedings of the 22nd ACM international conference on Information & Knowledge Management, ACM, 2013, pp. 1789–1794.
  • (16) Y. Luo, T. Liu, D. Tao, C. Xu, Multiview matrix completion for multilabel image classification, IEEE Transactions on Image Processing 24 (8) (2015) 2355–2368.
  • (17)

    Z. Lin, G. Ding, M. Hu, J. Wang, X. Ye, Image tag completion via image-specific and tag-specific linear sparse reconstructions, in: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE, 2013, pp. 1618–1625.

  • (18) N. Natarajan, I. S. Dhillon, Inductive matrix completion for predicting gene?disease associations, Bioinformatics 30 (12) (2014) i60–i68. arXiv:/oup/backfile/content_public/journal/bioinformatics/30/12/10.1093/bioinformatics/btu269/2/btu269.pdf, doi:10.1093/bioinformatics/btu269.
  • (19) X. Alameda-Pineda, Y. Yan, E. Ricci, O. Lanz, N. Sebe, Analyzing free-standing conversational groups: A multimodal approach, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM, 2015, pp. 5–14.
  • (20) B. Wu, S. Lyu, B. Ghanem, Constrained submodular minimization for missing labels and class imbalance in multi-label learning., in: AAAI, 2016, pp. 2229–2236.
  • (21) M. Fan, D. Zhao, Q. Zhou, Z. Liu, T. F. Zheng, E. Y. Chang, Distant supervision for relation extraction with matrix completion, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, 2014, pp. 839–849.
  • (22) R. J. Little, D. B. Rubin, Statistical analysis with missing data, Vol. 333, John Wiley & Sons, 2014.
  • (23) D. P. Bertsekas, Nonlinear programming, Athena scientific Belmont, 1999.
  • (24)

    K. Dvijotham, M. Fazel, A nullspace analysis of the nuclear norm heuristic for rank minimization, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, IEEE, 2010, pp. 3586–3589.

  • (25) H. Mohimani, M. Babaie-Zadeh, C. Jutten, A fast approach for overcomplete sparse decomposition based on smoothed l0 norm, IEEE Transactions on Signal Processing 57 (1) (2009) 289–301.
  • (26) A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Advances in neural information processing systems, 2002, pp. 681–687.
  • (27) D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Semantic annotation and retrieval of music and sound effects, IEEE Transactions on Audio, Speech, and Language Processing 16 (2) (2008) 467–476.
  • (28) K. Trohidis, G. Tsoumakas, G. Kalliris, I. P. Vlahavas, Multi-label classification of music into emotions., in: ISMIR, Vol. 8, 2008, pp. 325–330.