LASS: a simple assignment model with Laplacian smoothing

05/23/2014 ∙ by Miguel Á. Carreira-Perpiñán, et al. ∙ 0

We consider the problem of learning soft assignments of N items to K categories given two sources of information: an item-category similarity matrix, which encourages items to be assigned to categories they are similar to (and to not be assigned to categories they are dissimilar to), and an item-item similarity matrix, which encourages similar items to have similar assignments. We propose a simple quadratic programming model that captures this intuition. We give necessary conditions for its solution to be unique, define an out-of-sample mapping, and derive a simple, effective training algorithm based on the alternating direction method of multipliers. The model predicts reasonable assignments from even a few similarity values, and can be seen as a generalization of semisupervised learning. It is particularly useful when items naturally belong to multiple categories, as for example when annotating documents with keywords or pictures with tags, with partially tagged items, or when the categories have complex interrelations (e.g. hierarchical) that are unknown.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A major success in machine learning in recent years has been the development of semisupervised learning (SSL)

(Chapelle et al., 2006), where we are given labels for only a few of the training points. Many SSL approaches rely on a neighborhood graph constructed on the training data (labeled and unlabeled), typically weighted with similarity values. The Laplacian of this graph is used to construct a quadratic nonnegative function that measures the agreement of possible labelings with the graph structure, and minimizing it given the existing labels has the effect of propagating them over the graph. Laplacian-based formulations are conceptually simple, computationally efficient (since the Laplacian is usually sparse), have a solid foundation in graph theory and linear algebra (Chung, 1997; Doyle and Snell, 1984)

, and most importantly work very well in practice. The graph Laplacian has been widely exploited in machine learning, computer vision and graphics, and other areas: as mentioned, in semisupervised learning, manifold regularization and graph priors

(Zhu et al., 2003; Belkin et al., 2006; Zhou et al., 2004) for regression, classification and applications such as supervised image segmentation (Grady, 2006)

, where one solves a Laplacian-based linear system; in spectral clustering

(Shi and Malik, 2000), possibly with constraints (Lu and Carreira-Perpiñán, 2008), and spectral dimensionality reduction (Belkin and Niyogi, 2003) and probabilistic spectral dimensionality reduction (Carreira-Perpiñán and Lu, 2007)

, where one uses eigenvectors of the Laplacian; in clustering, manifold denoising and surface smoothing

(Carreira-Perpiñán, 2006; Wang and Carreira-Perpiñán, 2010; Taubin, 1995), where one iterates products of the data with the Laplacian; etc.

We concern ourselves with assignment problems in a semisupervised learning setting, where we have items and categories and we want to find soft assignments of items to categories given some information. This information often takes the form of partial tags or annotations, e.g. for pictures in websites such as Flickr, blog entries, etc. Let us consider a specific example where the items are documents (e.g. papers submitted to this conference) and the categories are keywords. Any given paper will likely be associated to a larger or smaller extent with many keywords, but most authors will tag their papers with only a few of them, usually the most distinctive (although, as we know, there may be other reasons). Thus, few papers will be tagged as “computer science” or “machine learning” because those keywords are perceived as redundant given, say, “semisupervised learning”. However, considered in a larger context (e.g. to include biology papers), such keywords would be valuable. Besides, categories may have various correlations that are unknown to us but that affect the assignments. For example, a hierarchical structure implies that “machine learning” belongs to “computer science” (although it does to “applied maths” to some extent as well). In general, we consider categories as sets having various intersection, inclusion and exclusion relations. Section 6 illustrates this in an example. Finally, it is sometimes practical to tag an item as not associated with a certain category, e.g. “this paper is not about regression” or “this patient does not have fever”, particularly if this helps to make it distinctive. In summary, in this type of applications, it is impractical for an item to be fully labeled over all categories, but it is natural for it to be associated or disassociated with a few categories. This can be coded with item-category similarity values that are positive or negative, respectively, with the magnitude indicating the degree of association, and zero meaning indifference or ignorance. We call this source of information, which is specific for each item irrespectively of other items, the wisdom of the expert.

We also consider another practical source of information. Usually it is easy to construct a similarity of a given item to other items, at least its nearest neighbors. For example, with documents or images, this could be based on a bag-of-words representation. We would expect similar items to have similar assignment vectors, and this can be captured with an item-item similarity matrix and its graph Laplacian. We call this source of information, which is about an item in the context of other items, the

wisdom of the crowd.

As an example of the interaction of these two sources of information, imagine the following example where the items are conference papers and the categories are authors. We know that author A1 writes papers about regression (usually with author A2) or bioinformatics (usually with author A3). We are sent for review a paper that is about regression (it contains many words about regression) and, we are tipped, by A1. Can we guess its coauthors, i.e., predict its assignments to all existing authors? Based on the crowd wisdom alone, many authors could have written the paper (those who write regression papers). Based on the expert wisdom alone, A2 or A3 may have cowritten the paper (since each of them has coathored other papers with A1). Given both wisdoms, we might expect a high assignment for A2 (and A1) and low for everybody else.

In this paper, we propose a simple model that captures this intuition as a quadratic program. We give some properties of the solution, define an out-of-sample mapping, derive a training algorithm, and illustrate the model with document and image datasets.

A shorter version of this work appears in a conference paper (Carreira-Perpiñán and Wang, 2014).

2 The Laplacian assignment (LASS) model

We consider the following assignment problem. We have items and categories, and we want to determine soft assignments of each item to each category , where and for each . We are given two similarity matrices, suitably defined, and typically sparse: an item-item similarity matrix , which is an matrix of affinities between each pair of items and ; and an item-category similarity matrix , which is an matrix of affinities between each pair of item and category (negative affinities, i.e., dissimilarities, are allowed in ).

We want to assign items to categories optimally as follows:

(1a)
s.t. (1b)
(1c)

where , is a vector of ones, and  is the graph Laplacian matrix, obtained as , where is the degree matrix of the weighted graph defined by . The problem is a quadratic program (QP) over an matrix , i.e., variables111We will use a boldface vector or to mean the th row of  (as a column vector), and a boldface vector to mean the th column of  (likewise for or ). The context and the index ( or ) will determine which is the case. This mild notation abuse will simplify the explanations. altogether, where , the th column of , contains the assignments of each item to category . We will call problems of the type (1) Laplacian assignment problems (LASS). Minimizing objective (1a) encourages items to be assigned to categories with which they have high similarity (the linear term in ), while encouraging similar items to have similar assignments (the Laplacian term in ), since

where is the th row of , i.e., the assignments for item . Although we could absorb inside , we will find it more convenient to fix the scale of each similarity to, say, the interval (where mean maximum (dis)similarity and ignorance), and then let control the strength of the Laplacian term.

The objective function (1a) is separable over categories as

and the constraints (1b)–(1c) are separable over items as , , for . Thus, the problem (1) is not separable, since all the assignments are coupled with a certain structure.

2.1 Extreme values of

We can determine the solution(s) for the following extreme values of :

  • If

    , then the LASS problem is a linear program (LP) and separates over each item

    . The solution is where , i.e., each item is assigned to its most similar category. This tells us what the linear term can do by itself. (If the maximum is achieved for more than one category for a given point, then there is an infinite number of “mixed” solutions that correspond to giving any assignment value to those categories, and zero for the rest.)

  • If or equivalently , then the LASS problem is a quadratic program with an infinite number of solutions of the form for each , i.e., all items have the same assignments. This tells us what the Laplacian term can do by itself.

  • If , i.e., for very large but still having the linear term, the behavior actually differs from that of . In the generic case, we expect a unique solution close to where and where , , i.e., all items are assigned to the same category, the one having maximum total similarity over all items. (Again, if has more than one maximum values, there is an infinite number of solutions corresponding to mixed assignments.) Indeed, if is very large, the Laplacian term dominates and we have that for every pair of items, approximately. Then the LASS problem becomes the following LP

    s.t.

    whose solution allocates all the assignment mass to the category with largest total similarity .

With intermediate , more interesting solutions appear (particularly when the similarity matrices are sparse), where the item-category similarities are propagated to all points through the item-item similarities.

2.2 Existence and unicity of the solution

The LASS problem is a convex QP, so general results of convex optimization tell us that all minima are global minima. However, since the Hessian of the objective function is positive semidefinite, there can be multiple minima. The following theorem characterizes the solutions, and its corollary gives a sufficient condition for the minimum to be unique.

Theorem 2.1.

Assume the graph Laplacian  corresponds to a connected graph and let be a solution (minimizer) of the LASS problem (1). Then, any other solution has the form where satisfies the conditions:

(2)

In particular, for each for which , then .

Proof.

Call the objective function of the LASS problem. Since is continuous and the feasible set of the problem is bounded and closed in , achieves a minimum value in the feasible set, hence at least one solution exists, which makes the theorem statement well defined. We call this solution . Now let us show that for any other feasible point , with . Simple algebra shows that

(3)

The last term is nonnegative because  is positive semidefinite. The penultimate term is also nonnegative. To see this, write the KKT conditions for the LASS problem with Lagrange multipliers and for the equality and inequality constraints, respectively:

Thus, since is feasible, , and where (i.e., the active inequalities). Then, from the first KKT equation, we have for the penultimate term:

because for all , for the inactive inequalities, and for the active inequalities. Hence, the last two terms in (3) are nonnegative, so and is a global minimizer.

Now assume that . Then the last two terms in (3) must both be zero. Recall that if the graph Laplacian

 corresponds to a connected graph, it has a single null eigenvalue with an eigenvector consisting of all ones. From

it follows that for some . Since is feasible, , and . From the penultimate term, .

Finally, from it follows that for each and , so if (i.e., if any of the inequalities involving are active), then . ∎

Corollary 2.2.

Assume the graph Laplacian  corresponds to a connected graph and let be a solution of the LASS problem (1). If then the solution is unique.

Proof.

We have , so for each , . From theorem 2.1, any other solution must have for , and . Hence . ∎

Remark 2.3.

If the graph Laplacian  corresponds to a graph with multiple connected components, then the LASS problem separates into a problem for each component, and the previous theorem holds in each component. Computationally, it is also more efficient to solve each problem separately.

Remark 2.4.

The set (2) of solutions to a LASS problem is a convex polytope.

Remark 2.5.

The condition of corollary 2.2 means that each category has at least one item with a zero assignment to it. In practice, we can expect this condition to hold, and therefore the solution to be unique, if the categories are sufficiently distinctive and is small enough. Equivalently, nonunique solutions arise if some categories are a catch-all for the entire dataset. Theoretically, this should always be possible if is large enough, particularly if there are many categories. However, in practice we have never observed nonunique solutions, because for large the algorithm is attracted towards a solution where one category dominates, so the vector  can only have one negative component, which makes (2) impossible unless  takes a special value, such as . Thus, the symmetric situation where all assignments are possible for does not seem to occur in practice.

Remark 2.6.

Practically, one can always make the solution unique by replacing  with where is a small value, since this makes the objective strongly convex. (This has the equivalent meaning of adding a penalty to it, which has the effect of biasing the assignment vector of each item towards the simplex barycenter, i.e., uniform assignments.) However, as noted before, nonunique solutions appear to be rare with practical data, so this does not seem necessary.

2.3 Particular cases

Theorem 2.7.

Assume the graph Laplacian  corresponds to a connected graph and let be a solution of the LASS problem (1). Then:

  1. If then , and .

  2. If then , and .

Proof.

Both statements follow from substituting the values in the KKT conditions (4). Conditions (4b)–(4e) are trivially satisfied, so we prove condition (4a) only. For statement 1, we can write row of  as , where is row of  and is a vector with entries , and we write all row vectors as column vectors. Then we can write row (in column form) of condition (4a) as:

For statement 2, we can write column of condition (4a) as . ∎

Remark 2.8.

The meaning of th. 2.7 is as follows. (1) An item for which no similarity to any category is given (i.e., no expert information) receives as assignment the average of its neighbors. This corresponds to the SSL prediction. (2) A category for which no item has a positive similarity receives no assignments.

2.4 Lagrange multipliers for a solution

Given a feasible point in parameter space, we may want to test whether it is a solution of the LASS problem. For a QP, the KKT conditions are necessary and sufficient for a solution (Nocedal and Wright, 2006). For our problem, and written in matrix form, the KKT conditions are:

(4a)
(4b)
(4c)
(4d)
(4e)

where means elementwise product, and  and  are the Lagrange multipliers associated with the point  for the equality and inequality constraints, respectively. We need to compute  and . Given , the KKT system (4) has linear equations for unknowns, and its solution is unique if  is feasible, as we will see. To obtain it, we multiply (4a) times on the right and obtain  as a function of :

(5)

Substituting it in (4a) gives, together with (4e):

where . This is a linear system of equations for the unknowns in . It separates over each row of , , in a system of the form

where ,  and  correspond to the th row of ,  and , respectively, written as -dimensional column vectors (we omit the index ). This is a linear system of equations for unknowns, which we solve by multiplying on the left times the transpose of the coefficient matrix:

Finally, using the matrix inversion lemma

we obtain the solution for :

(6)

whose transpose gives row of . Note the formula is well defined because and since for and for at least one , since  is feasible.

For the case , one can verify that the above formulas simplify as follows (again, we give ,  and for item but omitting the index ):

where  represents the th row of , and .

3 A simple, efficient algorithm to solve the QP

It is possible to solve problem (1) in different ways, but one must be careful in designing an effective algorithm because the number of variables and the number of constraints grows proportionally to the number of data points, and can then be very large. We describe here one algorithm that is very simple, has guaranteed convergence without line searches, and takes advantage of the structure of the problem and the sparsity of . It is based on the alternating direction method of multipliers (ADMM), combined with a direct linear solver using the Schur’s complement and caching the Cholesky factorization of .

3.1 QP solution using ADMM

We briefly review how to solve a QP using the alternating direction method of multipliers (ADMM), following (Boyd et al., 2011). Consider the QP

(7)
s.t. (8)

over , where  is positive (semi)definite. In ADMM, we introduce new variables so that we replace the inequalities with an indicator function , which is zero in the nonnegative orthant and otherwise. Then we write the problem as

(9)
s.t. (10)

where

is the original objective with its domain restricted to the equality constraint. The augmented Lagrangian is

(11)

and the ADMM iteration has the form:

where

 is the dual variable (the Lagrange multiplier estimates for the constraint

), and the updates are applied in order and modify the variables immediately. Here, we use the scaled form of the ADMM iteration, which is simpler. It is obtained by combining the linear and quadratic terms in and using a scaled dual variable :

Since is the indicator function for the nonnegative orthant, the solution of the -update is simply to threshold each entry in by taking is nonnegative part. Finally, the ADMM iteration is:

(12)
(13)
(14)

where the updates are applied in order and modify the variables immediately, and applies elementwise, and is the Euclidean norm. The penalty parameter is set by the user, and are the Lagrange multiplier estimates for the inequalities. The -update is an equality-constrained QP with KKT conditions

(15)

Solving this linear system gives the optimal  and  (the Lagrange multipliers for the equality constraint). The ADMM iteration consists of very simple updates to the relevant variables, but its success crucially relies in being able to solve the -update efficiently. Given the structure of our problem, it is convenient to use a direct solution using Schur’s complement, that is:

(16a)
(16b)

Eq. (16a) results from left-multiplying the first equation in (15) by and using the second equation in (15) to eliminate . Eq. (16b) results from substituting  back in the first equation in (15) and solving it for .

Thus, the ADMM iteration consists of solving a linear system on the primal variables , applying a thresholding to get , and an addition to get . Convergence of the ADMM iteration (12) to the global minimum of problem (7) in value and to a feasible point is guaranteed for any .

3.2 Application to our QP

We now write the ADMM updates for our QP (1), where we identify:

where concatenates the columns of its argument into a single column vector. Given the structure of these matrices, the solution of the KKT system (15) by using Schur’s complement (16) simplifies considerably. The basic reasons are that (1) the matrix  is block-diagonal with identical copies of the graph Laplacian , which is itself usually sparse; and (2) the especially simple form of the equality constraint matrix . Thus, even though the -update involves solving a large linear system of equations, it is equivalent to solving systems of equations where the coefficient matrix is the same for each system and besides is constant and sparse, equal to . In turn, these linear systems may be solved efficiently in one of the two following ways: (1) preferably, by caching the Cholesky factorization of this matrix (using a good permutation to reduce fill-in), if it does not add so much fill that it can be stored; or (2) by using an iterative linear solver such as conjugate gradients, initialized with a warm start, preconditioned, and exiting it before convergence, so as to carry out faster, inexact -updates.

The final algorithm is as follows, with its variables written in matrix form. The input are the affinity matrices and , from which we construct the graph Laplacian . We then choose and set

The Cholesky factor  is used to solve linear system (17b). We then iterate, in order, the following updates until convergence:

(17a)
(17b)
(17c)
(17d)

where are the primal variables, the auxiliary variables, the Lagrange multiplier estimates for , and the Lagrange multipliers for the equality constraints. The solution for the linear system in the -update may be obtained by using two triangular backsolves if using the Cholesky factor of , or using an iterative method such as conjugate gradients if the Cholesky factor is not available.

The iteration (17) is very simple to implement. It requires no line searches and has only one user parameter, the penalty parameter. The algorithm converges for any positive value of the penalty parameter, but this value does affect the convergence rate.

3.3 Remarks

Theorem 3.1.

At each iterate in the algorithm updates (17), , , and .

Proof.

For , substituting eq. (17a) into (17b):

where the last step results from . For , from eqs. (17c)–(17d) we have that . For , follows from in (17c). Finally, follows from and . ∎

Theorem 3.2.

Upon convergence of algorithm (17),  is a solution with Lagrange multipliers and .

Proof.

Let us compare the KKT conditions (4) with the algorithm updates (17) upon convergence, i.e., at a fixed point of the update equations. From th. 3.1 we know that and , which are KKT conditions (4b) and (4d). From eq. (17d) we must have , so from eq. (17c) we have , which is KKT condition (4c). From eqs. (17c)–(17d) we have and , therefore , which is KKT condition (4e). Finally, from eq. (17b) we have:

which matches KKT condition (4a). The change of sign in the multipliers between the algorithm and the KKT conditions is due to the sign choice in the Lagrangian (adding in eq. (11), subtracting in (4)). ∎

Remark 3.3.

In practice, the algorithm is stopped before convergence, and , and are estimates for a solution and its Lagrange multipliers, respectively. The estimate  may not be feasible, in particular the values need not be in , since this is only guaranteed upon convergence. If needed, a feasible point may be obtained by projecting each row of  onto the simplex (see section 4).

Remark 3.4.

If (or ) one solution is given by , for which the Lagrange multipliers are and , thus the inequality constraints are inactive and the equality constraints are weakly active. Indeed, that  value is also a solution of the unconstrained problem.

3.4 Computational complexity

Each step in (17) is except for the linear system solution in (17b). If  is sparse, using the Cholesky factor makes this step as well, and adds a one-time setup cost of computing the Cholesky factor (which is also linear in with sufficiently sparse matrices). Thus, each iteration of the algorithm is cheap. In practice, for good values of , the algorithm quickly approaches the solution in the first iterations and then converges slowly, as is known with ADMM algorithms in general. However, since each iteration is so cheap, we can run a large number of them if high accuracy is needed. As a sample runtime, for a problem with items and categories (i.e.,  has parameters) and using a -nearest-neighbor graph, the Cholesky factorization takes  s and each iteration takes  s in a PC.

For large-scale problems, the slow convergence becomes more problematic, and it is possible that the Cholesky factor may create too much fill even with a good preordering. One can use instead an iterative linear solver, such as preconditioned conjugate gradients. Scaling up the training is a topic for future research.

3.5 Initialization

If the LASS problem is itself a subproblem in a larger problem (as in the Laplacian -modes clustering algorithm; Wang and Carreira-Perpiñán, 2014b), one should warm-start the iteration of eq. (17) from the values of  and  in the previous outer-loop iteration. Otherwise, we can simply initialize , which (substituting in eqs. (17a)–(17b)) gives (where is the matrix  with centered rows, and is the simplex barycenter). This initialization is closely related to the projection on the simplex of the unconstrained optimum of the LASS problem, as we show next. Consider first the unconstrained minimization

This problem is in fact unbounded unless , because taking for any , , since , we have , which can be made arbitrarily negative. We could still try to define a  from the gradient , but this involves a linear system on , whose computational cost defeats the purpose of the initialization. Instead, we can consider the unconstrained minimization

for , which is strongly convex and has a unique minimum , which we can compute cheaply if we reuse the Cholesky factor for . Now, we can write the initialization (for ) in terms of as , which means that each row vector of is translated along the direction . Since this direction is orthogonal to the simplex, both and have the same projection on it.

Finally, note that if is large, then and , both of which project onto the simplex barycenter, independently of the problem data.

3.6 Stopping criterion

We stop when tol, i.e., when the change in absolute terms in  in the last iterations falls below a set tolerance tol (e.g. ). Using an absolute criterion here is equivalent to using a relative one, since , . Since our iterations are so cheap, evaluating takes a runtime comparable to that of the updates in (17) (except for the -update, possibly), so testing the stopping criterion every iterations saves around % runtime.

Another possible stopping criterion is to test whether the KKT conditions (4) are satisfied up to a given tolerance, using the Lagrange multipliers’ estimates provided by the algorithm at each iterate. Each iterate always satisfies (4b) and (4d), so we only need to check (4a), (4c) and (4e) (if the iterate is interior to the inequalities it will also satisfy (4c) and (4e)). Still, it is faster to check for changes in .

Since the iterates  in the algorithm need not be feasible, they may be slightly infeasible once the stopping criterion is satisfied. If desired, a feasible  can be obtained by projecting each assignment vector onto the simplex (see section 4).

3.7 Optimal penalty parameter

The speed at which ADMM converges depends on the quadratic penalty parameter (Boyd et al., 2011). We illustrate this with the “2 moons” dataset in fig. 1 ( points, categories, -nearest-neighbor graph, ), where we set positive similarity values for one point in each cluster, resulting in each cluster being assigned to a different category, as expected. The problem has parameters and we ran iterations, which took 11 s. Little work exists on how to select so as to achieve fastest convergence. Recently, for QPs, Ghadimi et al. (2013) suggest to use where and are the smallest (nonzero) and largest eigenvalue of the Laplacian. In fig. 1, , and we show the relative error vs number of iterations for different (initial , with relative error ). Asymptotically, the convergence is linear; a model gives for . While, in the long term, values close to work best, in the short term, smaller values are able to achieve an acceptably low relative error () in just a few iterations, so an adaptive would be best overall.

iter[t][]iterations error[][t]relative error rho=rrr1e-3[l][l] rho=rrr1e-2[l][l] rho=rrr1e-1[l][l] rho=rrr1e0[l][l] rho=rrr1e1[l][l] rho=rrr1e2[l][l] rho=rrr1e3[l][l]
Figure 1: Convergence speed of ADMM for different . Left: a 2-moons dataset for which only two points are provided with  affinities (above) and predicted assignments (below). Right: error (in log scale) vs iterations for different .

3.8 Matlab code

The following Matlab code implements the algorithm, assuming a direct solution of the -update linear system.

function [Z,Y,U,nu] = lass(L,l,G,r,Y,U,maxit,tol)

[N,K] = size(G); LI = 2*l*L+r*speye(N,N); h = (-sum(G,2)+r)/K; Zold = zeros(N,K);
for i=1:maxit
  nu = (r/K)*sum(Y-U,2) - h;
  Z = LI \ bsxfun(@minus,r*(Y-U)+G,nu);
  Y = max(Z+U,0);
  U = U + Z - Y;
  if max(abs(Z(:)-Zold(:))) < tol break; end; Zold = Z;
end

4 Out-of-sample mapping

Having trained the system, that is, having found the optimal assignments  for the training set items, we are given a new, test item  (for example, a new point ), along with its item-item and item-category similarities , and , , respectively, and we wish to find its assignment to each category. We follow the reasoning of Carreira-Perpiñán and Lu (2007) to derive an out-of-sample mapping. While one could train again the whole system augmented with , this would be very time-consuming, and the assignments of all points would change (although very slightly). A more practical and still natural way to define an out-of-sample mapping is to solve a problem of the form (1) with a dataset consisting of the original training set augmented with , but keeping  fixed to the values obtained during training. Hence, the only free parameter is the assignment vector  for the new point . After dropping constant terms, the optimization problem (1) reduces to the following quadratic program over variables:

(18a)
s.t. (18b)
(18c)

where and

is a weighted average of the training points’ assignments, and so is itself an average between this and the item-category affinities. Thus, the solution is the Euclidean projection of the -dimensional vector

onto the probability simplex. This can be efficiently computed, in a finite number of steps, with a simple

algorithm (Duchi et al., 2008; Wang and Carreira-Perpiñán, 2014a). Computationally, assuming  is sparse, the most expensive step is finding the neighbors to construct . With large , one should use some form of hashing (Shakhnarovich et al., 2006) to retrieve approximate neighbors quickly.

The out-of-sample prediction for a point in the training set does not generally equal the  value it received during training (although it does not differ much from it either). That is, , where uses the training data and for

. This is also true of semisupervised learning, and it simply reflects the fact that the out-of-sample mapping smoothes, rather than interpolates, the training data.

Given a solution  of the LASS training problem, the out-of-sample mapping is uniquely defined, because the problem (18) is strongly convex. However, as described in section 2.2, in particular settings the solution of the LASS training problem may not be unique, and a natural question is: what is the relation between the out-of-sample mappings for two different solutions? From th. 2.1, the solutions have the form where  is any particular solution and satisfies , and . Then the out-of-sample mapping for a -solution has the form

where is the out-of-sample mapping for the base solution . If  was parallel to the vector then the out-of-sample mappings for different solutions but actually coincide, but in fact , so the out-of-sample mappings for different solutions correspond to sliding along the simplex by vector  (which must respect the remaining conditions above, of course).

As a function of , the out-of-sample mapping takes the following extreme values:

  • If or , where , i.e., the item is assigned to its most similar similar category (or any mixture thereof in case of ties).

  • If or , , independently of . This corresponds to the SSL out-of-sample mapping.

In between these, the out-of-sample mapping as a function of is a piecewise linear path in the simplex, which represents the tradeoff between the crowd () and expert () wisdoms. This path is quite different from the simple average of and  (which need not even be feasible), and may produce exact 0s or 1s for some entries.

The LASS out-of-sample mapping offers an extra degree of flexibility to the user, which may be used on a case-by-case basis for each test item. The user has the prerogative to set to favor more or less the expert vs the crowd opinion, and in fact to explore the entire continuum for . The user can also explore what-if scenarios by changing  itself, given the vector  (e.g. how would the assignment vector look like if we think that test item  belongs to category but not to category ?). These computations are all relatively efficient because the bottleneck, which is the computation of , is done once only.

Note that the out-of-sample mapping is nonlinear and nonparametric, and it maps an input  (given its affinity information) onto a valid assignment vector in the probability simplex. Hence, LASS can also be considered as learning nonparametric conditional distributions over the categories, given partial supervision.

5 Related work

5.1 Semisupervised learning with a Laplacian penalty (SSL)

In semisupervised learning (SSL) with a Laplacian penalty (Zhu et al., 2003)

, the basic idea is that we are given an affinity matrix

 and corresponding graph Laplacian on items, and the labels for a subset of the items. Then, the labels for the remaining, unlabeled items are such that they minimize the Laplacian penalty, or equivalently they are the smoothest function on the graph that satisfies the given labels (“harmonic” function). Call of and of the matrices of labels for the unlabeled and labeled items, respectively, where , and . To obtain we minimize over , with fixed :

(19)

Thus, computationally the solution involves a sparse linear system of . An out-of-sample mapping for a new test item  with affinity vector  wrt the the training set can be derived by SSL again, taking of as all the trained labels (given and predicted) and as the free label. This gives a closed-form expression

(20)

which is the average of the labels of ’s neighbors, making clear the smoothing behavior of the Laplacian. SSL with a Laplacian penalty is very effective in problems where there are very few labels, i.e., , but the graph structure is highly predictive of each item’s labels. Essentially, the given labels are propagated throughout the graph.

In our setting, the labels are the item-category assignments , and we have the following result.

Theorem 5.1.

In problem (19), if and then and .

Proof.

Since we have

Hence . That follows from the maximum principle for harmonic functions (Doyle and Snell, 1984): each of the unknowns must lie between the minimum and maximum label values, i.e., in . (Strictly, they will lie in or be all equal to a constant.) ∎

Thus, in the special case where the given labels are valid assignments (nonnegative with unit sum), the predicted labels will also be valid assignments, and we need not subject the problem explicitly to simplex constraints, which simplifies it computationally. This occurs in the standard semisupervised classification setting where each item belongs to only one category and we use the