1 Introduction
A major success in machine learning in recent years has been the development of semisupervised learning (SSL)
(Chapelle et al., 2006), where we are given labels for only a few of the training points. Many SSL approaches rely on a neighborhood graph constructed on the training data (labeled and unlabeled), typically weighted with similarity values. The Laplacian of this graph is used to construct a quadratic nonnegative function that measures the agreement of possible labelings with the graph structure, and minimizing it given the existing labels has the effect of propagating them over the graph. Laplacianbased formulations are conceptually simple, computationally efficient (since the Laplacian is usually sparse), have a solid foundation in graph theory and linear algebra (Chung, 1997; Doyle and Snell, 1984), and most importantly work very well in practice. The graph Laplacian has been widely exploited in machine learning, computer vision and graphics, and other areas: as mentioned, in semisupervised learning, manifold regularization and graph priors
(Zhu et al., 2003; Belkin et al., 2006; Zhou et al., 2004) for regression, classification and applications such as supervised image segmentation (Grady, 2006), where one solves a Laplacianbased linear system; in spectral clustering
(Shi and Malik, 2000), possibly with constraints (Lu and CarreiraPerpiñán, 2008), and spectral dimensionality reduction (Belkin and Niyogi, 2003) and probabilistic spectral dimensionality reduction (CarreiraPerpiñán and Lu, 2007), where one uses eigenvectors of the Laplacian; in clustering, manifold denoising and surface smoothing
(CarreiraPerpiñán, 2006; Wang and CarreiraPerpiñán, 2010; Taubin, 1995), where one iterates products of the data with the Laplacian; etc.We concern ourselves with assignment problems in a semisupervised learning setting, where we have items and categories and we want to find soft assignments of items to categories given some information. This information often takes the form of partial tags or annotations, e.g. for pictures in websites such as Flickr, blog entries, etc. Let us consider a specific example where the items are documents (e.g. papers submitted to this conference) and the categories are keywords. Any given paper will likely be associated to a larger or smaller extent with many keywords, but most authors will tag their papers with only a few of them, usually the most distinctive (although, as we know, there may be other reasons). Thus, few papers will be tagged as “computer science” or “machine learning” because those keywords are perceived as redundant given, say, “semisupervised learning”. However, considered in a larger context (e.g. to include biology papers), such keywords would be valuable. Besides, categories may have various correlations that are unknown to us but that affect the assignments. For example, a hierarchical structure implies that “machine learning” belongs to “computer science” (although it does to “applied maths” to some extent as well). In general, we consider categories as sets having various intersection, inclusion and exclusion relations. Section 6 illustrates this in an example. Finally, it is sometimes practical to tag an item as not associated with a certain category, e.g. “this paper is not about regression” or “this patient does not have fever”, particularly if this helps to make it distinctive. In summary, in this type of applications, it is impractical for an item to be fully labeled over all categories, but it is natural for it to be associated or disassociated with a few categories. This can be coded with itemcategory similarity values that are positive or negative, respectively, with the magnitude indicating the degree of association, and zero meaning indifference or ignorance. We call this source of information, which is specific for each item irrespectively of other items, the wisdom of the expert.
We also consider another practical source of information. Usually it is easy to construct a similarity of a given item to other items, at least its nearest neighbors. For example, with documents or images, this could be based on a bagofwords representation. We would expect similar items to have similar assignment vectors, and this can be captured with an itemitem similarity matrix and its graph Laplacian. We call this source of information, which is about an item in the context of other items, the
wisdom of the crowd.As an example of the interaction of these two sources of information, imagine the following example where the items are conference papers and the categories are authors. We know that author A1 writes papers about regression (usually with author A2) or bioinformatics (usually with author A3). We are sent for review a paper that is about regression (it contains many words about regression) and, we are tipped, by A1. Can we guess its coauthors, i.e., predict its assignments to all existing authors? Based on the crowd wisdom alone, many authors could have written the paper (those who write regression papers). Based on the expert wisdom alone, A2 or A3 may have cowritten the paper (since each of them has coathored other papers with A1). Given both wisdoms, we might expect a high assignment for A2 (and A1) and low for everybody else.
In this paper, we propose a simple model that captures this intuition as a quadratic program. We give some properties of the solution, define an outofsample mapping, derive a training algorithm, and illustrate the model with document and image datasets.
A shorter version of this work appears in a conference paper (CarreiraPerpiñán and Wang, 2014).
2 The Laplacian assignment (LASS) model
We consider the following assignment problem. We have items and categories, and we want to determine soft assignments of each item to each category , where and for each . We are given two similarity matrices, suitably defined, and typically sparse: an itemitem similarity matrix , which is an matrix of affinities between each pair of items and ; and an itemcategory similarity matrix , which is an matrix of affinities between each pair of item and category (negative affinities, i.e., dissimilarities, are allowed in ).
We want to assign items to categories optimally as follows:
(1a)  
s.t.  (1b)  
(1c) 
where , is a vector of ones, and is the graph Laplacian matrix, obtained as , where is the degree matrix of the weighted graph defined by . The problem is a quadratic program (QP) over an matrix , i.e., variables^{1}^{1}1We will use a boldface vector or to mean the th row of (as a column vector), and a boldface vector to mean the th column of (likewise for or ). The context and the index ( or ) will determine which is the case. This mild notation abuse will simplify the explanations. altogether, where , the th column of , contains the assignments of each item to category . We will call problems of the type (1) Laplacian assignment problems (LASS). Minimizing objective (1a) encourages items to be assigned to categories with which they have high similarity (the linear term in ), while encouraging similar items to have similar assignments (the Laplacian term in ), since
where is the th row of , i.e., the assignments for item . Although we could absorb inside , we will find it more convenient to fix the scale of each similarity to, say, the interval (where mean maximum (dis)similarity and ignorance), and then let control the strength of the Laplacian term.
The objective function (1a) is separable over categories as
and the constraints (1b)–(1c) are separable over items as , , for . Thus, the problem (1) is not separable, since all the assignments are coupled with a certain structure.
2.1 Extreme values of
We can determine the solution(s) for the following extreme values of :

If
, then the LASS problem is a linear program (LP) and separates over each item
. The solution is where , i.e., each item is assigned to its most similar category. This tells us what the linear term can do by itself. (If the maximum is achieved for more than one category for a given point, then there is an infinite number of “mixed” solutions that correspond to giving any assignment value to those categories, and zero for the rest.) 
If or equivalently , then the LASS problem is a quadratic program with an infinite number of solutions of the form for each , i.e., all items have the same assignments. This tells us what the Laplacian term can do by itself.

If , i.e., for very large but still having the linear term, the behavior actually differs from that of . In the generic case, we expect a unique solution close to where and where , , i.e., all items are assigned to the same category, the one having maximum total similarity over all items. (Again, if has more than one maximum values, there is an infinite number of solutions corresponding to mixed assignments.) Indeed, if is very large, the Laplacian term dominates and we have that for every pair of items, approximately. Then the LASS problem becomes the following LP
s.t. whose solution allocates all the assignment mass to the category with largest total similarity .
With intermediate , more interesting solutions appear (particularly when the similarity matrices are sparse), where the itemcategory similarities are propagated to all points through the itemitem similarities.
2.2 Existence and unicity of the solution
The LASS problem is a convex QP, so general results of convex optimization tell us that all minima are global minima. However, since the Hessian of the objective function is positive semidefinite, there can be multiple minima. The following theorem characterizes the solutions, and its corollary gives a sufficient condition for the minimum to be unique.
Theorem 2.1.
Assume the graph Laplacian corresponds to a connected graph and let be a solution (minimizer) of the LASS problem (1). Then, any other solution has the form where satisfies the conditions:
(2) 
In particular, for each for which , then .
Proof.
Call the objective function of the LASS problem. Since is continuous and the feasible set of the problem is bounded and closed in , achieves a minimum value in the feasible set, hence at least one solution exists, which makes the theorem statement well defined. We call this solution . Now let us show that for any other feasible point , with . Simple algebra shows that
(3) 
The last term is nonnegative because is positive semidefinite. The penultimate term is also nonnegative. To see this, write the KKT conditions for the LASS problem with Lagrange multipliers and for the equality and inequality constraints, respectively:
Thus, since is feasible, , and where (i.e., the active inequalities). Then, from the first KKT equation, we have for the penultimate term:
because for all , for the inactive inequalities, and for the active inequalities. Hence, the last two terms in (3) are nonnegative, so and is a global minimizer.
Now assume that . Then the last two terms in (3) must both be zero. Recall that if the graph Laplacian
corresponds to a connected graph, it has a single null eigenvalue with an eigenvector consisting of all ones. From
it follows that for some . Since is feasible, , and . From the penultimate term, .Finally, from it follows that for each and , so if (i.e., if any of the inequalities involving are active), then . ∎
Corollary 2.2.
Assume the graph Laplacian corresponds to a connected graph and let be a solution of the LASS problem (1). If then the solution is unique.
Proof.
We have , so for each , . From theorem 2.1, any other solution must have for , and . Hence . ∎
Remark 2.3.
If the graph Laplacian corresponds to a graph with multiple connected components, then the LASS problem separates into a problem for each component, and the previous theorem holds in each component. Computationally, it is also more efficient to solve each problem separately.
Remark 2.4.
The set (2) of solutions to a LASS problem is a convex polytope.
Remark 2.5.
The condition of corollary 2.2 means that each category has at least one item with a zero assignment to it. In practice, we can expect this condition to hold, and therefore the solution to be unique, if the categories are sufficiently distinctive and is small enough. Equivalently, nonunique solutions arise if some categories are a catchall for the entire dataset. Theoretically, this should always be possible if is large enough, particularly if there are many categories. However, in practice we have never observed nonunique solutions, because for large the algorithm is attracted towards a solution where one category dominates, so the vector can only have one negative component, which makes (2) impossible unless takes a special value, such as . Thus, the symmetric situation where all assignments are possible for does not seem to occur in practice.
Remark 2.6.
Practically, one can always make the solution unique by replacing with where is a small value, since this makes the objective strongly convex. (This has the equivalent meaning of adding a penalty to it, which has the effect of biasing the assignment vector of each item towards the simplex barycenter, i.e., uniform assignments.) However, as noted before, nonunique solutions appear to be rare with practical data, so this does not seem necessary.
2.3 Particular cases
Theorem 2.7.
Assume the graph Laplacian corresponds to a connected graph and let be a solution of the LASS problem (1). Then:

If then , and .

If then , and .
Proof.
Both statements follow from substituting the values in the KKT conditions (4). Conditions (4b)–(4e) are trivially satisfied, so we prove condition (4a) only. For statement 1, we can write row of as , where is row of and is a vector with entries , and we write all row vectors as column vectors. Then we can write row (in column form) of condition (4a) as:
For statement 2, we can write column of condition (4a) as . ∎
Remark 2.8.
The meaning of th. 2.7 is as follows. (1) An item for which no similarity to any category is given (i.e., no expert information) receives as assignment the average of its neighbors. This corresponds to the SSL prediction. (2) A category for which no item has a positive similarity receives no assignments.
2.4 Lagrange multipliers for a solution
Given a feasible point in parameter space, we may want to test whether it is a solution of the LASS problem. For a QP, the KKT conditions are necessary and sufficient for a solution (Nocedal and Wright, 2006). For our problem, and written in matrix form, the KKT conditions are:
(4a)  
(4b)  
(4c)  
(4d)  
(4e) 
where means elementwise product, and and are the Lagrange multipliers associated with the point for the equality and inequality constraints, respectively. We need to compute and . Given , the KKT system (4) has linear equations for unknowns, and its solution is unique if is feasible, as we will see. To obtain it, we multiply (4a) times on the right and obtain as a function of :
(5) 
Substituting it in (4a) gives, together with (4e):
where . This is a linear system of equations for the unknowns in . It separates over each row of , , in a system of the form
where , and correspond to the th row of , and , respectively, written as dimensional column vectors (we omit the index ). This is a linear system of equations for unknowns, which we solve by multiplying on the left times the transpose of the coefficient matrix:
Finally, using the matrix inversion lemma
we obtain the solution for :
(6) 
whose transpose gives row of . Note the formula is well defined because and since for and for at least one , since is feasible.
For the case , one can verify that the above formulas simplify as follows (again, we give , and for item but omitting the index ):
where represents the th row of , and .
3 A simple, efficient algorithm to solve the QP
It is possible to solve problem (1) in different ways, but one must be careful in designing an effective algorithm because the number of variables and the number of constraints grows proportionally to the number of data points, and can then be very large. We describe here one algorithm that is very simple, has guaranteed convergence without line searches, and takes advantage of the structure of the problem and the sparsity of . It is based on the alternating direction method of multipliers (ADMM), combined with a direct linear solver using the Schur’s complement and caching the Cholesky factorization of .
3.1 QP solution using ADMM
We briefly review how to solve a QP using the alternating direction method of multipliers (ADMM), following (Boyd et al., 2011). Consider the QP
(7)  
s.t.  (8) 
over , where is positive (semi)definite. In ADMM, we introduce new variables so that we replace the inequalities with an indicator function , which is zero in the nonnegative orthant and otherwise. Then we write the problem as
(9)  
s.t.  (10) 
where
is the original objective with its domain restricted to the equality constraint. The augmented Lagrangian is
(11) 
and the ADMM iteration has the form:
where
is the dual variable (the Lagrange multiplier estimates for the constraint
), and the updates are applied in order and modify the variables immediately. Here, we use the scaled form of the ADMM iteration, which is simpler. It is obtained by combining the linear and quadratic terms in and using a scaled dual variable :Since is the indicator function for the nonnegative orthant, the solution of the update is simply to threshold each entry in by taking is nonnegative part. Finally, the ADMM iteration is:
(12)  
(13)  
(14) 
where the updates are applied in order and modify the variables immediately, and applies elementwise, and is the Euclidean norm. The penalty parameter is set by the user, and are the Lagrange multiplier estimates for the inequalities. The update is an equalityconstrained QP with KKT conditions
(15) 
Solving this linear system gives the optimal and (the Lagrange multipliers for the equality constraint). The ADMM iteration consists of very simple updates to the relevant variables, but its success crucially relies in being able to solve the update efficiently. Given the structure of our problem, it is convenient to use a direct solution using Schur’s complement, that is:
(16a)  
(16b) 
Eq. (16a) results from leftmultiplying the first equation in (15) by and using the second equation in (15) to eliminate . Eq. (16b) results from substituting back in the first equation in (15) and solving it for .
3.2 Application to our QP
We now write the ADMM updates for our QP (1), where we identify:
where concatenates the columns of its argument into a single column vector. Given the structure of these matrices, the solution of the KKT system (15) by using Schur’s complement (16) simplifies considerably. The basic reasons are that (1) the matrix is blockdiagonal with identical copies of the graph Laplacian , which is itself usually sparse; and (2) the especially simple form of the equality constraint matrix . Thus, even though the update involves solving a large linear system of equations, it is equivalent to solving systems of equations where the coefficient matrix is the same for each system and besides is constant and sparse, equal to . In turn, these linear systems may be solved efficiently in one of the two following ways: (1) preferably, by caching the Cholesky factorization of this matrix (using a good permutation to reduce fillin), if it does not add so much fill that it can be stored; or (2) by using an iterative linear solver such as conjugate gradients, initialized with a warm start, preconditioned, and exiting it before convergence, so as to carry out faster, inexact updates.
The final algorithm is as follows, with its variables written in matrix form. The input are the affinity matrices and , from which we construct the graph Laplacian . We then choose and set
The Cholesky factor is used to solve linear system (17b). We then iterate, in order, the following updates until convergence:
(17a)  
(17b)  
(17c)  
(17d) 
where are the primal variables, the auxiliary variables, the Lagrange multiplier estimates for , and the Lagrange multipliers for the equality constraints. The solution for the linear system in the update may be obtained by using two triangular backsolves if using the Cholesky factor of , or using an iterative method such as conjugate gradients if the Cholesky factor is not available.
The iteration (17) is very simple to implement. It requires no line searches and has only one user parameter, the penalty parameter. The algorithm converges for any positive value of the penalty parameter, but this value does affect the convergence rate.
3.3 Remarks
Theorem 3.1.
At each iterate in the algorithm updates (17), , , and .
Proof.
Theorem 3.2.
Upon convergence of algorithm (17), is a solution with Lagrange multipliers and .
Proof.
Let us compare the KKT conditions (4) with the algorithm updates (17) upon convergence, i.e., at a fixed point of the update equations. From th. 3.1 we know that and , which are KKT conditions (4b) and (4d). From eq. (17d) we must have , so from eq. (17c) we have , which is KKT condition (4c). From eqs. (17c)–(17d) we have and , therefore , which is KKT condition (4e). Finally, from eq. (17b) we have:
which matches KKT condition (4a). The change of sign in the multipliers between the algorithm and the KKT conditions is due to the sign choice in the Lagrangian (adding in eq. (11), subtracting in (4)). ∎
Remark 3.3.
In practice, the algorithm is stopped before convergence, and , and are estimates for a solution and its Lagrange multipliers, respectively. The estimate may not be feasible, in particular the values need not be in , since this is only guaranteed upon convergence. If needed, a feasible point may be obtained by projecting each row of onto the simplex (see section 4).
Remark 3.4.
If (or ) one solution is given by , for which the Lagrange multipliers are and , thus the inequality constraints are inactive and the equality constraints are weakly active. Indeed, that value is also a solution of the unconstrained problem.
3.4 Computational complexity
Each step in (17) is except for the linear system solution in (17b). If is sparse, using the Cholesky factor makes this step as well, and adds a onetime setup cost of computing the Cholesky factor (which is also linear in with sufficiently sparse matrices). Thus, each iteration of the algorithm is cheap. In practice, for good values of , the algorithm quickly approaches the solution in the first iterations and then converges slowly, as is known with ADMM algorithms in general. However, since each iteration is so cheap, we can run a large number of them if high accuracy is needed. As a sample runtime, for a problem with items and categories (i.e., has parameters) and using a nearestneighbor graph, the Cholesky factorization takes s and each iteration takes s in a PC.
For largescale problems, the slow convergence becomes more problematic, and it is possible that the Cholesky factor may create too much fill even with a good preordering. One can use instead an iterative linear solver, such as preconditioned conjugate gradients. Scaling up the training is a topic for future research.
3.5 Initialization
If the LASS problem is itself a subproblem in a larger problem (as in the Laplacian modes clustering algorithm; Wang and CarreiraPerpiñán, 2014b), one should warmstart the iteration of eq. (17) from the values of and in the previous outerloop iteration. Otherwise, we can simply initialize , which (substituting in eqs. (17a)–(17b)) gives (where is the matrix with centered rows, and is the simplex barycenter). This initialization is closely related to the projection on the simplex of the unconstrained optimum of the LASS problem, as we show next. Consider first the unconstrained minimization
This problem is in fact unbounded unless , because taking for any , , since , we have , which can be made arbitrarily negative. We could still try to define a from the gradient , but this involves a linear system on , whose computational cost defeats the purpose of the initialization. Instead, we can consider the unconstrained minimization
for , which is strongly convex and has a unique minimum , which we can compute cheaply if we reuse the Cholesky factor for . Now, we can write the initialization (for ) in terms of as , which means that each row vector of is translated along the direction . Since this direction is orthogonal to the simplex, both and have the same projection on it.
Finally, note that if is large, then and , both of which project onto the simplex barycenter, independently of the problem data.
3.6 Stopping criterion
We stop when tol, i.e., when the change in absolute terms in in the last iterations falls below a set tolerance tol (e.g. ). Using an absolute criterion here is equivalent to using a relative one, since , . Since our iterations are so cheap, evaluating takes a runtime comparable to that of the updates in (17) (except for the update, possibly), so testing the stopping criterion every – iterations saves around % runtime.
Another possible stopping criterion is to test whether the KKT conditions (4) are satisfied up to a given tolerance, using the Lagrange multipliers’ estimates provided by the algorithm at each iterate. Each iterate always satisfies (4b) and (4d), so we only need to check (4a), (4c) and (4e) (if the iterate is interior to the inequalities it will also satisfy (4c) and (4e)). Still, it is faster to check for changes in .
Since the iterates in the algorithm need not be feasible, they may be slightly infeasible once the stopping criterion is satisfied. If desired, a feasible can be obtained by projecting each assignment vector onto the simplex (see section 4).
3.7 Optimal penalty parameter
The speed at which ADMM converges depends on the quadratic penalty parameter (Boyd et al., 2011). We illustrate this with the “2 moons” dataset in fig. 1 ( points, categories, nearestneighbor graph, ), where we set positive similarity values for one point in each cluster, resulting in each cluster being assigned to a different category, as expected. The problem has parameters and we ran iterations, which took 11 s. Little work exists on how to select so as to achieve fastest convergence. Recently, for QPs, Ghadimi et al. (2013) suggest to use where and are the smallest (nonzero) and largest eigenvalue of the Laplacian. In fig. 1, , and we show the relative error vs number of iterations for different (initial , with relative error ). Asymptotically, the convergence is linear; a model gives for . While, in the long term, values close to work best, in the short term, smaller values are able to achieve an acceptably low relative error () in just a few iterations, so an adaptive would be best overall.


3.8 Matlab code
The following Matlab code implements the algorithm, assuming a direct solution of the update linear system.
function [Z,Y,U,nu] = lass(L,l,G,r,Y,U,maxit,tol) [N,K] = size(G); LI = 2*l*L+r*speye(N,N); h = (sum(G,2)+r)/K; Zold = zeros(N,K); for i=1:maxit nu = (r/K)*sum(YU,2)  h; Z = LI \ bsxfun(@minus,r*(YU)+G,nu); Y = max(Z+U,0); U = U + Z  Y; if max(abs(Z(:)Zold(:))) < tol break; end; Zold = Z; end
4 Outofsample mapping
Having trained the system, that is, having found the optimal assignments for the training set items, we are given a new, test item (for example, a new point ), along with its itemitem and itemcategory similarities , and , , respectively, and we wish to find its assignment to each category. We follow the reasoning of CarreiraPerpiñán and Lu (2007) to derive an outofsample mapping. While one could train again the whole system augmented with , this would be very timeconsuming, and the assignments of all points would change (although very slightly). A more practical and still natural way to define an outofsample mapping is to solve a problem of the form (1) with a dataset consisting of the original training set augmented with , but keeping fixed to the values obtained during training. Hence, the only free parameter is the assignment vector for the new point . After dropping constant terms, the optimization problem (1) reduces to the following quadratic program over variables:
(18a)  
s.t.  (18b)  
(18c) 
where and
is a weighted average of the training points’ assignments, and so is itself an average between this and the itemcategory affinities. Thus, the solution is the Euclidean projection of the dimensional vector
onto the probability simplex. This can be efficiently computed, in a finite number of steps, with a simple
algorithm (Duchi et al., 2008; Wang and CarreiraPerpiñán, 2014a). Computationally, assuming is sparse, the most expensive step is finding the neighbors to construct . With large , one should use some form of hashing (Shakhnarovich et al., 2006) to retrieve approximate neighbors quickly.The outofsample prediction for a point in the training set does not generally equal the value it received during training (although it does not differ much from it either). That is, , where uses the training data and for
. This is also true of semisupervised learning, and it simply reflects the fact that the outofsample mapping smoothes, rather than interpolates, the training data.
Given a solution of the LASS training problem, the outofsample mapping is uniquely defined, because the problem (18) is strongly convex. However, as described in section 2.2, in particular settings the solution of the LASS training problem may not be unique, and a natural question is: what is the relation between the outofsample mappings for two different solutions? From th. 2.1, the solutions have the form where is any particular solution and satisfies , and . Then the outofsample mapping for a solution has the form
where is the outofsample mapping for the base solution . If was parallel to the vector then the outofsample mappings for different solutions but actually coincide, but in fact , so the outofsample mappings for different solutions correspond to sliding along the simplex by vector (which must respect the remaining conditions above, of course).
As a function of , the outofsample mapping takes the following extreme values:

If or , where , i.e., the item is assigned to its most similar similar category (or any mixture thereof in case of ties).

If or , , independently of . This corresponds to the SSL outofsample mapping.
In between these, the outofsample mapping as a function of is a piecewise linear path in the simplex, which represents the tradeoff between the crowd () and expert () wisdoms. This path is quite different from the simple average of and (which need not even be feasible), and may produce exact 0s or 1s for some entries.
The LASS outofsample mapping offers an extra degree of flexibility to the user, which may be used on a casebycase basis for each test item. The user has the prerogative to set to favor more or less the expert vs the crowd opinion, and in fact to explore the entire continuum for . The user can also explore whatif scenarios by changing itself, given the vector (e.g. how would the assignment vector look like if we think that test item belongs to category but not to category ?). These computations are all relatively efficient because the bottleneck, which is the computation of , is done once only.
Note that the outofsample mapping is nonlinear and nonparametric, and it maps an input (given its affinity information) onto a valid assignment vector in the probability simplex. Hence, LASS can also be considered as learning nonparametric conditional distributions over the categories, given partial supervision.
5 Related work
5.1 Semisupervised learning with a Laplacian penalty (SSL)
In semisupervised learning (SSL) with a Laplacian penalty (Zhu et al., 2003)
, the basic idea is that we are given an affinity matrix
and corresponding graph Laplacian on items, and the labels for a subset of the items. Then, the labels for the remaining, unlabeled items are such that they minimize the Laplacian penalty, or equivalently they are the smoothest function on the graph that satisfies the given labels (“harmonic” function). Call of and of the matrices of labels for the unlabeled and labeled items, respectively, where , and . To obtain we minimize over , with fixed :(19) 
Thus, computationally the solution involves a sparse linear system of . An outofsample mapping for a new test item with affinity vector wrt the the training set can be derived by SSL again, taking of as all the trained labels (given and predicted) and as the free label. This gives a closedform expression
(20) 
which is the average of the labels of ’s neighbors, making clear the smoothing behavior of the Laplacian. SSL with a Laplacian penalty is very effective in problems where there are very few labels, i.e., , but the graph structure is highly predictive of each item’s labels. Essentially, the given labels are propagated throughout the graph.
In our setting, the labels are the itemcategory assignments , and we have the following result.
Theorem 5.1.
In problem (19), if and then and .
Proof.
Since we have
Hence . That follows from the maximum principle for harmonic functions (Doyle and Snell, 1984): each of the unknowns must lie between the minimum and maximum label values, i.e., in . (Strictly, they will lie in or be all equal to a constant.) ∎
Thus, in the special case where the given labels are valid assignments (nonnegative with unit sum), the predicted labels will also be valid assignments, and we need not subject the problem explicitly to simplex constraints, which simplifies it computationally. This occurs in the standard semisupervised classification setting where each item belongs to only one category and we use the vectors to implement a of coding (e.g. as used for supervised clustering in Grady, 2006). However, in general SSL does not produce valid assignments, e.g. if the given labels are not valid assignments, or in other widely used variations of SSL, such as using class mass normalization (Zhu et al., 2003), or using the normalized graph Laplacian instead of the unnormalized one, or using label penalties (Zhou et al., 2004). In the latter case (also similar to the “dongle” variation of SSL; Zhu et al., 2003), one minimizes the Laplacian penalty plus a term equal to the squared distance of the labeled points (considered free parameters as well) to the labels provided. Thus, this penalizes the labeled points from deviating from their intended labels, rather than forcing them to equal them. This was exten