1 Introduction
1.1 The Central Idea
Semisupervised learning has attracted the attention of many researchers because of the importance of combining unlabeled data with labeled data. In many applications the number of unlabeled data points is so large that it is only feasible to label a small subset of these points by hand. Therefore, the problem of effectively utilizing correlation information in unlabeled data to automatically extend the labels to all the data points is an important research question in machine learning. This paper concerns the issue of how to address uncertainty quantification in such classification methods. In doing so we bring together a variety of themes from the mathematical sciences, including optimization, PDEs, probability and statistics. We will show that a variety of different methods, arising in very distinct communities, can all be formulated around a common objective function
for a real valued function on the nodes of a graph representing the data points. The matrix is proportional to a graph Laplacian derived from the unlabeled data and the function involves the labelled data. The variable
is used for classification. Minimizing this objective function is one approach to such a classification. However if uncertainty is modeled then it is natural to consider the probability distribution with density
proportional to this will be derived using Bayesian formulations of the problem.We emphasize that the variety of probabilistic models considered in this paper arise from different assumptions concerning the structure of the data. Our objective is to introduce a unified framework within which to propose and evaluate algorithms to sample or minimize
We will focus on Monte Carlo Markov Chain (MCMC) and gradient descent. The objective of the paper is not to assess the validity of the assumptions leading to the different models; this is an important modeling question best addressed through understanding of specific classes of data arising in applications.
1.2 Literature Review
There are a large number of possible approaches to the semisupervised learning problem, developed in both the statistics and machine learning communities. The review [55] provides an excellent overview. In this paper we will concentrate exclusively on graphbased methods. These have the attractive feature of reducing potentially high dimensional unlabeled data to a realvalued function on the edges of a graph, quantifying affinity between the nodes; each node of the graph is identified with an unlabeled data point.
Early graphbased approaches were combinatorial. Blum et al. posed the binary semisupervised classification problem as a Markov random field (MRF) over the discrete state space of binary labels, the MAP estimation of which can be solved using a graphcut algorithm in polynomial time [55] . In general, inference for multilabel discrete MRFs is intractable [19]. However, several approximate algorithms exist for the multilabel case [13, 12, 32], and have been applied to many imaging tasks [14, 5, 30].
A different line of work is based on using the affinity function on the edges to define a realvalued function on the nodes of the graph, a substantial compression of high dimensional unlabelled data at each node. The Dirichlet energy , with proportional to the graph Laplacian formed from the affinities on the edges, plays a central role. A key conceptual issue in the graphbased approach is then to connect the labels, which are discrete, to this realvalued function. Strategies to link the discrete and continuous data then constitute different modeling assumptions. The line of work initiated in [56] makes the assumption that the labels are also realvalued and take the real values , linking them directly to the realvalued function on the nodes of the graph; this may be thought of as a continuum relaxation of the discrete state space MRF in [11]. The basic method is to minimize subject to the hard constraint that agrees with the label values; alternatively this constraint may be relaxed to a soft additional penalty term added to These methods are a form of krigging, or Gaussian process regression [49, 50], on a graph. A Bayesian interpretation of the approach in [56] is given in [57] with further developments in [28]. The Laplacian based approach has since been generalized in [54, 4, 45, 43, 31]; in particular this line of work developed to study the transductive problem of assigning predictions to data points off the graph. A formal framework for graphbased regularization, using , can be found in [3, 41]
. We also mention related methodologies such as the support vector machine (SVM)
[10] and robust convex minimization methods [1, 2] which may be based around minimization of with an additional soft penalty term; however since these do not have a Bayesian interpretation we do not consider them here. Other forms of regularization have been considered such as the graph wavelet regularization [40, 23].The underlying assumption in much of the work described in the previous paragraph is that the labels are realvalued. An arguably more natural modelling assumption is that there is a link function, such as the sign function, connecting a realvalued function on the graph nodes with the labels via thresholding. This way of thinking underlies the probit approach [50] and the Bayesian level set method [27, 20], both of which we will study in this paper. Lying between the approaches initiated by [56] and those based on thesholding are the methods based on optimization over realvalued variables which are penalized from taking values far from ; this idea was introduced in the work of Bertozzi et al. [7, 46]. It is based on a GinzburgLandau relaxation of the discrete Total Variation (TV) functional, which coincides with the graph cut energy. This was generalized to multiclass classification in [22]. Following this line of work, several new algorithms were developed for semisupervised and unsupervised classification problems on weighted graphs [26, 33, 36].
The Bayesian way of thinking is foundational in artificial intelligence and machine learning research
[10]. However, whilst the book [50]conducts a number of thorough uncertainty quantification studies for a variety of learning problems using Gaussian process priors, most of the papers studying graph based learning referred to above primarily use the Bayesian approach to learn hyperparameters in an optimization context, and do not consider uncertainty quantification. Thus the careful study of, and development of algorithms for, uncertainty quantification in classification remains largely open. There are a wide range of methodologies employed in the field of uncertainty quantification, and the reader may consult the books
[42, 44, 52] and the recent article [38] for details and further references. Underlying all of these methods is a Bayesian methodology which we adopt and which is attractive both for its clarity with respect to modelling assumptions and its basis for application of a range of computational tools. Nonetheless it is important to be aware of limitations in this approach, in particular with regard to its robustness with respect to the specification of the model, and in particular the prior distribution on the unknown of interest [37].1.3 Our Contribution
In this paper, we focus exclusively on the problem of binary semisupervised classification; however the methodology and conclusions will extend beyond this setting, and to multiclass classification in particular. Our focus is on a presentation which puts uncertainty quantification at the heart of the problem formulation, and we make four primary contributions:

we define a number of different Bayesian formulations of the graphbased semisupervised learning problem and we connect them to one another, to binary classification methods and to a variety of PDEinspired approaches to classification; in so doing we provide a single framework for a variety of methods which have arisen in distinct communities and we open up a number of new avenues of study for the problem area;

we highlight the pCNMCMC method for posterior sampling which, based on analogies with its use for PDEbased inverse problems [18], has the potential to sample the posterior distribution in a number of steps which is independent of the number of graph nodes;

we introduce approximations exploiting the empirical properties of the spectrum of the graph Laplacian, generalizing methods used in the optimization context in [7], allowing for computations at each MCMC step which scale well with respect to the number of graph nodes;

we demonstrate, by means of numerical experiments on a range of problems, both the feasibility, and value, of Bayesian uncertainty quantification in semisupervised, graphbased, learning, using the algorithmic ideas introduced in this paper.
1.4 Overview and Notation
The paper is organized as follows. In section 2, we give some background material needed for problem specification. In section 3 we formulate the three Bayesian models used for the classification tasks. Section 4 introduces the MCMC and optimization algorithms that we use. In section 5, we present and discuss results of numerical experiments to illustrate our findings; these are based on four examples of increasing size: the house voting records from 1984 (as used in [7]), the tunable two moons data set [16], the MNIST digit data base [29] and the hyperspectral gas plume imaging problem [15]. We conclude in section 6. To aid the reader, we give here an overview of notation used throughout the paper.

the set of nodes of the graph, with cardinality ;

the set of nodes where labels are observed, with cardinality ;

, feature vectors;

latent variable characterizing nodes, with denoting evaluation of at node ;

the thresholding function;

relaxation of using gradient flow in doublewell potential ;

the label value at each node with

or , label data;

with being a relaxation of the label variable ;

weight matrix of the graph, the resulting normalized graph Laplacian;

the precision matrix and the covariance matrix, both found from ;

eigenpairs of ;

: orthogonal complement of the null space of the graph Laplacian , given by ;

: GinzburgLandau functional;

,
: prior probability measures;

the measures denoted typically take argument and are realvalued; the measures denoted take argument on label space, or argument on a realvalued relaxation of label space;
2 Problem Specification
In subsection 2.1 we formulate semisupervised learning as a problem on a graph. Subsection 2.2 defines the relevant properties of the graph Laplacian and in subsection 2.3 these properties are used to construct a Gaussian probability distribution; in section 3 this Gaussian will be used to define our prior information about the classification problem. In subsection 2.4 we discuss thresholding which provides a link between the realvalued prior information, and the label data provided for the semisupervised learning task; in section 3 this will be used to definine our likelihood.
2.1 SemiSupervised Learning on a Graph
We are given a set of feature vectors for each For each the feature vector is an element of , so that Graph learning starts from the construction of an undirected graph with vertices and edge weights computed from the feature set . The weights will depend only on and and will decrease monotonically with some measure of distance between and ; the weights thus encode affinities between nodes of the graph. Although a very important modelling problem, we do not discuss how to choose these weights in this paper. For graph semisupervised learning, we are also given a partial set of (possibly noisy) labels , where has size . The task is to infer the labels for all nodes in , using the weighted graph and also the set of noisily observed labels . In the Bayesian formulation which we adopt the feature set , and hence the graph , is viewed as prior information, describing correlations amongst the nodes of the graph, and we combine this with a likelihood based on the noisily observed labels , to obtain a posterior distribution on the labelling of all nodes. Various Bayesian formulations, which differ in the specification of the observation model and/or the prior, are described in section 3. In the remainder of this section we give the background needed to understand all of these formulations, thereby touching on the graph Laplacian itself, its link to Gaussian probability distributions and, via thresholding, to nonGaussian probability distributions and to the GinzburgLandau functional. An important point to appreciate is that building our priors from Gaussians confers considerable computational advantages for large graphs; for this reason the nonGaussian priors will be built from Gaussians via change of measure or push forward under a nonlinear map.
2.2 The Graph Laplacian
The graph Laplacian is central to many graphlearning algorithms. There are a number of variants used in the literature; see [7, 47] for a discussion. We will work with the normalized Laplacian, defined from the weight matrix as follows. We define the diagonal matrix with entries If we assume that the graph is connected, then for all nodes . We can then define the normalized graph Laplacian^{1}^{1}1In the majority of the paper the only property of that we use is that it is symmetric positive semidefinite. We could therefore use other graph Laplacians, such as the unnormalized choice , in most of the paper. The only exception is the spectral approximation sampling algorithm introduced later; that particular algorithm exploits empirical properties of the symmetrized graph Laplacian. Note, though, that the choice of which graph Laplacian to use can make a significant difference – see [7], and Figure 2.1 therein. To make our exposition more concise we confine our presentation to the graph Laplacian (1). as
(1) 
and the graph Dirichlet energy as . Then
(2) 
Thus, similarly to the classical Dirichlet energy, this quadratic form penalizes nodes from having different function values, with penalty being weighted with respect to the similarity weights from . Furthermore the identity shows that is positive semidefinite. Indeed the vector of ones is in the nullspace of by construction, and hence
has a zero eigenvalue with corresponding eigenvector
.We let denote the eigenpairs of the matrix , so that^{2}^{2}2For the normalized graph Laplacian, the upper bound may be found in [17, Lemma 1.7, Chapter 1]; but with further structure on the weights many spectra saturate at (see Appendix), a fact we will exploit later.
(3) 
The eigenvector corresponding to is and , assuming a fully connected graph. Then where has columns and is a diagonal matrix with entries . Using these eigenpairs the graph Dirichlet energy can be written as
(4) 
this is analogous to decomposing the classical Dirichlet energy using Fourier analysis.
2.3 Gaussian Measure
We now show how to build a Gaussian distribution with negative log density proportional to
Such a Gaussian prefers functions that have larger components on the first few eigenvectors of the graph Laplacian, where the eigenvalues of are smaller. The corresponding eigenvectors carry rich geometric information about the weighted graph. For example, the second eigenvector of is the Fiedler vector and solves a relaxed normalized mincut problem [47, 25]. The Gaussian distribution thereby connects geometric intuition embedded within the graph Laplacian to a natural probabilistic picture.To make this connection concrete we define diagonal matrix with entries defined by the vector
and define the positive semidefinite covariance matrix ; choice of the scaling will be discussed below. We let
Note that the covariance matrix is that of a Gaussian with variance proportional to
in direction thereby leading to structures which are more likely to favour the Fiedler vector (), and lower values of in general, than it does for higher values. The fact that the first eigenvalue of is zero ensures that any draw from changes sign, because it will be orthogonal to ^{3}^{3}3Other treatments of the first eigenvalue are possible and may be useful but for simplicity of exposition we do not consider them in this paper. To make this intuition explicit we recall the KarhunenLoeve expansion which constructs a sample from the Gaussian according to the random sum(5) 
where the are i.i.d. Equation (3) thus implies that
We choose the constant of proportionality as a rescaling which enforces the property for ; in words the pernode variance is . Note that, using the orthogonality of the ,
(6) 
We reiterate that the support of the measure is the space
and that, on this space, the probability density function is proportional to
so that the precision matrix of the Gaussian is In what follows the sign of will be related to the classification; since all the entries of are positive, working on the space ensures a sign change in , and hence a nontrivial classification.
2.4 Thresholding and NonGaussian Probability Measure
For the models considered in this paper, the label space of the problem is discrete while the latent variable through which we will capture the correlations amongst nodes of the graph, encoded in the feature vectors, is realvalued. We describe thresholding, and a relaxation of thresholding, to address the need to connect these two differing sources of information about the problem. In what follows the latent variable (appearing in the probit and Bayesian level set methods) is thresholded to obtain the label variable The variable (appearing in the GinzburgLandau method) is a realvalued relaxation of the label variable The variable will be endowed with a Gaussian probability distribution. From this the variable (which lives on a discrete space) and (which is realvalued, but concentrates near the discrete space supporting ) will be endowed with nonGaussian probability distributions.
Define the (signum) function by
This will be used to connect the latent variable with the label variable . The function may be relaxed by defining where solves the gradient flow
This will be used, indirectly, to connect the latent variable with the realvalued relaxation of the label variable, . Note that , pointwise, as , on This reflects the fact that the gradient flow minimizes , asymptotically as , whenever started on
We have introduced a Gaussian measure on the latent variable which lies in ; we now want to introduce two ways of constructing nonGaussian measures on the label space or on realvalued relaxations of label space, building on the measure The first is to consider the pushforward of measure under the map : When applied to a sequence this gives
recalling that is the cardinality of . The definition is readily extended to components of defined only on subsets of . Thus is a measure on the label space The second approach is to work with a change of measure from the Gaussian in such a way that the probability mass on concentrates close to the label space . We may achieve this by defining the measure via its RadonNykodim derivative
(7) 
We name the GinzburgLandau measure, since the negative log density function of is the graph GinzburgLandau functional
(8) 
The GinzburgLandau distribution defined by can be interpreted as a nonconvex ground relaxation of the discrete MRF model [55], in contrast to the convex relaxation which is the Gaussian Field [56]. Since the double well has minima at the label values , the probability mass of is concentrated near the modes , and controls this concentration effect.
3 Bayesian Formulation
In this section we formulate three different Bayesian models for the semisupervised learning problem. The three models all combine the ideas described in the previous section to define three distinct posterior distributions. It is important to realize that these different models will give different answers to the same questions about uncertainty quantification. The choice of which Bayesian model to use is related to the data itself, and making this choice is beyond the scope of this paper. Currently the choice must be addressed on a case by case basis, as is done when choosing an optimization method for classification. Nonetheless we will demonstrate that the shared structure of the three models means that a common algorithmic framework can be adopted and we will make some conclusions about the relative costs of applying this framework to the three models.
We denote the latent variable by , , the thresholded value of by which is interpreted as the label assignment at each node , and noisy observations of the binary labels by , . The variable will be used to denote the realvalued relaxation of used for the GinzburgLandau model. Recall Bayes formula which transforms a prior density
on a random variable
into a posterior density on the conditional random variable :We will now apply this formula to condition our graph latent variable , whose thresholded values correspond to labels, on the noisy label data given at . As prior on we will always use we will describe two different likelihoods. We will also apply the formula to condition relaxed label variable , on the same label data , via the formula
We will use as prior the nonGaussian
For the probit and levelset models we now explicitly state the prior density , the likelihood function , and the posterior density ; in the GinzburgLandau case will replace and we will define the densities and
. Prior and posterior probability measures associated with letter
are on the latent variable ; measures associated with letter are on the label space, or realvalued relaxation of the label space.3.1 Probit
The probit method is designed for classification and is described in [50]. In that context Gaussian process priors are used and, unlike the graph Laplacian construction used here, do not depend on the unlabel data. Combining Gaussian process priors and graph Laplacian priors was suggested and studied in [4, 41, 31]. A recent fully Bayesian treatment of the methodology using unweighted graph Laplacians may be found in the paper [24]. In detail our model is as follows.
Prior We take as prior on the Gaussian Thus
Likelihood For any
with the drawn i.i.d from We let
the cumulative distribution function (cdf) of
, and note that thensimilarly
PosteriorBayes’ Theorem gives posterior with probability density function (pdf)
where
We let denote the pushforward under of
MAP Estimator This is the minimizer of the negative of the log posterior. Thus we minimize the following objective function over :
This is a convex function, a fact which is wellknown in related contexts, but which we state and prove in Proposition 1 Section B of the appendix for the sake of completeness. In view of the close relationship between this problem and the levelset formulation described next, for which there are no minimizers, we expect that minimization may not be entirely straightforward in the limit. This is manifested in the presence of nearflat regions in the probit log likelihood function when .
Our variant on the probit methodology differs from that in [24] in several ways: (i) our prior Gaussian is scaled to have pernode variance one, whilst in [24] the per node variance is a hyperparameter to be determined; (ii) our prior is supported on whilst in [24] the prior precision is found by shifting and taking a possibly fractional power of the resulting matrix, resulting in support on the whole of ; (iii) we allow for a scale parameter in the observational noise, whilst in [24] the parameter
3.2 LevelSet
This method is designed for problems considerably more general than classification on a graph [27]. For the current application, this model is exactly the same as probit except for the order in which the noise and the thresholding function is applied in the definition of the data. Thus we again take as Prior for , the Gaussian Then we have:
Likelihood For any
with the drawn i.i.d from Then
Posterior Bayes’ Theorem gives posterior with pdf
where
We let denote the pushforward under of
MAP Estimator Functional The negative of the log posterior is, in this case, given by
However, unlike the probit model, the Bayesian levelset method has no MAP estimator – the infimum of is not attained and this may be seen by noting that, if the infumum was attained at any nonzero point then would reduce the objective function for any ; however the point does not attain the infimum. This proof is detailed in [27] for a closely related PDE based model, and the proof is easily adapted.
3.3 GinzburgLandau
For this model, we take as prior the GinzburgLandau measure defined by (7), and employ a Gaussian likelihood for the observed labels. This construction gives the Bayesian posterior whose MAP estimator is the objective function introduced and studied in [7].
Prior We define prior on to be the GinzburgLandau measure given by (7) with density
Likelihood For any
with the drawn i.i.d from Then
Posterior Recalling that we see that Bayes’ Theorem gives posterior with pdf
MAP Estimator This is the minimizer of the negative of the log posterior. Thus we minimize the following objective function over :
This objective function was introduced in [7] as a relaxation of the mincut problem, penalized by data; the relationship to mincut was studied rigorously in [46]. The minimization problem for is nonconvex and has multiple minimizers, reflecting the combinatorial character of the mincut problem of which it is a relaxation.
3.4 Uncertainty Quantification for Graph Based Learning
In Figure 1 we plot the component of the negative log likelihood at a labelled node , as a function of the latent variable with data fixed, for the probit and Bayesian levelset models. The log likelihood for the GinzburgLandau formulation is not directly comparable as it is a function of the relaxed label variable , with respect to which it is quadratic with minimum at the data point
The probit and Bayesian levelset models lead to posterior distributions (with different subscripts) in latent variable space, and pushforwards under , denoted (also with different subscripts), in label space. The GinzburgLandau formulation leads to a measure in (relaxed) label space. Uncertainty quantification for semisupervised learning is concerned with completely characterizing these posterior distributions. In practice this may be achieved by sampling using MCMC methods. In this paper we will study four measures of uncertainty:

we will study the empirical pdfs of the latent and label variables at certain nodes;

we will study the posterior mean of the label variables at certain nodes;

we will study the posterior variance of the label variables averaged over all nodes;

we will use the posterior mean or variance to order nodes into those whose classifications are most uncertain and those which are most certain.
For the probit and Bayesian levelset models we interpret the thresholded variable as the binary label assignments corresponding to a realvalued configuration ; for GinzburgLandau we may simply take as the model is posed on (relaxed) label space. The nodewise posterior mean of can be used as a useful confidence score of the class assignment of each node. The nodewise posterior mean is defined as
(9) 
with respect to any of the posterior measures in label space. Note that for probit and Bayesian level set is a binary random variable taking values in and we have In this case if then . Furthermore
Later we will find it useful to consider the variance averaged over all nodes and hence define^{4}^{4}4Strictly speaking
(10) 
Note that the maximum value obtained by Var(l) is . This maximum value is attained under the Gaussian prior that we use in this paper. The deviation from this maximum under the posterior is a measure of the information content of the labelled data. Note, however, that the prior does contain information about classifications, in the form of correlations between vertices; this is not captured in (10).
4 Algorithms
From Section 3, we see that for all of the models considered, the posterior has the form
for some function , different for each of the three models (acknowledging that in the GinzburgLandau case the independent variable is , realvalued relaxation of label space, whereas for the other models an underlying latent variable which may be thresholded by into label space.) Furthermore, the MAP estimator is the minimizer of Note that is differentiable for the GinzburgLandau and probit models, but not for the levelset model. We introduce algorithms for both sampling (MCMC) and MAP estimation (optimization) that apply in this general framework. The sampler we employ does not use information about the gradient of ; the MAP estimation algorithm does, but is only employed on the GinzburgLandau and probit models. Both sampling and optimization algorithms use spectral properties of the precision matrix , which is proportional to the graph Laplcian .
4.1 Mcmc
Broadly speaking there are two strong competitors as samplers for this problem: MetropolisHastings based methods, and Gibbs based samplers. In this paper we focus entirely on MetropolisHastings methods as they may be used on all three models considered here. In order to induce scalability with respect to size of we use the preconditioned CrankNicolson (pCN) method described in [18] and introduced in the context of diffusions by Beskos et. al. in [9] and by Neal in the context of machine learning [35]. The method is also robust with respect to the small noise limit in which the label data is perfect. The pCN based approach is compared with Gibbs like methods for probit, to which they both apply, in [6]; both large data sets and small noise limits are considered.
The standard random walk Metropolis (RWM) algorithm suffers from the fact that the optimal proposal variance or stepsize scales inverse proportionally to the dimension of the state space [39], which is the graph size in this case. The pCN method is designed so that the proposal variance required to obtain a given acceptance probability scales independently of the dimension of the state space (here the number of graph nodes ), hence in practice giving faster convergence of the MCMC when compared with RWM [8]. We restate the pCN method as Algorithm 1, and then follow with various variants on it in Algorithms 2 and 3. In all three algorithms is the key parameter which determines the efficiency of the MCMC method: small leads to high acceptance probability but small moves; large leads to low acceptance probability and large moves. Somewhere between these extremes is an optimal choice of which minimizes the asymptotic variance of the algorithm when applied to compute a given expectation.
4.2 Spectral Projection
For graphs with a large number of nodes , it is prohibitively costly to directly sample from the distribution , since doing so involves knowledge of a complete eigendecomposition of , in order to employ (11). A method that is frequently used in classification tasks is to restrict the support of
to the eigenspace spanned by the first
eigenvectors with the smallest nonzero eigenvalues of (hence largest precision) and this idea may be used to approximate the pCN method; this leads to a low rank approximation. In particular we approximate samples from by(12) 
where is given by (6) truncated after , the are i.i.d centred unit Gaussians and the equality is in law. This is a sample from where and the diagonal entries of are set to zero for the entries after In practice, to implement this algorithm, it is only necessary to compute the first eigenvectors of the graph Laplacian . This gives Algorithm 2.
4.3 Spectral Approximation
Spectral projection often leads to good classification results, but may lead to reduced posterior variance and a posterior distribution that is overly smooth on the graph domain. We propose an improvement on the method that preserves the variability of the posterior distribution but still only involves calculating the first eigenvectors of This is based on the empirical observation that in many applications the spectrum of saturates and satisfies, for , for some . Such behaviour may be observed in b), c) and d) of Figure 2; in particular note that in the hyperspectal case . We assume such behaviour in deriving the low rank approximation used in this subsection. (See Appendix for a detailed discussion of the graph Laplacian spectrum.) We define by overwriting the diagonal entries of from to with We then set , and generate samples from (which are approximate samples from ) by setting
(13) 
where is given by (6) with replaced by for , the are centred unit Gaussians, and the equality is in law. Importantly samples according to (13) can be computed very efficiently. In particular there is no need to compute for , and the quantity can be computed by first taking a sample , and then projecting onto . Moreover, projection onto can be computed only using , since the vectors span the orthogonal complement of . Concretely, we have
where and equality is in law. Hence the samples can be computed by
(14) 
The vector is a sample from and results in Algorithm 3. Under the stated empirical properties of the graph Laplacian, we expect this to be a better approximation of the prior covariance structure than the approximation of the previous subsection.
4.4 MAP Estimation: Optimization
Recall that the objective function for the MAP estimation has the form , where is supported on the space . For GinzburgLandau and probit, the function is smooth, and we can use a standard projected gradient method for the optimization. Since is typically illconditioned, it is preferable to use a semiimplicit discretization as suggested in [7], as convergence to a stationary point can be shown under a graph independent learning rate. Furthermore, the discretization can be performed in terms of the eigenbasis , which allows us to easily apply spectral projection when only a truncated set of eigenvectors is available. We state the algorithm in terms of the (possibly truncated) eigenbasis below. Here is an approximation to found by setting where is the matrix with columns and for , Thus
5 Numerical Experiments
In this section we conduct a series of numerical experiments on four different data sets that are representative of the field of graph semisupervised learning. There are three main purposes for the experiments. First we perform uncertainty quantification, as explained in subsection 3.4. Secondly, we study the spectral approximation and projection variants on pCN sampling as these scale well to massive graphs. Finally we make some observations about the cost and practical implementation details of these methods, for the different Bayesian models we adopt; these will help guide the reader in making choices about which algorithm to use. We present the results for MAP estimation in Section B of the Appendix, alongside the proof of convexity of the probit MAP estimator.
The quality of the graph constructed from the feature vectors is central to the performance of any graph learning algorithms. In the experiments below, we follow the graph construction procedures used in the previous papers [7, 26, 33]
which applied graph semisupervised learning to all of the datasets that we consider in this paper. Moreover, we have verified that for all the reported experiments below, the graph parameters are in a range such that spectral clustering
[47](an unsupervised learning method) gives a reasonable performance. The methods we employ lead to refinements over spectral clustering (improved classification) and, of course, to uncertainty quantification (which spectral clustering does not address).
5.1 Data Sets
We introduce the data sets and describe the graph construction for each data set. In all cases we numerically construct the weight matrix , and then the graph Laplacian .^{5}^{5}5The weight matrix is symmetric in theory; in practice we find that symmetrizing via the map is helpful.
5.1.1 Two Moons
The two moons artificial data set is constructed to give noisy data which lies near a nonlinear low dimensional manifold embedded in a high dimensional space [16]. The data set is constructed by sampling data points uniformly from two semicircles centered at and with radius , embedding the data in
, and adding Gaussian noise with standard deviation
We set and in this paper; recall that the graph size is and each feature vector has length . We will conduct a variety of experiments with different labelled data size , and in particular study variation with . The default value, when not varied, is at of , with the labelled points chosen at random.We take each data point as a node on the graph, and construct a fully connected graph using the selftuning weights of ZelnikManor and Perona [53], with = 10. Specifically we let , be the coordinates of the data points and . Then weight between nodes and is defined by
(15) 
where is the distance of the th closest point to the node .
5.1.2 House Voting Records from 1984
This dataset contains the voting records of 435 U.S. House of Representatives; for details see [7] and the references therein. The votes were recorded in 1984 from the United States Congress, session. The votes for each individual is vectorized by mapping a yes vote to , a no vote to , and an abstention/noshow to . The data set contains votes that are believed to be wellcorrelated with partisanship, and we use only these votes as feature vectors for constructing the graph. Thus the graph size is , and feature vectors have length . The goal is to predict the party affiliation of each individual, given a small number of known affiliations (labels). We pick Democrats and Republicans at random to use as the observed class labels; thus corresponding to less than of fidelity (i.e. labelled) points. We construct a fully connected graph with weights given by (15) with for all nodes
5.1.3 Mnist
The MNIST database consists of
images of size pixels containing the handwritten digits through ; see [29] for details. Since in this paper we focus on binary classification, we only consider pairs of digits. To speed up calculations, we subsample randomly images from each digit to form a graph with nodes; we use this for all our experiments except in subsection 5.4 where we use the full data set of size for digit pair to benchmark computational cost. The nodes of the graph are the images and as feature vectors we project the images onto the leading 50 principal components given by PCA; thus the feature vectors at each node have length We construct a nearest neighbor graph with for each pair of digits considered. Namely, the weights are nonzero if and only if one of or is in the nearest neighbors of the other. The nonzero weights are set using (15) with .We choose the four pairs , , and . These four pairs exhibit increasing levels of difficulty for classification. This fact is demonstrated in Figures (a)a  (d)d, where we visualize the datasets by projecting the dataset onto the second and third eigenvector of the graph Laplacian. Namely, each node is mapped to the point , where .
5.1.4 HyperSpectral Image
The hyperspectral data set analysed for this project was provided by the Applied Physics Laboratory at Johns Hopkins University; see [15] for details. It consists of a series of video sequences recording the release of chemical plumes taken at the Dugway Proving Ground. Each layer in the spectral dimension depicts a particular frequency starting at nm and ending with nm, with a channel spacing of nm, giving channels; thus the feature vector has length The spatial dimension of each frame is pixels. We select frames from the video sequence as the input data, and consider each spatial pixel as a node on the graph. Thus the graph size is . Note that timeordering of the data is ignored. The classification problem is to classify pixels that represent the chemical plumes against pixels that are the background.
We construct a fully connected graph with weights given by the cosine distance:
This distance is small for vectors that point in the same direction, and is insensitive to their magnitude. We consider the normalized Laplacian defined in (1). Because it is computationally prohibitive to compute eigenvectors of a Laplacian of this size, we apply the Nyström extension [51, 21] to obtain an approximation to the true eigenvectors and eigenvalues; see [7] for details pertinent to the setup here. We emphasize that each pixel in the frames is a node on the graph and that, in particular, pixels across the timeframes are also connected. Since we have no ground truth labels for this dataset, we generate known labels by setting the segmentation results from spectral clustering as ground truth. The default value of is , and labels are chosen at random. This corresponds to labelling around of the points. We only plot results for the last frames of the video sequence in order to ensure that the information in the figures it not overwhelmingly large.
5.2 Uncertainty Quantification
In this subsection we demonstrate both the feasibility, and value, of uncertainty quantification in graph classification methods. We employ the probit and the Bayesian levelset model for most of the experiments in this subsection; we also employ the GinzburgLandau model but since this can be slow to converge, due to the presence of local minima, it is only demonstrated on the voting records dataset. The pCN method is used for sampling on various datasets to demonstrate properties and interpretations of the posterior. In all experiments, all statistics on the label are computed under the pushforward posterior measure onto label space, .
5.2.1 Posterior Mean as Confidence Scores
We construct the graph from the MNIST dataset following subsection 5.1. The noise variance is set to , and of fidelity points are chosen randomly from each class. The probit posterior is used to compute (9). In Figure 4 we demonstrate that nodes with scores closer to the binary ground truth labels look visually more uniform than nodes with
far from those labels. This shows that the posterior mean contains useful information which differentiates between outliers and inliers that align with human perception. The scores
are computed as follows: we let be a set of samples of the posterior measure obtained from the pCN algorithm. The probability is approximated byfor each . Finally the score
5.2.2 Posterior Variance as Uncertainty Measure
In this set of experiments, we show that the posterior distribution of the label variable captures the uncertainty of the classification problem. We use the posterior variance of , averaged over all nodes, as a measure of the model variance; specifically formula (10). We study the behaviour of this quantity as we vary the level of uncertainty within certain inputs to the problem. We demonstrate empirically that the posterior variance is approximately monotonic with respect to variations in the levels of uncertainty in the input data, as it should be; and thus that the posterior variance contains useful information about the classification. We select quantities that reflect the separability of the classes in the feature space.
Figure 5 plots the posterior variance against the standard deviation of the noise appearing in the feature vectors for the two moons dataset; thus points generated on the two semicircles overlap more as increases. We employ a sequence of posterior computations, using probit and Bayesian levelset, for . Recall that and we choose of the nodes to have the ground truth labels as observed data. Within both models, is fixed at . A total of samples are taken, and the proposal variance is set to . We see that the mean posterior variance increases with , as is intuitively reasonable. Furthermore, because is small, probit and Bayesian levelset are very similar models and this is reflected in the similar quantitative values for uncertainty.
A similar experiment studies the posterior label variance as a function of the pair of digits classified within the MNIST data set. We choose of the nodes as labelled data, and set . The number of samples employed is
Comments
There are no comments yet.