1 Introduction
Semisupervised learning (SSL) has become a topic of significant recent interest in the context of applied machine learning, where perclass distributions are difficult to automatically separate due to limited sampling and/or limitations of the underlying mathematical model. Several applications, including contentbased retrieval
[51], email classification [24], gene function prediction [34], and natural language processing
[40, 26], benefit from the availability of userdefined/applicationspecific knowledge in the presence of large amounts of complex unlabeled data, where labeled observations are often limited and expensive to acquire. In general, SSL algorithms fall into two broad categories: classification and clustering. Semisupervised classification is considered to improve supervised classification when small amounts of labeled data with large amounts of unlabeled data are available [53, 7]. For example, in a semisupervised email classification, one may wish to classify constantly increasing email messages into spam/nonspam with the knowledge of a limited amount of user/humanbased classified messages
[24]. On the other hand, semisupervised clustering (SSC), also known as constrained clustering [6], aims to provide better performance for unsupervised clustering when userbased information about the relationships within a small subset of the observations becomes available. Such relations would involve data points belonging to the same or different classes. For example, a languagespecific grammar is necessary in cognitive science when individuals are attempting to learn a foreign language efficiently. Such a grammar provides rules for prepositions that can be considered as userdefined knowledge for improving the ability to learn a new language.To highlight the role of userdefined relationships for learning an applicationspecific data distribution, we consider the example in Figure 1
(a), which shows a maximum likelihood model estimate of a Gaussian mixture that is well supported by the data. However, an application may benefit from another good (but not optimal w.r.t. likelihood) solution as in Figure
1(b), which is inconsistent with the data, but is optimal without some information in addition to the raw data points. Using a limited amount of labeled data and a large amount of unlabeled data could be difficult to guide the learning algorithm in the applicationspecific direction [53, 10, 29, 50], because performance of a generative model depends on the ratio of the labeled data to unlabeled data. In contrast, previous works have shown that SSC achieves the estimate in Figure 1(b), given the observed data and a small number of userdefined relationships that would guide the parameter estimation process toward a model [6] that is not only informed by the data, but also by this small amount of user input. This paper addresses the problem of incorporating such userspecific relations into a clustering problem in an effective, general, and reliable manner.Clustering data using a generative framework has some useful, important properties, including compact representations, parameter estimation for subsequent statistical analysis, and the ability to induce classifications of unseen data [54]. For the problem of estimating the parameters of generative models, the expectationmaximization (EM) algorithm [13] is particularly effective. The EM formulation is guaranteed to give maximumlikelihood (ML) estimates in the unimodal case and local maxima in likelihood otherwise. Therefore, EM formulations of parameter estimation that properly account for user input in the context of SSC are of interest and one of the contributions of this paper.
A flexible and efficient way to incorporate user input into SSC is in the form of relations between observed data points, in order to define statistical relationships among observations (rather than explicit labeling, as would be done in classification). A typical example would be for a user to examine a small subset of data and decide that some pairs of points should be in different classes, referred to as a cannotlink relation, and that other pairs of data points should be in the same class, i.e., mustlink. Using these basic primitives, one may build up more complex relationships among sets of points. The concept of pairwise links was first applied to centroidbased clustering approaches, for instance, in the form of constrainedKmeans [43], where each observation is assigned to the nearest cluster in a manner that avoids violating constraints.
Although some progress has been made in developing mechanisms for incorporating this type of user input into clustering algorithms, the need remains for a systematic, general framework that generalizes with a limited amount of user knowledge. Most stateoftheart techniques propose adding hard constraints [38], where data points that violate the constraints do not contribute (i.e., all pairwise constraints must be satisfied), or soft penalties [30], which penalize the clustering results based on the number of violated constraints. Both hard constraints and soft penalties can lead to both a lack of generality and suboptimal solutions. For instance, in constrained Kmeans, introducing constraints by merely assigning a relatively small number of points to appropriate centroids does not ensure that the models (centroids) adequately respond to this user input.
In this paper, we propose a novel, generative approach for clustering with pairwise relations that incorporates these relations into the estimation process in a precise manner. The parameters are estimated by optimizing the data likelihood under the assumption
that individual data points are either independent samples (as in the unsupervised case) or that they have a nontrivial joint distribution, which is determined by user input. The proposed model explicitly incorporates the pairwise relationship as a property of the generative model that guides the parameter estimation process to reflect user preferences and estimates the global structure of the underlying distribution. Moreover, the proposed model is represented as a probability distribution that can be virtually any form. The results in this paper demonstrate that the proposed optimal strategy pays off, and that it outperforms the stateofthe art on realworld datasets with significantly less user input.
2 Related Work
Semisupervised clustering methods typically fall into one of two categories [6]: distancebased methods and constraintbased methods. The distancebased approaches combine conventional clustering algorithms with distance metrics that are designed to satisfy the information given by user input [47, 4, 45, 8]. The metrics effectively embed the points into spaces where the distances between the points with constraints are either larger or smaller to reflect the userspecified relationships. On the other hand, constraintbased algorithms incorporate the pairwise constraints into the clustering objective function, to either enforce the constraints or penalize their violation. For example,Wagstaff et al. proposed the constrained Kmeans algorithm, which enforced user input as hard constraints in a nonprobabilistic manner as the part of the algorithm that assigns points to classes [43]. Basu el al. proposed a probabilistic framework based on a hidden Markov random field, with ad hoc soft penalties, which integrated metric learning with the constrained Kmeans approach, optimized by an EMlike algorithm [5]. This work also can be applied to a kernel feature space as in [23]
. Allab and Benabdeslem adapted topological clustering to pairwise constraints using a selforganizing map in a deterministic manner
[2].Semisupervised clustering methods with generative, parametric clustering approaches have also been augmented to accommodate user input. Lu and Leen proposed a penalized clustering algorithm using Gaussian mixture models (GMM) by incorporating the pairwise constraints as a prior distribution over the latent variable directly, resulting in a computationally challenging evaluation of the posterior
[30]. Such a penalizationbased formulation results in a model with no clear generative interpretation and a stochastic expectation step that requires Gibbs sampling. Shental et al. proposed a GMM with equivalence constraints that defines the data from either the same or a different source. However, for the cannotlink case, they used the Markov network to describe the dependence between a pair of latent variables and sought the optimal parameter by gradient ascent [38]. Their results showed that the cannotlink relationship was unable to impact the final parameter estimation (i.e., such a relation was ineffective). Further, they imposed user input as hard constraints where data points that violate the constraints did not contribute to the parameter estimation process. A similar approach, in [25], proposed to treat the constraint as an additional random variable that increases the complexity of the optimization process. Further, their approach focused only on
mustlink. In this paper, we propose a novel solution to incorporating userdefined data relationships into clustering problems, so that cannotlink and mustlink relations can be included in a unified framework in a way that they are computed efficiently using an EM algorithm with very modest computational demands. Moreover, the proposed formulation is general in that it can 1) accommodate any kind of relation that can be expressed as a joint probability and 2) incorporate, in principle, any probability distribution (generative model). For GMMs, however, this formulation results in a particularly attractive algorithm that entails a closedform solution for the mean and covariance and a relatively inexpensive, iterative, constrained, nonlinear optimization for the mixing parameters.Recently, EMlike algorithms for SSL (and clustering in particular) have received significant attention in natural language processing [20, 31]. Graca et al. proposed an EM approach with a posterior constraint that incorporates the expected values of specially designed auxiliary functions of the latent variables to influence the posterior distribution to favor user input [20]. Because of the lack of probabilistic interpretation, the expectation step is not influenced by user input, and the results are not optimal.
Unlike the generative approach, graphbased methods group the data points according to similarity and do not necessarily assume an underlying distribution. Graphbased, semisupervised clustering methods have been demonstrated to be promising when user input is available [52, 44, 49]. However, graphbased methods are not ideal classifiers when a new data point is presented due to their transductive property, i.e., unable to learn the general rule from the specific training data [16, 54]. In order to classify a new data point, other than rebuilding the graph with the new data point, one likely solution is to build a separate inductive model on top of the output of the graphbased method (e.g., Kmeans or GMM); user input would need to be incorporated into this new model.
The work in this paper is distinct from the aforementioned works in the following aspects:

We present a fully generative approach, rather than a heuristic approach of imposing hard constraints or adding ad hoc penalties.

The proposed generative model reflects user preferences while maintaining a probabilistic interpretation, which allows it to be generalized to take advantage of alternative density models or optimization algorithms.

The proposed model clearly deals with the mustlink and cannotlink cases in a unified framework and demonstrates that solutions using mustlink and cannotlink together or independently are tractable and effective.

Instead of pairwise constraints, the statistical interpretation of pairwise relationships allows the model estimation to converge to a distribution that follows user preferences with less domain knowledge.

In the proposed algorithm, the parameter estimation is very similar to a standard EM in terms of ease of implementation and efficiency.
3 Clustering With Pairwise Relationships
The proposed model incorporates user input in the form of relations between pairs of points that are in the same class (mustlink) or different classes (cannotlink). The mustlink and cannotlink relationships are a natural and practical choice since the user can guide the clustering without having a specific preconceived notion of classes. These pairwise relationships are typically not sufficiently dense or complete to build a full discriminative model, and yet they may be helpful in discovering the underlying structure of the unlabeled data. For data points that have no user input, we assume that they are independent, random samples. The pairwise relationships give rise to an associate generative model with a joint distribution that reflects the nature of the user input.
The parameters are estimated as an ML formulation through an EM algorithm that discovers the global structure of the underlying distribution that reflects the userdefined relations. Unlike previous works that include user input in a specific model (e.g., a GMM) through either hard constraints [38] or soft penalties [30], in this work we propose an ML estimation based on a generative model, without ad hoc penalties.
3.1 Generative Models: Unsupervised Scenario
In this section, we first introduce generative models for an unsupervised scenario. Suppose the unconstrained generative model consists of classes. denotes the observed dataset without user input. Dataset is associated with latent set where with if and only if the corresponding data point was generated from the th class, subject to . Therefore, we can obtain the soft label for a data point by estimating . The probability that a data point is generated from a generative model with parameters is
(1) 
The likelihood of the observed data points governed by the model parameters is
(2)  
(3) 
where the condition on the product term in equation (2) is restricted to data points generated from the th class. The joint probability in equation (3) is expressed, using Bayes’ rule, in terms of the conditional probability and the
th class prior probability
. In the rest of the formulation, to simplify the representation, we use3.2 Generative Model With Pairwise Relationships
The definition of a pairwise relation in the proposed generative model is similar to that in the unsupervised case, yet such relations are propagated to the latent variables level. In particular, denotes a set of mustlink relations where the pair and was generated from the same class; hence, the pair shares a single latent variable . The same logic is applied to the cannotlink relations where denotes a set of cannotlink relations encoding that and were generated from distinct classes; therefore, . Including and , the data points are now expanded to be . Thus, the modified completedata likelihood function that would reflect user input is (refer to Figure 2 for the graphical representation)
(4) 
and are the likelihood of pairwise data points. The likelihood of the set of all pairs of mustlink data points is, therefore,
(5) 
The likelihood of the cannotlink data points explicitly reflects the fact that they are drawn from distinct classes. Therefore, the joint probability of the labeling vectors
and for all is as follows:(6)  
(7)  
(8) 
The proposed joint distribution reflects the cannotlink constraints by assigning a zero joint probability of and being generated from the same class, and takes into account the effect of this relation on the normalization term of the joint distribution . As such, the cannotlink relations contribute to the posterior distribution as follows:
(9) 
3.3 Expectation Maximization With Pairwise Relationships
Given the joint distribution , the objective is to maximize the loglikelihood function with respect to the parameters of the generative process in a manner that would discover the global structure of the underlying distribution and reflect user input. This objective can be achieved using an EM algorithm.
3.3.1 EStep
In the Estep, we estimate the posterior of the latent variables using the current parameter values .
(10) 
term: Taking the expectation of with respect to the posterior distribution of and bearing in mind that the latent variable
is a binary variable,
(11) 
term: Taking the expectation of with respect to the mustlink posterior distribution of results in
(12) 
term: Because the proposed model does not allow and to be from the same class, the expectation of equation (8) in the that both will have the same class assignment vanishes, which can be shown using Jensen’s inequality as follows:
(13) 
Hence, we can set in equation (8). The expectation of the term with respect to is
(14) 
In a like manner, we can write down the expectation of .
3.3.2 MStep
In the Mstep, therefore, we update the by maximizing equation (3.3.1) and fixing the posterior distribution that we estimated in the Estep.
(15) 
Different density models result in different update mechanisms for the respective model parameters. In the next subsection, we elaborate on an example of the proposed model to illustrate the idea of the Mstep for the case of Gaussian mixture models.
3.4 Gaussian Mixture Model With Pairwise Relationships
Consider employing a single distribution (e.g., a Gaussian distribuion) for each class probability . The proposed model, therefore, becomes the Gaussian mixture model (GMM) with pairwise relationships. The parameter of the GMM is , such that is the mixing parameter for the class proportion subject to and . is the mean parameter, and is the covariance associated with the th class. By taking the derivative of equation (3.3.1) with respect to and , we can get
(16)  
(17)  
(18) 
where , , and the sample covariance .
Estimating the mixing parameters , on the other hand, entails the following constrained nonlinear optimization, which can be solved using sequential quadratic programming with NewtonRaphson steps [15, 1]. Let denote the vector of mixing parameters. Given the current estimate of the mean vectors and covariance matrices, the new estimate of the mixing parameters can be solved for using the optimization problem defined in (3.4),
(19) 
where the initialization can be obtained using the closedform solution obtained from discarding the nonlinear part, which ignores the normalization term . The energy function is convex, and we have found that this iterative algorithm typically converges in three to five iterations and does not represent a significant computational burden.
3.4.1 Multiple Mixture Clusters Per Class
In order to group the data that lies on the subspace (e.g., manifold structure) more explicitly, multiclusters to model per class have been widely used in unsupervised clustering by representing the density model in a hierarchical structure [9, 46, 42, 18, 33, 22, 41]. Because of its natural representation of data, the hierarchical structure can be built using either a topdown or bottomup approach, in which the first approach tries to decompose one cluster into several small clusters, whereas the second starts with grouping several clusters into one cluster. The multicluster per class strategy also has been proposed when both labeled data and unlabeled data are available [35, 28, 37, 21, 48, 12, 11, 17]. However, previous works indicated that the labeled data is unable to impact the final parameter estimation if the initial model assumption is incorrect [10, 29, 50, 39]. Moreover, it is not clear how to employ the previous works in regard to pairwise links instead of labeled data.
In this section, we propose to use the generative mixture of Gaussian distributions for each class probability
. In this form, we use multiclusters to model one class that overcomes data on a manifold structure. Therefore, in addition to the latent variable set , is also associated with the latent variable set where with if and only if the corresponding data point was generated from the th cluster in the th class, subject to ; is the number of clusters in the th class. The parameter of the generative mixture model is and is the mixing parameter for the class proportion and is the same as in section 3.4. The parameter of the th class is where , such that is the mixing parameter for the cluster proportion subject to , is the mean parameter, and is the covariance associated with the th cluster in the th class. The probability that an unsupervised data point is generated from a generative mixture model given parameters is(20) 
where
(21) 
and is the Gaussian distribution. The definition of equation (21) can be used to describe the in equation (5) and the in equation (9). In the Estep, the posterior of latent variable can be estimated by marginalization of the directly. In the Mstep, we update the parameters by maximizing equation (3.3.1), which is similar to GMM case in section 3.4 (see the Appendix A for details). Last, if , we have and equation (20) becomes the GMM, i.e., one cluster/single Gaussian distribution per class.
4 Experiment
In this section, we demonstrate the effectiveness of the proposed generative model on a synthetic dataset as well as on wellknown datasets where the number of links can be significantly reduced compared to stateoftheart.
4.1 Experimental Settings
To illustrate the method, we start with the case of : a mixture of Gaussians () and a single Gaussian distribution (). To initialize the model parameters, we first randomly select the mean vectors by Kmeans++ [3], which is similar to the Gonzalez algorithm [19] without being completely greedy. Afterward, we assign every observed data point to its nearest initial mean where initial covariance matrices for each class are computed. We initially assume equally probable classes where the mixing parameters are set to . When (i.e., multiclusters per class), we initialize the parameters of the th cluster in the th class using the aforementioned strategy, but only on the data points that have been assigned to the th class after the above initialization. To mimic user preferences and assess the performance of the proposed model as a function of the number of available relations, pairwise relations are created by randomly selecting a pair of observed data points and using the knowledge of the distributions. If the points are assigned to the same cluster based on their ground truth labeling, we move them to the mustlink set, otherwise, to the cannotlink set. We perform 100 trials for all experiments. Each trial is constructed by the random initialization of the model parameters and random pairwise relations.
We compare the proposed model, generative model with pairwise relation (GMPR), to the unconstrained GMM, unconstrained spectral clustering (SC), and four other stateoftheart algorithms: 1) GMMEC: GMM with the equivalence constraint [38], 2) EMPC: EM with the posterior constraint [20]; it is worth mentioning that EMPC works only for cannotlink, 3) SSKK: Constrained kernel Kmeans [23], and 4) CSC ^{1}^{1}1https://github.com/gnaixgnaw/CSP
: Flexible constrained spectral clustering
[44]. For SC, SSKK, and CSC, the similarity matrix is computed by the RBF kernel, whose parameter is set by the average squared distance between all pairs of data points.We use purity [32] for performance evaluation, which is a scalar value ranging from to where is the best. Purity can be computed as follows: each class is assigned to the most frequent ground truth label ; then, purity is measured by counting the number of correctly assigned observed data points in every ground truth class and dividing the total number of observed data. The assignment is according to the highest probability of the posterior distribution.
4.2 Results: Single Gaussian Distribution ()
In this section, we demonstrate the performance of the proposed model using a single Gaussian distribution on standard binary and multiclass problems.
4.2.1 Synthetic Data
We start off by evaluating the performance of GMPR, which uses a single Gaussian distribution for on synthetic data. We generate a twocluster toy example to mimic the example in Figure 1, which is motivated by [53]. The correct decision boundary should be the horizontal line along the xaxis. Figure 3(a) is the generated data with the initial means. Figure 3(b) is the clustering result obtained from an unconstrained GMM. Figure 3(c) shows that the proposed GMPR can learn the desired model with only two mustlink relations and two cannotlink relations. Figure 3(d) shows that the proposed GMPR can learn the desired model with only two mustlinks. Figure 3(e) shows that the proposed GMPR can learn the desired model with only two cannotlink relations. This experiment illustrates the advantage of the proposed method, which can perform well with only either mustlinks or cannotlinks. This advantage makes the proposed model distinct from previous works [38, 25].
4.2.2 UCI Repository and Handwritten Digits
In this section, we report the performance of three real datasets: 1) the Haberman’s survival^{2}^{2}2https://archive.ics.uci.edu/ml/datasets.html dataset contains 306 instances, 3 attributes, and 2 classes; 2) the MNIST^{3}^{3}3http://yann.lecun.com/exdb/mnist/ database contains images of handwritten digits. We used the test dataset, which contains 10000 examples, 784 attributes, and 10 classes [27]; and 3) the Thyroid^{4}^{4}4http://www.raetschlab.org/Members/raetsch/benchmark dataset contains 215 instances, 5 attributes, and 2 classes.
We demonstrate the performance of GMPR
on two binary clustering tasks, Haberman and Thyroid, and two multiclass problems, digits 1, 2, 3 and 4, 5, 6, 7. For ease of visualization, we work with only the leading two principal components of the MNIST using principal component analysis (PCA). Figure
5 shows twodimensional inputs, colorcoded by class label. Figure 4 shows that GMPR significantly outperforms GMMEC regardless of the available number of links on all datasets. Moreover, Figure 6 shows that GMPR performs well even if only the mustlinks are available. Compared to EMPC, which uses only the cannotlinks, Figure 7 shows the performance of GMPR is always greater than or comparable to EMPC and GMPR. Figure 7 also shows that the performance of EMPC decreases when the number of classes increases. The cannotlink in the GMPR, on the other hand, can contribute to the model when the problem is either binary or multiclass. Notice that all the experiments indicate that GMPRhas a lower variance over 100 random initializations, which implies
GMPR stability regardless of the number of available pairwise links.4.3 Results: Mixture of Gaussians ()
In this section, we demonstrate the performance of the proposed model using a mixture of Gaussians on the datasets that have local manifold structure.
4.3.1 Synthetic Data: Two Moons Dataset
Data points in two moons are on a moonlike manifold structure (Figure 8(a)), which allows us to show the advantage of the proposed method using a mixture of Gaussians as a distribution instead of a single Gaussian distribution. Figure 8(a) shows the data with initial means for the GMM and the GMPR using a single Gaussian. Figure 8(b) shows the data with initial means for GMPR using a mixture of Gaussians (). Figure 8(c) is the clustering result obtained from the unconstrained GMM, in which three points were assigned to the wrong class. Figure 8(c) also shows that the performance of the GMM relied on the parameter initialization. Figure 8(d) shows that the proposed GMPR, which used a single cluster for each class, tried to learn the manifold structure via two mustlink and two cannotlink relations. However, two points were still assigned to the incorrect class. Figure 8(e) shows that the GMPR can trace the manifold structure but used the same links in (d) with two clusters for each class. This experiment illustrates the advantage of the proposed model with a mixture of distributions that traces the local data structure by every single cluster and describes the global data structure using the mixture of clusters.
4.3.2 Coil 20
In this section, we report the performance of COIL 20^{5}^{5}5http://www.cs.columbia.edu/CAVE/software/softlib/coil20.php datasets, which contain images of 20 objects in which each object was placed on a turntable and rotated 360 degrees to be captured with different poses via a fixed camera (Figure 9). The COIL 20 dataset contains 1440 instances and 1024 attributes. We set the number of multiclusters per class by crossvalidation to . Previous studies have shown that the intrinsic dimension of many highdimensional realworld datasets is often quite small () [36, 14]; therefore, each image is first projected onto the lowdimensional subspace (d = 10, 15, and 20). Figure 9 shows that the GMPR provides higher purity values compared to the SSKK and the CSC with fewer links () regardless of the subspace dimension. In these experiments, we found that the proposed model can outperform the graphbased method with fewer links.
4.4 Result: Sensitivity to Number of Clusters Per Class
Lastly, we demonstrated the performance of the proposed model in regard to different values of . First, we used the same dataset (MNIST) that is used in section 4.2.2. In Figure 5(a), we observed digit 1, which clearly lay on a moonlike structure. Therefore, Figure 10(a) shows that the performance of , or is better than when the number of links is greater than 64. However, in Figure 5(b), we observe hardly any manifold structure for digits 4, 5, 6, and 7. This observation also applies to the results in Figure 10(b). The performances of , and are very similar to each other, e.g., increasing the value of does not help. However, we also notice that the increase in the number of does not hurt the performance of the model and might even enhance the performance depending on the dataset.
Appendix A. Mixture of Distributions
Likelihood: Mustlink Relationships
The likelihood of the is
(22) 
Likelihood: Cannotlink Relationships
The likelihood of the is
(23)  
(24) 
and
(25) 
EStep:
Unsupervised Scenario
The expatiation is
(26)  
and
(27) 
Mustlink Scenario
The is
(28) 
and
(29) 
where
(30) 
Cannotlink Scenario
The is
(31) 
and
(32) 
where
(33) 
MStep
The mean and covariance in the th cluster in the th class are
(34) 
(35) 
where
(36) 
and
(37) 
Because the mixing parameter for the cluster satisfies the summation to one, the determination can be achieved by the Lagrange multiplier.
(38) 
is the Lagrange multiplier. Taking the derivative of equation (38) with respect to ,
(39) 
By taking the derivative of equation (38) with respect to and equal to zero, we then can get and use it to eliminate the in equation (39). The mixing parameter for the th cluster in the th mixture is given by
(40) 
Lastly, estimating the mixing parameters for mixture is the same as in equation (3.4).
References
 [1] Abramowitz, M., Stegun, I.A.: Handbook of mathematical functions: with formulas, graphs, and mathematical tables, vol. 55. Courier Corporation (1964)
 [2] Allab, K., Benabdeslem, K.: Constraint selection for semisupervised topological clustering. In: Machine Learning and Knowledge Discovery in Databases, pp. 28–43. Springer (2011)
 [3] Arthur, D., Vassilvitskii, S.: kmeans++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
 [4] BarHillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research 6(6), 937–965 (2005)
 [5] Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semisupervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 59–68. ACM (2004)
 [6] Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: Advances in algorithms, theory, and applications. CRC Press (2008)
 [7] Chapelle, O., Schölkopf, B., Zien, A., et al.: Semisupervised learning, vol. 2. MIT press Cambridge (2006)
 [8] Cohn, D., Caruana, R., McCallum, A.: Semisupervised clustering with user feedback. Constrained Clustering: Advances in Algorithms, Theory, and Applications 4(1), 17–32 (2003)

[9]
Coviello, E., Lanckriet, G.R., Chan, A.B.: The variational hierarchical em algorithm for clustering hidden markov models.
In: Advances in neural information processing systems, pp. 404–412 (2012)  [10] Cozman, F.G., Cohen, I., Cirelo, M.C., et al.: Semisupervised learning of mixture models. In: international conference on Machine learning, pp. 99–106 (2003)

[11]
Dara, R., Kremer, S.C., Stacey, D., et al.: Clustering unlabeled data with soms
improves classification of labeled realworld data.
In: Neural Networks, 2002. IJCNN’02. Proceedings of the 2002 International Joint Conference on, vol. 3, pp. 2237–2242. IEEE (2002)

[12]
Demiriz, A., Bennett, K.P., Embrechts, M.J.: Semisupervised clustering using genetic algorithms.
Artificial neural networks in engineering (ANNIE99) pp. 809–814 (1999)  [13] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) pp. 1–38 (1977)
 [14] Felsberg, M., Kalkan, S., Krüger, N.: Continuous dimensionality characterization of image structures. Image and Vision Computing 27(6), 628–636 (2009)
 [15] Fletcher, R.: Practical methods of optimization. John Wiley & Sons (2013)

[16]
Gammerman, A., Vovk, V., Vapnik, V.: Learning by transduction.
In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 148–155. Morgan Kaufmann Publishers Inc. (1998)
 [17] Goldberg, A.B., Zhu, X., Singh, A., Xu, Z., Nowak, R.: Multimanifold semisupervised learning (2009)

[18]
Goldberger, J., Roweis, S.T.: Hierarchical clustering of a mixture model.
In: Advances in Neural Information Processing Systems, pp. 505–512 (2004)  [19] Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38, 293–306 (1985)
 [20] Graca, J., Ganchev, K., Taskar, B.: Expectation maximization and posterior constraints. In: Advances in neural information processing systems (2007)
 [21] He, X., Cai, D., Shao, Y., Bao, H., Han, J.: Laplacian regularized gaussian mixture model for data clustering. Knowledge and Data Engineering, IEEE Transactions on 23(9), 1406–1418 (2011)
 [22] Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the em algorithm. Neural computation 6(2), 181–214 (1994)
 [23] Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semisupervised graph clustering: a kernel approach. Machine Learning 74(1), 1–22 (2009)
 [24] Kyriakopoulou, A., Kalamboukis, T.: The impact of semisupervised clustering on text classification. In: Proceedings of the 17th Panhellenic Conference on Informatics, pp. 180–187. ACM (2013)
 [25] Law, M.H., Topchy, A.P., Jain, A.K.: Modelbased clustering with probabilistic constraints. In: SDM, pp. 641–645. SIAM (2005)

[26]
Le Nguyen, M., Shimazu, A.: A semi supervised learning model for mapping
sentences to logical forms with ambiguous supervision.
Data & Knowledge Engineering
90, 1–12 (2014)  [27] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
 [28] Liu, J., Cai, D., He, X.: Gaussian mixture model with local consistency. In: AAAI, vol. 10, pp. 512–517. Citeseer (2010)

[29]
Loog, M.: Semisupervised linear discriminant analysis through momentconstraint parameter estimation.
Pattern Recognition Letters 37, 24–31 (2014)  [30] Lu, Z., Leen, T.K.: Semisupervised learning with penalized probabilistic clustering. In: Advances in neural information processing systems, pp. 849–856 (2004)
 [31] Mann, G.S., McCallum, A.: Generalized expectation criteria for semisupervised learning with weakly labeled data. The Journal of Machine Learning Research 11, 955–984 (2010)
 [32] Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge university press Cambridge (2008)
 [33] Meila, M., Jordan, M.I.: Learning with mixtures of trees. The Journal of Machine Learning Research 1, 1–48 (2001)
 [34] Nguyen, T.P., Ho, T.B.: Detecting disease genes based on semisupervised learning and protein–protein interaction networks. Artificial intelligence in medicine 54(1), 63–71 (2012)
 [35] Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine learning 39(23), 103–134 (2000)
 [36] Raginsky, M., Lazebnik, S.: Estimation of intrinsic dimensionality using highrate vector quantization. In: Advances in neural information processing systems, pp. 1105–1112 (2005)
 [37]
Comments
There are no comments yet.