semisupervisedlearning
None
view repo
In several domains obtaining class annotations is expensive while at the same time unlabelled data are abundant. While most semisupervised approaches enforce restrictive assumptions on the data distribution, recent work has managed to learn semisupervised models in a nonrestrictive regime. However, so far such approaches have only been proposed for linear models. In this work, we introduce semisupervised parameter learning for SumProduct Networks (SPNs). SPNs are deep probabilistic models admitting inference in linear time in number of network edges. Our approach has several advantages, as it (1) allows generative and discriminative semisupervised learning, (2) guarantees that adding unlabelled data can increase, but not degrade, the performance (safe), and (3) is computationally efficient and does not enforce restrictive assumptions on the data distribution. We show on a variety of data sets that safe semisupervised learning with SPNs is competitive compared to stateoftheart and can lead to a better generative and discriminative objective value than a purely supervised approach.
READ FULL TEXT VIEW PDFNone
In several domains, unlabelled observations are abundant and cheap to acquire, while obtaining class labels is expensive and sometimes infeasible for large amounts of data. In such cases, semisupervised learning can be used to exploit large amounts of unlabelled data in addition to labelled data. Examples include text [30] or image data [17, 27, 22], which are ubiquitous online, but also biological (genomics, proteomics, gene expression) data [31] and speech [28].
One of the challenges facing most semisupervised learning approaches is scalability, many methods scale quadratically or even cubically with data set size, or require restrictive assumptions such as low dimensionality or sparsity [32, 17]. In fact, if the data violates the assumptions enforced by a learner, the use of additional unlabelled data can even degrade the classification performance.
Several approaches for semisupervised have been proposed, including selftraining, Transductive SupportVector Machines (TSVM)
[5], and graphbased methods. We refer to [32, 14] for comprehensive reviews on the stateoftheart. As pointed out by [17], selftraining is errorprone (it can reinforce poor predictions) and TSVM as well as graphbased methods are difficult to scale. In addition, TSVM, and its recent extensions [19] require that the decision boundary lie in a low density region, yielding suboptimal accuracy if this is not met. Each of these methods can lead to decreased accuracy when adding unlabelled data. To overcome these limitations, [21] recently proposed a probabilistic formulation for safe semisupervised learning of generative linear models.In the family of deep probabilistic models, SumProduct Networks (SPNs) [26]
have recently gained popularity, due to their efficiency, i.e. lineartime inference, generality, i.e. they subsume existing approaches such as latent tree models and mixtures, and performance on various tasks including computer vision
[26, 10], action recognition [2], speech [24], and language modelling [4].For probabilistic models, including SPNs, semisupervised learning with generative models is natural. Data points are assigned to whichever class maximizes , with being a generative model for the data in class . Subsequently, the labelled data points can be used to learn the model. Unfortunately, adding unlabelled data can significantly degrade classification accuracy instead of improving it [6].
In this paper, we introduce safe semisupervised parameter learning for SPNs that is safe, scalable and nonrestrictive. Safe means that adding unlabelled data can increase, but not degrade, model performance. The training time scales linearly with added data points and apart from the structure of the underlying SPN, no assumptions are made regarding the data distribution. Unlike other semisupervised methods, the presented approach does not need lowdensity or clustering assumptions [3]. In addition to safety, we show competitive results when compared with stateoftheart approaches in Section 4.
The structure of the paper is as follows: Section 2
introduces the notation used throughout the paper, describes recent approaches for parameter learning in SPNs and introduces the contrastive pessimistic likelihood estimation for
safe semisupervised learning of generative models. In Section 3 we propose safe semisupervised learning for SPNs, give derivations for generative and discriminative parameter learning and present the algorithm MCPSPN for training safe semisupervised SPNs. Experiments are presented in Section 4 showing that safe semisupervised SPNs are able to escape from degenerated supervised solutions, generally outperform purely supervised learning and achieve competitive performance on a variety of data sets. Section 5 concludes the paper and gives future prospects.We use capital letters to denote random variables (RVs) and denote a set of RVs as
. Moreover, we denote a realisation of a RV using lowercase letters and indicate a realisation of using bold lowercase letters, e.g. . We denote the set of labelled observation as and the set of unlabelled observation as where , are the features andthe labels in onehotencoding. Additionally, we use
to denote soft labels for the unlabelled observations. We generally write instead of and write instead of . For readability, we will refer to the value of an SPN using a calligraphic notation, , and write for the value of the _{th} node in an SPN.SPNs are a deep probabilistic architecture which allows to capture expressive variable interactions, yet guaranteeing exact computations of marginals in linear time. SPNs have its foundation in network polynomials for efficient inference in Bayesian networks introduced by
[7]. Poon and Domingos [26] generalized the idea and introduced SPNs over random variables (RVs) with finitely many states.(SumProduct Network [26]) A sumproduct network (SPN) over variables is a rooted directed acyclic graph whose leaves are the indicators and and whose internal nodes are sums and products. Each edge emanating from a sum node has a nonnegative weight . The value of a product node is the product of the values of its children. The value of a sum node is , where are the children of and is the value of node . The value of an SPN is the value of its root.
SPNs can be generalized by replacing the leaf node indicators with arbitrary input distributions [25]. Thus, we consider SPNs with arbitrary leaf node distributions throughout the paper.
The parameters of an SPN can be learned efficiently using Expectation Maximisation (EM) [26, 23]. We use the formulation of [23], where the updates for the parameters of the _{th} sum node are defined as:
(1)  
(2) 
Furthermore, the parameter update for an exponential family leaf node with scope and parameter is given by the expected sufficient statistic and can be computed as:
(3)  
(4) 
where denotes the sufficient statistics. We assume complete evidence for the RVs and refer to [23] for a derivation of the updates with partial evidence.
The parameters of a discriminative SPN can be learned by optimising the conditional log likelihood using backpropagation [10]. The set of variables of a discriminative SPN are divided into query variables , hidden variables and observed RVs . Therefore, the value of a discriminative SPN is denoted as
. Furthermore, the conditional probability is estimated by setting all indicator functions of the hidden variables to
and computing(5) 
where setting the indicators of the hidden variables to one allows the gradients of the conditional log likelihood to be computed in a single upward pass. For the sake of readability, we omit the hidden variables if their indicators are set to one and write for the value of a discriminative SPN instead.
Given a network structure, one can train a discriminative SPN by gradient ascent using the partial derivatives of the SPN with respect to the parameters of the network. The partial derivatives of the weights take the form
(6) 
where is computed using backpropagation. By setting the gradient of the root node , the gradients of the subsequent nodes are computed in a topdown order. At sum nodes the gradient is propagated to the children using and at product nodes using . As indicated, the gradient at a node is accumulated based on the parents gradients. We refer to [10] for further details on the derivation of the gradients and derivations of hard gradient updates.
As in network polynomials for Bayesian networks [7]
, partial derivatives of any parameter in an SPN can be calculated using the chain rule, leading to a straight forward computation of parameter updates for the leaf node distributions, i.e.
(7)  
(8) 
In the case of univariate Gaussian distributions, the updates are computed by taking the partial derivatives of the mean and the variance of the distribution.
Most semisupervised learning approaches require strong assumptions, e.g. low density assumption for TSVM, and can lead to decreased performance with increasing number of unlabelled data samples if these assumptions are violated. Loog [21] has proposed Contrastive Pessimistic Likelihood Estimation (CPLE) in order to facilitate performance guarantees while only relying on the assumptions of an underlying generative model.
CPLE maintains soft labels (hypotheses) for each unlabelled data point, and assigns them pessimistically, using a training objective that maximizes the log likelihood on the data but minimizes the improvement provided by the unlabelled data. Therefore, CPLE yields in a safe semisupervised objective.
Model parameters under CPLE are estimated according to:
(9) 
where denotes soft labels for every unlabelled data point and denotes the parameters of a purely supervised model derived only on . The introduction of soft labels, respects the fact that classes may be overlapping. In the case of unique class labels each soft label vector is an element of the simplex .
Since the trained classifier assumes the worstcase improvement, its performance cannot degrade when adding unlabelled data. Loog
[21] constrains the CPLE to generative models, and provides a concrete solution for a simple linear classifier based on linear discriminative analysis. In the following, we define a contrastive pessimistic objective for generative and discriminative SPNs, yielding in a safe semisupervised learning procedure with linear computational complexity which only relies on the assumptions intrinsic to the given network structure.Given an SPN we can find the optimal parameters for generative safe semisupervised learning using the CPLE objective defined in Equation (9). For clarity, we always use the plus operator to indicate parameters of the purely supervised solution, e.g. weights , and indicate parameters of the safe semisupervised solution using an asterisk. Due to the conservative choice of by minimizing the improvement over the supervised result, and since we can always take in the worst case, this objective is guaranteed to lead to a safe solution. More formally, as shown in Loog [21], it is guaranteed that
(10) 
Therefore, if log likelihoods are used in the CPLE objective the safe semisupervised solution has at least the same log likelihood given and as the purely supervised objective.
In the following we derive the Expectation Maximisation (EM) updates for the generative safe semisupervised SPN. Therefore, let
(11) 
be the likelihood of a semisupervised SPN for labelled observations . We denote to be the indicator for class which is one if is true and zero otherwise. Furthermore, let
(12) 
be the likelihood of a semisupervised SPN for unlabelled observations with being the soft labels of the data. Note that for all unlabelled observations, as each soft label vector is an element of the simplex. We can therefore define the generative log likelihood function of a semisupervised SPN as the sum of the log likelihood given the labelled data and the unlabelled data. Formally, we define the generative log likelihood of a semisupervised SPN as
(13) 
which allows for straightforward derivation of the EM updates. The updates of the weights of sum node in can be computed as in Eq. (2) using the following , i.e.
(14) 
where we omitted the parametrization of the network for better readability. Furthermore, we can update the parameters of an exponential family leaf node with scope using the expected sufficient statistics as
(15)  
(16) 
(17) 
where we assume complete evidence for the RVs and .
Subsequently, the soft label for class of an unlabelled sample is updated pessimistically with gradient descent using the partial derivative of which is defined as
(18)  
(19) 
Note that after each gradient update it is necessary to ensure that the soft labels for the unlabelled data points are on the simplex. For this purpose, the soft labels are projected back to the simplex using the approach by Duchi et al. [8].
Conditional likelihoods instead of generative objectives are a more natural way of learning SPNs for classification tasks in the semisupervised regime. Formally, the model parameters for discriminative safe semisupervised SPNs are estimated according to
(20) 
where we intentionally use to indicate the use of the conditional log likelihood. Extending the formulation for discriminative SPNs allows to define a discriminative learning approach for safe semisupervised SPNs, i.e.,
(21) 
where the conditional likelihood for labelled and unlabelled data, respectively, are given as
(22)  
(23) 
The partial derivatives for the weights of the discriminative semisupervised SPN therefore become
(24) 
Similarly, we can derive the partial derivatives of the leaf node parameters by applying the chain rule, leading to the following parameter updates
(25)  
(26)  
(27) 
To pessimistically update the soft labels, one can use gradient descent on the partial derivatives similar as for the generative objective in Eq. (19).
The algorithm Maximum Contrastive Pessimistic SPN (MCPSPN) for learning safe semisupervised SPNs is illustrated in Algorithm 1 and consists of the following adversarial steps: (1) optimising the safe semisupervised solution on the given soft labels by maximising a generative or discriminative objective (2) minimising the improvement of the semisupervised solution over the purely supervised solution by adjusting the soft labels pessimistically. As an SPN is a multilinear function in terms of the model parameters we can apply the generalisation of the minmax theorem for multilinear functions [16] and interchange the maximisation and the minimisation in our algorithm.
Depending on the choice of the objective, the MCPSPN procedure first finds a purely supervised solution by only maximising the chosen objective with respect to the labelled data. Secondly, we initialise all soft labels of the unlabelled data either using an optimistic approach or using random draws from a Dirichlet distribution. In the case of a generative objective the purely supervised solution can degenerate to a point mass estimator. It is therefore useful for generative SPNs to initialise the soft labels using random draws instead of starting from an optimistic labelling. After initialising all soft labels the MCPSPN procedure finds a safe semisupervised solution by alternating between the two adversarial steps. The function call projectOnSimplex refers to the approach in [8], which we use to project the soft label assignments back to the simplex (but other approaches for this task could also be used). Note that we found it useful to decrease the learning rate of the pessimistic soft labels adjustment over time. In our experiments we therefore used a simple decay function , if necessary more advanced approaches can be used instead. The source code for safe semisupervised learning of SPNs is available online^{1}^{1}1https://github.com/trappmartin/SSLSPN_UAI2017.
We analysed the performance of the safe semisupervised learning approach qualitatively on synthetic data using the generative objective, and quantitatively on various data sets using both objectives.
In addition to the synthetic two moons data set [15], we used various well known data sets from the UCI repository [20] to evaluate the performance of the safe
semisupervised parameter learning approaches. We preprocessed the data in the following way: (1) we removed features with zero variance, (2) we applied zscore normalisation. To ensure broad applicability of the approaches, we selected data sets which origin from a variety of domains and cover a wide range of number of samples and dimensions. Details on the selected data sets are shown in Table
1 where the last column lists the number of labelled samples used in all experiments. Note that the number of labelled samples per data set is calculated as in [21].To consistently learn SPN structures for all experiments we extended the wellknown learnSPN [11] algorithm for Gaussian distributed data, similar as in [29]. Additionally, we added a layer that conditions on the class labels resulting in structures that are suitable for supervised and semisupervised learning [10]. As learnSPN produces large SPN structures, which might lead to overfitting, we used a two step procedure for regularizing the resulting network. First, we estimate and apply a pruning depth of the network and secondly, we remove degenerated leaf distributions. We further ensured throughout all regularization steps that the resulting SPN is complete and decomposable.
Due to the nonlinearity, flexibility and complexity of SPNs with arbitrary leaf distributions, learning a safe semisupervised objective for such networks, without enforcing prior assumptions on the data distribution, is much more difficult than for linear models such as Linear Discriminant Analysis (LDA) [21]. Therefore, we analysed the behaviour of safe semisupervised SPNs qualitatively on the synthetic two moons data set [15]. Figure 0(a) shows the purely supervised solution for a small subset of labelled observations and the solution found using a generative safe semisupervised SPN over time. For reference the oracle solution, which knows the labels of all observations, is depicted in Figure 0(b).
Data Set  N  D  K  

BUPA  345  6  2  14 
Fertility  100  9  2  20 
Haberman  306  3  2  8 
ILPD  583  10  2  22 
Ionosphere  351  34  2  70 
Iris  150  4  3  11 
Parkinsons  197  23  2  48 
WDBC  569  32  2  66 
Wine  178  13  3  29 
The purely supervised SPN clearly overfits the few labelled examples and degenerated almost completely to a kernel density estimator. The
safe semisupervised parameter learning approach is initialised using soft labels drawn from a Dirichlet distribution, to allow the model to escape from the local optimum. As shown in Figure 0(c), the generative safe semisupervised approach is able to find a reasonable solution after only three iterations even with a random initialisation of the soft labels. The model converges after only 20 iterations to a stable solution without enforcing restrictive assumptions on the data distribution.

We constructed truncated network structures using learnSPN [11]. The truncation levels have been estimated using the Akaike information criterion [1]. After the structure construction we initialised all soft labels using random draws from a Dirichlet distribution with equal concentration parameter for all classes.
Furthermore, we lower bounded the variance of the leaf distributions to the th percentile of the nearest neighbour distances of all data points in . We selected the smallest percentile such that the constructed lower bound is above zero. Imposing a lower bound on the variances of the leaf node distributions in such way prevents the univariate Gaussian distributions from degeneration with minimal influence on the model expressiveness.
We analysed our approach for generative semisupervised learning of SPNs by: (1) splitting each dataset into training (80%) and testing set (20%), (2) draw labelled samples stratified from each training set as proposed in [21]. We used an additional labelled validation set of samples for early stopping. In addition to the labelled samples, we used all remaining observations in the training set as unlabelled examples.
We compare the performance of the safe semisupervised learning (SSL) approach against the purely supervised solution, an oracle solution and the solution found by the recently introduced inductive approach (MCPLDA) [21]. All models where evaluated on the test set. The resulting average log likelihood values are estimated over 100 independent runs. Table 2
lists the average log likelihood and the standard errors of all approaches. Note that the guarantee of the CPLE is on the training set including unlabelled observations. We expect however, the performance of the SSL approach on the test set in expectation to be better or similar to the purely supervised learner.
In most cases we could indeed find an improvement of the safe semisupervised approach over the purely supervised solution. In the cases of Parkinsons, WDBC and Wine the purely supervised learner already finds solutions which are close to the oracle solution. This might be due to the relative simple geometric properties of those data sets. In this situation, our SSL approach converged to solutions which are close to the purely supervised solution. In some cases, e.g. BUPA, Fertility, Haberman and ILPD, we could find an improvement upon the oracle solution or near oracle solution performance. Furthermore, safe semisupervised SPNs generally outperform MCPLDA on almost all data sets in terms of the log likelihood, with one exception being the Iris data set. Moreover, our approach generally reaches very stable results and achieves estimated standard errors lower than those of the supervised and the MCPLDA solution.
Data Set  Supervised  SSL  Oracle  MCPLDA  

BUPA  
Fertility  
Haberman  
ILPD  
Ionosphere  
Iris  
Parkinsons  
WDBC  
Wine 
We assess the classification performance of discriminative safe semisupervised learning below, as optimising a discriminative objective is a more natural way for classification tasks.
Similar to the quantitative evaluation of the generative approach, we constructed truncated structures for all experiments. To avoid overfitting we used early truncation of the model, estimated according to the performance on the validation set. We further initialised all soft labels using optimistic predictions from the purely supervised model. To obtain training and test sets, we followed the same approach as described for the generative experiments. Similar to the generative evaluation, the randomly drawn labelled subset is obtained from the training set and the performance of each algorithm is estimated over 100 independent trials.
We compared the performance of our discriminative approach against the purely supervised solution, the oracle solution and the following state of the art approaches: Transductive SVM (TSVM) [5], Minimum Entropy Regularization (MER) [13] and the recently published Implicitly Constrained Least Squares (ICLS) [18]. To assess the performance of a classification method, we computed the score for binary classification tasks. In cases of multiclass data sets, we used the macro average score. To compute multiclass predictions for approaches designed only for binary classification we used the onevsrest approach. The average scores as well as the standard errors of all approaches are shown in Table 3.
The safe semisupervised parameter learning approach achieves competitive results for almost all data sets. In general, our approach produces reasonable results and does not degenerate if certain assumptions are not met. Moreover, in several cases our discriminative approach achieves test scores which are comparable to those of the oracle solution, e.g. for Haberman and Wine. We could find the lowest performance of our approach on the Fertility data set. Note that the scores on Fertility, Haberman and ILPD
are generally very low as those are imbalanced or skewed data sets.
In general, the proposed safe semisupervised learning for SPNs is a powerful adversarial approach which scales linearly in the number of samples and is nonrestrictive. Even though we achieved competitive results even on data sets where low density assumptions are met, e.g. Wine, further improvements may be achieved by trading off optimism and pessimism. One way of approaching this issue would be to add a weighting scheme into the CPLE formulation.
Even though optimising the conditional log likelihood inside the CPLE objective provides a reasonable criterion for classification tasks, this approach does not guarantee to improve the classification performance of the learner. It is therefore possible, that better classification performance can be achieved by using a multiclass squaredhinge loss, which was recently used in a related model [12].
In this paper, we introduced the first approach for semisupervised parameter learning with SumProduct Networks (SPNs). We presented generative and discriminative safe semisupervised learning procedures which guarantee that adding unlabelled data can increase, but not degrade, the performance of the learner on the training set. Furthermore, our approach exploits the tractability of SPNs and scales linear in the number of data points and model parameters. In contrast to other semisupervised learners, the proposed approach is nonrestrictive and does not need prior assumptions on the data distribution. The approach allows broad applicability and is a generic safe semisupervised learning procedure for all models which leverage the sumproduct theorem [9] and therefore provides a semisupervised learning procedure beyond SPNs.
Data Set  Supervised  SSL  Oracle  TSVM  ICLSC  MER  

BUPA  
Fertility  
Haber.  
ILPD  
Ionos.  
Iris  
Parkins.  
PID  
WDBC  
Wine 
We investigated the performance of our approach quantitatively and qualitatively. In the conducted qualitative analysis we found that the generative safe semisupervised parameter learning approach is able to a find reasonable solution after only a few iterations and is able to escape from the degenerated supervised solutions. We further compared the performance of safe semisupervised parameter learning for SPNs against stateoftheart approaches. The proposed safe semisupervised learning for SPNs achieves competitive performance compared to stateoftheart approaches, and outperformed supervised SPNs in the majority of cases. Even though our approach is nonrestrictive and does not need prior assumptions on the data distribution, safe semisupervised SPNs can utilise low density regions if the structure of the network reflects geometric properties of the data distribution. However, as such assumptions are not enforced in the learning procedure, our safe semisupervised learner is still capable of finding decision boundaries which cross high density regions.
Future research directions include: interleaving network structure learning with semisupervised parameter learning, extensions to other learning objectives, investigating possibilities for trading off optimism and pessimism in the objective, dealing with covariate shift and analysing instability in safe semisupervised SPNs and its comparison with GANs. Furthermore, we plan to apply our safe semisupervised learning approach to highdimensional classification problems from medicine, genetics and other domains.
This research is partially funded by the Austrian Science Fund (FWF): P 27530 and P 27803N15.
Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 1314–1321, 2012.Adaptive computation and machine learning. MIT Press, 2006.
Deep neural network features and semisupervised training for low resource speech recognition.
In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6704–6708, 2013.