In several domains obtaining class annotations is expensive while at the same time unlabelled data are abundant. While most semi-supervised approaches enforce restrictive assumptions on the data distribution, recent work has managed to learn semi-supervised models in a non-restrictive regime. However, so far such approaches have only been proposed for linear models. In this work, we introduce semi-supervised parameter learning for Sum-Product Networks (SPNs). SPNs are deep probabilistic models admitting inference in linear time in number of network edges. Our approach has several advantages, as it (1) allows generative and discriminative semi-supervised learning, (2) guarantees that adding unlabelled data can increase, but not degrade, the performance (safe), and (3) is computationally efficient and does not enforce restrictive assumptions on the data distribution. We show on a variety of data sets that safe semi-supervised learning with SPNs is competitive compared to state-of-the-art and can lead to a better generative and discriminative objective value than a purely supervised approach.READ FULL TEXT VIEW PDF
In several domains, unlabelled observations are abundant and cheap to acquire, while obtaining class labels is expensive and sometimes infeasible for large amounts of data. In such cases, semi-supervised learning can be used to exploit large amounts of unlabelled data in addition to labelled data. Examples include text  or image data [17, 27, 22], which are ubiquitous online, but also biological (genomics, proteomics, gene expression) data  and speech .
One of the challenges facing most semi-supervised learning approaches is scalability, many methods scale quadratically or even cubically with data set size, or require restrictive assumptions such as low dimensionality or sparsity [32, 17]. In fact, if the data violates the assumptions enforced by a learner, the use of additional unlabelled data can even degrade the classification performance.
Several approaches for semi-supervised have been proposed, including self-training, Transductive Support-Vector Machines (TSVM), and graph-based methods. We refer to [32, 14] for comprehensive reviews on the state-of-the-art. As pointed out by , self-training is error-prone (it can reinforce poor predictions) and TSVM as well as graph-based methods are difficult to scale. In addition, TSVM, and its recent extensions  require that the decision boundary lie in a low density region, yielding sub-optimal accuracy if this is not met. Each of these methods can lead to decreased accuracy when adding unlabelled data. To overcome these limitations,  recently proposed a probabilistic formulation for safe semi-supervised learning of generative linear models.
In the family of deep probabilistic models, Sum-Product Networks (SPNs) 
have recently gained popularity, due to their efficiency, i.e. linear-time inference, generality, i.e. they subsume existing approaches such as latent tree models and mixtures, and performance on various tasks including computer vision[26, 10], action recognition , speech , and language modelling .
For probabilistic models, including SPNs, semi-supervised learning with generative models is natural. Data points are assigned to whichever class maximizes , with being a generative model for the data in class . Subsequently, the labelled data points can be used to learn the model. Unfortunately, adding unlabelled data can significantly degrade classification accuracy instead of improving it .
In this paper, we introduce safe semi-supervised parameter learning for SPNs that is safe, scalable and non-restrictive. Safe means that adding unlabelled data can increase, but not degrade, model performance. The training time scales linearly with added data points and apart from the structure of the underlying SPN, no assumptions are made regarding the data distribution. Unlike other semi-supervised methods, the presented approach does not need low-density or clustering assumptions . In addition to safety, we show competitive results when compared with state-of-the-art approaches in Section 4.
The structure of the paper is as follows: Section 2
introduces the notation used throughout the paper, describes recent approaches for parameter learning in SPNs and introduces the contrastive pessimistic likelihood estimation forsafe semi-supervised learning of generative models. In Section 3 we propose safe semi-supervised learning for SPNs, give derivations for generative and discriminative parameter learning and present the algorithm MCP-SPN for training safe semi-supervised SPNs. Experiments are presented in Section 4 showing that safe semi-supervised SPNs are able to escape from degenerated supervised solutions, generally outperform purely supervised learning and achieve competitive performance on a variety of data sets. Section 5 concludes the paper and gives future prospects.
We use capital letters to denote random variables (RVs) and denote a set of RVs as. Moreover, we denote a realisation of a RV using lower-case letters and indicate a realisation of using bold lower-case letters, e.g. . We denote the set of labelled observation as and the set of unlabelled observation as where , are the features and
the labels in one-hot-encoding. Additionally, we useto denote soft labels for the unlabelled observations. We generally write instead of and write instead of . For readability, we will refer to the value of an SPN using a calligraphic notation, , and write for the value of the th node in an SPN.
SPNs are a deep probabilistic architecture which allows to capture expressive variable interactions, yet guaranteeing exact computations of marginals in linear time. SPNs have its foundation in network polynomials for efficient inference in Bayesian networks introduced by. Poon and Domingos  generalized the idea and introduced SPNs over random variables (RVs) with finitely many states.
(Sum-Product Network ) A sum-product network (SPN) over variables is a rooted directed acyclic graph whose leaves are the indicators and and whose internal nodes are sums and products. Each edge emanating from a sum node has a non-negative weight . The value of a product node is the product of the values of its children. The value of a sum node is , where are the children of and is the value of node . The value of an SPN is the value of its root.
SPNs can be generalized by replacing the leaf node indicators with arbitrary input distributions . Thus, we consider SPNs with arbitrary leaf node distributions throughout the paper.
The parameters of an SPN can be learned efficiently using Expectation Maximisation (EM) [26, 23]. We use the formulation of , where the updates for the parameters of the th sum node are defined as:
Furthermore, the parameter update for an exponential family leaf node with scope and parameter is given by the expected sufficient statistic and can be computed as:
where denotes the sufficient statistics. We assume complete evidence for the RVs and refer to  for a derivation of the updates with partial evidence.
The parameters of a discriminative SPN can be learned by optimising the conditional log likelihood using back-propagation . The set of variables of a discriminative SPN are divided into query variables , hidden variables and observed RVs . Therefore, the value of a discriminative SPN is denoted as
. Furthermore, the conditional probability is estimated by setting all indicator functions of the hidden variables toand computing
where setting the indicators of the hidden variables to one allows the gradients of the conditional log likelihood to be computed in a single upward pass. For the sake of readability, we omit the hidden variables if their indicators are set to one and write for the value of a discriminative SPN instead.
Given a network structure, one can train a discriminative SPN by gradient ascent using the partial derivatives of the SPN with respect to the parameters of the network. The partial derivatives of the weights take the form
where is computed using back-propagation. By setting the gradient of the root node , the gradients of the subsequent nodes are computed in a top-down order. At sum nodes the gradient is propagated to the children using and at product nodes using . As indicated, the gradient at a node is accumulated based on the parents gradients. We refer to  for further details on the derivation of the gradients and derivations of hard gradient updates.
As in network polynomials for Bayesian networks 
, partial derivatives of any parameter in an SPN can be calculated using the chain rule, leading to a straight forward computation of parameter updates for the leaf node distributions, i.e.
Most semi-supervised learning approaches require strong assumptions, e.g. low density assumption for TSVM, and can lead to decreased performance with increasing number of unlabelled data samples if these assumptions are violated. Loog  has proposed Contrastive Pessimistic Likelihood Estimation (CPLE) in order to facilitate performance guarantees while only relying on the assumptions of an underlying generative model.
CPLE maintains soft labels (hypotheses) for each unlabelled data point, and assigns them pessimistically, using a training objective that maximizes the log likelihood on the data but minimizes the improvement provided by the unlabelled data. Therefore, CPLE yields in a safe semi-supervised objective.
Model parameters under CPLE are estimated according to:
where denotes soft labels for every unlabelled data point and denotes the parameters of a purely supervised model derived only on . The introduction of soft labels, respects the fact that classes may be overlapping. In the case of unique class labels each soft label vector is an element of the simplex .
Since the trained classifier assumes the worst-case improvement, its performance cannot degrade when adding unlabelled data. Loog constrains the CPLE to generative models, and provides a concrete solution for a simple linear classifier based on linear discriminative analysis. In the following, we define a contrastive pessimistic objective for generative and discriminative SPNs, yielding in a safe semi-supervised learning procedure with linear computational complexity which only relies on the assumptions intrinsic to the given network structure.
Given an SPN we can find the optimal parameters for generative safe semi-supervised learning using the CPLE objective defined in Equation (9). For clarity, we always use the plus operator to indicate parameters of the purely supervised solution, e.g. weights , and indicate parameters of the safe semi-supervised solution using an asterisk. Due to the conservative choice of by minimizing the improvement over the supervised result, and since we can always take in the worst case, this objective is guaranteed to lead to a safe solution. More formally, as shown in Loog , it is guaranteed that
Therefore, if log likelihoods are used in the CPLE objective the safe semi-supervised solution has at least the same log likelihood given and as the purely supervised objective.
In the following we derive the Expectation Maximisation (EM) updates for the generative safe semi-supervised SPN. Therefore, let
be the likelihood of a semi-supervised SPN for labelled observations . We denote to be the indicator for class which is one if is true and zero otherwise. Furthermore, let
be the likelihood of a semi-supervised SPN for unlabelled observations with being the soft labels of the data. Note that for all unlabelled observations, as each soft label vector is an element of the simplex. We can therefore define the generative log likelihood function of a semi-supervised SPN as the sum of the log likelihood given the labelled data and the unlabelled data. Formally, we define the generative log likelihood of a semi-supervised SPN as
which allows for straightforward derivation of the EM updates. The updates of the weights of sum node in can be computed as in Eq. (2) using the following , i.e.
where we omitted the parametrization of the network for better readability. Furthermore, we can update the parameters of an exponential family leaf node with scope using the expected sufficient statistics as
where we assume complete evidence for the RVs and .
Subsequently, the soft label for class of an unlabelled sample is updated pessimistically with gradient descent using the partial derivative of which is defined as
Note that after each gradient update it is necessary to ensure that the soft labels for the unlabelled data points are on the simplex. For this purpose, the soft labels are projected back to the simplex using the approach by Duchi et al. .
Conditional likelihoods instead of generative objectives are a more natural way of learning SPNs for classification tasks in the semi-supervised regime. Formally, the model parameters for discriminative safe semi-supervised SPNs are estimated according to
where we intentionally use to indicate the use of the conditional log likelihood. Extending the formulation for discriminative SPNs allows to define a discriminative learning approach for safe semi-supervised SPNs, i.e.,
where the conditional likelihood for labelled and unlabelled data, respectively, are given as
The partial derivatives for the weights of the discriminative semi-supervised SPN therefore become
Similarly, we can derive the partial derivatives of the leaf node parameters by applying the chain rule, leading to the following parameter updates
To pessimistically update the soft labels, one can use gradient descent on the partial derivatives similar as for the generative objective in Eq. (19).
The algorithm Maximum Contrastive Pessimistic SPN (MCP-SPN) for learning safe semi-supervised SPNs is illustrated in Algorithm 1 and consists of the following adversarial steps: (1) optimising the safe semi-supervised solution on the given soft labels by maximising a generative or discriminative objective (2) minimising the improvement of the semi-supervised solution over the purely supervised solution by adjusting the soft labels pessimistically. As an SPN is a multi-linear function in terms of the model parameters we can apply the generalisation of the minmax theorem for multi-linear functions  and interchange the maximisation and the minimisation in our algorithm.
Depending on the choice of the objective, the MCP-SPN procedure first finds a purely supervised solution by only maximising the chosen objective with respect to the labelled data. Secondly, we initialise all soft labels of the unlabelled data either using an optimistic approach or using random draws from a Dirichlet distribution. In the case of a generative objective the purely supervised solution can degenerate to a point mass estimator. It is therefore useful for generative SPNs to initialise the soft labels using random draws instead of starting from an optimistic labelling. After initialising all soft labels the MCP-SPN procedure finds a safe semi-supervised solution by alternating between the two adversarial steps. The function call projectOnSimplex refers to the approach in , which we use to project the soft label assignments back to the simplex (but other approaches for this task could also be used). Note that we found it useful to decrease the learning rate of the pessimistic soft labels adjustment over time. In our experiments we therefore used a simple decay function , if necessary more advanced approaches can be used instead. The source code for safe semi-supervised learning of SPNs is available on-line111https://github.com/trappmartin/SSLSPN_UAI2017.
We analysed the performance of the safe semi-supervised learning approach qualitatively on synthetic data using the generative objective, and quantitatively on various data sets using both objectives.
semi-supervised parameter learning approaches. We pre-processed the data in the following way: (1) we removed features with zero variance, (2) we applied z-score normalisation. To ensure broad applicability of the approaches, we selected data sets which origin from a variety of domains and cover a wide range of number of samples and dimensions. Details on the selected data sets are shown in Table1 where the last column lists the number of labelled samples used in all experiments. Note that the number of labelled samples per data set is calculated as in .
To consistently learn SPN structures for all experiments we extended the well-known learnSPN  algorithm for Gaussian distributed data, similar as in . Additionally, we added a layer that conditions on the class labels resulting in structures that are suitable for supervised and semi-supervised learning . As learnSPN produces large SPN structures, which might lead to over-fitting, we used a two step procedure for regularizing the resulting network. First, we estimate and apply a pruning depth of the network and secondly, we remove degenerated leaf distributions. We further ensured throughout all regularization steps that the resulting SPN is complete and decomposable.
Due to the non-linearity, flexibility and complexity of SPNs with arbitrary leaf distributions, learning a safe semi-supervised objective for such networks, without enforcing prior assumptions on the data distribution, is much more difficult than for linear models such as Linear Discriminant Analysis (LDA) . Therefore, we analysed the behaviour of safe semi-supervised SPNs qualitatively on the synthetic two moons data set . Figure 0(a) shows the purely supervised solution for a small subset of labelled observations and the solution found using a generative safe semi-supervised SPN over time. For reference the oracle solution, which knows the labels of all observations, is depicted in Figure 0(b).
The purely supervised SPN clearly over-fits the few labelled examples and degenerated almost completely to a kernel density estimator. Thesafe semi-supervised parameter learning approach is initialised using soft labels drawn from a Dirichlet distribution, to allow the model to escape from the local optimum. As shown in Figure 0(c), the generative safe semi-supervised approach is able to find a reasonable solution after only three iterations even with a random initialisation of the soft labels. The model converges after only 20 iterations to a stable solution without enforcing restrictive assumptions on the data distribution.
We constructed truncated network structures using learnSPN . The truncation levels have been estimated using the Akaike information criterion . After the structure construction we initialised all soft labels using random draws from a Dirichlet distribution with equal concentration parameter for all classes.
Furthermore, we lower bounded the variance of the leaf distributions to the th percentile of the nearest neighbour distances of all data points in . We selected the smallest percentile such that the constructed lower bound is above zero. Imposing a lower bound on the variances of the leaf node distributions in such way prevents the univariate Gaussian distributions from degeneration with minimal influence on the model expressiveness.
We analysed our approach for generative semi-supervised learning of SPNs by: (1) splitting each dataset into training (80%) and testing set (20%), (2) draw labelled samples stratified from each training set as proposed in . We used an additional labelled validation set of samples for early stopping. In addition to the labelled samples, we used all remaining observations in the training set as unlabelled examples.
We compare the performance of the safe semi-supervised learning (SSL) approach against the purely supervised solution, an oracle solution and the solution found by the recently introduced inductive approach (MCPLDA) . All models where evaluated on the test set. The resulting average log likelihood values are estimated over 100 independent runs. Table 2
lists the average log likelihood and the standard errors of all approaches. Note that the guarantee of the CPLE is on the training set including unlabelled observations. We expect however, the performance of the SSL approach on the test set in expectation to be better or similar to the purely supervised learner.
In most cases we could indeed find an improvement of the safe semi-supervised approach over the purely supervised solution. In the cases of Parkinsons, WDBC and Wine the purely supervised learner already finds solutions which are close to the oracle solution. This might be due to the relative simple geometric properties of those data sets. In this situation, our SSL approach converged to solutions which are close to the purely supervised solution. In some cases, e.g. BUPA, Fertility, Haberman and ILPD, we could find an improvement upon the oracle solution or near oracle solution performance. Furthermore, safe semi-supervised SPNs generally outperform MCPLDA on almost all data sets in terms of the log likelihood, with one exception being the Iris data set. Moreover, our approach generally reaches very stable results and achieves estimated standard errors lower than those of the supervised and the MCPLDA solution.
We assess the classification performance of discriminative safe semi-supervised learning below, as optimising a discriminative objective is a more natural way for classification tasks.
Similar to the quantitative evaluation of the generative approach, we constructed truncated structures for all experiments. To avoid over-fitting we used early truncation of the model, estimated according to the performance on the validation set. We further initialised all soft labels using optimistic predictions from the purely supervised model. To obtain training and test sets, we followed the same approach as described for the generative experiments. Similar to the generative evaluation, the randomly drawn labelled subset is obtained from the training set and the performance of each algorithm is estimated over 100 independent trials.
We compared the performance of our discriminative approach against the purely supervised solution, the oracle solution and the following state of the art approaches: Transductive SVM (TSVM) , Minimum Entropy Regularization (MER)  and the recently published Implicitly Constrained Least Squares (ICLS) . To assess the performance of a classification method, we computed the score for binary classification tasks. In cases of multi-class data sets, we used the macro average score. To compute multi-class predictions for approaches designed only for binary classification we used the one-vs-rest approach. The average scores as well as the standard errors of all approaches are shown in Table 3.
The safe semi-supervised parameter learning approach achieves competitive results for almost all data sets. In general, our approach produces reasonable results and does not degenerate if certain assumptions are not met. Moreover, in several cases our discriminative approach achieves test scores which are comparable to those of the oracle solution, e.g. for Haberman and Wine. We could find the lowest performance of our approach on the Fertility data set. Note that the scores on Fertility, Haberman and ILPD
are generally very low as those are imbalanced or skewed data sets.
In general, the proposed safe semi-supervised learning for SPNs is a powerful adversarial approach which scales linearly in the number of samples and is non-restrictive. Even though we achieved competitive results even on data sets where low density assumptions are met, e.g. Wine, further improvements may be achieved by trading off optimism and pessimism. One way of approaching this issue would be to add a weighting scheme into the CPLE formulation.
Even though optimising the conditional log likelihood inside the CPLE objective provides a reasonable criterion for classification tasks, this approach does not guarantee to improve the classification performance of the learner. It is therefore possible, that better classification performance can be achieved by using a multi-class squared-hinge loss, which was recently used in a related model .
In this paper, we introduced the first approach for semi-supervised parameter learning with Sum-Product Networks (SPNs). We presented generative and discriminative safe semi-supervised learning procedures which guarantee that adding unlabelled data can increase, but not degrade, the performance of the learner on the training set. Furthermore, our approach exploits the tractability of SPNs and scales linear in the number of data points and model parameters. In contrast to other semi-supervised learners, the proposed approach is non-restrictive and does not need prior assumptions on the data distribution. The approach allows broad applicability and is a generic safe semi-supervised learning procedure for all models which leverage the sum-product theorem  and therefore provides a semi-supervised learning procedure beyond SPNs.
We investigated the performance of our approach quantitatively and qualitatively. In the conducted qualitative analysis we found that the generative safe semi-supervised parameter learning approach is able to a find reasonable solution after only a few iterations and is able to escape from the degenerated supervised solutions. We further compared the performance of safe semi-supervised parameter learning for SPNs against state-of-the-art approaches. The proposed safe semi-supervised learning for SPNs achieves competitive performance compared to state-of-the-art approaches, and outperformed supervised SPNs in the majority of cases. Even though our approach is non-restrictive and does not need prior assumptions on the data distribution, safe semi-supervised SPNs can utilise low density regions if the structure of the network reflects geometric properties of the data distribution. However, as such assumptions are not enforced in the learning procedure, our safe semi-supervised learner is still capable of finding decision boundaries which cross high density regions.
Future research directions include: interleaving network structure learning with semi-supervised parameter learning, extensions to other learning objectives, investigating possibilities for trading off optimism and pessimism in the objective, dealing with covariate shift and analysing instability in safe semi-supervised SPNs and its comparison with GANs. Furthermore, we plan to apply our safe semi-supervised learning approach to high-dimensional classification problems from medicine, genetics and other domains.
This research is partially funded by the Austrian Science Fund (FWF): P 27530 and P 27803-N15.
Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pages 1314–1321, 2012.
Adaptive computation and machine learning. MIT Press, 2006.
Deep neural network features and semi-supervised training for low resource speech recognition.In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6704–6708, 2013.