Deep neural networks (DNNs) have achieved great success in various domains mainly through supervised learning, which requires large collections of labeled data. However, in many domains, collecting labeled data is difficult and expensive, but there are often easily-available unlabeled data. This has motivated the community to develop semi-supervised learning (SSL), which aims to leveraging both labeled and unlabeled data for training. There have emerged a plethora of SSL methods that are designed for deep neural networks[1, 2, 3, 4, 5, 6], spanning over domains of image classification, natural language labeling and so on.
The key to designing SSL methods is how to effectively exploit the information contained in the unlabeled data 
, which can provide regularizations for finding good classifiers. A Bayesian view of regularizations is priors, which reflect our priori knowledge regarding the model. Recent SSL methods with DNNs can be distinguished by the priors they adopt, and, roughly speaking, can be divided into two classes111We mainly discuss SSL methods for using DNNs. General discussion of SSL can be referred to . - based on generative models or discriminative models. In this paper, we refer to these two classes as generative SSL and discriminative SSL respectively. A popular priori used by discriminative SSL is that the outputs from the discriminative classifier are smooth with respect to local and random perturbations of the inputs. Examples include virtual adversarial training (VAT)  and a number of recently developed consistency-regularization based methods [2, 3, 4] and contrastive learning based methods . However, those SSL methods heavily rely on domain-specific data augmentations , which are tuned intensively for images leading to impressive performance in some image domains but are less successful for other domains where these augmentations are less effective (e.g., medical images and text). For instance, random input perturbations are more difficult to apply to discrete data like text .
Generative SSL methods involve unsupervised learning over unlabeled data based on generative models. As argued in , for observation and label , learning can be thought of providing a kind of generic priori to learning . Representations that are useful for tend to be useful when learning , allowing sharing of statistical strength between the unsupervised and supervised learning. In this sense, generative SSL methods is more appealing from the perspective of being domain-agnostic, since they do not inherently require data augmentations and generally can be applied to a wider range of domains.
Remarkably, there exist two different manners for the generative SSL approach - joint-training and pre-training. In the first manner, which often referred to as joint-training, a joint model of is defined. When we have label , we maximize (the supervised objective), and when the label is unobserved, we marginalize it out and maximize (the unsupervised objective). Semi-supervised learning over a mix of labeled and unlabeled data is formulated as maximizing the (weighted) sum of and
. In the pre-training manner of SSL, we perform unsupervised representation learning on unlabeled data followed by supervised training (called fine-tuning) on labeled data. This manner of pre-training followed by fine-tuning has received increasing application in natural language processing.
This paper focuses on pushing forward domain-agnostic semi-supervised learning, particularly via energy-based generative models. Recently, energy-based models (EBMs) [10, 11, 12, 13] have achieved promising results for generative modeling. Joint-training via EBMs for SSL has been explored with very encouraging results [11, 14, 15], which show state-of-the-art SSL performance across different data modalities (images, natural languages, an protein structure prediction and year prediction from the UCI dataset repository) and in different data settings (fix-dimensional and sequence data). However, pre-training via EBMs for SSL has not been studied, and it is interesting to compare joint-training and pre-training when both are based on EBMs. In this paper, we make two contributions. First, we explore pre-training via EBMs for SSL and compare it to joint-training. Second, as suggested in , a suite of experiments are conducted222Code will be released for reproduction upon acceptance of this paper., in which we vary both the amount of labeled and unlabeled data to give a realistic whole picture of the performances of EBM based SSL methods.
2 Related work
Discriminative and generative SSL. Semi-supervised learning is a heavily studied problem, ranging from classic self-training  (also known as pseudo-labeling ), graph based methods , to various recent SSL methods for using DNNs. In general, recent DNN based SSL methods can be distinguished by the prior they adopt for representation learning from unlabeled data. Discriminative SSL works by discriminating between different augmentations from a given unlabeled sample, such as in recent FixMatch , SimCLR  methods. They rely on a rich set of domain-specific data augmentations, e.g., RandAugment . Although there are some efforts to use data-independent model noises, e.g., by dropout , domain-specific data augmentations is indispensable.
Recent progress in learning with deep generative models stimulates the generative SSL research, which usually involves blending unsupervised learning and supervised learning. These methods make fewer domain-specific assumptions and tend to be domain-agnostic. The performance comparisons between generative and discriminative SSL methods are mixed. It is found that consistency based discriminative SSL methods often outperform generative SSL methods in image domain. However, in text domain, the generative SSL methods such as those based on pre-training word vectors are more successful and widely used.
EBM based generative SSL. Recently, it is shown in 14] that joint-training via EBMs outperforms VAT on tabular data from the UCI dataset repository other than images. Further, joint-training via EBMs has been extended to modeling sequences and consistently outperform conditional random fields (CRFs) (the supervised baseline) and self-training (the classic semi-supervised baseline) on natural language labeling tasks such as POS tagging, chunking and NER.
On the other hand, pre-training has received attention in the early stage of training DNNs and recently become widely used in natural language processing tasks. Pre-training via EBMs is conceptually feasible, but remain unexplored. Despite the encouraging results obtained by joint-training of EBMs, it is not clear whether pre-training via EBMs is also competitive for domain-agnostic SSL.
3 Semi-supervised Learning via EBMs
where denotes the space of all possible values of , is the normalizing constant, is called the potential function which assigns a scalar value to each configuration of in and can be very flexibly defined (e.g., through DNNs of different architectures). For different applications, could be discrete or continuous, and could be fix-dimensional or trans-dimensional (i.e., sequences of varying lengths). For example, images are fix-dimensional continuous data (i.e., ), and natural languages are sequences taking discrete tokens (i.e., where is the vocabulary of tokens).
Training EBMs is challenging, because the gradient in maximizing the data log-likelihood for observed involves expectation w.r.t. the model distribution , as shown below:
Considerable progress has been made recently to successfully train large-scale EBMs parameterized by DNNs [11, 12, 15, 23] for different types of data from various domains, which lays the foundation to use EBMs, as a unified framework, to achieve domain-agnostic SSL.
For training EBMs for continuous data such as images, the inclusive approach, as detailed in , has been shown to yield superior results in unsupervised and semi-supervised training, by introducing inclusive-divergence minimized auxiliary generators and utilizing stochastic gradient sampling (such as SGLD) to approximate the model expectation in Eq. (2).
) and has achieved superior results in unsupervised and semi-supervised training, with the use of dynamic noise distribution to improve training efficiency of NCE (noise-contrastive estimation).
3.2 Pre-training via EBMs for SSL
Pre-training via EBMs for SSL consists of two stages. The first stage is pre-training an EBM on unlabeled data333 This is also known as unsupervised pre-training, which is different from supervised pre-training (also known as transfer learning using a pre-trained classifier).
It is shown in
This is also known as unsupervised pre-training, which is different from supervised pre-training (also known as transfer learning using a pre-trained classifier). It is shown in that the success of supervised transfer learning heavily depends on how closely related the two datasets are. . This is followed by a fine-tuning stage, where we can easily use the pretrained EBM to initialize a discriminative model and further train over labeled data.
Consider pre-training of an EBM for semi-supervised image classification, which essentially involves estimating as defined in Eq.(1) from unlabeled images. For the potential function
, we can use a multi-layer feed-forward neural network, which, in the final layer, calculates a scalar via a linear layer, . Here denotes the activation from the last hidden layer and the weight vector in the final linear layer. For simplicity, we omit the bias in describing linear layers throughout the paper.
In fine-tuning, we throw and fed into an added linear output layer, followed by , to predict , where denotes the new trainable weight parameters and the class label.
The above procedure can be similarly applied to pre-training of an EBM for semi-supervised natural language labeling (e.g., POS tagging). In pre-training, basically we estimate an EBM-based language model from unlabeled text corpus. Neural networks with different architectures can be used to implement the potential function given length . With abuse of notation, here denotes a token sequence of length , and . We use the bidirectional LSTM based potential function in  as follows:
where and are of the same dimensions, denoting the output embedding vector, the last hidden vectors of the forward and backward LSTMs respectively at position .
In fine-tuning, we add a CRF, as the discriminative model, on top of the extracted representations to do sequence labeling, i.e., to predict a sequence of labels with one label for one token at each position, where denotes the label at position . Specifically, we concatenate and and add a linear output layer to define the node potential and add a matrix to define the edge potential, as in recent neural CRFs [25, 26]. The parameters to be fine-tuned are the weights in the linear output layer and .
3.3 Joint-training via EBMs for SSL
The above pre-training via EBMs for SSL considers the modeling of only observations without labels . The joint-training refers to the joint modeling of and :
Then, it can be easily seen that the conditional density implied by the joint density Eq.(4) is:
The implied marginal density is where, with abuse of notation, . The key for EBM based joint-training for SSL is to choose suitable such that both and can be tractably optimized.
In joint-training of an EBM for semi-supervised image classification, we consider a neural network , which accepts the image and outputs an vector whose size being equal to the number of class labels, . Then we define , where denotes the -th element of a vector. With the above potential definition, it can be easily seen that the implied conditional density is exactly a standard -class based classifier, using logits calculated by the neural network from the input . And we do not need to calculate for classification. Therefore, we can conduct SSL over a mix of labeled and unlabeled data by maximizing the (weighted) sum of and , where both optimizations are tractable as detailed in .
The above procedure can be similarly applied to joint-training of an EBM for semi-supervised natural language labeling with and , . We consider a neural network and define
where denotes the element of a matrix and models the edge potential for adjacent labels. With the above potential definition, it can be easily seen that the conditional density implied by the joint density Eq.(4) is exactly a CRF with node potentials and edge potentials , and the implied marginal density is exactly a trans-dimensional random field (TRF) language model [10, 27, 28]. Training of both models are tractable as detailed in [15, 23].
SSL experiments are conducted on standard benchmark datasets in different domains, including the CIFAR-10 and SVHN datasets  for image classification and the POS, chunking and NER datasets [29, 15] for natural language labeling. We use the standard data split for training and testing. When we vary the amount of labeled and unlabeled data for training, we select varying proportions (e.g., 10%, 100%) of labels from the original full set of labeled data. Throughout the paper, the amount of labels is thus described in terms of proportions. “100% labeled” means 50,000 and 73,257 images for CIFAR-10 and SVHN, and 56K, 7.4K, 14K sentences for POS, chunking and NER, respectively.
4.1 SSL for Image Classification
|Ladder network ||20.400.47|
|Results below this line cannot be directly compared to those above.|
|VAT small ||14.87|
|Temporal Ensembling ||12.160.31|
|Mean Teacher ||12.310.28|
SSL for image classification over CIFAR-10 with 4,000 labels. The upper/lower blocks show generative/discriminative SSL methods respectively. The means and standard deviations are calculated over ten independent runs with randomly sampled labels.
First, we experiment with CIFAR-10 and compare different generative SSL methods. As in previous work, we randomly sample 4,000 labeled images for training. The remaining images are treated as unlabeled. We use the network architectures and hyper-parameter settings in . It can be seen from Table 2 that semi-supervised EBMs, especially the joint-training EBMs, produce strong results on par with state-of-art generative SSL methods444As discussed in , Bad-GANs could hardly be classified as a generative SSL method.. Furthermore, joint-training EBMs outperform pre-training+fine-tuning EBMs by a large margin in this task. Note that some discriminative SSL methods, as listed in the lower block in Table 2, also produce superior results but heavily utilize domain-specific data augmentations, and thus are not directly compared to generative SSL methods.
Second, we experiment with CIFAR-10 and SVHN, and examine the effects of varying amount of labels. We sample varying proportions of labels as labeled training data and use the remaining as unlabeled training data (i.e., we do not add external unlabeled data). From the plot of error rates w.r.t. labeling proportions in Fig. 1 , we can see how many labels can be reduced by using joint-training EBMs. The joint-training EBMs obtain 11.14% on CIFAR-10 and 3.95% on SVHN using only 50% labels, which is marginally better than 11.49% and 4.04% obtained by the supervised baseline using 100% labels. This indicates that we can reduce 50% of labels without losing performance on these two tasks. Additionally, it is interesting to observe that in the case of using 100% labels, the joint-training EBMs outperform the supervised baseline with 13.9% and 14.6% reductions in error rates. This is because the generative loss provides regularization for the pure discriminative loss , as discussed in .
4.2 SSL for Natural Language Labeling
In this experiment, we evaluate different methods for natural language labeling, through three tasks - POS tagging, chunking and NER. The following benchmark datasets are used - PTB POS tagging, CoNLL-2000 chunking and CoNLL-2003 English NER, as in[26, 6, 29, 15]
. We sample varying proportions of labels as labeled training data and use the Google one-billion-word dataset as the large pool of unlabeled sentences. In , joint-training EBM based experiments are conducted, using the labeling proportions of 10% and 100% with “U/L” (the ratio between the amount of unlabeled and labeled) of 50. In this paper, a larger scale of experiments are conducted, covering the labeling proportions of 2%, 10% and 100% with “U/L” of 50, 250 and 500 for three tasks, which consist of 27 settings. We use the network architectures in . After some empirical search, we fix hyper-parameters (tuned separately for different methods), which are used for all the 27 settings.
, the main observations are as follows. 1) The joint-training EBMs outperform the supervised baseline in 25 out of the 27 settings. Since we perform one run for each setting, this may indicate 2 outliers. 2) For a fixed labeling size (as given by the labeling proportion), increasing “U/L” makes joint-training EBMs performing better, except one outlier. 3) The effect of increasing the labeling size on the improvement of the joint-training EBMs over the supervised baseline with a fixed “U/L” is mixed. For POS/chunking/NER, the largest improvements are achieved under 2%/10%/100% labeled, respectively. It seems that the working point where an SSL method brings the largest improvement over the supervised baseline is task dependent. If the working point is indicated by the performance of the supervised baseline, then the SSL method brings the largest effect when the performance of the supervised baseline is moderate, neither too low nor already high. 4) Joint-training EBMs outperform pre-training EBMs in 23 out of the 27 settings marginally but nearly consistently. A possible intuition is that pre-training is not aware of the labels of interest and is thus weakened for representation learning. 5) It seems that the degrees of improvements of the joint-training EBMs over the pre-training EBMs are not affected by the labeling size and “U/L”.
Natural language labeling results. The evaluation metric is accuracy for POS andfor chunking and NER. “Labeled” denotes the amount of labels in terms of the proportions w.r.t. the full set of labels. “U/L” denotes the ratio between the amount of unlabeled and labeled data. “U/L=0” denotes the supervised baseline. “pre.” and “joint” denote the results by pre-training+fine-tuning EBMs and joint-training EBMs, respectively.
|joint over sup.||joint over pre.|
This paper focuses on pushing forward domain-agnostic semi-supervised learning, particularly via energy-based generative models, and makes two contributions. First, we explore pre-training via EBMs for SSL and compare it to joint-training. Second, a suite of experiments are conducted over domains of image classification and natural language labeling to give a realistic whole picture of the performances of EBM based SSL methods. It is found that joint-training EBMs outperform pre-training EBMs marginally but nearly consistently. We hope the results presented here make a useful step towards developing domain-agnostic SSL methods.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1979–1993, 2018.
-  Samuli Laine and Timo Aila, “Temporal ensembling for semi-supervised learning,” in ICLR, 2017.
Antti Tarvainen and Harri Valpola,
“Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,”in NIPS, 2017.
-  Kihyuk Sohn, David Berthelot, Chun-Liang Li, and et al, “FixMatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv:2001.07685, 2020.
-  Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” arXiv:2002.05709, 2020.
-  Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc Le, “Semi-supervised sequence modeling with cross-view training,” in EMNLP, 2018.
-  Xiaojin Zhu, “Semi-supervised learning literature survey,” Technical report, University of Wisconsin-Madison, 2006.
-  Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le, “RandAugment: Practical automated data augmentation with a reduced search space,” in CVPR, 2020.
-  Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
-  Bin Wang, Zhijian Ou, and Zhiqiang Tan, “Learning trans-dimensional random fields with applications to language modeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 876–890, 2017.
-  Yunfu Song and Zhijian Ou, “Learning neural random fields with inclusive auxiliary generators,” arXiv:1806.00271, 2018.
-  Yilun Du and Igor Mordatch, “Implicit generation and generalization in energy-based models,” arXiv:1903.08689, 2019.
-  Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu, “A theory of generative convnet,” in ICML, 2016.
-  Stephen Zhao, Jörn-Henrik Jacobsen, and Will Grathwohl, “Joint energy-based models for semi-supervised classification,” in ICML Workshop on Uncertainty and Robustness in Deep Learning, 2020.
-  Yunfu Song, Zhijian Ou, Zitao Liu, and Songfan Yang, “Upgrading CRFs to JRFs and its benefits to sequence modeling and labeling,” in ICASSP, 2020.
-  Avital Oliver, Augustus Odena, Colin Raffel, Ekin D Cubuk, and Ian J Goodfellow, “Realistic evaluation of semi-supervised learning algorithms,” in ICLR, 2018.
-  H Scudder, IEEE Transactions on Information Theory, vol. 11, no. 3, pp. 363–371, 1965.
-  Dong-Hyun Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in ICML Workshop on challenges in representation learning, 2013.
-  Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, 2003.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
“Dropout: a simple way to prevent neural networks from
The journal of machine learning research, 2014.
-  Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang, “A tutorial on energy-based learning,” Predicting structured data, vol. 1, no. 0, 2006.
-  Daphne Koller and Nir Friedman, Probabilistic graphical models: principles and techniques, MIT press, 2009.
-  Bin Wang and Zhijian Ou, “Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation,” in SLT, 2018.
-  Michael Gutmann and Aapo Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in AISTATS, 2010.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
“Neural architectures for named entity recognition,”in NAACL-HLT, 2016.
-  Xuezhe Ma and Eduard Hovy, “End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,” in ACL, 2016.
-  Bin Wang and Zhijian Ou, “Language modeling with neural trans-dimensional random fields,” in ASRU, 2017.
-  Bin Wang and Zhijian Ou, “Learning neural trans-dimensional random field language models with noise-contrastive estimation,” in ICASSP, 2018.
-  Kai Hu, Zhijian Ou, Min Hu, and Junlan Feng, “Neural CRF transducers for sequence labeling,” in ICASSP, 2019.
Jost Tobias Springenberg,
“Unsupervised and semi-supervised learning with categorical generative adversarial networks,”in ICML, 2016.
-  Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko, “Semi-supervised learning with ladder networks,” in NIPS, 2015.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training GANs,” in NIPS, 2016.
-  Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov, “Good semi-supervised learning that requires a bad GAN,” in NIPS, 2017.
-  Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng, “Sobolev GAN,” in ICLR, 2018.
-  Andrew Y Ng and Michael I Jordan, in NIPS, 2002.
-  Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson, “One billion word benchmark for measuring progress in statistical language modeling,” in INTERSPEECH, 2014.
-  Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le, “Rethinking pre-training and self-training,” arXiv:2006.06882, 2020.