1 Introduction
In supervised classification, probabilistic classification is an approach that assigns a class label to an input sample
by estimating the posterior probability
. This approach is primarily categorized into two types of models: discriminative model and generative model. The former optimizes the posterior distribution directly on a training set, whereas the latter finds the class conditional distribution and class prior and subsequently derives the posterior distribution using Bayes’ rule.The discriminative model and generative model are mutually related (Lasserre et al., 2006; Minka, 2005). According to Lasserre et al. (2006)
, the only difference between these models is their statistical parameter constraints. Therefore, given a certain generative model, we can derive a corresponding discriminative model. For example, the discriminative model corresponding to a unimodal Gaussian distribution is logistic regression (see Appendix A for derivation). Several discriminative models corresponding to the Gaussian mixture model (GMM) have been proposed
(Axelrod et al., 2006; Bahl et al., 1996; Klautau et al., 2003; Tsai & Chang, 2002; Tsuji et al., 1999; Tüske et al., 2015; Wang, 2007). They indicate more flexible fitting capability than the generative GMM and have been applied successfully in fields such as speech recognition (Axelrod et al., 2006; Tüske et al., 2015; Wang, 2007).The problem to address in mixture models such as the GMM is the determination of the number of components . Classically, Akaike’s information criterion and the Bayesian information criterion have been used; nevertheless, they require a considerable computational cost because a likelihood must be calculated for every candidate component number. In the generative GMM, methods that optimize during learning exist (Crouse et al., 2011; Štepánová & Vavrečka, 2018). However, in a discriminative GMM, a method to optimize simultaneously during learning has not been clearly formulated.
In this paper, we propose a novel GMM having two important properties: sparsity and discriminability, which is named sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMMbased discriminative model is trained by sparse Bayesian learning. This learning algorithm improves the generalization capability by obtaining a sparse solution and determines the number of components automatically by removing redundant components. Furthermore, the SDGM can be embedded into neural networks (NNs) such as convolutional NNs and trained in an endtoend manner with an NN. To the authors best knowledge, there is no GMM that has both of sparsity and discriminability.
The contributions of this study are as follows:

We propose a novel sparse classifier based on a discriminative GMM. The proposed SDGM has both sparsity and discriminability, and determines the number of components automatically. The SDGM can be considered as the theoretical extension of the discriminative GMM and the relevance vector machine (RVM)
(Tipping, 2001). 
This study attempts to connect both fields of probabilistic models and NNs. From the equivalence of a discriminative model based on Gaussian distribution to a fully connected layer, we demonstrate that the SDGM can be used as a module of a deep NN. We also show that the SDGM can show superior performance than the fully connected layer with a softmax function via an endtoend learning with an NN on the image recognition task.
2 Sparse Discriminative Gaussian Mixture (SDGM)
An SDGM takes a continuous variable as its input and outputs its posterior probability for each class . An SDGM acquires a sparse structure by removing redundant components via sparse Bayesian learning.
Figure 1 shows how the SDGM is trained by removing unnecessary components while keeping discriminability. The twoclass training data are from Ripley’s synthetic data (Ripley, 2006), where a Gaussian mixture model with two components is used for generating data of each class. In this training, we set the initial number of components to three for each class. As the training progresses, one of the components for each class becomes small gradually and is removed.
2.1 Model formulation
The posterior probabilities for each class is calculated as follows:
(1)  
(2)  
(3) 
where is the number of components for class and is the mixture weight that is equivalent to the prior of each component . It should be noted that we use , which is the weight vector representing the th Gaussian component of class . The dimension of , i.e., , is the same as that of ; namely, .
Derivation. Utilizing Gaussian distribution as a conditional distribution of given and , , the posterior probability of given , , is calculated as follows:
(4)  
(5) 
where and are the mean vector and the covariance matrix for component in class . Since the calculation inside an exponential function in (5) is quadratic form, the conditional distributions can be transformed as follows:
(6) 
where
(7)  
Here, is the ()th element of .
2.2 Learning algorithm
Algorithm 1 shows the training of the SDGM.
In this algorithm, the optimal weight is obtained as maximum a posteriori solution. We can obtain a sparse solution by optimizing the prior distribution set to each weight simultaneously with weight optimization.
A set of training data and target value is given. The target is coded in a oneof form, where if the th sample belongs to class ,
otherwise. A binary random variable
is introduced. The variable when the th sample from class belongs to the th component. Otherwise, . This variable is required for the optimization of the mixture weight . We also define and as vectors that comprise and as their elements, respectively. As the prior distribution of the weight, we employ a Gaussian distribution with a mean of zero. Using a different precision parameter (inverse of the variance)
for each weight , the joint probability of all the weights is represented as follows:(8) 
where and are vectors with and as their elements, respectively. During learning, we update not only but also . If , the prior (8) is 0; hence a sparse solution is obtained by optimizing .
Using these variables, the expectation of the loglikelihood function over , , is defined as follows:
(9) 
where is a matrix with as its element. The training data matrix contains in the th row. The variable in the righthand side corresponds to and can be calculated as .
The posterior probability of the weight vector is described as follows:
(10) 
An optimal is obtained as the point where (10) is maximized. The denominator of the righthand side in (10) is called the evidence term, and we maximize it with respect to . However, this maximization problem cannot be solved analytically; therefore we introduce the Laplace approximation described as the following procedure.
With fixed, we obtain the mode of the posterior distribution of . The solution is given by the point where the following equation is maximized:
(11)  
where . We obtain the mode of (11) via Newton’s method. The gradient and Hessian required for this estimation can be calculated as follows:
(12) 
(13) 
Each element of and is calculated as follows:
(14) 
(15) 
where is a variable that takes 1 if both and , 0 otherwise. Hence, the posterior distribution of can be approximated by a Gaussian distribution with a mean of and a covariance matrix of , where
(16) 
Because the evidence term can be represented using the normalization term of this Gaussian distribution, we obtain the following updating rule by calculating its derivative with respect to .
(17) 
where is the orthogonal component of . The mixture weight can be estimated using as follows:
(18) 
where is the number of training samples belonging to class . As described above, we obtain a sparse solution by alternately repeating the update of hyperparameters, as described in (17) and (18) and the posterior distribution estimation of using the Laplace approximation. During the procedure, the th component is eliminated if becomes 0 or all the weights corresponding to the component become 0.
3 Experiments
3.1 Evaluation of characteristics using synthetic data
To evaluate the characteristics of the SDGM, we conducted classification experiments using synthetic data. The dataset comprises two classes. The data were sampled from a Gaussian mixture model with eight components for each class. The numbers of training data and test data were 320 and 1,600, respectively. The scatter plot of this dataset is shown in Figure 2.
In the evaluation, we calculated the error rates for the training data and the test data, the number of components after training, the number of nonzero weights after training, and the weight reduction ratio (the ratio of the number of the nonzero weights to the number of initial weights), by varying the number of initial components as .
Figure 2 displays the changes in the learned class boundaries according to the number of initial components.
When the number of components is small, such as that shown in Figure 2(a), the decision boundary is simple; therefore, the classification performance is insufficient. However, according to the increase in the number of components, the decision boundary fits the actual class boundaries. It is noteworthy that the SDGM learns the GMM as a discriminative model instead of a generative model; an appropriate decision boundary was obtained even if the number of components for the model is less than the actual number (e.g., 2(c)).
Figure 3 shows the evaluation results of the characteristics.
Figures 3(a), (b), (c), and (d) show the recognition error rate, number of components after training, number of nonzero weights after training, and weight reduction ratio, respectively. The horizontal axis shows the number of initial components in all the graphs.
In Figure 3(a), the recognition error rates for the training data and test data are almost the same with the few number of components, and decrease according to the increase in the number of initial components while it is 2 to 6. This implied that the representation capability was insufficient when the number of components was small, and that the network could not accurately separate the classes. Meanwhile, changes in the training and test error rates were both flat when the number of initial components exceeded eight, even though the test error rates were slightly higher than the training error rate. In general, the training error decreases and the test error increases when the complexity of the classifier is increased. However, the SDGM suppresses the increase in complexity using sparse Bayesian learning, thereby preventing overfitting.
In Figure 3(b), the number of components after training corresponds to the number of initial components until the number of initial components is eight. When the number of initial components exceeds ten, the number of components after training tends to be reduced. In particular, eight components are reduced when the number of initial components is 20. The results above indicate the SDGM can reduce unnecessary components.
From the results in Figure 3(c), we confirm that the number of nonzero weights after training increases according to the increase in the number of initial components. This implies that the complexity of the trained model depends on the number of initial components, and that the minimum number of components is not always obtained.
Meanwhile, in Figure 3(d), the weight reduction ratio increases according to the increase in the number of initial components. This result suggests that the larger the number of initial weights, the more weights were reduced. Moreover, the weight reduction ratio is greater than 99 % in any case. The results above indicate that the SDGM can prevent overfitting by obtaining high sparsity and can reduce unnecessary components.
3.2 Comparative study using benchmark data
To evaluate the capability of the SDGM quantitatively, we conducted a classification experiment using benchmark datasets. The datasets used in this experiment were Ripley’s synthetic data (Ripley, 2006) (Ripley hereinafter) and four datasets cited from Rätsch et al. (2001); Banana, Waveform, Titanic, and Breast Cancer. Ripley is a synthetic dataset that is generated from a twodimensional () Gaussian mixture model, and 250 and 1,000 samples are provided for training and test, respectively. The number of classes is two (), and each class comprises two components. The remaining four datasets are all twoclass () datasets, which comprise different data size and dimensionality. Since they contain 100 training/test splits, we repeated experiments for 100 times and then calculated average statistics.
For comparison, we used three classifiers that can obtain a sparse solution: a linear logistic regression (LR) with
constraint, a support vector machine (SVM)
(Cortes & Vapnik, 1995) and a relevance vector machine (RVM) (Tipping, 2001). In the evaluation, we compared the recognition error rates for discriminability and number of nonzero weights for sparsity on the test data. The results of SVM and RVM were cited from Tipping (2001). For ablation study, we also tested our SDGM without sparse learning by omitting the update of . By way of summary, the statistics were normalized by those of the SDGM and the overall mean was shown.Table 1 shows the recognition error rates and number of nonzero weights for each method.
Error rate (%)  Number of nonzero weights  
SDGM  Baselines  SDGM  Baselines  
Dataset  w/ sparse  w/o sparse  LR  SVM  RVM  w/ sparse  w/o sparse  LR  SVM  RVM 
Ripley  9.1  9.9  11.4  10.6  9.3  6  1255  2  38  4 
Banana  10.6  10.8  47.0  10.9  10.8  11.1  2005  2  135.2  11.4 
Waveform  10.1  9.5  13.5  10.3  10.9  11.0  2005  20.73  146.4  14.6 
Titanic  22.7  23.3  22.7  22.1  23.0  74.5  755  2.98  93.7  65.3 
Breast Cancer  29.4  35.1  27.5  26.9  29.9  15.73  1005  8.88  116.7  6.3 
Normalized mean  1.00  1.05  1.79  1.02  1.03  1.00  129.35  0.60  8.11  0.86 
The results in Table 1 show that the SDGM achieved an equivalent or greater accuracy compared with the SVM and RVM on average. The SDGM is developed based a Gaussian mixture model and is particularly effective for data where a Gaussian distribution can be assumed, such as the Ripley dataset. On the number of nonzero weights, understandably, the LR showed the smallest number since it is a linear model. Among the remaining nonlinear classifiers, the SDGM achieved relatively small number of nonzero weights thanks to its sparse Bayesian learning. The results above indicated that the SDGM demonstrated generalization capability and a sparsity simultaneously.
3.3 Image classification
In this experiment, the SDGM is embedded into a deep neural network. Since the SDGM is differentiable with respect to the weights, SDGM can be embedded into a deep NN as a module and is trained in an endtoend manner. In particular, the SDGM plays the same role as the softmax function since the SDGM calculates the posterior probability of each class given an input vector. We can show that a fully connected layer with the softmax is equivalent to the discriminative model based on a single Gaussian distribution for each class by applying a simple transformation (see Appendix A), whereas the SDGM is based on the Gaussian mixture model.
To verify the difference between them, we conducted image classification experiments. Using a CNN with a softmax function as a baseline, we evaluated the capability of SDGM by replacing softmax with the SDGM.
3.3.1 Datasets and experimental setups
We used the following datasets and experimental settings in this experiment.
MNIST: This dataset includes 10 classes of handwritten binary digit images of size (LeCun et al., 1998)
. We used 60,000 images as training data and 10,000 images as testing data. As a feature extractor, we used a simple CNN that consists of five convolutional layers with four max pooling layers between them and a fully connected layer. To visualize the learned CNN features, we first set the output dimension of the fully connected layer of the baseline CNN as two (
). Furthermore, we tested by increasing the output dimension of the fully connected layer from two to ten ().FashionMNIST: FashionMNIST (Xiao et al., 2017) includes 10 classes of binary fashion images with a size of . It includes 60,000 images for training data and 10,000 images for testing data. We used the same CNN as in MNIST with 10 as the output dimension.
CIFAR10: CIFAR10 (Krizhevsky & Hinton, 2009) is the labeled subsets of an 80 million tiny image dataset. This dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. For CIFAR10, we trained DenseNet (Huang et al., 2017) with a depth of 40 and a growth rate of 12.
For each dataset, the network was trained with a batch size of 64 for 100 epochs with a learning rate of 0.01 We used a weight decay of
and the Nesterov optimization algorithm (Sutskever et al., 2013) with a momentum of 0.9. The network weights were initialized using the Glorot uniform (Glorot & Bengio, 2010).3.3.2 Results
Figure 4 shows the twodimensional feature embeddings on the MNIST dataset. Different feature embeddings were acquired for each method. When softmax was used, the features spread in a fan shape and some part of the distribution overlapped around the origin. However, when the SDGM was used, the distribution for each class exhibited an ellipse shape and margins appeared between the class distributions. This is because the SDGM is based on a Gaussian mixture model and functions to push the samples into a Gaussian shape.
MNIST ()  MNIST ()  FashionMNIST  CIFAR10  

Softmax  3.19  1.01  8.78  11.07 
SDGM  2.43  0.72  8.30  10.05 
4 Related Work and Position of This Study
Figure 5 illustrates the relationship of our study with other studies.
This study is primarily consists of three factors: discriminative model, Gaussian mixture model, and Sparse Bayesian learning. This study is the first that combines these three factors and expands the body of knowledge in these fields.
From the perspective of the sparse Bayesian classifier, the RVM (Tipping, 2001) is the most important related study. An RVM is combines logistic regression and sparse Bayesian learning. Since the logistic regression is equivalent to the discriminative model of a unimodal Gaussian model, the SDGM can be considered as an extended RVM using a GMM. furthermore, from the perspective of the probabilistic model, the SDGM is considered as the an extended discriminative GMM (Klautau et al., 2003) using sparse Bayesian learning, and an extended sparse GMM (Gaiffas & Michel, 2014) using the discriminative model.
Sparse methods have often been used in machine learning. Three primary merits of using sparse learning are as follows: improvements in generalization capability, memory reduction, and interpretability. Several attempts have been conducted to adapt sparse learning to deep NNs.
Graham (2014)proposed a spatiallysparse convolutional neural network.
Liu et al. (2015) proposed a sparse convolution neural network. Additionally, sparse Bayesian learning has been applied in many fields. For example, an application to EEG classification has been reported (Zhang et al., 2017).5 Conclusion
In this paper, we proposed a sparse classifier based on a GMM, which is named SDGM. In the SDGM, a GMMbased discriminative model was trained by sparse Bayesian learning. This learning algorithm improved the generalization capability by obtaining a sparse solution and automatically determined the number of components by removing redundant components. The SDGM could be embedded into NNs such as convolutional NNs and could be trained in an endtoend manner.
In the experiments, we demonstrated that the SDGM could reduce the amount of weights via sparse Bayesian learning, thereby improving its generalization capability. The comparison using benchmark datasets suggested that SDGM outperforms the conventional sparse classifiers. We also demonstrated that SDGM outperformed the fully connected layer with the softmax function when it was used as the last layer of a deep NN.
One of the limitations of this study is that sparse Bayesian learning was applied only when the SDGM was trained standalone. In future work, we will develop a sparse learning algorithm for a whole deep NN structure including the feature extraction part. This will improve the ability of the CNN for larger data classification.
References
 Axelrod et al. (2006) Scott Axelrod, Vaibhava Goel, Ramesh Gopinath, Peder Olsen, and Karthik Visweswariah. Discriminative estimation of subspace constrained Gaussian mixture models for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(1):172–189, 2006.
 Bahl et al. (1996) Lalit R Bahl, Mukund Padmanabhan, David Nahamoo, and PS Gopalakrishnan. Discriminative training of Gaussian mixture models for large vocabulary speech recognition systems. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), volume 2, pp. 613–616, 1996.
 Cortes & Vapnik (1995) Corinna Cortes and Vladimir Vapnik. Supportvector networks. Machine learning, 20(3):273–297, 1995.
 Crouse et al. (2011) David F Crouse, Peter Willett, Krishna Pattipati, and Lennart Svensson. A look at Gaussian mixture reduction algorithms. In Proceedings of the 14th International Conference on Information Fusion, pp. 1–8, 2011.
 Gaiffas & Michel (2014) Stephane Gaiffas and Bertrand Michel. Sparse bayesian unsupervised learning. arXiv preprint arXiv:1401.8017, 2014.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)
, pp. 249–256, 2010.  Graham (2014) Benjamin Graham. Spatiallysparse convolutional neural networks. arXiv preprint arXiv:1409.6070, 2014.

Huang et al. (2017)
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 1, pp. 3, 2017.  Klautau et al. (2003) Aldebaro Klautau, Nikola Jevtic, and Alon Orlitsky. Discriminative Gaussian mixture models: A comparison with kernel classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 353–360, 2003.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 Lasserre et al. (2006) Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 87–94, 2006.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Liu et al. (2015) Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 806–814, 2015.
 Minka (2005) Tom Minka. Discriminative models, not discriminative training. Technical report, Technical Report MSRTR2005144, Microsoft Research, 2005.
 Rätsch et al. (2001) Gunnar Rätsch, Takashi Onoda, and KR Müller. Soft margins for adaboost. Machine learning, 42(3):287–320, 2001.
 Ripley (2006) Brian D Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 2006.
 Štepánová & Vavrečka (2018) Karla Štepánová and Michal Vavrečka. Estimating number of components in Gaussian mixture model using combination of greedy and merging algorithm. Pattern Analysis and Applications, 21(1):181–192, 2018.

Sutskever et al. (2013)
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.
In Proceedings of International Conference on Mchine Learning (ICML), volume 28, pp. 1139–1147, 2013.  Tipping (2001) Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning research, 1(Jun):211–244, 2001.
 Tsai & Chang (2002) WueiHe Tsai and WenWhei Chang. Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification. Speech Communication, 36(34):317–326, 2002.
 Tsuji et al. (1999) Toshio Tsuji, Osamu Fukuda, Hiroyuki Ichinobe, and Makoto Kaneko. A loglinearized Gaussian mixture network and its application to EEG pattern classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 29(1):60–72, 1999.

Tüske et al. (2015)
Zoltán Tüske, Muhammad Ali Tahir, Ralf Schlüter, and Hermann Ney.
Integrating Gaussian mixtures into deep neural networks: Softmax layer with hidden variables.
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4285–4289, 2015.  Wang (2007) Jue Wang. Discriminative Gaussian mixtures for interactive image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pp. I–601, 2007.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Zhang et al. (2017) Yu Zhang, Yu Wang, Jing Jin, and Xingyu Wang. Sparse Bayesian learning for obtaining sparsity of EEG frequency bands based feature vectors in motor imagery classification. International Journal of Neural Systems, 27(2):1650032, 2017.
Appendix A Appendix
We explain that a fully connected layer with the softmax function, or logistic regression, can be regarded as a discriminative model based on a Gaussian distribution by utilizing transformation of the equations. Let us consider a case in which the classconditional probability is a Gaussian distribution. In this case, we can omit from the equations (4)–(7).