semiMVAE
Semi-supervised Multi-view Variational Autoencoder (semiMVAE)
view repo
In emotion recognition, it is difficult to recognize human's emotional states using just a single modality. Besides, the annotation of physiological emotional data is particularly expensive. These two aspects make the building of effective emotion recognition model challenging. In this paper, we first build a multi-view deep generative model to simulate the generative process of multi-modality emotional data. By imposing a mixture of Gaussians assumption on the posterior approximation of the latent variables, our model can learn the shared deep representation from multiple modalities. To solve the labeled-data-scarcity problem, we further extend our multi-view model to semi-supervised learning scenario by casting the semi-supervised classification problem as a specialized missing data imputation task. Our semi-supervised multi-view deep generative framework can leverage both labeled and unlabeled data from multiple modalities, where the weight factor for each modality can be learned automatically. Compared with previous emotion recognition methods, our method is more robust and flexible. The experiments conducted on two real multi-modal emotion datasets have demonstrated the superiority of our framework over a number of competitors.
READ FULL TEXT VIEW PDFSemi-supervised Multi-view Variational Autoencoder (semiMVAE)
With the development of human-computer interaction, emotion recognition has become increasingly important. Since human’s emotion contains many nonverbal cues, various modalities ranging from facial expressions, voice, Electroencephalogram (EEG), eye movements to other physiological signals can be used as the indicators of emotional states [Calvo and D’Mello2010]. In real-world applications, it is difficult to recognize the emotional states using just a single modality, because signals from different modalities represent different aspects of emotion and provide complementary information. Recent studies show that integrating multiple modalities can significantly boost the emotion recognition accuracy [Verma and Tiwary2014, Pang et al.2015, Lu et al.2015, Liu et al.2016, Soleymani et al.2016, Zhang et al.2016].
The most successful approach to fuse the information from multiple modalities is based on deep multi-view representation learning [Ngiam et al.2011, Andrew et al.2013, Srivastava and Salakhutdinov2014, Wang et al.2015, Chandar et al.2016]. For example, [Pang et al.2015]
proposed to learn a joint density model for emotion analysis with a multi-modal Deep Boltzmann Machine (DBM)
[Srivastava and Salakhutdinov2014]. This multi-modal DBM is exploited to model the joint distribution over visual, auditory, and textual features.
[Liu et al.2016]proposed a multi-modal emotion recognition method by using multi-modal Deep Autoencoders (DAE)
[Ngiam et al.2011], in which the joint representations of EEG and eye movement signals were extracted. Nevertheless, there are still limitations with these deep multi-modal emotion recognition methods, e.g., their performances depend on the amount of labeled data.By using the modern sensor equipments, we can easily collect massive physiological signals, which are closely related to people’s emotional states. Despite the convenience of data acquisition, the data labeling procedure requires lots of manual efforts. Therefore, in most cases only a small set of labeled samples is available, while the majority of whole dataset is left unlabeled. Traditional emotion recognition approaches only utilized the limited amount of labeled data, which may result in severe overfitting. The most attractive way to deal with this issue is based on Semi-supervised Learning (SSL), which builds more robust model by exploiting both labeled and unlabeled data [Schels et al.2014, Jia et al.2014, Zhang et al.2016]. Though multi-modal approaches have been widely implemented for emotion recognition, very few of them explored SSL simultaneously. To the best of our knowledge, only [Zhang et al.2016] proposed an enhanced multi-modal co-training algorithm for semi-supervised emotion recognition, but its shallow structure is hard to capture the high-level correlation between different modalities.
Amongst existing SSL approaches, the most competitive one is based on deep generative models, which employs the Deep Neural Networks (DNNs) to learn discriminative features and casts the semi-supervised classification problem as a specialized missing data imputation task.
[Kingma et al.2014] and [Maaløe et al.2016]have shown that deep generative models and approximate Bayesian inference exploiting recent advances in scalable variational methods
[Kingma and Welling2014, Rezende et al.2014] can provide state-of-the-art performance for semi-supervised classification. Though the Variational Autoencoder (VAE) framework [Kingma and Welling2014] has shown great advantages in SSL, its potential merits remain under-explored. For example, until recently, there was no successful multi-view extension for it. The main difficulty lies in its inherent assumption that the posterior approximation should be conditioned on the data point, which is natural to single-view data but becomes problematic for multi-view case.In this paper, we propose a novel semi-supervised multi-view deep generative framework for multi-modal emotion recognition. Our framework combines the advantages of deep multi-view representation learning and Bayesian modeling, thus it has sufficient flexibility and robustness in learning joint features and classifier. Our main contributions can be summarized as follows.
We propose a multi-view extension for VAE by imposing a mixture of Gaussians assumption on the posterior approximation of the latent variables. For multi-view learning, this is critical for fully exploiting the information from multiple views.
We introduce a semi-supervised multi-modal emotion recognition framework based on multi-view VAE. Our framework can leverage both labeled and unlabeled samples from multiple modalities and the weight factor for each modality can be learned automatically, which is critical for building a robust emotion recognition system.
We demonstrate the superiority of our framework and provide insightful observations on two real multi-modal emotion datasets.
The VAE framework has recently been introduced as a robust model for latent feature learning [Kingma and Welling2014, Rezende et al.2014]. However, the single-view architecture in VAE can’t effectively deal with multi-view data. In this section, we first build a multi-view VAE, which can learn the shared deep representation from multi-view data. And then, we extend it to the semi-supervised scenario. Assume we are faced with multi-view data that appears as pairs , with observation from the -th view and the corresponding class label .
We assume the latent variable can generate multi-view features . Specifically, we assume generates for any , with the following generative model (cf. Fig. 1a):
(1) |
where
is a suitable likelihood function (e.g. a Gaussian for continuous observation or Bernoulli for binary observation), which is formed by a non-linear transformation of the latent variable
. This non-linear transformation is essential to allow for higher moments of the data to be captured by the density model, and we choose these non-linear functions to be DNNs, referred to as the generative networks, with parameters
. Note that, the likelihoods for different data views are assumed to be independent of each other, with different nonlinear transformations.The Bayesian Canonical Correlation Analysis (CCA) model [Klami et al.2013] can be seen as a special case of our model, where linear shallow transformations were used to generate each data view and only two different views were considered. [Wang et al.2016] used a similar deep nonlinear generative process as ours to construct deep Bayesian CCA model, but during inference they construct the variational posterior approximation from just one view and ignore the rest one. Such a choice is convenient for inference and computation, but only seeks suboptimal solutions as it doesn’t fully exploit the data. As shown in the following, we assume the variational approximation to the posterior of latent variables to be a mixture of Gaussians, utilizing information from multiple views.
Typically, both the prior and the approximate posterior
are assumed to be Gaussian distributions
[Kingma and Welling2014, Rezende et al.2014] in order to maintain mathematical and computational tractability. Although this assumption has leaded to favorable results on several tasks, it is clearly a restrictive and often unrealistic assumption. Specifically, the choice of a Gaussian distribution for and imposes a strong uni-modal structure assumption on the latent space. However, for data distributions that are strongly multi-modal, the uni-modal Gaussian assumption inhibits the model’s ability to extract and represent important structure in the data. To improve the flexibility of the model, one way is to impose a mixture of Gaussians assumption on . However, it has the risk of creating separate “islands” of discontinuous manifolds that may break the meaningfulness of the representation in the latent space.To learn more powerful and expressive models – in particular, models with multi-modal latent variable structures for multi-modal emotion recognition applications – we seek a mixture of Gaussians for , while preserving as a standard Gaussian. Thus (cf. Fig. 1b),
(2) |
where the mean and the covariance are nonlinear functions of the observation , with variational parameter . As in our generative model, we choose these nonlinear functions to be DNNs, referred to as the inference networks. is the non-negative normalized weight factor for the -th view, i.e., and . By conditioning the posterior approximation on the data point, we avoid variational parameters per data point, instead only requiring to fit global variational parameters. Note that, our mixed Gaussian assumption on the variational approximation distinguishes our method from all existing ones using the auto-encoding variational framework [Kingma and Welling2014, Wang et al.2016, Burda et al.2016, Kingma et al.2016, Serban et al.2016, Maaløe et al.2016]. For multi-view learning, this is critical for fully exploiting the information from multiple views.
|
|
In semi-supervised classification, only a subset of the samples have corresponding class labels, and we focus on using the multi-view VAE to build a model (semiMVAE) that learns classifier from both labeled and unlabeled multi-view data. Since the emotional data is continuous, we choose the Gaussian likelihoods. Then the generative model is defined as (cf. Fig. 2a):
(3) |
where denotes the categorical distribution, is treated as a latent variable for the unlabeled data points, and the mean
and variance
are nonlinear functions of and , with parameter . The inference model is defined as (cf. Fig. 2b):(4) | ||||
where is assumed to be a mixture of Gaussians to combine the information from multiple data views. Intuitively, , and correspond to the encoder, the decoder and the classifier, respectively.
For brevity, we omit the explicit dependencies on , and for the moment variables mentioned above hereafter. In principle, , , , and
can be implemented by various DNN models, e.g., Multiple Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN).
|
|
The variational lower bound on the marginal likelihood for a single labeled data point is
(5) |
where . Note that, the Shannon entropy is hard to compute analytically, and we have used the Jensen’s inequality to derive a lower bound of it:
For unlabeled data, we further introduce the variational distribution for :
(6) |
with . The objective function for the entire dataset is now:
(7) |
where and are labeled and unlabeled dataset, respectively. The classification accuracy can be improved by introducing an explicit classification loss for labeled data. The extended objective function is:
(8) |
where the hyper-parameter is a weight between generative and discriminative learning. We set , where is a scaling constant, and and are the numbers of labeled and unlabeled data points in one minibatch, respectively. Note that, the classifier is also used at test phase for the prediction of unseen data.
Eq. (8
) provides a unified objective function for optimizing the parameters of encoder, decoder and classifier networks. This optimization can be done jointly, without resort to the variational EM algorithm, using the stochastic backpropagation technique
[Kingma and Welling2014, Rezende et al.2014].The reparameterization trick is a vital component of the algorithm, because it allows us to easily take the derivative of with respect to the variational parameters . However, the use of a mixture of Gaussians for the variational distribution makes the application of reparameterization trick challenging. It can be shown that, for any , can be rewritten, using the location-scale transformation for the Gaussian distribution, as:
(9) |
where and .
While the expectations on the right hand side of Eq. (2.4.1) still cannot be solved analytically, their gradients w.r.t. , and
can be efficiently estimated using the following Monte-Carlo estimators,
(10) |
(11) |
(12) |
where is evaluated at and with . In practice, it suffices to use a small (e.g. ) and then estimate the gradient using minibatches of data points. We use the same random numbers for all estimators to have lower variance. The gradient w.r.t. is omitted here, since it can be derived straightforwardly by using traditional reparameterization trick [Kingma et al.2014].
The gradients of the loss for semiMVAE (Eq. (8
)) can then be computed by a direct application of the chain rule and estimators presented above. During optimization we can use the estimated gradients in conjunction with standard stochastic gradient based optimization methods such as SGD, RMSprop or Adam
[Kingma and Ba2014]. Overall, the model can be trained with reparameterization trick for backpropagation through the mixed Gaussian latent variables.In this section, we present extensive experimental results to demonstrate the effectiveness of the proposed semi-supervised multi-view framework for emotion recognition.
Data description Two multi-modal emotion datasets, the SEED dataset111http://bcmi.sjtu.edu.cn/%7Eseed/index.html [Lu et al.2015] and the DEAP dataset222http://www.eecs.qmul.ac.uk/mmv/datasets/deap/download.html [Koelstra et al.2012], were used in our experiments.
The SEED dataset contains EEG and eye movement signals from 15 subjects during watching 15 movie clips, where each movie clip lasts about 4 minutes long. The EEG signals were recorded from 62 channels and the eye movement signals contained information about blink, saccade fixation and so on. In our experiment, we used the data from 9 subjects across 3 sessions, totally 27 data files. For each data file, data from watching the 1-9 movie clips were used as training set, while data from watching the 10-12 movie clips were used as validation set and the rest (13-15) were used as testing set.
The DEAP dataset contains EEG and peripheral physiological signals of 32 participants. Signals were recorded when they were watching 40 one-minute duration music videos. The EEG signals were recorded from 32 channels, whereas the peripheral physiological signals were recorded from 8 channels. The participants, using values from 1 to 9, rated each music video in terms of the levels of valence, arousal and so on. In our experiment, the valence-arousal space was divided into four quadrants according to the ratings. The threshold we used was 5, leading to four classes of data. Considering the fuzzy boundary of emotions and the variations of participants’ ratings possibly associated with individual difference in rating scale, we discarded the samples whose ratings of arousal and valence are between 3 and 6. The dataset was randomly divided into 10-folds, where 8 folds for training, one fold for validation and the last fold for testing. The size of testing set is relative small, because some graph-based semi-supervised baselines are hard to deal with large dataset.
Feature selection For SEED dataset, lu2015combining have extracted the Differential Entropy (DE) features and 33 eye movement features from EEG and eye movement signals [Lu et al.2015]. We also used these features in our experiments. For DEAP dataset, we extracted the DE features from EEG and peripheral physiological signals. The DE features can be calculated in four frequency bands: theta (4-8Hz), alpha (8-14Hz), beta (14-31Hz), and gamma (31-45Hz), and we used all band’s features. The details of the data used in our experiments were summarized in Table 1.
Datasets | #Instances | #Features | #Training | #Validation | #Testing | #Classes |
---|---|---|---|---|---|---|
SEED | 22734 | 310(EEG), 33(Eye) | 13473 | 4725 | 4536 | 3 |
DEAP | 21042 | 128(EEG), 32(Phy.) | 16834 | 2104 | 2104 | 4 |
Compared methods We compared our semiMVAE with a broad range of solutions, including supervised learning, transductive and inductive semi-supervised learning. We briefly summarize the various baselines in the following.
: the multi-view extension of deep autoencoders, which can be used to extract the joint representations from multiple modalities [Ngiam et al.2011].
: the full deep neural network extension of Canonical Correlation Analysis (CCA). DCCA can learn deep nonlinear mappings of two views, which are maximally correlated [Andrew et al.2013].
: a deep multi-view representation learning model which combines the advantages of the DCCA and deep autoencoders. In particular, DCCAE consists of two autoencoders and optimizes the combination of canonical correlation between the learned bottleneck representations and the reconstruction errors of the autoencoders [Wang et al.2015].
: a graph-based multi-view semi-supervised classification algorithm, which can integrate heterogeneous features from both labeled and unlabeled data [Cai et al.2013].
: a latest auto-weighted multiple graph learning framework, which can be applied to multi-view semi-supervised classification task [Nie et al.2016].
: a single-view semi-supervised deep generative model proposed in [Kingma et al.2014]. We evaluate semiVAE’s performance for each modality and the concatenation of all modalities, respectively.
For MAE, DCCA and DCCAE, we used the Support Vector Machines
333http://www.csie.ntu.edu.tw/%7Ecjlin/liblinear/. (SVM) and transductive SVM444http://svmlight.joachims.org/. (TSVM) for supervised learning and transductive semi-supervised learning, respectively. Parameter setting For semiMVAE, we considered multiple layer perceptrons as the type of inference and generative networks. On both datasets, we set the structures of the inference and generative networks for each view as ‘100-50-30’ and ‘30-50-100’, respectively. We used the Adam optimizer [Kingma and Ba2014] with a learning rate in training. The scaling constant was selected from {0.1, 0.5, 1} throughout the experiments. The weight factor for each view was initialized with , where is the number of views. For MAE, DCCA and DCCAE, we considered the same setups (network structure, learning rate, etc.) as our semiMVAE. For AMMSS, we tuned the parameters as suggested in [Cai et al.2013]. For AMGL and semiVAE, we used their default settings.To simulate semi-supervised learning scenario, on both datasets, we randomly labeled different proportions of samples in the training set, and remained the rest samples in the training set unlabeled. For transductive semi-supervised learning, we trained models on the dataset consisting of the testing data and labeled data belonging to training set. For inductive semi-supervised learning, we trained models on the entire training set consisting of the labeled and unlabeled data. For supervised learning, we trained models on the labeled data belonging to training set, and test their performance on the testing set. Table 2 presents the classification accuracies of all methods on SEED and DEAP datasets. The proportions of labeled samples in the training set vary from to . Several observations can be drawn as follows.
SEED data | Algorithms | 1% labeled | 2% labeled | 3% labeled |
---|---|---|---|---|
Supervised learning | MAE+SVM | .814.031 | .896.024 | .925.024 |
DCCA+SVM | .809.035 | .891.035 | .923.028 | |
DCCAE+SVM | .819.036 | .893.034 | .923.027 | |
Transductive semi-supervised learning | AMMSS | .731.055 | .839.036 | .912.018 |
AMGL | .711.047 | .817.023 | .886.028 | |
MAE+TSVM | .818.035 | .910.025 | .931.026 | |
DCCA+TSVM | .811.031 | .903.024 | .928.021 | |
DCCAE+TSVM | .823.040 | .907.027 | .929.023 | |
semiMVAE | .861.037 | .931.020 | .960.021 | |
Inductive semi-supervised learning | semiVAE (Eye) | .753.024 | .849.055 | .899.049 |
semiVAE (EEG) | .768.041 | .861.040 | .919.026 | |
semiVAE (Concat.) | .803.035 | .876.043 | .926.044 | |
semiMVAE | .880.033 | .955.020 | .968.015 |
DEAP data | Algorithms | 1% labeled | 2% labeled | 3% labeled |
---|---|---|---|---|
Supervised learning | MAE+SVM | .353.027 | .387.014 | .411.016 |
DCCA+SVM | .359.016 | .400.014 | .416.018 | |
DCCAE+SVM | .361.023 | .403.017 | .419.013 | |
Transductive semi-supervised learning | AMMSS | .303.029 | .353.024 | .386.014 |
AMGL | .291.027 | .341.021 | .367.019 | |
MAE+TSVM | .376.025 | .403.031 | .417.026 | |
DCCA+TSVM | .379.021 | .408.024 | .421.017 | |
DCCAE+TSVM | .384.022 | .412.027 | .425.021 | |
semiMVAE | .424.020 | .441.013 | .456.013 | |
Inductive semi-supervised learning | semiVAE (Phy.) | .366.024 | .389.048 | .402.034 |
semiVAE (EEG) | .374.019 | .397.013 | .407.016 | |
semiVAE (Concat.) | .383.019 | .404.016 | .416.012 | |
semiMVAE | .421.019 | .439.025 | .451.022 |
First, the average accuracy of semiMVAE significantly surpasses the baselines in all cases. Second, by examining semiMVAE against supervised learning approaches trained on very limited labeled data, we can find that semiMVAE always outperforms them. This encouraging result shows that semiMVAE can effectively leverage the useful information from unlabeled data. Third, multi-view semi-supervised algorithms AMMSS and AMGL perform worst in all cases. We attribute this to the fact that graph-based shallow models AMMSS and AMGL can’t extract the deep features from the original data. Fourth, the performances of three TSVM based semi-supervised methods are moderate. Although MAE+TSVM, DCCA+TSVM and DCCAE+TSVM can also integrate multi-modality information from unlabeled samples, their two-stage learning can’t obtain the global optimal model parameters. Finally, compared with the single-view semi-supervised method semiVAE, our multi-view method is more effective in integrating multiple modalities.
![]() |
![]() |
![]() |
![]() |
The proportion of labeled and unlabeled samples in the training set will affect the performance of semi-supervised models. Figs. 3 and 4 show the changes of semiMVAE’s average accuracy on both datasets with different proportions of labeled and unlabeled samples in the training set. We can observe that both labeled and unlabeled samples can effectively boost the classification accuracy of semiMVAE.
Instead of treating each modality equally, our semiMVAE can weight each modality and perform classification simultaneously. Fig. 5a shows the learned weight factors by inductive semiMVAE on SEED and DEAP datasets ( labeled). From it, we can observe that EEG modality has the highest weight on both datasets, which is consistent with single modality’s performance of semiVAE shown in Table 2 and the results in previous work [Lu et al.2015].
![]() |
![]() |
The scaling constant controls the weight of discriminative learning in semiMVAE. Fig. 5b shows the performance of inductive semiMVAE with different values ( labeled). From it, we can find that the scaling constant can be chosen from {0.1, 0.5, 1}, where semiMVAE achieves good results.
This paper proposes a semi-supervised multi-view deep generative framework for emotion recognition, which can leverage both labeled and unlabeled data from multiple modalities. The key to the framework are two parts: 1) multi-view VAE can fully integrate the information from multiple modalities and 2) semi-supervised learning can overcome the labeled-data-scarcity problem. Experimental results on two real multi-modal emotion datasets demonstrate the effectiveness of our approach.
A novel semi-supervised deep learning framework for affective state recognition on EEG signals.
In International Conference on Bioinformatics and Bioengineering (BIBE), pages 30–37. IEEE, 2014.Journal of Machine Learning Research
, 14(1):965–1003, 2013.