1. Introduction
With the development of humancomputer interaction (HCI), emotion recognition has become increasingly important. Since human’s emotion contains many nonverbal cues, various modalities ranging from facial expressions, body gesture, voice to physiological signals can be used as the indicators of emotional states (Calvo and D’Mello, 2010; Poria et al., 2017). In realworld applications, it is difficult to recognize human’s emotional states only considering a single modality, because signals from different modalities represent different aspects of emotion and provide complementary information. Recent studies show that integrating multiple modalities can significantly boost the emotion recognition accuracy (Lu et al., 2015; Ranganathan et al., 2016; Soleymani et al., 2016). The most successful approach to fuse the information from multiple modalities is based on deep multiview representation learning (Ngiam et al., 2011; Srivastava and Salakhutdinov, 2014; Wang et al., 2015). E.g., (Pang et al., 2015)
proposed a joint density model for emotion analysis with a multimodal deep Boltzmann machine (DBM)
(Srivastava and Salakhutdinov, 2014). This multimodal DBM is exploited to model the joint distribution over visual, auditory, and textual features.
(Liu et al., 2016)proposed a multimodal emotion recognition method by using multimodal autoencoders (MAE)
(Ngiam et al., 2011), in which the joint representations of Electroencephalogram (EEG) and eye movement signals were extracted. Nevertheless, there are still limitations with these deep multimodal emotion recognition methods, e.g., their performances depend on the amount of labeled data and they could not handle incomplete data.By using the modern sensing equipments, we can easily collect massive emotionrelated data from multiple modalities. But, the data labeling procedure requires lots of manual efforts. Therefore, in most cases only a small set of labeled samples is available, while the majority of whole dataset is left unlabeled. In addition to challenges with insufficient labeled data, one must often address the incompletedata problem, i.e., not all modalities are available for every data point. Generally, we can identify various causes for incomplete data. E.g., unforeseeable sensor malfunction may fail to collect sensing information, thus providing us incomplete data with one or more missing modalities. Traditional multimodal emotion recognition approaches (Pang et al., 2015; Lu et al., 2015; Liu et al., 2016) only utilized the limited amount of labeled data, which may result in severe overfitting. Also, most of them neglect the missing modality issue, which greatly limits their applications in realworld scenarios. The most attractive way to deal with the aforementioned issues is semisupervised learning (SSL) with incomplete data. SSL can improve model’s generalization ability by exploiting both labeled and unlabeled data simultaneously (Schels et al., 2014; Jia et al., 2014; Zhang et al., 2016), and learning from incomplete data can guarantee the robustness of the emotion recognition system (Wagner et al., 2011).
In this paper, we show that the problems mentioned above can be resolved under a unified multiview deep generative framework. For modeling the statistical relationships of multimodality emotional data, a shared latent variable is transformed by different modalityspecific generative networks to different data views (modalities). Instead of treating each view equally, we impose a nonuniformly weighted Gaussian mixture assumption on the posterior approximation of the shared latent variables. This is critical for inferring the joint latent representation and the weight factor of each view from multiple modalities. During optimization, a second lower bound to the variational lower bound is derived to address the intractable entropy of a mixed Gaussians. To leverage the contextual information in the unlabeled data to augment the limited labeled data, we then extend our multiview framework to SSL scenario. It is achieved by casting the semisupervised classification problem as a specialized missing data imputation task. Specifically, we treat the unknown labels as latent variables and estimate them within a multiview autoencoding variational Bayes framework. We further extend the proposed SSL algorithm to the incompletedata case by introducing latent variables for the missing views. Besides the unknown labels, the missing views are also integrated out so that the marginal likelihood is maximized with respect to model parameters. In this way, our SSL algorithm can utilize all available data: both labeled and unlabeled, as well as both complete and incomplete. Since the category information and the uncertainty of missing view are taken into account in the training process, our SSL algorithm is more powerful than traditional missing view imputation methods
(Quanz and Huan, 2012; Wang et al., 2015; Hastie et al., 2015; Chandar et al., 2016). We finally demonstrate the superiority of our framework and provide insightful observations on two real multimodal emotion datasets.2. Related Work
Multimodal approaches have been widely implemented for emotion recognition (Pang et al., 2015; Lu et al., 2015; Soleymani et al., 2016; Liu et al., 2016; Ranganathan et al., 2016; Tzirakis et al., 2017; Zheng et al., 2018). E.g., (Ranganathan et al., 2016)
used a multimodal deep belief network (DBN) to extract features from face, body gesture, voice and physiological signals for emotion classification.
(Lu et al., 2015)classified the combination of EEG and eye movement signals into three affective states. But, very few of them explored SSL. To the best of our knowledge, only (Zhang et al., 2016) proposed an enhanced multimodal cotraining algorithm for semisupervised emotion recognition, but its shallow structure is hard to capture the highlevel correlation between different modalities. In addition, most prior work in this field assumes that all modalities are available at all times (Zhang et al., 2016; Zheng et al., 2018), which is not realistic in practical environments. In contrast to the above methods, our framework naturally allows us to perform multimodal emotion recognition within SSL and incompletedata situations.The variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et al., 2014) is one of the most popular deep generative models (DGMs). VAE has shown great advantages in semisupervised classification (Kingma et al., 2014; Maaløe et al., 2016). E.g., Kingma et al. (Kingma et al., 2014) proposed a semisupervised VAE (M2) by modeling the joint distribution over data and labels. Maaløe et al. proposed the auxiliary DGMs (ADGM and SDGM) (Maaløe et al., 2016) by introducing auxiliary variables, which improve the variational approximation. However, these models cannot effectively deal with multiview data, especially in incompleteview case. Our proposed semisupervised multiview DGMs distinguish our method from all existing ones using VAE framework (Wang et al., 2016; Burda et al., 2016; Kingma et al., 2016; Serban et al., 2016; Maaløe et al., 2016).
Incompletedata problem is often circumvented via imputation methods (Williams et al., 2007; Amini et al., 2009; Wang et al., 2010; Xu et al., 2015; Zhang et al., 2018). Common imputation schemes include matrix completion (Keshavan et al., 2009; Hazan et al., 2015; Hastie et al., 2015) and autoencoderbased methods (Wang et al., 2015; Chandar et al., 2016; Luan et al., 2017; Shang et al., 2017). Matrix completion methods, such as SoftImputeALS (Hastie et al., 2015), focus on imputing the missing entries of a partially observed matrix based on assumption that the completed matrix has a lowrank structure. Matrix completion methods often assume data is missing at random (MAR), which might not be optimal for our problem where modalities are missing at continuous blocks. On the other hand, autoencoderbased methods, such as DCCAE (Wang et al., 2015) and CorrNet (Chandar et al., 2016), exploit the connections between views, enabling the incomplete view to be restored with the help of the complete view. Besides lowrank structure of the data matrix and the connections between views, category information is also important for missing view imputation tasks, though category labels may be partially observed. So far, very few algorithms (Yu et al., 2011; Quanz and Huan, 2012) can estimate the missing view under the SSL scenario. Although CoNet (Quanz and Huan, 2012)
utilized deep neural networks (DNNs) to predict the missing view based on existing views and partially observed labels, its feedforward structure could not integrate multiple views effectively in classification. Additionally, most previous works treat the missing data as fixed values and hence ignore the uncertainty of the missing data. Unlike them, our SiMVAE essentially performs infinite imputations by integrating out the missing data.
3. Methodology
In this section, we first develop a multiview variational autoencoder (MVAE) model for fusing multimodality emotional data. Based on MVAE, we further build a semisupervised emotion recognition algorithm. Finally, we develop a more robust semisupervised algorithm to address the incomplete multimodality emotional data. For simplicity we restrict further discussion to the case of two views, though all the proposed methods can be extended to more than two views. Assume we are faced with multiview data that appears as pairs , with observation from the th view and the corresponding class label .
3.1. Multiview Variational Autoencoder
3.1.1. DNNparameterized Likelihoods
We assume that multiple data views (modalities) are generated independently from a shared latent space with multiple viewspecific generative networks. Specifically, we assume a shared latent variable generates with the following generative model (cf. Figure 1a):
(1) 
where
is a suitable likelihood function (e.g. a Gaussian for continuous observation or Bernoulli for binary observation), which is formed by a nonlinear transformation of the latent variable
. This nonlinear transformation is essential to allow for higher moments of the data to be captured by the density model, and we choose these nonlinear functions to be DNNs, referred to as the generative networks, with parameters
. Note that, the likelihoods for different views are assumed to be independent of each other, with potentially different DNN types for different modalities.3.1.2. Gaussian Prior and Gaussian Mixture Posterior
In vanilla VAE (Kingma and Welling, 2014; Rezende et al., 2014), which can only handle singleview data, both the prior and the approximate posterior
are assumed to be Gaussian distributions in order to maintain mathematical and computational tractability. Although this assumption has leaded to favorable results on several tasks, it is clearly a restrictive and often unrealistic assumption. Specifically, the choice of a Gaussian distribution for
and imposes a strong unimodal structure assumption on the latent space. However, for data distributions that are strongly multimodal, the unimodal Gaussian assumption inhibits the model’s ability to extract and represent important structure in the data. To improve the flexibility of the model, one way is to impose a Mixture of Gaussians (MoG) assumption on . However, it has the risk of creating separate “islands” of discontinuous manifolds that may break the meaningfulness of the representation in the latent space.To learn more powerful and expressive models (in particular, models with multimodal latent variable structures for multimodal emotion recognition applications) we seek a MoG for , while preserving as a standard Gaussian. Thus the prior distribution and the inference model (cf. Figure 1b) are defined as: ,
(2) 
where the mean and the covariance are nonlinear functions of the observation , with variational parameter . As in our generative model, we choose these nonlinear functions to be DNNs, referred to as the inference networks. is the nonnegative normalized weight factor for the th view, i.e., and . Note that, Gershman et al. (Gershman et al., 2012) proposed a nonparametric variational inference method by simply assuming the variational distribution to be a uniformly weighted Gaussian mixture. However, treating each component equally will lose flexibility in fusing multiple data views. Instead of treating each view equally, our nonuniformly weighted Gaussian mixture assumption can weight each view automatically in subsequent emotion recognition tasks, which is useful to identify the importance of each view.
3.2. Semisupervised Multimodal Emotion Recognition
Although many supervised emotion recognition algorithms exist (see (Poria et al., 2017) for a thorough literature review), very few semisupervised algorithms have been proposed to improve the recognition performance by utilizing both labeled and unlabeled data. Here we extend MVAE by introducing a conditional probabilistic distribution for the unknown labels to obtain a semisupervised multiview classification algorithm.
3.2.1. Generative model
Since the emotional data is continuous, we choose the Gaussian likelihoods. Then our generative model (cf. Figure 1c) is defined as :
(3)  
where denotes the categorical distribution, is treated as a latent variable for the unlabeled data points, and the mean
and variance
are nonlinear functions of and , with parameter .3.2.2. Inference model
The inference model (cf. Figure 1d) is defined as :
(4)  
where is the introduced conditional distribution for , and is assumed to be a mixture of Gaussians to combine the information from multiple data views and the label. Intuitively, , and correspond to the encoder, decoder and classifier, respectively. For brevity, we omit the explicit dependencies on , and for the moment variables mentioned above hereafter. In principle, , , , and
can be implemented by various DNN models, e.g., multilayer perceptrons and convolutional neural networks.
3.2.3. Objective function
In the semisupervised setting, there are two lower bounds for the labeled and unlabeled cases, respectively. The variational lower bound on the marginal likelihood for a single labeled data point is
(5) 
where . It should be noted that, the Shannon entropy is hard to compute analytically, and we have used the Jensen’s inequality to derive a lower bound of it (see Supplementary Material Section A for details). For unlabeled data point, the variational lower bound on the marginal likelihood can be given by:
(6)  
with .
Therefore, the objective function for the entire dataset is:
(7) 
where and denote labeled and unlabeled dataset, respectively. The classification accuracy can be improved by introducing an explicit classification loss for labeled data, and the extended objective function is now:
(8) 
where is a weight parameter between generative and discriminative learning. We set , where is a scaling constant, and and are the numbers of labeled and unlabeled data points in one minibatch, respectively. Note that, the classifier is also used at test phase for the prediction of unseen data. Eq. (8) provides a unified objective function for optimizing the parameters of encoder, decoder and classifier networks.
3.2.4. Parameter optimization
Parameter optimization can be done jointly by using the stochastic backpropagation technique
(Kingma and Welling, 2014; Rezende et al., 2014). The reparameterization trick (Kingma and Welling, 2014; Kingma et al., 2014) is a vital component, because it allows us to take derivative of w.r.t. the variational parameters . However, the use of Gaussian mixture for variational posterior distribution makes it infeasible to apply the reparameterization trick directly. It can be shown that, for any , can be rewritten, using the locationscale transformation for the Gaussian distribution, as:(9)  
where and . While the expectations on the right hand side still cannot be solved analytically, their gradients w.r.t. , and can be efficiently estimated using MonteCarlo method (see Supplementary Material Section B for details). The gradients of the objective function (Eq. (8
)) can then be computed by using the chain rule and the derived MonteCarlo estimators.
3.3. Handling Incomplete Data
In the above discussion it is assumed that all modalities are available for every data point. In practice, however, many samples generally have incomplete modalities (i.e., with one or more missing modalities) (Wagner et al., 2011). In light of this, we further develop a semisupervised incomplete multiview classification algorithm (SiMVAE). For simplicity, we assume only one view (either or ) is incomplete, though our model can be easily extended to more sophisticated cases. We partition each data point into an observed view and a missing view (i.e., ).
3.3.1. Generative model
In this setting, only a subset of the samples have complete views and corresponding labels. We regard both the unknown label and the missing view as latent variables. Then our generative model (cf. Figure 1e) is defined as :
(10)  
where and are DNNs with parameters and , respectively. and are defined as in Eq. (3).
3.3.2. Inference model
As multimodality emotional data are collected from the same subject, there must be some underlying relationships between modalities, though they focus on different information. Given the observed modality, the estimation of missing modality is feasible if we capture the relationships between modalities. Therefore, the inference model (cf. Figure 1f) is defined as , with
(11) 
where is a DNN with parameter . and are defined as in Eq. (4). Intuitively, we formulate the missing view imputation as a conditional distribution estimation task (conditioned on the observed view). Compared with existing single imputation methods (Chandar et al., 2016; Shang et al., 2017; Luan et al., 2017), our model essentially performs infinite imputations and hence takes the uncertainty of the missing data into account. To obtain a single imputation of rather than the full conditional distribution one can evaluate .
3.3.3. Objective function
In semisupervised incomplete multiview setting, there are four lower bounds for the labeledcomplete, labeledincomplete, unlabeledcomplete and unlabeledincomplete cases, respectively.
Similar to Eq. (3.2.3), the variational lower bound on the marginal likelihood for a single labeledcomplete data point is
(12) 
where . In the labeledincomplete context, the variational lower bound on the marginal likelihood for a single data point can be given by:
(13)  
The solution to is analytical since the conditional distribution is assumed to be a Gaussian (cf. Eq. (11)). For unlabeledcomplete data point, the variational lower bound on the marginal likelihood can be obtained by
(14) 
For unlabeledincomplete case, the variational lower bound on the marginal likelihood can be given by:
(15) 
Comparing to Eq. (3.3.3) we see that aside from the explicit conditional distribution for unknown label we have added a conditional distribution for missing view .
The objective function for all available data points is now:
(16) 
Model performance can be improved by introducing explicit imputation loss and classification loss for complete data and labeled data, respectively. Therefore, the final objective function is
(17) 
where and are weight parameters, and . We set and , where and are scaling constants, and , , and are the numbers of complete, incomplete, labeled and unlabeled data in one minibatch, respectively. Noted that the explicit classification loss (i.e., last term in Eq. (3.3.3)) allows SiMVAE to use the partially observed category information to assist the generation of given , which is more effective than the unsupervised imputation algorithms (Wang et al., 2015; Chandar et al., 2016). Similarly, Eq. (3.3.3) can be optimized by using the stochastic backpropagation technique (Kingma and Welling, 2014; Rezende et al., 2014).
In principle, our SiMVAE can also handle multiple missing views simultaneously. The formulas are omitted here since they can be derived straightforwardly by using multiple distinct conditional density functions .
3.3.4. Connections to auxiliary deep generative models
Maaløe et al. (Maaløe et al., 2016) proposed auxiliary DGMs (ADGM and SDGM) by defining the inference model as , where is the auxiliary variable introduced to make the variational distribution more expressive, and . If is a totally unobservable variable in Figures 1e and 1f, similar to SDGM, SiMVAE becomes a twolayered stochastic model. Since the generative process is conditioned on the auxiliary variable, twolayered stochastic model is more flexible than ADGM (Maaløe et al., 2016). Standard ADGM and SDGM could not handle incomplete multiview data. We endow them with this ability by forcing the inferred auxiliary variable close to on the set of complete data. E.g., we can obtain the objective function of SDGM by introducing an additional imputation loss to SDGM:
(18) 
where is a regularization parameter, and denotes the set of complete data. can be found in (Maaløe et al., 2016). Intuitively, SDGM not only enjoys the advantages of SDGM (in terms of flexibility, convergence and performance), but also captures the relationships between views via the auxiliary inference model . However, SDGM sets a single Gaussian in the variational distribution , which may restrict its ability in multimodality fusion.
4. Experiments
We conduct experiments on two multimodal emotion datasets to demonstrate the effectiveness of the proposed framework.
4.1. Datasets
SEED: The SEED dataset (Zheng and Lu, 2015) contains Electroencephalogram (EEG) and eye movement (Eye) signals from 9 subjects during watching 15 movie clips, where each movie clip lasts about 4 minutes long. The EEG signals were recorded from 62 channels and the Eye signals contained information about blink, saccade fixation and so on. We used the EEG and Eye data from 9 subjects across 3 sessions, totally 27 data files. For each data file, data from watching the 19 movie clips were used as training set, while data from watching the 1012 movie clips were used as validation set and the rest (1315) were used as testing set.
DEAP:The DEAP dataset (Koelstra et al., 2012) contains EEG and peripheral physiological signals (PPS) from 32 subjects during watching 40 oneminute duration music videos. The EEG signals were recorded from 32 channels, whereas the PPS was recorded from 8 channels. The participants, using values from 1 to 9, rated each music video in terms of the levels of valence, arousal and so on. In our experiment, the valencearousal space was divided into four quadrants according to the ratings. The threshold we used was 5, leading to four classes of data. Considering the variations of participants’ ratings possibly associated with individual difference in rating scale, we discarded the samples whose ratings of arousal and valence are between 4 and 6. The dataset was randomly divided into 10folds, where 8 folds for training, one fold for validation and the last fold for testing.
For SEED, we used the extracted differential entropy (DE) features and eye movement features (blink, saccade fixation and so on) (Lu et al., 2015). For DEAP, following (Lu et al., 2015), we split the time series data into many onesecond nonoverlapping segments, where each segment is treated as an instance. Then we extracted the DE features from EEG and PPS data instances. The DE features can be calculated in four frequency bands: theta (48Hz), alpha (814Hz), beta (1431Hz), and gamma (3145Hz), and we used all band’s features. The details of the data used in our experiments were summarized in Table 1.
dataset  #sample  #modality (#dim.)  #training  #validation  #test  #class 

SEED  22734  EEG(310), Eye(33)  13473  4725  4536  3 
DEAP  21042  EEG(128), PPS(32)  16834  2104  2104  4 
4.2. Semisupervised Classification with MultiModality Emotional Data
4.2.1. Experimental setting
To simulate SSL scenario, on both datasets, we randomly labeled different proportions of samples in the training set, and remained the rest samples in the training set unlabeled. For transductive SSL, we trained models on the dataset consisting of the testing data and labeled data belonging to training set. For inductive SSL, we trained models on the entire training set consisting of the labeled and unlabeled data. For supervised learning, we trained models on the labeled data belonging to training set, and test their performance on the testing set. We compared our SMVAE with a broad range of solutions, including MAE (Ngiam et al., 2011), DCCA (Andrew et al., 2013), DCCAE (Wang et al., 2015), AMMSS (Cai et al., 2013), AMGL (Nie et al., 2016), M2 (Kingma et al., 2014) and SDGM (Maaløe et al., 2016). For SMVAE, we considered multilayer perceptrons as the type of inference and generative networks. On both datasets, we set the hidden architectures of the inference and generative networks for each view as ‘1005030’ and ‘3050100’, respectively, where ‘30’ is the dimension of the latent variables. We used the Adam optimizer (Kingma and Ba, 2014) with a learning rate in training. The scaling constant
was selected from {0.1, 0.5, 1}. For MAE, DCCA and DCCAE, we considered the same setups (network structure, learning rate, etc.) as our SMVAE. Furthermore, we used support vector machines (SVM) and transductive SVM (TSVM) for supervised learning and transductive SSL, respectively. For AMGL, M2 and SDGM we used their default settings, and we evaluated M2’s performance on each modality and the concatenation of all modalities, respectively.
SEED data  Algorithms  1% labeled  2% labeled  3% labeled 

Supervised learning  MAE+SVM (Ngiam et al., 2011)  .814.031  .896.024  .925.024 
DCCA+SVM (Andrew et al., 2013)  .809.035  .891.035  .923.028  
DCCAE+SVM (Wang et al., 2015)  .819.036  .893.034  .923.027  
Transductive SSL  AMMSS (Cai et al., 2013)  .731.055  .839.036  .912.018 
AMGL (Nie et al., 2016)  .711.047  .817.023  .886.028  
MAE+TSVM (Ngiam et al., 2011)  .818.035  .910.025  .931.026  
DCCA+TSVM (Andrew et al., 2013)  .811.031  .903.024  .928.021  
DCCAE+TSVM (Wang et al., 2015)  .823.040  .907.027  .929.023  
SMVAE  .861.037  .931.020  .960.021  
Inductive SSL  M2 (Eye) (Kingma et al., 2014)  .753.024  .849.055  .899.049 
M2 (EEG) (Kingma et al., 2014)  .768.041  .861.040  .919.026  
M2 (Concat.) (Kingma et al., 2014)  .803.035  .876.043  .926.044  
SDGM (Concat.) (Maaløe et al., 2016)  .819.034  .893.042  .932.041  
SMVAE  .880.033  .955.020  .968.015 
DEAP data  Algorithms  1% labeled  2% labeled  3% labeled 

Supervised learning  MAE+SVM (Ngiam et al., 2011)  .353.027  .387.014  .411.016 
DCCA+SVM (Andrew et al., 2013)  .359.016  .400.014  .416.018  
DCCAE+SVM (Wang et al., 2015)  .361.023  .403.017  .419.013  
Transductive SSL  AMMSS (Cai et al., 2013)  .303.029  .353.024  .386.014 
AMGL (Nie et al., 2016)  .291.027  .341.021  .367.019  
MAE+TSVM (Ngiam et al., 2011)  .376.025  .403.031  .417.026  
DCCA+TSVM (Andrew et al., 2013)  .379.021  .408.024  .421.017  
DCCAE+TSVM (Wang et al., 2015)  .384.022  .412.027  .425.021  
SMVAE  .424.020  .441.013  .456.013  
Inductive SSL  M2 (PPS) (Kingma et al., 2014)  .366.024  .389.048  .402.034 
M2 (EEG) (Kingma et al., 2014)  .374.019  .397.013  .407.016  
M2 (Concat.) (Kingma et al., 2014)  .383.019  .404.016  .416.015  
SDGM (Concat.) (Maaløe et al., 2016)  .389.019  .411.017  .423.015  
SMVAE  .421.017  .439.015  .451.013 
4.2.2. Classification accuracy with very few labels
Table 2 presents the classification accuracies of all methods on SEED and DEAP datasets. The proportions of labeled samples in the training set vary from to . Results (mean
std) were averaged over 20 independent runs. Several observations can be drawn as follows. First, the average accuracy of SMVAE significantly surpasses the baselines in all cases. Second, by examining SMVAE against supervised learning approaches trained on very limited labeled data, we can find that SMVAE always outperforms them. This encouraging result shows that SMVAE can effectively leverage the useful information from unlabeled data. Third, multiview semisupervised algorithms AMMSS and AMGL perform worst in all cases. We attribute this to the fact that graphbased shallow models AMMSS and AMGL cannot extract the deep features from the original data. Fourth, the performances of three TSVMbased semisupervised methods are moderate. Finally, compared with the singleview methods M2 and SDGM, our multiview method is more effective in integrating multiple modalities.
4.2.3. Flexibility and stability
The proportion of unlabeled samples in the training set will affect the performance of semisupervised models. Figure 2a shows the changes of inductive SMVAE’s average accuracy on SEED with different proportions of unlabeled samples in the training set. We can observe that the unlabeled samples can effectively boost the classification accuracy of SMVAE. Instead of treating each modality equally, SMVAE can weight each modality and perform classification simultaneously. Figure 2b shows the learned weight factors by inductive SMVAE on both datasets ( labeled). From it, we can observe that EEG modality has the highest weight on both datasets, which is consistent with single modality’s performance of M2 shown in Table 2 and the results in previous work (Lu et al., 2015). The scaling constant controls the weight of discriminative learning in SMVAE. Figure 2c shows the performance of inductive SMVAE with different values ( labeled). From it, we can find that the scaling constant can be chosen from {0.1, 0.5, 1}, where SMVAE achieves good results.
4.3. Semisupervised Learning with Incomplete MultiModality Data
4.3.1. Experimental setting
To simulate the incomplete data setting, we randomly selected a fraction of instances (from both labeled and unlabeled training data) to be unpaired examples, i.e., they are described by only one modality, and the remaining ones appear in both modalities. We varied the fraction of missing data from 10% to 90% with an interval of 20%, while no missing data in validation and testing sets. In our experiment, we assumed the Eye modality of SEED and the PPS modality of DEAP are incomplete.
There are two main solutions for semisupervised classification of incomplete multiview data. One way is to complete the missing view firstly in an unsupervised way, and then conduct semisupervised classification. Another way is to integrate missing view imputation and semisupervised classification into an endtoend learning framework. We compared our (inductive) SiMVAE algorithm with these two ways. Specifically, we compared SiMVAE with SoftImputeALS (Hastie et al., 2015), DCCAE (Wang et al., 2015), CorrNet (Chandar et al., 2016), CoNet (Quanz and Huan, 2012) and SDGM (a variant of SDGM (Maaløe et al., 2016), cf. Section 3.3.4). For SoftImputeALS, DCCAE and CorrNet, we first estimated the missing modalities by using the authors’ implementation, and then conducted semisupervised classification by using our (inductive) SMVAE algorithm. For CoNet and SDGM, we conducted missing modality imputation and semisupervised classification simultaneously based on our own implementations. Additionally, we also compared SiMVAE with the following two baselines: 1) SiMVAE with complete data (FullData, i.e., no missing modality for any training instances), which can be regarded as a upper bound of SiMVAE; 2) SiMVAE with only paired data (PartialData, i.e., we simply discard those incomplete samples in training process), which can be regarded as a lower bound of SiMVAE. These two bounds define the potential range of SiMVAE’s performance. For SiMVAE, both and were selected from {0.1, 0.5, 1}. For SDGM, we selected the regularization parameter from .
4.3.2. Semisupervised classification
The performance of our SiMVAE and the compared methods was shown in Figure 3, where each point on every curve is an average over 20 independent trials. From Figure 3, it is seen that SiMVAE consistently outperforms the compared methods. Compared with the twostage methods (SoftImputeALS, DCCAE and CorrNet), the advantage of SiMVAE is significant, especially when there are sufficient labeled data (3%). This is because SiMVAE can make good use of the available category information to generate more informative modalities, which in turn will improve classification performance. Whereas the twostage methods couldn’t obtain the global optimal results. Also, SiMVAE shows obvious advantage over the semisupervised methods CoNet and SDGM. This may be because CoNet and SDGM are not designed to integrate multiple modalities. Moreover, SiMVAE has been successful even when a high percentage of samples are incomplete. Specifically, SiMVAE with even about 50% incomplete samples achieves comparable results to the fully complete case (FullData). With fractions lower than that, we observe that SiMVAE roughly reached FullData’s performance, especially when the labeled data are sufficient. Finally, SiMVAE’s performance is more closer to FullData than to PartialData, which indicates the effectiveness of SiMVAE in learning from incomplete data.
4.3.3. Missing modality imputation
Since the quality of recovered missing modalities directly affects the classification results, we also evaluated the performance of missing modality imputation for all methods. For SiMVAE and SDGM, we obtained the single imputation of by evaluating the conditional mean (). We used the Normalized Mean Squared Error (NMSE) to measure the relative distance between the original and the recovered modalities. , where and are the original and the recovered data matrices, respectively. demotes the Frobenious norm. Figure 4 shows the experimental results.
From Figure 4, it can be seen that as the fraction of missing data increases, the relative distance between the original modalities and the recovered modalities increases. Further, the semisupervised imputation methods (SiMVAE, CoNet and SDGM) consistently outperforms the unsupervised imputation methods (SoftImputeALS, DCCAE and CorrNet), and increasing the number of labeled training data improves the imputation performance of semisupervised methods. This demonstrates that the category information plays an important role in missing modality imputation. SoftImputeALS shows the worst performance, which verifies that matrix completion method is not suitable for missing modality imputation. CoNet and SDGM obtain comparable imputation errors to SiMVAE. This indicates that their moderate classification performance in Figure 3 may be caused by their inability in modality fusion. Except for SiMVAE and SDGM, other methods ignore the uncertainty of the missing view, which also limits their imputation performance. To compare the imputation performance more intuitively, we visualize the original and recovered data matrices in Figure 5 (on SEED, 3% labeled and 10% missing Eye). From it, we see that SiMVAE recovered more individual characteristics of the original data matrix than other methods.
4.3.4. Sensitivity analysis
Figure 6 shows the classification accuracies of inductive SiMVAE with different scaling constants and on both datasets ( labeled and 10% missing data). From it, we can find that SiMVAE is not very sensitive to the values of and . We choose the best and from {0.1, 0.5, 1} in the experiments.
5. Conclusion
We have proposed a novel semisupervised multiview deep generative framework for multimodal emotion recognition with incomplete data. Under our framework, each modality of the emotional data is treated as one view, and the importance of each modality is inferred automatically by learning a nonuniformly weighted Gaussian mixture posterior approximation for the shared latent variable. The labeleddatascarcity problem is naturally addressed within our framework through casting the semisupervised classification problem as a specialized missing data imputation task. The incompletedata problem is elegantly circumvented by treating the missing views as latent variables and integrating them out. Compared with previous emotion recognition methods, our method is more robust and flexible. Experimental results confirmed the superiorities of our framework over many stateoftheart competitors.
Acknowledgements.
This work was supported by National Natural Science Foundation of China (No. 91520202, 61602449), Beijing Municipal Science&Technology Commission (Z181100008918010), Youth Innovation Promotion Association CAS and Strategic Priority Research Program of CAS.References
 (1)
 Amini et al. (2009) Massih Reza Amini, Nicolas Usunier, and Cyril Goutte. 2009. Learning from Multiple Partially Observed Views – an Application to Multilingual Text Categorization. NIPS (2009), 28–36.
 Andrew et al. (2013) Galen Andrew, Raman Arora, Jeff A Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis.. In ICML. 1247–1255.
 Burda et al. (2016) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2016. Importance Weighted Autoencoders. In ICLR.
 Cai et al. (2013) Xiao Cai, Feiping Nie, Weidong Cai, and Heng Huang. 2013. Heterogeneous Image Features Integration via Multimodal Semisupervised Learning Model. In ICCV. 1737–1744.
 Calvo and D’Mello (2010) Rafael A Calvo and Sidney D’Mello. 2010. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing 1, 1 (2010), 18–37.
 Chandar et al. (2016) Sarath Chandar, Mitesh M Khapra, Hugo Larochelle, and Balaraman Ravindran. 2016. Correlational neural networks. Neural computation 28, 2 (2016), 257–285.
 Gershman et al. (2012) Samuel Gershman, Matt Hoffman, and David Blei. 2012. Nonparametric variational inference. In ICML.

Hastie
et al. (2015)
Trevor Hastie, Rahul
Mazumder, Reza Zadeh, and Reza Zadeh.
2015.
Matrix completion and lowrank SVD via fast
alternating least squares.
Journal of Machine Learning Research
16, 1 (2015), 3367–3402.  Hazan et al. (2015) Elad Hazan, Roi Livni, and Yishay Mansour. 2015. Classification with Low Rank and Missing Data. In ICML. 257–266.

Jia
et al. (2014)
Xiaowei Jia, Kang Li,
Xiaoyi Li, and Aidong Zhang.
2014.
A novel semisupervised deep learning framework for affective state recognition on EEG signals. In
International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, 30–37.  Keshavan et al. (2009) Raghunandan H. Keshavan, Sewoong Oh, and Andrea Montanari. 2009. Matrix completion from a few entries. IEEE Transactions on Information Theory 56, 6 (2009), 2980–2998.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semisupervised learning with deep generative models. In NIPS. 3581–3589.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improving Variational Inference with Inverse Autoregressive Flow. In NIPS. 4743–4751.
 Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. AutoEncoding Variational Bayes. In ICLR.
 Koelstra et al. (2012) Sander Koelstra, Christian Muhl, Mohammad Soleymani, JongSeok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2012. Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3, 1 (2012), 18–31.
 Liu et al. (2016) Wei Liu, Wei Long Zheng, and Bao Liang Lu. 2016. Emotion Recognition Using Multimodal Deep Learning. In International Conference on Neural Information Processing. 521–529.
 Lu et al. (2015) Yifei Lu, WeiLong Zheng, Binbin Li, and BaoLiang Lu. 2015. Combining Eye Movements and EEG to Enhance Emotion Recognition.. In IJCAI. 1170–1176.
 Luan et al. (2017) Tran Luan, Xiaoming Liu, Jiayu Zhou, and Rong Jin. 2017. Missing Modalities Imputation via Cascaded Residual Autoencoder. In CVPR. 4971–4980.
 Maaløe et al. (2016) Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. 2016. Auxiliary deep generative models. In ICML. 1445–1453.
 Ngiam et al. (2011) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML. 689–696.
 Nie et al. (2016) Feiping Nie, Jing Li, Xuelong Li, et al. 2016. ParameterFree AutoWeighted Multiple Graph Learning: A Framework for Multiview Clustering and SemiSupervised Classification. In IJCAI. 1881–1887.
 Pang et al. (2015) Lei Pang, Shiai Zhu, and ChongWah Ngo. 2015. Deep multimodal learning for affective analysis and retrieval. IEEE Transactions on Multimedia 17, 11 (2015), 2008–2020.
 Poria et al. (2017) Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98–125.
 Quanz and Huan (2012) Brian Quanz and Jun Huan. 2012. CoNet:feature generation for multiview semisupervised learning with partially observed views. In CIKM. 1273–1282.

Ranganathan et al. (2016)
Hiranmayi Ranganathan,
Shayok Chakraborty, and Sethuraman
Panchanathan. 2016.
Multimodal emotion recognition using deep learning
architectures. In
Applications of Computer Vision
. 1–9.  Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In NIPS. 1278–1286.
 Schels et al. (2014) Martin Schels, Markus Kächele, Michael Glodek, David Hrabal, Steffen Walter, and Friedhelm Schwenker. 2014. Using unlabeled data to improve classification of emotional states in human computer interaction. Journal on Multimodal User Interfaces 8, 1 (2014), 5–16.
 Serban et al. (2016) Iulian V Serban, II Ororbia, G Alexander, Joelle Pineau, and Aaron Courville. 2016. Multimodal Variational EncoderDecoders. arXiv preprint arXiv:1612.00377 (2016).
 Shang et al. (2017) Chao Shang, Aaron Palmer, Jiangwen Sun, Ko Shin Chen, Jin Lu, and Jinbo Bi. 2017. VIGAN: Missing view imputation with generative adversarial networks. In IEEE International Conference on Big Data. 766–775.
 Soleymani et al. (2016) Mohammad Soleymani, Sadjad AsghariEsfeden, Yun Fu, and Maja Pantic. 2016. Analysis of EEG signals and facial expressions for continuous emotion detection. IEEE Transactions on Affective Computing 7, 1 (2016), 17–28.
 Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. Ladder variational autoencoders. In NIPS. 3738–3746.
 Srivastava and Salakhutdinov (2014) Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal Learning with Deep Boltzmann Machines. Journal of Machine Learning Research 15 (2014), 2949–2980.
 Tzirakis et al. (2017) Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou. 2017. Endtoend multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1301–1309.
 Wagner et al. (2011) Johannes Wagner, Elisabeth Andre, Florian Lingenfelser, and Jonghwa Kim. 2011. Exploring fusion methods for multimodal emotion recognition with missing data. IEEE Transactions on Affective Computing 2, 4 (2011), 206–218.
 Wang et al. (2010) C. Wang, X. Liao, L Carin, and D. B. Dunson. 2010. Classification with Incomplete Data Using Dirichlet Process Priors. Journal of Machine Learning Research 11, 18 (2010), 3269.
 Wang et al. (2015) Weiran Wang, Raman Arora, Karen Livescu, and Jeff A Bilmes. 2015. On Deep MultiView Representation Learning.. In ICML. 1083–1092.
 Wang et al. (2016) Weiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. 2016. Deep Variational Canonical Correlation Analysis. arXiv: 1610.03454 (2016).
 Williams et al. (2007) D Williams, X. Liao, Y. Xue, L Carin, and B Krishnapuram. 2007. On classification with incomplete data. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 3 (2007), 427.
 Xu et al. (2015) C. Xu, D. Tao, and C. Xu. 2015. Multiview Learning with Incomplete Views. IEEE Transactions on Image Processing 24, 12 (2015), 5812–5825.
 Yu et al. (2011) Shipeng Yu, Balaji Krishnapuram, R mer Rosales, and R. Bharat Rao. 2011. Bayesian CoTraining. Journal of Machine Learning Research 12, 3 (2011), 2649–2680.
 Zhang et al. (2018) Lei Zhang, Yao Zhao, Zhenfeng Zhu, Dinggang Shen, and Shuiwang Ji. 2018. MultiView Missing Data Completion. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1296–1309.
 Zhang et al. (2016) Zixing Zhang, Fabien Ringeval, Bin Dong, Eduardo Coutinho, Erik Marchi, and Björn Schüller. 2016. Enhanced semisupervised learning for multimodal emotion recognition. In ICASSP. IEEE, 5185–5189.
 Zheng et al. (2018) WeiLong Zheng, Wei Liu, Yifei Lu, BaoLiang Lu, and Andrzej Cichocki. 2018. EmotionMeter: A Multimodal Framework for Recognizing Human Emotions. IEEE Transactions on Cybernetics (2018), 1–13.
 Zheng and Lu (2015) WeiLong Zheng and BaoLiang Lu. 2015. Investigating Critical Frequency Bands and Channels for EEGbased Emotion Recognition with Deep Neural Networks. IEEE Transactions on Autonomous Mental Development 7, 3 (2015), 162–175.
Section A
The Shannon entropy is hard to compute analytically. In general, there is no closedform expression for the entropy of a Mixture of Gaussians (MoG). Here we lower bound the entropy of MoG using Jensen’s inequality:
where we have used the fact that the convolution of two Gaussians is another Gaussian, and .
Section B
can be rewritten, using the locationscale transformation for the Gaussian distribution, as:
where and . While the expectations on the right hand side of the above equation still cannot be solved analytically, their gradients w.r.t. , and can be efficiently estimated using the following MonteCarlo estimators