M2P2: Multimodal Persuasion Prediction using Adaptive Fusion

06/03/2020 ∙ by Chongyang Bai, et al. ∙ Dartmouth College 0

Identifying persuasive speakers in an adversarial environment is a critical task. In a national election, politicians would like to have persuasive speakers campaign on their behalf. When a company faces adverse publicity, they would like to engage persuasive advocates for their position in the presence of adversaries who are critical of them. Debates represent a common platform for these forms of adversarial persuasion. This paper solves two problems: the Debate Outcome Prediction (DOP) problem predicts who wins a debate while the Intensity of Persuasion Prediction (IPP) problem predicts the change in the number of votes before and after a speaker speaks. Though DOP has been previously studied, we are the first to study IPP. Past studies on DOP fail to leverage two important aspects of multimodal data: 1) multiple modalities are often semantically aligned, and 2) different modalities may provide diverse information for prediction. Our M2P2 (Multimodal Persuasion Prediction) framework is the first to use multimodal (acoustic, visual, language) data to solve the IPP problem. To leverage the alignment of different modalities while maintaining the diversity of the cues they provide, M2P2 devises a novel adaptive fusion learning framework which fuses embeddings obtained from two modules – an alignment module that extracts shared information between modalities and a heterogeneity module that learns the weights of different modalities with guidance from three separately trained unimodal reference models. We test M2P2 on the popular IQ2US dataset designed for DOP. We also introduce a new dataset called QPS (from Qipashuo, a popular Chinese debate TV show ) for IPP. M2P2 significantly outperforms 3 recent baselines on both datasets. Our code and QPS dataset can be found at http://snap.stanford.edu/m2p2/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Controversial topics (e.g. foreign policy, immigration, national debt, privacy issues) engender much debate amongst academics, businesses, and politicians. Speakers who are persuasive often win such debates. Given videos of discussions between two participants, the goal of this paper is to provide a fully automated system to solve two persuasion related problems. The Debate Outcome Prediction problem (DOP) tries to determine which of two teams “wins” a debate. Suppose the two teams are and and suppose denote the number of supporters for and ’s positions respectively before the debate and denote the same after the debate. Hence, . In the DOP problem, we say that team (resp. team ) wins the debate if (resp. ). We say a speaker is a winner if s/he belongs to the winning team. The Intensity of Persuasion Problem (IPP) tries to predict the increase (or decrease) in the number of votes of each speaker (as opposed to a team). We use the same notation as before but assuming we have two speakers . The intensity of speaker ’s persuasiveness is for . It is clear that both these problems are important. In a business meeting, it might be important to win (DOP), but in other situations, peeling away support for an opponent might be important (IPP). The more support a speaker can peel away from the opponent, the more persuasive s/he is.

Solving DOP and IPP using video data alone can pose many challenges. In this paper, we test our M2P2 algorithm against two datasets, the IQ2US dataset111https://www.intelligencesquaredus.org from a popular US debate TV show and the QPS dataset from the popular Chinese TV show Qipashuo222https://www.imdb.com/title/tt4397792/. Real-world videos such as these come with three broad properties: (i) as we can see in Figure 1(b), the detected language can be very noisy — this must be accounted for, (ii) as we can see from Figure 1(a), there can be considerable noise in the video modality as well — for instance, a man’s face might be shown in the video while a woman is speaking and these kinds of audio-video mismatches must be addressed, (iii) but in some cases — as shown in Figure 1, the modalities might be nicely aligned where the audio, language, and video modalities are all correct and the speaker’s speech and visual signals are aligned. The problem of identifying these types of mismatches poses a major challenge in building a single model to predict both DOP and IPP.

Though we are not the first to take on the DOP problem, we are the first to solve IPP. DOP has been addressed by  [5, 24, 29] who use multimodal sequence data to predict who will win a debate. However, these efforts do not address all the three challenges described above. To the best of our knowledge, there is no existing dataset that addresses the IPP problem and there are no algorithms to solve the IPP problem. In this paper, we develop a novel algorithm called M2P2 and show that M2P2 improves upon past solutions to DOP by 2%–3.7% accuracy (statistically significant with a -value below 0.05) and beats adaptations of past work on DOP to the IPP case by over 25% MSE (statistically significant with ). Figure LABEL:fig:real_time_prediction shows a sample of how our M2P2 framework predicts speaker persuasiveness at interim points during a debate from the QPS dataset — the reader can readily see that the M2P2 prediction of number of votes (orange line) closely matches the ground truth (green line).

Figure 1: In multimodal content, the modalities are semantically aligned. This example shows a case where the visual modality (facial expressions) and the language modality (the content of the speech) are closely aligned.
(a) There are cases where the visual modality is noisy, while the language modality is clean. In 4 consecutive frames when the woman is speaking, the face of a man appears (see frames 2 and 3) and the man’s face is incorrectly assumed to be the woman’s. The language modality, however, is correct.
(b) There are cases where the language modality can be noisy, while the visual modality is clean. We use Baidu’s OCR API to extract the Chinese transcripts from the video frames. In the video frame (the right side of the figure), the transcripts extracted by the OCR system (the left side) are incorrect due to the milk ads shown.
Figure 2: Individual modalities can be noisy. Here we show examples where the visual or the language modality are wrong. M2P2 learns to down-weight the noisy modalities.

Our contributions. When all three modalities (audio, video, language) agree, then that “common” information must be correctly captured by a predictive model. In this case, we say that the modalities are aligned. However, there can be cases where some modalities suggest one thing, while the other(s) suggest something different. In this case, we say the modalities are heterogeneous. Our solution, M2P2, captures both aspects and also learns how to weight the two aspects in order to maximize prediction accuracy. M2P2 first leverages the Transformer encoder structure [34] to project the three modalities into three latent spaces. To combine the information from the latent spaces, the model devises two major modules: alignment and heterogeneity.

The alignment

module learns to highlight the shared, aligned information across modalities. It enforces an alignment loss in the loss function as a regularization term during training. This ensures that there is relatively little discrepancy between the latent embeddings of different modalities when they are aligned.

The heterogeneity module first learns the weights of modality-specific information and applies weighted fusion to harden the model against noisy modalities (cf. Figure 2). M2P2 uses a novel interactive training procedure to learn the weights from three separately trained reference models, each corresponding to a single modality. Intuitively, a modality with smaller unimodal loss should be assigned a higher weight in the multimodal model. Finally, the outputs of both modules are combined with the debate meta-data for persuasion prediction.

We evaluate M2P2 on the IQ2US and QPS datasets. IQ2US was first used by  [5] to evaluate the DOP problem. The IQ2US dataset only has the final debate outcomes, without any labels about how persuasive each speaker is during the debate. Hence, IQ2US cannot be used to evaluate IPP. To this end, we created a new dataset QPS, based on an extremely popular Chinese entertainment debate TV show called Qipashuo22footnotemark: 2. In QPS, the audience provides real-time votes before and after each speaker in order to gauge how persuasive the speaker is. QPS therefore provides a direct measure of each speaker’s persuasiveness for training and evaluation. We use the IQ2US dataset for the DOP problem and the QPS dataset for IPP problem. M2P2 outperforms baselines based on three recent papers [5, 24, 29] which were originally designed to predict debate outcomes (or other related problem scenarios). We also conduct ablation studies and visualize our results to show the effectiveness of different novel components in M2P2.

Figure 3: M2P2 architecture. First, audio, face and language sequences are extracted from a video clip and fed to three separate modules to get primary input embeddings. Second, each of these embeddings is fed to a Transformer encoder [34]

followed by a max pooling layer, which yields the latent embeddings. Third, the latent embeddings are fed to the alignment and heterogeneity modules to generate the embeddings

and . Last, we concatenate , and the debate meta-data which is fed to an MLP for persuasiveness prediction. The latent embeddings interact with two procedures alternately: optimize the alignment loss and persuasiveness loss , and learn weights through 3 reference models.

2 Related Work

Unimodal persuasion prediction. There has been some work on using a single modality for predicting persuasion. [39, 27, 36, 13] explored the linguistic modality by studying style, context, semantic features and argument-level dynamics in English transcripts to solve DOP. For the visual modality, Joo et al. [19] defined nine visual intents related to persuasion (e.g. dominance, trustworthiness) and trained SVMs to predict them and persuasion using hand-crafted features. Huang et al. [16] improved these results by fine-tuning pre-trained CNNs. In the case of audio, MFCC features and LSTM were used by [28] to solve DOP.

Multimodal persuasion prediction. Brilman et al.  [5] solved DOP by extracting facial emotions, voice pitch and word category related features and then training separate SVMs for each modality. The overall prediction for DOP was obtained through a majority vote by the three models. Nojavanasghari et al.  [24]

solved DOP by first building a Multi-Layer Perceptron (MLP) for each modality, then concatenating the predicted probabilities,and sending them as input to yet another MLP. Because both methods use simple aggregate feature values (e.g. mean, median), they ignore the dynamics of features over time. As a result, these two approaches do not work well with short video clips, and do not leverage temporal dynamics. To address this problem, Santos et al.  

[29] used an LSTM to take each time-step into account, but their feature-level multimodal fusion considers all modalities to be equally important — thus ignoring the noise, heterogeneity, and alignment properties.

M2P2 is the first to address the Intensity of Persuasion Prediction problem (IPP). Moreover, M2P2 captures temporal dynamics via a multi-headed attention mechanism that: (i) learns the importance of different modalities at different times in long video sequences, and (ii) thus learns better representations of multiple modalities. Moreover, M2P2 is the first to capture both alignment and heterogeneity — hence addressing noise. With these innovations, M2P2 performs well in both IPP and DOP.

General Multimodal Learning. A body of multimodal learning methods defines constraints between modalities in a latent space to capture their inter-relationships. Andrew et al.  [2]

extended Canonical Correlation Analysis by deep neural networks to maximize inter-modal correlations. Such correlation constraints have since been used in sentiment classification

[10], emotion recognition [1] and semantic-visual embedding [11]. In addition to capturing the shared relationship, [25, 30, 35]

tried to extract the individual component of each modality through low-rank estimation.

[18, 10] trained auto-encoders to reconstruct a modality from itself and another modality. While these efforts provide important insights for creating multimodal embeddings, they do not show how to combine the learned embeddings for accurate prediction.

A second body of work explores architectures for fusing embeddings from modalities. Zadeh et al.  [38]

introduced bimodal and trimodal tensors via cross products to express inter-modal features. As cross products significantly increase the dimensionality of the feature space,

[20, 4, 6] introduced bilinear pooling techniques to learn compact representations. Although these methods explicitly model inter-modal relationships, they introduce additional features that require larger networks to be learned for subsequent prediction tasks. In contrast, attention-based fusion [22, 15] learns the weighted sum of multimodal embeddings taking the prediction task into account. However, they require huge amounts of data to learn the optimal attention weights. In order to capture long-term dependencies, M2P2 uses the Transformer encoder [34, 33] to learn latent embeddings for modalities. On one hand, inspired by the first class of work, M2P2 uses a shared projector and enforces high correlation among the encoded embeddings. On the other hand, M2P2 computes a weighted concatenation of latent unimodal embeddings, where the weights are guided by the persuasiveness loss of each embedding through interactive training. These two innovations lead to a compact embedding that can be learned with a small dataset.

3 The M2p2 Framework

Figure 3 shows an overview of our M2P2 architecture with a brief description of its major components. Note that the key novelties of this paper are the two novel modules (i.e., the alignment module and the heterogeneity module shaded in yellow in Figure 3) that constitute the adaptive fusion framework (Section 3.3). 333Our proposed adaptive fusion framework has the potential of being broadly utilized in other multimodal learning tasks. We leave that exploration for future work.

3.1 Generating Primary Input Embeddings

Given a video clip, we respectively represent the acoustic, visual and language input as , , , where are respectively the lengths of the audio signal, face sequence, and word sequence. are the height, width and the number of channels of each image, and

is the length of our dictionary of words. In addition, we also use two debate meta-data features: the number of votes before a speech and the length of the speech. We generically denote the debate meta-data as a vector

, where .

We first extract features from the three modalities, then add a fully-connected (FC) layer for each modality to obtain low dimensional primary input embeddings. The generated primary input embeddings are depicted as multi-dimensional bars (as a symbol of vector sequences) in Figure 3

. Here we describe the detailed feature extraction components.

Feature extraction from the acoustic modality. For each audio clip, we use Covarep [7] to extract MFCCs444The energy-related 0th coefficient is excluded, Glottal source parameters, pitch-related features, and features using the Summation of Residual Harmonics method [9]. These features capture human voice characteristics from different perspectives and are all shown to be relevant to emotions [12]. These 73 dimensional features are averaged over every half second.

Feature extraction from the visual modality. Since the speakers in both datasets can be highly dynamic and occluded, we capture only their faces as Brilman et al. [5]

did to reduce noisy input. The details of face detection and recognition are in Section 

4. Given each facial image, we use the VGG19 architecture  [31] pre-trained on the Facial Emotion Recognition FER2013 dataset555https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/overview and extract the 512 dimensional output before the last FC layer as the face features.

Feature extraction from the language modality. We use the Jieba666https://github.com/fxsjy/jieba Chinese text segmentation library to segment Chinese sentences (utterances) into words. We use the Tencent Chinese embedding corpus  [32] to extract 200 dimensional word embeddings. In the case of English, we extract 64 dimensional Glove word embeddings  [26] trained from all transcripts from the IQ2US debates.

All features are passed to a learnable FC+ReLU layer which converts the initial features into

primary input embeddings. The primary input embeddings thus obtained for each of the three modalities are respectively , where is the row-dimension of the primary input embeddings, which is same across different modalities. denote the sequence lengths of the modalities, where . Note that in our primary input embeddings, the timestamps of the acoustic, visual, and language modality respectively represent a short time window, a frame, and a word.

3.2 Generating Compact Latent Embeddings of Modalities with Transformers

To get a compact representation of the primary input embeddings for each modality, we aggregate the sequence of features into a single representation vector using one Transformer encoder per modality. Transformer encoders have been shown to outperform many other deep architectures, including RNNs, GRUs, and LSTMs in many sequential data processing tasks in computer vision 

[37]

and natural language processing 

[8]. The multi-head self-attention mechanism of Transformer better memorizes the long-term temporal dynamics [34].

With the Transformer encoder, the primary input embedding of each modality is respectively transformed into a representation as:

(1)

where , and is the dimension of the latent space after the Transformer encoder.

To convert arbitrary length time sequences into standardized latent embedding vectors , we additionally use a max pooling layer:

(2)

intuitively captures the maximum activation over the time sequence along each dimension of .

3.3 Balancing Shared and Heterogeneous Information with Adaptive Fusion

As mentioned earlier, there are two conflicting aspects of multimodal data. First, data from different modalities within the same time frames may sometimes be highly aligned (i.e., have shared information). Second, different modalities may sometimes contain diverse cues which may not be equally important for prediction. To balance the aligned and heterogeneous multimodal information, we propose a novel adaptive fusion framework consisting of two key modules: an alignment module and a heterogeneity module (shaded in yellow in Figure 3).

3.3.1 Alignment Module

To extract information shared across different modalities, we first use a shared multi-layer perceptron (MLP) to project the latent embeddings of each modality into the same latent space:

(3)

Here, , where is the dimension of the shared projection space. MLP is shown as three rounded grey boxes in Figure 3.

Inspired by existing multimodal representation learning work  [2, 10], we use three cosine loss terms across the modalities to measure the alignment of modalities in the shared projection space:

(4)

During training, the alignment loss will be added to the entire prediction loss function as a regularization term to penalize lack of alignment between the 3 modalities in the projected space.

After the shared MLP layer, the regularized embeddings are in the same latent space. We apply mean pooling to average the three embeddings:

(5)

now contains shared information from all modalities.

3.3.2 Heterogeneity Module

Another key observation discussed in Section 1 is that different modalities may contain diverse information, and therefore make unequal contributions to the final prediction of persuasiveness (e.g., due to the noisy data from certain modalities as shown in Figure 2). We therefore propose a novel heterogeneity module which utilizes an interactive training procedure (Algorithm 1) to learn weights for different modalities.

Intuitively, the importance of each modality should be inversely proportional to the “error” caused by the modality. To estimate this error term, we create three unimodal MLP reference models (represented as dashed arrows and rounded grey boxes at the central bottom of Figure 3) parameterized by for the acoustic, visual, and language modalities respectively. Each unimodal MLP takes the compact latent embedding generated by the Transformer encoder as input and generates a unimodal prediction for each modality :

(6)

We use to denote the validation set, are the labels, and are the predictions made by the unimodal reference model for modality . The reference models (’s) are updated using the following Mean Squared Error (MSE) loss alone:

(7)

After several epochs of training

’s, we are able to obtain a converged MSE loss of each reference model. We then use the updated reference model to estimate the prediction errors by . is used to guide the weights of latent embeddings () to be concatenated in the heterogeneity module:

(8)

are scalars incrementally updated over epochs:

(9)

where controls the rate of update, and is obtained using the following softmax function of the reference model validation losses:

(10)

is a scaling factor. Since , combining Equation (9), it is guaranteed that .

3.3.3 Adaptive Fusion with Interactive Training

The representations obtained from the alignment module () and the heterogeneity module () are then concatenated together with the debate meta-data and fed into a final MLP layer to make the final prediction :

(11)

where is the set of parameters of the M2P2 model excluding the reference model parameters .

To train the M2P2 model, we have two loss terms: a novel alignment loss , and a persuasiveness loss term . In the case of the IPP problem, is the MSE loss. In the case of DOP, we use cross-entropy loss for the binary classification. The total loss function is a weighted combination:

(12)

where is a weight factor.

The entire training proceeds in a master-slave manner, as shown in Algorithm 1. In each epoch of the master training procedure (Lines 4 to 14), we use the total loss function in Equation (12) to update the parameters of the main M2P2 components. The weights of the 3 modalities are obtained using reference models , and their losses are then updated in the slave procedure. In each epoch of the slave procedure (Lines 8 to 10), we take the latent embeddings from the master procedure as input and update the reference models with the loss function in Equation (7). We then obtain the weights of different modalities in the heterogeneity module.

Input: Training dataset , validation datset ; Number of epochs and
Output: Multi-modality model , modality weights ()
1 Initialize three unimodal reference models and ; Initialize ; % Master Procedure Start for epoch=1,…,N do
2       Update with loss function Equation (12); Get latent embeddings ; % Slave Procedure Start for epoch=1,…,n do
3             Update with loss function in Equation (7);
4      % Slave Procedure End Get reference model losses ; Update modality importance weights using Equations (9)-(10);
% Master Procedure End return ,
Algorithm 1 M2P2 interactive training procedure

4 Datasets

We describe our two datasets below.

4.1 QPS Dataset

We created the QPS dataset by getting videos777An example can be found in https://youtu.be/P5ehhs0hpFI. from the popular Chinese TV debate show Qipashuo. In each episode of the TV show, 100 audience members initially vote ‘for’ or ‘against’ a given debate topic. Debaters from ‘for’ and ‘against’ teams speak alternately, and the audience can change their votes anytime. In general, there are 6–10 speech turns. Final votes are turned in after the last speaker. The winner is the team which has more votes at the end than at the beginning. For example, if the initial and final ‘for’ vs. ‘against’ votes are 30:70 and 40:60, respectively, then the ‘for’ team wins because they increased their votes from 30 to 40 (even though they still have fewer votes than the “against” team). In total, we collected videos of 21 Qipashuo episodes with 205 speaking clips spanning a total of 582 minutes.

We extracted the transcripts from the video frames using Baidu’s OCR API888https://ai.baidu.com/tech/ocr

. We sample 2 frames per second and binarize the images with a threshold 0.6. We cluster the binarized images into buckets such that any two binarized images in the same bucket are identical on 90% or more pixels. We then randomly select one of these images to represent the cluster. This helps reduce noise (e.g. from advertisements displayed on the image). Finally, the surviving binary images are fed into the OCR API to get accurate transcripts.

If we take each speaking clip as a train/test instance, there would be a total of 205 data points. This paucity of information poses a huge challenge for machine learning. We therefore segment each speaking clip into clips of 50 utterances each according to the transcript we extract above. Note that 50 is the smallest number of utterances in any speaking clip of our dataset. Moreover, note that these “sub-clips” of 50 utterances yield a temporal sequence whose temporal dynamics can be important. The labels are shared for segments extracted from the same clip. This trick yields 2297 such segments which are used as train/test instances in our evaluation.

As the speakers are highly dynamic and often occluded, we only use speakers’ faces as the visual input. We extract 2 frames per second from videos and use Dlib999http://dlib.net for face detection and recognition. The recognition is based on one pre-annotated profile for each speaker and is only needed for training. To further reduce false positives (i.e., extracting the face of the non-speakers), we first use the model from [3] to remove faces in the image that are not speaking, and then use the method from [23] for face tracking.

4.2 Iq2us

We also evaluate M2P2 on the benchmark IQ2US TV debate dataset used by [5, 28, 39, 27, 36]. This dataset was originally collected by [5]. The audience can only vote at the beginning and at the end of the debate, and the winner is determined in the same way as in Qipashuo. Note that we cannot use the same set of videos as [5], since they were interested in predicting the result of the whole debate, which doesn’t require the transcripts to be aligned within shorter clips. Of the 100 episodes we collected, only 58 had transcripts that were correctly aligned with the visual modality at the minute level. Finally, we get 852 one-minute single-speaker clip instances from the 58 episodes — 428 of them belong to the winning side. As transcripts are available in the IQ2US data, no pre-processing is required for the language modality in this dataset. For the visual modality, we use the same procedures as in the QPS dataset to extract the face image sequences of the speakers. Since there are no intermediate votes in IQ2US, we only predict the debate outcome (i.e. whether a single-speaker clip instance belongs to the winning team).

Fold 1 2 3 4 5 6 7 8 9 10 Average
Brilman et al. [5] 0.009 0.011 0.016 0.017 0.030 0.018 0.020 0.012 0.013 0.018 0.016
Nojavanasghari et al. [24] 0.007 0.015 0.019 0.011 0.027 0.014 0.020 0.012 0.020 0.015 0.016
Santos et al. [29] 0.025 0.019 0.018 0.019 0.018 0.017 0.029 0.016 0.024 0.018 0.020
M2P2 (proposed method) 0.006 0.010 0.015 0.015 0.020 0.015 0.012 0.009 0.009 0.013 0.012
% 14.2 9.1 6.3 -36.4 -11.1 -7.1 40.0 25.0 30.8 13.3 25.0
Table 1: MSE for each test fold of different approaches to solving the Intensity of Persuasion Prediction on the QPS Dataset. The last row shows the MSE decrease percentage of M2P2 compared to the best baseline in each fold. On average, M2P2 achieves a lower MSE than the baselines by at least 25%. Results are statistically significant with . Note that the vote scores we predict range from 0 to 1.
Method Accuracy
Brilman et al. [5] 0.614
Nojavanasghari et al. [24] 0.615
Santos et al. [29] 0.598
M2P2 (proposed method) 0.635
Table 2: Prediction accuracy for Debate Outcome Prediction in IQ2US dataset. Our M2P2 is 2%–3.7% better than baselines. Results are statistically significant with .

5 Experimental Evaluations

Our experiments assess the performance of M2P2 on the DOP and IPP tasks. Specifically:

  1. (IPP) We predict the number of votes after a speech by a debater — this is done on the QPS dataset;

  2. (DOP) We predict whether a clip in which a debater is speaking is part of the winning team of the debate — this is done on the IQ2US dataset;

In addition, we also conduct an ablation study that assesses the contributions of different components of M2P2. Finally, we assess the importance of different modalities as well as time frames using the QPS dataset.

5.1 Experimental Settings

QPS uses a 10-fold rolling window prediction. Specifically, we construct 10 sequences of consecutive episodes of the show. For instance, if represent the set of all QPS episodes, then one sequence would be , another would be . For any such sequence , we set as the test episode (i.e. the episode on which we make predictions). We learn a model from the first episodes and identify the best parameters for our model by using episodes as the validation set. As the same subject can occur in multiple episodes of QPS, in order to avoid information leakage from training to test data, we do not train a model from to predict .

For IQ2US, 10-fold cross validation is used since a debater can only appear in one episode. The initial vote score and speaking length features are normalized to .

Denote FCn as a fully-connected layer that outputs n-dimension vectors. The MLPs in the reference models and final multimodal prediction model are all configured as FC16+ReLU, FC8+ReLU, and FC1+Sigmoid. The shared MLP in alignment module is FC16+ReLU. M2P2

 uses Batch Normalization  

[17] right after each of the FC layers for input embeddings, and uses 0.4 as dropout [14] after all FC16 layers. For the Transformer encoder, we use a single layer with 4 heads, where the input, hidden, and output dimension are all 16. We use the Adam [21] optimizer with a weight decay of . The numbers of epochs in Algorithm 1 is and . The learning rate , alignment loss weight , update scalar , scaling factor are finalized by grid search. We ended up using as these yield the best results on the validation sets.

5.2 Comparison with Baselines

We compare both tasks with the following multimodal persuasion prediction baselines: SVM + majority vote [5], deep multimodal fusion [24], and LSTM [29].

In the case of the IPP problem, we adapt the first baseline by modifying it to use an SVM regressor (rather than an SVM classifier) followed by late fusion. For the other two baselines, we use MSE loss. For fairness, we also allow the baselines to use the two debate meta-data features. The results comparing

M2P2 on IPP and DOP with past approaches are shown in Tables 1 and 2, respectively.

IPP Problem. Table 1 shows the MSE obtained by different approaches in each fold and the average on the QPS dataset. Note that the vote scores we predict are normalized to lie in the interval. The last line of Table 1 shows the decrease percentage of MSE which is defined as = 1-MSE(M2P2)/MSE(the best baseline). For instance, from the first column of Table 1, we see that the percentage decrease is representing a 14% decrease of MSE generated by M2P2 compared to the best of the baselines. In the case of IPP, we see that on average, M2P2

 yields a 25% decrease of MSE compared with the best baseline which is statistically significant via a Student t-test (

). Moreover, M2P2 is more robust and performs better than all baselines in 7 out of 10 folds.

DOP Problem. Table 2 shows the average prediction accuracy over 10 folds on the DOP problem w.r.t. the IQ2US dataset. It is clear that M2P2 achieves 2%–3.7% higher average accuracy than the baselines, the improvement is statistically significant (). These make M2P2 the best performing system for both the IPP and the DOP problems.

Method MSE
M2P2 without alignment loss 0.018
M2P2 without reference models 0.015
M2P2-LSTM 0.032
M2P2-Acoustic 0.017
M2P2-Visual 0.019
M2P2-Language 0.016
M2P2 0.012
Table 3: Ablation study results. All improvements are statistically significant ().

5.3 Ablation Study

To measure the contributions of the different components of M2P2, we create four methods, each with one component removed from M2P2 :

  • M2P2 without the alignment loss.

  • M2P2 without reference models. The latent embeddings are concatenated by equal weights .

  • M2P2-LSTM. The Transformer encoder and max pooling layer are replaced by a 1-layer LSTM.

  • M2P2-unimodal. We input a single modality without alignment loss and latent embedding concatenation. That is, the latent embedding is directly concatenated with the debate meta-feature and fed to the final MLP.

IPP Problem. Table 3 shows the average MSE obtained on the QPS dataset for both M2P2 and the 4 methods above. First, according to rows 1,2 and the last row, we find that if M2P2 does not use alignment losses and reference models, the MSE increases from 0.012 to 0.018 and 0.015 respectively. This is statistically significant () and hence shows the power of both proposed adaptive fusion modules in Section 3.3. Second, we observe the power of the Multihead-attention Transformer encoder to handle long sequences, as the M2P2-LSTM model achieves the worst MSE amongst all methods. Third, we observe from rows 4-6 that the language modality is the most important in the prediction task, while the acoustic and visual modalities are less important. This observation is consistent with the modality concatenation weights that will be shown in the following subsection.

5.4 Visualization of Prediction

In this experiment, we show (1) the importance of modalities through their learned weights (cf. Equation (8)), and (2) the examples of learned temporal attention weights from different modalities.

Modality weights. We report the modality weights in the heterogeneity module of the trained M2P2 in all folds of QPS dataset. Figure 4 shows box plots for the three modalities. The language modality is the most important and robust over all folds with a median weight of 0.42, while the median weights of acoustic and visual modalities are and respectively.

Temporal attention weights. We visualize the temporal attention weights of two sample sequences of visual (Figure 5) and language (Figure 6) modalities. For each timestamp, we average the attention weights of all timestamps and all heads towards it, as its attention weight. In Figure 5 (top), the man’s face is not detected correctly in frames 3 and 6 – and we see that M2P2 assigns near-zero attention weights to both frames, suggesting that these frames should be ignored. Moreover, the happy expression in frame 2 gets a high attention weight. The woman below gets high attention weights when she actively talks to someone (frames 2,4,5). In Figure 6, we notice that reasonable keywords like ‘wear’, ‘shackle’, ‘passive’, and ‘hold’ also get high attention weights. Therefore, our M2P2 captures the meaningful long-range temporal dynamics with the help of Transformer Encoder.

Figure 4: Modality weights in the heterogeneity module.
Figure 5: Temporal attention of visual modality – color coded as blue. Darker color implies higher attention weight.
Figure 6: Temporal attention of language modality – color coded as red. Darker color implies higher attention weight. The original Chinese transcripts are translated to English.

6 Conclusion

In this paper, we have solved two problems. First, we provide a solution to the Debate Outcome Prediction (DOP) problem that improves on past work by 2%–3.7%. Though these numbers are not huge, they are statistically significant. Second, we are the first to pose and solve the Intensity of Persuasion Prediction (IPP) problem. We show that we are able to beat baselines built on top of past solutions to IPP by 25% on average. Our proposed M2P2 framework leverages both the common and modality-specific information contained in multimodal sequence data (audio, video, language), while learning to focus attention on the meaningful part of the data. Moreover, our newly created QPS dataset provides a valuable new asset for future research — it will be released upon publication of this paper.

However, there is ample scope for future work. First, we do not provide any theoretical guarantees on the convergence of modality weights. Second, more scalable methods to capture cross-modal interaction would be very valuable.

It is important to note that the adaptive fusion technique in M2P2

 can be generalized to other multimodal sequence prediction problems such as video question answering and video sentiment analysis. We leave this exploration for future work. In other future work, we plan to conduct semantic-level studies to gain knowledge of the persuasive attributes (e.g. are high pitch, positive sentiment, attractive faces more persuasive?). One can also explore richer primary input modality embeddings (e.g. body pose, context-related word embeddings).

References

  • [1] Gustavo Aguilar, Viktor Rozgic, Weiran Wang, and Chao Wang. Multimodal and multi-view models for emotion recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 991–1002, Florence, Italy, July 2019. Association for Computational Linguistics.
  • [2] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255, 2013.
  • [3] Chongyang Bai, Srijan Kumar, Jure Leskovec, Miriam Metzger, Jay F. Nunamaker, and V. S. Subrahmanian. Predicting the visual focus of attention in multi-person discussion videos. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    , pages 4504–4510. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  • [4] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2612–2620, 2017.
  • [5] Maarten Brilman and Stefan Scherer. A multimodal predictive model of successful debaters or how i learned to sway votes. In Proceedings of the 23rd ACM international conference on Multimedia, pages 149–158. ACM, 2015.
  • [6] Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. Murel: Multimodal relational reasoning for visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1989–1998, 2019.
  • [7] Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. Covarep—a collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pages 960–964. IEEE, 2014.
  • [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [9] Thomas Drugman and Abeer Alwan. Joint robust voicing detection and pitch estimation based on residual harmonics. In Twelfth Annual Conference of the International Speech Communication Association, 2011.
  • [10] Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, and Sunil Kumar Kopparapu.

    Audio-visual fusion for sentiment classification using cross-modal autoencoder.

    NIPS, 2019.
  • [11] Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. Finding beans in burgers: Deep semantic-visual embedding with localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3984–3993, 2018.
  • [12] Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. Representation learning for speech emotion recognition. In Interspeech, pages 3603–3607, 2016.
  • [13] Ivan Habernal and Iryna Gurevych. Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional lstm. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1589–1599, 2016.
  • [14] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [15] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi. Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision, pages 4193–4202, 2017.
  • [16] Xinyue Huang and Adriana Kovashka.

    Inferring visual persuasion via body language, setting, and deep features.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 73–79, 2016.
  • [17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [18] Dae Ung Jo, ByeongJu Lee, Jongwon Choi, Haanju Yoo, and Jin Young Choi. Cross-modal variational auto-encoder with distributed latent spaces and associators. arXiv preprint arXiv:1905.12867, 2019.
  • [19] Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. Visual persuasion: Inferring communicative intents of images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 216–223, 2014.
  • [20] Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank bilinear pooling. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  • [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [22] Xiang Long, Chuang Gan, Gerard De Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. Multimodal keyless attention fusion for video classification. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [23] M. J. Marin-Jimenez, V. Kalogeiton, P. Medina-Suarez, and A. Zisserman. LAEO-Net: revisiting people Looking At Each Other in videos. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [24] Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 284–288. ACM, 2016.
  • [25] Yannis Panagakis, Mihalis A Nicolaou, Stefanos Zafeiriou, and Maja Pantic. Robust correlated and individual component analysis. IEEE transactions on pattern analysis and machine intelligence, 38(8):1665–1678, 2015.
  • [26] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [27] Peter Potash and Anna Rumshisky. Towards debate automation: a recurrent model for predicting debate winners. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2465–2475, 2017.
  • [28] Pedro Bispo Santos, Lisa Beinborn, and Iryna Gurevych. A domain-agnostic approach for opinion prediction on speech. In Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES), pages 163–172, 2016.
  • [29] Pedro Bispo Santos and Iryna Gurevych. Multimodal prediction of the audience’s impression in political debates. In Proceedings of the 20th International Conference on Multimodal Interaction: Adjunct, page 6. ACM, 2018.
  • [30] Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, and Gang Wang. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE transactions on pattern analysis and machine intelligence, 40(5):1045–1058, 2017.
  • [31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [32] Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 175–180, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  • [33] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, 7 2019. Association for Computational Linguistics.
  • [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [35] Sunny Verma, Chen Wang, Liming Zhu, and Wei Liu. Deepcu: integrating both common and unique latent information for multimodal sentiment analysis. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 3627–3634. AAAI Press, 2019.
  • [36] Lu Wang, Nick Beauchamp, Sarah Shugars, and Kechen Qin. Winning on the merits: The joint effects of content and style on debate outcomes. Transactions of the Association for Computational Linguistics, 5:219–232, 2017.
  • [37] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
  • [38] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics.
  • [39] Justine Zhang, Ravi Kumar, Sujith Ravi, and Cristian Danescu-Niculescu-Mizil. Conversational flow in oxford-style debates. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 136–141, 2016.