Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization

07/01/2019 ∙ by Paul Pu Liang, et al. ∙ Carnegie Mellon University 17

There has been an increased interest in multimodal language processing including multimodal dialog, question answering, sentiment analysis, and speech recognition. However, naturally occurring multimodal data is often imperfect as a result of imperfect modalities, missing entries or noise corruption. To address these concerns, we present a regularization method based on tensor rank minimization. Our method is based on the observation that high-dimensional multimodal time series data often exhibit correlations across time and modalities which leads to low-rank tensor representations. However, the presence of noise or incomplete values breaks these correlations and results in tensor representations of higher rank. We design a model to learn such tensor representations and effectively regularize their rank. Experiments on multimodal language data show that our model achieves good results across various levels of imperfection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analyzing multimodal language sequences spans various fields including multimodal dialog (Das et al., 2017; Rudnicky, 2005), question answering (Antol et al., 2015; Tapaswi et al., 2015; Das et al., 2018), sentiment analysis (Morency et al., 2011), and speech recognition Palaskar et al. (2018). Generally, these multimodal sequences contain heterogeneous sources of information across the language, visual and acoustic modalities. For example, when instructing robots, these machines have to comprehend our verbal instructions and interpret our nonverbal behaviors while grounding these inputs in their visual sensors (Schmerling et al., 2017; Iba et al., 2005). Likewise, comprehending human intentions requires integrating human language, speech, facial behaviors, and body postures (Mihalcea, 2012; Rossiter, 2011). However, as much as more modalities are required for improved performance, we now face a challenge of imperfect data where data might be 1) incomplete due to mismatched modalities or sensor failure, or 2) corrupted with random or structured noise. As a result, an important research question involves learning robust representations from imperfect multimodal data.

Figure 1: Clean multimodal time series data (in shades of green) exhibits correlations across time and across modalities, leading to redundancy in low rank tensor representations. On the other hand, the presence of imperfect entries (in gray, blue, and red) breaks these correlations and leads to higher rank tensors. In these scenarios, we use tensor rank regularization to learn tensors that more accurately represent the true correlations and latent structures in multimodal data.

Recent research in both unimodal and multimodal learning has investigated the use of tensors for representation learning (Anandkumar et al., 2014). Given representations from modalities, the order- outer product tensor is a natural representation for all possible interactions between the modality dimensions (Liu et al., 2018). In this paper, we propose a model called the Temporal Tensor Fusion Network (T2FN) that builds tensor representations from multimodal time series data. T2FN learns a tensor representation that captures multimodal interactions across time. A key observation is that clean data exhibits tensors that are low-rank since high-dimensional real-world data is often generated from lower dimensional latent structures (Lakshmanan et al., 2015). Furthermore, clean multimodal time series data exhibits correlations across time and across modalities (Yang et al., 2017; Hidaka and Yu, 2010). This leads to redundancy in these overparametrized tensors which explains their low rank (Figure 1). On the other hand, the presence of noise or incomplete values breaks these natural correlations and leads to higher rank tensor representations. As a result, we can use tensor rank minimization to learn tensors that more accurately represent the true correlations and latent structures in multimodal data, thereby alleviating imperfection in the input. With these insights, we show how to integrate tensor rank minimization as a simple regularizer for training in the presence of imperfect data. As compared to previous work on imperfect data (Sohn et al., 2014; Srivastava and Salakhutdinov, 2014; Pham et al., 2019)

, our model does not need to know which of the entries or modalities are imperfect beforehand. Our model combines the strength of temporal non-linear transformations of multimodal data with a simple regularization technique on tensor structures. We perform experiments on multimodal video data consisting of humans expressing their opinions using a combination of language and nonverbal behaviors. Our results back up our intuitions that imperfect data increases tensor rank. Finally, we show that our model achieves good results across various levels of imperfection.

2 Related Work

Tensor Methods: Tensor representations have been used for learning discriminative representations in unimodal and multimodal tasks. Tensors are powerful because they can capture important higher order interactions across time, feature dimensions, and multiple modalities (Kossaifi et al., 2017). For unimodal tasks, tensors have been used for part-of-speech tagging (Srikumar and Manning, 2014), dependency parsing (Lei et al., 2014), word segmentation (Pei et al., 2014), question answering (Qiu and Huang, 2015), and machine translation (Setiawan et al., 2015). For multimodal tasks, Huang et al. (2017) used tensor products between images and text features for image captioning. A similar approach was proposed to learn representations across text, visual, and acoustic features to infer speaker sentiment (Liu et al., 2018; Zadeh et al., 2017). Other applications include multimodal machine translation (Delbrouck and Dupont, 2017), audio-visual speech recognition (Zhang et al., 2017), and video semantic analysis (Wu et al., 2009; Gao et al., 2009).

Imperfect Data: In order to account for imperfect data, several works have proposed generative approaches for multimodal data (Sohn et al., 2014; Srivastava and Salakhutdinov, 2014)

. Recently, neural models such as cascaded residual autoencoders 

(Tran et al., 2017), deep adversarial learning (Cai et al., 2018), or translation-based learning (Pham et al., 2019) have also been proposed. However, these methods often require knowing which of the entries or modalities are imperfect beforehand. While there has been some work on using low-rank tensor representations for imperfect data (Chang et al., 2017; Fan et al., 2017; Chen et al., 2017; Long et al., 2018; Nimishakavi et al., 2018)

, our approach is the first to integrate rank minimization with neural networks for multimodal language data, thereby combining the strength of non-linear transformations with the mathematical foundations of tensor structures.

3 Proposed Method

In this section, we present our method for learning representations from imperfect human language across the language, visual, and acoustic modalities. In §3.1, we discuss some background on tensor ranks. We outline our method for learning tensor representations via a model called Temporal Tensor Fusion Network (T2FN) in §3.2. In §3.3, we investigate the relationship between tensor rank and imperfect data. Finally, in §3.4, we show how to regularize our model using tensor rank minimization.

We use lowercase letters to denote scalars, boldface lowercase letters

to denote vectors, and boldface capital letters

to denote matrices. Tensors, which we denote by calligraphic letters , are generalizations of matrices to multidimensional arrays. An order- tensor has dimensions, . We use to denote outer product between vectors.

3.1 Background: Tensor Rank

The rank of a tensor measures how many vectors are required to reconstruct the tensor. Simple tensors that can be represented as outer products of vectors have lower rank, while complex tensors have higher rank. To be more precise, we define the rank of a tensor using Canonical Polyadic (CP)-decomposition (Carroll and Chang, 1970). For an order- tensor , there exists an exact decomposition into vectors :

(1)

The minimal for exact decomposition is called the rank of the tensor. The vectors are called the rank decomposition factors of .

Figure 2: The Temporal Tensor Fusion Network (T2FN) creates a tensor from temporal data. The rank of increases with imperfection in data so we regularize our model by minimizing an upper bound on the rank of .

3.2 Multimodal Tensor Representations

Our model for creating tensor representations is called the Temporal Tensor Fusion Network (T2FN), which extends the model in Zadeh et al. (2017) to include a temporal component. We show that T2FN increases the capacity of TFN to capture high-rank tensor representations, which itself leads to improved prediction performance. More importantly, our knowledge about tensor rank properties allows us to regularize our model effectively for imperfect data.

We begin with time series data from the language, visual and acoustic modalities, denoted as , , and

respectively. We first use Long Short-term Memory (LSTM) networks 

(Hochreiter and Schmidhuber, 1997)

to encode the temporal information from each modality, resulting in a sequence of hidden representations

, , and . Similar to prior work which found tensor representations to capture higher-order interactions from multimodal data (Liu et al., 2018; Zadeh et al., 2017; Fukui et al., 2016), we form tensors via outer products of the individual representations through time (as shown in Figure 2):

(2)

where we append a 1 so that unimodal, bimodal, and trimodal interactions are all captured as described in Zadeh et al. (2017). is our multimodal representation which can then be used to predict the label using a fully connected layer. Observe how the construction of closely resembles equation (1) as the sum of vector outer products. As compared to TFN which uses a single outer product to obtain a multimodal tensor of rank one, T2FN creates a tensor of high rank (upper bounded by ). As a result, the notion of rank naturally emerges when we reason about the properties of .

3.3 How Does Imperfection Affect Rank?

We first state several observations about the rank of multimodal representation :

1) : The rank of

is maximized when data entries are sampled from i.i.d. noise (e.g. Gaussian distributions). This is because this setting leads to no redundancy at all between the feature dimensions across time steps.

2) : Clean real-world data is often generated from lower dimensional latent structures (Lakshmanan et al., 2015). Furthermore, multimodal time series data exhibits correlations across time and across modalities (Yang et al., 2017; Hidaka and Yu, 2010). This redundancy leads to low-rank tensor representations.

3) : If the data is imperfect, the presence of noise or incomplete values breaks these natural correlations and leads to higher rank tensor representations.

These intuitions are also backed up by several experimental results which are presented in §4.2.

3.4 Tensor Rank Regularization

Given our intuitions above, it would then seem natural to augment the discriminative objective function with a term to minimize the rank of . In practice, the rank of an order- tensor is computed using the nuclear norm which is defined as (Friedland and Lim, 2014),

(3)

When

, this reduces to the matrix nuclear norm (sum of singular values). However, computing the rank of a tensor or its nuclear norm is NP-hard for tensors of order

 (Friedland and Lim, 2014). Fortunately, there exist efficiently computable upper bounds on the nuclear norm and minimizing these upper bounds would also minimize the nuclear norm . We choose the upper bound as presented in Hu (2014), which upper bounds the nuclear norm with the tensor Frobenius norm scaled by the tensor dimensions:

(4)

where the Frobenius norm is defined as the sum of squared entries in which is easily computable and convex. Since is easily computable and convex, including this term adds negligible computational cost to the model. We will use this upper bound as a surrogate for the nuclear norm in our objective function. Our objective function is therefore a weighted combination of the prediction loss and the tensor rank regularizer in equation (4).

4 Experiments

(a) CP decomposition error of under random and structured dropping of features. Imperfect data leads to an increase in decomposition error and an increase in (approximate) tensor rank.
(b)

Sentiment classification accuracy under random drop (i.e. dropping entries randomly with probability

). T2FN with rank regularization (green) performs well.
(c) Sentiment classification accuracy under structured drop (dropping entire time steps randomly with probability ). T2FN with rank regularization (green) performs well.
Figure 3: (a) Effect of imperfect data on tensor rank. (b) and (c): CMU-MOSI test accuracy under imperfect data.

Our experiments are designed with two research questions in mind: 1) What is the effect of various levels of imperfect data on tensor rank in T2FN? 2) Does T2FN with rank regularization perform well on prediction with imperfect data? We answer these questions in §4.2 and §4.3 respectively.

4.1 Datasets

We experiment with real video data consisting of humans expressing their opinions using a combination of language and nonverbal behaviors. We use the CMU-MOSI dataset which contains 2199 videos annotated for sentiment in the range  (Zadeh et al., 2016). CMU-MOSI and related multimodal language datasets have been studied in the NLP community (Gu et al., 2018; Liu et al., 2018; Liang et al., 2018)

from fully supervised settings but not from the perspective of supervised learning with imperfect data. We use 52 segments for training, 10 for validation and 31 for testing. GloVe word embeddings 

(Pennington et al., 2014), Facet (iMotions, 2017), and COVAREP (Degottex et al., 2014) features are extracted for the language, visual and acoustic modalities respectively. Forced alignment is performed using P2FA (Yuan and Liberman, 2008) to align visual and acoustic features to each word, resulting in a multimodal sequence. Our data splits, features, alignment, and preprocessing steps are consistent with prior work on the CMU-MOSI dataset (Liu et al., 2018).

4.2 Rank Analysis

We first study the effect of imperfect data on the rank of tensor . We introduce the following types of noises parametrized by . Higher noise levels implies more imperfection: 1) clean: no imperfection, 2) random drop: each entry is dropped independently with probability , and 3) structured drop: independently for each modality, each time step is chosen with probability . If a time step is chosen, all feature dimensions at that time step are dropped. For all imperfect settings, features are dropped during both training and testing.

We would like to show how the tensor ranks vary under different imperfection settings. However, as is mentioned above, determining the exact rank of a tensor is an NP-hard problem (Friedland and Lim, 2014). In order to analyze the effect of imperfections on tensor rank, we perform CP decomposition (equation (5)) on the tensor representations under different rank settings and measure the reconstruction error ,

(5)

Given the true rank , will be high at ranks , while will be approximately zero at ranks (for example, a rank tensor would display a large reconstruction error with CP decomposition at rank , but would show almost zero error with CP decomposition at rank ). By analyzing the effect of on , we are then able to derive a surrogate to the true rank .

Using this approach, we experimented on CMU-MOSI and the results are shown in Figure 3(a). We observe that imperfection leads to an increase in (approximate) tensor rank as measured by reconstruction error (the graph shifts outwards and to the right), supporting our hypothesis that imperfect data increases tensor rank (§3.3).

4.3 Prediction Results

Our next experiment tests the ability of our model to learn robust representations despite data imperfections. We use the tensor for prediction and report binary classification accuracy on CMU-MOSI test set. We compare to several baselines: Early Fusion (EF)-LSTM, Late Fusion (LF)-LSTM, TFN, and T2FN without rank regularization. These results are shown in Figure 3(b) for random drop and Figure 3(c) for structured drop. T2FN with rank regularization maintains good performance despite imperfections in data. We also observe that our model’s improvement is more significant on random drop settings, which results in a higher tensor rank as compared to structured drop settings (from Figure 3(a)). This supports our hypothesis that our model learns robust representations when imperfections that increase tensor rank are introduced. On the other hand, the existing baselines suffer in the presence of imperfect data.

5 Discussion and Future Work

We acknowledge that there are other alternative methods to upper bound the true rank of a tensor (Alexeev et al., 2011; Atkinson and Lloyd, 1980; Ballico, 2014). From a theoretical perspective, there exists a trade-off between the cost of computation and the tightness of approximation. In addition, the tensor rank can (far) exceed the maximum dimension, and a low-rank approximation for tensors may not even exist (de Silva and Lim, 2008). While our tensor rank regularization method seems to work well empirically, there is definitely room for a more thorough theoretical analysis of constructing and regularizing tensor representations for multimodal learning.

6 Conclusion

This paper presented a regularization method based on tensor rank minimization. We observe that clean multimodal sequences often exhibit correlations across time and modalities which leads to low-rank tensors, while the presence of imperfect data breaks these correlations and results in tensors of higher rank. We designed a model, the Temporal Tensor Fusion Network, to learn such tensor representations and effectively regularize their rank. Experiments on multimodal language data show that our model achieves good results across various levels of imperfections. We hope to inspire future work on regularizing tensor representations of multimodal data for robust prediction in the presence of imperfect data.

Acknowledgements

PPL, ZL, and LM are partially supported by the National Science Foundation (Award #1750439 and #1722822) and Samsung. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of Samsung and NSF, and no official endorsement should be inferred. YHT and RS are supported in part by DARPA HR00111990016, AFRL FA8750-18-C-0014, NSF IIS1763562, Apple, and Google focused award. QZ is supported by JSPS KAKENHI (Grant No. 17K00326). We also acknowledge NVIDIA’s GPU support and the anonymous reviewers for their constructive comments.

References