S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

by   Yizhe Zhu, et al.
Rutgers University

We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e.g., videos and audios) under self-supervision. Specifically, we exploit the benefits of some readily accessible supervisory signals from input data itself or some off-the-shelf functional models and accordingly design auxiliary tasks for our model to utilize these signals. With the supervision of the signals, our model can easily disentangle the representation of an input sequence into static factors and dynamic factors (i.e., time-invariant and time-varying parts). Comprehensive experiments across videos and audios verify the effectiveness of our model on representation disentanglement and generation of sequential data, and demonstrate that, our model with self-supervision performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.


page 6

page 7

page 14

page 15

page 16

page 17

page 18

page 19


Contrastively Disentangled Sequential Variational Autoencoder

Self-supervised disentangled representation learning is a critical task ...

Disentangled Recurrent Wasserstein Autoencoder

Learning disentangled representations leads to interpretable models and ...

A Framework for Contrastive and Generative Learning of Audio Representations

In this paper, we present a framework for contrastive learning for audio...

Self-supervision versus synthetic datasets: which is the lesser evil in the context of video denoising?

Supervised training has led to state-of-the-art results in image and vid...

Self-supervised Detransformation Autoencoder for Representation Learning in Open Set Recognition

The objective of Open set recognition (OSR) is to learn a classifier tha...

Self-Supervision by Prediction for Object Discovery in Videos

Despite their irresistible success, deep learning algorithms still heavi...

Self-supervised HDR Imaging from Motion and Exposure Cues

Recent High Dynamic Range (HDR) techniques extend the capabilities of cu...

1 Introduction

Representation learning is one of the essential research problems in machine learning and computer vision 


. Real-world sensory data such as videos, images, and audios are often in the form of high dimensions. Representation learning aims to map these data into a low-dimensional space to make it easier to extract semantically meaningful information for downstream tasks such as classification and detection. Recent years have witnessed rising interests in disentangled representation learning, which tries to separate the underlying factors of observed data variations such that each factor exclusively interprets one type of semantic attributes of sensory data. The representation of sequential data is expected to be disentangled into time-varying factors and time-invariant factors. For video data, the identity of a moving object in a video is regarded as time-invariant factors, and the motion in each frame is considered as time-varying ones  

[37]. For speech data, the representations of the timbre of speakers and the linguistic contents are expected to be disentangled [27]. There are several benefits of learning disentangled representations. First, the models that produce disentangled representations are more explainable. Second, disentangled representations make it easier and more efficient to manipulate data generation, which has potential applications in entertainment industry, training data synthesis [55, 56] and several downstream tasks [32, 19, 18, 45, 57].

Figure 1: Self-supervision and regularizations enforce the latent variable of our sequential VAE to be disentangled into a static representation and a dynamic representation .

Despite the vast amount of works [24, 33, 6, 15, 16, 7, 30] on disentangled representations of static data (mainly image data), fewer works [27, 37, 23, 48] have explored representation disentanglement for sequential data generation. For unsupervised models, FHVAE [27] and DSVAE [37] elaborately designed model architectures and factorized latent variables into static and dynamic parts. These models may well handle simple data forms such as synthetic animation data but fail when dealing with realistic ones as we will show later. Besides, as pointed out in [38]

, unsupervised representation disentanglement is impossible without inductive biases. Without any supervision, the performance of disentanglement can hardly be guaranteed and greatly depends on the random seed and the dimensionality of latent vectors set in the models. On the other hand, several works 

[23, 48] resort to utilizing label information or attribute annotation as strong supervision for disentanglement. For instance, VideoVAE [23] leveraged holistic attributes to constrain latent variables. Nevertheless, the costly annotation of data is essential for these models and prevents them from being deployed to most real-world applications, in which a tremendous amount of unlabeled data is available.

To alleviate the drawbacks of both unsupervised and supervised models discussed above, this work tackles representation disentanglement for sequential data generation utilizing self-supervision. In self-supervised learning, various readily obtainable supervisory signals have been explored for representation learning of images and videos, employing auxiliary data such as the ambient sounds in videos 

[42, 3], the egomotion of cameras [1, 29], the geometry cue in 3D movies [20], and off-the-shelf functional models for visual tracking [53], and optical flow [44, 52]. However, how self-supervised learning benefits representation disentanglement of sequential data has barely been explored.

In this paper, we propose a sequential variational autoencoder (VAE), a recurrent version of VAE, for sequence generation. In the latent space, the representation is disentangled into time-invariant and time-varying factors. We address the representation disentanglement by exploring intrinsic supervision signals, which can be readily obtained from both data itself and off-the-shelf methods, and accordingly design a series of auxiliary tasks. Specifically, on one hand, to exclude dynamic information from time-invariant variable, we exploit the temporal order of the sequential data and expect the time-invariant variable of the temporally shuffled data to be close to if not the same as that of the original data. On the other hand, the time-varying variable is expected to contain dynamic information in different modalities. For video data, we allow it to predict the location of the largest motion in every frame, which can be readily inferred from optical flow. For audio data, the volume in each segment as an intrinsic label is leveraged as the supervisory signal. To further encourage the representation disentanglement, the mutual information between static and dynamic variables are minimized as an extra regularization.

To the best of our knowledge, this paper is the first work to explicitly use auxiliary supervision to improve the representation disentanglement for sequential data. Extensive experiments on representation disentanglement and sequence data generation demonstrate that, with these multiple freely accessible supervisions, our model dramatically outperforms unsupervised learning-based methods and even performs better than fully-supervised learning-based methods in several cases.

2 Related Work

Disentangled Sequential Data Generation With the success of deep generative models, recent works [24, 33, 6, 7, 30] resort to variational autoencoders (VAEs) [35] and generative adversarial networks (GANs) [22] to learn a disentangled representation. Regularizations are accordingly designed. -VAE [24] imposed a heavier penalty on the KL divergence term for a better disentanglement learning. Follow-up researches [33, 6] derived a Total Correlation (TC) from the KL term, and highlights this TC term as the key factor in disentangled representation learning. In InfoGAN [7], the disentanglement of a latent code is achieved by maximizing a mutual information lower-bound between and the generated sample .

Several works involving disentangled representation have been proposed for video prediction. Villegas et al. [50] and Denton et al. [12] designed two networks to encode pose and content separately at each timestep. Unlike video prediction, video generation from priors, which we perform in this work, is a much harder task since no frame is available for appearance and motion modeling in the generation phase.

To handle video generation, VAEs are extended to a recurrent version [17, 4, 10]. However, these models do not explicitly consider static and dynamic representation disentanglement and fail to perform manipulable data generation. More recently, several works have proposed VAEs with factorized latent variables. FHVAE [27] presented a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables in the context of speech data, but did not take advantage of the sequential prior. Combining the merits of recurrent VAE and FHVAE, DSVAE [37] is capable of disentangling latent factors by factorizing them into time-invariant and time-dependent parts and applies an LSTM sequential prior to keep a better sequential consistency for sequence generation. Although with elaborately designed complex architectures, these models may only perform decently on representation disentanglement of simple data, the disentanglement performance degrades rapidly when the complexity of data increases. In contrast, our work explores both model and regularization designs for representation disentanglement and sequential data generation. Our model fully factorizes the latent variables to time-invariant and time-varying parts, and both the posterior and the prior of the time-varying variable are modeled by LSTM for dynamic consistency. The auxiliary tasks with readily accessible supervisory signals are designed to regularize and encourage representation disentanglement.

Figure 2: The framework of our proposed model in the context of video data. Each frame of a video is fed into an encoder to produce a sequence of visual features, which is then passed through an LSTM module to obtain the manifold posterior of a dynamic latent variable and the posterior of a static latent variable . The static and dynamic representations and are sampled from the corresponding posteriors and concatenated to be fed into a decoder to generate reconstructed sequence . Three regularizers are imposed on dynamic and static latent variables to encourage the representation disentanglement.

Self-Supervised Learning The concept of self-supervised learning traces back to the autoencoder [25]

, which uses the input itself as supervision to learn the representation. Denoising autoencoder

[51] makes the learned representations robust to noise and partial corruption of the input pattern by adding noise to the input. Recent years have witnessed the booming interest in self-supervised learning. The sources of supervisory signals can be roughly categorized into three classes. (a) Intrinsic labels: Doersch et al. [13] explored the use of spatial context in images, and Noroozi et al. [41] trained a model to solve Jigsaw puzzles as a pretext task. Several works [54, 36] showed that colorizing a gray-scale photograph can be utilized as a powerful pretext task for visual understanding. Temporal information of video is another readily accessible supervisory signal. [39] trained a model to determine whether a sequence of frames from a video is in the correct temporal order and [31] made the model learn to arrange the permuted 3D spatiotemporal crops. (b) Auxiliary data: Agrawal et al. [1] and Jayaraman et al. [29] exploited the freely available knowledge of camera motion as a supervisory signal for feature learning. Ambient sounds in videos [42, 3] are used as a supervisory signal for learning visual models. The geometry cue in 3D movies [20] is utilized for visual representation learning. (c) Off-the-shelf tools: Wang et al. [53] leveraged the visual consistency of objects from a visual tracker in the video clips. [44] used segments obtained by motion-based segmentation based on optical flow as pseudo ground truth for the single-frame object segmentation. Instead of learning visual features as in aforementioned methods, this work aims to achieve static and dynamic representation disentanglement for sequential data such as video and speech. To this end, we leverage supervisory signals from intrinsic labels to regularize the static representation and off-the-shelf tools to regularize the dynamic representation.

3 Sequential VAE Model

We start by introducing some notations and the problem definition. is given as a dataset that consists of i.i.d. sequences, where denotes a sequence of observed variables, such as a video of frames or an audio of segments. We propose a sequential variational autoencoder model, where the sequence is assumed to be generated from latent variable and is factorized into two disentangled variables: the time-invariant (or static) variable and the time-varying (or dynamic) variables .

Priors  The prior of

is defined as a standard Gaussian distribution:

. The time-varying latent variables follow a sequential prior , where , are the parameters of the prior distribution conditioned on all previous time-varying latent variables . The model can be parameterized as a recurrent network, such as LSTM [26] or GRU [9], where the hidden state is updated temporally. The prior of can be factorized as:


Generation The generating distribution of time step is conditioned on and : , where and the decoder

can be a highly flexible function such as a deconvolutional neural network 


The complete generative model can be formalized by the factorization:


Inference  Our sequential VAE uses variational inference to approximate posterior distributions:


where and . The static variable is conditioned on the whole sequence while the dynamic variable is inferred by a recurrent encoder and only conditioned on the previous frames. Our inference model is factorized as:


Learning  The objective function of sequential VAE is a timestep-wise negative variational lower bound:


The schematic representation of our model is shown in Figure 2. Note that DSVAE also proposes a sequential VAE with disentangled representation, but it either independently infers only based on the frame of each time-step without considering the continuity of dynamic variables and thus may generate inconsistent motion, or assumes the variational posterior of depends on , implying that the variables are still implicitly entangled. In contrast, we model both the prior and the posterior of by recurrent models independently, resulting in consistent dynamic information in synthetic sequences, and ensure full disentanglement of and by posterior factorization.

4 Self-Supervision and Regularization

Without any supervision, there is no guarantee that the time-invariant representation and the time-varying representation are disentangled. In this section, we introduce a series of auxiliary tasks on the different types of representations as the regularization of our sequential VAE to achieve the disentanglement, where readily accessible supervisory signals are leveraged.

4.1 Static Consistency Constraint

To encourage the time-invariant representation to exclude any dynamic information, we expect that changes little when varying dynamic information dramatically. To this end, we shuffle the temporal order of frames to form a shuffled sequence. Ideally, the static factors of the original sequence and shuffled sequence should be very close, if not equal, to one another. However, directly minimizing the distance of these two static factors will lead to very trivial solutions, e.g., the static factors of all sequences converge to the same value and do not contain any meaningful information. Thus, we randomly sample another sequence as the negative sample of the static factor. With a triple of static factors, we introduce a triplet loss as follows:


where , and are the static factors of the anchor sequence, the shuffled sequence as the positive data, and another randomly sampled video as the negative data, denotes the Euclidean distance and is the margin, set to 1. This regularization makes preserve meaningful static information to a certain degree while excluding dynamic information.

Figure 3: The pseudo label generation for video datasets. (a) The left image is the input frame and the right image is the corresponding optical flow map that is split by a grid. (b) Three distances are used as the dynamic signals for the face dataset.

4.2 Dynamic Factor Prediction

To encourage the dynamic representation to carry adequate and correct time-dependent information of each time-step, we exploit dynamic information-related signals from off-the-shelf tools for different types of sequential data and accordingly design the auxiliary tasks as the regularization imposed on . We have the loss , where can be either cross-entropy loss or mean squared error loss according to the designed auxiliary task, is a network for dynamic factor prediction and contains supervisory signals.

Video Data  The dynamic representation can be learned by forcing it to predict the dynamic factors of videos. Motivated by this, we expect the location of the largest motion regions can be accurately predicted based on . To this end, the optical flow maps of video are first obtained by commonly used functional model FlowNet2 [28] and then split into patches by grid, as shown in Figure 3.a. We compute the average of motion magnitudes for every patch and use the indices of patches with the top-k largest values as the pseudo label for prediction. For this task,

is implemented with two fully-connected layers and a softmax layer.

Apart from the optical flow, some freely obtainable data-specific signals can be exploited. For a human face dataset, the landmark of each frame can be readily detected and considered as a supervision for dynamic factors. We obtain the landmark from an off-the-shelf landmark detector [14]. To keep our model concise and efficient, we only leverage the distance between the upper and lower eyelids as well as the distance between the upper and lower lips in each frame as the dynamic signal, as shown in Figure 3.b. Here, consists of two fully-connected layers to regress the three distances. We observe that our model can easily capture dynamic motions under this simple supervision.

Audio Data  For the audio dataset, we consider the volume as time-dependency factor and accordingly design an auxiliary task, where is forced to predict if the speech is silent or not in each segment. The pseudo label is readily obtained by setting a magnitude threshold on the volume of each speech segment. consists of two fully-connected layers and performs a binary classification.

4.3 Mutual Information Regularization

Forcing the time-varying variable to predict dynamic factors can guarantee that contains adequate dynamic information, but this fails to guarantee that excludes the static information. Therefore, we introduce the mutual information between static and dynamic variables as a regulator . The mutual information is a measure of the mutual dependence between two variables. By minimizing

, we encourage the information in these two variables are mutually exclusive. The mutual information is formally defined as the KL divergence of the joint distribution to the product of marginal distribution of each variable. We have



. The expectation can be estimated by the mini-batch weighted sampling estimator 



for or , where and are the data size and the minibatch size, respectively.

4.4 Objective Function

Overall, our objective can be considered as the sequential VAE loss with a series of self-supervision and regularization:


where , and are balancing factors.

5 Experiments

To comprehensively validate the effectiveness of our proposed model with self-supervision, we conduct experiments on three video datasets and one audio dataset. With these four datasets, we cover different modalities from video to audio. In video domain, a large range of motions are covered from large character motions (e.g., walking, stretching) to subtle facial expressions (e.g, smiling, disgust).

5.1 Experiments on Video Data

We present an in-depth evaluation on two problems, tested on three different datasets and employing a large variety of metrics.

5.1.1 Datasets

Stochastic Moving MNIST is introduced by [11] and consists of sequences of frames of size , where two digits from MNIST dataset move in random directions. We randomly generate 6000 sequences, 5000 of which are used for training and the rest are for testing.

Sprite [37] contains sequences of animated cartoon characters with 9 action categories: walking, casting spells and slashing with three viewing angles. The appearance of characters are fully controlled by four attributes, i.e., the color of skin, tops, pants, and hair. Each of the attributes categories contains 6 possible variants, therefore it results in totally unique characters, 1000 of which are used for training and the rest for testing. Each sequence contains 8 frames of size .

MUG Facial Expression [2] consists of videos with actors performing different facial expressions: anger, fear, disgust, happiness, sadness, and surprise. Each video composes of 50 to 160 frames. As suggested in MoCoGAN [49], we crop the face regions, resize video to , and randomly sample a clip of frames in each video. The of dataset is used for training and the rest for testing.

Figure 4: Representation swapping on SMMNIST, Sprite and MUG datasets. In each panel, we show two real videos as well as the generated videos by swapping and , from our model and two competing models: DSVAE and MonkeyNet. Each column is supposed to have the same motion.

5.1.2 Representation Swapping

We first perform the representation swapping and compare our method with DSVAE, a disentangled VAE model, as well as MonkeyNet [47], a state-of-the-art deformable video generation model. Suppose two real videos are given for motion information and appearance information, denoted as and . Our method and DSVAE perform video generation based on the from and from . For MonkeyNet, the videos are generated by deforming the first frame of based on motion in . The synthetic videos are expected to preserve the appearance in and the motion in . The qualitative comparisons on three datasets are shown in Figure 4.

For SMMNIST, the generated videos of our model can preserve the identity of digits while consistently mimic the motion of the provided video. However, DSVAE can hardly preserve the identity. For instance, it mistakenly changes the digit “9” to “6”. We observe that MonkeyNet can hardly handle the case with multiple objects like SMMNIST, because the case does not meet its implicit assumption of only one object moving in the video.

For Sprite, DSVAE generates blurry videos when the characters in and have opposite directions, indicating it fails to encode the direction information in the dynamic variable. Conversely, our model can generate videos with the appearance of the character in and the same action and direction of the character in , due to the guidance from optical flow. The characters in the generated videos of Monkeynet fail to follow the pose and action in , and many artifacts appear. E.g., an arm-like blob appears in the back of the character in the left panel.

For MUG, the generated video of DSVAE can hardly preserve both the appearance in and the facial expression in . For example, the person in the right has a mixed appearance characteristic, indicating and are entangled. Due to the deformation scheme of generation, MonkeyNet fails to handle the case where the faces in two videos are not well aligned. For instance, forcing the man with a smile to be fear results in unnatural expression. On the contrary, our model disentangles and , supported by the realistic expressions on different faces in generated videos.

Figure 5: Unconditional video generation on MUG. The upper and lower panels show the qualitative results of our model and MoCoGAN, respectively.
Figure 6: Randomly sampled frames for each dataset.
Methods SMMNIST Sprite MUG
MoCoGAN 74.55% 4.078 0.194 0.191 92.89% 8.461 0.090 2.192 63.12% 4.332 0.183 1.721
DSVAE 88.19% 6.210 0.185 2.011 90.73% 8.384 0.072 2.192 54.29% 3.608 0.374 1.657
baseline 90.12% 6.543 0.167 2.052 91.42% 8.312 0.071 2.190 53.83% 3.736 0.347 1.717
full model 95.09% 7.072 0.150 2.106 99.49% 8.637 0.041 2.197 70.51% 5.136 0.135 1.760
baseline-sv* 92.18% 6.845 0.156 2.057 98.91% 8.741 0.028 2.196 72.32% 5.006 0.129 1.740
Table 1: Quantitatively performance comparison on SMMNIST, Sprite and MUG datasets. High values are expected for , and , while for , the lower values are better. The results of our model with supervision of ground truth labels baseline-sv* are shown as a reference.
Figure 7: Controlled video generation. (a) Video generation controlled by fixing the static variable and randomly sampling dynamic variables from the prior . All sequences share a same identity but with different motions for each dataset. (b) Video generation with changed facial expressions. Expression is changed from smile to surprise and from surprise to disgust in two sequences, respectively. We control it by transferring the dynamic variables.

5.1.3 Video Generation

Quantitative Results We compute the quantitative metrics of our model with and without self-supervsion and regularization, denoted as full model and baseline, as well as two competing methods: DSVAE and MoCoGAN. All these methods are comparable as no ground truth labels are used to benefit representation disentanglement. Besides, the results of our baseline with full supervision from human-annotation baseline-sv are also provided as a reference.

To demonstrate the ability of a model on the representation disentanglement, we use the classification accuracy  [37], which measures the ability of a model to preserve a specific attributes when generating a video given the corresponding representation or label. To measure how diverse and realistic videos a model can generate, three metrics are used:  [46], Intra-Entropy  [23] and Inter-Entropy  [23]

. All metrics utilize a pretrained classifier based on the real videos and ground truth attributes. See Appendix for the detailed definitions of the metrics. The results are shown in Table 


For representation disentanglement, we consider generating videos with a given inferred from a real video and randomly sampled from the prior for SMMNIST. We then check if the synthetic video contains the same digits as the real video by the pretrained classifier. For MUG, we evaluate the ability of a model to preserve the facial expression by fixing and randomly sampled from the prior . For Sprite, since the ground truth of both actions and appearance attributes are available, we evaluate the ability of preserving both static and dynamic representations, and report the average scores. It’s evident that our full model consistently outperforms all competing methods. For SMNIST, we observe that MoCoGAN have poor ability to correctly generate the digits with given labels while our full model can generate correctly digits, reflected by the high . Note that our full model achieves on Sprite, indicating the and are greatly disentangled. Besides, full model significantly boosts the performance of baseline, especially in MUG where more realistic data is contained, the performance gets giant boost from to , which illustrates the crucial role of our self-supervision and regularization. For video generation, full model consistently shows the superior performances on , , . Especially in MUG, full model outperforms the runner-up MoCoGAN by on , demonstrating that high quality of videos generated by our model. Note that our baseline is also compared favorably to DSVAE, illustrating the superiority of the designed sequential VAE model.

It is worth noting that our model with self-supervision full model outperforms baseline-sv in SMMNIST on the representation disentanglement. The possible reason is with the ground truth labels, baseline-sv only encourages to contain identity information but fails to exclude the information in , resulting in confusion in digit recognition when using various dynamic variables. For MUG, baseline-sv performs better on preserving motion. We conjecture it’s because that the strong supervision of expression labels forces to encode the dynamic information, and does not favor encoding temporal dynamic information and thus varying does not affect the motion much.

Qualitative results We first demonstrate the ability of our model to manipulate video generation in Figure 7. By fixing and sampling , our model can generate videos with the same object that performs various motions as shown in Figure 7.a.  Even in one video, the facial expression can be transferred by controlling , as shown in Figure 7.b.

We also evaluate the appearance diversity of generate objects from our model. Figure 6 shows the frames our model generates with sampled . The objects with realistic and diverse appearances validate our model’s outstanding capability of high-quality video generation.

Besides, we compare our model with MoCoGAN on unconditional video generation on MUG dataset, as shown in Figure 5. The videos are generated with sampled and . MoCoGAN generates videos with many artifacts, such as unrealistic eyes and inconsistent mouth in the third video. Conversely, our model generates more realistic human faces with consistent high-coherence expressions.

5.1.4 Ablation Studies

In this section, we present an ablation study to empirically measure the impact of each regularization of our model on its performance. The variant without a certain regularizaiton is denoted as . In Table 2, we report the quantitative evaluation. We note that performs worse than the full model. This illustrates the significance of the static consistency constraint to make to be disentangled from . degrades the performance on considerably, indicating that as regularization is crucial to preserve the action information in the dynamic vector. Besides, after removing the mutual information regularization, again shows an inferior performance to the full model. A possible explanation is that our encourages that excludes the appearance information; thus the appearance information is only from . The qualitative results shown in Figure 8 confirms this analysis. We generate a video with of the woman and of the man in the first row by different variants of our model. Without , some characteristics of the man are still preserved, such as the beard, confirming that the appearance information partially remains in . In the results of , the beard is more evident, indicating that the static and dynamic variable are still entangled. On the other hand, without , the woman in the generated video cannot mimic the action of the man well, which indicates the dynamic variable does not encode the motion information properly. Finally, the person in the generated video of neither preserves the appearance of the woman nor follows the expression of the man. It illustrates the representation disentanglement without any supervision remains a hard task.

61.45% 4.850 0.201 1.734
58.32% 4.423 0.284 1.721
66.07% 4.874 0.175 1.749
Full model 70.51% 5.136 0.135 1.760
Table 2: Ablation study of disentanglement on MUG.
Figure 8: Ablation study on MUG dataset. The first frame of the appearance video and the motion video are shown in the first row.

5.2 Experiments on Audio Data

To demonstrate the general applicability of our model on sequential data, we conduct experiments on audio data, where the time-invariant and time-varying factors are the timbre of a speaker and the linguistic content of a speech, respectively. The dataset we use is TIMIT, which is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects [21], and contains 63000 recordings of read speeches. We split the dataset to training and testing subsets with a ratio of 5:1. As in  [27], all the speech are presented as a sequence of 80 dimensional Mel-scale filter bank features.

We quantitatively compare our model with FHVAE and DSVAE on the speaker verification task based on either or , measured by the Equal Error Rate(EER) [8]. Note that we expect the speaker can be correctly verified with as it encodes the timbre of speakers, and randomly guess with as it ideally only encodes the linguistic content. The results are shown in Table 3. Our model outperforms competing methods in both cases. Especially when based on , our model doubles the score of the baseline, indicating our model significantly eliminate the timbre information in .

model feature dim EER
FHVAE 16 5.06%
DSVAE 64 4.82%
64 18.89%
Ours 64 4.80%
64 40.12%
Table 3: Performance comparison on speaker verification. Small errors are better for and large errors are expected for .

6 Conclusion

We propose a self-supervised sequential VAE, which learns disentangled time-invariant and time-varying representations for sequential data. We show that, with readily accessible supervisory signals from data itself and off-the-shelf tools, our model can achieve comparable performance to the fully supervised models that require costly human annotations. The disentangling ability of our model is qualitatively and quantitatively verified on four datasets across video and audio domains. The appealing results on a variety of tasks illustrate that, leveraging self-supervision is a promising direction for representation disentanglement and sequential data generation. In the future, we plan to extend our model to high-resolution video generation, video prediction and image-to-video generation.


  • [1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pages 37–45, 2015.
  • [2] N. Aifanti, C. Papachristou, and A. Delopoulos. The mug facial expression database. In 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10, pages 1–4, April 2010.
  • [3] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017.
  • [4] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
  • [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [6] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610–2620, 2018.
  • [7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
  • [8] Mohamed Chenafa, Dan Istrate, Valeriu Vrabie, and Michel Herbin. Biometric system based on voice recognition using multiclassifiers. In European Workshop on Biometrics and Identity Management, pages 206–215. Springer, 2008.
  • [9] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [10] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
  • [11] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In 35th International Conference on Machine Learning (ICML), 2018.
  • [12] Emily L Denton et al. Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pages 4414–4423, 2017.
  • [13] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
  • [14] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 379–388, 2018.
  • [15] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and Jan-Willem van de Meent. Structured disentangled representations. arXiv preprint arXiv:1804.02086, 2018.
  • [16] Babak Esmaeili, Hao Wu, Sarthak Jain, N Siddharth, Brooks Paige, and Jan-Willem Van de Meent. Hierarchical disentangled representations. stat, 1050:12, 2018.
  • [17] Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581, 2014.
  • [18] Deng-Ping Fan, Ming-Ming Cheng, Jiang-Jiang Liu, Shang-Hua Gao, Qibin Hou, and Ali Borji. Salient objects in clutter: Bringing salient object detection to the foreground. In European Conference on Computer Vision (ECCV). Springer, 2018.
  • [19] Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attention to video salient object detection. In IEEE CVPR, 2019.
  • [20] Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas.

    Geometry guided convolutional neural networks for self-supervised video representation learning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5589–5597, 2018.
  • [21] John S Garofolo, L F Lamel, W M Fisher, Jonathan G Fiscus, D S Pallett, and Nancy L Dahlgren. Darpa timit acoustic-phonetic continuous speech corpus cd-rom TIMIT. Technical report, 1993.
  • [22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [23] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
  • [24] Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • [25] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pages 3–10, 1994.
  • [26] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [27] Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pages 1878–1889, 2017.
  • [28] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
  • [29] Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision, pages 1413–1421, 2015.
  • [30] Insu Jeon, Wonkwang Lee, and Gunhee Kim. Ib-gan: Disentangled representation learning with information bottleneck gan. 2018.
  • [31] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162, 2019.
  • [32] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [33] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
  • [34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [36] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6874–6883, 2017.
  • [37] Yingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning (ICML), 2018.
  • [38] Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, 2019.
  • [39] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
  • [40] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
  • [41] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  • [42] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In European conference on computer vision, pages 801–816. Springer, 2016.
  • [43] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.

    Automatic differentiation in pytorch.

  • [44] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2701–2710, 2017.
  • [45] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [46] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [47] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [48] Ximeng Sun, Huijuan Xu, and Kate Saenko. A two-stream variational adversarial network for video generation. arXiv preprint arXiv:1812.01037, 2018.
  • [49] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
  • [50] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
  • [51] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
  • [52] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [53] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
  • [54] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
  • [55] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1004–1013, 2018.
  • [56] Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2019.
  • [57] Yizhe Zhu, Jianwen Xie, Zhiqiang Tang, Xi Peng, and Ahmed Elgammal. Semantic-guided multi-attention localization for zero-shot learning. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS), Dec 2019.

Appendix A Minibatch Weighted Sampling

Minibatch Weighted Sampling is an estimator of the posterior introduced by [6]. Let be the size of a dataset and be the size of a minibatch, the entropy of the posterior distribution can be estimated based on a minibatch:


The readers can refer to  [6] for the details.

In our model, the posterior of the latent variable can be factorized as . Thus the entropy of the joint distribution can be estimated:

Lemma A.1.

Given a dataset of N samples with a distribution and a minibatch of samples drawn i.i.d. from , and assume the posterior of the latent variable can be factorized as: , the lower bound of , or , is :


denotes the probability of a sampled minibatch where one of the elements is fixed to be

and the rest are sampled i.i.d. from .


For any sampled batch instance , , and when one of the elements is fixed to be , .

The inequality is due to having a support that is a subset of that of . ∎

Following Lemma A.1, when provided with a minibatch of samples , we can use estimate the lower bound as:


where is a sample from , and is a sample from .

Appendix B Derivation of Objective Function

We show the derivation of objective function in Eq.5. The observe model is defined as:


To avoid the intractable integration over and , variational inference introduces an posterior approximation . A variational lower bound of log is:


we get line 2 from line 1 by plugging the Eq.2 and Eq.4.

Appendix C Metrics Definition

Similar to [48], we introduce the definitions of classification accuracy, Inception Score, Inter-Entropy, Intra-Entropy when groundtruth labels can not directly be provided to the model. Let be the generated video based on the representation of a real video with the label or directly conditioned on the label (e.g., MoCoGAN). We have a classifier that is pretrained to predict the labels of real videos.

  • Classification Accuracy () measures the percentage of the agreement of predicted labels between the generated video and the given real video . Higher classification accuracy indicates that the generated video is more recognizable and the corresponding representation is better disentangled from other representation.

  • Inception Score measures the KL divergence between the conditional label distribution and the marginal distribution .

  • Inter-Entropy is the entropy of the marginal distribution :


    where . Higher means the model generates more diverse results.

  • Intra-Entropy is entropy of the conditional class distribution .


    Lower indicates the generated video is more realistic.

For the speaker verification task in the audio dataset TIMIT, we use the metric Equal Error Rate(EER). The threshold value is tuned to make the false acceptance rate equal to the false rejection rate. The common value is referred to as the Equal Error Rate.

Appendix D Details on Architecture and Training

We implement our model using PyTorch [43] and use the Adam optimizer [34] with and . The learning rate is set to

and the batch size is set to 16. Our model is trained with 1000 epochs for each dataset on a GTX 1080 Ti GPU.

The detailed architecture description of the encoder and decoder of our S3VAE is summarized in Table 4. The visual feature of 128d from the frame encoder is fed into an LSTM with one hidden layer (256d) and the LSTM outputs the parameters and for the Gaussian multivariate distribution of the static variable of dimension . The output of the LSTM is fed into another LSTM with one hidden layer (256d) to produce the parameters and for the Gaussian multivariate distribution of the dynamic variable of dimension for each time step. Besides, we adopt a trainable LSTM to parameterize the prior of the dynamic variable, .

The dimensionality of latent variables is set to for SMMNIST, Sprite, MUG, respectively. The balancing parameters and are set to 1000, 100, 1, respectively, for all datasets.

Encoder Decoder
Input 64x64 RGB image Input z
4x4 conv(sd 2, pd 1, ch 64) 4x4 convTrans(sd 1, pd 0, ch 512)
BN, lReLU(0.2),

BN, ReLU, upsample

4x4 conv(sd 2, pd 1, ch 128) 3x3 conv(sd 1, pd 1, ch 256)
BN, lReLU(0.2), BN, ReLU, upsample
4x4 conv(sd 2, pd 1, ch 256) 3x3 conv(sd 1, pd 1, ch 128)
BN, lReLU(0.2), BN, ReLU, upsample
4x4 conv(sd 2, pd 1, ch 512) 3x3 conv(sd 1, pd 1, ch 128)
BN, lReLU(0.2), BN, ReLU, upsample
4x4 conv(sd 1, pd 0, ch 128) 3x3 conv(sd 1, pd 1, ch 64)
BN, Tanh BN, ReLU
1x1 conv(sd 1, pd 0, ch )
Table 4:

Frame Encoder and decoder of S3VAE for SMMNIST, Sprite, MUG datasets. Let sd denote stride, pd, padding; ch, channel; lReLU, leakyReLU.

is the number of image channels, which is 1 for SMMNIST and 3 for Sprite and MUG datasets.

Appendix E Representation swapping on audio data

We now show the qualitative results of representation swapping. In Figure 9

, we show the results of representation swapping. Each heatmap shows the mel-scale filter bank features of 200ms in the frequency domain, where the x-axis is temporal with 20 steps, and the y-axis represents the value of frequency. As marked in the black rectangle, 24 examples are generated by combining four static variables extracted from the samples in the first column and six dynamic variables extracted from samples in the first row.

As can be observed, in each column, the linguistic phonetic-level contents, reflected by the formants along the temporal axis, are kept almost the same. On the other hand, the timbres are reflected as the harmonics in the heatmap, which correspond to horizontal light stripes. In each row, the harmonics of heatmaps keep consistent, indicating the timbre of the speaker is preserved. Overall, the results demonstrate the ability of our model to disentangle the representation of audio data.

Figure 9: Representation swapping. Each heatmap shows the mel-scale filter bank features of 200ms in the frequency domain, where the x-axis is temporal with 20 steps, and the y-axis reflects the value of frequency. The first row shows the real data where , encoding linguistic content, is extracted while the first column is the real data where , encoding timbre, is extracted. Each of the rest sequential data is generated based on the axis-corresponding and .

Appendix F More Qualitative Results

We report additional qualitative results on representation swapping in Figure 1012 and  14. These qualitative results further illustrate the ability of our method to disentangle the static and dynamic representations. As can be seen, the generated videos follow the motion of while preserving the appearance of the .

Besides, to validate the effectiveness of our method on video generation manipulation, we show qualitative results of video generation with fixed representation. Specifically, the videos first are generated by fixing the static representation and sampling the dynamic representation . As shown in 10(a)12(a) and 14(a), the generated videos have the sames appearance but perform various motions.

Then the videos are generated by fixing the dynamic representation and sampling the static representation . As shown in  10(b)12(b) and  14(b), the generated videos have various appearance but perform the same motions.

Figure 10: Qualitative results for Representation swapping on SMMNIST dataset. In each panel, the first row is that provides the dynamic representation and the first image of the second row is one frame of that provides the static representation . The video generated based on and is shown in the second row.
(a) Videos generated by fixing the static representation and sampling the dynamic representation . Each row shows one generated video sequence. All videos show the same digits, which moves in various directions in different videos.
(b) Videos generated by fixing the dynamic representation and sampling the static representation . Each row shows one generated video sequence. Different videos show various digits, which perform the same motion.
Figure 11: Manipulating video generation on SMMNIST dataset.
Figure 12: Qualitative results for Representation swapping on Sprite dataset. In each panel, the first row is that provides the dynamic representation and the first image of the second row is one frame of that provides the static representation . The video generated based on and is shown in the second row.
(a) Videos generated by fixing the static representation and sampling the dynamic representation . Each row shows one generated video sequence. All videos show the same character, which performs various actions in various directions.
(b) Videos generated by fixing the dynamic representation and sampling the static representation . Different videos show various characters, which perform the same motion towards the same direction.
Figure 13: Manipulating video generation on Sprite dataset.
Figure 14: Qualitative results for Representation swapping on MUG dataset. In each panel, the first row is that provides the dynamic representation and the first image of the second row is one frame of that provides the static representation . The video generated based on and is shown in the second row.
(a) Videos generated by fixing the static representation and sampling the dynamic representation . Each row shows one generated video sequence. All videos show the same woman, which performs various expressions.
(b) Videos generated by fixing the dynamic representation and sampling the static representation . Each row shows one generated video sequence. Different videos show different persons, which perform the expression of surprise.
Figure 15: Manipulating video generation on MUG dataset.