1 Introduction
Representation learning is one of the essential research problems in machine learning and computer vision
[5]. Realworld sensory data such as videos, images, and audios are often in the form of high dimensions. Representation learning aims to map these data into a lowdimensional space to make it easier to extract semantically meaningful information for downstream tasks such as classification and detection. Recent years have witnessed rising interests in disentangled representation learning, which tries to separate the underlying factors of observed data variations such that each factor exclusively interprets one type of semantic attributes of sensory data. The representation of sequential data is expected to be disentangled into timevarying factors and timeinvariant factors. For video data, the identity of a moving object in a video is regarded as timeinvariant factors, and the motion in each frame is considered as timevarying ones
[37]. For speech data, the representations of the timbre of speakers and the linguistic contents are expected to be disentangled [27]. There are several benefits of learning disentangled representations. First, the models that produce disentangled representations are more explainable. Second, disentangled representations make it easier and more efficient to manipulate data generation, which has potential applications in entertainment industry, training data synthesis [55, 56] and several downstream tasks [32, 19, 18, 45, 57].Despite the vast amount of works [24, 33, 6, 15, 16, 7, 30] on disentangled representations of static data (mainly image data), fewer works [27, 37, 23, 48] have explored representation disentanglement for sequential data generation. For unsupervised models, FHVAE [27] and DSVAE [37] elaborately designed model architectures and factorized latent variables into static and dynamic parts. These models may well handle simple data forms such as synthetic animation data but fail when dealing with realistic ones as we will show later. Besides, as pointed out in [38]
, unsupervised representation disentanglement is impossible without inductive biases. Without any supervision, the performance of disentanglement can hardly be guaranteed and greatly depends on the random seed and the dimensionality of latent vectors set in the models. On the other hand, several works
[23, 48] resort to utilizing label information or attribute annotation as strong supervision for disentanglement. For instance, VideoVAE [23] leveraged holistic attributes to constrain latent variables. Nevertheless, the costly annotation of data is essential for these models and prevents them from being deployed to most realworld applications, in which a tremendous amount of unlabeled data is available.To alleviate the drawbacks of both unsupervised and supervised models discussed above, this work tackles representation disentanglement for sequential data generation utilizing selfsupervision. In selfsupervised learning, various readily obtainable supervisory signals have been explored for representation learning of images and videos, employing auxiliary data such as the ambient sounds in videos
[42, 3], the egomotion of cameras [1, 29], the geometry cue in 3D movies [20], and offtheshelf functional models for visual tracking [53], and optical flow [44, 52]. However, how selfsupervised learning benefits representation disentanglement of sequential data has barely been explored.In this paper, we propose a sequential variational autoencoder (VAE), a recurrent version of VAE, for sequence generation. In the latent space, the representation is disentangled into timeinvariant and timevarying factors. We address the representation disentanglement by exploring intrinsic supervision signals, which can be readily obtained from both data itself and offtheshelf methods, and accordingly design a series of auxiliary tasks. Specifically, on one hand, to exclude dynamic information from timeinvariant variable, we exploit the temporal order of the sequential data and expect the timeinvariant variable of the temporally shuffled data to be close to if not the same as that of the original data. On the other hand, the timevarying variable is expected to contain dynamic information in different modalities. For video data, we allow it to predict the location of the largest motion in every frame, which can be readily inferred from optical flow. For audio data, the volume in each segment as an intrinsic label is leveraged as the supervisory signal. To further encourage the representation disentanglement, the mutual information between static and dynamic variables are minimized as an extra regularization.
To the best of our knowledge, this paper is the first work to explicitly use auxiliary supervision to improve the representation disentanglement for sequential data. Extensive experiments on representation disentanglement and sequence data generation demonstrate that, with these multiple freely accessible supervisions, our model dramatically outperforms unsupervised learningbased methods and even performs better than fullysupervised learningbased methods in several cases.
2 Related Work
Disentangled Sequential Data Generation With the success of deep generative models, recent works [24, 33, 6, 7, 30] resort to variational autoencoders (VAEs) [35] and generative adversarial networks (GANs) [22] to learn a disentangled representation. Regularizations are accordingly designed. VAE [24] imposed a heavier penalty on the KL divergence term for a better disentanglement learning. Followup researches [33, 6] derived a Total Correlation (TC) from the KL term, and highlights this TC term as the key factor in disentangled representation learning. In InfoGAN [7], the disentanglement of a latent code is achieved by maximizing a mutual information lowerbound between and the generated sample .
Several works involving disentangled representation have been proposed for video prediction. Villegas et al. [50] and Denton et al. [12] designed two networks to encode pose and content separately at each timestep. Unlike video prediction, video generation from priors, which we perform in this work, is a much harder task since no frame is available for appearance and motion modeling in the generation phase.
To handle video generation, VAEs are extended to a recurrent version [17, 4, 10]. However, these models do not explicitly consider static and dynamic representation disentanglement and fail to perform manipulable data generation. More recently, several works have proposed VAEs with factorized latent variables. FHVAE [27] presented a factorized hierarchical graphical model that imposes sequencedependent priors and sequenceindependent priors to different sets of latent variables in the context of speech data, but did not take advantage of the sequential prior. Combining the merits of recurrent VAE and FHVAE, DSVAE [37] is capable of disentangling latent factors by factorizing them into timeinvariant and timedependent parts and applies an LSTM sequential prior to keep a better sequential consistency for sequence generation. Although with elaborately designed complex architectures, these models may only perform decently on representation disentanglement of simple data, the disentanglement performance degrades rapidly when the complexity of data increases. In contrast, our work explores both model and regularization designs for representation disentanglement and sequential data generation. Our model fully factorizes the latent variables to timeinvariant and timevarying parts, and both the posterior and the prior of the timevarying variable are modeled by LSTM for dynamic consistency. The auxiliary tasks with readily accessible supervisory signals are designed to regularize and encourage representation disentanglement.
SelfSupervised Learning The concept of selfsupervised learning traces back to the autoencoder [25]
, which uses the input itself as supervision to learn the representation. Denoising autoencoder
[51] makes the learned representations robust to noise and partial corruption of the input pattern by adding noise to the input. Recent years have witnessed the booming interest in selfsupervised learning. The sources of supervisory signals can be roughly categorized into three classes. (a) Intrinsic labels: Doersch et al. [13] explored the use of spatial context in images, and Noroozi et al. [41] trained a model to solve Jigsaw puzzles as a pretext task. Several works [54, 36] showed that colorizing a grayscale photograph can be utilized as a powerful pretext task for visual understanding. Temporal information of video is another readily accessible supervisory signal. [39] trained a model to determine whether a sequence of frames from a video is in the correct temporal order and [31] made the model learn to arrange the permuted 3D spatiotemporal crops. (b) Auxiliary data: Agrawal et al. [1] and Jayaraman et al. [29] exploited the freely available knowledge of camera motion as a supervisory signal for feature learning. Ambient sounds in videos [42, 3] are used as a supervisory signal for learning visual models. The geometry cue in 3D movies [20] is utilized for visual representation learning. (c) Offtheshelf tools: Wang et al. [53] leveraged the visual consistency of objects from a visual tracker in the video clips. [44] used segments obtained by motionbased segmentation based on optical flow as pseudo ground truth for the singleframe object segmentation. Instead of learning visual features as in aforementioned methods, this work aims to achieve static and dynamic representation disentanglement for sequential data such as video and speech. To this end, we leverage supervisory signals from intrinsic labels to regularize the static representation and offtheshelf tools to regularize the dynamic representation.3 Sequential VAE Model
We start by introducing some notations and the problem definition. is given as a dataset that consists of i.i.d. sequences, where denotes a sequence of observed variables, such as a video of frames or an audio of segments. We propose a sequential variational autoencoder model, where the sequence is assumed to be generated from latent variable and is factorized into two disentangled variables: the timeinvariant (or static) variable and the timevarying (or dynamic) variables .
Priors The prior of
is defined as a standard Gaussian distribution:
. The timevarying latent variables follow a sequential prior , where , are the parameters of the prior distribution conditioned on all previous timevarying latent variables . The model can be parameterized as a recurrent network, such as LSTM [26] or GRU [9], where the hidden state is updated temporally. The prior of can be factorized as:(1) 
Generation The generating distribution of time step is conditioned on and : , where and the decoder
can be a highly flexible function such as a deconvolutional neural network
[40].The complete generative model can be formalized by the factorization:
(2) 
Inference Our sequential VAE uses variational inference to approximate posterior distributions:
(3) 
where and . The static variable is conditioned on the whole sequence while the dynamic variable is inferred by a recurrent encoder and only conditioned on the previous frames. Our inference model is factorized as:
(4) 
Learning The objective function of sequential VAE is a timestepwise negative variational lower bound:
(5) 
The schematic representation of our model is shown in Figure 2. Note that DSVAE also proposes a sequential VAE with disentangled representation, but it either independently infers only based on the frame of each timestep without considering the continuity of dynamic variables and thus may generate inconsistent motion, or assumes the variational posterior of depends on , implying that the variables are still implicitly entangled. In contrast, we model both the prior and the posterior of by recurrent models independently, resulting in consistent dynamic information in synthetic sequences, and ensure full disentanglement of and by posterior factorization.
4 SelfSupervision and Regularization
Without any supervision, there is no guarantee that the timeinvariant representation and the timevarying representation are disentangled. In this section, we introduce a series of auxiliary tasks on the different types of representations as the regularization of our sequential VAE to achieve the disentanglement, where readily accessible supervisory signals are leveraged.
4.1 Static Consistency Constraint
To encourage the timeinvariant representation to exclude any dynamic information, we expect that changes little when varying dynamic information dramatically. To this end, we shuffle the temporal order of frames to form a shuffled sequence. Ideally, the static factors of the original sequence and shuffled sequence should be very close, if not equal, to one another. However, directly minimizing the distance of these two static factors will lead to very trivial solutions, e.g., the static factors of all sequences converge to the same value and do not contain any meaningful information. Thus, we randomly sample another sequence as the negative sample of the static factor. With a triple of static factors, we introduce a triplet loss as follows:
(6) 
where , and are the static factors of the anchor sequence, the shuffled sequence as the positive data, and another randomly sampled video as the negative data, denotes the Euclidean distance and is the margin, set to 1. This regularization makes preserve meaningful static information to a certain degree while excluding dynamic information.
4.2 Dynamic Factor Prediction
To encourage the dynamic representation to carry adequate and correct timedependent information of each timestep, we exploit dynamic informationrelated signals from offtheshelf tools for different types of sequential data and accordingly design the auxiliary tasks as the regularization imposed on . We have the loss , where can be either crossentropy loss or mean squared error loss according to the designed auxiliary task, is a network for dynamic factor prediction and contains supervisory signals.
Video Data The dynamic representation can be learned by forcing it to predict the dynamic factors of videos. Motivated by this, we expect the location of the largest motion regions can be accurately predicted based on . To this end, the optical flow maps of video are first obtained by commonly used functional model FlowNet2 [28] and then split into patches by grid, as shown in Figure 3.a. We compute the average of motion magnitudes for every patch and use the indices of patches with the topk largest values as the pseudo label for prediction. For this task,
is implemented with two fullyconnected layers and a softmax layer.
Apart from the optical flow, some freely obtainable dataspecific signals can be exploited. For a human face dataset, the landmark of each frame can be readily detected and considered as a supervision for dynamic factors. We obtain the landmark from an offtheshelf landmark detector [14]. To keep our model concise and efficient, we only leverage the distance between the upper and lower eyelids as well as the distance between the upper and lower lips in each frame as the dynamic signal, as shown in Figure 3.b. Here, consists of two fullyconnected layers to regress the three distances. We observe that our model can easily capture dynamic motions under this simple supervision.
Audio Data For the audio dataset, we consider the volume as timedependency factor and accordingly design an auxiliary task, where is forced to predict if the speech is silent or not in each segment. The pseudo label is readily obtained by setting a magnitude threshold on the volume of each speech segment. consists of two fullyconnected layers and performs a binary classification.
4.3 Mutual Information Regularization
Forcing the timevarying variable to predict dynamic factors can guarantee that contains adequate dynamic information, but this fails to guarantee that excludes the static information. Therefore, we introduce the mutual information between static and dynamic variables as a regulator . The mutual information is a measure of the mutual dependence between two variables. By minimizing
, we encourage the information in these two variables are mutually exclusive. The mutual information is formally defined as the KL divergence of the joint distribution to the product of marginal distribution of each variable. We have
(7) 
where
. The expectation can be estimated by the minibatch weighted sampling estimator
[6],(8) 
for or , where and are the data size and the minibatch size, respectively.
4.4 Objective Function
Overall, our objective can be considered as the sequential VAE loss with a series of selfsupervision and regularization:
(9) 
where , and are balancing factors.
5 Experiments
To comprehensively validate the effectiveness of our proposed model with selfsupervision, we conduct experiments on three video datasets and one audio dataset. With these four datasets, we cover different modalities from video to audio. In video domain, a large range of motions are covered from large character motions (e.g., walking, stretching) to subtle facial expressions (e.g, smiling, disgust).
5.1 Experiments on Video Data
We present an indepth evaluation on two problems, tested on three different datasets and employing a large variety of metrics.
5.1.1 Datasets
Stochastic Moving MNIST is introduced by [11] and consists of sequences of frames of size , where two digits from MNIST dataset move in random directions. We randomly generate 6000 sequences, 5000 of which are used for training and the rest are for testing.
Sprite [37] contains sequences of animated cartoon characters with 9 action categories: walking, casting spells and slashing with three viewing angles. The appearance of characters are fully controlled by four attributes, i.e., the color of skin, tops, pants, and hair. Each of the attributes categories contains 6 possible variants, therefore it results in totally unique characters, 1000 of which are used for training and the rest for testing. Each sequence contains 8 frames of size .
MUG Facial Expression [2] consists of videos with actors performing different facial expressions: anger, fear, disgust, happiness, sadness, and surprise. Each video composes of 50 to 160 frames. As suggested in MoCoGAN [49], we crop the face regions, resize video to , and randomly sample a clip of frames in each video. The of dataset is used for training and the rest for testing.
5.1.2 Representation Swapping
We first perform the representation swapping and compare our method with DSVAE, a disentangled VAE model, as well as MonkeyNet [47], a stateoftheart deformable video generation model. Suppose two real videos are given for motion information and appearance information, denoted as and . Our method and DSVAE perform video generation based on the from and from . For MonkeyNet, the videos are generated by deforming the first frame of based on motion in . The synthetic videos are expected to preserve the appearance in and the motion in . The qualitative comparisons on three datasets are shown in Figure 4.
For SMMNIST, the generated videos of our model can preserve the identity of digits while consistently mimic the motion of the provided video. However, DSVAE can hardly preserve the identity. For instance, it mistakenly changes the digit “9” to “6”. We observe that MonkeyNet can hardly handle the case with multiple objects like SMMNIST, because the case does not meet its implicit assumption of only one object moving in the video.
For Sprite, DSVAE generates blurry videos when the characters in and have opposite directions, indicating it fails to encode the direction information in the dynamic variable. Conversely, our model can generate videos with the appearance of the character in and the same action and direction of the character in , due to the guidance from optical flow. The characters in the generated videos of Monkeynet fail to follow the pose and action in , and many artifacts appear. E.g., an armlike blob appears in the back of the character in the left panel.
For MUG, the generated video of DSVAE can hardly preserve both the appearance in and the facial expression in . For example, the person in the right has a mixed appearance characteristic, indicating and are entangled. Due to the deformation scheme of generation, MonkeyNet fails to handle the case where the faces in two videos are not well aligned. For instance, forcing the man with a smile to be fear results in unnatural expression. On the contrary, our model disentangles and , supported by the realistic expressions on different faces in generated videos.
Methods  SMMNIST  Sprite  MUG  

MoCoGAN  74.55%  4.078  0.194  0.191  92.89%  8.461  0.090  2.192  63.12%  4.332  0.183  1.721 
DSVAE  88.19%  6.210  0.185  2.011  90.73%  8.384  0.072  2.192  54.29%  3.608  0.374  1.657 
baseline  90.12%  6.543  0.167  2.052  91.42%  8.312  0.071  2.190  53.83%  3.736  0.347  1.717 
full model  95.09%  7.072  0.150  2.106  99.49%  8.637  0.041  2.197  70.51%  5.136  0.135  1.760 
baselinesv*  92.18%  6.845  0.156  2.057  98.91%  8.741  0.028  2.196  72.32%  5.006  0.129  1.740 
5.1.3 Video Generation
Quantitative Results We compute the quantitative metrics of our model with and without selfsupervsion and regularization, denoted as full model and baseline, as well as two competing methods: DSVAE and MoCoGAN. All these methods are comparable as no ground truth labels are used to benefit representation disentanglement. Besides, the results of our baseline with full supervision from humanannotation baselinesv are also provided as a reference.
To demonstrate the ability of a model on the representation disentanglement, we use the classification accuracy [37], which measures the ability of a model to preserve a specific attributes when generating a video given the corresponding representation or label. To measure how diverse and realistic videos a model can generate, three metrics are used: [46], IntraEntropy [23] and InterEntropy [23]
. All metrics utilize a pretrained classifier based on the real videos and ground truth attributes. See Appendix for the detailed definitions of the metrics. The results are shown in Table
1.For representation disentanglement, we consider generating videos with a given inferred from a real video and randomly sampled from the prior for SMMNIST. We then check if the synthetic video contains the same digits as the real video by the pretrained classifier. For MUG, we evaluate the ability of a model to preserve the facial expression by fixing and randomly sampled from the prior . For Sprite, since the ground truth of both actions and appearance attributes are available, we evaluate the ability of preserving both static and dynamic representations, and report the average scores. It’s evident that our full model consistently outperforms all competing methods. For SMNIST, we observe that MoCoGAN have poor ability to correctly generate the digits with given labels while our full model can generate correctly digits, reflected by the high . Note that our full model achieves on Sprite, indicating the and are greatly disentangled. Besides, full model significantly boosts the performance of baseline, especially in MUG where more realistic data is contained, the performance gets giant boost from to , which illustrates the crucial role of our selfsupervision and regularization. For video generation, full model consistently shows the superior performances on , , . Especially in MUG, full model outperforms the runnerup MoCoGAN by on , demonstrating that high quality of videos generated by our model. Note that our baseline is also compared favorably to DSVAE, illustrating the superiority of the designed sequential VAE model.
It is worth noting that our model with selfsupervision full model outperforms baselinesv in SMMNIST on the representation disentanglement. The possible reason is with the ground truth labels, baselinesv only encourages to contain identity information but fails to exclude the information in , resulting in confusion in digit recognition when using various dynamic variables. For MUG, baselinesv performs better on preserving motion. We conjecture it’s because that the strong supervision of expression labels forces to encode the dynamic information, and does not favor encoding temporal dynamic information and thus varying does not affect the motion much.
Qualitative results We first demonstrate the ability of our model to manipulate video generation in Figure 7. By fixing and sampling , our model can generate videos with the same object that performs various motions as shown in Figure 7.a. Even in one video, the facial expression can be transferred by controlling , as shown in Figure 7.b.
We also evaluate the appearance diversity of generate objects from our model. Figure 6 shows the frames our model generates with sampled . The objects with realistic and diverse appearances validate our model’s outstanding capability of highquality video generation.
Besides, we compare our model with MoCoGAN on unconditional video generation on MUG dataset, as shown in Figure 5. The videos are generated with sampled and . MoCoGAN generates videos with many artifacts, such as unrealistic eyes and inconsistent mouth in the third video. Conversely, our model generates more realistic human faces with consistent highcoherence expressions.
5.1.4 Ablation Studies
In this section, we present an ablation study to empirically measure the impact of each regularization of our model on its performance. The variant without a certain regularizaiton is denoted as . In Table 2, we report the quantitative evaluation. We note that performs worse than the full model. This illustrates the significance of the static consistency constraint to make to be disentangled from . degrades the performance on considerably, indicating that as regularization is crucial to preserve the action information in the dynamic vector. Besides, after removing the mutual information regularization, again shows an inferior performance to the full model. A possible explanation is that our encourages that excludes the appearance information; thus the appearance information is only from . The qualitative results shown in Figure 8 confirms this analysis. We generate a video with of the woman and of the man in the first row by different variants of our model. Without , some characteristics of the man are still preserved, such as the beard, confirming that the appearance information partially remains in . In the results of , the beard is more evident, indicating that the static and dynamic variable are still entangled. On the other hand, without , the woman in the generated video cannot mimic the action of the man well, which indicates the dynamic variable does not encode the motion information properly. Finally, the person in the generated video of neither preserves the appearance of the woman nor follows the expression of the man. It illustrates the representation disentanglement without any supervision remains a hard task.
Methods  

61.45%  4.850  0.201  1.734  
58.32%  4.423  0.284  1.721  
66.07%  4.874  0.175  1.749  
Full model  70.51%  5.136  0.135  1.760 
5.2 Experiments on Audio Data
To demonstrate the general applicability of our model on sequential data, we conduct experiments on audio data, where the timeinvariant and timevarying factors are the timbre of a speaker and the linguistic content of a speech, respectively. The dataset we use is TIMIT, which is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects [21], and contains 63000 recordings of read speeches. We split the dataset to training and testing subsets with a ratio of 5:1. As in [27], all the speech are presented as a sequence of 80 dimensional Melscale filter bank features.
We quantitatively compare our model with FHVAE and DSVAE on the speaker verification task based on either or , measured by the Equal Error Rate(EER) [8]. Note that we expect the speaker can be correctly verified with as it encodes the timbre of speakers, and randomly guess with as it ideally only encodes the linguistic content. The results are shown in Table 3. Our model outperforms competing methods in both cases. Especially when based on , our model doubles the score of the baseline, indicating our model significantly eliminate the timbre information in .
model  feature  dim  EER 

FHVAE  16  5.06%  
DSVAE  64  4.82%  
64  18.89%  
Ours  64  4.80%  
64  40.12% 
6 Conclusion
We propose a selfsupervised sequential VAE, which learns disentangled timeinvariant and timevarying representations for sequential data. We show that, with readily accessible supervisory signals from data itself and offtheshelf tools, our model can achieve comparable performance to the fully supervised models that require costly human annotations. The disentangling ability of our model is qualitatively and quantitatively verified on four datasets across video and audio domains. The appealing results on a variety of tasks illustrate that, leveraging selfsupervision is a promising direction for representation disentanglement and sequential data generation. In the future, we plan to extend our model to highresolution video generation, video prediction and imagetovideo generation.
References
 [1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pages 37–45, 2015.
 [2] N. Aifanti, C. Papachristou, and A. Delopoulos. The mug facial expression database. In 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10, pages 1–4, April 2010.
 [3] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017.
 [4] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 [6] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610–2620, 2018.
 [7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 [8] Mohamed Chenafa, Dan Istrate, Valeriu Vrabie, and Michel Herbin. Biometric system based on voice recognition using multiclassifiers. In European Workshop on Biometrics and Identity Management, pages 206–215. Springer, 2008.
 [9] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [10] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 [11] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In 35th International Conference on Machine Learning (ICML), 2018.
 [12] Emily L Denton et al. Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pages 4414–4423, 2017.
 [13] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.

[14]
Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang.
Style aggregated network for facial landmark detection.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 379–388, 2018.  [15] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and JanWillem van de Meent. Structured disentangled representations. arXiv preprint arXiv:1804.02086, 2018.
 [16] Babak Esmaeili, Hao Wu, Sarthak Jain, N Siddharth, Brooks Paige, and JanWillem Van de Meent. Hierarchical disentangled representations. stat, 1050:12, 2018.
 [17] Otto Fabius and Joost R van Amersfoort. Variational recurrent autoencoders. arXiv preprint arXiv:1412.6581, 2014.
 [18] DengPing Fan, MingMing Cheng, JiangJiang Liu, ShangHua Gao, Qibin Hou, and Ali Borji. Salient objects in clutter: Bringing salient object detection to the foreground. In European Conference on Computer Vision (ECCV). Springer, 2018.
 [19] DengPing Fan, Wenguan Wang, MingMing Cheng, and Jianbing Shen. Shifting more attention to video salient object detection. In IEEE CVPR, 2019.

[20]
Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas.
Geometry guided convolutional neural networks for selfsupervised video representation learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5589–5597, 2018.  [21] John S Garofolo, L F Lamel, W M Fisher, Jonathan G Fiscus, D S Pallett, and Nancy L Dahlgren. Darpa timit acousticphonetic continuous speech corpus cdrom TIMIT. Technical report, 1993.
 [22] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [23] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
 [24] Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 [25] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pages 3–10, 1994.
 [26] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [27] WeiNing Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pages 1878–1889, 2017.
 [28] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
 [29] Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to egomotion. In Proceedings of the IEEE International Conference on Computer Vision, pages 1413–1421, 2015.
 [30] Insu Jeon, Wonkwang Lee, and Gunhee Kim. Ibgan: Disentangled representation learning with information bottleneck gan. 2018.
 [31] Longlong Jing and Yingli Tian. Selfsupervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162, 2019.
 [32] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
 [33] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 [34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [35] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [36] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6874–6883, 2017.
 [37] Yingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning (ICML), 2018.
 [38] Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, 2019.
 [39] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
 [40] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
 [41] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
 [42] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In European conference on computer vision, pages 801–816. Springer, 2016.

[43]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
2017.  [44] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2701–2710, 2017.
 [45] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [46] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
 [47] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [48] Ximeng Sun, Huijuan Xu, and Kate Saenko. A twostream variational adversarial network for video generation. arXiv preprint arXiv:1812.01037, 2018.
 [49] Sergey Tulyakov, MingYu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
 [50] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
 [51] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
 [52] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Selfsupervised spatiotemporal representation learning for videos by predicting motion and appearance statistics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [53] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
 [54] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
 [55] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zeroshot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1004–1013, 2018.
 [56] Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning featuretofeature translator by alternating backpropagation for generative zeroshot learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2019.
 [57] Yizhe Zhu, Jianwen Xie, Zhiqiang Tang, Xi Peng, and Ahmed Elgammal. Semanticguided multiattention localization for zeroshot learning. In Thirtythird Conference on Neural Information Processing Systems (NeurIPS), Dec 2019.
Appendix A Minibatch Weighted Sampling
Minibatch Weighted Sampling is an estimator of the posterior introduced by [6]. Let be the size of a dataset and be the size of a minibatch, the entropy of the posterior distribution can be estimated based on a minibatch:
(10) 
The readers can refer to [6] for the details.
In our model, the posterior of the latent variable can be factorized as . Thus the entropy of the joint distribution can be estimated:
(11) 
Lemma A.1.
Given a dataset of N samples with a distribution and a minibatch of samples drawn i.i.d. from , and assume the posterior of the latent variable can be factorized as: , the lower bound of , or , is :
where
denotes the probability of a sampled minibatch where one of the elements is fixed to be
and the rest are sampled i.i.d. from .Proof.
For any sampled batch instance , , and when one of the elements is fixed to be , .
The inequality is due to having a support that is a subset of that of . ∎
Following Lemma A.1, when provided with a minibatch of samples , we can use estimate the lower bound as:
(12) 
where is a sample from , and is a sample from .
Appendix B Derivation of Objective Function
We show the derivation of objective function in Eq.5. The observe model is defined as:
(13) 
To avoid the intractable integration over and , variational inference introduces an posterior approximation . A variational lower bound of log is:
(14) 
we get line 2 from line 1 by plugging the Eq.2 and Eq.4.
Appendix C Metrics Definition
Similar to [48], we introduce the definitions of classification accuracy, Inception Score, InterEntropy, IntraEntropy when groundtruth labels can not directly be provided to the model. Let be the generated video based on the representation of a real video with the label or directly conditioned on the label (e.g., MoCoGAN). We have a classifier that is pretrained to predict the labels of real videos.

Classification Accuracy () measures the percentage of the agreement of predicted labels between the generated video and the given real video . Higher classification accuracy indicates that the generated video is more recognizable and the corresponding representation is better disentangled from other representation.

Inception Score measures the KL divergence between the conditional label distribution and the marginal distribution .
(15) 
InterEntropy is the entropy of the marginal distribution :
(16) where . Higher means the model generates more diverse results.

IntraEntropy is entropy of the conditional class distribution .
(17) Lower indicates the generated video is more realistic.
For the speaker verification task in the audio dataset TIMIT, we use the metric Equal Error Rate(EER). The threshold value is tuned to make the false acceptance rate equal to the false rejection rate. The common value is referred to as the Equal Error Rate.
Appendix D Details on Architecture and Training
We implement our model using PyTorch [43] and use the Adam optimizer [34] with and . The learning rate is set to
and the batch size is set to 16. Our model is trained with 1000 epochs for each dataset on a GTX 1080 Ti GPU.
The detailed architecture description of the encoder and decoder of our S3VAE is summarized in Table 4. The visual feature of 128d from the frame encoder is fed into an LSTM with one hidden layer (256d) and the LSTM outputs the parameters and for the Gaussian multivariate distribution of the static variable of dimension . The output of the LSTM is fed into another LSTM with one hidden layer (256d) to produce the parameters and for the Gaussian multivariate distribution of the dynamic variable of dimension for each time step. Besides, we adopt a trainable LSTM to parameterize the prior of the dynamic variable, .
The dimensionality of latent variables is set to for SMMNIST, Sprite, MUG, respectively. The balancing parameters and are set to 1000, 100, 1, respectively, for all datasets.
Encoder  Decoder 

Input 64x64 RGB image  Input z 
4x4 conv(sd 2, pd 1, ch 64)  4x4 convTrans(sd 1, pd 0, ch 512) 
BN, lReLU(0.2),  BN, ReLU, upsample 
4x4 conv(sd 2, pd 1, ch 128)  3x3 conv(sd 1, pd 1, ch 256) 
BN, lReLU(0.2),  BN, ReLU, upsample 
4x4 conv(sd 2, pd 1, ch 256)  3x3 conv(sd 1, pd 1, ch 128) 
BN, lReLU(0.2),  BN, ReLU, upsample 
4x4 conv(sd 2, pd 1, ch 512)  3x3 conv(sd 1, pd 1, ch 128) 
BN, lReLU(0.2),  BN, ReLU, upsample 
4x4 conv(sd 1, pd 0, ch 128)  3x3 conv(sd 1, pd 1, ch 64) 
BN, Tanh  BN, ReLU 
1x1 conv(sd 1, pd 0, ch )  
sigmoid 
Frame Encoder and decoder of S3VAE for SMMNIST, Sprite, MUG datasets. Let sd denote stride, pd, padding; ch, channel; lReLU, leakyReLU.
is the number of image channels, which is 1 for SMMNIST and 3 for Sprite and MUG datasets.Appendix E Representation swapping on audio data
We now show the qualitative results of representation swapping. In Figure 9
, we show the results of representation swapping. Each heatmap shows the melscale filter bank features of 200ms in the frequency domain, where the xaxis is temporal with 20 steps, and the yaxis represents the value of frequency. As marked in the black rectangle, 24 examples are generated by combining four static variables extracted from the samples in the first column and six dynamic variables extracted from samples in the first row.
As can be observed, in each column, the linguistic phoneticlevel contents, reflected by the formants along the temporal axis, are kept almost the same. On the other hand, the timbres are reflected as the harmonics in the heatmap, which correspond to horizontal light stripes. In each row, the harmonics of heatmaps keep consistent, indicating the timbre of the speaker is preserved. Overall, the results demonstrate the ability of our model to disentangle the representation of audio data.
Appendix F More Qualitative Results
We report additional qualitative results on representation swapping in Figure 10, 12 and 14. These qualitative results further illustrate the ability of our method to disentangle the static and dynamic representations. As can be seen, the generated videos follow the motion of while preserving the appearance of the .
Besides, to validate the effectiveness of our method on video generation manipulation, we show qualitative results of video generation with fixed representation. Specifically, the videos first are generated by fixing the static representation and sampling the dynamic representation . As shown in 10(a), 12(a) and 14(a), the generated videos have the sames appearance but perform various motions.
Then the videos are generated by fixing the dynamic representation and sampling the static representation . As shown in 10(b), 12(b) and 14(b), the generated videos have various appearance but perform the same motions.