GMLR
Generative Models for Low Rank Video Representation and Reconstruction
view repo
Finding compact representation of videos is an essential component in almost every problem related to video processing or understanding. In this paper, we propose a generative model to learn compact latent codes that can efficiently represent and reconstruct a video sequence from its missing or undersampled measurements. We use a generative network that is trained to map a compact code into an image. We first demonstrate that if a video sequence belongs to the range of the pretrained generative network, then we can recover it by estimating the underlying compact latent codes. Then we demonstrate that even if the video sequence does not belong to the range of a pretrained network, we can still recover the true video sequence by jointly updating the latent codes and the weights of the generative network. To avoid overfitting in our model, we regularize the recovery problem by imposing lowrank and similarity constraints on the latent codes of the neighboring frames in the video sequence. We use our methods to recover a variety of videos from compressive measurements at different compression rates. We also demonstrate that we can generate missing frames in a video sequence by interpolating the latent codes of the observed frames in the lowdimensional space.
READ FULL TEXT VIEW PDF
Our goal is to predict future video frames given a sequence of input fra...
read it
Stochastic video prediction is usually framed as an extrapolation proble...
read it
We propose a strong baseline model for unsupervised feature learning usi...
read it
In this paper, we study a new problem arising from the emerging MPEG
sta...
read it
We propose a generative framework which takes on the video frame
interpo...
read it
The past decades have witnessed the rapid development of image and video...
read it
We address the problem of restoring a highquality image from an observe...
read it
Generative Models for Low Rank Video Representation and Reconstruction
Deep generative networks, such as autoencoders, generative adversarial networks (GANs), and variational autoencoders (VAEs), are now commonly used in almost every machine learning and computer vision task
[12, 24, 16, 35]. One key idea in these generative networks is that they can learn to transform a lowdimensional feature vector (or latent code) into realistic images and videos. The range of the generated images is expected to be close to the true underlying distribution of training images. Once these networks are properly trained (which remains a nontrivial task), they can generate remarkable images in the trained categories of natural scenes.In this paper, we propose to use a deep generative model for compact representation and reconstruction of videos from a small number of linear measurements. We assume that a generative network trained on some class of images is available, which we represent as
(1) 
denotes the overall function for the deep network with layers that maps a lowdimensional (latent) code into an image and represents all the weight parameters of the deep network. as given in (1) can be viewed as a cascade of functions for , each of which represents a mapping between input and output of respective layer. An illustration of such a generator with is shown in Figure 1. Suppose we are given a sequence of measurements for as
(2) 
where denotes the frame in the unknown video sequence, denotes its observed measurements, denotes the respective measurement operator, and denotes noise or error in the measurements. Our goal is to recover the video sequence () from the available measurements (). The recovery problem becomes especially challenging as the number of measurements (in ) becomes very small compared to the number of unknowns (in ). To ensure quality reconstruction in such settings, we need a compact (lowdimensional) representation of the unknown signal. Thus, we use the given generative model to represent the video sequence as and the observed measurements as .
We first demonstrate that if a video sequence belongs to the range of the network , then we can reconstruct it by optimizing directly over the latent code . Then we demonstrate that even if a video sequence lies outside the range of the given network , we can still reconstruct it by jointly optimizing over network weights and the latent codes . To exploit similarities among the frames in a video sequence, we also include lowrank and similarity constraints on the latent codes. We note that the pretrained network we used in our experiments is highly overparameterized; therefore, lowrank and similarity constraints help in regularizing the network and finding good solution presumably near the initial weights.
Video signals have natural redundancies along spatial and temporal dimensions that can be exploited to learn their compact representations. Such compact representations can then be used for compression, denoising, restoration, and other processing/transmission tasks. Historically, video representation schemes have relied on handcrafted blocks that include motion estimation/compensation and sparsifying transforms such as discrete cosine transform (DCT) and wavelets [36, 8, 32, 22]. Recent progress in datadriven representation methods offers new opportunities to develop improved schemes for compact representation of videos [21, 25, 19].
Compressive sensing refers to a broad class of problems in which we aim to recover a signal from a small number of measurements [5, 10, 6]. The canonical compressive sensing problem in (2) is inherently underdetermined, and we need to use some prior knowledge about the signal structure. Classical signal priors exploit sparse and lowrank structures in images and videos for their reconstruction [11, 1, 37, 28, 38].
Deep generative models offer a new framework for compact representation of images and videos. A generative model can be viewed as a function that maps a given input (or latent) code into an image. For compact representation of images, we seek a generative model that can generate a variety of images with high fidelity using a very lowdimensional latent code. Recently, a number of generative models have been proposed to learn latent representation of an image with respect to a generator [20, 39, 9]
. The learning process usually involves gradient decent to estimate the best representation of the latent code, where the gradients with respect to the latent code representation are backpropagated to the pixel space
[3].In recent year, generative networks have been extensively used for learning good representations for images and videos. Generative adversarial networks (GANs) and variational autoencoders (VAEs) [12, 17, 15, 2] learn a function that maps vectors drawn from a certain distribution in a lowdimensional space into images in a highdimensional space. An attractive feature of VAEs [17] and GANs [12] is their ability to transform feature vectors to generate a variety of images from a different set of desired distributions. Our technical approach bears some similarities with recent work on image generation and manipulation via conditional GANs and VAEs [7, 13, 29]. For example, we can create new images with same content but different articulations by changing the input latent codes [7, 23]. In [3], the authors presented a framework for jointly optimizing latent code and network parameters while training a standalone generator network. Furthermore, linear arithmetic operations in the latent space of generators can generate to meaningful image transformations. In our paper, we will apply similar principles to generate different frames in a video sequence while jointly optimizing latent codes and generator parameters but ensuring that latent codes belong to a small subspace (even a line as we show in Figure 6).
In this paper, we use a generative model as a prior for video signal representation and reconstruction. Our generative model and optimization is inspired by recent work on using generative models for compressive sensing in [4, 34, 14, 27, 33]. Recently, [4] showed that a trained deep generative network can be used as a prior for image reconstruction from compressive measurements; the reconstruction problem involves optimization over the latent code of the generator. In a related work, [33] observed that an untrained convolutional generative model can also be used as a prior for solving inverse problems such as inpainting and denoising because of their tendency to generate natural images; the reconstruction problem involves optimization of generator network weights. Inspired by these observations, a number of methods have been proposed for solving compressive sensing problem by optimizing generator network weights while keeping the latent code fixed at a random value [14, 34]. As they are allowing generator parameters to change, the generator can reconstruct wide range of images. However, as the latent codes are initialized randomly and stay the same, we cannot find a representative latent codes for images.
In our proposed method, we use the generative model in (1) to find compact representation of videos in the form of . To reconstruct a video sequence from the compressive measurements in (2), we either optimize over the latent codes or or optimize over the network weights and in a joint manner. Since the frames in a video sequence exhibit rich redundancies in their representation. We hypothesize that if the generator function is continuous, then the similarity of the frames would translate into the similarity in their corresponding latent codes. Based on this hypothesis, we impose similarity and lowrank constraints on the latent codes to represent the video sequence with an even more compact representation of the latent codes. An illustration of the differences between the types of representations is shown in Figure 2.
In this paper, we propose to use a lowrank generative prior for compact representation of a video sequence, which we then use to solve some video compressive sensing problems. The key contributions of this paper are as follows.
[leftmargin=*]
We first demonstrate that we can learn a compact representation of a video sequence in the form of lowrank latent codes for a deep generative network similar to the one depicted in Figure 1.
Consecutive frames in a video sequence share lots of similarities. To encode similarities among the reconstructed frames, we introduce lowrank and similarity constraints on the generator latent codes. This enables us to represent a video sequence with a very small number of parameters in the latent codes and reconstruct them from a very small number of measurements.
Latent code optimization can only reconstruct a video sequence that belong to its range. We demonstrate that by jointly optimizing the latent codes with the network weights, we can expand the range of the generator and reconstruct images that the given initial generator fails on. We show that even though the network has a very large number of parameters, but the joint optimization still converges to a good solution with similarity and lowrank constraints on latent codes.
We show that, in some cases, the lowrank structure on the latent codes also provides a nice lowdimensional manifold that can be used to generate new frames that are similar to the given sequence.
Let us assume that for is a sequence of video frames that we want to reconstruct from the measurements as given in (2). The generative model as given in (1) maps a lowdimensional representation vector, , to a highdimensional image as . Thus, our goal of video recovery is equivalent to solving the following optimization problem over :
(3) 
which can be viewed as a nonlinear system of equations.
In latent code optimization, we assume that the function
approximates the probability distribution of the set of natural images where our target image belongs. Thus, we can restrict our search for the underlying video sequence,
, only in the range of the generator. Similar problem has been studied in [4] for image compressive sensing.Given a pretrained generator, , measurement sequence, , and the measurement matrices, , we can solve the following optimization problem to recover the lowdimensional latent codes: for our target video sequence, , as
(4) 
Since we can backpropagate gradient w.r.t. the through the generator, we can solve the problem in (4) using gradient descent. Although latent code optimization can solve compressive sensing problem with high probability, it cannot solve the problem when the images do not belong to the generator. As there are wide variety of images, it is difficult to represent them with a single or a few generators. In such scenarios, latent code optimization proves to be inadequate.
Any generator has a limited range within which it can generate images; the range of a generator presumably depends on the types of images used during training. To highlight this limitation, we performed an experiment in which we tried to generate a video sequence that is very different from the examples on which our generator was trained on. This is not a compressive sensing experiment; we are providing original video sequences to the generator and finding the best approximation of the sequence generated by them. The results are shown in Figure 3 using two video sequences: Moving MNIST and Color Wheel. In both cases, network weights are initialized with the weights of a generator that was trained on a different dataset. The pretrained network used for Moving MNIST example was trained on standard MNIST dataset, which does not include any image with two digits. Therefore, the generator trained on MNIST fails on Moving MNIST if we only optimize over the latent code because Moving MNIST dataset consists of images with two digits. The joint optimization of latent code and generator parameters, however, can recover the entire Moving MNIST sequence with high quality. For Color wheel the original generator was trained on CIFAR10 training set which contains diverse category of images. However, as we see in Figure 3, the generator fails to produce quality images Still it cannot perform well on color wheel representation just by latent code update. Joint optimization improves the reconstruction quality significantly.
The results presented in Figure 3 should not be surprising for the following reasons: We are providing a video sequence to the generator that has degrees of freedom for each ; therefore, the range of sequences that can be generated by changing the is quite limited for a fixed . In contrast, if we let change while we learn the , then the network can potentially generate any image in because we have a very large degrees of freedom. Note that in our generator, the number of parameters in is significantly larger than the size of or .
The surprising thing, however, is that we can also recover quality images by jointly optimizing the latent codes and network weights while solving the compressive sensing problem. In other words, we can overcome the range limitation of the generator by optimizing generator parameters alongside latent code to get a good reconstruction from compressive measurements as well as good representative latent codes for the video sequence even though the network is highly overparameterized. The resulting optimization problem can be written as
(5) 
where the reconstructed video sequence can be generated using the estimated latent codes and generator weights as .
This joint optimization of latent code and generator parameter offer the optimization problem a lot of flexibility to generate a wide range of images. As the generator function is highly nonconvex, we initialize with the pretrained set of weights. After every gradient descent update of the latent codes,
, we update the model parameters with stochastic gradient descent.
A generative prior gives us an opportunity to utilize the corresponding latent codes. The latent codes can be viewed as nonlinear, lowdimensional projection of the original images. In a video sequence, each frame has some similarities with the neighboring frames. Even though the similarity may seem very complex in original dimension, it can become much simpler when we encode each image to a low dimensional latent code. If the latent code is long enough to encode the changes in the image domain, then they can also be used for applying similarity constraint on the image domain.
We assume that if the images are similar to each other, then their corresponding latent codes must be similar too. To exploit this structure, we propose to reconstruct the following optimization problem with similarity constraints:
(6) 
where and the are the weights that represent some measure of similarity between and frames. Assuming the adjacent frames in a sequence are close to each other, we fix for all for simplicity.
To further exploit the redundancies in a video sequence, we assume that the variation in the sequence of images are localized and the latent codes sequence can be represented in a much lower dimensional space compared to their ambient dimension. For each minibatch, we define a matrix such that
where is the latent code corresponding to image of the sequence. To explore low rank embedding, we solve the following constrained optimization:
(7)  
We implement this constraint by reconstructing matrix from its top singular vectors in each iteration. Thus the rank of matrix formed by a sequence of images becomes , which implies that we can express each of the latent codes in terms of orthogonal basis vectors. For rank embedding, we represent each latent code as a linear combination of the orthogonal basis vectors as
(8) 
where is the weight of the corresponding basis vector.
We can now represent a video sequence with frames with orthogonal codes. This offers an additional compression to our latent codes. We use the same idea to linearize motion manifold in latent space.
In this section, we describe our experimental setup.
Choice of generator: We follow the wellknown DCGAN framework [23]
for our generators except that we do not use any batchnormalization layer because gradient through the batchnormalization layer is dependent on the batch size and the distribution of the batch. As shown in Figure
1, in DCGAN generator framework, we project the latent code, , to a larger vector using a fully connected network and then reshape it so that it can work as an input for the following deconvolutional layers. Instead of using any pooling layers, in DCGAN framework, authors [23]propose strided convolution. All the intermediate deconvolution layers are followed by ReLU activation. The last deconvolution layer is followed by Tanh activation function to generate the reconstructed image
.Initial generator training: We train our generators by jointly optimizing the generator parameters, and latent code, using SGD optimization by following the procedure in [3]
. In each iteration, we first update the generator parameters and then update the latent code using SGD. We use squaredloss function,
to train the generators. We keep the minibatch size fixed at 256. We use two different trained generators for our experiments: one for RGB images and another for grayscale images. The RGB image generator is trained on CIFAR10 training dataset resized to . We choose CIFAR10 because it has 10 different categories of images, which helps increase the range of the generator. The grayscale image generator is trained on MNIST digit training dataset resized to . We used SGD optimizer for optimizing both latent code and network weights. The learning rate for updating is chosen as 1 and learning rate for updating as .Measurement matrix: We used three different measurement matrices in our experiments. We first experiment with original images (i.e.,
is an identity matrix) to test which sequences can be generated by latent code optimization and which ones require joint optimization of latent codes and network weights. Then we experiment with compressive measurements, for which we choose the entries of the
independently from distribution. For a video sequence of frames, we generate independent measurement matrices. Then we experiment with missing pixels (also known as image/video inpainting problem) to show that our algorithm works on other inverse problems as well. For experiments with missing pixels, we randomly dropped a fraction of the pixels from each frame.Datasets: We test our hypothesis on five datasets, which includes both synthetic and real video sequences. The first test set consists of 10 MNIST test digits. We rotate each digit by per frame for a total of 32 frames. Second test set includes 10 Moving MNIST test sequences [31]. Each test sequence has 20 frames. For the third test set, we generate a color wheel with 12 colors by dividing a circle into 12 equal slices. We rotate the color wheel by per frame for 64 frames. Finally we experiment on different real video sequences from publicly available KTH human action video dataset [26] and UCF101 dataset [30]. We show the results on a person walking video from KTH dataset in this paper because of its simplicity. We cropped the video in the temporal dimension to select 80 frames, which show only unidirectional movement. We also show results for an archery video sequence from UCF101 dataset.
Performance metric: We measure the performance of our recovery algorithms in terms of the reconstruction error PSNR. For a given image and its reconstruction , PSNR is defined as
where max and min corresponds to the maximal and minimal value the image can attain respectively, and MSE is the mean squared error.
In our first set of experiments, we simply generate a given video sequence using our network by optimizing only over the latent codes and by optimizing jointly over the latent codes and network parameters. In other words, is an identity matrix in these experiments. A summary of our experimental results is presented in Table 1 that correspond to the case when original video sequence is used to estimate latent codes that provide best approximation of the sequence. We observe from Table 1 that adding similarity and lowrank constraints provides small improvement in the image approximation performance. This might be because of the fact that the frames are already slowly changing and we have enough measurements to approximate them. However, jointly optimizing both latent codes and network parameters provides a significant gain in the reconstruction PSNR.











Rotating MNIST  25.82  25.73  26.81  33.75  33.78  33.9  
Moving MNIST  18.55  16.99  18.51  31.17  31.16  31.15  
Color Wheel  18.24  17.97  18.31  22.07  21.92  22.05  
Archery  24.15  23.13  24.49  26.5  23.15  27.26  
Person Walking  27.55  23.30  27.55  27.9  26.72  27.91 
In our first experiment, we test latent code optimization with and without similarity and lowrank constraints. We show some example reconstructions for the inpainting problem with 90% missing pixels in Figure 4. For similarity constraint, is chosen for both cases. For lowrank constraints, the optimal values of rank for Rotating MNIST and Person Walking are and , respectively. We can observe for very low measurements, low rank generator not only represent the video sequence with lower number of parameters in latent codes ( and of the total frames respectively for Rotating MNIST and Person Walking), it also gives boost in reconstruction performance.
We also performed a number of experiments for latent code optimization (with and without similarity and lowrank constraints) for different datasets and measurements. A summary of our experimental results is presented in Table 2. The results refer to experiments in which we estimate latent codes from the compressive measurements of the sequence. We observe that adding similarity or lowrank constraints in the compressive sensing problems shows significant improvement in the quality of reconstruction.










Experiments with compressive Gaussian measurements  
Rotating MNIST ()  20.35  20.75 (r=5)  22.13  30.9  31 (r=5)  32.97  
Moving MNIST ()  16.75  16.9 (r=12)  17.57  24.43  27.03 (r=4)  27.2  
Color Wheel ()  16.95  17.96 (r=6)  17.09  21.92  23.71 (r=6)  21.8  
Archery ()  21.58  23.54(r=16)  23.15  25.82  26.9 (r=21)  25.83  
Experiments with 80% Missing pixels  
Rotating MNIST  19.15  25.07(r=4)  24.45  26.54  29.58 (r=3)  28.53  
Moving MNIST  16.44  16.82 (r=9)  17.34  18.65  19.02(r=9)  19.55  
Color Wheel  16.54  17.85 (r=6)  16.75  18.46  19.96 (r=4)  18.88  
Archery  23.15  23.8 (r=22)  23.32  23.6  23.81 (r=21)  23.57  
Person Walking  25.34  26.1 (r=21)  25.9  25.8  26.17 (r=22)  25.96 
As we discussed before in Figure 3, the joint optimization over and can generate images that are very different from the images network is trained on. Table 1 refers to similar experiments in which we are given the original video sequence and we want to estimate latent codes and network weight that can best approximate the given video sequence. We observe that joint optimization offers a significant performance boost compared to latent code optimization alone. As we discussed before, this is expected because we have a lot more degrees of freedom in the case of joint optimization than what we have for latent code optimization. The similarity or lowrank constraints do not provide a significant boost while approximating the video sequence.
Table 2 summarizes results for compressive measurements, where we are only given linear measurement of the video sequence and we want to estimate the latent codes and network weights that minimize the objectives in (5) or (7
). We performed experiment on image inpainting and compressive sensing problems. For image inpainting problem, we show reconstruction results for
missing pixels in Table 2. We also show results for different compressive measurements for different synthetic and real video sequences. We can observe from Table 2 that with lowrank constraints on the generator, we can not only represent the whole video sequence with a very few latent codes, but also get better reconstruction than full rank cases. Similarity constraint on latent codes also show improvement in reconstruction performance when the measurements are low.Some examples of video sequences from compressive measurements are presented in Figure 5. In each of the experiments, we compute Gaussian measurements of each frame in a sequence and then solve the optimization problems in (5) (this corresponds to the fullrank recovery) and (7) with (this corresponds to the rank4 recovery). We observe that lowrank constraints provide a small improvement in terms of the quality of reconstruction.
In this section, we present our preliminary experiments on linearizing articulation manifold of a video sequence by imposing lowrank structure on the latent codes. In our experiment, we force our latent codes to map on a straight line by defining each , where and are scalar. We impose this rank2 constraint by solving the problem in (7) but instead of approximating the using the top two singular vectors, we approximate them using their mean and first principal vector.
We further investigate the linearization of multiple video sequences while optimizing the same generator weights to generate those sequences. In this experiment, we form the matrix by concatenating latent codes for multiple different sequences. Then we apply rank2 constraint on the entire matrix using top two singular vectors. We simultaneously apply linearity constraint on each sequence by imposing rank=2 constraint on the latent codes for each sequence separately using mean and first principal vector as mentioned above.
We plot the embedding of each in terms of two orthogonal basis vectors in Figures 5(a) and 5(b). We observe that a welldefined rotation in image domain is translated into a line in the latent space. We also observe that as we increase the rotating angles, the corresponding embedding moves along a straight line in the direction of first principal vector in an increasing order. We plot the embedding of three sequences of Rotating MNIST in Figure 6(a). We observe that the rotation of different digits are translated into different lines in the 2D latent space. Furthermore, latent codes for each of the sequences preserves their sequential order. However, in the case of moving MNIST, even though we get perfect reconstruction with the line embedding, but the order of the video sequence is not preserved in the embedded space. We did not impose any constraint in our optimization to preserve the order, but we expect that if the video sequence changes in such a manner that frames that are farther in time are also farther in content, then we will see the order will be preserved. We leave this investigation for future work.
If the latent codes follow some sequential order, it is possible to generate intermediate images between each frames. We test this idea using three Rotating MNIST sequences. Each sequence originally contained 20 frames, where, in each frame, the digit is rotated from the previous frame. However, we set aside to frames while optimizing the generator to approximate those frames. We perform joint optimization of and using rank=2 constraint on the latent codes and linearization constraint on the latent codes of each sequence. When we observe the latent code representation for the approximated images, we observe that the latent codes follow sequential order but there are significant gap between the latent codes of and frames. We can observe this phenomena in Figure 7(a). We then try to generate 1000 frames between frame 1 and frame 20 using linear interpolation between corresponding latent codes. We keep the same network weights which is giving us the approximation of the original sequences. We can observe from Figure 7(b) that we can generate the missing frames in that way. However, we can choose frame 1 and 20 here as two end points for linear interpolating because the entire sequence is maintaining the sequential order in their linear latent space representation. But in cases where the sequence only maintains sequential order locally, we can select interpolation end points from the cluster of frames which maintains sequential order.
We further experiment on a complex real life motion using spinning figures dataset from [18]. We selected a rotating bunny sequence and cropped only the bunny from the images. The bunny completes one rotation in 15 frames. We selected first 10 frames from each of the 4 full rotations and keep the similar rotations close to each other. We try to find out if this sequence maintain its sequential order in any latent space. We observe the representation of the sequence in latent space using constraint. We impose constraint by selecting mean and first two principal vectors. So, the latent codes are constrained to 2D plane in the 3D space. We show the approximation of bunny sequence using this constraint in Figure 8(b) and the corresponding latent space representation in Figure 8(a). We can observe from the latent space representation that the sequence maintained its sequential order in this representation.
We proposed a generative model for lowrank representation and reconstruction of video sequences. We presented experiments to demonstrate that video sequences can be reconstructed from compressive measurements by either optimizing over the latent code or jointly optimizing over the latent codes and network weights. We observed that adding similarity and lowrank constraints in the optimization regularizes the recovery problems and improves the quality of reconstruction. We presented some preliminary experiments to show that lowrank embedding of latent codes with joint optimization can potentially be useful in linearizing articulation manifolds of the video sequence. An implementation of our algorithm with pretrained models is available here: https://github.com/CSIPlab/gmlr.
In all our experiments, we observed that joint optimization performs remarkably well for compressive measurements as well. Even though the number of measurements are extremely small compared to the number of parameters in , the solution almost always converges to a good sequence. We attribute this success to a good initialization of the network weights and hypothesize that a “good set of weights” are available near the initial set of weights in all these experiments. We intend to investigate a proof of the presence of good local minima around initialization in our future work.
Decoding by linear programming.
IEEE transactions on information theory, 51(12):4203–4215, 2005.IEEE Transactions on Neural Networks and Learning Systems
, pages 1–8, 2018.Deep convolutional neural network for decompressed video enhancement.
In Data Compression Conference (DCC), 2016, pages 617–617. IEEE, 2016.We experiment on different video sequences from all six categories (Boxing, Handclapping, Handwaving, Jogging, Running, Walking) from KTH video dataset. To reduce computational complexity, we have selected part of these videos in a batch. Table 3 includes the number of frames for our test videos. We experiment on image inpainting with 80% missing pixels. We experiment for both latent code optimization and joint optimization of latent code and network weight. In Table 3, we report experimental results and in Figure 10, we demonstrate some representational examples. Joint optimization of and significantly outperforms latent code optimization because the video sequences are not from the range of the pretrained generator. Furthermore, applying rank=2 linearization constraint on latent code we observe similar performance as full rank reconstruction for joint optimization.
Video 

Latent Code optimization  Joint Optimization  

Full rank  Rank=2 (linearized)  Full rank  Rank=2 (linearized)  
Boxing  50  22.45  22.62  32.37  30.38  
Handclapping  50  26.03  26.2  35.74  33.98  
Handwaving  50  22.29  20.65  30.01  27.48  
Jogging  30  23.82  18.58  26.4  24.01  
Running  30  25.74  20.66  27.54  27.56  
Walking  55  23.72  18.44  27.53  27.33 
We further experiment with untrained generator like [33, 34]. We observe that if we initialize network weights with pretrained network weights, network converges faster even for the images that were not used in the training but fall under the similar distribution. In Figure11, we show reconstruction loss vs number of iteration curve for a Rotating MNIST and a Handclapping video. We show these results for inpainting problem with 80% missing pixels. We can observe that for Rotating MNIST video, random initialization shows false convergence before finally converging. It becomes difficult for some datasets like Moving MNIST to find a convergence using untrained network weights as initialization. So, we use the weights of a pretrained network as initialization.
We use two different generator networks for RGB image generation and grayscale image generation. In both generators, we use filters in deconvolutional layers. For RGB image generator, a 256 dimensional latent code is projected and reshaped into whereas for grayscale image generator, a 32 dimensional latent code is projected and reshaped into . The number of kernel for each deconvolutional layer of RGB image generator is 256, 128, 64 and 3, respectively. For grayscale image generator, number of kernel for each deconvolutional layer is 128,64,32 and 1, respectively. The number of parameters for each generator is shown in Table 4.
Layers  Number of Parameters  
RGB Image Generator  Grayscale Image Generator  
Fullyconnected + reshape + ReLU  2,097,152  131,072 
Deconv 1 + ReLU  2,097,152  524,288 
Deconv 2 + ReLU  524,288  131,072 
Deconv 3 + ReLU  131,072  32,768 
Deconv 4 + Tanh  3,072  512 
Total Parameters  4,852,736  819,712 
Comments
There are no comments yet.