Looking at a video sequence with one or more missing frames, how do we infer about what happened in the missing portion? We have never visualized that missing frame. Instead we have a knowledge of the spatio-temporal context of the video to reason about a potential unknown scenario. This spatio-temporal context from the adjacent frames within the camera and the corresponding frames from other overlapping cameras is key to solving an important problem in automated video analysis- frame reconstruction - which is the task of reconstructing one or more missing frames in videos. Frame reconstruction is critical in applications like retrieving missing frames in surveillance videos, anomaly detection, data compression, video editing, video post-processing, animation, spoofing and so on. Although there have been works on frame reconstruction in a single camera setting[24, 3, 10] to the best of our knowledge, ours is the first work to solve it in a multi-camera scenario.
Overview of Our Approach. We present an adversarial approach to learn a joint spatio-temporal representation of the missing frame in a multi-camera scenario. First, we learn the possible representations of the missing frame conditioned on the preceding and following frames within the camera as well as on the corresponding frames in other overlapping cameras using conditional Generative Adversarial Network (cGAN)  similar to the one used in . Then all of these representations are merged together using a weighted average where the weights are chosen as follows: representations learned from frames within the camera are given more weight when they are close to the missing frame and representations learned from frames in other overlapping cameras are given more weight when the available intra-camera frames are far apart. Overview of our proposed framework is illustrated in Fig. 1. The main contributions of our work are:
We tackle a novel problem of frame reconstruction in multi-camera scenario.
We perform extensive experiments on a challenging multi-camera video dataset to show the effectiveness of our method.
We perform extensive experiments on a single-camera video dataset to provide quantitative comparison of our proposed method with others in the literature.
2 Related Works
Our work is related to video inpainting, frame interpolation, video prediction, frame reconstruction, and generative adversarial networks. There are important differences between frame reconstruction and the problems of video inpainting or frame interpolation. Some spatial information is available in inpainting since the missing portions are assumed to be localized to small spatio-temporal regions. Interpolation cannot reconstruct multiple missing frames as it requires the adjacent (maximumseconds apart 
) frames as inputs. In video prediction, the goal is to predict the most probable future frames from a sequence of past observations.
There are patch-based approaches , probabilistic model based approaches  and methods handling background and foreground separately [18, 8] for video inpainting. For frame interpolation, there are approaches  using dense optical flow field, phase-based method 
, deep learning approaches[17, 12, 28] and works on long term interpolation [3, 10]. There are sequence-to-sequence learning-based approaches [20, 23], predictive coding network , convolutional LSTM , deep regression network  for video prediction. The recent state-of-the-art work on frame reconstruction within a single camera  uses an LSTM-based interpolation network. However, to the best of our knowledge, there is no work performing frame reconstruction in a multi-camera scenario. This is important when adjacent available frames within the camera are far apart and frames from other corresponding overlapping views can be useful. Recently, Generative Adversarial Networks 
have become popular to solve challenging computer vision problems like text-to-image synthesis, frame interpolation  and so on.  has shown outstanding performance in conditional transfer of pixel-level knowledge. In this work, we seek to leverage GANs for the multi-camera reconstruction problem.
We would refer the camera with the missing frames as the target camera and other cameras as the reference cameras.
3.1 Network Architecture
Similar to general GAN, conditional GAN has a generator and a discriminator. Both of our generator and discriminator have the same architectures used in . We use the conditional GAN to do a mapping between inter-camera or intra-camera frames. These frames share an underlying structure i.e., they share some common low-level information which we want to transfer across the network. Previous image translation problems used an encoder-decoder network  where the input was downsampled after being passed through a number of layers and then upsampled using a reverse process when a bottleneck layer was reached . We use a “U-Net”-based architecture of the generator adding skip connection between each layer to overcome the bottleneck problem as the skip connections directly connect encoder layers to decoder layers. L1 loss efficiently captures the low frequency components of images. But using only L1 loss in the objective function for image mapping generates blurry results. We are using a combination of L1 loss and adversarial loss in the objective function. So we aim to use a discriminator efficient in modeling the high frequency components of images. We use the PatchGAN  to focus on the structure at local image patches. The discriminator tries to differentiate between the generated and the actual missing frames at patch-level and runs convolutationally across the image to generate an averaged output. So, in this way, the image is modeled as a Markov random field assuming that the pixels separated by more than one patch diameter are independent. The high level network architectures for the generator and discriminator are shown in Fig. 2.
3.2 Model Training and Inference
In conditional GANs, a mapping is learned from an observed image
and random noise vector, to an output image , where the generator learns to generate outputs close to real images indistinguishable by the discriminator . The discriminator learns to efficiently detect the fake outputs generated by . The objective function of the conditional GAN is as follows:
Here, is the loss to reduce blurring. Let us assume that there are overlapping cameras available in a multi-camera scenario. The frame, , is missing in the target camera. First, we generate two representations of the missing frame from the past and future frame within the camera using two separate conditional GANs. We generate using the past frame and using the future frame. In our case, can be any arbitrary number based on availability. We generate different representations of the missing frame from the corresponding frame in other reference cameras i.e., generate where . Basically the network learns a mapping from the observed frames (, , and ) to the missing frame . In accordance with (3.2), , , and are and is . A training instance is shown in Fig. 3.
The generated frame tries to resemble the missing frame in terms of the loss along with fooling the discriminator. Following , we alternate between a gradient descent step upon and one upon . Also, in accordance with , the training maximizes . We divide the objective function in (3.2) by during optimizing to slow down it learning rate relative to
. To optimize the network, we use a minibatch stochastic gradient descent with an adaptive sub-gradient method (Adam) and a learning rate of .
During testing, we merge all the generated frames using a weighted average. The weights are chosen by maximizing the average PSNR on a smaller validation set. The more adjacent the available frames are in the target camera, the more weight is given to the representations learned from them than those from the reference cameras. Please note that, since the cameras are partially overlapped, we incorporate the multi-view representation only when there is a person/object present in the overlapping zone.
4.1 Dataset and Preprocessing
Office Lobby Dataset. Office Lobby Dataset is a multi-view summarization dataset where video clips are captured by cameras . The cameras are not completely overlapping and the videos have different brightness levels across multi-views. The approximate offset between camera and is about and between camera and is about . To make an approximate synchronization of the inter-camera frames, these offset values were taken into account while extracting and aligning the frames from different cameras.
KTH Human Action Dataset. KTH Human Action Dataset consists of types of human activities (boxing, handclapping, handwaving, jogging, running, and walking). These actions are performed by subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes, and indoors with lighting variation .
Objective. The main objective of these experiments is to evaluate the quality of the reconstructed frames in multi-camera scenario. We show how the overlapping cameras become more and more important as the distance is increased between the intra-camera frames and the missing frame.
The evaluation metrics we use are PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). SSIM estimates how structurally close the reconstructed frame is to the original one. For both of these metrics, higher value indicates better performance. There is no existing work on multi-view frame reconstruction to compare our method with. To show the effectiveness of our method in a single camera scenario, we compare with a state-of-the-art reconstruction method.
Experimental Setup. We use the standard
split for training and testing and use TensorFlow to train our network on a NVIDIA Tesla K80 GPU.
Quantitative Evaluation. Our reconstruction results on Office Lobby Dataset increasing the distance between the missing frame and the available intra-camera past and future frames (multiple frames missing) are shown in Table 1. We consider different lengths (gap) of missing frame while testing which are selected in a sliding window manner. Comparisons of our reconstruction results on KTH Human Action Dataset are shown in Table 2. We achieve comparable PSNR and SSIM with those reported in .
|LSTM-Based Method ||35.40||0.96|
Qualitative Evaluation. Some example reconstructed frames with the conditional input frames and the ground truth missing frames are shown in Fig. 4.
Ablation Study. The comparison of achieved PSNR using only the intra-camera view of camera vs. using multi-view reconstruction in Office Lobby Dataset is shown in Table 3 as ablation study which justifies the integration of multi-view specially when the gap is large between the missing frame and the available intra-camera frames.
In this work, we proposed an adversarial learning framework for frame reconstruction in multi-camera scenario when one or more frames are missing. We learned the representation of the missing frame conditioned on the past and future frames within that camera as well as the corresponding frames in other overlapping cameras using conditional GAN and merged them together using a weighted average.
This work was partially supported by NSF grant 1544969 from the Cyber-Physical Systems program.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: A system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
-  K. Chen and D. A. Lorenz. Image sequence interpolation using optimal control. Journal of Mathematical Imaging and Vision, 41(3):222–238, 2011.
-  X. Chen, W. Wang, and J. Wang. Long-term video interpolation with bidirectional predictive network. In IEEE Visual Communications and Image Processing (VCIP), pages 1–4, 2017.
-  M. Ebdelli, O. Le Meur, and C. Guillemot. Video inpainting with short-term windows: application to object removal and error concealment. IEEE Transactions on Image Processing, 24(10):3034–3047, 2015.
-  Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z.-H. Zhou. Multi-view video summarization. IEEE Transactions on Multimedia, 12(7):717–729, 2010.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
G. E. Hinton and R. R. Salakhutdinov.
Reducing the dimensionality of data with neural networks.science, 313(5786):504–507, 2006.
-  K.-L. Hung and S.-C. Lai. Exemplar-based video inpainting approach using temporal relationship of consecutive frames. In IEEE Int. Conf. on Awareness Science and Technology (iCAST), pages 373–378. IEEE, 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz.
Super slomo: High quality estimation of multiple intermediate frames
for video interpolation.
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In Int. Conf. on Computer Vision (ICCV), volume 2, 2017.
-  W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.
-  S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung. Phase-based frame interpolation for video. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1410–1418, 2015.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Pérez. Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, 7(4):1993–2019, 2014.
-  S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In Int. Conf. on Computer Vision (ICCV), pages 261–270, 2017.
-  K. A. Patwardhan, G. Sapiro, and M. Bertalmío. Video inpainting under constrained camera motion. IEEE Transactions on Image Processing, 16(2):545–553, 2007.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
-  C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In IEEE Int. Conf. on Pattern Recognition, volume 3, pages 32–36, 2004.
-  N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In Int. Conf. on machine learning, pages 843–852, 2015.
-  X. Sun, R. Szeto, and J. J. Corso. A temporally-aware interpolation network for video frame inpainting. arXiv preprint arXiv:1803.07218, 2018.
-  J. van Amersfoort, W. Shi, A. Acosta, F. Massa, J. Totz, Z. Wang, and J. Caballero. Frame interpolation with multi-scale deep loss functions and generative adversarial networks. arXiv preprint arXiv:1711.06045, 2017.
-  R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, pages 98–106, 2016.
-  T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In European Conf. on computer vision, pages 286–301. Springer, 2016.