Multi-View Frame Reconstruction with Conditional GAN

by   Tahmida Mahmud, et al.

Multi-view frame reconstruction is an important problem particularly when multiple frames are missing and past and future frames within the camera are far apart from the missing ones. Realistic coherent frames can still be reconstructed using corresponding frames from other overlapping cameras. We propose an adversarial approach to learn the spatio-temporal representation of the missing frame using conditional Generative Adversarial Network (cGAN). The conditional input to each cGAN is the preceding or following frames within the camera or the corresponding frames in other overlapping cameras, all of which are merged together using a weighted average. Representations learned from frames within the camera are given more weight compared to the ones learned from other cameras when they are close to the missing frames and vice versa. Experiments on two challenging datasets demonstrate that our framework produces comparable results with the state-of-the-art reconstruction method in a single camera and achieves promising performance in multi-camera scenario.



page 2

page 4


Single-Frame based Deep View Synchronization for Unsynchronized Multi-Camera Surveillance

Multi-camera surveillance has been an active research topic for understa...

STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling

We propose a novel superpixel-based multi-view convolutional neural netw...

Smart Time-Multiplexing of Quads Solves the Multicamera Interference Problem

Time-of-flight (ToF) cameras are becoming increasingly popular for 3D im...

On-line non-overlapping camera calibration net

We propose an easy-to-use non-overlapping camera calibration method. Fir...

Distributed Bayesian inference for consistent labeling of tracked objects in non-overlapping camera networks

One of the fundamental requirements for visual surveillance using non-ov...

Enhanced Spatially Interleaved Techniques for Multi-View Distributed Video Coding

This paper presents a multi-view distributed video coding framework for ...

Multi-View Video Coding with GAN Latent Learning

The introduction of multiple viewpoints inevitably increases the bitrate...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Looking at a video sequence with one or more missing frames, how do we infer about what happened in the missing portion? We have never visualized that missing frame. Instead we have a knowledge of the spatio-temporal context of the video to reason about a potential unknown scenario. This spatio-temporal context from the adjacent frames within the camera and the corresponding frames from other overlapping cameras is key to solving an important problem in automated video analysis- frame reconstruction - which is the task of reconstructing one or more missing frames in videos. Frame reconstruction is critical in applications like retrieving missing frames in surveillance videos, anomaly detection, data compression, video editing, video post-processing, animation, spoofing and so on. Although there have been works on frame reconstruction in a single camera setting

[24, 3, 10] to the best of our knowledge, ours is the first work to solve it in a multi-camera scenario.

Figure 1: An example case when there are cameras and the frame, is missing from target camera of Office Lobby Dataset [5]. We want to generate the missing frame using four available frames ( frames from reference camera and , and respectively, and and frames from target camera , and respectively). Here, can be any arbitary number.

Overview of Our Approach. We present an adversarial approach to learn a joint spatio-temporal representation of the missing frame in a multi-camera scenario. First, we learn the possible representations of the missing frame conditioned on the preceding and following frames within the camera as well as on the corresponding frames in other overlapping cameras using conditional Generative Adversarial Network (cGAN) [15] similar to the one used in [9]. Then all of these representations are merged together using a weighted average where the weights are chosen as follows: representations learned from frames within the camera are given more weight when they are close to the missing frame and representations learned from frames in other overlapping cameras are given more weight when the available intra-camera frames are far apart. Overview of our proposed framework is illustrated in Fig. 1. The main contributions of our work are:

  1. We tackle a novel problem of frame reconstruction in multi-camera scenario.

  2. We perform extensive experiments on a challenging multi-camera video dataset to show the effectiveness of our method.

  3. We perform extensive experiments on a single-camera video dataset to provide quantitative comparison of our proposed method with others in the literature.

2 Related Works

Our work is related to video inpainting, frame interpolation, video prediction, frame reconstruction, and generative adversarial networks. There are important differences between frame reconstruction and the problems of video inpainting or frame interpolation. Some spatial information is available in inpainting since the missing portions are assumed to be localized to small spatio-temporal regions. Interpolation cannot reconstruct multiple missing frames as it requires the adjacent (maximum

seconds apart [24]

) frames as inputs. In video prediction, the goal is to predict the most probable future frames from a sequence of past observations.

There are patch-based approaches [16], probabilistic model based approaches [4] and methods handling background and foreground separately [18, 8] for video inpainting. For frame interpolation, there are approaches [2] using dense optical flow field, phase-based method [14]

, deep learning approaches

[17, 12, 28] and works on long term interpolation [3, 10]. There are sequence-to-sequence learning-based approaches [20, 23], predictive coding network [13], convolutional LSTM [26], deep regression network [27] for video prediction. The recent state-of-the-art work on frame reconstruction within a single camera [24] uses an LSTM-based interpolation network. However, to the best of our knowledge, there is no work performing frame reconstruction in a multi-camera scenario. This is important when adjacent available frames within the camera are far apart and frames from other corresponding overlapping views can be useful. Recently, Generative Adversarial Networks [6]

have become popular to solve challenging computer vision problems like text-to-image synthesis

[21], frame interpolation [25] and so on. [9] has shown outstanding performance in conditional transfer of pixel-level knowledge. In this work, we seek to leverage GANs for the multi-camera reconstruction problem.

Figure 2: Proposed architecture for the generator (top) and the discriminator (bottom) [9]. The pixel values in the output show how realistic that section of the unknown image is.

3 Methodology

We would refer the camera with the missing frames as the target camera and other cameras as the reference cameras.

3.1 Network Architecture

Similar to general GAN, conditional GAN has a generator and a discriminator. Both of our generator and discriminator have the same architectures used in [19]. We use the conditional GAN to do a mapping between inter-camera or intra-camera frames. These frames share an underlying structure i.e., they share some common low-level information which we want to transfer across the network. Previous image translation problems used an encoder-decoder network [7] where the input was downsampled after being passed through a number of layers and then upsampled using a reverse process when a bottleneck layer was reached [9]. We use a “U-Net”-based architecture of the generator adding skip connection between each layer to overcome the bottleneck problem as the skip connections directly connect encoder layers to decoder layers. L1 loss efficiently captures the low frequency components of images. But using only L1 loss in the objective function for image mapping generates blurry results. We are using a combination of L1 loss and adversarial loss in the objective function. So we aim to use a discriminator efficient in modeling the high frequency components of images. We use the PatchGAN [9] to focus on the structure at local image patches. The discriminator tries to differentiate between the generated and the actual missing frames at patch-level and runs convolutationally across the image to generate an averaged output. So, in this way, the image is modeled as a Markov random field assuming that the pixels separated by more than one patch diameter are independent. The high level network architectures for the generator and discriminator are shown in Fig. 2.

3.2 Model Training and Inference

In conditional GANs, a mapping is learned from an observed image

and random noise vector

, to an output image , where the generator learns to generate outputs close to real images indistinguishable by the discriminator [9]. The discriminator learns to efficiently detect the fake outputs generated by . The objective function of the conditional GAN is as follows:


Here, is the loss to reduce blurring. Let us assume that there are overlapping cameras available in a multi-camera scenario. The frame, , is missing in the target camera. First, we generate two representations of the missing frame from the past and future frame within the camera using two separate conditional GANs. We generate using the past frame and using the future frame. In our case, can be any arbitrary number based on availability. We generate different representations of the missing frame from the corresponding frame in other reference cameras i.e., generate where . Basically the network learns a mapping from the observed frames (, , and ) to the missing frame . In accordance with (3.2), , , and are and is . A training instance is shown in Fig. 3.

Figure 3:

A training instance for the conditional GAN where the discriminator learns to classify between fake and real frames and the generator learns to fool the discriminator.

The generated frame tries to resemble the missing frame in terms of the loss along with fooling the discriminator. Following [6], we alternate between a gradient descent step upon and one upon . Also, in accordance with [6], the training maximizes . We divide the objective function in (3.2) by during optimizing to slow down it learning rate relative to

. To optimize the network, we use a minibatch stochastic gradient descent with an adaptive sub-gradient method (Adam)

[11] and a learning rate of .

During testing, we merge all the generated frames using a weighted average. The weights are chosen by maximizing the average PSNR on a smaller validation set. The more adjacent the available frames are in the target camera, the more weight is given to the representations learned from them than those from the reference cameras. Please note that, since the cameras are partially overlapped, we incorporate the multi-view representation only when there is a person/object present in the overlapping zone.

Figure 4: Two examples from Office Lobby Dataset where Input 1, Input 2, Input 3, and Input 4 are the preceding and the following frames of camera , and the correspoinding frames of camera and respectively. As we increase the gap between the preceding and following frames with the missing frame, frames of camera and camera become more important. For example, due to the large number of missing frames in gap , the women in red dress is not visible yet in input 1 and her position is far away in input . Still, a person wearing a red dress is visible in the correct position of the generated frame incorporating information from the other two cameras.

4 Experiments

4.1 Dataset and Preprocessing

Office Lobby Dataset. Office Lobby Dataset is a multi-view summarization dataset where video clips are captured by cameras [5]. The cameras are not completely overlapping and the videos have different brightness levels across multi-views. The approximate offset between camera and is about and between camera and is about . To make an approximate synchronization of the inter-camera frames, these offset values were taken into account while extracting and aligning the frames from different cameras.

KTH Human Action Dataset. KTH Human Action Dataset consists of types of human activities (boxing, handclapping, handwaving, jogging, running, and walking). These actions are performed by subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes, and indoors with lighting variation [22].

4.2 Results

Objective. The main objective of these experiments is to evaluate the quality of the reconstructed frames in multi-camera scenario. We show how the overlapping cameras become more and more important as the distance is increased between the intra-camera frames and the missing frame.

Performance Measure.

The evaluation metrics we use are PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). SSIM estimates how structurally close the reconstructed frame is to the original one. For both of these metrics, higher value indicates better performance. There is no existing work on multi-view frame reconstruction to compare our method with. To show the effectiveness of our method in a single camera scenario, we compare with a state-of-the-art reconstruction method


Experimental Setup. We use the standard

split for training and testing and use TensorFlow

[1] to train our network on a NVIDIA Tesla K80 GPU.

Quantitative Evaluation. Our reconstruction results on Office Lobby Dataset increasing the distance between the missing frame and the available intra-camera past and future frames (multiple frames missing) are shown in Table 1. We consider different lengths (gap) of missing frame while testing which are selected in a sliding window manner. Comparisons of our reconstruction results on KTH Human Action Dataset are shown in Table 2. We achieve comparable PSNR and SSIM with those reported in [24].

  Gap 1 3 5 7 15 30
  PSNR 32.06 29.28 28.10 27.19 25.56 25.17
  SSIM 0.95 0.92 0.91 0.90 0.88 0.87
Table 1: Multi-view Reconstuction Performance for Office Lobby Dataset.
Proposed Method 35.03 0.93
LSTM-Based Method [24] 35.40 0.96
Table 2: Single-view Reconstuction Performance Comparisons for KTH Human Action Dataset.

Qualitative Evaluation. Some example reconstructed frames with the conditional input frames and the ground truth missing frames are shown in Fig. 4.

Ablation Study. The comparison of achieved PSNR using only the intra-camera view of camera vs. using multi-view reconstruction in Office Lobby Dataset is shown in Table 3 as ablation study which justifies the integration of multi-view specially when the gap is large between the missing frame and the available intra-camera frames.

  Gap 1 3 5 7 15 30
  Single 32.06 29.24 28.02 27.02 24.17 23.97
  Multi 32.06 29.28 28.10 27.19 25.56 25.17
Table 3: Ablation Study for Frame Reconstruction in Office Lobby Dataset considering Single-View vs. Multi-View.

5 Conclusions

In this work, we proposed an adversarial learning framework for frame reconstruction in multi-camera scenario when one or more frames are missing. We learned the representation of the missing frame conditioned on the past and future frames within that camera as well as the corresponding frames in other overlapping cameras using conditional GAN and merged them together using a weighted average.

6 Acknowledgements

This work was partially supported by NSF grant 1544969 from the Cyber-Physical Systems program.