NeRV: Neural Representations for Videos

10/26/2021
by   Hao Chen, et al.
University of Maryland
Facebook
0

We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. Given a frame index, NeRV outputs the corresponding RGB image. Video encoding in NeRV is simply fitting a neural network to video frames and decoding process is a simple feedforward operation. As an image-wise implicit representation, NeRV output the whole image and shows great efficiency compared to pixel-wise implicit representation, improving the encoding speed by 25x to 70x, the decoding speed by 38x to 132x, while achieving better video quality. With such a representation, we can treat videos as neural networks, simplifying several video-related tasks. For example, conventional video compression methods are restricted by a long and complex pipeline, specifically designed for the task. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H.264, HEVC ). Besides compression, we demonstrate the generalization of NeRV for video denoising. The source code and pre-trained model can be found at https://github.com/haochen-rye/NeRV.git.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 14

page 15

06/17/2020

A Real-time Action Representation with Temporal Encoding and Deep Compression

Deep neural networks have achieved remarkable success for video-based ac...
12/21/2021

Implicit Neural Video Compression

We propose a method to compress full-resolution video sequences with imp...
07/27/2019

DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks

The field of video compression has developed some of the most sophistica...
12/15/2016

CSVideoNet: A Real-time End-to-end Learning Framework for High-frame-rate Video Compressive Sensing

This paper addresses the real-time encoding-decoding problem for high-fr...
12/29/2021

StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

Videos show continuous events, yet most - if not all - video synthesis f...
10/05/2018

Deep Probabilistic Video Compression

We propose a variational inference approach to deep probabilistic video ...
02/27/2020

BBAND Index: A No-Reference Banding Artifact Predictor

Banding artifact, or false contouring, is a common video compression imp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

What is a video? Typically, a video captures a dynamic visual scene using a sequence of frames. A schematic interpretation of this is a curve in 2D space, where each point can be characterized with a pair representing the spatial state. If we have a model for all pairs, then, given any , we can easily find the corresponding state. Similarly, we can interpret a video as a recording of the visual world, where we can find a corresponding RGB state for every single timestamp. This leads to our main claim: can we represent a video as a function of time?

More formally, can we represent a video as , where , i.e., a frame at timestamp , is represented as a function parameterized by . Given their remarkable representational capacity [21], we choose deep neural networks as the function in our work. Given these intuitions, we propose NeRV, a novel representation that represents videos as implicit functions and encodes them into neural networks. Specifically, with a fairly simple deep neural network design, NeRV can reconstruct the corresponding video frames with high quality, given the frame index. Once the video is encoded into a neural network, this network can be used as a proxy for video, where we can directly extract all video information from the representation. Therefore, unlike traditional video representations which treat videos as sequences of frames, shown in Figure 1 (a), our proposed NeRV considers a video as a unified neural network with all information embedded within its architecture and parameters, shown in Figure 1 (b).

Figure 1: (a) Conventional video representation as frame sequences. (b) NeRV, representing video as neural networks, which consists of multiple convolutional layers, taking the normalized frame index as the input and output the corresponding RGB frame.
Explicit (frame-based) Implicit (unified)
Hand-crafted (e.g., HEVC [47]) Learning-based (e.g., DVC [31]) Pixel-wise (e.g., NeRF [33] Image-wise (Ours)
Encoding speed Fast Medium Very slow Slow
Decoding speed Medium Slow Very slow Fast
Compression ratio Medium High Low Medium
Table 1: Comparison of different video representations. Although explicit representations outperform implicit ones in encoding speed and compression ratio now, NeRV shows great advantage in decoding speed. And NeRV outperforms pixel-wise implicit representations in all metrics.

As an image-wise implicit representation, NeRV shares lots of similarities with pixel-wise implicit visual representations [44, 48] which takes spatial-temporal coordinates as inputs. The main differences between our work and image-wise implicit representation are the output space and architecture designs. Pixel-wise representations output the RGB value for each pixel, while NeRV outputs a whole image, demonstrated in Figure 2. Given a video with size of , pixel-wise representations need to sample the video times while NeRV only need to sample times. Considering the huge pixel number, especially for high resolution videos, NeRV shows great advantage for both encoding time and decoding speed. Different output space also leads to different architecture designs, NeRV utilizes a MLP ConvNets architecture to output an image while pixel-wise representation uses a simple MLP to output the RGB value of the pixel. Sampling efficiency of NeRV also simplify the optimization problem, which leads to better reconstruction quality compared to pixel-wise representations.

We also demonstrate the flexibility of NeRV by exploring several applications it affords. Most notably, we examine the suitability of NeRV for video compression. Traditional video compression frameworks are quite involved, such as specifying key frames and inter frames, estimating the residual information, block-size the video frames, applying discrete cosine transform on the resulting image blocks and so on. Such a long pipeline makes the decoding process very complex as well. In contrast, given a neural network that encodes a video in NeRV, we can simply cast the video compression task as a model compression problem, and trivially leverage any well-established or cutting edge model compression algorithm to achieve good compression ratios. Specifically, we explore a three-step model compression pipeline: model pruning, model quantization, and weight encoding, and show the contributions of each step for the compression task. We conduct extensive experiments on popular video compression datasets, such as UVG 

[32], and show the applicability of model compression techniques on NeRV for video compression. We briefly compare different video representations in Table 1 and NeRV shows great advantage in decoding speed.

Besides video compression, we also explore other applications of the NeRV representation for the video denoising task. Since NeRV is a learnt implicit function, we can demonstrate its robustness to noise and perturbations. Given a noisy video as input, NeRV generates a high-quality denoised output, without any additional operation, and even outperforms conventional denoising methods.

The contribution of this paper can be summarized into four parts:

  • [leftmargin=0.3in,,topsep=0.005in]

  • We propose NeRV, a novel image-wise implicit representation for videos, representating a video as a neural network, converting video encoding to model fitting and video decoding as a simple feedforward operation.

  • Compared to pixel-wise implicit representation, NeRV output the whole image and shows great efficiency, improving the encoding speed by 25 to 70, the decoding speed by 38 to 132, while achieving better video quality.

  • NeRV allows us to convert the video compression problem to a model compression problem, allowing us to leverage standard model compression tools and reach comparable performance with conventional video compression methods, e.g., H.264 [58], and HEVC [47].

  • As a general representation for videos, NeRV also shows promising results in other tasks, e.g., video denoising. Without any special denoisng design, NeRV outperforms traditional hand-crafted denoising algorithms (medium filter etc.) and ConvNets-based denoisng methods.

2 Related Work

Implicit Neural Representation. Implicit neural representation is a novel way to parameterize a variety of signals. The key idea is to represent an object as a function approximated via a neural network, which maps the coordinate to its corresponding value (e.g., pixel coordinate for an image and RGB value of the pixel). It has been widely applied in many 3D vision tasks, such as 3D shapes [16, 15], 3D scenes [45, 25, 37, 6], and appearance of the 3D structure [33, 34, 35]. Comparing to explicit 3D representations, such as voxel, point cloud, and mesh, the continuous implicit neural representation can compactly encode high-resolution signals in a memory-efficient way. Most recently, [13] demonstrated the feasibility of using implicit neural representation for image compression tasks. Although it is not yet competitive with the state-of-the-art compression methods, it shows promising and attractive proprieties. In previous methods, MLPs are often used to approximate the implicit neural representations, which take the spatial or spatio-temporal coordinate as the input and output the signals at that single point (e.g., RGB value, volume density). In contrast, our NeRV representation, trains a purposefully designed neural network composed of MLPs and convolution layers, and takes the frame index as input and directly outputs all the RGB values of that frame.

Video Compression.

As a fundamental task of computer vision and image processing, visual data compression has been studied for several decades. Before the resurgence of deep networks, handcrafted image compression techniques, like JPEG 

[53] and JPEG2000 [46], were widely used. Building upon them, many traditional video compression algorithms, such as MPEG [28], H.264 [58], and HEVC [47], have achieved great success. These methods are generally based on transform coding like Discrete Cosine Transform (DCT) [2] or wavelet transform [3]

, which are well-engineered and tuned to be fast and efficient. More recently, deep learning-based visual compression approaches have been gaining popularity. For video compression, the most common practice is to utilize neural networks for certain components while using the traditional video compression pipeline. For example, 

[8]

proposed an effective image compression approach and generalized it into video compression by adding interpolation loop modules. Similarly, 

[59] converted the video compression problem into an image interpolation problem and proposed an interpolation network, resulting in competitive compression quality. Furthermore, [1] generalized optical flow to scale-space flow to better model uncertainty in compression. Later, [60] employed a temporal hierarchical structure, and trained neural networks for most components including key frame compression, motion estimation, motions compression, and residual compression. However, all of these works still follow the overall pipeline of traditional compression, arguably limiting their capabilities.

Model Compression. The goal of model compression is to simplify an original model by reducing the number of parameters while maintaining its accuracy. Current research on model compression research can be divided into four groups: parameter pruning and quantization [51, 17, 18, 57, 23, 27]; low-rank factorization [40, 10, 24]; transferred and compact convolutional filters [9, 62, 42, 11]; and knowledge distillation [4, 20, 7, 38]. Our proposed NeRV enables us to reformulate the video compression problem into model compression, and utilize standard model compression techniques. Specifically, we use model pruning and quantization to reduce the model size without significantly deteriorating the performance.

3 Neural Representations for Videos

We first present the NeRV representation in Section 3.1, including the input embedding, the network architecture, and the loss objective. Then, we present model compression techniques on NeRV in Section 3.2 for video compression.

3.1 NeRV Architecture

Figure 2: (a) Pixel-wise implicit representation taking pixel coordinates as input and use a simple MLP to output pixel RGB value (b) NeRV: Image-wise implicit representation taking frame index as input and use a MLP ConvNets to output the whole image. (c) NeRV block architecture, upscale the feature map by here.

In NeRV, each video is represented by a function , where the input is a frame index and the output is the corresponding RGB image . The encoding function is parameterized with a deep neural network , . Therefore, video encoding is done by fitting a neural network to a given video, such that it can map each input timestamp to the corresponding RGB frame.

Input Embedding. Although deep neural networks can be used as universal function approximators [21], directly training the network with input timestamp results in poor results, which is also observed by [39, 33]. By mapping the inputs to a high embedding space, the neural network can better fit data with high-frequency variations. Specifically, in NeRV, we use Positional Encoding [33, 52, 48] as our embedding function

(1)

where and are hyper-parameters of the networks. Given an input timestamp , normalized between , the output of embedding function is then fed to the following neural network.

Network Architecture. NeRV architecture is illustrated in Figure 2

(b). NeRV takes the time embedding as input and outputs the corresponding RGB Frame. Leveraging MLPs to directly output all pixel values of the frames can lead to huge parameters, especially when the images resolutions are large. Therefore, we stack multiple NeRV blocks following the MLP layers so that pixels at different locations can share convolutional kernels, leading to an efficient and effective network. Inspired by the super-resolution networks, we design the NeRV block,  illustrated in Figure 

2 (c), adopting PixelShuffle technique [43] for upscaling method. Convolution and activation layers are also inserted to enhance the expressibilty. The detailed architecture can be found in the supplementary material.

Loss Objective.

For NeRV, we adopt combination of L1 and SSIM loss as our loss function for network optimization, which calculates the loss over all pixel locations of the predicted image and the ground-truth image as following

(2)

where is the frame number, the NeRV prediction, the frame ground truth, is hyper-parameter to balance the weight for each loss component.

3.2 Model Compression

Figure 3: NeRV-based video compression pipeline.

In this section, we briefly revisit model compression techniques used for video compression with NeRV. Our model compression composes of four standard sequential steps: video overfit, model pruning, weight quantization, and weight encoding as shown in Figure 3.

Model Pruning. Given a neural network fit on a video, we use global unstructured pruning to reduce the model size first. Based on the magnitude of weight values, we set weights below a threshold as zero,

(3)

where is the percentile value for all parameters in . As a normal practice, we fine-tune the model to regain the representation, after the pruning operation.

Model Quantization. After model pruning, we apply model quantization to all network parameters. Note that different from many recent works [23, 5, 14, 55]

that utilize quantization during training, NeRV is only quantized post-hoc (after the training process). Given a parameter tensor

(4)

where ‘round’ is rounding value to the closest integer, ‘bit’ the bit length for quantized model, and the max and min value for the parameter tensor , ‘scale’ the scaling factor. Through Equation 4, each parameter can be mapped to a ‘bit’ length value. The overhead to store ‘scale’ and can be ignored given the large parameter number of , e.g., they account for only in a small Conv with input channels and output channels ( parameters in total).

Entropy Encoding. Finally, we use entropy encoding to further compress the model size. By taking advantage of character frequency, entropy encoding can represent the data with a more efficient codec. Specifically, we employ Huffman Coding [22] after model quantization. Since Huffman Coding is lossless, it is guaranteed that a decent compression can be achieved without any impact on the reconstruction quality. Empirically, this further reduces the model size by around 10%.

4 Experiments

4.1 Datasets and Implementation Details

We perform experiments on “Big Buck Bunny” sequence from scikit-video to compare our NeRV with pixel-wise implicit representations, which has frames of resolution. To compare with state-of-the-arts methods on video compression task, we do experiments on the widely used UVG [32], consisting of 7 videos and 3900 frames with in total.

In our experiments, we train the network using Adam optimizer [26] with learning rate of 5e-4. For ablation study on UVG, we use cosine annealing learning rate schedule [30]

, batchsize of 1, training epochs of 150, and warmup epochs of 30 unless otherwise denoted. When compare with state-of-the-arts, we run the model for 1500 epochs, with batchsize of 6. For experiments on “Big Buck Bunny”, we train NeRV for 1200 epochs unless otherwise denoted. For fine-tune process after pruning, we use 50 epochs for both UVG and “Big Buck Bunny”.

For NeRV architecture, there are 5 NeRV blocks, with up-scale factor 5, 3, 2, 2, 2 respectively for 1080p videos, and 5, 2, 2, 2, 2 respectively for 720p videos. By changing the hidden dimension of MLP and channel dimension of NeRV blocks, we can build NeRV model with different sizes. For input embedding in Equation 1, we use and as our default setting. For loss objective in Equation 2, is set to . We evaluate the video quality with two metrics: PSNR and MS-SSIM [56]

. Bits-per-pixel (BPP) is adopted to indicate the compression ratio. We implement our model in PyTorch 

[36] and train it in full precision (FP32). All experiments are run with NVIDIA RTX2080ti. Please refer to the supplementary material for more experimental details, results, and visualizations (e.g., MCL-JCV [54] results)

4.2 Main Results

We compare NeRV with pixel-wise implicit representations on ’Big Buck Bunny’ video. We take SIREN [44] and NeRF [33] as the baseline, where SIREN [44] takes the original pixel coordinates as input and uses activations, while NeRF [33]

adds one positional embedding layer to encode the pixel coordinates and uses ReLU activations. Both SIREN and FFN use a 3-layer perceptron and we change the hidden dimension to build model of different sizes. For fair comparison, we train SIREN and FFN for 120 epochs to make encoding time comparable. And we change the filter width to build NeRV model of comparable sizes, named as NeRV-S, NeRV-M, and NeRV-L. In Table 

3, NeRV outperforms them greatly in both encoding speed, decoding quality, and decoding speed. Note that NeRV can improve the training speed by to , and speedup the decoding FPS by to . We also conduct experiments with different training epochs in Table 3, which clearly shows that longer training time can lead to much better overfit results of the video and we notice that the final performances have not saturated as long as it trains for more epochs.

Methods Parameters Training Speed Encoding Time PSNR Decoding FPS
SIREN [44] 3.2M 31.39 1.4
NeRF [33] 3.2M 33.31 1.4
NeRV-S (ours) 3.2M 25 1 34.21 54.5
SIREN [44] 6.4M 31.37 0.8
NeRF [33] 6.4M 35.17 0.8
NeRV-M (ours) 6.3M 50 1 38.14 53.8
SIREN [44] 12.7M 25.06 0.4
NeRF [33] 12.7M 37.94 0.4
NeRV-L (ours) 12.5M 70 1 41.29 52.9
Table 3: PSNR vs. epochs. Since video encoding of NeRV is an overfit process, the reconstructed video quality keeps increasing with more training epochs. NeRV-S/M/L mean models with different sizes.
Epoch NeRV-S NeRV-M NeRV-L
300 32.21 36.05 39.75
600 33.56 37.47 40.84
1.2k 34.21 38.14 41.29
1.8k 34.33 38.32 41.68
2.4k 34.86 38.7 41.99
Table 2: Compare with pixel-wise implicit representations. Training speed means time/epoch, while encoding time is the total training time.

4.3 Video Compression

Compression ablation. We first conduct ablation study on video “Big Buck Bunny”. Figure 6 shows the results of different pruning ratios, where model of 40% sparsity still reach comparable performance with the full model. As for model quantization step in Figure 6, a 8-bit model still remains the video quality compared to the original one (32-bit). Figure 6 shows the full compression pipeline with NeRV. The compression performance is quite robust to NeRV models of different sizes, and each step shows consistent contribution to our final results. Please note that we only explore these three common compression techniques here, and we believe that other well-established and cutting edge model compression algorithm can be applied to further improve the final performances of video compression task, which is left for future research.

Figure 4: Model pruning. Sparsity is the ratio of parameters pruned.
Figure 5: Model quantization. Bit is the bit length used to represent parameter value.
Figure 6: Compression pipeline to show how much each step contribute to compression ratio.

Compare with state-of-the-arts methods. We then compare with state-of-the-arts methods on UVG dataset. First, we concatenate 7 videos into one single video along the time dimension and train NeRV on all the frames from different videos, which we found to be more beneficial than training a single model for each video. After training the network, we apply model pruning, quantization, and weight encoding as described in Section3.2. Figure 8 and Figure 8 show the rate-distortion curves. We compare with H.264 [58], HEVC [47], STAT-SSF-SP [61], HLVC [60], Scale-space [1], and Wu et al[59]. H.264 and HEVC are performed with medium preset mode. As the first image-wise neural representation, NeRV generally achieves comparable performance with traditional video compression techniques and other learning-based video compression approaches. It is worth noting that when BPP is small, NeRV can match the performance of the state-of-the-art method, showing its great potential in high-rate video compression. When BPP becomes large, the performance gap is mostly because of the lack of full training due to GPU resources limitations. As shown in Table 3, the decoding video quality keeps increasing when the training epochs are longer. Figure 9 shows visualizations for decoding frames. At similar memory budget, NeRV shows image details with better quality.

Figure 7: PSNR vs. BPP on UVG dataset.
Figure 8: MS-SSIM vs. BPP on UVG dataset.
Figure 9: Video compression visualization. At similar BPP, NeRV reconstructs videos with better details.

Decoding time We compare with other methods for decoding time under a similar memory budget. Note that HEVC is run on CPU, while all other learning-based methods are run on a single GPU, including our NeRV. We speedup NeRV by running it in half precision (FP16). Due to the simple decoding process (feedforward operation), NeRV shows great advantage, even for carefully-optimized H.264. And lots of speepup can be expected by running quantizaed model on special hardware. All the other video compression methods have two types of frames: key and interval frames. Key frame can be reconstructed by its encoded feature only while the interval frame reconstruction is also based on the reconstructed key frames. Since most video frames are interval frames, their decoding needs to be done in a sequential manner after the reconstruction of the respective key frames. On the contrary, our NeRV can output frames at any random time index independently, thus making parallel decoding much simpler. This can be viewed as a distinct advantage over other methods.

4.4 Video Denoising

We apply several common noise patterns on the original video and train the model on the perturbed ones. During training, no masks or noise locations are provided to the model, i.e., the target of the model is the noisy frames while the model has no extra signal of whether the input is noisy or not. Surprisingly, our model tries to avoid the influence of the noise and regularizes them implicitly with little harm to the compression task simultaneously, which can serve well for most partially distorted videos in practice.

The results are compared with some standard denoising methods including Gaussian, uniform, and median filtering. These can be viewed as denoising upper bound for any additional compression process. As listed in Table 5, the PSNR of NeRV output is usually much higher than the noisy frames although it’s trained on the noisy target in a fully supervised manner, and has reached an acceptable level for general denoising purpose. Specifically, median filtering has the best performance among the traditional denoising techniques, while NeRV outperforms it in most cases or is at least comparable without any extra denoising design in both architecture design and training strategy.

Methods FPS
Habibian et al. [6]
Wu et al. [59]
Rippel et al. [41] 1
DVC [31] 1.8
Liu et al [29] 3
H.264 [58] 9.2
NeRV (FP32) 5.6
NeRV (FP16) 12.5
Table 5: PSNR results for video denoising. “baseline” refers to the noisy frames before any denoising
noise white black salt & pepper random Average
Baseline 27.85 28.29 27.95 30.95 28.74
Gaussian 30.27 30.14 30.23 30.99 30.41
Uniform 29.11 29.06 29.10 29.63 29.22
Median 33.89 33.84 33.87 33.89 33.87
Minimum 20.55 16.60 18.09 18.20 18.36
Maximum 16.16 20.26 17.69 17.83 17.99
NeRV 33.31 34.20 34.17 34.80 34.12
Table 4: Decoding speed with BPP 0.2 for 1080p videos

We also compare NeRV with another neural-network-based denoising method, Deep Image Prior (DIP) [50]. Although its main target is image denoising, NeRV outperforms it in both qualitative and quantitative metrics, demonstrated in Figure 10. The main difference between them is that denoising of DIP only comes form architecture prior, while the denoising ability of NeRV comes from both architecture prior and data prior. DIP emphasizes that its image prior is only captured by the network structure of Convolution operations because it only feeds on a single image. But the training data of NeRV contain many video frames, sharing lots of visual contents and consistences. As a result, image prior is captured by both the network structure and the training data statistics for NeRV. DIP relies significantly on a good early stopping strategy to prevent it from overfitting to the noise. Without the noise prior, it has to be used with fixed iterations settings, which is not easy to generalize to any random kind of noises as mentioned above. By contrast, NeRV is able to handle this naturally by keeping training because the full set of consecutive video frames provides a strong regularization on image content over noise.

Figure 10: Denoising visualization. (c) and (e) are denoising output for DIP [50]. Data generalization of NeRV leads to robust and better denoising performance since all frames share the same representation, while DIP model overfits one model to one image only.

4.5 Ablation Studies

Finally, we provide ablation studies on the UVG dataset. PSNR and MS-SSIM are adopted for evaluation of the reconstructed videos.
Input embedding. In Table 6, PE means positional encoding as in Equation 1, which greatly improves the baseline, None means taking the frame index as input directly. Similar findings can be found in [33], without any input embedding, the model can not learn high-frequency information, resulting in much lower performance.
Upscale layer. In Table 4.5, we show results of three different upscale methods. i.e., Bilinear Pooling, Transpose Convolution, and PixelShuffle [43]. With similar model sizes, PixelShuffle shows best results. Please note that although Transpose convolution [12] reach comparable results, it greatly slowdown the training speed compared to the PixelShuffle.
Normalization layer. In Table 4.5, we apply common normalization layers in NeRV block. The default setup, without normalization layer, reaches the best performance and runs slightly faster. We hypothesize that the normalization layer reduces the over-fitting capability of the neural network, which is contradictory to our training objective.
Activation layer. Table 4.5 shows results for common activation layers. The GELU [19]activation function achieve the highest performances, which is adopted as our default design.
Loss objective. We show loss objective ablation in Table 10. We shows performance results of different combinations of L2, L1, and SSIM loss. Although adopting SSIM alone can produce the highest MS-SSIM score, but the combination of L1 loss and SSIM loss can achieve the best trade-off between the PSNR performance and MS-SSIM score.

Table 6: Input embedding ablation. PE means positional encoding PSNR MS-SSIM None 24.93 0.769 PE 37.26 0.970
Table 7: Upscale layer ablation
PSNR MS-SSIM
Bilinear Pooling 29.56 0.873
Transpose Conv 36.63 0.967
PixelShuffle 37.26 0.970
Table 8: Norm layer ablation
PSNR MS-SSIM
BatchNorm 36.71 0.971
InstanceNorm 35.5 0.963
None 37.26 0.970
Table 9: Activation function ablation
PSNR MS-SSIM
ReLU 35.89 0.963
Leaky ReLU 36.76 0.968
Swish 37.08 0.969
GELU 37.26 0.970
Table 10: Loss objective ablation L2 L1 SSIM PSNR MS-SSIM 35.64 0.956 35.77 0.959 35.69 0.971 35.95 0.960 36.46 0.970 37.26 0.970

5 Discussion

Conclusion. In this work, we present a novel neural representation for videos, NeRV, which encodes videos into neural networks. Our key sight is that by directly training a neural network with video frame index and output corresponding RGB image, we can use the weights of the model to represent the videos, which is totally different from conventional representations that treat videos as consecutive frame sequences. With such a representation, we show that by simply applying general model compression techniques, NeRV can match the performances of traditional video compression approaches for the video compression task, without the need to design a long and complex pipeline. We also show that NeRV can outperform standard denoising methods. We hope that this paper can inspire further research works to design novel class of methods for video representations.

Limitations and Future Work. There are some limitations with the proposed NeRV. First, to achieve the comparable PSNR and MS-SSIM performances, the training time of our proposed approach is longer than the encoding time of traditional video compression methods. Second, the architecture design of NeRV is still not optimal yet, we believe more exploration on the neural architecture design can achieve higher performances. Finally, more advanced and cutting the edge model compression methods can be applied to NeRV and obtain higher compression ratios.

Acknowledgement. This project was partially funded by the DARPA SAIL-ON (W911NF2020009) program, an independent grant from Facebook AI, and Amazon Research Award to AS.

References

  • [1] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici (2020) Scale-space flow for end-to-end optimized video compression. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8503–8512. Cited by: §2, §4.3.
  • [2] N. Ahmed, T. Natarajan, and K. R. Rao (1974) Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: §2.
  • [3] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §2.
  • [4] L. J. Ba and R. Caruana (2013) Do deep nets really need to be deep?. arXiv preprint arXiv:1312.6184. Cited by: §2.
  • [5] R. Banner, I. Hubara, E. Hoffer, and D. Soudry (2018) Scalable methods for 8-bit training of neural networks. arXiv preprint arXiv:1805.11046. Cited by: §3.2.
  • [6] R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. Newcombe (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In European Conference on Computer Vision, pp. 608–625. Cited by: §2, Table 5.
  • [7] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker (2017) Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 742–751. Cited by: §2.
  • [8] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto (2019) Learning image and video compression through spatial-temporal energy compaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10071–10080. Cited by: §2.
  • [9] T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In

    International conference on machine learning

    ,
    pp. 2990–2999. Cited by: §2.
  • [10] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736. Cited by: §2.
  • [11] S. Dieleman, J. De Fauw, and K. Kavukcuoglu (2016)

    Exploiting cyclic symmetry in convolutional neural networks

    .
    In International conference on machine learning, pp. 1889–1898. Cited by: §2.
  • [12] V. Dumoulin and F. Visin (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §4.5.
  • [13] E. Dupont, A. Goliński, M. Alizadeh, Y. W. Teh, and A. Doucet (2021) COIN: compression with implicit neural representations. arXiv preprint arXiv:2103.03123. Cited by: §2.
  • [14] F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. Roy, and A. Ramezani-Kebrya (2020) Adaptive gradient quantization for data-parallel sgd. arXiv preprint arXiv:2010.12460. Cited by: §3.2.
  • [15] K. Genova, F. Cole, A. Sud, A. Sarna, and T. A. Funkhouser (2019) Deep structured implicit functions.. Cited by: §2.
  • [16] K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser (2019) Learning shape templates with structured implicit functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7154–7164. Cited by: §2.
  • [17] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In International conference on machine learning, pp. 1737–1746. Cited by: §2.
  • [18] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
  • [19] D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §4.5.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • [21] K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §1, §3.1.
  • [22] D. A. Huffman (1952) A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40 (9), pp. 1098–1101. Cited by: §3.2.
  • [23] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §2, §3.2.
  • [24] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §2.
  • [25] C. Jiang, A. Sud, A. Makadia, J. Huang, M. Nießner, T. Funkhouser, et al. (2020) Local implicit grid representations for 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010. Cited by: §2.
  • [26] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [27] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2.
  • [28] D. Le Gall (1991) MPEG: a video compression standard for multimedia applications. Communications of the ACM 34 (4), pp. 46–58. Cited by: §2.
  • [29] J. Liu, S. Wang, W. Ma, M. Shah, R. Hu, P. Dhawan, and R. Urtasun (2020) Conditional entropy coding for efficient video compression. In ECCV, Cited by: Table 5.
  • [30] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    arXiv preprint arXiv:1608.03983. Cited by: §4.1.
  • [31] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019) Dvc: an end-to-end deep video compression framework. In CVPR, Cited by: Table 1, Table 5.
  • [32] A. Mercat, M. Viitanen, and J. Vanne (2020) UVG dataset: 50/120fps 4k sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302. Cited by: §1, §4.1.
  • [33] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pp. 405–421. Cited by: Table 1, §2, §3.1, §4.2, §4.5, Table 3.
  • [34] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515. Cited by: §2.
  • [35] M. Oechsle, L. Mescheder, M. Niemeyer, T. Strauss, and A. Geiger (2019) Texture fields: learning texture representations in function space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4531–4540. Cited by: §2.
  • [36] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.1.
  • [37] S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger (2020) Convolutional occupancy networks. arXiv preprint arXiv:2003.04618 2. Cited by: §2.
  • [38] A. Polino, R. Pascanu, and D. Alistarh (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
  • [39] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville (2019) On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. Cited by: §3.1.
  • [40] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua (2013) Learning separable filters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2754–2761. Cited by: §2.
  • [41] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. Bourdev (2019) Learned video compression. In ICCV, Cited by: Table 5.
  • [42] W. Shang, K. Sohn, D. Almeida, and H. Lee (2016)

    Understanding and improving convolutional neural networks via concatenated rectified linear units

    .
    In international conference on machine learning, pp. 2217–2225. Cited by: §2.
  • [43] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §3.1, §4.5.
  • [44] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33. Cited by: §1, §4.2, Table 3.
  • [45] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618. Cited by: §2.
  • [46] A. Skodras, C. Christopoulos, and T. Ebrahimi (2001) The jpeg 2000 still image compression standard. IEEE Signal processing magazine 18 (5), pp. 36–58. Cited by: §2.
  • [47] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: 3rd item, Table 1, §2, §4.3.
  • [48] M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS. Cited by: §1, §3.1.
  • [49] S. Tomar (2006) Converting video formats with ffmpeg. Linux Journal 2006 (146), pp. 10. Cited by: §A.3.
  • [50] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9446–9454. Cited by: Figure 10, §4.4.
  • [51] V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. Cited by: §2.
  • [52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.1.
  • [53] G. K. Wallace (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §2.
  • [54] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C. J. Kuo (2016) MCL-jcv: a jnd-based h. 264/avc video quality assessment dataset. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 1509–1513. Cited by: §A.2, §4.1.
  • [55] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018) Training deep neural networks with 8-bit floating point numbers. arXiv preprint arXiv:1812.08011. Cited by: §3.2.
  • [56] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §4.1.
  • [57] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665. Cited by: §2.
  • [58] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology 13 (7), pp. 560–576. Cited by: 3rd item, §2, §4.3, Table 5.
  • [59] C. Wu, N. Singhal, and P. Krahenbuhl (2018) Video compression through image interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431. Cited by: §2, §4.3, Table 5.
  • [60] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte (2020) Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6628–6637. Cited by: §2, §4.3.
  • [61] R. Yang, Y. Yang, J. Marino, and S. Mandt (2020)

    Hierarchical autoregressive modeling for neural video compression

    .
    arXiv preprint arXiv:2010.10258. Cited by: §4.3.
  • [62] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang (2016) Doubly convolutional neural networks. arXiv preprint arXiv:1610.09716. Cited by: §2.

Appendix A Appendix

a.1 NeRV Architecture

We provide the architecture details in Table 11. On a video, given the timestamp index , we first apply a 2-layer MLP on the output of positional encoding layer, then we stack 5 NeRV blocks with upscale factors 5, 3, 2, 2, 2 respectively. In UVG experiments on video compression task, we train models with different sizes by changing the value of to (48,384), (64,512), (128,512), (128,768), (128,1024), (192,1536), and (256,2048).

Layer Modules Upscale Factor
Output Size &
0 Positional Encoding -
1 MLP & Reshape -
2 NeRV block
3 NeRV block
4 NeRV block
5 NeRV block
6 NeRV block
7 Head layer -
Table 11: NeRV architecture for videos. Change the value of and to get models with different sizes.

a.2 Results on MCL-JCL dataset

We provide the experiment results for video compression task on MCL-JCL [54]dataset in Figure 11 and Figure 11.

[PSNR vs. BPP] [MS-SSIM vs. BPP ]

Figure 11: Rate distortion plots on the MCL-JCV dataset.

a.3 Implementation Details of Baselines

Following prior works, we used ffmpeg [49]

to produce the evaluation metrics for H.264 and HEVC.

First, we use the following command to extract frames from original YUV videos, as well as compressed videos to calculate metrics:

ffmpeg -i FILE.y4m FILE/f%05d.png

Then we use the following commands to compress videos with H.264 or HEVC codec under medium settings:

ffmpeg -i FILE/f%05d.png -c:v h264 -preset medium \
    -bf 0 -crf CRF FILE.EXT
ffmpeg -i FILE/f%05d.png -c:v hevc -preset medium \
    -x265-params bframes=0 -crf CRF FILE.EXT

where FILE is the filename, CRF is the Constant Rate Factor value, and EXT is the video container format extension.

a.4 Video Temporal Interpolation

We also explore NeRV for video temporal interpolation task. Specifically, we train our model with a subset of frames sampled from one video, and then use the trained model to infer/predict unseen frames given an unseen interpolated frame index. As we show in Fig 12, NeRV can give quite reasonable predictions on the unseen frame, which has good and comparable visual quality compared to the adjacent seen frames.

Figure 12: Temporal interpolation results for video with small motion.

a.5 More Visualizations

We provide more qualitative visualization results in Figure 13 to compare the our NeRV with H.265 for the video compression task. We test a smaller model on “Bosphorus” video, and it also has a better performance compared to H.265 codec with similar BPP. The zoomed areas show that our model produces fewer artifacts and the output is smoother.

Figure 13: Video compression visulization. The difference is calculated by the L1 loss (absolute value, scaled by the same level for the same frame, and the darker the more different). “Bosphorus” video in UVG dataset, the residual visulization is much smaller for NeRV.

Broader Impact.

As the most popular media format nowadays, videos are generally viewed as frames of sequences. Different from that, our proposed NeRV is a novel way to represent videos as a function of time, parameterized by the neural network, which is more efficient and might be used in many video-related tasks, such as video compression, video denoising and so on. Hopefully, this can potentially save bandwidth, fasten media streaming, which enrich entertainment potentials. Unfortunately, like many advances in deep learning for videos, this approach can be utilized for a variety of purposes beyond our control.