1 Introduction
What is a video? Typically, a video captures a dynamic visual scene using a sequence of frames. A schematic interpretation of this is a curve in 2D space, where each point can be characterized with a pair representing the spatial state. If we have a model for all pairs, then, given any , we can easily find the corresponding state. Similarly, we can interpret a video as a recording of the visual world, where we can find a corresponding RGB state for every single timestamp. This leads to our main claim: can we represent a video as a function of time?
More formally, can we represent a video as , where , i.e., a frame at timestamp , is represented as a function parameterized by . Given their remarkable representational capacity [21], we choose deep neural networks as the function in our work. Given these intuitions, we propose NeRV, a novel representation that represents videos as implicit functions and encodes them into neural networks. Specifically, with a fairly simple deep neural network design, NeRV can reconstruct the corresponding video frames with high quality, given the frame index. Once the video is encoded into a neural network, this network can be used as a proxy for video, where we can directly extract all video information from the representation. Therefore, unlike traditional video representations which treat videos as sequences of frames, shown in Figure 1 (a), our proposed NeRV considers a video as a unified neural network with all information embedded within its architecture and parameters, shown in Figure 1 (b).
Explicit (framebased)  Implicit (unified)  
Handcrafted (e.g., HEVC [47])  Learningbased (e.g., DVC [31])  Pixelwise (e.g., NeRF [33]  Imagewise (Ours)  
Encoding speed  Fast  Medium  Very slow  Slow 
Decoding speed  Medium  Slow  Very slow  Fast 
Compression ratio  Medium  High  Low  Medium 
As an imagewise implicit representation, NeRV shares lots of similarities with pixelwise implicit visual representations [44, 48] which takes spatialtemporal coordinates as inputs. The main differences between our work and imagewise implicit representation are the output space and architecture designs. Pixelwise representations output the RGB value for each pixel, while NeRV outputs a whole image, demonstrated in Figure 2. Given a video with size of , pixelwise representations need to sample the video times while NeRV only need to sample times. Considering the huge pixel number, especially for high resolution videos, NeRV shows great advantage for both encoding time and decoding speed. Different output space also leads to different architecture designs, NeRV utilizes a MLP ConvNets architecture to output an image while pixelwise representation uses a simple MLP to output the RGB value of the pixel. Sampling efficiency of NeRV also simplify the optimization problem, which leads to better reconstruction quality compared to pixelwise representations.
We also demonstrate the flexibility of NeRV by exploring several applications it affords. Most notably, we examine the suitability of NeRV for video compression. Traditional video compression frameworks are quite involved, such as specifying key frames and inter frames, estimating the residual information, blocksize the video frames, applying discrete cosine transform on the resulting image blocks and so on. Such a long pipeline makes the decoding process very complex as well. In contrast, given a neural network that encodes a video in NeRV, we can simply cast the video compression task as a model compression problem, and trivially leverage any wellestablished or cutting edge model compression algorithm to achieve good compression ratios. Specifically, we explore a threestep model compression pipeline: model pruning, model quantization, and weight encoding, and show the contributions of each step for the compression task. We conduct extensive experiments on popular video compression datasets, such as UVG
[32], and show the applicability of model compression techniques on NeRV for video compression. We briefly compare different video representations in Table 1 and NeRV shows great advantage in decoding speed.Besides video compression, we also explore other applications of the NeRV representation for the video denoising task. Since NeRV is a learnt implicit function, we can demonstrate its robustness to noise and perturbations. Given a noisy video as input, NeRV generates a highquality denoised output, without any additional operation, and even outperforms conventional denoising methods.
The contribution of this paper can be summarized into four parts:

[leftmargin=0.3in,,topsep=0.005in]

We propose NeRV, a novel imagewise implicit representation for videos, representating a video as a neural network, converting video encoding to model fitting and video decoding as a simple feedforward operation.

Compared to pixelwise implicit representation, NeRV output the whole image and shows great efficiency, improving the encoding speed by 25 to 70, the decoding speed by 38 to 132, while achieving better video quality.

As a general representation for videos, NeRV also shows promising results in other tasks, e.g., video denoising. Without any special denoisng design, NeRV outperforms traditional handcrafted denoising algorithms (medium filter etc.) and ConvNetsbased denoisng methods.
2 Related Work
Implicit Neural Representation. Implicit neural representation is a novel way to parameterize a variety of signals. The key idea is to represent an object as a function approximated via a neural network, which maps the coordinate to its corresponding value (e.g., pixel coordinate for an image and RGB value of the pixel). It has been widely applied in many 3D vision tasks, such as 3D shapes [16, 15], 3D scenes [45, 25, 37, 6], and appearance of the 3D structure [33, 34, 35]. Comparing to explicit 3D representations, such as voxel, point cloud, and mesh, the continuous implicit neural representation can compactly encode highresolution signals in a memoryefficient way. Most recently, [13] demonstrated the feasibility of using implicit neural representation for image compression tasks. Although it is not yet competitive with the stateoftheart compression methods, it shows promising and attractive proprieties. In previous methods, MLPs are often used to approximate the implicit neural representations, which take the spatial or spatiotemporal coordinate as the input and output the signals at that single point (e.g., RGB value, volume density). In contrast, our NeRV representation, trains a purposefully designed neural network composed of MLPs and convolution layers, and takes the frame index as input and directly outputs all the RGB values of that frame.
Video Compression.
As a fundamental task of computer vision and image processing, visual data compression has been studied for several decades. Before the resurgence of deep networks, handcrafted image compression techniques, like JPEG
[53] and JPEG2000 [46], were widely used. Building upon them, many traditional video compression algorithms, such as MPEG [28], H.264 [58], and HEVC [47], have achieved great success. These methods are generally based on transform coding like Discrete Cosine Transform (DCT) [2] or wavelet transform [3], which are wellengineered and tuned to be fast and efficient. More recently, deep learningbased visual compression approaches have been gaining popularity. For video compression, the most common practice is to utilize neural networks for certain components while using the traditional video compression pipeline. For example,
[8]proposed an effective image compression approach and generalized it into video compression by adding interpolation loop modules. Similarly,
[59] converted the video compression problem into an image interpolation problem and proposed an interpolation network, resulting in competitive compression quality. Furthermore, [1] generalized optical flow to scalespace flow to better model uncertainty in compression. Later, [60] employed a temporal hierarchical structure, and trained neural networks for most components including key frame compression, motion estimation, motions compression, and residual compression. However, all of these works still follow the overall pipeline of traditional compression, arguably limiting their capabilities.Model Compression. The goal of model compression is to simplify an original model by reducing the number of parameters while maintaining its accuracy. Current research on model compression research can be divided into four groups: parameter pruning and quantization [51, 17, 18, 57, 23, 27]; lowrank factorization [40, 10, 24]; transferred and compact convolutional filters [9, 62, 42, 11]; and knowledge distillation [4, 20, 7, 38]. Our proposed NeRV enables us to reformulate the video compression problem into model compression, and utilize standard model compression techniques. Specifically, we use model pruning and quantization to reduce the model size without significantly deteriorating the performance.
3 Neural Representations for Videos
We first present the NeRV representation in Section 3.1, including the input embedding, the network architecture, and the loss objective. Then, we present model compression techniques on NeRV in Section 3.2 for video compression.
3.1 NeRV Architecture
In NeRV, each video is represented by a function , where the input is a frame index and the output is the corresponding RGB image . The encoding function is parameterized with a deep neural network , . Therefore, video encoding is done by fitting a neural network to a given video, such that it can map each input timestamp to the corresponding RGB frame.
Input Embedding. Although deep neural networks can be used as universal function approximators [21], directly training the network with input timestamp results in poor results, which is also observed by [39, 33]. By mapping the inputs to a high embedding space, the neural network can better fit data with highfrequency variations. Specifically, in NeRV, we use Positional Encoding [33, 52, 48] as our embedding function
(1) 
where and are hyperparameters of the networks. Given an input timestamp , normalized between , the output of embedding function is then fed to the following neural network.
Network Architecture. NeRV architecture is illustrated in Figure 2
(b). NeRV takes the time embedding as input and outputs the corresponding RGB Frame. Leveraging MLPs to directly output all pixel values of the frames can lead to huge parameters, especially when the images resolutions are large. Therefore, we stack multiple NeRV blocks following the MLP layers so that pixels at different locations can share convolutional kernels, leading to an efficient and effective network. Inspired by the superresolution networks, we design the NeRV block, illustrated in Figure
2 (c), adopting PixelShuffle technique [43] for upscaling method. Convolution and activation layers are also inserted to enhance the expressibilty. The detailed architecture can be found in the supplementary material.Loss Objective.
For NeRV, we adopt combination of L1 and SSIM loss as our loss function for network optimization, which calculates the loss over all pixel locations of the predicted image and the groundtruth image as following
(2) 
where is the frame number, the NeRV prediction, the frame ground truth, is hyperparameter to balance the weight for each loss component.
3.2 Model Compression
In this section, we briefly revisit model compression techniques used for video compression with NeRV. Our model compression composes of four standard sequential steps: video overfit, model pruning, weight quantization, and weight encoding as shown in Figure 3.
Model Pruning. Given a neural network fit on a video, we use global unstructured pruning to reduce the model size first. Based on the magnitude of weight values, we set weights below a threshold as zero,
(3) 
where is the percentile value for all parameters in . As a normal practice, we finetune the model to regain the representation, after the pruning operation.
Model Quantization. After model pruning, we apply model quantization to all network parameters. Note that different from many recent works [23, 5, 14, 55]
that utilize quantization during training, NeRV is only quantized posthoc (after the training process). Given a parameter tensor
(4) 
where ‘round’ is rounding value to the closest integer, ‘bit’ the bit length for quantized model, and the max and min value for the parameter tensor , ‘scale’ the scaling factor. Through Equation 4, each parameter can be mapped to a ‘bit’ length value. The overhead to store ‘scale’ and can be ignored given the large parameter number of , e.g., they account for only in a small Conv with input channels and output channels ( parameters in total).
Entropy Encoding. Finally, we use entropy encoding to further compress the model size. By taking advantage of character frequency, entropy encoding can represent the data with a more efficient codec. Specifically, we employ Huffman Coding [22] after model quantization. Since Huffman Coding is lossless, it is guaranteed that a decent compression can be achieved without any impact on the reconstruction quality. Empirically, this further reduces the model size by around 10%.
4 Experiments
4.1 Datasets and Implementation Details
We perform experiments on “Big Buck Bunny” sequence from scikitvideo to compare our NeRV with pixelwise implicit representations, which has frames of resolution. To compare with stateofthearts methods on video compression task, we do experiments on the widely used UVG [32], consisting of 7 videos and 3900 frames with in total.
In our experiments, we train the network using Adam optimizer [26] with learning rate of 5e4. For ablation study on UVG, we use cosine annealing learning rate schedule [30]
, batchsize of 1, training epochs of 150, and warmup epochs of 30 unless otherwise denoted. When compare with stateofthearts, we run the model for 1500 epochs, with batchsize of 6. For experiments on “Big Buck Bunny”, we train NeRV for 1200 epochs unless otherwise denoted. For finetune process after pruning, we use 50 epochs for both UVG and “Big Buck Bunny”.
For NeRV architecture, there are 5 NeRV blocks, with upscale factor 5, 3, 2, 2, 2 respectively for 1080p videos, and 5, 2, 2, 2, 2 respectively for 720p videos. By changing the hidden dimension of MLP and channel dimension of NeRV blocks, we can build NeRV model with different sizes. For input embedding in Equation 1, we use and as our default setting. For loss objective in Equation 2, is set to . We evaluate the video quality with two metrics: PSNR and MSSSIM [56]
. Bitsperpixel (BPP) is adopted to indicate the compression ratio. We implement our model in PyTorch
[36] and train it in full precision (FP32). All experiments are run with NVIDIA RTX2080ti. Please refer to the supplementary material for more experimental details, results, and visualizations (e.g., MCLJCV [54] results)4.2 Main Results
We compare NeRV with pixelwise implicit representations on ’Big Buck Bunny’ video. We take SIREN [44] and NeRF [33] as the baseline, where SIREN [44] takes the original pixel coordinates as input and uses activations, while NeRF [33]
adds one positional embedding layer to encode the pixel coordinates and uses ReLU activations. Both SIREN and FFN use a 3layer perceptron and we change the hidden dimension to build model of different sizes. For fair comparison, we train SIREN and FFN for 120 epochs to make encoding time comparable. And we change the filter width to build NeRV model of comparable sizes, named as NeRVS, NeRVM, and NeRVL. In Table
3, NeRV outperforms them greatly in both encoding speed, decoding quality, and decoding speed. Note that NeRV can improve the training speed by to , and speedup the decoding FPS by to . We also conduct experiments with different training epochs in Table 3, which clearly shows that longer training time can lead to much better overfit results of the video and we notice that the final performances have not saturated as long as it trains for more epochs.Methods  Parameters  Training Speed  Encoding Time  PSNR  Decoding FPS 
SIREN [44]  3.2M  31.39  1.4  
NeRF [33]  3.2M  33.31  1.4  
NeRVS (ours)  3.2M  25  1  34.21  54.5 
SIREN [44]  6.4M  31.37  0.8  
NeRF [33]  6.4M  35.17  0.8  
NeRVM (ours)  6.3M  50  1  38.14  53.8 
SIREN [44]  12.7M  25.06  0.4  
NeRF [33]  12.7M  37.94  0.4  
NeRVL (ours)  12.5M  70  1  41.29  52.9 
Epoch  NeRVS  NeRVM  NeRVL 
300  32.21  36.05  39.75 
600  33.56  37.47  40.84 
1.2k  34.21  38.14  41.29 
1.8k  34.33  38.32  41.68 
2.4k  34.86  38.7  41.99 
4.3 Video Compression
Compression ablation. We first conduct ablation study on video “Big Buck Bunny”. Figure 6 shows the results of different pruning ratios, where model of 40% sparsity still reach comparable performance with the full model. As for model quantization step in Figure 6, a 8bit model still remains the video quality compared to the original one (32bit). Figure 6 shows the full compression pipeline with NeRV. The compression performance is quite robust to NeRV models of different sizes, and each step shows consistent contribution to our final results. Please note that we only explore these three common compression techniques here, and we believe that other wellestablished and cutting edge model compression algorithm can be applied to further improve the final performances of video compression task, which is left for future research.
Compare with stateofthearts methods. We then compare with stateofthearts methods on UVG dataset. First, we concatenate 7 videos into one single video along the time dimension and train NeRV on all the frames from different videos, which we found to be more beneficial than training a single model for each video. After training the network, we apply model pruning, quantization, and weight encoding as described in Section3.2. Figure 8 and Figure 8 show the ratedistortion curves. We compare with H.264 [58], HEVC [47], STATSSFSP [61], HLVC [60], Scalespace [1], and Wu et al. [59]. H.264 and HEVC are performed with medium preset mode. As the first imagewise neural representation, NeRV generally achieves comparable performance with traditional video compression techniques and other learningbased video compression approaches. It is worth noting that when BPP is small, NeRV can match the performance of the stateoftheart method, showing its great potential in highrate video compression. When BPP becomes large, the performance gap is mostly because of the lack of full training due to GPU resources limitations. As shown in Table 3, the decoding video quality keeps increasing when the training epochs are longer. Figure 9 shows visualizations for decoding frames. At similar memory budget, NeRV shows image details with better quality.
Decoding time We compare with other methods for decoding time under a similar memory budget. Note that HEVC is run on CPU, while all other learningbased methods are run on a single GPU, including our NeRV. We speedup NeRV by running it in half precision (FP16). Due to the simple decoding process (feedforward operation), NeRV shows great advantage, even for carefullyoptimized H.264. And lots of speepup can be expected by running quantizaed model on special hardware. All the other video compression methods have two types of frames: key and interval frames. Key frame can be reconstructed by its encoded feature only while the interval frame reconstruction is also based on the reconstructed key frames. Since most video frames are interval frames, their decoding needs to be done in a sequential manner after the reconstruction of the respective key frames. On the contrary, our NeRV can output frames at any random time index independently, thus making parallel decoding much simpler. This can be viewed as a distinct advantage over other methods.
4.4 Video Denoising
We apply several common noise patterns on the original video and train the model on the perturbed ones. During training, no masks or noise locations are provided to the model, i.e., the target of the model is the noisy frames while the model has no extra signal of whether the input is noisy or not. Surprisingly, our model tries to avoid the influence of the noise and regularizes them implicitly with little harm to the compression task simultaneously, which can serve well for most partially distorted videos in practice.
The results are compared with some standard denoising methods including Gaussian, uniform, and median filtering. These can be viewed as denoising upper bound for any additional compression process. As listed in Table 5, the PSNR of NeRV output is usually much higher than the noisy frames although it’s trained on the noisy target in a fully supervised manner, and has reached an acceptable level for general denoising purpose. Specifically, median filtering has the best performance among the traditional denoising techniques, while NeRV outperforms it in most cases or is at least comparable without any extra denoising design in both architecture design and training strategy.
Methods  FPS 
Habibian et al. [6]  
Wu et al. [59]  
Rippel et al. [41]  1 
DVC [31]  1.8 
Liu et al [29]  3 
H.264 [58]  9.2 
NeRV (FP32)  5.6 
NeRV (FP16)  12.5 
noise  white  black  salt & pepper  random  Average 
Baseline  27.85  28.29  27.95  30.95  28.74 
Gaussian  30.27  30.14  30.23  30.99  30.41 
Uniform  29.11  29.06  29.10  29.63  29.22 
Median  33.89  33.84  33.87  33.89  33.87 
Minimum  20.55  16.60  18.09  18.20  18.36 
Maximum  16.16  20.26  17.69  17.83  17.99 
NeRV  33.31  34.20  34.17  34.80  34.12 
We also compare NeRV with another neuralnetworkbased denoising method, Deep Image Prior (DIP) [50]. Although its main target is image denoising, NeRV outperforms it in both qualitative and quantitative metrics, demonstrated in Figure 10. The main difference between them is that denoising of DIP only comes form architecture prior, while the denoising ability of NeRV comes from both architecture prior and data prior. DIP emphasizes that its image prior is only captured by the network structure of Convolution operations because it only feeds on a single image. But the training data of NeRV contain many video frames, sharing lots of visual contents and consistences. As a result, image prior is captured by both the network structure and the training data statistics for NeRV. DIP relies significantly on a good early stopping strategy to prevent it from overfitting to the noise. Without the noise prior, it has to be used with fixed iterations settings, which is not easy to generalize to any random kind of noises as mentioned above. By contrast, NeRV is able to handle this naturally by keeping training because the full set of consecutive video frames provides a strong regularization on image content over noise.
4.5 Ablation Studies
Finally, we provide ablation studies on the UVG dataset. PSNR and MSSSIM are adopted for evaluation of the reconstructed videos.
Input embedding.
In Table 6, PE means positional encoding as in Equation 1, which greatly improves the baseline, None means taking the frame index as input directly. Similar findings can be found in [33], without any input embedding, the model can not learn highfrequency information, resulting in much lower performance.
Upscale layer.
In Table 4.5, we show results of three different upscale methods. i.e., Bilinear Pooling, Transpose Convolution, and PixelShuffle [43]. With similar model sizes, PixelShuffle shows best results. Please note that although Transpose convolution [12] reach comparable results, it greatly slowdown the training speed compared to the PixelShuffle.
Normalization layer.
In Table 4.5, we apply common normalization layers in NeRV block. The default setup, without normalization layer, reaches the best performance and runs slightly faster. We hypothesize that the normalization layer reduces the overfitting capability of the neural network, which is contradictory to our training objective.
Activation layer.
Table 4.5 shows results for common activation layers. The GELU [19]activation function achieve the highest performances, which is adopted as our default design.
Loss objective.
We show loss objective ablation in Table 10. We shows performance results of different combinations of L2, L1, and SSIM loss. Although adopting SSIM alone can produce the highest MSSSIM score, but the combination of L1 loss and SSIM loss can achieve the best tradeoff between the PSNR performance and MSSSIM score.
PSNR  MSSSIM  
Bilinear Pooling  29.56  0.873 
Transpose Conv  36.63  0.967 
PixelShuffle  37.26  0.970 
PSNR  MSSSIM  
BatchNorm  36.71  0.971 
InstanceNorm  35.5  0.963 
None  37.26  0.970 
PSNR  MSSSIM  
ReLU  35.89  0.963 
Leaky ReLU  36.76  0.968 
Swish  37.08  0.969 
GELU  37.26  0.970 
5 Discussion
Conclusion. In this work, we present a novel neural representation for videos, NeRV, which encodes videos into neural networks. Our key sight is that by directly training a neural network with video frame index and output corresponding RGB image, we can use the weights of the model to represent the videos, which is totally different from conventional representations that treat videos as consecutive frame sequences. With such a representation, we show that by simply applying general model compression techniques, NeRV can match the performances of traditional video compression approaches for the video compression task, without the need to design a long and complex pipeline. We also show that NeRV can outperform standard denoising methods. We hope that this paper can inspire further research works to design novel class of methods for video representations.
Limitations and Future Work. There are some limitations with the proposed NeRV. First, to achieve the comparable PSNR and MSSSIM performances, the training time of our proposed approach is longer than the encoding time of traditional video compression methods. Second, the architecture design of NeRV is still not optimal yet, we believe more exploration on the neural architecture design can achieve higher performances. Finally, more advanced and cutting the edge model compression methods can be applied to NeRV and obtain higher compression ratios.
Acknowledgement. This project was partially funded by the DARPA SAILON (W911NF2020009) program, an independent grant from Facebook AI, and Amazon Research Award to AS.
References

[1]
(2020)
Scalespace flow for endtoend optimized video compression.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 8503–8512. Cited by: §2, §4.3.  [2] (1974) Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: §2.
 [3] (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §2.
 [4] (2013) Do deep nets really need to be deep?. arXiv preprint arXiv:1312.6184. Cited by: §2.
 [5] (2018) Scalable methods for 8bit training of neural networks. arXiv preprint arXiv:1805.11046. Cited by: §3.2.
 [6] (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In European Conference on Computer Vision, pp. 608–625. Cited by: §2, Table 5.
 [7] (2017) Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 742–751. Cited by: §2.
 [8] (2019) Learning image and video compression through spatialtemporal energy compaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10071–10080. Cited by: §2.

[9]
(2016)
Group equivariant convolutional networks.
In
International conference on machine learning
, pp. 2990–2999. Cited by: §2.  [10] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736. Cited by: §2.

[11]
(2016)
Exploiting cyclic symmetry in convolutional neural networks
. In International conference on machine learning, pp. 1889–1898. Cited by: §2.  [12] (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §4.5.
 [13] (2021) COIN: compression with implicit neural representations. arXiv preprint arXiv:2103.03123. Cited by: §2.
 [14] (2020) Adaptive gradient quantization for dataparallel sgd. arXiv preprint arXiv:2010.12460. Cited by: §3.2.
 [15] (2019) Deep structured implicit functions.. Cited by: §2.
 [16] (2019) Learning shape templates with structured implicit functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7154–7164. Cited by: §2.
 [17] (2015) Deep learning with limited numerical precision. In International conference on machine learning, pp. 1737–1746. Cited by: §2.
 [18] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
 [19] (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §4.5.
 [20] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
 [21] (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §1, §3.1.
 [22] (1952) A method for the construction of minimumredundancy codes. Proceedings of the IRE 40 (9), pp. 1098–1101. Cited by: §3.2.
 [23] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §2, §3.2.
 [24] (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §2.
 [25] (2020) Local implicit grid representations for 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010. Cited by: §2.
 [26] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 [27] (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2.
 [28] (1991) MPEG: a video compression standard for multimedia applications. Communications of the ACM 34 (4), pp. 46–58. Cited by: §2.
 [29] (2020) Conditional entropy coding for efficient video compression. In ECCV, Cited by: Table 5.

[30]
(2016)
Sgdr: stochastic gradient descent with warm restarts
. arXiv preprint arXiv:1608.03983. Cited by: §4.1.  [31] (2019) Dvc: an endtoend deep video compression framework. In CVPR, Cited by: Table 1, Table 5.
 [32] (2020) UVG dataset: 50/120fps 4k sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302. Cited by: §1, §4.1.
 [33] (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pp. 405–421. Cited by: Table 1, §2, §3.1, §4.2, §4.5, Table 3.
 [34] (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515. Cited by: §2.
 [35] (2019) Texture fields: learning texture representations in function space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4531–4540. Cited by: §2.
 [36] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.1.
 [37] (2020) Convolutional occupancy networks. arXiv preprint arXiv:2003.04618 2. Cited by: §2.
 [38] (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
 [39] (2019) On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. Cited by: §3.1.
 [40] (2013) Learning separable filters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2754–2761. Cited by: §2.
 [41] (2019) Learned video compression. In ICCV, Cited by: Table 5.

[42]
(2016)
Understanding and improving convolutional neural networks via concatenated rectified linear units
. In international conference on machine learning, pp. 2217–2225. Cited by: §2.  [43] (2016) Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §3.1, §4.5.
 [44] (2020) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33. Cited by: §1, §4.2, Table 3.
 [45] (2019) Scene representation networks: continuous 3dstructureaware neural scene representations. arXiv preprint arXiv:1906.01618. Cited by: §2.
 [46] (2001) The jpeg 2000 still image compression standard. IEEE Signal processing magazine 18 (5), pp. 36–58. Cited by: §2.
 [47] (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: 3rd item, Table 1, §2, §4.3.
 [48] (2020) Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS. Cited by: §1, §3.1.
 [49] (2006) Converting video formats with ffmpeg. Linux Journal 2006 (146), pp. 10. Cited by: §A.3.
 [50] (2018) Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9446–9454. Cited by: Figure 10, §4.4.
 [51] (2011) Improving the speed of neural networks on cpus. Cited by: §2.
 [52] (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.1.
 [53] (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §2.
 [54] (2016) MCLjcv: a jndbased h. 264/avc video quality assessment dataset. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 1509–1513. Cited by: §A.2, §4.1.
 [55] (2018) Training deep neural networks with 8bit floating point numbers. arXiv preprint arXiv:1812.08011. Cited by: §3.2.
 [56] (2003) Multiscale structural similarity for image quality assessment. In The ThritySeventh Asilomar Conference on Signals, Systems Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §4.1.
 [57] (2016) Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665. Cited by: §2.
 [58] (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology 13 (7), pp. 560–576. Cited by: 3rd item, §2, §4.3, Table 5.
 [59] (2018) Video compression through image interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431. Cited by: §2, §4.3, Table 5.
 [60] (2020) Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6628–6637. Cited by: §2, §4.3.

[61]
(2020)
Hierarchical autoregressive modeling for neural video compression
. arXiv preprint arXiv:2010.10258. Cited by: §4.3.  [62] (2016) Doubly convolutional neural networks. arXiv preprint arXiv:1610.09716. Cited by: §2.
Appendix A Appendix
a.1 NeRV Architecture
We provide the architecture details in Table 11. On a video, given the timestamp index , we first apply a 2layer MLP on the output of positional encoding layer, then we stack 5 NeRV blocks with upscale factors 5, 3, 2, 2, 2 respectively. In UVG experiments on video compression task, we train models with different sizes by changing the value of to (48,384), (64,512), (128,512), (128,768), (128,1024), (192,1536), and (256,2048).
Layer  Modules  Upscale Factor 


0  Positional Encoding    
1  MLP & Reshape    
2  NeRV block  
3  NeRV block  
4  NeRV block  
5  NeRV block  
6  NeRV block  
7  Head layer   
a.2 Results on MCLJCL dataset
a.3 Implementation Details of Baselines
Following prior works, we used ffmpeg [49]
to produce the evaluation metrics for H.264 and HEVC.
First, we use the following command to extract frames from original YUV videos, as well as compressed videos to calculate metrics:
Then we use the following commands to compress videos with H.264 or HEVC codec under medium settings:
where FILE is the filename, CRF is the Constant Rate Factor value, and EXT is the video container format extension.
a.4 Video Temporal Interpolation
We also explore NeRV for video temporal interpolation task. Specifically, we train our model with a subset of frames sampled from one video, and then use the trained model to infer/predict unseen frames given an unseen interpolated frame index. As we show in Fig 12, NeRV can give quite reasonable predictions on the unseen frame, which has good and comparable visual quality compared to the adjacent seen frames.
a.5 More Visualizations
We provide more qualitative visualization results in Figure 13 to compare the our NeRV with H.265 for the video compression task. We test a smaller model on “Bosphorus” video, and it also has a better performance compared to H.265 codec with similar BPP. The zoomed areas show that our model produces fewer artifacts and the output is smoother.
Broader Impact.
As the most popular media format nowadays, videos are generally viewed as frames of sequences. Different from that, our proposed NeRV is a novel way to represent videos as a function of time, parameterized by the neural network, which is more efficient and might be used in many videorelated tasks, such as video compression, video denoising and so on. Hopefully, this can potentially save bandwidth, fasten media streaming, which enrich entertainment potentials. Unfortunately, like many advances in deep learning for videos, this approach can be utilized for a variety of purposes beyond our control.
Comments
There are no comments yet.