During the last two decades, video has become the dominant form of communication of the digital society. This has led to an explosive growth where video content accounts for more than 80% of global data traffic. The basic (lossy) video compression objective consists of transmitting as few bits as possible (i.e. minimize rate) while representing the input sequence at a certain level of fidelity (i.e. distortion). Video is now consumed using heterogeneous devices ranging from TV sets to smartphones. Furthermore, real-time video conferencing has become a household technology, pervasive in work and educational environments. These practical scenarios imposes additional constraints to the design of video codec in practice, such as dynamically controllable rate, low computational and memory footprint, and low latency. Together with the previous rate and distortion objectives, they conform the more challenging problem of practical video compression.
In parallel, the deep learning revolution has motivated a new compression paradigm based on parametric encoders and decoders implemented as deep neural networks which are optimized with data. This compression approach has been applied successfully first in images[2016arXiv161101704B, 2018arXiv180201436B, 2018arXiv180902736M] and then videos [DVC, HLVC]
. This paradigm contrasts with the traditional hybrid video coding paradigm, based on block-based linear transforms and carefully engineered coding tools (e.g. H.264/AVC, H.265/HEVC). Focusing on improving rate-distortion performance, most neural image and video codecs are impractical, since require heavy and complex networks. Practical aspects have been always carefully considered in the design of traditional codecs. In contrast to previous works, our paper focuses chiefly on those practical constraints, proposing a lightweight and flexible design for practical neural video compression.
Our design is based on a slimmable autoencoder augmented with a slimmable temporal entropy model. This design is motivated by two recent works. Motivated by the empirical observation that lower rates do not require the use of full capacity, Yang et al. 
proposed the slimmable compressive autoencoder (SlimCAE) architecture, where the slimming becomes a flexible mechanism to both vary the rate-distortion tradeoff and control the complexity. However, extending SlimCAE to video by including temporal prediction is not trivial, since most designs require additional modules to estimate and compensate motion (e.g. optical flow nets, motion compensation nets). Slimmable designs of such modules are not straightforward, nor the potential interplay with other elements in the compression framework. Recently, Sunet al. [STEM] proposed spatiotemporal entropy model (STEM), a motion-free framework where temporal prediction is performed directly in the entropy model without any motion estimation nor compensation. In our framework we adopt part of STEM’s entropy model and propose a slimmable version, thus having a fully slimmable codec.
In summary, this work contributes with a novel slimmable video codec (SlimVC) designed to address practical challenges in the neural video compression paradigm, via a simple slimming mechanism. Experiments show that our slimmable model can effectively exploit temporal redundancy without a significant drop in RD performance compared to that of independent models.
2 SlimCAE and STEM
2.1 Slimmable compressive autoencoder
Neural image codecs are implemented typically as compressive autoencoders (CAEs) [theis2017lossy, 2016arXiv161101704B], consisting of autoencoders augmented with quantization and entropy coding. The encoder parametrized by transforms the input image into a latent representation , which is then quantized as and the entropy encoder maps it to the bitstream . In the decoder, is mapped back to the reconstructed latent representation , and the decoder parametrized by recovers the reconstructed image
. During training, quantization is replaced by a differentiable proxy are used (additive uniform noise, in our case) and entropy coding is bypassed and the rate is approximated by the entropy of the latent representation. This requires a model of the probability distribution parametrized by
. This model, usually refer to as entropy model, has been the source of many improvements in RD performance, by including hyperpriors[2018arXiv180201436B]2018arXiv180902736M].
CAEs are typically trained by minimizing a RD objective
where is the set of training images, is the tradeoff between the rate of the latent representations and distortion between input and reconstructed images, averaged over .
An slimmable compressive autoencoder (SlimCAE)  is a CAE whose layers are slimmable. The slimmable layers can discard part of the parameters while still performing a valid operation. This results in less expressivenes, but also lower memory footprint and computation. The SlimCAE contains sub-models, each of which is determined by a set of parameters , where . The parameters of the sub-model are a superset of the parameters in the sub-model . Finally, the sub-models are trained jointly using a joint loss
In , the authors showed that if the set of are determined properly for the specific sub-modules, SlimCAE can achieve roughly the same RD performance of independent models optimized for single fixed s.
2.2 Spatiotemporal entropy model
Sun et al. [STEM] proposed a motion-free video compression method observing that inter-frame redundacy can be exploited efficiently in the entropy module via a spatiotemporal entropy model (STEM)[STEM] without requiring motion estimation. In this model, the hyperencoder (HE) of the hyperprior receives the latent representations and of both the current frame and the previous one, allowing it to exploit temporal redundancy, reducing the rate of the side information received by the hyperdecoder (HD). In addition, only the residual latent is transmitted in the bitstream. In order to obtain more accurate distribution models, while further exploiting spatial and temporal redundancy, STEM includes a spatial prior module (SPM) and a temporal prior module (TPM), together with an entropy parameters module (EPM) that fuses the information and predicts the actual distribution parameters at time .
SPM is an autoregressive PixelCNN-like network, and provides a relatively minor gain in RD performance at significant increased computational cost and particularly, two orders of magnitud increase in latency (from tenths to tens of seconds, both reported by [STEM] and verified in our implementation). For these practical reasons, we chose not to include SPM in our framework.
3 Fully slimmable framework
The proposed framework is shown in Fig. 1, where all trainable modules are designed to be slimmable111We use switchable GDNs ., including both the feature autoencoder (i.e. SlimFE, SlimFD) and the entropy model (i.e. SlimHE, SlimHD, SlimTPM, SlimEPM). For simplicity, we assume uniform slimming, that is, the width (i.e. number of channels) in every slimmable layer is slimmed by the same factor (we use the same factors as in SlimCAE , i.e. [0.25,0.375,0.5,0.75,1]). Table 1 provides more details about the architecture of the slimmable modules. SlimVC is trained in two stages. First, we train it as an image-based SlimCAE with hyperprior. Then we discard the hyperprior, and add the remaining modules of SlimVC (note that SlimCAE’s hyperprior is image-based, while SlimHE and SlimHD have distinct architectures and designed for pairs of frames). Then we fix SlimFE and SlimFD, and train the remaining slimmable modules.
: Leaky ReLU.
4.1 Experimental settings
Datasets and training details
We use Open Images [openimages] and CLIC as training datasets [CLIC] during the first training stage, with random crops and a batch size of 16 crops. For the second stage, we use small sequences from the Vimeo-90k dataset [2017arXiv171109078X], in pixel crops and a batch size of 32 crops. The model has five RD operating points (i.e. [0.25,0.375,0.5,0.75,1], as mentioned earlier). We use a learning rate of 5e-5, and mean square error (MSE) as distortion metric.
SlimVC (GOP=): the proposed approach after the second stage of training with a group of pictures of size . SlimVC (intra-only): is the codec resulting from the first stage without exploiting temporal redundancy. Independent VCs (GOP=): uses the same architecture of SlimVC but with a single width, so each RD point corresponds to a different model trained independently for that specific RD tradeoff. For comparison we also include H.264, STEM[STEM] and DVC [DVC]. Note that DVC is significantly more complex, and uses motion estimation and compensation with temporal prediction in the pixel domain.
We compressed the first 100 frames of HEVC Class B sequences  and the Ultra Video Group test sequences [UVG] with a GOP size of 10 and 12 pictures, respectively. The RD performances of the different methods are shown in Fig.2222We included the RD curve of STEM from [STEM] for reference, but note that the architectures are not comparable: the implementation of STEM in [STEM] uses encoders and decoders with four convolutional layers, while we use three, and their entropy model leverages an autoregressive context model and an SPM, which are not used in our case.. The proposed SlimVC has a RD performance very close to that of independently trained VCs, thus showing the benefit of SlimVC in terms of providing variable rate with one single model. Comparing with SlimVC (all intra), we can see that the slimmable temporal entropy model and the second stage are effective in consistently reducing the rate at all RD points (SlimVC curves are shifted towards the left). RD performance is comparable to that of H.264, and remains below that of DVC, which is significantly more complex and lacks the flexibility of SlimVC (see next section). Besides, the design of SlimVC has still considerable room form improvement of RD performance.
4.2.1 Memory and computational efficiency
We measured the efficiency of SlimVC and other baselines in terms of computational cost (in floating point operations, FLOPs) and memory footprint (in MB) when processing 1080P input sequences (i.e., 1920×1080×3). Table 2 shows that SlimVC requires significantly less computations than the other video baselines, especially in lower rates where the slimmable design is able to avoid most of the computations, leading to very significant speedups (up to 20x for low rates).
Fig.3 shows the detailed memory footprint of SlimVC and the different modules for the different widths. It shows that SlimVC is a lightweight method, whose memory footprint can be gracefully adjusted depending on the rate needs. In contrast to SlimCAE, where the feature encoder and decoder were the main bottlenecks in terms of memory and computation, in SlimVC the most critical modules in this regard are those related to entropy modeling. In particular, the temporal prediction module is the heaviest module of the codec.
|Methods||Low rate||Medium rate||High rate|
|STEM w/o SPM||613||613||613||613||613|
|STEM w/o SPM||1479||1479||1479||1479||1479|
Motivated by some practical limitations of current neural video codecs, we propose slimmable video codec (SlimVC), a novel adaptive architecture based on slimmable modules that can provide significant savings in memory and computational costs for low and mid rates, together with variable rate control with one single video model. While SlimCAE showed that slimmable codecs are promising approaches for practical neural image compression, SlimVC further advances this potential for the case of practical neural video compression.