I Introduction
Image compression has been a thriving research area for decades due to the storage and transmission requirements in various applications that underpin our modern digital life. Image compression also appears in the form of intraframe coding for video compression [21]. The rapid advances in interframe prediction make efficient intraframe coding become increasingly important because intracoded frames often predominate over the bit rate of a compressed video. Therefore, it is much desirable to achieve even higher image compression efficiency.
The stateoftheart image compression methods, e.g. BPG [4] and VVC intra coding [27], usually involve blockbased intra prediction, blockbased transform coding of residuals, and contextadaptive binary arithmetic coding. Over the years, tremendous research effort has been invested to better every component in a way that seeks higher compression efficiency at the expense of an acceptable complexity increase. These handcrafted codecs, although achieving a good balance between compression efficiency and complexity, lacks the opportunity to optimize all the components jointly in a seamless, endtoend manner.
The rising of deep learning recently spurred a new wave of developments in image compression, with endtoend learned systems attracting lots of attention. Among them, the variational autoencoder (VAE)based methods
[23, 11, 5, 6]have achieved compression performance very close to the latest VVC intra coding. Different from traditional handcrafted codecs, the VAEbased methods usually implement an imagelevel nonlinear transform that converts an input image into a compact set of latent features, the dimensions of which are much smaller than that of the image. Ever since the advent of the first VAEbased scheme
[1], several improvements have been made on the expressiveness [5, 6, 20] of the autoencoder and the efficiency of entropy coding [23, 11, 5, 6, 3, 19, 12]. Up to now, the VAEbased methods have become the mainstream approach to endtoend learned image compression.However, one issue with most VAEbased schemes is that the autoencoder is generally lossy. There is no guarantee that its nonlinear transform can reconstruct the input image losslessly even without quantizing the latent features of the image. This is unlike the traditional transforms, such as Discrete Cosine Transform and Wavelet Transform, which have the desirable property of perfect reconstruction and allow the codec to offer a wide range of quality levels by merely changing the quantization step size.
Recently, the flowbased models [9, 22] emerged as attractive alternatives. These models have the striking feature of realizing a bijective and invertible mapping between the input image and its latent features via the use of reversible networks composed of affine coupling layers [17, 8]. This invertibility is utilized to develop lossless image compression in [10], while the affine coupling layers are used in place of the lossy autoencoder in [9, 22] to achieve both lossy and lossless (or nearlylossless) compression with a single unified model. The reversible networks, however, are quite distinct from the commonly used autoencoders, making these two types of compression systems not compatible with each other.
In this paper, we propose a novel endtoend image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF) [13]. ANF is a new type of flow models that work on augmented input space to offer greater transformation ability than the ordinary flow models. Our scheme ANFIC is motivated by the fact that ANF is a generalization of VAE that stacks multiple VAE’s as a flow model. In a sense, this allows ANFIC to extend any existing VAEbased compression system in a flowbased framework to enjoy the benefits of both approaches. ANFIC is novel and unique in that (1) it distinguishes from flowbased compression by operating in augmented input space, being able to leverage the representation power of any VAEbased image compression, and that (2) it is more general than the VAEbased compression by allowing VAE to be stacked and/or extended hierarchically.
Extensive experimental results on Kodak, Tecnick, and CLIC validation datasets show that ANFIC performs comparably to or better than the stateoftheart endtoend image compression in terms of PSNRRGB. It performs close to VVC intra [27] over a wide range of quality levels from lowrate compression up to nearlylossless compression. In particular, ANFIC achieves the stateoftheart performance among the competing methods, when extended with conditional convolutional layers [7] for variate rate compression with a single model.
Our main contributions are threefold:

We propose ANFIC as the first work that leverages VAEbased image compression in a flowbased framework.

We offer extensive ablation studies to understand and visualize the inner workings of ANFIC.

Extensive experimental results show that ANFIC is competitive with the stateoftheart image compression, VAEbased and flowbased, over a wide range of quality levels and performs close to VVC intra coding.
The remainder of this paper is organized as follows: Section II reviews VAEbased image compression and the basics of ANF. Section III elaborates the design of ANFIC. Section IV compares ANFIC with the stateoftheart methods in terms of objective compression performance and subjective image quality. Section V presents our ablation studies. Finally, we provide concluding remarks in Section VI.
Ii Related Work
In this paper, we propose an ANFbased image compression. It can be viewed as an extension of VAEbased image compression. Hence, this section focuses on the recent developments of VAEbased image compression and introduces the fundamentals of ANF to ease the understanding of our scheme.
Iia VAEbased Image Compression
VAEbased image compression [23, 11, 5, 6, 1, 3, 19]
is the most popular approach to endtoend learned image compression. Its training framework includes three major components: the analysis transform, the prior distribution, and the synthesis transform. These components are implemented by neural networks.
The analysis transform encodes the raw image through an encoding distribution with the latent representation uniformly quantized as . The is then entropy encoded into a bitstream using a learned prior implemented by a network . Finally, the synthesis transform reconstructs approximately the input from by a decoding distribution .
All the network parameters are trained endtoend by minimizing
(1) 
where the first term, denoted by , aims to minimize the negative loglikelihood of and the second term minimizes the rate needed for signaling . In particular, it is shown that minimizing Eq. (1) amounts to maximizing the evidence lower bound (ELBO) of a latent variable model [16], which is specified by and , with
taking a uniform distribution that models the effect of uniform quantization. In a more general setting, a hyperparameter
is introduced to balance between and , yielding .Balle et al. [1] are the first to introduce the aforementioned VAE framework together with a learned factorized prior to image compression. In entropy coding the image latents, they assume the prior distribution over to be factorial and learn the distribution by the network
. Their analysis and synthesis transforms are composed of convolutional neural networks and the general division normalization (GDN) layers, which originate from
[2].Even since the advent of the VAEbased compression framework, several efforts have been made to advance its coding efficiency. In particular, some [23, 11, 5, 6, 3, 19, 12]
improve the prior estimation for better entropy coding while others
[5, 6, 20] address the analysis and syntheses transforms (referred collectively to as the autoencoding transform). We summarize briefly these efforts as follows.Enhanced Prior Estimation: The prior distribution crucially determines the number of bits (i.e. the rate) needed to signal the quantized image latents . Recognizing the suboptimality of the factorized prior , where feature samples in every channel of are independently and identically distributed, Balle et al. [3]
propose the notion of hyperprior to model every feature sample separately by a Gaussian distribution. To this end, additional side information
is extracted from the image latent and sent to the decoder, making the density estimation of dependent on the input . The and form the latent representation of the input. The hyperprior thus bears the interpretation of factorizing the joint distribution
as , where and are assumed to be Gaussian and factorial, respectively. Hu et al. [11, 12] extend the idea to include more than one layer of hyperprior, leading to a factorization of , where form a multilayer hyperprior. In addition to the use of hyperprior, Minnen et al. [23], Lee et al. [19], Chen et al. [5], and Cheng et al. [6] incorporate an autoregressive prior by 2D [23, 6, 19] or 3D [5] masked convolution [25], in order to utilize causal contextual information for better density estimation. In particular, Cheng et al. [6] model with a Gaussian mixture distribution instead of a Gaussian.Enhanced Autoencoding Transform: The capacity of the autoencoding transform determines its expressiveness. Chen et al. [5] add residual blocks to the autoencoder along with several nonlocal attention modules (NLAM). NLAM is shown to facilitate spatial bit allocation among coding areas of varied texture complexity. Unlike most of the VAEbased systems, which operate at image level, the blockbased autoencoder in [20] divides the input image into nonoverlapping macroblocks, each of which contains multiple subblocks coded sequentially using recurrentbased analysis and synthesis transforms. It has the striking feature of allowing high degree of computational parallelism at macroblock level. In general, most autoencoders are not guaranteed to reconstruct the input perfectly even when no quantization is involved.
IiB Flowbased Image Compression
Recently, flowbased models [17, 8, 15] emerge as an attractive alternative to VAE [16] or other autoencoders. They are characterized by the bijective mapping between the input and its latent representation, ensuring that the input can be perfectly reconstructed from its latent in the absence of quantization. Ma et al. [22] make an interesting attempt to introduce liftingbased coupling layers, which are a specialized implementation of additive coupling layers [17, 8] often used to construct a flow model, as the analysis and synthesis backbone. In particular, they split an input image, first rowwise and then columnwise, into latent subbands, the resulting decomposition being similar to 2D wavelet transform. Helminger et al. [9] also use additive coupling layers but with the factorout splitting to generate a multiscale image representation as shown in Fig. 0(a). Their work extends the notion of integer discrete flows for lossless compression [10] to lossy compression. In common, these works show the potential of flowbased models to offer a wide range of quality levels ranging from lowrate compression to nearlylossless or even lossless compression.
IiC Augmented Normalizing Flows (ANF)
The ANF model [13] is an invertible latent variable model. It is composed of multiple autoencoding transforms, each of which comprises a pair of the encoding and decoding transforms as depicted in Fig. 1(a). Consider the example of ANF with one autoencoding transform (i.e. onestep ANF). It converts the input coupled with an independent noise into their latent representation with one pair of encoding and decoding transforms:
(2)  
(3) 
where refers collectively to the network parameters of the encoding and decoding transforms. Compared with ordinary flow models, ANF augments the input with an independent noise. It is shown in [13] that the augmented input space allows a smoother transformation to the required latent space.
Multistep ANF and Hierarchical ANF: From Fig. 1(a) and according to Eqs. (2) and (3), the encoding or decoding transform implements an invertible affine coupling layer. Stacking pairs of these coupling layers leads to an invertible network, termed multistep ANF, with much improved capacity than onestep ANF. Another way to increase the model capacity is to augment more noise inputs as hierarchical ANF (see Fig. 1(b)). Particularly, these two approaches can be combined in a flexible way for even higher model capacity.
Training ANF: Like the ordinary flow models, ANF can be trained by maximizing the augmented joint likelihood, i.e. :
(4) 
where is the alternate composition of the encoding and decoding transforms with and represents the specified or learned prior distribution over the latents . It is shown in [13] that maximizing the augmented joint likelihood in ANF amounts to maximizing a lower bound on the marginal likelihood , with the gap attributed to the model’s incapability of modeling independently of .
VAE as Onestep ANF: Notably, VAE can be viewed as onestep ANF by (1) letting be a Gaussian noise, (2) transforming into via reparameterizing the VAE’s encoding distribution of the form , and (3) normalizing as via the VAE’s decoding distribution . The resulting then follows and so does the aggregated distribution of from various inputs . Maximizing Eq. (4) for such an onestep ANF is shown in [13] to be identical to maximizing the ELBO of VAE [16].
Iii Proposed Method
Inspired by the fact that most learned image compression is VAEbased and that VAE is equivalent to onestep ANF, we propose an ANFbased image compression framework, termed ANFIC. We first outline the ANFIC framework in Section IIIA, with a focus on how to extend VAEbased image compression with hyperprior by multistep and hierarchical ANF. This is followed by discussions on the entropy coding of the latent representation (Section IIIB), the modeling of the prior distribution in ANFIC (Section IIIA), and the training objective (Section IIIC).
To the best of our knowledge, ANFIC is the first work that combines VAE and flow models in a unified framework. It distinguishes from flowbased compression in that it operates on augmented input space (see Fig. 0(b)), being able to leverage the representation power of any existing VAEbased image compression. Moreover, ANFIC is more general than the VAEbased scheme by allowing it to be stacked and/or extended hierarchically (see Fig. 2).
Iiia ANFIC Framework
Fig. 3 describes the framework of ANFIC. From bottom to top, it stacks two autoencoding transforms (i.e. twostep ANF), with the top one extended further to the right to form a hierarchical ANF [13] that implements the hyperprior. More autoencoding transforms can be added straightforwardly to create a multistep ANF. In particular, the and in the autoencoding transform follow Eqs. (2) and (3), except that we make them purely additive by removing and for better convergence as with some other flowbased schemes [9, 22].
The autoencoding transform of the hyperprior, which assumes each sample in the latent representation is a Gaussian, is defined as
(5)  
(6) 
where (depicted as Q in Fig. 3) denotes the nearestinteger rounding for quantizing the residual between and the predicted mean of the Gaussian distibution from the hyperprior . This part implements the autoregressive hyperprior in [23], with denoting the image latents whose distributions are signaled as the side information .
The encoding of ANFIC proceeds by passing the augmented input through the autoencoding and hyperprior transforms, i.e. , to obtain the latent representation . In particular, represents the input image, denotes the augmented Gaussian noise, and simulates the additive quantization noise of the hyperprior. To achieve (lossy) compression, we want and to capture most of the information about the input and regularize during training to approximate noughts. As such, only and are entropy coded into bitstreams. Note that due to the volumepreserving property of ANF (or any flow model), has the same dimensionality as the input while that of and is usually much smaller depending on the design choice. This flexibility allows us to incorporate any existing VAEbased compression scheme as one specific realization of the autoencoding transform in ANFIC. For example, the encoder of any VAEbased compression can be used to implement for the encoding transform in Eq. (2); likewise, its decoder can realize for the decoding transform in Eq. (3). Note that we have assumed the use of additive coupling layers.
To decode the input , we apply the inverse mapping function to the quantized latents , where is set to noughts. In ANFIC, there are two sources of distortion that cause the reconstruction to be lossy: the quantization error of and the error of setting to noughts during the inverse operation. Essentially, ANFIC is an ANF model, which is bijective and invertible. The errors between the encoding latents and their quantized version will introduce distortion to the reconstructed image, as shown in Fig. 4.
To mitigate the effect of quantization errors on the decoded image quality, we incorporate a quality enhancement network at the end of the reverse path, as illustrated in Fig. 4. This enhancement network is an integral part of ANFIC, which is constrained by the fact that the analysis and the synthesis transforms must share the same autoencoding transforms (i.e. invertible coupling layers). This constraint makes it difficult to learn a synthesis transform that can effectively compensate for quantization errors while maintaining the invertibility. The same observation was made in [22]. In this paper, we adopt the same lightweight enhancement network as [22].
IiiB Prior Distribution
The prior distribution of ANFIC refers to the joint distribution of the latents , which like VAEbased schemes plays a crucial role in determining the rate needed to signal the image latents. Rather than manually specifying the prior distribution, we adopt a parametric approach to learn , for the sake of balancing between rate and distortion. As noted previously, our ANFIC has the latent and the hyperprior capture most of the information of the input . We thus require the latent
to follow a zeromean Gaussian with a small variance
and to be independent of . That is, factorizes as:(7) 
with
(8) 
and the remaining terms, and , learned from data by neural networks.
Similar to VAEbased schemes [3], we assume to be a nonparametric distribution and to be a conditional Gaussian. Recall that and are the quantized version of the primary image latent and its hyperprior (see Eq. (5)), which is output by the encoding transform of (see Fig. 3). We follow the additive noise model for quantization. As a result, we have and follow a distribution given by the convolution of and . In symbols, and have the forms of
(9) 
where denotes convolution and is a learned distribution parameterized by . Note that unless otherwise specified, are all assumed to be factorial over the elements of , respectively.
Gaussian Mixtures Extension: ANFIC is flexible in accommodating more sophisticated modeling of
, such as Gaussian mixture models. Unlike the single Gaussian model, the mixture model requires to estimate the mixing probabilities
for components as well as the corresponding mean and variance . All these parameters are functions of the hyperprior . In the present case, the decoding transform (see Eq. (6)) is changed to –namely, an identity transform followed by the quantization of . This change is necessary because with the mixture model, the subtraction of a single predicted mean from is not feasible. In addition, follows a distribution given by(10) 
IiiC Training Objective
Training ANFIC can be achieved by minimizing the negative augmented loglikelihood, i.e.
. This leads to the following loss function:
(11) 
where the Jacobian logdeterminant generally prevents the collapse of the latent space. In our implementation, we replace it with a reconstruction loss , with the distortion metric being the meansquared error (MSE) or multiscale structure similarity index (MSSSIM) [29]:
(12) 
where refer to the parameters of all the networks, including the quality enhancement network. Unlike the traditional weighted sum of rate and distortion , our training objective has the additional requirement that should approximate noughts. This drives to encode most of the information about the input , provided that the reconstructed image approximate closely. In passing, we note that the reconstruction loss also prevents the latent space from collapsing. Apparently, it would be difficult to recover the input if different ’s are all mapped to the same point in the latent space.
Iv Experimental Results
This section evaluates the performance of ANFIC both objectively and subjectively. We first present the network architectures, training details, evaluation methodologies, and the baseline methods in Section IVA. Next, we compare the ratedistorton performance of ANFIC with several stateoftheart methods on commonly used datasets in Section IVB. Lastly, we evaluate the subjective quality of the reconstructed images in Section IVC.
Iva Settings and Implementation Details
Network Architectures:
Our autoencoding transforms for feature extraction (the left branch in Fig.
5) and hyperprior (the right branch in Fig. 5) share similar architectures to the VAEbased scheme in [23]. In addition, we use the same lightweight dequantization network in [22] as the quality enhancement network. All the autoencoding transforms in our model have separate network weights. To keep the overall model size comparable to that of [23], we reduce the number of channels in every convolutional layer to 128. We adopt the autoregressive and Gaussian mixture model (Section IIIB) for entropy coding in all the experiments, with the number of mixture components set empirically to 3, which is found to be most effective in [6].Training: For training, we use vimeo90k dataset from [28]. It contains 91,701 training videos, each having 7 frames. In a training iteration, we randomly choose one frame from each video and crop it to . We adopt the Adam [14] optimizer with a batch size of 32. The learning rate is fixed at during the first 3M iterations, and then we decay to for finetuning. The two hyperparameters (see Eq. (12)) are chosen to have , where is one of the values from for MSE optimization and from
for optimizing MSSSIM. In particular, we first train our model for the highest rate point. It is then fine tuned with few epochs to obtain the models for lower rate points.
Evaluation: We evaluate our model on commonly used datasets, Kodak [18] and Tecnick [24], which include 24 uncompressed images of size and 40 images of size , respectively. Additionally, we test our model on the CLIC validation datasets [26]. It contains two subdivided datasets: professional and mobile. The former has 41 higher resolution images and the latter 61 images. To evaluate the ratedistortion performance, we report rates in bits per pixel (bpp) and quality in PSNRRGB and MSSSIM. Moreover, we use BPG as an anchor in reporting the BDrates. Note that rate inflation as compared to BPG is reflected by positive BDrates while rate saving is shown as negative BDrates.
IvB RateDistortion Performance
Fig. 6 compares the ratedistortion performance of the competing methods on Kodak, Tecnick, and CLIC (professional and mobile combined) datasets, with the BDrate numbers summarized in Table I. Following some prior works, the BDrate figures for CLIC professional dataset are reported separately in Table I.
In terms of PSNRRGB, one can see that our method shows comparable performance to the stateoftheart learned codecs, CVPR’20 [6] and TPAMI’20 [22], on Kodak and CLIC datasets. Remarkably, it achieves the best performance among all the learned codecs on Tecnick and CLIC datasets. It however falls short of the VTM model slightly on Kodak, Tecnick and CLIC datasets. In particular, ANFIC displays a tendency to perform worse at low rates. This may be attributed to the fact that additive coupling layers are susceptible to the accumulation and propagation of quantization errors (Fig. 4). It is important to note in Table I that ANFIC is inferior to VTM in BDrate saving by a significant margin () on CLIC Professional dataset. Careful examination of the dataset reveals that some images are extremely challenging and not typical of the images found in our training data. All the competing methods are faced with the same issue. It is expected that increasing the diversity of training data will help. Nevertheless, the superiority of ANFIC over BPG is apparent on all the datasets.
In terms of MSSSIM, our method performs among top two. It is slightly worse than the top performer, CVPR’20 [6], on Kodak dataset, especially at low rates (See Fig. 5(b)), but is comparable to ICLR’19 [19], which achieves the best MSSSIM performance on the CLIC dataset. It is worth mentioning that TPAMI’20 [22], a strong baseline when evaluated with PSNRRGB, exhibits poor MSSSIM results because the released model is optimized for MSE only. Also, as noted previously in other studies, the learned codecs outperform VTM and BPG considerably when trained and tested by MSSSIM.
The model size comparison in Table I suggests that the ratedistortion benefits of ANFIC do not come at the expense of unreasonably huge models. Its model size is between that of TPAMI’20 [22] and CVPR’20 [6], both show competitive ratedistortion performance.
Methods  BDrate (%)  Model Size  
Kodak  Tecnick  CLIC  CLIC Pro  
ICLR’18 [3]  5.0  2.9  13.2    12M 
NIPS’18 [23]  3.0  15.6  0.3    20M 
ICLR’19 [19]  3.0  10.1  7.5  13.8  73M 
TPAMI’20 [22]  16.3  22.6  18.1  20.3  18M 
CVPR’20 [6]  14.5      25.3  27M 
TPAMI’21 [12]  8.8  15.0  13.0    73M 
ANFIC (Ours)  15.3  26.5  18.5  24.5  23M 
VTM 444 [27]  17.9  29.7  22.6  31.3   
IvC Subjective Quality Comparison
Figs. 7, 8 and 9 show the subjective quality comparison between ANFIC (ours), VVC, BPG, and TPAMI’20 [22] on images , , and from Kodak dataset. It is seen that our MSE model achieves comparable subjective quality to VVC and TPAMI’20 [22]. As expected, ANFIC optimized for MSE tends to smooth the highlytextured areas, while VVC and HEVC generates clear blocking artifacts in Fig. 8. In particular, TPAMI’20 [22] suffers from geometric distortion especially in the ”door” area in Fig. 7 and produces some artificial noisy dots on the ”water surface” in Fig. 8. In contrast, our MSSSIM model shows much better subjective quality, preserving most highfrequency details.
V Ablation Studies
In this section, we conduct ablation studies to understand ANFIC’s properties. First, we show how the ANF framework improves the VAEbased scheme by stacking its autoencoding transform (Section VA). Second, we investigate the effect of the quality enhancement network on ANFIC and its VAEbased counterpart (Section VB
). Third, we analyze the inner workings of ANFIC by visualizing the output of each autoencoding transform in both spatial and frequency domains (Section
VC). Fourth, we study the compression performance of ANFIC across low and high rates (Section VD). Finally, we extend ANFIC to support variable rate compression and compare its performance with the other baselines (Section VE). Unless otherwise specified, Kodak dataset is used for ablation experiments.Va Number of Autoencoding Transforms
To see the ratedistortion benefits of stacking autoencoding transforms, we compare between the VAEbased scheme [23] and ANFIC with a varied number of autoencoding transforms. It is important to note that the VAEbased scheme can be interpreted as onestep ANFIC (see Section IIIA). For a fair comparison, the VAEbased scheme [23] (which is modified from [23] by additionally including Gaussian mixturebased entropy coding and is termed NIPS’18 + GMM) and ANFIC share the same autoencoding architecture, entropy coding scheme, and quality enhancement network. To keep the model size comparable, the channel number of every autoencoding transform in ANFIC is set to 128 (See Fig. 5), while that of the VAEbased counterpart is 192. This ensures that ANFIC with two autoencoding transforms (the main setting used throughout this paper) has a similar model size to the VAEbased one. Nevertheless, when the number of autoencoding transforms increases beyond two, the model size of ANFIC increases linearly.
From Fig. 10, it is seen that increasing the number of autoencoding transforms from one layer (VAEbased) to two layers (Ours 2step) improves the ratedistortion performance significantly. However, the gain diminishes sharply when the number goes beyond two. We thus choose two autoencoding transforms as our default setting.
VB Effect of Quality Enhancement Network
Fig. 11 shows the effect of the quality enhancement network (as a postprocessing network) on the ratedistortion performance of ANFIC and the VAEbased scheme [23].
We observe that ANFIC benefits more from the use of the quality enhancement network, which boosts the BDrate saving of ANFIC by as compared to with the NIPS’18+GMM (VAEbased) scheme [7]. This suggests that ANFIC literally separates the image transformation and the (quantization) error compensation into two orthogonal parts. The former is addressed by invertible autoencoding transforms while the latter relies on the quality enhancement network. The fact that the feature extraction and the image reconstruction in ANFIC have to go through the same invertible coupling layers make it difficult to learn autoencoding transforms that can handle well both image representation and error compensation. This however is not the case with the NIPS’18+GMM (VAEbased) scheme, where the analysis and the synthesis transforms do not share the same network. Usually, the synthesis transform can learn to compensate partially for quantization errors. As such, the gain from the quality enhancement network becomes limited when the synthesis network is already capable enough. The result is in line with an earlier finding reported in [22] that the flowbased model can benefit more from postprocessing.
VC Visualization of Autoencoding Transforms
Fig. 12 visualizes how our ANFIC model (see Fig. 5) transforms the input image stepbystep into a residual image and what information is captured by the corresponding latent code in each step. Additionally, the corresponding signal spectra in frequency domain are presented to understand the system response of every autoencoding transform. For better visualizing the evolution of signals, we extend the architecture in Fig. 5 to threestep ANFIC, with the final outputs being and (instead of and as depicted in Fig. 5). Also presented in this figure are the decoder outputs of the autoencoding transforms (see Eq. (3) and Fig. 5), which reveal the information captured by the latent code . As an example, the first autoencoding transform converts the image into the latent code , which is then decoded as to be subtracted from . Hence, stands for an estimate of that is derived from the latent .
From left to right in the top two rows, one can see that the highfrequency details of the input image are filtered out in successive autoencoding transforms, arriving at a residual image with little highfrequency information (see the subfigure in the topright corner). As such, the autoencoding transforms in ANFIC act as lowpass filters, where their cutoff frequency decreases with the increasing transform step in the feature extraction process. Because will be discarded during the reconstruction process, the remaining highfrequency details in will be lost completely. Thus, ANFIC is lossy.
The decoder outputs of the autoencoding transforms further shed light on how the latent code is transformed from into a form suitable for compression (i.e. ). From left to right in the bottom two rows, we see that (decoded from ) presents a rough estimate of the input . Its spectrum looks similar to that of , but is not exactly the same. We conjecture that focuses more on the approximation of the highfrequency part of the input . The corroborating fact is that when it is subtracted from , the resulting output has relatively less highfrequency information. This becomes even more obvious in the following autoencoding transform, where (decoded from ) addresses primarily the remaining midfrequency part in ; as a result, the output of the second transform becomes an even lowerfrequency signal. In the end, the latent code , which will be compressed into the bitstream, only needs to represent a lowpass filtered version of the original input, which is relatively easy to compress. The reconstruction process updates a zero image in by those decoder outputs in reverse order (i.e. ), to recover the lowfrequency, midfrequency, and highfrequency details of the input stepbystep.
VD Compression Performance across Low and High Rates
This study investigates the compression performance of ANFIC over a wide range of bit rates. It is reported in [9, 22] that most VAEbased compression schemes suffer from the autoencoder limitation; that is, the reconstruction by the autoencder is generally lossy, even without quantization. As a result, it is difficult for a VAEbased model to support efficient compression over a wide rage of bit rates without changing the network architecture, for example, by adjusting the number of channels. ANFIC, although being a flowbased model, is lossy due to discarding the highfrequency information in the residual image (see Fig. 5) for reconstruction.
Fig. 13 compares ANFIC with two stateoftheart VAEbased schemes over a wide range of bit rates. In particular, ANFIC has the same number of channels (i.e. 320 channels) in latent space as NIPS’18 [23], whereas CVPR’20 [6] has only 192 channels yet with a larger model size. The network architectures of all the competing models are fixed except that their network weights are trained separately for different rate points.
We see that ANFIC matches the performance of VTM closely from extremely lowrate compression up to nearly lossless compression, while the two VAEbased schemes tend to fall short of VTM and even BPG at high rates. The reason why ANFIC is able to work well across low and high rates are twofold: (1) the ANFbased backbone is fully invertible, and (2) our training strategies, which require to approximate noughts in the feature extraction process and use noughts exactly for during reconstruction, force the image latent and its hyperprior to capture as much information about the input as possible (see Fig. 5).
VE Variable Rate Compression
Recognizing that ANFIC can work well over a wide range of bit rates, we take one step further to adapt ANFIC to variable rate compression with a single model. To this end, we implement the notion of the conditional convolution in [7], replacing every convolutional layer with one that is conditional on the (see Eq. (12)). The conditional convolution layer applies an affine transformation to every feature map, with the affine parameters derived from a network conditional on the rate parameter . For the experiment, we train a single ANFIC model using 5 distinct values. The training objective is an extension of Eq. (12) by substituting different ’s into Eq. (12) and averaging over these variants.
Fig. 14 shows the ratedistortion comparison of the stateoftheart variable rate models, including VVC, BPG, ICCV’19 [7], TMAPI’20 [22], and ANFIC (ours). Compared with our multimodel setting, our singlemodel setting performs comparably well, with slightly increased rate saving due to training variance. It also shows comparable performance to VTM across the 5 rate points, but outperforms significantly the other learningbased methods in singlemodel mode.
Vi Conclusion
In this paper, we propose an ANFbased image compression system (ANFIC). It is motivated by the fact that VAE, which forms the basis of most endtoend learned image compression, is a special case of ANF and can be extended by ANF to offer greater expressiveness. ANFIC is the first work that introduces VAEbased compression in a flowbased framework, enjoying the benefits of both approaches. Experimental results show that ANFIC performs comparably to or better than the stateoftheart learned image compression and is able to offer a wide range of quality levels without changing the network architecture. Furthermore, its variable rate version shows little performance degradation. Flowbased models are relatively new to learned image compression. We believe there remains widely open space for further research.
References
 [1] (2017) Endtoend optimized image compression. In International Conference on Learning Representations, Cited by: §I, §IIA, §IIA.
 [2] (2017) Endtoend Optimized Image Compression. In International Conference on Learning Representations, Cited by: §IIA.
 [3] (2018) Variational image compression with a scale hyperprior. In International Conference on Learning Representations, Cited by: §I, §IIA, §IIA, §IIA, §IIIB, §IVA, TABLE I.
 [4] BPG image format, https://bellard.org/bpg/. Cited by: §I, §IVA.
 [5] (2021) EndtoEnd Learnt Image Compression via NonLocal Attention Optimization and Improved Context Modeling. IEEE Transactions on Image Processing 30, pp. 3179–3191. Cited by: §I, §IIA, §IIA, §IIA, §IIA, §IVA.

[6]
(2020)
Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules.
In
IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §I, §IIA, §IIA, §IIA, §IVA, §IVA, §IVB, §IVB, §IVB, TABLE I, §VD.  [7] (2019) Variable Rate Deep Image Compression With a Conditional Autoencoder. In 2019 IEEE/CVF International Conference on Computer Vision, Vol. . External Links: Document Cited by: §I, §VB, §VE, §VE.
 [8] (2017) Density estimation using real NVP. In International Conference on Learning Representations, External Links: Link Cited by: §I, §IIB.
 [9] (2021) Lossy Image Compression with Normalizing Flows. In International Conference on Learning Representations Workshop, Cited by: §I, Fig. 1, §IIB, §IIIA, §VD.
 [10] (2019) Integer Discrete Flows and Lossless Compression. In Advances in Neural Information Processing Systems, Cited by: §I, §IIB.
 [11] (2020) CoarsetoFine HyperPrior Modeling for Learned Image Compression. In AAAI Conference on Artificial Intelligenc, Cited by: §I, §IIA, §IIA, §IIA.
 [12] (2021) Learning EndtoEnd Lossy Image Compression: A Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §IIA, §IIA, §IVA, TABLE I.
 [13] (2020) Augmented Normalizing Flows: Bridging the Gap Between Generative Flows and Latent Variable Models.. CoRR. Cited by: §I, §IIB, §IIC, §IIC, §IIC, §IIIA.
 [14] (2015) Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, Cited by: §IVA.
 [15] (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §IIB.
 [16] (2014) AutoEncoding Variational Bayes. Cited by: §IIA, §IIB, §IIC.
 [17] (2020) Normalizing Flows: An Introduction and Review of Current Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §IIB.
 [18] Kodak lossless true color image suite (photocd pcd0992). http://r0k.us/graphics/kodak/.. Cited by: §IVA.
 [19] (2019) Contextadaptive Entropy Model for Endtoend Optimized Image Compression. In International Conference on Learning Representations, Cited by: §I, §IIA, §IIA, §IIA, §IVA, §IVB, TABLE I.
 [20] (2020) A Spatial RNN Codec for EndtoEnd Image Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I, §IIA, §IIA.
 [21] (2019) DVC: An EndToEnd Deep Video Compression Framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I.
 [22] (2020) EndtoEnd Optimized Versatile Image Compression With WaveletLike Transform. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §IIB, §IIIA, §IIIA, §IVA, §IVA, §IVB, §IVB, §IVB, §IVC, TABLE I, §VB, §VD, §VE.
 [23] (2018) Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In Neural Information Processing Systems, Cited by: §I, §IIA, §IIA, §IIA, §IIIA, §IVA, §IVA, TABLE I, §VA, §VB, §VD.
 [24] (2014) TESTIMAGES: a Largescale Archive for Testing Visual Devices and Basic Image Processing Algorithms. In Smart Tools and Apps for Graphics  Eurographics Italian Chapter Conference, Cited by: §IVA.

[25]
(2016)
.
In
Proceedings of International Conference on Machine Learning
, Cited by: §IIA.  [26] (2020) Workshop and Challenge on Learned Image Compression (CLIC2020). External Links: Link Cited by: §IVA.

[27]
VVC official test model VTM, https://vcgit.hhi.fraunhofer.de/jvet/
VVCSoftware_VTM/tree/VTM9.0. Cited by: §I, §I, §IVA, TABLE I.  [28] (2019) Video Enhancement with TaskOriented Flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §IVA.
 [29] (2003) Multiscale structural similarity for image quality assessment. In The ThritySeventh Asilomar Conference on Signals, Systems Computers, Cited by: §IIIC.
Comments
There are no comments yet.