ANFIC: Image Compression Using Augmented Normalizing Flows

This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE's. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to nearly-lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model.



There are no comments yet.


page 9

page 10

page 11


G-VAE: A Continuously Variable Rate Deep Image Compression Framework

Rate adaption of deep image compression in a single model will become on...

Variable Rate Deep Image Compression With a Conditional Autoencoder

In this paper, we propose a novel variable-rate learned image compressio...

Lossy Image Compression with Normalizing Flows

Deep learning based image compression has recently witnessed exciting pr...

Quality and Complexity Assessment of Learning-Based Image Compression Solutions

This work presents an analysis of state-of-the-art learning-based image ...

DSIC: Deep Stereo Image Compression

In this paper we tackle the problem of stereo image compression, and lev...

Transformer-based Image Compression

A Transformer-based Image Compression (TIC) approach is developed which ...

HiLLoC: Lossless Image Compression with Hierarchical Latent Variable Models

We make the following striking observation: fully convolutional VAE mode...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image compression has been a thriving research area for decades due to the storage and transmission requirements in various applications that underpin our modern digital life. Image compression also appears in the form of intra-frame coding for video compression [21]. The rapid advances in inter-frame prediction make efficient intra-frame coding become increasingly important because intra-coded frames often predominate over the bit rate of a compressed video. Therefore, it is much desirable to achieve even higher image compression efficiency.

The state-of-the-art image compression methods, e.g. BPG [4] and VVC intra coding [27], usually involve block-based intra prediction, block-based transform coding of residuals, and context-adaptive binary arithmetic coding. Over the years, tremendous research effort has been invested to better every component in a way that seeks higher compression efficiency at the expense of an acceptable complexity increase. These hand-crafted codecs, although achieving a good balance between compression efficiency and complexity, lacks the opportunity to optimize all the components jointly in a seamless, end-to-end manner.

The rising of deep learning recently spurred a new wave of developments in image compression, with end-to-end learned systems attracting lots of attention. Among them, the variational autoencoder (VAE)-based methods 

[23, 11, 5, 6]

have achieved compression performance very close to the latest VVC intra coding. Different from traditional hand-crafted codecs, the VAE-based methods usually implement an image-level non-linear transform that converts an input image into a compact set of latent features, the dimensions of which are much smaller than that of the image. Ever since the advent of the first VAE-based scheme 

[1], several improvements have been made on the expressiveness [5, 6, 20] of the autoencoder and the efficiency of entropy coding [23, 11, 5, 6, 3, 19, 12]. Up to now, the VAE-based methods have become the mainstream approach to end-to-end learned image compression.

However, one issue with most VAE-based schemes is that the autoencoder is generally lossy. There is no guarantee that its non-linear transform can reconstruct the input image losslessly even without quantizing the latent features of the image. This is unlike the traditional transforms, such as Discrete Cosine Transform and Wavelet Transform, which have the desirable property of perfect reconstruction and allow the codec to offer a wide range of quality levels by merely changing the quantization step size.

Recently, the flow-based models [9, 22] emerged as attractive alternatives. These models have the striking feature of realizing a bijective and invertible mapping between the input image and its latent features via the use of reversible networks composed of affine coupling layers [17, 8]. This invertibility is utilized to develop lossless image compression in [10], while the affine coupling layers are used in place of the lossy autoencoder in [9, 22] to achieve both lossy and lossless (or nearly-lossless) compression with a single unified model. The reversible networks, however, are quite distinct from the commonly used autoencoders, making these two types of compression systems not compatible with each other.

In this paper, we propose a novel end-to-end image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF) [13]. ANF is a new type of flow models that work on augmented input space to offer greater transformation ability than the ordinary flow models. Our scheme ANFIC is motivated by the fact that ANF is a generalization of VAE that stacks multiple VAE’s as a flow model. In a sense, this allows ANFIC to extend any existing VAE-based compression system in a flow-based framework to enjoy the benefits of both approaches. ANFIC is novel and unique in that (1) it distinguishes from flow-based compression by operating in augmented input space, being able to leverage the representation power of any VAE-based image compression, and that (2) it is more general than the VAE-based compression by allowing VAE to be stacked and/or extended hierarchically.

Extensive experimental results on Kodak, Tecnick, and CLIC validation datasets show that ANFIC performs comparably to or better than the state-of-the-art end-to-end image compression in terms of PSNR-RGB. It performs close to VVC intra [27] over a wide range of quality levels from low-rate compression up to nearly-lossless compression. In particular, ANFIC achieves the state-of-the-art performance among the competing methods, when extended with conditional convolutional layers [7] for variate rate compression with a single model.

Our main contributions are three-fold:

  • We propose ANFIC as the first work that leverages VAE-based image compression in a flow-based framework.

  • We offer extensive ablation studies to understand and visualize the inner workings of ANFIC.

  • Extensive experimental results show that ANFIC is competitive with the state-of-the-art image compression, VAE-based and flow-based, over a wide range of quality levels and performs close to VVC intra coding.

The remainder of this paper is organized as follows: Section II reviews VAE-based image compression and the basics of ANF. Section III elaborates the design of ANFIC. Section IV compares ANFIC with the state-of-the-art methods in terms of objective compression performance and subjective image quality. Section V presents our ablation studies. Finally, we provide concluding remarks in Section VI.

Ii Related Work

In this paper, we propose an ANF-based image compression. It can be viewed as an extension of VAE-based image compression. Hence, this section focuses on the recent developments of VAE-based image compression and introduces the fundamentals of ANF to ease the understanding of our scheme.

Ii-a VAE-based Image Compression

VAE-based image compression [23, 11, 5, 6, 1, 3, 19]

is the most popular approach to end-to-end learned image compression. Its training framework includes three major components: the analysis transform, the prior distribution, and the synthesis transform. These components are implemented by neural networks.

The analysis transform encodes the raw image through an encoding distribution with the latent representation uniformly quantized as . The is then entropy encoded into a bitstream using a learned prior implemented by a network . Finally, the synthesis transform reconstructs approximately the input from by a decoding distribution .

All the network parameters are trained end-to-end by minimizing


where the first term, denoted by , aims to minimize the negative log-likelihood of and the second term minimizes the rate needed for signaling . In particular, it is shown that minimizing Eq. (1) amounts to maximizing the evidence lower bound (ELBO) of a latent variable model [16], which is specified by and , with

taking a uniform distribution that models the effect of uniform quantization. In a more general setting, a hyper-parameter

is introduced to balance between and , yielding .

Balle et al. [1] are the first to introduce the aforementioned VAE framework together with a learned factorized prior to image compression. In entropy coding the image latents, they assume the prior distribution over to be factorial and learn the distribution by the network

. Their analysis and synthesis transforms are composed of convolutional neural networks and the general division normalization (GDN) layers, which originate from


Even since the advent of the VAE-based compression framework, several efforts have been made to advance its coding efficiency. In particular, some [23, 11, 5, 6, 3, 19, 12]

improve the prior estimation for better entropy coding while others 

[5, 6, 20] address the analysis and syntheses transforms (referred collectively to as the autoencoding transform). We summarize briefly these efforts as follows.

Enhanced Prior Estimation: The prior distribution crucially determines the number of bits (i.e. the rate) needed to signal the quantized image latents . Recognizing the suboptimality of the factorized prior , where feature samples in every channel of are independently and identically distributed, Balle et al. [3]

propose the notion of hyperprior to model every feature sample separately by a Gaussian distribution. To this end, additional side information

is extracted from the image latent and sent to the decoder, making the density estimation of dependent on the input . The and form the latent representation of the input

. The hyperprior thus bears the interpretation of factorizing the joint distribution

as , where and are assumed to be Gaussian and factorial, respectively. Hu et al. [11, 12] extend the idea to include more than one layer of hyperprior, leading to a factorization of , where form a multi-layer hyperprior. In addition to the use of hyperprior, Minnen et al. [23], Lee et al. [19], Chen et al. [5], and Cheng et al. [6] incorporate an autoregressive prior by 2D [23, 6, 19] or 3D [5] masked convolution [25], in order to utilize causal contextual information for better density estimation. In particular, Cheng et al. [6] model with a Gaussian mixture distribution instead of a Gaussian.

Enhanced Autoencoding Transform: The capacity of the autoencoding transform determines its expressiveness. Chen et al. [5] add residual blocks to the autoencoder along with several non-local attention modules (NLAM). NLAM is shown to facilitate spatial bit allocation among coding areas of varied texture complexity. Unlike most of the VAE-based systems, which operate at image level, the block-based autoencoder in [20] divides the input image into non-overlapping macroblocks, each of which contains multiple sub-blocks coded sequentially using recurrent-based analysis and synthesis transforms. It has the striking feature of allowing high degree of computational parallelism at macroblock level. In general, most autoencoders are not guaranteed to reconstruct the input perfectly even when no quantization is involved.

Ii-B Flow-based Image Compression

(a) Normalizing flows
Fig. 1: Flow-based image compression with (a) normalizing flows [9] and (b) augmented normalizing flows (ours).

Recently, flow-based models [17, 8, 15] emerge as an attractive alternative to VAE [16] or other autoencoders. They are characterized by the bijective mapping between the input and its latent representation, ensuring that the input can be perfectly reconstructed from its latent in the absence of quantization. Ma et al. [22] make an interesting attempt to introduce lifting-based coupling layers, which are a specialized implementation of additive coupling layers [17, 8] often used to construct a flow model, as the analysis and synthesis backbone. In particular, they split an input image, first row-wise and then column-wise, into latent subbands, the resulting decomposition being similar to 2D wavelet transform. Helminger et al. [9] also use additive coupling layers but with the factor-out splitting to generate a multi-scale image representation as shown in Fig. 0(a). Their work extends the notion of integer discrete flows for lossless compression [10] to lossy compression. In common, these works show the potential of flow-based models to offer a wide range of quality levels ranging from low-rate compression to nearly-lossless or even lossless compression.

Our work aims to leverage the developments of VAE-based schemes in a flow-based framework to enjoy the benefits of both (see Fig. 0(b)). For this purpose, we resort to augmented normalizing flows [13], the basics of which are presented next.

Ii-C Augmented Normalizing Flows (ANF)

Fig. 2: The architectures of ANF: (a) one-step ANF, composed of the encoding and the decoding transforms, and (b) one-step hierarchical ANF.

The ANF model [13] is an invertible latent variable model. It is composed of multiple autoencoding transforms, each of which comprises a pair of the encoding and decoding transforms as depicted in Fig. 1(a). Consider the example of ANF with one autoencoding transform (i.e. one-step ANF). It converts the input coupled with an independent noise into their latent representation with one pair of encoding and decoding transforms:


where refers collectively to the network parameters of the encoding and decoding transforms. Compared with ordinary flow models, ANF augments the input with an independent noise. It is shown in [13] that the augmented input space allows a smoother transformation to the required latent space.

Multi-step ANF and Hierarchical ANF: From Fig. 1(a) and according to Eqs. (2) and (3), the encoding or decoding transform implements an invertible affine coupling layer. Stacking pairs of these coupling layers leads to an invertible network, termed multi-step ANF, with much improved capacity than one-step ANF. Another way to increase the model capacity is to augment more noise inputs as hierarchical ANF (see Fig. 1(b)). Particularly, these two approaches can be combined in a flexible way for even higher model capacity.

Training ANF: Like the ordinary flow models, ANF can be trained by maximizing the augmented joint likelihood, i.e. :


where is the alternate composition of the encoding and decoding transforms with and represents the specified or learned prior distribution over the latents . It is shown in [13] that maximizing the augmented joint likelihood in ANF amounts to maximizing a lower bound on the marginal likelihood , with the gap attributed to the model’s incapability of modeling independently of .

VAE as One-step ANF: Notably, VAE can be viewed as one-step ANF by (1) letting be a Gaussian noise, (2) transforming into via re-parameterizing the VAE’s encoding distribution of the form , and (3) normalizing as via the VAE’s decoding distribution . The resulting then follows and so does the aggregated distribution of from various inputs . Maximizing Eq. (4) for such an one-step ANF is shown in [13] to be identical to maximizing the ELBO of VAE [16].

Iii Proposed Method

Inspired by the fact that most learned image compression is VAE-based and that VAE is equivalent to one-step ANF, we propose an ANF-based image compression framework, termed ANFIC. We first outline the ANFIC framework in Section III-A, with a focus on how to extend VAE-based image compression with hyperprior by multi-step and hierarchical ANF. This is followed by discussions on the entropy coding of the latent representation (Section III-B), the modeling of the prior distribution in ANFIC (Section III-A), and the training objective (Section III-C).

To the best of our knowledge, ANFIC is the first work that combines VAE and flow models in a unified framework. It distinguishes from flow-based compression in that it operates on augmented input space (see Fig. 0(b)), being able to leverage the representation power of any existing VAE-based image compression. Moreover, ANFIC is more general than the VAE-based scheme by allowing it to be stacked and/or extended hierarchically (see Fig. 2).

Fig. 3: The overall architecture of our proposed ANF-based image compression (ANFIC).

Iii-a ANFIC Framework

Fig. 3 describes the framework of ANFIC. From bottom to top, it stacks two autoencoding transforms (i.e. two-step ANF), with the top one extended further to the right to form a hierarchical ANF [13] that implements the hyperprior. More autoencoding transforms can be added straightforwardly to create a multi-step ANF. In particular, the and in the autoencoding transform follow Eqs. (2) and (3), except that we make them purely additive by removing and for better convergence as with some other flow-based schemes [9, 22].

The autoencoding transform of the hyperprior, which assumes each sample in the latent representation is a Gaussian, is defined as


where (depicted as Q in Fig. 3) denotes the nearest-integer rounding for quantizing the residual between and the predicted mean of the Gaussian distibution from the hyperprior . This part implements the autoregressive hyperprior in [23], with denoting the image latents whose distributions are signaled as the side information .

The encoding of ANFIC proceeds by passing the augmented input through the autoencoding and hyperprior transforms, i.e. , to obtain the latent representation . In particular, represents the input image, denotes the augmented Gaussian noise, and simulates the additive quantization noise of the hyperprior. To achieve (lossy) compression, we want and to capture most of the information about the input and regularize during training to approximate noughts. As such, only and are entropy coded into bitstreams. Note that due to the volume-preserving property of ANF (or any flow model), has the same dimensionality as the input while that of and is usually much smaller depending on the design choice. This flexibility allows us to incorporate any existing VAE-based compression scheme as one specific realization of the autoencoding transform in ANFIC. For example, the encoder of any VAE-based compression can be used to implement for the encoding transform in Eq. (2); likewise, its decoder can realize for the decoding transform in Eq. (3). Note that we have assumed the use of additive coupling layers.

To decode the input , we apply the inverse mapping function to the quantized latents , where is set to noughts. In ANFIC, there are two sources of distortion that cause the reconstruction to be lossy: the quantization error of and the error of setting to noughts during the inverse operation. Essentially, ANFIC is an ANF model, which is bijective and invertible. The errors between the encoding latents and their quantized version will introduce distortion to the reconstructed image, as shown in Fig. 4.

To mitigate the effect of quantization errors on the decoded image quality, we incorporate a quality enhancement network at the end of the reverse path, as illustrated in Fig. 4. This enhancement network is an integral part of ANFIC, which is constrained by the fact that the analysis and the synthesis transforms must share the same autoencoding transforms (i.e. invertible coupling layers). This constraint makes it difficult to learn a synthesis transform that can effectively compensate for quantization errors while maintaining the invertibility. The same observation was made in [22]. In this paper, we adopt the same lightweight enhancement network as [22].

Fig. 4: Error propagation due to the quantization of the image latents . To alleviate propagation errors, we place a quality enhancement network at the end of the reverse path (the red dotted line).

Iii-B Prior Distribution

The prior distribution of ANFIC refers to the joint distribution of the latents , which like VAE-based schemes plays a crucial role in determining the rate needed to signal the image latents. Rather than manually specifying the prior distribution, we adopt a parametric approach to learn , for the sake of balancing between rate and distortion. As noted previously, our ANFIC has the latent and the hyperprior capture most of the information of the input . We thus require the latent

to follow a zero-mean Gaussian with a small variance

and to be independent of . That is, factorizes as:




and the remaining terms, and , learned from data by neural networks.

Similar to VAE-based schemes [3], we assume to be a non-parametric distribution and to be a conditional Gaussian. Recall that and are the quantized version of the primary image latent and its hyperprior (see Eq. (5)), which is output by the encoding transform of (see Fig. 3). We follow the additive noise model for quantization. As a result, we have and follow a distribution given by the convolution of and . In symbols, and have the forms of


where denotes convolution and is a learned distribution parameterized by . Note that unless otherwise specified, are all assumed to be factorial over the elements of , respectively.

Gaussian Mixtures Extension: ANFIC is flexible in accommodating more sophisticated modeling of

, such as Gaussian mixture models. Unlike the single Gaussian model, the mixture model requires to estimate the mixing probabilities

for components as well as the corresponding mean and variance . All these parameters are functions of the hyperprior . In the present case, the decoding transform (see Eq. (6)) is changed to –namely, an identity transform followed by the quantization of . This change is necessary because with the mixture model, the subtraction of a single predicted mean from is not feasible. In addition, follows a distribution given by


Iii-C Training Objective

Training ANFIC can be achieved by minimizing the negative augmented log-likelihood, i.e.

. This leads to the following loss function:


where the Jacobian log-determinant generally prevents the collapse of the latent space. In our implementation, we replace it with a reconstruction loss , with the distortion metric being the mean-squared error (MSE) or multi-scale structure similarity index (MS-SSIM) [29]:


where refer to the parameters of all the networks, including the quality enhancement network. Unlike the traditional weighted sum of rate and distortion , our training objective has the additional requirement that should approximate noughts. This drives to encode most of the information about the input , provided that the reconstructed image approximate closely. In passing, we note that the reconstruction loss also prevents the latent space from collapsing. Apparently, it would be difficult to recover the input if different ’s are all mapped to the same point in the latent space.

Iv Experimental Results

This section evaluates the performance of ANFIC both objectively and subjectively. We first present the network architectures, training details, evaluation methodologies, and the baseline methods in Section IV-A. Next, we compare the rate-distorton performance of ANFIC with several state-of-the-art methods on commonly used datasets in Section  IV-B. Lastly, we evaluate the subjective quality of the reconstructed images in Section IV-C.

Fig. 5: The network architecture of our proposed ANFIC . We adopt the autoregressive and Gaussian mixture model for entropy coding. AC and mask denote arithmetic coding and masked convolution, respectively.

Iv-a Settings and Implementation Details

Network Architectures:

Our autoencoding transforms for feature extraction (the left branch in Fig. 

5) and hyperprior (the right branch in Fig. 5) share similar architectures to the VAE-based scheme in [23]. In addition, we use the same lightweight de-quantization network in [22] as the quality enhancement network. All the autoencoding transforms in our model have separate network weights. To keep the overall model size comparable to that of [23], we reduce the number of channels in every convolutional layer to 128. We adopt the autoregressive and Gaussian mixture model (Section III-B) for entropy coding in all the experiments, with the number of mixture components set empirically to 3, which is found to be most effective in [6].

Training: For training, we use vimeo-90k dataset from [28]. It contains 91,701 training videos, each having 7 frames. In a training iteration, we randomly choose one frame from each video and crop it to . We adopt the Adam [14] optimizer with a batch size of 32. The learning rate is fixed at during the first 3M iterations, and then we decay to for fine-tuning. The two hyper-parameters (see Eq. (12)) are chosen to have , where is one of the values from for MSE optimization and from

for optimizing MS-SSIM. In particular, we first train our model for the highest rate point. It is then fine tuned with few epochs to obtain the models for lower rate points.

Evaluation: We evaluate our model on commonly used datasets, Kodak [18] and Tecnick [24], which include 24 uncompressed images of size and 40 images of size , respectively. Additionally, we test our model on the CLIC validation datasets [26]. It contains two subdivided datasets: professional and mobile. The former has 41 higher resolution images and the latter 61 images. To evaluate the rate-distortion performance, we report rates in bits per pixel (bpp) and quality in PSNR-RGB and MS-SSIM. Moreover, we use BPG as an anchor in reporting the BD-rates. Note that rate inflation as compared to BPG is reflected by positive BD-rates while rate saving is shown as negative BD-rates.

Baselines: For comparison, the baseline methods include VTM-444 [27], BPG-444 [4], ICLR’18 [3], NIPS’18 [23], ICLR’19 [19], TPAMI’20 [22], CVPR’20 [6], TPAMI’21 [12], and TIP’21 [5]. It is worth noting that TPAMI’20 [22] is a flow-based model, while the other learned codecs are VAE-based.

Iv-B Rate-Distortion Performance

(a) Kodak, PSNR
(b) Kodak, MS-SSIM
(c) Tecnick, PSNR
(d) Tecnick, MS-SSIM
Fig. 6: Rate-distortion performance evaluation on (a) Kodak (PSNR), (b) Kodak (MS-SSIM), (c) Techinck (PSNR), (d) Techinck (MS-SSIM), (e) CLIC (PSNR), and (f) CLIC (MS-SSIM). The numbers in the parentheses are the BD-rates with BPG-444 as anchor.

Fig. 6 compares the rate-distortion performance of the competing methods on Kodak, Tecnick, and CLIC (professional and mobile combined) datasets, with the BD-rate numbers summarized in Table I. Following some prior works, the BD-rate figures for CLIC professional dataset are reported separately in Table I.

In terms of PSNR-RGB, one can see that our method shows comparable performance to the state-of-the-art learned codecs, CVPR’20 [6] and TPAMI’20 [22], on Kodak and CLIC datasets. Remarkably, it achieves the best performance among all the learned codecs on Tecnick and CLIC datasets. It however falls short of the VTM model slightly on Kodak, Tecnick and CLIC datasets. In particular, ANFIC displays a tendency to perform worse at low rates. This may be attributed to the fact that additive coupling layers are susceptible to the accumulation and propagation of quantization errors (Fig. 4). It is important to note in Table I that ANFIC is inferior to VTM in BD-rate saving by a significant margin () on CLIC Professional dataset. Careful examination of the dataset reveals that some images are extremely challenging and not typical of the images found in our training data. All the competing methods are faced with the same issue. It is expected that increasing the diversity of training data will help. Nevertheless, the superiority of ANFIC over BPG is apparent on all the datasets.

In terms of MS-SSIM, our method performs among top two. It is slightly worse than the top performer, CVPR’20 [6], on Kodak dataset, especially at low rates (See Fig. 5(b)), but is comparable to ICLR’19 [19], which achieves the best MS-SSIM performance on the CLIC dataset. It is worth mentioning that TPAMI’20 [22], a strong baseline when evaluated with PSNR-RGB, exhibits poor MS-SSIM results because the released model is optimized for MSE only. Also, as noted previously in other studies, the learned codecs outperform VTM and BPG considerably when trained and tested by MS-SSIM.

The model size comparison in Table I suggests that the rate-distortion benefits of ANFIC do not come at the expense of unreasonably huge models. Its model size is between that of TPAMI’20 [22] and CVPR’20 [6], both show competitive rate-distortion performance.

Methods BD-rate (%) Model Size
Kodak Tecnick CLIC CLIC Pro
ICLR’18 [3] 5.0 2.9 13.2 - 12M
NIPS’18 [23] -3.0 -15.6 0.3 - 20M
ICLR’19 [19] -3.0 -10.1 -7.5 -13.8 73M
TPAMI’20 [22] -16.3 -22.6 -18.1 -20.3 18M
CVPR’20 [6] -14.5 - - -25.3 27M
TPAMI’21 [12] -8.8 -15.0 -13.0 - 73M
ANFIC (Ours) -15.3 -26.5 -18.5 -24.5 23M
VTM 444 [27] -17.9 -29.7 -22.6 -31.3 -
TABLE I: Comparison of the BD-rate savings and model sizes of the competing methods (optimized by MSE). The BD-rate savings are reported with BPG-444 serving as the anchor. The best performer is marked with “”, and the second best with “”.

Iv-C Subjective Quality Comparison

Fig. 7: Subjective quality comparison of image from Kodak dataset.
Fig. 8: Subjective quality comparison of image from Kodak dataset.
Fig. 9: Subjective quality comparison of image from Kodak dataset.

Figs. 78 and 9 show the subjective quality comparison between ANFIC (ours), VVC, BPG, and TPAMI’20 [22] on images , , and from Kodak dataset. It is seen that our MSE model achieves comparable subjective quality to VVC and TPAMI’20 [22]. As expected, ANFIC optimized for MSE tends to smooth the highly-textured areas, while VVC and HEVC generates clear blocking artifacts in Fig. 8. In particular, TPAMI’20 [22] suffers from geometric distortion especially in the ”door” area in Fig. 7 and produces some artificial noisy dots on the ”water surface” in Fig. 8. In contrast, our MS-SSIM model shows much better subjective quality, preserving most high-frequency details.

V Ablation Studies

In this section, we conduct ablation studies to understand ANFIC’s properties. First, we show how the ANF framework improves the VAE-based scheme by stacking its autoencoding transform (Section V-A). Second, we investigate the effect of the quality enhancement network on ANFIC and its VAE-based counterpart (Section V-B

). Third, we analyze the inner workings of ANFIC by visualizing the output of each autoencoding transform in both spatial and frequency domains (Section 

V-C). Fourth, we study the compression performance of ANFIC across low and high rates (Section V-D). Finally, we extend ANFIC to support variable rate compression and compare its performance with the other baselines (Section V-E). Unless otherwise specified, Kodak dataset is used for ablation experiments.

Fig. 10: Rate-distortion curves for different number of autoencoding transforms.
Fig. 11: Rate-distortion performance with and without the quality enhancement network.

V-a Number of Autoencoding Transforms

To see the rate-distortion benefits of stacking autoencoding transforms, we compare between the VAE-based scheme [23] and ANFIC with a varied number of autoencoding transforms. It is important to note that the VAE-based scheme can be interpreted as one-step ANFIC (see Section III-A). For a fair comparison, the VAE-based scheme [23] (which is modified from [23] by additionally including Gaussian mixture-based entropy coding and is termed NIPS’18 + GMM) and ANFIC share the same autoencoding architecture, entropy coding scheme, and quality enhancement network. To keep the model size comparable, the channel number of every autoencoding transform in ANFIC is set to 128 (See Fig. 5), while that of the VAE-based counterpart is 192. This ensures that ANFIC with two autoencoding transforms (the main setting used throughout this paper) has a similar model size to the VAE-based one. Nevertheless, when the number of autoencoding transforms increases beyond two, the model size of ANFIC increases linearly.

From Fig. 10, it is seen that increasing the number of autoencoding transforms from one layer (VAE-based) to two layers (Ours 2-step) improves the rate-distortion performance significantly. However, the gain diminishes sharply when the number goes beyond two. We thus choose two autoencoding transforms as our default setting.

V-B Effect of Quality Enhancement Network

Fig. 11 shows the effect of the quality enhancement network (as a post-processing network) on the rate-distortion performance of ANFIC and the VAE-based scheme [23].

We observe that ANFIC benefits more from the use of the quality enhancement network, which boosts the BD-rate saving of ANFIC by as compared to with the NIPS’18+GMM (VAE-based) scheme [7]. This suggests that ANFIC literally separates the image transformation and the (quantization) error compensation into two orthogonal parts. The former is addressed by invertible autoencoding transforms while the latter relies on the quality enhancement network. The fact that the feature extraction and the image reconstruction in ANFIC have to go through the same invertible coupling layers make it difficult to learn autoencoding transforms that can handle well both image representation and error compensation. This however is not the case with the NIPS’18+GMM (VAE-based) scheme, where the analysis and the synthesis transforms do not share the same network. Usually, the synthesis transform can learn to compensate partially for quantization errors. As such, the gain from the quality enhancement network becomes limited when the synthesis network is already capable enough. The result is in line with an earlier finding reported in [22] that the flow-based model can benefit more from post-processing.

Fig. 12: Visualization of the autoencoding transform outputs and the decoder outputs in the autoencoding transforms in three-step ANFIC, where the average image intensity has been shifted to 128 for better viewing. The signal spectra in frequency domain are plotted as heatmaps.

V-C Visualization of Autoencoding Transforms

Fig. 12 visualizes how our ANFIC model (see Fig. 5) transforms the input image step-by-step into a residual image and what information is captured by the corresponding latent code in each step. Additionally, the corresponding signal spectra in frequency domain are presented to understand the system response of every autoencoding transform. For better visualizing the evolution of signals, we extend the architecture in Fig. 5 to three-step ANFIC, with the final outputs being and (instead of and as depicted in Fig. 5). Also presented in this figure are the decoder outputs of the autoencoding transforms (see Eq. (3) and Fig. 5), which reveal the information captured by the latent code . As an example, the first autoencoding transform converts the image into the latent code , which is then decoded as to be subtracted from . Hence, stands for an estimate of that is derived from the latent .

From left to right in the top two rows, one can see that the high-frequency details of the input image are filtered out in successive autoencoding transforms, arriving at a residual image with little high-frequency information (see the sub-figure in the top-right corner). As such, the autoencoding transforms in ANFIC act as low-pass filters, where their cut-off frequency decreases with the increasing transform step in the feature extraction process. Because will be discarded during the reconstruction process, the remaining high-frequency details in will be lost completely. Thus, ANFIC is lossy.

The decoder outputs of the autoencoding transforms further shed light on how the latent code is transformed from into a form suitable for compression (i.e. ). From left to right in the bottom two rows, we see that (decoded from ) presents a rough estimate of the input . Its spectrum looks similar to that of , but is not exactly the same. We conjecture that focuses more on the approximation of the high-frequency part of the input . The corroborating fact is that when it is subtracted from , the resulting output has relatively less high-frequency information. This becomes even more obvious in the following autoencoding transform, where (decoded from ) addresses primarily the remaining mid-frequency part in ; as a result, the output of the second transform becomes an even lower-frequency signal. In the end, the latent code , which will be compressed into the bitstream, only needs to represent a low-pass filtered version of the original input, which is relatively easy to compress. The reconstruction process updates a zero image in by those decoder outputs in reverse order (i.e. ), to recover the low-frequency, mid-frequency, and high-frequency details of the input step-by-step.

Fig. 13: Rate-distortion comparison between our ANFIC and VAE-based schemes across low and high rates.

V-D Compression Performance across Low and High Rates

This study investigates the compression performance of ANFIC over a wide range of bit rates. It is reported in [9, 22] that most VAE-based compression schemes suffer from the autoencoder limitation; that is, the reconstruction by the autoencder is generally lossy, even without quantization. As a result, it is difficult for a VAE-based model to support efficient compression over a wide rage of bit rates without changing the network architecture, for example, by adjusting the number of channels. ANFIC, although being a flow-based model, is lossy due to discarding the high-frequency information in the residual image (see Fig. 5) for reconstruction.

Fig. 13 compares ANFIC with two state-of-the-art VAE-based schemes over a wide range of bit rates. In particular, ANFIC has the same number of channels (i.e. 320 channels) in latent space as NIPS’18 [23], whereas CVPR’20 [6] has only 192 channels yet with a larger model size. The network architectures of all the competing models are fixed except that their network weights are trained separately for different rate points.

We see that ANFIC matches the performance of VTM closely from extremely low-rate compression up to nearly lossless compression, while the two VAE-based schemes tend to fall short of VTM and even BPG at high rates. The reason why ANFIC is able to work well across low and high rates are two-fold: (1) the ANF-based backbone is fully invertible, and (2) our training strategies, which require to approximate noughts in the feature extraction process and use noughts exactly for during reconstruction, force the image latent and its hyperprior to capture as much information about the input as possible (see Fig. 5).

Fig. 14: Rate-distortion comparison between variable rate models. Multi-model: separate models for distinct rate points; Single-model: a single model for multiple rate points.

V-E Variable Rate Compression

Recognizing that ANFIC can work well over a wide range of bit rates, we take one step further to adapt ANFIC to variable rate compression with a single model. To this end, we implement the notion of the conditional convolution in [7], replacing every convolutional layer with one that is conditional on the (see Eq. (12)). The conditional convolution layer applies an affine transformation to every feature map, with the affine parameters derived from a network conditional on the rate parameter . For the experiment, we train a single ANFIC model using 5 distinct values. The training objective is an extension of Eq. (12) by substituting different ’s into Eq. (12) and averaging over these variants.

Fig. 14 shows the rate-distortion comparison of the state-of-the-art variable rate models, including VVC, BPG, ICCV’19 [7], TMAPI’20 [22], and ANFIC (ours). Compared with our multi-model setting, our single-model setting performs comparably well, with slightly increased rate saving due to training variance. It also shows comparable performance to VTM across the 5 rate points, but outperforms significantly the other learning-based methods in single-model mode.

Vi Conclusion

In this paper, we propose an ANF-based image compression system (ANFIC). It is motivated by the fact that VAE, which forms the basis of most end-to-end learned image compression, is a special case of ANF and can be extended by ANF to offer greater expressiveness. ANFIC is the first work that introduces VAE-based compression in a flow-based framework, enjoying the benefits of both approaches. Experimental results show that ANFIC performs comparably to or better than the state-of-the-art learned image compression and is able to offer a wide range of quality levels without changing the network architecture. Furthermore, its variable rate version shows little performance degradation. Flow-based models are relatively new to learned image compression. We believe there remains widely open space for further research.


  • [1] J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. In International Conference on Learning Representations, Cited by: §I, §II-A, §II-A.
  • [2] J. Ballé, V. Laparra, and E.P. Simoncelli (2017) End-to-end Optimized Image Compression. In International Conference on Learning Representations, Cited by: §II-A.
  • [3] J. Ballé, D. Minnen, S. Singh, S.J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. In International Conference on Learning Representations, Cited by: §I, §II-A, §II-A, §II-A, §III-B, §IV-A, TABLE I.
  • [4] BPG image format, Cited by: §I, §IV-A.
  • [5] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang (2021) End-to-End Learnt Image Compression via Non-Local Attention Optimization and Improved Context Modeling. IEEE Transactions on Image Processing 30, pp. 3179–3191. Cited by: §I, §II-A, §II-A, §II-A, §II-A, §IV-A.
  • [6] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto (2020) Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §I, §II-A, §II-A, §II-A, §IV-A, §IV-A, §IV-B, §IV-B, §IV-B, TABLE I, §V-D.
  • [7] Y. Choi, M. El-Khamy, and J. Lee (2019) Variable Rate Deep Image Compression With a Conditional Autoencoder. In 2019 IEEE/CVF International Conference on Computer Vision, Vol. . External Links: Document Cited by: §I, §V-B, §V-E, §V-E.
  • [8] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using real NVP. In International Conference on Learning Representations, External Links: Link Cited by: §I, §II-B.
  • [9] L. Helminger, A. Djelouah, M. Gross, and C. Schroers (2021) Lossy Image Compression with Normalizing Flows. In International Conference on Learning Representations Workshop, Cited by: §I, Fig. 1, §II-B, §III-A, §V-D.
  • [10] E. Hoogeboom, R. Peters, and M. Welling (2019) Integer Discrete Flows and Lossless Compression. In Advances in Neural Information Processing Systems, Cited by: §I, §II-B.
  • [11] Y. Hu, W. Yang, and J. Liu (2020) Coarse-to-Fine Hyper-Prior Modeling for Learned Image Compression. In AAAI Conference on Artificial Intelligenc, Cited by: §I, §II-A, §II-A, §II-A.
  • [12] Y. Hu, W. Yang, Z. Ma, and J. Liu (2021) Learning End-to-End Lossy Image Compression: A Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-A, §II-A, §IV-A, TABLE I.
  • [13] C. Huang, L. Dinh, and A. C. Courville (2020) Augmented Normalizing Flows: Bridging the Gap Between Generative Flows and Latent Variable Models.. CoRR. Cited by: §I, §II-B, §II-C, §II-C, §II-C, §III-A.
  • [14] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, Cited by: §IV-A.
  • [15] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §II-B.
  • [16] D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. Cited by: §II-A, §II-B, §II-C.
  • [17] I. Kobyzev, S. Prince, and M. Brubaker (2020) Normalizing Flows: An Introduction and Review of Current Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-B.
  • [18] E. Kodak Kodak lossless true color image suite (photocd pcd0992). Cited by: §IV-A.
  • [19] J. Lee, S. Cho, and S.K. Beack (2019) Context-adaptive Entropy Model for End-to-end Optimized Image Compression. In International Conference on Learning Representations, Cited by: §I, §II-A, §II-A, §II-A, §IV-A, §IV-B, TABLE I.
  • [20] C. Lin, J. Yao, F. Chen, and L. Wang (2020) A Spatial RNN Codec for End-to-End Image Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, §II-A.
  • [21] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019) DVC: An End-To-End Deep Video Compression Framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [22] H. Ma, D. Liu, N. Yan, H. Li, and F. Wu (2020) End-to-End Optimized Versatile Image Compression With Wavelet-Like Transform. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-B, §III-A, §III-A, §IV-A, §IV-A, §IV-B, §IV-B, §IV-B, §IV-C, TABLE I, §V-B, §V-D, §V-E.
  • [23] D. Minnen, J. Ballé, and G. Toderici (2018) Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In Neural Information Processing Systems, Cited by: §I, §II-A, §II-A, §II-A, §III-A, §IV-A, §IV-A, TABLE I, §V-A, §V-B, §V-D.
  • [24] A. Nicola and G. Andrea (2014) TESTIMAGES: a Large-scale Archive for Testing Visual Devices and Basic Image Processing Algorithms. In Smart Tools and Apps for Graphics - Eurographics Italian Chapter Conference, Cited by: §IV-A.
  • [25] A. V. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel Recurrent Neural Networks


    Proceedings of International Conference on Machine Learning

    Cited by: §II-A.
  • [26] G. Toderici, W. Shi, R. Timofte, L. Theis, J. Balle, E. Agustsson, N. Johnston, and F. Mentzer (2020) Workshop and Challenge on Learned Image Compression (CLIC2020). External Links: Link Cited by: §IV-A.
  • [27] VVC official test model VTM,
    Cited by: §I, §I, §IV-A, TABLE I.
  • [28] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video Enhancement with Task-Oriented Flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §IV-A.
  • [29] A. B. Z. Wang (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, Cited by: §III-C.