BINet: a binary inpainting network for patch-based image compression
Recent deep learning models outperform standard lossy image compression codecs. However, applying these models on a patch-by-patch basis requires that each image patch be encoded and decoded independently. The influence from adjacent patches is therefore lost, leading to block artefacts at low bitrates. We propose the Binary Inpainting Network (BINet), an autoencoder framework which incorporates binary inpainting to reinstate interdependencies between adjacent patches, for improved patch-based compression of still images. When decoding a patch, BINet additionally uses the binarised encodings from surrounding patches to guide its reconstruction. In contrast to sequential inpainting methods where patches are decoded based on previons reconstructions, BINet operates directly on the binary codes of surrounding patches without access to the original or reconstructed image data. Encoding and decoding can therefore be performed in parallel. We demonstrate that BINet improves the compression quality of a competitive deep image codec across a range of compression levels.READ FULL TEXT VIEW PDF
BINet: a binary inpainting network for patch-based image compression
Over 60% of Internet byte content consists of still images Google Developers (2016). Efficient image compression is therefore essential in lowering transmission bandwidth and data storage costs. Lossy image compression is currently dominated by patch-based standard codecs such as JPEG Wallace (1991) and WebP Google Developers (2016)
. Patch-based encoding schemes are preferred to their full-resolution counterparts, as they are more memory efficient and are required by standard video codecs such as H.264/5 that rely on block motion estimation techniquesRichardson and Vcodex (2010).
Careful engineering has enabled standard image codecs to perform well in most settings. But these codecs suffer from arduous hand-tuned parameterisation, which can be particularly sensitive to settings outside of the domain for which they were designed. In contrast, deep neural networks are trained through loss-driven end-to-end optimisation, and deep image compression models have been shown to outperform standard image codecsToderici et al. (2015); Ballé et al. (2016); Toderici et al. (2017); Theis et al. (2017); Rippel and Bourdev (2017); Agustsson et al. (2018); Johnston et al. (2018); Santurkar et al. (2018); Ballé et al. (2018); Mentzer et al. (2018); Li et al. (2018); Jooyoung Lee and Beack (2019). These deep approaches, although effective, are not optimised for patch-based encoding since they use the full image content to steer compression. Full image context is, unfortunately, not available for patch-based systems as each patch is encoded independently. Patch-based encoding is therefore avoided in deep compression models Toderici et al. (2015); Johnston et al. (2018), as it may result in block artefacts at shallow bitrates. To remedy this, we propose the Binary Inpainting Network (BINet) framework, which is inspired by research in image inpainting.
Image inpainting involves reconstructing a masked-out image region by using the surrounding pixels as context. It is often used as an error-correction strategy to restore patches lost during transmission. Traditional inpainting models, such as PixelCNN van den Oord et al. (2016), assume access to original pixel content; in Figure 1(a), the model would be asked to predict the shaded region in the middle, given the surrounding context as input. We extend this idea in order to perform patch-based image compression. When decoding a particular patch, BINet incorporates the compressed binary codes from adjacent image patches as well as the current patch to reinstate relationships between separately encoded regions. As depicted in Figure 1(b), BINet therefore exploits encoded binary information from a full-context region as well as the patch being inpainted in order to formulate its prediction of the inpainted region. The overall approach is illustrated in Figure 2: BINet encodes patches as discrete binary codes using a single encoder. The decoder then reconstructs a particular centre patch by incorporating the binary codes of surrounding patches. It therefore allows for parallel encoding and decoding of image patches aided by learned inpainting from a full binary context region.
In sequential compression techniques such as WebP Google Developers (2016), linear combinations of previously reconstructed outputs are used when decoding a particular patch. This is similar to sequential patch-based inpainting Baig et al. (2017), as illustrated in Figure 1(c), where previously decoded output from the model is treated as the context region and used to perform inpainting on the next patch. In contrast to these approaches, BINet decodes a particular patch, not based on previous patch reconstructions, but based directly on the binary encodings of the surrounding patches. Since it does not need to wait for surrounding patches to be decoded, BINet can decode all patches in parallel while still taking the full surrounding context into account.
BINet’s encoder and decoder are trained jointly through end-to-end optimisation. In contrast to Baig et al. (2017), where separate compression and inpainting networks are trained, BINet builds inpainting directly into its decoder architecture and does not require training an additional inpainting network. Our aim is to show that this approach allows spatial dependencies between patches to be re-instated from independently encoded patches, thereby advancing patch-based encoding in a neural compression model.
We proceed with a description of the BINet framework and the formulation of a loss function for learning binary encodings that exploit spatial redundancy between neighbouring image patches. BINet can be used with different types of encoder and decoder architectures, and in this work we specifically employ two competitive iterative decoding methodsToderici et al. (2015, 2017), namely additive reconstruction (AR) and one-shot reconstruction (OSR). We describe these specific instantiations of BINet in Section 2. To show the benefit of incorporating inpainting, the BINet models are compared to convolutional AR and OSR models without inpainting. Compression efficiency is evaluated quantitatively using the SSIM and PSNR image quality metrics. We show that BINet performs better than the conventional AR and OSR approaches over the complete range of compression levels considered (Section 4). On the standard Kodak dataset Kodak (1999), we show that the OSR variant of BINet consistently outperforms JPEG. Although it falls short of outperforming WebP, we show qualitatively that BINet produces smoother image reconstructions and is capable of more complex inpainting than the sequential decoding methods used by WebP. We released a full implementation of BINet online111https://github.com/adnortje/binet.
BINet is a variation of a basic autoencoder Hinton and Salakhutdinov (2006). Figure 2 shows BINet’s encoding and decoding process. It accepts as input a set of image patches, indicated by (a) in the figure, that are reduced to low dimensional representations and binarised, as shown at (b). Binarisation is required for digitally storing and/or transmitting a compressed version of an image Richardson and Vcodex (2010). As in Toderici et al. (2015); Raiko et al. (2014)
a stochastic binarisation function is used during training by adding uniform quantisation noise. This allows us to backpropagate gradients through the binarisation layer in the encoder by copying the gradients from the first decoder operation to the penultimate encoder layer. The decoder network at(c) is applied as a sliding window across the generated binary codes such that each image patch at (d) is decoded using both its own binary code and the codes of adjacent patches that fall within a specific grid region. Intuitively, because the encoder and decoder networks are trained jointly, the decoder learns to inpaint from binary codes within its context region whilst the encoder learns to produce more compact codes that promote the inpainting performed by the decoder. The same encoder network is applied to each individual image patch, meaning that encoding on multiple patches can be performed in parallel. In principle any model can be used as the encoder and decoder in Figure 2, which is why we refer to BINet as a framework.
As depicted in Figure 2, the reconstruction of a patch from its compressed representation can be formulated as
where and represent the encoder and decoder mappings shown at (b) and (c), respectively. represent the patches used as context for predicting the centre patch . The sliding window at the decoder can be implemented using unfold operations to maintain parallelisation, and takes the bits produced for at (b) as context to make the prediction
. Note that the same encoder network is applied to each of the input image patches individually and in parallel. Edge regions of the binary codes are appropriately padded so that the spatial resolution of the input image is maintained. To learn how to inpaint, we use theloss:
where is equivalent to .
BINet can be used with different types of encoder and decoder networks, and here we consider two specific progressive architectures.
Additive reconstruction (AR) is widely used in traditional image codecs for variable bitrate encoding and progressive image enhancement Wallace (1991). Variable bitrate encoding entails assigning fewer bits to simpler image regions and vice versa, thereby reducing the overall bitrate on average. Progressive image compression involves encoding an image such that it can be reconstructed at various quality levels as bits are received by the decoder. Using AR, this is achieved by transmitting the difference (the residual) between successive compression iterations and the original image such that the decoder can enhance its reconstruction by adding subsequently received residuals Wallace (1991). The AR process is shown in Figure 3 and can be expressed mathematically as
Each autoencoder stage, , attempts to reconstruct the residual error from the previous stage, with representing the original image Toderici et al. (2015). The reconstruction error is then passed to the following network iteration, which attempts to reconstruct it. The final output image is obtained by summing over all the residuals produced across multiple network stages.
One-shot reconstruction (OSR) is defined mathematically as follows Toderici et al. (2017):
Each iteration, , accepts the previously incurred residual error, , as input and uses it to reconstruct an improved quality approximation of the original image. OSR differs from AR in that the original image is reconstructed at each network stage as opposed to the previous stage’s residual. This is achieved by recurrent links that propagate encoder and decoder state information. The compression quality of the current iteration is thus influenced by relevant information from previous encodings and decodings that persist in the network’s memory. Figure 5
illustrates the Convolutional Gated Recurrent Unit (GRU) OSR systemToderici et al. (2017), denoted as ConvGRU-OSR.
Both AR and OSR can be used naturally with BINet. As baselines, we use the progressive ConvAR Toderici et al. (2015); Baig et al. (2017) and ConvGRU-OSR Toderici et al. (2017) networks shown in Figures 3 and 5, respectively. The reconstruction of an image patch for a single iteration of these models can be written as:
Their patch reconstruction are therefore based on the encoding of a single input patch . In other words, they do not incorporate inpainting to aid compression. The training loss for both the ConvAR and ConvGRU-OSR baselines can be expressed as
The BINet framework is incorporated into ConvAR and ConvGRU-OSR by including learned binary inpainting at the first iteration, as shown in Figures 4 and 6. Later iterations encode the residual error incurred by this initial inpainting prediction. We only include inpainting at the first iteration, as intuitively this stage encodes details that contain the most spatial redundancy compared to later stages whose purpose is to encode finer and less correlated patch details.222Future work may focus on ways of including binary inpainting at later network stages. Our goal is to show the benefit of this binary inpainting strategy.
The encoding process of BINetAR (Figure 4) and BINetOSR (Figure 6) can be expressed as in equations (3) and (4), where again represents the original input image patch while is the initial iteration’s inpainting loss given in equation (2). An -iteration implementation of BINet with either AR or OSR is trained to optimise the loss:
The models discussed in Section 2 are trained on the CLIC Compression Challenge Professional Dataset Freeman et al. (2018), which is pre-partitioned into training, validation and test sets. Each set contains a variety of professionally captured high resolution natural images, saved in lossless PNG format to prevent the learning of compression artefacts introduced by lossy codecs.
The loss functions in equations (6) and (7) are used to train iteration implementations of the baseline (ConvAR, ConvGRU-OSR) and BINet (BINetAR, BINetOSR) systems, respectively. All models are trained to encode and reconstruct randomly cropped image patches. Following the approach in Toderici et al. (2015)
, the networks are constrained such that each autoencoder stage contributes 0.125 bits per pixel (bpp) to the overall compression of an input image patch. During training, BINet encodes nine directly adjacent image patches independently and reconstructs the central patch region based on the binary codes produced for the nine patches. Training patches are randomly cropped from the images in the training set at every epoch while centre cropping is used on images in the validation set to ensure that the validation losses for the BINet and baseline models are directly comparable across epochs. Image patches used during training are batched into groups of 32 and normalised such that pixel values fall in the range. Models are trained for 15 000 epochs and early stopping is employed based on the validation loss.333 For the preliminary analyses in Section 4.1 we stop training at 5 000 epochs. We use Adam optimisation Kingma and Ba (2014) with an initial learning rate of 0.0001. The learning rate is decayed by a factor of 2 at epochs 3 000, 10 000 and 14 000.
Quantifying image quality in a way that aligns with the subjective nature of the human visual system is difficult. A subjective assessment using surveys on humans can be slow and prone to viewer bias, and may garner results that are not easily reproducible. Objective algorithms have therefore become the norm in assessing image compression models Richardson and Vcodex (2010)et al. (2004). PSNR and SSIM measure the degree to which an image reconstruction corresponds to the original image. In both cases a higher score implies greater fidelity. SSIM falls within while PSNR (usually expressed in dB) can be any real value. We follow the procedure recommended in Wang et al. (2004) when calculating the SSIM of a compressed image: the final SSIM score is obtained by applying the SSIM index over smaller () pixel regions in a convolutional manner on a per-channel basis, and averaging the results. We use , , and , with a Gaussian weighting process, as in Wang et al. (2004).
For evaluation, each image is resized to pixels such that evaluation image dimensions are cleanly divisible by the chosen patch size.444We also ran tests on full unscaled images, and found that trends were exactly the same as when images are resized in this way, due to the models always compressing a fixed patch size irrespective of input image dimensions. Images are then partitioned into pixel patches and encoded, and quality scores are calculated on and averaged across the reassembled images. The performance of BINet is contrasted to that of the baseline systems at various bit depths in order to gauge the effectiveness of incorporating the proposed binary inpainting framework across different operating points. Additionally, we perform various preliminary analyses on validation data to further illustrate BINet’s capabilities.
We first perform a preliminary analysis on development data to better understand the properties of BINet and the benefit of binary inpainting as opposed to conventional sequential inpainting techniques. We then turn to quantitative analyses on test data where BINet is compared to the baseline neural compression models as well as standard image compression codecs.
In order to assess qualitatively whether inpainting of image patches from compressed binary codes is possible, a 1-iteration implementation of BINetAR (0.125 bpp) is trained to explicitly predict the pixel content of an unknown patch region located at the centre of a pixel grid. This version of BINet is purposefully altered such that it masks bits pertaining to the central patch region, i.e. the context region available to the decoder matches that of Figure 1(a). This forces the network to become fully reliant on the binary encodings of surrounding patches when predicting the central patch’s pixel content.
Figure 7 demonstrates the inpainting capabilities of this masked BINet, and indicates that it is able to predict a basis for an unknown patch using the compressed binary codes of its nearest neighbours. Figure 8 compares inpaintings from BINet (green border) and WebP (red border). The four main modes used by WebP to sequentially predict a patch region are included in the diagram and abbreviated as in Google Developers (2016). The modes either average (DC_PRED), directly copy (H_PRED, V_PRED), or linearly combine (TM_PRED) pixels from previously decoded patches. The figure shows that the inpaintings produced by BINet resemble the ground truth patches (black border) more closely than those of WebP.
In this experiment we compare full-context binary inpainting to the sequential inpainting scheme proposed by Baig et al. (2017). A masked 1-iteration realisation of BINetAR is pitted against the Sequential Inpainting Network (SINet) in Figure 9. SINet consists of a pre-trained image compression model (ConvAR with , bpp ) coupled to an inpainting network (ConvAR decoder). SINet’s inpainting network is trained to sequentially predict the central patch from previously decoded patches such that its context region is like that of Figure 1(c). Table 1 compares the average SSIM and PSNR scores achieved by BINet and SINet on the validation set. BINet’s full-context binary inpainting mechanism leads to a 6% improvement in SSIM and a 11% increase in PSNR relative to the partial-context sequential inpainting performed by SINet. Figure 10 illustrates how BINet’s ability to harness pixel content from a full context region aids its inpainting ability. BINet (green border) correctly identifies that the lower right-hand corner of its inpainting should be white, whereas SINet (red border) is oblivious to this due to its limited context region. Importantly, BINet has a major additional benefit in that it can be parallelised, since reconstruction of a particular patch is not performed based on previously decoded patches but rather directly on the binary codes of all surrounding patches.
To determine if teaching a model to inpaint from binary codes aids its compression capabilities, 1-iteration (0.125 bpp) implementations of BINetAR and the baseline ConvAR are pitted against each other. Figure 11 demonstrates how BINetAR outperforms ConvAR quantitatively in terms of training and validation loss. Losses represent the mean error between the ground truth and predicted patches and are indicative of the quality of the model’s patch reconstructions. Figure 12 shows an assortment of images encoded by BINetAR and ConvAR. Note that in each case BINetAR produces images with a higher perceptual fidelity than ConvAR, according to the SSIM and PSNR scores achieved by its reconstructions. The images produced by BINetAR are qualitatively smoother than those of ConvAR at equally low bitrates, making BINetAR better suited for patch-based compression. The improved smoothness can be attributed to BINetAR’s decoder which learns to constrain a patch to match its surroundings. All the images used here are from the CLIC validation set Freeman et al. (2018).
We now turn to quantitative analyses on the CLIC test set Freeman et al. (2018). We train 16-iteration implementations of BINetAR and ConvAR to assess the effect of incorporating a single inpainting stage on the performance of an AR model. We first consider reconstruction of single patches (an intrinsic measure) and then consider the more realistic evaluation on full images.
We first assess the abilities of BINetAR and ConvAR to reconstruct single image patches centre-cropped from the test set. Each model is trained to compress patches, for an intrinsic evaluation of model performance. The resulting SSIM and PSNR rate-distortion curves are shown in Figures 13(a) and 13(b). The variable bit rate is achieved by varying the number of encoding iterations from to . The figures indicate that at low bitrates close to the inpainting layer BINetAR gives a small but consistent improvement over ConvAR.
Dynamic bit assignment entails encoding different patch regions with varying bit allocations governed by a predetermined quality threshold such as PSNR. This aids compression as image regions are not necessarily equally complex. At low bitrates close to the inpainting layer BINetAR consistently produces patches of a higher quality than ConvAR. This means that if dynamic bit assignment were implemented, BINetAR would reach target quality thresholds after fewer encoding iterations (resulting in fewer bits) compared to ConvAR.
The average PSNR and SSIM scores achieved by BINetAR and ConvAR on test images are compared in Table 2, for various bit allocations. Although improvements are small, BINetAR consistently outperforms ConvAR across all bit depths. If one compares the first iteration of the two models, BINetAR outperforms ConvAR by 8% in terms of SSIM and results in a 3% relative improvement in PSNR. This comparison between the first iteration of the models is important as binary inpainting is only incorporated at the first stage of the BINetAR model.
|0.125 bpp||0.25 bpp||0.5 bpp||0.125 bpp||0.25 bpp||0.5 bpp|
Sixteen-iteration implementations of BINetOSR and ConvGRU-OSR are trained to assess the effect of incorporating a single inpainting stage on the performance of an OSR model. Models are again evaluated on the CLIC test set Freeman et al. (2018).
We first asses BINetOSR’s and ConvGRU-OSR’s intrinsic capacity to reconstruct patches center-cropped from the test data. The resulting areas under the PSNR and SSIM rate-distortion curves are given in Table 3. Note that a greater area is indicative of increased perceptual quality across all sixteen allocated bitrates. Table 3 shows that incorporating learned inpainting into just one iteration of the ConvGRU-OSR model effectively increases its area under the PSNR and SSIM rate-distortion curves.
|Model||Area under the curve|
The PSNR and SSIM curves achieved by BINetOSR and ConvGRU-OSR on test images are shown in Figures 14(a) and 14(b). Unlike the AR model, inpainting gains are more pronounced at stages further from the inpainting layer, as recurrence allows BINetOSR to better propagate inpainting information to later decoding stages. This forces the first stage to learn an inpainting strategy that is beneficial to the system as a whole as opposed to BINetAR where improvements are concentrated around the inpainting layer. Again, while performance gains are small, they are consistent over the bit rates considered.
Up to now we focused on incorporating BINet into the ConvAR and ConvGRU-OSR approaches in order to investigate the effect of binary inpainting on patch-based compression in isolation. Here we compare BINet to standard image codecs. Evaluation is carried out on full images from the Kodak dataset Kodak (1999) resized to pixels. Figure 15 compares BINet’s SSIM performance to that of WebP and JPEG. BINetAR outperforms JPEG at low bitrates, but gradually worsens at bitrates produced by stages further from the inpainting layer. JPEG is patch-based, and Figure 16 indicates that block artefacts arising through the use of an independent patch-based encoding scheme can be suppressed by BINetAR for enhanced quality at shallow bit allocations.
We showed in Section 4 that BINet is capable of learning more complex inpainting predictions than WebP. Although the inclusion of a binary inpainting stage does consistently improve ConvGRU-OSR’s performance (Figure 14), BINetOSR still falls short of outperforming WebP (Figure 15). Using the same decoder module for inpainting and image patch reconstruction may result in a conflict between learning compression and the high quality inpainting shown in Section 4. One must also take into consideration that WebP and JPEG’s codes are further compressed by lossless entropy coding, whereas BINet’s codes are not.
Our aim here was not to achieve state-of-the-art performance, but rather to investigate whether binary inpainting improves patch-based image compression in a deep neural network model, and this was shown in Sections 4.2 and 4.3. Any model can be used as the basis in the BINet framework, and future work will consider incorporating BINet into the more powerful recurrent models of Johnston et al. (2018).
We introduced the Binary Inpainting Network (BINet), a novel framework that can be used to improve an existing system for patch-based image compression. Building on ideas from image inpainting as well as deep image compression, BINet is novel in two particular ways. Firstly, in contrast to work on inpainting, BINet incorporates explicit binarisation in an encoder module, which allows it to be used for compression. Secondly, in contrast to most deep compression models, BINet incorporates information from adjacent patches when decoding a particular patch. The result is a patch-based compression method which allows for parallelised inpainting from a full-context region without access to original image data. In quantitative evaluations, we showed that BINet yields small but consistent improvements over baselines without inpainting. Qualitatively we showed that BINet results in fewer block artefacts at shallow bitrates compared to standard image codecs, resulting in smoother image reconstructions.
Apart from incorporating BINet into more advanced neural architectures in future work, we aim to also explore alternative applications for binary inpainting such as binary error correction and patch-based video-frame interpolation.
Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §1.
Conditional probability models for deep image compression. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Variable rate image compression with recurrent neural networks. In International Conference On Learning Representations (ICLR), Cited by: §1, §1, §2.1, §2.2.1, §2.2.3, §3.1.
International Conference on Machine Learning (ICML), Cited by: §1.