BINet: a binary inpainting network for deep patch-based image compression

by   André Nortje, et al.
Stellenbosch University

Recent deep learning models outperform standard lossy image compression codecs. However, applying these models on a patch-by-patch basis requires that each image patch be encoded and decoded independently. The influence from adjacent patches is therefore lost, leading to block artefacts at low bitrates. We propose the Binary Inpainting Network (BINet), an autoencoder framework which incorporates binary inpainting to reinstate interdependencies between adjacent patches, for improved patch-based compression of still images. When decoding a patch, BINet additionally uses the binarised encodings from surrounding patches to guide its reconstruction. In contrast to sequential inpainting methods where patches are decoded based on previons reconstructions, BINet operates directly on the binary codes of surrounding patches without access to the original or reconstructed image data. Encoding and decoding can therefore be performed in parallel. We demonstrate that BINet improves the compression quality of a competitive deep image codec across a range of compression levels.



There are no comments yet.


page 3

page 9

page 12

page 16


Patch-Based Deep Autoencoder for Point Cloud Geometry Compression

The ever-increasing 3D application makes the point cloud compression unp...

Texture Memory-Augmented Deep Patch-Based Image Inpainting

Patch-based methods and deep networks have been employed to tackle image...

A Tale of Two Bases: Local-Nonlocal Regularization on Image Patches with Convolution Framelets

We propose an image representation scheme combining the local and nonloc...

Space-Filling Curve Indices as Acceleration Structure for Exemplar-Based Inpainting

Exemplar-based inpainting is the process of reconstructing missing parts...

Learning quadrangulated patches for 3D shape parameterization and completion

We propose a novel 3D shape parameterization by surface patches, that ar...

Sibling Neural Estimators: Improving Iterative Image Decoding with Gradient Communication

For lossy image compression, we develop a neural-based system which lear...

Learning Quadrangulated Patches For 3D Shape Processing

We propose a system for surface completion and inpainting of 3D shapes u...

Code Repositories


BINet: a binary inpainting network for patch-based image compression

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over 60% of Internet byte content consists of still images Google Developers (2016). Efficient image compression is therefore essential in lowering transmission bandwidth and data storage costs. Lossy image compression is currently dominated by patch-based standard codecs such as JPEG Wallace (1991) and WebP Google Developers (2016)

. Patch-based encoding schemes are preferred to their full-resolution counterparts, as they are more memory efficient and are required by standard video codecs such as H.264/5 that rely on block motion estimation techniques 

Richardson and Vcodex (2010).

Careful engineering has enabled standard image codecs to perform well in most settings. But these codecs suffer from arduous hand-tuned parameterisation, which can be particularly sensitive to settings outside of the domain for which they were designed. In contrast, deep neural networks are trained through loss-driven end-to-end optimisation, and deep image compression models have been shown to outperform standard image codecs 

Toderici et al. (2015); Ballé et al. (2016); Toderici et al. (2017); Theis et al. (2017); Rippel and Bourdev (2017); Agustsson et al. (2018); Johnston et al. (2018); Santurkar et al. (2018); Ballé et al. (2018); Mentzer et al. (2018); Li et al. (2018); Jooyoung Lee and Beack (2019). These deep approaches, although effective, are not optimised for patch-based encoding since they use the full image content to steer compression. Full image context is, unfortunately, not available for patch-based systems as each patch is encoded independently. Patch-based encoding is therefore avoided in deep compression models Toderici et al. (2015); Johnston et al. (2018), as it may result in block artefacts at shallow bitrates. To remedy this, we propose the Binary Inpainting Network (BINet) framework, which is inspired by research in image inpainting.

Image inpainting involves reconstructing a masked-out image region by using the surrounding pixels as context. It is often used as an error-correction strategy to restore patches lost during transmission. Traditional inpainting models, such as PixelCNN van den Oord et al. (2016), assume access to original pixel content; in Figure 1(a), the model would be asked to predict the shaded region in the middle, given the surrounding context as input. We extend this idea in order to perform patch-based image compression. When decoding a particular patch, BINet incorporates the compressed binary codes from adjacent image patches as well as the current patch to reinstate relationships between separately encoded regions. As depicted in Figure 1(b), BINet therefore exploits encoded binary information from a full-context region as well as the patch being inpainted in order to formulate its prediction of the inpainted region. The overall approach is illustrated in Figure 2: BINet encodes patches as discrete binary codes using a single encoder. The decoder then reconstructs a particular centre patch by incorporating the binary codes of surrounding patches. It therefore allows for parallel encoding and decoding of image patches aided by learned inpainting from a full binary context region.

In sequential compression techniques such as WebP Google Developers (2016), linear combinations of previously reconstructed outputs are used when decoding a particular patch. This is similar to sequential patch-based inpainting Baig et al. (2017), as illustrated in Figure 1(c), where previously decoded output from the model is treated as the context region and used to perform inpainting on the next patch. In contrast to these approaches, BINet decodes a particular patch, not based on previous patch reconstructions, but based directly on the binary encodings of the surrounding patches. Since it does not need to wait for surrounding patches to be decoded, BINet can decode all patches in parallel while still taking the full surrounding context into account.

BINet’s encoder and decoder are trained jointly through end-to-end optimisation. In contrast to Baig et al. (2017), where separate compression and inpainting networks are trained, BINet builds inpainting directly into its decoder architecture and does not require training an additional inpainting network. Our aim is to show that this approach allows spatial dependencies between patches to be re-instated from independently encoded patches, thereby advancing patch-based encoding in a neural compression model.

We proceed with a description of the BINet framework and the formulation of a loss function for learning binary encodings that exploit spatial redundancy between neighbouring image patches. BINet can be used with different types of encoder and decoder architectures, and in this work we specifically employ two competitive iterative decoding methods 

Toderici et al. (2015, 2017), namely additive reconstruction (AR) and one-shot reconstruction (OSR). We describe these specific instantiations of BINet in Section 2. To show the benefit of incorporating inpainting, the BINet models are compared to convolutional AR and OSR models without inpainting. Compression efficiency is evaluated quantitatively using the SSIM and PSNR image quality metrics. We show that BINet performs better than the conventional AR and OSR approaches over the complete range of compression levels considered (Section 4). On the standard Kodak dataset Kodak (1999), we show that the OSR variant of BINet consistently outperforms JPEG. Although it falls short of outperforming WebP, we show qualitatively that BINet produces smoother image reconstructions and is capable of more complex inpainting than the sequential decoding methods used by WebP. We released a full implementation of BINet online111

(a) Traditional
(b) BINet
(c) Sequential
Figure 1: Context regions available to various inpainting models. In traditional inpainting (a), a masked-out region is predicted based solely on context from surrounding regions. In BINet (b), the middle region is decoded based on binary codes (indicated with the ’s and ’s) of the patch of interest but also the binary codes from adjacent image patches. In sequential compression methods (c), patches are decoded in a pre-specified order; these reconstructed patches can then be used as the context region when decoding a next patch in the sequence.
Figure 2: The Binary Inpainting Network (BINet) framework. Compressed binary codes are illustrated here as two bits ( or ) per patch.

2 Binary Inpainting Network (BINet)

2.1 Architectural Overview

BINet is a variation of a basic autoencoder Hinton and Salakhutdinov (2006). Figure 2 shows BINet’s encoding and decoding process. It accepts as input a set of image patches, indicated by (a) in the figure, that are reduced to low dimensional representations and binarised, as shown at (b). Binarisation is required for digitally storing and/or transmitting a compressed version of an image Richardson and Vcodex (2010). As in Toderici et al. (2015); Raiko et al. (2014)

a stochastic binarisation function is used during training by adding uniform quantisation noise. This allows us to backpropagate gradients through the binarisation layer in the encoder by copying the gradients from the first decoder operation to the penultimate encoder layer. The decoder network at

(c) is applied as a sliding window across the generated binary codes such that each image patch at (d) is decoded using both its own binary code and the codes of adjacent patches that fall within a specific grid region. Intuitively, because the encoder and decoder networks are trained jointly, the decoder learns to inpaint from binary codes within its context region whilst the encoder learns to produce more compact codes that promote the inpainting performed by the decoder. The same encoder network is applied to each individual image patch, meaning that encoding on multiple patches can be performed in parallel. In principle any model can be used as the encoder and decoder in Figure 2, which is why we refer to BINet as a framework.

As depicted in Figure 2, the reconstruction of a patch from its compressed representation can be formulated as


where and represent the encoder and decoder mappings shown at (b) and (c), respectively. represent the patches used as context for predicting the centre patch . The sliding window at the decoder can be implemented using unfold operations to maintain parallelisation, and takes the bits produced for at (b) as context to make the prediction

. Note that the same encoder network is applied to each of the input image patches individually and in parallel. Edge regions of the binary codes are appropriately padded so that the spatial resolution of the input image is maintained. To learn how to inpaint, we use the



where is equivalent to .

2.2 Progressive BINet Architectures

BINet can be used with different types of encoder and decoder networks, and here we consider two specific progressive architectures.

2.2.1 Additive Reconstruction (AR)

Additive reconstruction (AR) is widely used in traditional image codecs for variable bitrate encoding and progressive image enhancement Wallace (1991). Variable bitrate encoding entails assigning fewer bits to simpler image regions and vice versa, thereby reducing the overall bitrate on average. Progressive image compression involves encoding an image such that it can be reconstructed at various quality levels as bits are received by the decoder. Using AR, this is achieved by transmitting the difference (the residual) between successive compression iterations and the original image such that the decoder can enhance its reconstruction by adding subsequently received residuals Wallace (1991). The AR process is shown in Figure 3 and can be expressed mathematically as


Each autoencoder stage, , attempts to reconstruct the residual error from the previous stage, with representing the original image Toderici et al. (2015). The reconstruction error is then passed to the following network iteration, which attempts to reconstruct it. The final output image is obtained by summing over all the residuals produced across multiple network stages.

Figure 3: Two-iteration implementation of the Convolutional AR (ConvAR) baseline model.
Figure 4: Two-iteration implementation of BINet with additive reconstruction (BINetAR). Binary inpainting is integrated into the first iteration of the ConvAR model in Figure 3.

2.2.2 One-Shot Reconstruction (OSR)

One-shot reconstruction (OSR) is defined mathematically as follows Toderici et al. (2017):


Each iteration, , accepts the previously incurred residual error, , as input and uses it to reconstruct an improved quality approximation of the original image. OSR differs from AR in that the original image is reconstructed at each network stage as opposed to the previous stage’s residual. This is achieved by recurrent links that propagate encoder and decoder state information. The compression quality of the current iteration is thus influenced by relevant information from previous encodings and decodings that persist in the network’s memory. Figure 5

illustrates the Convolutional Gated Recurrent Unit (GRU) OSR system 

Toderici et al. (2017), denoted as ConvGRU-OSR.

Figure 5: Two-iteration implementation of the Convolutional GRU OSR (ConvGRU-OSR) baseline model.
Figure 6: Two-iteration implementation of BINet with one-shot reconstruction (BINetOSR). Binary inpainting is integrated into the first iteration of the ConvGRU-OSR model in Figure 5.

2.2.3 BINet with AR and OSR

Both AR and OSR can be used naturally with BINet. As baselines, we use the progressive ConvAR Toderici et al. (2015); Baig et al. (2017) and ConvGRU-OSR Toderici et al. (2017) networks shown in Figures 3 and 5, respectively. The reconstruction of an image patch for a single iteration of these models can be written as:


Their patch reconstruction are therefore based on the encoding of a single input patch . In other words, they do not incorporate inpainting to aid compression. The training loss for both the ConvAR and ConvGRU-OSR baselines can be expressed as


where refers to the number of reconstruction iterations (Figures 3 and 5 show only two iterations, but typically more are used).

The BINet framework is incorporated into ConvAR and ConvGRU-OSR by including learned binary inpainting at the first iteration, as shown in Figures 4 and 6. Later iterations encode the residual error incurred by this initial inpainting prediction. We only include inpainting at the first iteration, as intuitively this stage encodes details that contain the most spatial redundancy compared to later stages whose purpose is to encode finer and less correlated patch details.222Future work may focus on ways of including binary inpainting at later network stages. Our goal is to show the benefit of this binary inpainting strategy.

The encoding process of BINetAR (Figure 4) and BINetOSR (Figure 6) can be expressed as in equations (3) and (4), where again represents the original input image patch while is the initial iteration’s inpainting loss given in equation (2). An -iteration implementation of BINet with either AR or OSR is trained to optimise the loss:


3 Experimental Setup

3.1 Data and Training Procedure

The models discussed in Section 2 are trained on the CLIC Compression Challenge Professional Dataset Freeman et al. (2018), which is pre-partitioned into training, validation and test sets. Each set contains a variety of professionally captured high resolution natural images, saved in lossless PNG format to prevent the learning of compression artefacts introduced by lossy codecs.

The loss functions in equations (6) and (7) are used to train iteration implementations of the baseline (ConvAR, ConvGRU-OSR) and BINet (BINetAR, BINetOSR) systems, respectively. All models are trained to encode and reconstruct randomly cropped image patches. Following the approach in Toderici et al. (2015)

, the networks are constrained such that each autoencoder stage contributes 0.125 bits per pixel (bpp) to the overall compression of an input image patch. During training, BINet encodes nine directly adjacent image patches independently and reconstructs the central patch region based on the binary codes produced for the nine patches. Training patches are randomly cropped from the images in the training set at every epoch while centre cropping is used on images in the validation set to ensure that the validation losses for the BINet and baseline models are directly comparable across epochs. Image patches used during training are batched into groups of 32 and normalised such that pixel values fall in the range

. Models are trained for 15 000 epochs and early stopping is employed based on the validation loss.333 For the preliminary analyses in Section 4.1 we stop training at 5 000 epochs. We use Adam optimisation Kingma and Ba (2014) with an initial learning rate of 0.0001. The learning rate is decayed by a factor of 2 at epochs 3 000, 10 000 and 14 000.

3.2 Evaluation Procedure

Quantifying image quality in a way that aligns with the subjective nature of the human visual system is difficult. A subjective assessment using surveys on humans can be slow and prone to viewer bias, and may garner results that are not easily reproducible. Objective algorithms have therefore become the norm in assessing image compression models Richardson and Vcodex (2010)

. We use two standard objective image evaluation metrics: Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity index (SSIM) 

Wang et al. (2004). PSNR and SSIM measure the degree to which an image reconstruction corresponds to the original image. In both cases a higher score implies greater fidelity. SSIM falls within while PSNR (usually expressed in dB) can be any real value. We follow the procedure recommended in Wang et al. (2004) when calculating the SSIM of a compressed image: the final SSIM score is obtained by applying the SSIM index over smaller () pixel regions in a convolutional manner on a per-channel basis, and averaging the results. We use , , and , with a Gaussian weighting process, as in Wang et al. (2004).

For evaluation, each image is resized to pixels such that evaluation image dimensions are cleanly divisible by the chosen patch size.444We also ran tests on full unscaled images, and found that trends were exactly the same as when images are resized in this way, due to the models always compressing a fixed patch size irrespective of input image dimensions. Images are then partitioned into pixel patches and encoded, and quality scores are calculated on and averaged across the reassembled images. The performance of BINet is contrasted to that of the baseline systems at various bit depths in order to gauge the effectiveness of incorporating the proposed binary inpainting framework across different operating points. Additionally, we perform various preliminary analyses on validation data to further illustrate BINet’s capabilities.

4 Experiments

We first perform a preliminary analysis on development data to better understand the properties of BINet and the benefit of binary inpainting as opposed to conventional sequential inpainting techniques. We then turn to quantitative analyses on test data where BINet is compared to the baseline neural compression models as well as standard image compression codecs.

4.1 Preliminary Analysis

4.1.1 Is Inpainting from Binary Codes Possible?

In order to assess qualitatively whether inpainting of image patches from compressed binary codes is possible, a 1-iteration implementation of BINetAR (0.125 bpp) is trained to explicitly predict the pixel content of an unknown patch region located at the centre of a pixel grid. This version of BINet is purposefully altered such that it masks bits pertaining to the central patch region, i.e. the context region available to the decoder matches that of Figure 1(a). This forces the network to become fully reliant on the binary encodings of surrounding patches when predicting the central patch’s pixel content.

Figure 7 demonstrates the inpainting capabilities of this masked BINet, and indicates that it is able to predict a basis for an unknown patch using the compressed binary codes of its nearest neighbours. Figure 8 compares inpaintings from BINet (green border) and WebP (red border). The four main modes used by WebP to sequentially predict a patch region are included in the diagram and abbreviated as in Google Developers (2016). The modes either average (DC_PRED), directly copy (H_PRED, V_PRED), or linearly combine (TM_PRED) pixels from previously decoded patches. The figure shows that the inpaintings produced by BINet resemble the ground truth patches (black border) more closely than those of WebP.

Figure 7: Inpaintings performed by masked BINet.
Figure 8: A comparison of the inpainting performed by masked BINet and WebP.
Figure 9: The Sequential Inpainting Network (SINet).

4.1.2 Is Full-Context Binary Inpainting Superior to Sequential Inpainting?

In this experiment we compare full-context binary inpainting to the sequential inpainting scheme proposed by Baig et al. (2017). A masked 1-iteration realisation of BINetAR is pitted against the Sequential Inpainting Network (SINet) in Figure 9. SINet consists of a pre-trained image compression model (ConvAR with , bpp ) coupled to an inpainting network (ConvAR decoder). SINet’s inpainting network is trained to sequentially predict the central patch from previously decoded patches such that its context region is like that of Figure 1(c). Table 1 compares the average SSIM and PSNR scores achieved by BINet and SINet on the validation set. BINet’s full-context binary inpainting mechanism leads to a 6% improvement in SSIM and a 11% increase in PSNR relative to the partial-context sequential inpainting performed by SINet. Figure 10 illustrates how BINet’s ability to harness pixel content from a full context region aids its inpainting ability. BINet (green border) correctly identifies that the lower right-hand corner of its inpainting should be white, whereas SINet (red border) is oblivious to this due to its limited context region. Importantly, BINet has a major additional benefit in that it can be parallelised, since reconstruction of a particular patch is not performed based on previously decoded patches but rather directly on the binary codes of all surrounding patches.

Figure 10: Comparison of inpainting performed by masked BINet and SINet given an artificial image patch.
Table 1: Averaged masked BINet and SINet SSIM and PSNR scores for image patch inpaintings

4.1.3 Does Inpainting Improve Compression using a Single Iteration?

To determine if teaching a model to inpaint from binary codes aids its compression capabilities, 1-iteration (0.125 bpp) implementations of BINetAR and the baseline ConvAR are pitted against each other. Figure 11 demonstrates how BINetAR outperforms ConvAR quantitatively in terms of training and validation loss. Losses represent the mean error between the ground truth and predicted patches and are indicative of the quality of the model’s patch reconstructions. Figure 12 shows an assortment of images encoded by BINetAR and ConvAR. Note that in each case BINetAR produces images with a higher perceptual fidelity than ConvAR, according to the SSIM and PSNR scores achieved by its reconstructions. The images produced by BINetAR are qualitatively smoother than those of ConvAR at equally low bitrates, making BINetAR better suited for patch-based compression. The improved smoothness can be attributed to BINetAR’s decoder which learns to constrain a patch to match its surroundings. All the images used here are from the CLIC validation set Freeman et al. (2018).

(a) Training Loss
(b) Validation Loss
Figure 11: Training and validation losses for 1-iteration implementations of BINetAR and the ConvAR baseline.
Figure 12: Full image reconstructions for 1-iteration implementations of BINetAR and the ConvAR baseline.

4.2 Quantitative Analysis: BINetAR vs. ConvAR

We now turn to quantitative analyses on the CLIC test set Freeman et al. (2018). We train 16-iteration implementations of BINetAR and ConvAR to assess the effect of incorporating a single inpainting stage on the performance of an AR model. We first consider reconstruction of single patches (an intrinsic measure) and then consider the more realistic evaluation on full images.

4.2.1 Patch Reconstruction

We first assess the abilities of BINetAR and ConvAR to reconstruct single image patches centre-cropped from the test set. Each model is trained to compress patches, for an intrinsic evaluation of model performance. The resulting SSIM and PSNR rate-distortion curves are shown in Figures 13(a) and 13(b). The variable bit rate is achieved by varying the number of encoding iterations from to . The figures indicate that at low bitrates close to the inpainting layer BINetAR gives a small but consistent improvement over ConvAR.

Dynamic bit assignment entails encoding different patch regions with varying bit allocations governed by a predetermined quality threshold such as PSNR. This aids compression as image regions are not necessarily equally complex. At low bitrates close to the inpainting layer BINetAR consistently produces patches of a higher quality than ConvAR. This means that if dynamic bit assignment were implemented, BINetAR would reach target quality thresholds after fewer encoding iterations (resulting in fewer bits) compared to ConvAR.

(a) SSIM
(b) PSNR
Figure 13: BINetAR vs. ConvAR: rate-distortion curves for image patches.

4.2.2 Full Image Reconstruction

The average PSNR and SSIM scores achieved by BINetAR and ConvAR on test images are compared in Table 2, for various bit allocations. Although improvements are small, BINetAR consistently outperforms ConvAR across all bit depths. If one compares the first iteration of the two models, BINetAR outperforms ConvAR by 8% in terms of SSIM and results in a 3% relative improvement in PSNR. This comparison between the first iteration of the models is important as binary inpainting is only incorporated at the first stage of the BINetAR model.

0.125 bpp 0.25 bpp 0.5 bpp 0.125 bpp 0.25 bpp 0.5 bpp
Table 2: BINetAR vs. ConvAR: SSIM and PSNR scores at various bit-per-pixel (bpp) allocations for images.

4.3 Quantitative Analysis: BINetOSR vs. ConvGRU-OSR

Sixteen-iteration implementations of BINetOSR and ConvGRU-OSR are trained to assess the effect of incorporating a single inpainting stage on the performance of an OSR model. Models are again evaluated on the CLIC test set Freeman et al. (2018).

4.3.1 Patch Reconstruction

We first asses BINetOSR’s and ConvGRU-OSR’s intrinsic capacity to reconstruct patches center-cropped from the test data. The resulting areas under the PSNR and SSIM rate-distortion curves are given in Table 3. Note that a greater area is indicative of increased perceptual quality across all sixteen allocated bitrates. Table 3 shows that incorporating learned inpainting into just one iteration of the ConvGRU-OSR model effectively increases its area under the PSNR and SSIM rate-distortion curves.

Model Area under the curve
Table 3: BINetOSR vs. ConvGRU-OSR: area under the curve for SSIM and PSNR rate-distortion, calculated on image patches.

4.3.2 Full Image Reconstruction

The PSNR and SSIM curves achieved by BINetOSR and ConvGRU-OSR on test images are shown in Figures 14(a) and 14(b). Unlike the AR model, inpainting gains are more pronounced at stages further from the inpainting layer, as recurrence allows BINetOSR to better propagate inpainting information to later decoding stages. This forces the first stage to learn an inpainting strategy that is beneficial to the system as a whole as opposed to BINetAR where improvements are concentrated around the inpainting layer. Again, while performance gains are small, they are consistent over the bit rates considered.

(a) PSNR
(b) SSIM
Figure 14: BINetOSR vs. ConvGRU-OSR: rate-distortion curves for complete images.

4.4 Quantitative Analysis: BINet vs. Standard Codecs

Up to now we focused on incorporating BINet into the ConvAR and ConvGRU-OSR approaches in order to investigate the effect of binary inpainting on patch-based compression in isolation. Here we compare BINet to standard image codecs. Evaluation is carried out on full images from the Kodak dataset Kodak (1999) resized to pixels. Figure 15 compares BINet’s SSIM performance to that of WebP and JPEG. BINetAR outperforms JPEG at low bitrates, but gradually worsens at bitrates produced by stages further from the inpainting layer. JPEG is patch-based, and Figure 16 indicates that block artefacts arising through the use of an independent patch-based encoding scheme can be suppressed by BINetAR for enhanced quality at shallow bit allocations.

We showed in Section 4 that BINet is capable of learning more complex inpainting predictions than WebP. Although the inclusion of a binary inpainting stage does consistently improve ConvGRU-OSR’s performance (Figure 14), BINetOSR still falls short of outperforming WebP (Figure 15). Using the same decoder module for inpainting and image patch reconstruction may result in a conflict between learning compression and the high quality inpainting shown in Section 4. One must also take into consideration that WebP and JPEG’s codes are further compressed by lossless entropy coding, whereas BINet’s codes are not.

Our aim here was not to achieve state-of-the-art performance, but rather to investigate whether binary inpainting improves patch-based image compression in a deep neural network model, and this was shown in Sections 4.2 and 4.3. Any model can be used as the basis in the BINet framework, and future work will consider incorporating BINet into the more powerful recurrent models of Johnston et al. (2018).

Figure 15: SSIM rate-distortion curves comparing BINet to standard image codecs on the Kodak dataset Kodak (1999).
(a) BINetAR
(b) ConvAR
(c) JPEG
Figure 16: BINetAR vs. ConvAR vs. JPEG: image reconstructions. Image is from the Kodak dataset Kodak (1999).

5 Conclusion and Future Research

We introduced the Binary Inpainting Network (BINet), a novel framework that can be used to improve an existing system for patch-based image compression. Building on ideas from image inpainting as well as deep image compression, BINet is novel in two particular ways. Firstly, in contrast to work on inpainting, BINet incorporates explicit binarisation in an encoder module, which allows it to be used for compression. Secondly, in contrast to most deep compression models, BINet incorporates information from adjacent patches when decoding a particular patch. The result is a patch-based compression method which allows for parallelised inpainting from a full-context region without access to original image data. In quantitative evaluations, we showed that BINet yields small but consistent improvements over baselines without inpainting. Qualitatively we showed that BINet results in fewer block artefacts at shallow bitrates compared to standard image codecs, resulting in smoother image reconstructions.

Apart from incorporating BINet into more advanced neural architectures in future work, we aim to also explore alternative applications for binary inpainting such as binary error correction and patch-based video-frame interpolation.


  • E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool (2018) Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958. External Links: 1804.02958 Cited by: §1.
  • M. H. Baig, V. Koltun, and L. Torresani (2017) Learning to inpaint for image compression. In Advances in Neural Information Processing Systems (NIPS), External Links: 1709.08855 Cited by: §1, §1, §2.2.3, §4.1.2.
  • J. Ballé, V. Laparra, and E. P. Simoncelli (2016) End-To-End Optimized Image Compression. In International Conference On Learning Representations (ICLR), Cited by: §1.
  • J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018)

    Variational image compression with a scale hyperprior

    arXiv preprint arXiv:1802.01436. Cited by: §1.
  • W. T. Freeman, G. Toderici, and M. Covell (2018) CLIC Compression Challenge. Note: available at Cited by: §3.1, §4.1.3, §4.2, §4.3.
  • Google Developers (2016) Compression techniques WebP Google developers. Note: available at Cited by: §1, §1, §4.1.1.
  • G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science. Cited by: §2.1.
  • N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §4.4.
  • S. C. Jooyoung Lee and S. Beack (2019) Context-adaptive entropy model for end-to-end optimized image compression. In International Conference On Learning Representations (ICLR), Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. External Links: 1412.6980 Cited by: §3.1.
  • Kodak (1999) Kodak lossless true color image suite (PhotoCD PCD0992). Note: available at Cited by: §1, Figure 15, Figure 16, §4.4.
  • M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang (2018) Learning convolutional networks for content-weighted image compression. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool (2018)

    Conditional probability models for deep image compression

    In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • T. Raiko, M. Berglund, G. Alain, and L. Dinh (2014) Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989. Cited by: §2.1.
  • I. E. G. Richardson and Vcodex (2010) The H.264 advanced video compression standard. 2.0 edition, Wiley, Chichester, West Sussex. Cited by: §1, §2.1, §3.2.
  • O. Rippel and L. Bourdev (2017) Real-time adaptive image compression. arXiv preprint arXiv:1705.05823. Cited by: §1.
  • S. Santurkar, D. Budden, and N. Shavit (2018) Generative compression. In Picture Coding Symposium (PCS), Cited by: §1.
  • L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: §1.
  • G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2015)

    Variable rate image compression with recurrent neural networks

    In International Conference On Learning Representations (ICLR), Cited by: §1, §1, §2.1, §2.2.1, §2.2.3, §3.1.
  • G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell (2017) Full resolution image compression with recurrent neural networks. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.2.2, §2.2.3.
  • A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel recurrent neural networks. In

    International Conference on Machine Learning (ICML)

    Cited by: §1.
  • G. K. Wallace (1991) The JPEG still picture compession standard. Communications of the Association for Computing Machinery (ACM). Cited by: §1, §2.2.1.
  • Z. Wang, A. C. Bovik, H. Rahim Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP) (4). Cited by: §3.2.