GIF is a widely used image file format. At its core, GIF represents an image by applying color quantization. Each pixel of an image is indexed by the nearest color in some color palette. Finding an optimal color palette, which is equivalent to clustering, is an NP-hard problem. A commonly used algorithm for palette extraction is the median-cut algorithm 
, due to its low cost and relatively high quality. Better clustering algorithms such as k-means produce palettes of higher quality, but are much slower, and havetime complexity [7, 21]. Nearly all classical palette extraction methods involve an iterative procedure over all the image pixels, and are therefore inefficient.
Another issue with GIF encoding is the color banding artifacts brought by color quantization as shown in Figure 0(b). A popular method for suppressing banding artifacts is dithering, a technique which adds small perturbations in an input image to make the color quantized version of the input image look more visually appealing. Error diffusion is a classical method for dithering, which diffuses quantization errors to nearby pixels [8, 11, 25]. While effective, these error diffusion methods also introduce artifacts of their own, e.g., dotted noise artifacts as shown in Figure 0(c). Fundamentally, these methods are not aware of the higher level semantics in an image, and are therefore incapable of optimizing for higher level perceptual metrics.
To this end, we propose a differentiable GIF encoding framework consisting of three novel neural networks: PaletteNet, DitherNet, and BandingNet. Our architecture encompasses the entire GIF pipeline, where each component is replaced by one of the networks above. Our motivation for designing a fully differentiable architecture is two-fold. First of all, having a differentiable pipeline allows us to jointly optimize the entire model with respect to any loss functions. Instead of designing by hand heuristics for artifact removal, our network can automatically learn strategies to produce more visually appealing images by optimizing for perceptual metrics. Secondly, our design implies that both color quantization and dithering only involve a single forward pass of our network. This is inherently more efficient. Therefore, our method hastime complexity, and is easy to be parallelized compared to the iterative procedures in the traditional GIF pipeline.
Our contributions are the following:
To the best of our knowledge, our method is the first fully differentiable GIF encoding pipeline. Also, our method is compatible with existing GIF decoders.
We introduce PaletteNet, a novel method that extracts color palettes using a neural network. PaletteNet shows a higher peak signal-to-noise ratio (PSNR) than the traditional median-cut algorithm.
We introduce DitherNet, a neural network-based approach to reducing quantization artifacts. DitherNet is designed to randomize errors more on areas that suffer from banding artifacts.
We introduce BandingNet, a neural network for detecting banding artifacts which can also be used as a perceptual loss.
2 Previous Work
We categorize the related literature into four topics: dithering, palette selection, perceptual loss, and other related work.
2.1 Dithering and Error Diffusion
Dithering is a procedure to randomize quantization errors. Floyd-Steinberg error diffusion  is a widely used dithering algorithm. The Floyd-Steinberg algorithm sequentially distributes errors to neighboring pixels. Applying blue-noise characteristics on dithering algorithms showed improvement in perceptual quality [25, 34]. Kite et al.  provided a model for error diffusion as linear gain with additive noise, and also suggested quantification of edge sharpening and noise shaping. Adaptively changing threshold in error diffusion reduces artifact of quantization .
Halftoning or digital halftoning are other types of error diffusion methods, representing images as patternized gray pixels. In halftoning, Pang et al.  used structural similarity measurement (SSIM) to improve halftoned image quality. Chang et al.  reduced the computational cost of  by applying precomputed tables. Li and Mould  alleviated contrast varying problems using contrast-aware masks. Recently, Fung and Chan  suggested a method to optimize diffusion filters to provide blue-noise characteristics on multiscale halftoning, and Guo et al.  proposed a tone-replacement method to reduce banding and noise artifacts.
2.2 Palette Selection
Color quantization involves clustering the pixels of an image to clusters. One of the most commonly used algorithms for GIF color quantization is the median-cut algorithm . Dekker proposed using Kohonen neural networks for predicting cluster centers . Other clustering techniques such as k-means 6], particle swarm methods  have also been applied to the problem of color quantization .
Another line of work focuses on making clustering algorithms differentiable. Jang et al. 
proposed efficient gradient estimators for the Gumbel-Softmax which can be smoothly annealed to a categorical distribution. Oordet al.  proposed VQ-VAE that generates discrete encodings in the latent space. Agustsson et al.  provides another differentiable quantization method through annealing. Peng et al.  proposed a differentiable k-means network which reformulates and relaxes the k-means objective in a differentiable manner.
2.3 Perceptual Loss
Compared to quantitative metrics, e.g., signal-to-noise ratio (SNR), PSNR, and SSIM, perceptual quality metrics measure human perceptual quality. There have been efforts to evaluate perceptual quality using neural networks. Johnson et al. 
suggested using the feature layers of convolutional neural networks as a perceptual quality metric. Talebi and Milanfar proposed NIMA, a neural network based perceptual image quality metric. Blau and Michaeli  discussed the relationship between image distortion and perceptual quality, and provided comparisons of multiple perceptual qualities and distortions. The authors extended their discussion further in  and considered compression rate, distortion, and perceptual quality. Zhang et al.  compared traditional image quality metrics such as SSIM or PSNR and deep network-based perceptual quality metrics, and discussed the effectiveness of perceptual metrics.
Banding is a common compression artifact caused by quantization. There are some existing works [2, 20, 37] about banding artifact detection, where the method proposed by Wang et al.  achieved strong correlations with human subjective opinions. Lee et al.  segmented the image into smooth and non-smooth regions, and computed various directional features, e.g., contrast and Sobel masks, and non-directional features, e.g
., variance and entropy for each pixel in the non-smooth region to identify it as “banding boundaries” or “true edges”. Baughet al.  related the number of uniform segments to the visibility of banding. Their observation was that when the size of most of the uniform segments in an image was less than pixels in area, then there was no perceivable banding in the image. Wang et al.  extracted banding edges from homogeneous segments, and defined a banding score based on length and visibility of banding edges. The proposed banding metrics have a good correlation with subjective assessment.
2.4 Other Related Work
is a recent work that also involves the GIF format and deep learning. GIF2Video tackles the problem of artifact removal for both static or animated GIF images. This is complementary to our work since GIF2Video is a post-processing method, whereas our method is a pre-processing method.
Our method also uses several well-known neural network architectures, such as ResNet , U-Net , and Inception . ResNet allowed deeper neural network connections using skip connections and showed significant performance improvement on image classification tasks. Inception used different convolution kernel sizes and merged outputs, and achieved better image classification accuracy than humans. U-Net introduced an auto-encoder with skip connections, and achieved high quality output images in auto-encoder.
As shown in Figure 1(a), GIF encoding involves several steps: (1) color palette selection; (2) pixel value quantization given the color palette; (3) dithering; (4) re-applying pixel quantization; (5) Lempel-Ziv-Welch lossless data compression. The last step is lossless compression, and it does not affect the image quality. Thus we will focus on replacing the first four steps with neural networks to improve the image quality. To make a differentiable GIF encoding pipeline, we introduce two neural networks: 1) PaletteNet, predicting the color palette from a given input image and 2) DitherNet for reducing quantization artifacts. We also introduce soft projection to make the quantization step differentiable. To suppress banding artifacts, we introduce BandingNet as an additional perceptual loss. Figure 1(b) shows the overall architecture of the differentiable GIF encoding pipeline.
3.1 PaletteNet: Palette Prediction
The goal of PaletteNet is to predict a near-optimal palette with a feed-forward neural network given an RGB image and the number of palette. We emphasize here that at inference time, the output from PaletteNet will be directly used as the palette, and no additional iterative optimization is used. Therefore, PaletteNet in effect learns how to cluster image pixels in a feed-forward fashion.
We first state a few definitions and notations. Let be the input image, a color palette with number of palettes, be the palette prediction network. We define to be the quantized version of using a given palette , i.e.,
where is any indexing over the pixel space.
Given the definitions above, PaletteNet is trained in a self-supervised manner with a standard loss.
f Next, we introduce the soft projection operation to make the quantized image also differentiable with respect to the network parameters of . Note that since the predicted palette is already differentiable with respect to , the non-differentiability of the projected image is due to the hard projection operation defined in Equation 1. Therefore, we define a soft projection which smooths the operation in Equation 1,
where , , and a temperature parameter controlling the amount of smoothing. Note that the soft projection is not required for the training of PaletteNet itself, but needed if we want to chain the quantized image as a differentiable component of some additional learning system, such as in the case of our GIF encoding pipeline in Figure 1(b).
3.2 DitherNet: Reducing Artifacts
Dithering is an algorithm to randomize quantization errors into neighboring pixels in images to avoid banding artifacts. After the color quantization, we define the error image as
where is the original image, is a hard projection function that returns the nearest color in , and is any indexing over the pixel space. The traditional Floyd-Steinberg algorithm diffuses errors with fixed weights, and updates neighboring pixels in a sequential fashion.
Different from Floyd-Steinberg, our DitherNet accepts an original image and its quantized image as an input, and directly predicts a randomized error image . The predicted error is then added to the original image to produce the final quantized image.
where is the soft projection and used for training to maintain differentiability. The hard projection () is used for inference.
For the network architecture, DitherNet uses a variation of U-Net  as a backbone network for creating error images. Training DitherNet is inherently a multi-loss problem. The final output has to remain close to the original, while achieving good visual quality after quantization. Ideally, DitherNet should be able to reduce the errors along the banding areas in a visually pleasing way. To this end, we define the dithering loss, , as a combination of image fidelity measures, perceptual losses, as well as a customized banding loss term to explicitly suppress banding artifacts. We define the loss as , where is the absolute difference operator and is the element index. The dither loss is defined as
where is the banding loss given by BandingNet (see Section 3.3), is a perceptual loss given by either NIMA  or VGG , and , , , and are weights of each loss. In Equation 6, preserves the similarity between input and final quantized images, is for preserving the sum of quantization errors, and is the banding loss. We will discuss the effect of each term in Section 4.2.
3.3 BandingNet: Banding Score
We propose a neural network that predicts the severity of banding artifacts in an image. We train the network by distilling from the non-differentiable banding metric in . The output of this network will be used as the loss to guide our DitherNet.
A straightforward way to train the model is to directly use the RGB image as input, and banding scores obtained by  as ground truth. We first tried to use a classical CNN model  to train the banding predictor, and defined the loss as the mean absolute difference (MAD) between predicted score and ground truth. However, as shown in Figure 3 (red lines), such naive approach is unstable over training, and could not achieve low MAD.
As pointed out in , banding artifacts tend to appear on the boundary between two smooth regions. We found that adding a map of likely banding artifact areas as input would significantly improve the model. Here we propose a way to extract potential banding edges, and form a new input map (Algorithm 1).
As shown in Figure 4, the extracted edge map is able to highlight potential banding edges, e.g., banding on the background, and set relatively low weights for true edges like eyes and hair. By using this new input map, the training converges faster than using the RGB map, and MAD is also lowered as well, vs. within the banding score range [0, 1].
To use the BandingNet as a perceptual loss, we propose some additional modifications to the banding network proposed above. First, we augment the training data by adding pairs of GIF encoded images with and without Floyd-Steinberg dithering, and artificially lower the banding score loss for examples with Floyd-Steinberg dithering. This guides the loss along the Floyd-Steinberg direction, and can reduce adversarial artifacts compared to the un-adjusted BandingNet. Secondly, we apply our BandingNet on a multi-scale image pyramid in order to capture the banding artifacts at multiple scales of the image. Compared to a single-scale loss, the multi-scale BandingNet promotes randomizing errors on a larger spatial range instead of only near the banding edges. To define the multi-scale loss, we construct the image pyramid in Equation 7.
where is a level of image pyramid,
denotes image upscaling,denotes image downscaling, is a smoothing function, and is a scaling factor.
Let be the output of BandingNet. Our final banding loss is defined by
For training DitherNet, we use . The exact training parameters for our BandingNet loss can be found in the supplementary materials.
3.4 Overall Training
The overall loss for training our networks is given in Equation 9.
To stabilize training, we propose a three-stage method for training all of the networks. In the first stage, BandingNet and PaletteNet are first trained separately until convergence. BandingNet will be fixed and used only as a loss in the next two stages. In the second stage, we fix PaletteNet and only train DitherNet using Equation 9. In the final stage, we use a lower learning rate to fine-tune PaletteNet and DitherNet jointly. We found that the second stage is necessary to encourage DitherNet to form patterns that remove banding artifacts, whereas jointly training PaletteNet and DitherNet from the start will result in PaletteNet dominating the overall system, and a reduced number of colors in the output image.
We evaluate our methods on the CelebA dataset , where the models are trained and evaluated on an 80/20 random split of the dataset. For both training and evaluation, we first resize the images while preserving the aspect ratio so that the minimum of the image width and height is . The images are then center cropped to .
For PaletteNet, we use InceptionV2  as the backbone network, where the features from the last layer are globally pooled before passing to a fully connected layer with output dimensions. The output is then mapped to by the tanhactivation function. To train PaletteNet, we use the loss in Equation 2. The details of training parameters are discussed in the supplementary material.
We evaluate the average PSNR of the quantized image compared to the original, for both median-cut and PaletteNet. We also explored using PointNet , a popular architecture for processing point cloud input, as a comparison. The traditional method of palette extraction can be viewed as clustering a 3D point cloud, where each point corresponds to an RGB pixel in the image. The point cloud formed by the image pixels clearly retains rich geometric structures, and thus is highly applicable to PointNet. The details of PointNet are discussed in the supplementary material. Results for different values of are shown in Table 1.
From Table 1, we see that PaletteNet with the InceptionV2 network outperforms median-cut across all values of , where the improvement is more pronounced for lower values of . We also note that the PointNet architecture performed worse than the InceptionV2 network and Median-cut.
We note here that the DitherNet needs to be trained with relatively high weights on the various perceptual losses in order to produce perturbations that improve visual quality. However, raising these weights too much introduces adversarial artifacts that lowers the perceptual loss, but no longer produces examples of high visual quality. To reduce this effect, we augment the input images by changing saturation, hue, and also apply early stopping where we terminate training at epoch 3. The details of training parameters are discussed in the supplementary material.
|Without Banding Loss||With Banding Loss|
Table 2 shows that training without banding loss provides better PSNR and SSIM. However, DitherNet trained with banding loss provides better perceptual quality as shown in Figure 6. In our experiment, our method shows better quality with VGG  and NIMA  perceptual losses as shown in Figure 7.
Quantitative metrics, such as PSNR or SSIM, are not proper metrics to compare dithering methods, since dithering is mainly used for improving perceptual quality and does not necessarily improve PSNR or SSIM over non-dithered images. Instead, to evaluate the visual quality of our algorithm compared to the standard GIF pipeline with Floyd-Steinberg error diffusion and median-cut palette, a pairwise comparison user study was conducted using Amazon’s Mechanical Turk. We randomly choose 200 images from the CelebA evaluation data, and produce quantized images from the standard GIF pipeline and our model. Raters are then asked to choose from a pair which image is of higher quality. For each image pair, we collect ratings from 10 independent users.
Table 3 shows the favorability of our method compared to the standard median-cut with Floyd-Steinberg dithering. We see that our method outperforms the baseline by a large margin when the number of palettes is low, and has comparable performance when the number of palettes is high. There are several causes to the discrepancy in favorability for different numbers of palette levels. First of all, the number of images with visible banding artifacts decreases as the number of palettes increases. On images without banding artifacts, our method is almost identical to that from the standard GIF pipeline, since raters are often not sensitive to the minute differences in PSNR. On images with banding artifacts, a key difference between low and high palette count is in the visibility of the dotted pattern artifacts. When the number of palette levels is low, the dotted patterns are much more visible in the image and often rated unfavorably compared to the patterns from DitherNet. Another reason is the performance gap between PaletteNet and median-cut shrinks as the number of palettes grows (see Table 1).
In this paper, we proposed the first fully differentiable GIF encoding pipeline by introducing DitherNet and PaletteNet. To further improve the encoding quality, we introduced BandingNet that measures banding artifact score. Our PaletteNet can predict high quality palettes from input images. DitherNet is able to distribute errors and lower banding artifacts using BandingNet as a loss. Our method can be extended in multiple directions as future work. For example, k-means based palette prediction and heuristic methods for dithering, i.e., , show higher visual quality than ours. We also would like to extend our current work to image reconstruction, static to dynamic GIF, and connecting with other differentiable image file formats.
Soft-to-hard vector quantization for end-to-end learning compressible representations. In NeurIPS, pp. 1141–1151. Cited by: §2.2.
-  (2014-11) Advanced video debanding. In CVMP, pp. 1–10. External Links: Cited by: §2.3.
-  (2018-06) The perception-distortion tradeoff. In CVPR, pp. 6228–6237. Cited by: §2.3.
-  (2019-06) Rethinking lossy compression: the rate-distortion-perception tradeoff. In ICML, Cited by: §2.3.
-  (1990) Interactive computer graphics; functional, procedural, and device-level methods. 1st edition, Addison-Wesley Longman Publishing Co., Inc., USA. External Links: Cited by: §2.2.
-  (2015-06) An effective real-time color quantization method based on divisive hierarchical clustering. Journal of Real-Time Image Processing 10, pp. 329–344. External Links: Cited by: §2.2.
-  (2011-03) Improving the performance of k-means for color quantization. Image and Vision Computing 29 (4), pp. 260–271. External Links: Cited by: §1, §2.2.
-  (2009-12) Structure-aware error diffusion. ACM Transactions on Graphics 28 (5), pp. 1–8. External Links: Cited by: §1, §2.1.
-  (2001-01) Adaptive threshold modulation for error diffusion halftoning. IEEE Transactions on Image Processing 10 (1), pp. 104–116. External Links: Cited by: §2.1.
-  (1994) Kohonen neural networks for optimal colour quantization. Network: Computation in Neural Systems 5 (3), pp. 351–367. External Links: Cited by: §2.2.
-  (1976) An adaptive algorithm for spatial greyscale. Proceedings of the Society for Information Display 17 (2), pp. 75–77. Cited by: §1, §2.1.
-  (2013-01) Optimizing the error diffusion filter for blue noise halftoning with multiscale error diffusion. IEEE Transactions on Image Processing 22 (1), pp. 413–417. External Links: Cited by: §2.1.
-  (2015-11) Tone-replacement error diffusion for multitoning. IEEE Transactions on Image Processing 24 (11), pp. 4312–4321. External Links: Cited by: §2.1.
-  (2016-06) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.4.
-  (1982-07) Color image quantization for frame buffer display. SIGGRAPH Computer Graphics 16 (3), pp. 297–307. External Links: Cited by: §1.
-  (2016-11) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §2.2.
-  (2016-09) Perceptual losses for real-time style transfer and super-resolution. In ECCV, pp. 694–711. Cited by: §2.3, §3.2, §4.2.
-  (2000-05) Modeling and quality assessment of halftoning by error diffusion. IEEE Transactions on Image Processing 9 (5), pp. 909–922. External Links: Cited by: §2.1.
-  (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §3.3.
-  (2006-02) Two-stage false contour detection using directional contrast and its application to adaptive false contour reduction. IEEE Transactions on Consumer Electronics 52 (1), pp. 179–188. External Links: Cited by: §2.3.
-  (2019-11) PNGQUANT. Note: pngquant.org Cited by: §1, §5.
-  (2010-06) Contrast-aware halftoning. Computer Graphics Forum 29 (2), pp. 273–280. External Links: Cited by: §2.1.
-  (2015-12) Deep learning face attributes in the wild. In ICCV, pp. 3730–3738. Cited by: §4.
-  (2006) Particle swarm optimization for pattern recognition and image processing. In Swarm Intelligence in Data Mining, pp. 125–151. External Links: Cited by: §2.2.
-  (2001-08) A simple and efficient error-diffusion algorithm. In SIGGRAPH, pp. 567–572. External Links: Cited by: §1, §2.1.
-  (2008-08) Structure-aware halftoning. In SIGGRAPH, pp. 1–8. External Links: Cited by: §2.1.
-  (2018-08) K-meansnet: when k-means meets differentiable programming. arXiv preprint arXiv:1808.07292. Cited by: §2.2.
-  (2017-07) PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660. Cited by: §4.1.
-  (2015-11) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.4, §3.2.
-  (1997-11) A comparison of clustering algorithms applied to color image quantization. Pattern Recognition Letters 18 (11), pp. 1379–1384. External Links: Cited by: §2.2.
-  (2015-06) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §2.4.
Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826. Cited by: §4.1.
-  (2018-08) NIMA: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. External Links: Cited by: §2.3, §3.2, §4.2.
-  (1988-01) Dithering with blue noise. Proceedings of the IEEE 76 (1), pp. 56–79. External Links: Cited by: §2.1.
-  (2017) Neural discrete representation learning. In NeurIPS, pp. 6306–6315. Cited by: §2.2.
GIF2ƒVideo: color dequantization and temporal interpolation of gif images. In CVPR, pp. 1419–1428. Cited by: §2.4.
-  (2016-09) A perceptual visibility metric for banding artifacts. In ICIP, pp. 2067–2071. External Links: Cited by: §2.3, §3.3, §3.3, §3.3.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: §2.3.