A multi-layer image representation using Regularized Residual Quantization: application to compression and denoising

07/07/2017 ∙ by Sohrab Ferdowsi, et al. ∙ 0

A learning-based framework for representation of domain-specific images is proposed where joint compression and denoising can be done using a VQ-based multi-layer network. While it learns to compress the images from a training set, the compression performance is very well generalized on images from a test set. Moreover, when fed with noisy versions of the test set, since it has priors from clean images, the network also efficiently denoises the test images during the reconstruction. The proposed framework is a regularized version of the Residual Quantization (RQ) where at each stage, the quantization error from the previous stage is further quantized. Instead of codebook learning from the k-means which over-trains for high-dimensional vectors, we show that only generating the codewords from a random, but properly regularized distribution suffices to compress the images globally and without the need to resort to patch-based division of images. The experiments are done on the CroppedYale-B set of facial images and the method is compared with the JPEG-2000 codec for compression and BM3D for denoising, showing promising results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the classical image processing tasks like image compression and denoising. While there exists a wealth of successful methods to address them, the specificity and the intricate optimization in their design hinders their application to more general tasks and setups. For example suppose instead of one single image, we are given a collection of similar-looking images. Can the standard image compression codecs benefit from the shared redundancy to compress the images further? Such a setup is of great practical importance for compression of facial or iris images in biometrics, medical images, or the compression and transmission of very large, but similar-looking images in remote sensing and astronomy. In these cases, the usage of generic codecs like JPEG-2000 whose basis vectors are not adapted to the statistics of images is known to be inefficient.

Take the case of facial images. Inspite of the extensive litereture in generic image compression, only several learning-based algorithms have studied the compression of facial images. For example, [1] was an early attempt based on VQ. [2] learns the dictionaries based on the K-SVD [3] while [4] uses a tree-based wavelet transform. [5] proposes a codec by using the Iteration Tuned and Aligned Dictionary (ITAD). In spite of their high compression performance, the problem with most of these approaches is that they rely very much on the alignment of images and they are less likely to generalize once the imaging setup is changed a bit. Some of them require the detection of facial features (sometimes manually) and then alignment by geometrical transformation into some canonical form and also a background removal stage.

Similarly for the image denoising tasks, only few methods have benefited from external clean databases of similar images. For example, [6] reports a near improvement over the BM3D.

On the other hand, one can think of different tasks to be performed jointly. Can more favorable scenarios like the availability of a collection of similar domain-specific images help to compress and denoise images at the same time? As a practical scenario, suppose for example the case where in an object identification system, several exemplar images have been taken with high-quality acquisition systems in the enrollment mode. At query time, however, only low-quality and noisy cameras are available. It is highly desirable to be able to jointly denoise and compress the acquisitions.

The rest of the paper is organized as follows. In section 2, a very brief overview of the general image representation formulation is considered where several relevant cases are quickly reviewed. Section 3 preludes with a review from a problem in rate-distortion theory, namely the reverse water-filling paradigm. This will be used as the core concept behind the proposed Regularized Residual Quantization (RRQ) introduced next. Section 4 conducts experiments on the RRQ algorithm under the image compression and denoising tasks. Finally, section 5 concludes the paper.

2 Related work

Many methods for image representation and dictionary learning can be generalized in the inverse-problem formulation of Eq. 1, where contains the data-points (e.g., image patches) ’s in its columns. The codebook and the codes can be represented in a matrix form as 111Notation: matrix

, random variable

, random vector and vector and , respectively, with and .

(1)
s.t.

and are a set of constraints on the construction of the codebook and the codes, respectively.

Depending on and , the problem of Eq. 1 can be treated in many different ways.222See [7] and [8] for detailed reviews and discussions For example, under the famous sparsity constraint or its relaxed version , the K-SVD algorithm [3] solves it for local minima in an iterative way.

In this work, we follow the VQ-based interpretation of Eq. 1, where, as a general formulation, it is required that:

This problem can be solved using the k-means algorithm. However, the lack of structure for this formulation leads to poor generalization performance. To address some of the issues with this simple formulation, Product Quantization (PQ) (e.g., [9] [10]) divides the vectors into several blocks and runs k-means on each of them independently. While PQ can achieve good rate-distortion performance under certain conditions, its lack of design flexibility and the fact that the system should be re-trained for every rate, makes PQ not a suitable solution for image analysis.

As an alternative, RQ is a multi-layer approach that at each layer quantizes the residuals of quantization of the previous layer. While having been extensively studied in the 80’s and 90’s for different tasks like image coding (e.g., refer to [11], [12] or [9]), its efficiency was limited for more modern applications. In practice, it was not possible to learn codewords for more than a couple of layers.

In this work, we use an approach based on the RQ for which we introduce a pre-processing and an efficient regularization, making it possible to learn arbitrary numbers of layers. Moreover, the introduced regularization makes it possible to go beyond the image patches and work with the high-dimensional image directly. This brings an important advantage for different tasks like image compression. Since the global picture of the image is preserved in the high-dimensional representation, one does not have to encode the relation between similar patches after compression.

3 Proposed framework: RRQ

We first recall a concept from rate-distortion theory which is the quantization of Gaussian independent sources. Although in a slightly different setup than a practical quantization (e.g. being asymptotic), this motivates the core idea behind the RRQ algorithm introduced next in this chapter.

3.1 Preliminaries: Quantization of independent sources

The trade-off between the compactness and the fidelity of representation of a signal is classically treated in the Shannon’s rate-distortion theory [13]333Refer to Ch. 10 of [14] for further details of this subsection..

A special setup studied in this theory is the rate-distortion for

independent Gaussian distributed sources,

’s with different variances. Concretely, assume

. Define the expected distortion between a random vector and its estimate as

, where the distortion between two n-vectors and is defined as .

Here we ask the question: Given a fixed total distortion allowed, i.e., , what is the optimal way to divide the distortion (or rate) between these sources such that the overall allocated rate (distortion) is minimized? This can be posed as:

(2)
s.t.

where is the distortion of each source after rate-allocation. The solution to this convex problem is known as the reverse water-filling and is given as:

(3)

where is a constant which should be chosen to guarantee that .

Denote , the variance of the codewords for quantization of . Due to the principle of orthogonality and the independence of dimensions, we have that . Therefore, according to Eq. 3, the optimal assignment of the codeword variances will be a soft-thresholding of with :

(4)

This means that the optimal rate-allocation requires that the sources with variances less than should not be assigned any rate at all. This, when used in the codebook design, results in sparsity of the codewords which we incorporate in the RRQ algorithm.

, images in the train set.
, matrix of decorrelated vectors, , rotation matrices for the sub-bands.
1:for  do
2:     
3:      : zig-zag vectorization
4:     Divide into equal sub-bands:
5:     for  do
6:         Stack all ’s to get
7:     end for
8:end for
9:for  do
10:     Perform PCA on (without dim. reduction) to get and , the rotation matrix
11:     Concatenate ’s to get
12:end for
Algorithm 1 Pre-processing
de-correlated train set
multi-layer codebooks ’s and index sets ’s, with
1:
2:
3: variance per dimension
4:for  do
5:     
6:     for  do
7:         
8:     end for
9:     
10:     for  do
11:          Generate randomly from
12:         Concatenate ’s to get
13:     end for
14:     for  do
15:         
16:          all-zero vector with at the position
17:         Concatenate ’s to get
18:     end for
19:     
20:     
21:     
22:     
23:end for
Algorithm 2 Regularized Residual Quantization

3.2 The RRQ algorithm

Inspired by the setup studied in section 3.1, we argue that after a pre-processing stage, natural images can be represented in a global representation as variance decaying vectors which have independent, or at least uncorrelated dimensions. One might think of the PCA as a simple way to achieve this. However, since the dimensionality of the entire vectorized image is high, apart from the big complexities incurred, there will be too many parameters in the covariance matrix to estimate. Therefore, a global PCA will likely over-fit to the training, largely deviating from the test set. To overcome this issue, we propose the pre-processing in Algorithm 1.

After the PCA rotation matrices are learned from the training set, the same procedure applies to images from the test set. In fact, this pre-processing is a more robust estimation for the global PCA. Instead of parameters of the direct PCA rotation matrix, with the help of 2D-DCT, this pre-processing has parameters to estimate. This is an effective way to trade independence of dimensions for robustness between train and test sets.

The RRQ framework is introduced in Algorithm 2. For each of the layers, given the desired number of codewords, , after calculation of the variances of the residuals, the algorithm first finds the optimal and calculates the optimal variances of the codewords based on Eq. 4 and then randomly generates codewords based on these variances. Especially at the first layers, since the data has a very strong decaying character, this makes the codewords very sparse, significantly reducing the complexity and storage cost of the codebooks. The algorithm continues by quantizing the residuals ’s with the generated codewords and updating the estimations ’s and the new residuals ’s and finishes at the desired .

(a) D-R curve (normalized, log-scale)
(b) Image compression
(c) Image denoising
Figure 1: D-R, compression and denoising average performances on the CroppedYale-B set.

4 Experiments

We perform the two tasks of image compression and denoising of facial images. For image compression, we compare the performance of our proposed method with the JPEG and JPEG-2000 standards. For denoising, we compare with the BM3D. These are widely considered as baselines for comparison in the literature.

The CroppedYale-B set [15] is used which contains 2408 images of size from 38 subjects. Each subject has between 57 to 64 acquisitions with extreme illumination changes. We choose half of the images for each subject randomly for training and the rest for testing.

We choose two different value-pairs, and , where is the number of layers and is the number of codewords per each layer. As described earlier, all codewords are generated randomly according to Eq. 4. Algorithm 1 is used for pre-processing with sub-bands. The resulting decorrelated vectors are of size (same as the original images).

Figure 0(a) sketches the D-R curve for this set. It is seen that the gap between the training and the test sets for the proposed RRQ is very small, indicating the success of the algorithm in terms of generalization. The non-regularized RQ, on the other hand, while has much lower distortion on the train set, fails to compress the test set at the first several layers.

Fig. 0(b) shows the results of image compression. These results are averaged over 20 randomly chosen images from the test set. The advantage of the proposed method under this setup over the highly-optimized JPEG-2000 codec is significant, particularly at lower rates. It should be noted that we do not perform any entropy coding over the codebook indices. Further compression improvement can be achieved by entropy coding over the tree-like structure of the codebooks.

The results of image denoising for three different noise levels444Gray values are normalized between 0 and 1., averaged over 20 randomly chosen test images are depicted in Fig. 0(c). The network is trained over clean images and is exactly the same as the one used for compression. Test images are contaminated with noise and are given as the input to the network for reconstruction. When reconstructing the noisy image, the network uses the priors from the clean images based on which it has been trained. These priors are automatically used in the reconstruction process, serving as a very efficient denoising strategy surpassing (only at highly noisy regimes), the prior-less BM3D.

As the network tries to reconstruct the noisy image with further details, the noise statistics are becoming more present in the reconstructed image, hence degrading the quality. Therefore, depending on the noise variance, the maximum PSNR is somewhere in the middle of the distortion-rate curve. Noisier images have the maximum at lower rates.

Fig. 2 illustrates the denoising quality for two image samples. It is interesting to notice that the BM3D, although producing a smooth image, fails to reconstruct the face contours since it lacks enough priors.

(a)
(b)
(c) dB
(d) dB
(e)
(f)
(g) dB
(h) dB
Figure 2: Samples of image denoising. Order of columns: original image, noisy (noise variance), BM3D (PSNR) and RRQ (PSNR).

5 Conclusions

A framework for multi-layer representation of images was proposed where, instead of local patch-based processing, a global high-dimensional vector representation of images is successively quantized within different levels of reconstruction fidelity. As an alternative to the classical RQ framework which is based on k-means, the proposed RRQ along with pre-processing, randomly generates codewords from a regularized and learned distribution. Apart from the many potential advantages of having random codewords, this is shown lo lead to efficient quantization with low train-test distortion gaps. The experimental results show interesting promise for different practical scenarios, e.g., when the acquisition devices at the query phase are much noisier than the enrollment cameras. Future works consider using the variance priors to further train the codewords, moreover using entropy coding on the tree of indices for better rate-distortion performance.

References

  • [1] M. Elad, R. Goldenberg, and R. Kimmel, “Low bit-rate compression of facial images,” IEEE Transactions on Image Processing, vol. 16, no. 9, pp. 2379–2383, Sept 2007.
  • [2] Ori Bryt and Michael Elad, “Compression of facial images using the k-svd algorithm,” Journal of Visual Communication and Image Representation, vol. 19, no. 4, pp. 270 – 282, 2008.
  • [3] M. Aharon, M. Elad, and A. Bruckstein, “k -svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, Nov 2006.
  • [4] I. Ram, I. Cohen, and M. Elad, “Facial image compression using patch-ordering-based adaptive wavelet transform,” IEEE Signal Processing Letters, vol. 21, no. 10, pp. 1270–1274, Oct 2014.
  • [5] J. Zepeda, C. Guillemot, and E. Kijak, “Image compression using sparse representations and the iteration-tuned and aligned dictionary,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 5, pp. 1061–1073, Sept 2011.
  • [6] E. Luo, S. H. Chan, and T. Q. Nguyen, “Adaptive image denoising by targeted databases,” IEEE Transactions on Image Processing, vol. 24, no. 7, pp. 2167–2181, July 2015.
  • [7] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1045–1057, June 2010.
  • [8] Julien Mairal, Francis Bach, and Jean Ponce, “Sparse modeling for image and vision processing,” Foundations and Trends® in Computer Graphics and Vision, vol. 8, no. 2-3, pp. 85–283, 2014.
  • [9] Allen Gersho and Robert M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Norwell, MA, USA, 1991.
  • [10] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, Jan 2011.
  • [11] C. F. Barnes, S. A. Rizvi, and N. M. Nasrabadi, “Advances in residual vector quantization: a review,” IEEE Transactions on Image Processing, vol. 5, no. 2, pp. 226–262, Feb 1996.
  • [12] N. M. Nasrabadi and R. A. King, “Image coding using vector quantization: a review,” IEEE Transactions on Communications, vol. 36, no. 8, pp. 957–971, Aug 1988.
  • [13] Claude E Shannon, “Coding theorems for a discrete source with a fidelity criterion,” .
  • [14] T. Cover and J. Thomas, Elements of Information Theory 2nd Edition, Wiley-Interscience, 2 edition, 7 2006.
  • [15] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman,

    “From few to many: Illumination cone models for face recognition under variable lighting and pose,”

    IEEE Trans. Pattern Anal. Mach. Intelligence, vol. 23, no. 6, pp. 643–660, 2001.