1 Introduction
Sparse data approximation is a fundamental problem in many areas of signal processing and machine learning. For different tasks like multimedia compression, content identification, multiclass classification and representation learning, one aims at straightforward, concise and computationally feasible approximations.
The above requirements, while being conflicting in nature, have been formulated and extensively studied under different concepts and applications like ratedistortion theory, approximate nearest neighbor search, vector quantization, dictionary learning and supervised/unsupervised learning in different disciplines.
In this work, we try to address some of the issues considered in these topics by asking the question, which data representation scheme is the most concise in terms of memory storage, the fastest in terms of computational complexity and the most accurate in terms of fidelity to the original data.
To this end, we propose a framework that could potentially be used in many different applications such as quality enhancement, denoising, impainting, visual recognition and joint compressionencryption. The general idea behind this approach being present in several earlier works [1], [2], we unify them together and treat the problem from a practically significant perspective along with an informationtheoretic analysis.
In particular, for this work, we consider the problem of image compression. We show that the proposed framework, when adapted to a particular class of images can gain a considerable compression performance increase compared to the JPEG2000 codec for the very low bitrate regime on the images belonging to the same class, in our application, face images.
Such a problem formulation is of great practical significance for those applications where the significant amount of images of a similar nature, like facial/iris images in biometrics, medical images, remote sensing and astronomical images, are to be compressed and communicated. In this case, the usage of a generic codec whose basis vectors are not adapted to the statistics of image is known to be inefficient, especially in the low rate regime. In this case, the overhead for storing a common trained codebook might be minor in comparison to the gain for millions of images.
The paper is organized as follows. In section 2 we briefly review the classic Shannon RateDistortion theory where the data are represented in one single layer. In section 3 we discuss the informationtheoretic analysis of the multilayer structure. Section 4 studies the behavior of i.i.d. sources of information under the multilayer structure. Section 5 considers images as the data to be treated within this framework where a short review of facial image compression in the literature is also provided. The experimental results for image compression are discussed in section 6. Finally, we conclude the paper in section 7.
2 Shannon RateDistortion Theory: Shallow Representation
The tradeoff between the concise representation of a source of information and the fidelity is theoretically treated and formulated by Shannnon in [3]
. In this analysis, for the joint description of the outcomes of the sequence of random variables,
, the measure of compactness is the compression rate defined as and measured in bits, if we store codewords in a codebook that each of them refer to a data point in the space of .The codewords, ’s are generated from a distribution and organized into a shallow codebook as shown in Fig. 1. Each codeword has the assigned index . This codebook is shared between the encoder and the decoder . The sequence with an index , represents a compressed counterpart of . It should be pointed out that the codebook is overcomplete, since , i.e., the number of codewords is exponential in . Moreover, the representation is sparse since only one codeword is used for the approximation of .
This representation leads to a loss of quality that should be measured as a distortion between and . One widely used measure of distortion between and is the MSE, defined as .
The Shannon theory relates these two concepts by defining the ratedistortion function and relating it to the mutual information between the sequence and its representation; hence paving the way for calculation of this function for various sources.
More concretely, the ratedistortion theory states that in order to guarantee to have the expected distortion between and , less than a threshold distortion value , i.e., , the compression rate should be lowerbounded by the ratedistortion function . This lower bound is proven to be equal to:
(1) 
An important consequence of this theory states that, for memoryless sources of information emitting i.i.d.
sequences, the distortionrate function (an alternative to the ratedistortion function) is upperbounded by that of the Gaussian source with the same variance
, and MSE distortion measure, as:(2) 
These bounds, suggested by the Shannon’s theory of ratedistortion, however, are proven to be achieved only for the asymptotic case where the blocklength . Consider for a fixed rate, any increase in the blocklength would lead to an exponential increase in the number of representations as . This means that, in the data representation language where several data points are to be stored in the memory and exhaustively matched in case queries are presented, one has to deal with an exponential complexity for both search and memory storage. Therefore, the current setup, while conceptually very important, cannot be appealing for many practical scenarios.
3 InformationTheoretic Analysis of MultiLayer Representation
Instead of the above singlelayer (shallow) representation of information where we have a shallow codebook with and is the representation vector in , consider the case where we have multiple codebooks , where the codebook consists of . The number of codewords, , or the corresponding rates will be specified later.
Consider the final encoding or source approximation is done as . Therefore, the ratedistortion function can be calculated from equation (1). The mutual information in this case can be bounded as
(3) 
The important consequence of the developments in equation (3) is that, to achieve a high rate which requires exponential storage and computational complexities in the shallow representation (due to ), one can achieve a targeted with codebooks each with very low rates such that
or equivalently,
(4) 
Therefore, the exponential nature of the required shallow codebook size for high rates is achieved by multiplication of smaller codebook sizes, i.e., the equivalent alphabet size will be:
(5) 
Fig. 2 sketches the structure of codebooks and decoding in the multilayer representaion.
3.1 MultiLayer Additive Structure
Suppose the special case when we have the reconstruction function to be additive, i.e.,
. Given a realization of the source, the decoder in this case consists of finding the Euclidean nearest neighbor of the sequence within the codewords of the first codebook, calculating the error of estimation and passing it to the next stage and repeating the same procedure until the last stage where the overall error will be equivalent to the error in the last stage since,
, where denotes the Vector Qunatizer for the stage with distortion .Moreover, for the Gaussian source we have that , where stands for the corresponding distortion, and .
The decoding as well as memory complexities will be instead of in the shallow structure.
Reducing the multistage encoding function to the addition operator, while being simple and intuitive, reduces the optimality since in general
and cannot be decomposed directly to conditional terms as in equation (3).
This issue, while reducing optimality due to some information loss in the addition operations, keeps the great advantage of breaking the exponential complexities of one huge shallow structure to several codebooks of considerably smaller sizes. In addition, a practical question of learnability of exponential codebook in the shallow structure is infeasible and requires also exponential number of training samples. In contrast, the multilayer structure can be easily trained for low rates .
In section 4, we simulate the performance of this scheme for i.i.d. sources of information and show that this loss is not a limiting factor.
4 MultiLayer Representation of i.i.d. Sources for Synthetic Data
Consider the stationary ergodic source with . The realizations of this source are to be represented with codebooks each with codewords. The decoding is done as described in section 3.1.
For the encoding at each stage, using the stationarity and ergodicity assumptions on the source and therefore the specific geometry imposed to the data distribution in as grows large, we design the codewords in different codebooks very efficiently using only random codewords that are properly normalized.
Suppose the case where for all , which means that the data are i.i.d.. In this case, the data is concentrating around a spherical shell with radius , as grows large.
For the first stage of encoding, suppose we want to compress the data to the rate . The achievable distortion for this stage and large enough is given by equation (2) as which is achieved for optimal codebook design, in this case with random structure.
Due to the optimality proved for this hypothetical case in terms of MSE distortion, one can conclude the orthogonality of the vector of estimation with its error (due to the principle of orthogonality), i.e.,
Therefore, from the law of cosines, one can confirm that the variance of the codewords of the first codebook is:
(6) 
Extending the argument to other layers, it can be concluded that the variance of the codewords of the layer is given as:
(7) 
4.1 Design of Random Codebooks for MultiLayer Representation
The use of random codebooks is appealing both in theory and for practical applications. Avoiding overfitting to the seen data in a machine learning setup, preservation of privacy and security in multimedia or medical data management and eliminating the computational cost of codebook design are among the advantages of this approach.
To this end, equations (6) (7) should be considered in random codebook design. Fig. 3 shows the effect of normalization of codewords in a codebook on the achieved distortion and also orthogonality as measured by for different codebook variances.
As is shown in the figure, the empirical optimum for is not far from the theoretical optimum when , the difference being due to geometrical variations of the nsphere for different values of .
In fact, by proper normalization, as dictated by these equations, we show that we can get very close to the theoretical distortionrate limit of equation (2) for moderate values of .
Fig. 4 shows the achieved distortion for i.i.d. source with synthesized from for different compression rates. We consider the compression using two different sets of codebooks. First is a randomly generated i.i.d. Gaussian with for the layer with where is the average distortion of the layer.
The second set of codebooks are equiprobable binary codebooks with alphabet , chosen to guarantee the same variance as the Gaussian case in each layer.
As is seen from the figure, the achieved distortionrate function, without the exponential complexity burdens of the shallow structure, closely approximates the behavior of the Shannon lower bound in equation (2). The difference with the theoretical limit is due both to the finite block length and the information loss due to the additive encoding, as explained in section 3.1.
Interestingly, the behavior of the two codebook design strategies is the same. The reason is due to the choice of very small rates at each stage. In fact, in an analogy with channel coding, the dual problem of ratedistortion theory, one can verify that capacities of the Gaussian channel and binary symmetric channel are very close at extremely small rates.
This fact is of very much practical significance, since, if the rate selection and normalization is done properly, one does not have to worry about matching the distribution of the codebooks with that of the source. Moreover, the memory storage of real valued codewords can be reduced to that of binary values.
5 Facial Image Compression
Due to its practical significance, image compression has become a very mature field of both research and technology. Among the existing methods of image compression, JPEG2000 is reported to be among the best existing algorithms used in practice [4] with a very intricate structure to achieve a highly optimized tradeoff between compression ratio and performance . However, since it is a general purpose codec, for applications where compression of a large amount of similar images is concerned, one could think of methods of compression to use the extra redundancy present due to the similarities of application images. Moreover, the JPEG2000 codec is not capable of providing very high compression ratios while many applications would require images to be highly compressed and compromise quality for description efficiency.
One important example for this scenario is the compression of facial images. They are available in large quantities in big databases of police departments, organizations and entities with lots of employees and users. Efficient compression of these images in terms of storage and computational complexity is very important since it will result immediately in more resources and thus providing services to more users. Moreover, in some applications, rather than quality and fine details, the recognition informativeness of facial images is of more importance.
Apart from the very numerous literature in image compression, there has been several works on compression of facial images. In [5], a facial compression scheme based on Vector Quantization was proposed where a considerable performance improvement over the JPEG2000 was reported at very low rates. However, this method needs detection of facial features (sometimes manually) and alignment by geometrical transformation into a canonical form and also background removal which makes it very sensitive to the required preprocessing. Within the same setup, an approach based on dictionary learning with the KSVD algorithm was proposed in [6] where a special dictionary was learned for every block location of the image. In another work [7], a facial image compression using Redundant TreeBased Wavelet Transform (RTBWT) was used with the same preprocessing and a filteringbased postprocessing to improve the quality of images. In spite of their high performance in terms of PNSR, the problem with these approaches is that they rely very much on the alignment of images and they are less likely to generalize once the imaging setup is changed a bit.
Another scheme was proposed in [8] where the authors propose a codec by using the Iteration Tuned and Aligned Dictionary (ITAD) to compress facial images where dictionaries are tuned in every iteration of the pursuit algorithm used. A considerable compression performance gain is reported for a wider range of compression rates. However, the tree structure of the dictionaries will require a considerable storage.
5.1 Multilayer approximation of Images
We apply the above framework to compress facial images. Images from the training set are divided into nonoverlapping blocks and then gathered in a database. Without any special preprocessing, the blocks are vectorized and fed to the simple kmeans algorithm. The residual of quantization is fed to the next stages for further quantization. To avoid overfitting,
, the number of cluster centroids (codewords) at the layer is chosen such that the distortion of reconstruction of the test data is within a margin from the distortion of the reconstruction of the training data.The encoding part consists of assigning to each image block a sequence of indices each taking values from an alphabet of codewords. Therefore, the Bits Per Pixel (BPP) value for the image will be where
is the block size. This value could be reduced by the use of an entropy coding applied to indices where a probability table could be trained from the training set for each of the stages.
The decoding part simply consists of table lookups to read the values of the corresponding entropydecoded sequences of codewords for each block and their addition. This process could be done online and sequentially once the required bits for each stage is received.
6 Experimental Results for Image Compression
We used 2400 randomly chosen images ( for training and for testing) from the CroppedYaleB[9] database of cropped facial images with different lighting conditions. This is a difficult database for compression since the variation of lighting in images is very significant and shadows could obscure different parts of face in different images. Therefore, one cannot train highly specialized dictionaries for different locations. Moreover, unlike the databases used in the existing approaches, the background is completely removed from the faces and the algorithm cannot favor from the redundant areas common in all images.
We used layers of global codebooks with and codewords at the first, second, third and forth consecutive five layers, respectively. As was previously mentioned, the choice of these values should avoid overfitting. As is understandable from the values chosen for , also as seen from Fig. 7, the latter stages have less correlated structure and tend to overtrain more easily.
Fig. (a)a sketches the compression behavior of this experiment with that of the JPEG2000 codec in terms of PSNR(dB) for different values of Bits Per Pixel (BPP) and Fig. (b)b shows the ratedistortion performance. As is seen from this figure, the quality of the compressed facial images at very low rates is significantly superior to that of the JPEG2000. We used a simple arithmetic coding for the indices.
7 Conclusions
We presented a multilayer data representation approach and justified its efficiency in terms of data fidelity, memory storage and computational complexity with informationtheoretic arguments. We then used this approach for the application of image compression when the images belong to a certain class, in our experiments facial images. We showed that with its simple structure in the direct pixel domain which could still be improved in different ways in terms of the choice of codebook sizes, entropy coding used or postprocessing to reduce the blocking artifact, significant performance boost was achieved in the very low rate regime, compared to the JPEG2000 codec.
Acknowledgments
The research has been partially supported by SNF grant 1200020146379 and by a grant from Switzerland through the Swiss Contribution to the enlarged European Union PSPB125/2010.
References

[1]
R. Venkataramanan, T. Sarkar, and S. Tatikonda,
“Lossy compression via sparse linear regression: Computationally efficient encoding and decoding,”
Information Theory, IEEE Transactions on, vol. 60, no. 6, pp. 3265–3278, June 2014.  [2] A. Gersho and R. Gray, Vector Quantization and Signal Compression, Number 159 in The Kluwer International Series in Engineering and Computer Science. Kluwer, 1992.
 [3] Claude E Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE Nat. Conv. Rec, vol. 4, no. 142163, pp. 1, 1959.
 [4] David Taubman and Michael Marcellin, JPEG2000 Image Compression Fundamentals, Standards and Practice (The Springer International Series in Engineering and Computer Science), Springer, softcover reprint of the original 1st ed. 2002 edition, 6 2013.
 [5] M.. Elad, R.. Goldenberg, and R.. Kimmel, “Low bitrate compression of facial images,” Image Processing, IEEE Transactions on, vol. 16, no. 9, pp. 2379–2383, Sept 2007.
 [6] Ori Bryt and Michael Elad, “Compression of facial images using the ksvd algorithm,” Journal of Visual Communication and Image Representation, vol. 19, no. 4, pp. 270 – 282, 2008.
 [7] I. Ram, I. Cohen, and M. Elad, “Facial image compression using patchorderingbased adaptive wavelet transform,” Signal Processing Letters, IEEE, vol. 21, no. 10, pp. 1270–1274, Oct 2014.
 [8] J. Zepeda, C. Guillemot, and E. Kijak, “Image compression using sparse representations and the iterationtuned and aligned dictionary,” Selected Topics in Signal Processing, IEEE Journal of, vol. 5, no. 5, pp. 1061–1073, Sept 2011.

[9]
Georghiades et. al.,
“From few to many: Illumination cone models for face recognition under variable lighting and pose,”
IEEE Trans. Pattern Anal. Mach. Intelligence, 2001.
Comments
There are no comments yet.