Parametric context adaptive Laplace distribution for multimedia compression

05/28/2019 ∙ by Jarek Duda, et al. ∙ 0

Data compression often subtracts predictor and encodes the difference (residue) assuming Laplace distribution, for example for images, videos, audio, or numerical data. Its performance is strongly dependent on proper choice of width (scale parameter) of this parametric distribution, can be improved if optimizing it based on local situation like context. For example in popular LOCO-I (JPEG-LS) lossless image compressor there is used 3 dimensional context quantized into 365 discrete possibilities treated independently. This article discussed inexpensive approaches for exploiting their dependencies by using ARCH-like context dependent models for parameters of parametric distribution for residue, also evolving in time for adaptive case. For example tested such 4 or 11 parameter models turned out providing similar performance as 365 parameter LOCO-I model for 48 tested images. Beside smaller headers, such reduction of number of parameters can lead to better generalization, and allows to practically exploit higher dimensional contexts, for example using information from all 3 color channels, further pixels, or of some additional region classifiers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many types of data statistically agree with specific parametric distributions, like Gaussian distribution through the law of large numbers, or Laplace distribution popular in data compression as it agrees with statistics of errors from prediction (residues). Their parameters can often be inexpensively estimated, and storing them in a header is much less expensive than e.g. entire probability distribution on some quantized set of represented values. Parametric distributions smoothen between discretized possibilities, generalizing statistical trends emerging in a given type of data.

However, for example due to randomness alone, statistics of real data has usually some distortion from such idealization. Directly storing counted frequencies can exploit this difference, gaining asymptotically Kullback-Leibler divergence bits/value - at cost of larger header. Data compressors need to optimize this minimum description length 

[2] tradeoff between model size and entropy it leads to.

In practice, instead of a single e.g. Laplace distribution to encode residues (errors of predictions) for the entire image, we would like to make its parameters dependent on local situation - through context dependence like in Markov modelling, or adaptivity like for non-stationary time series.

The possibility to directly store all values fades away when increasing dimension of the model - both due to size growing exponentially with dimension, but also underrepresentation. Going to higher dimensions requires finding and exploiting some general behaviour, for example through parametrizations, as in examples presented in Fig. 1.

Figure 1: Comparison of some discussed models for 48 grayscale 8 bit 512x512 images presented in Fig. 2. Top left: first we need to predict pixel value based on the current context: already decoded 4 neighboring pixels . This predicted is used as the center of Laplace distribution, which is estimated as median: minimizes distance. Hence, presented evaluation uses average for 4 approaches: LOCO-I predictor (red), simple average (green), least squares parameters for combined images (orange), and least squares parameters chosen individually for each image (blue) - the last one gives the lowest residues so it is used further. Top right: bits/pixel for encoding its residues using centered Laplace distribution of width (scale parameter) modeled in various ways. Red: LOCO-I model with 365 parameters corresponding to quantized context: . Green: single chosen individually (MLE) for each image. Orange: discussed here 4 parameter model, written at the bottom left, blue: discussed later 11 parameter model. Bottom: differences of these values for the two models. The evaluation assumes accurate entropy coding (AC/ANS) and neglects headers - including them would worsen especially LOCO-I evaluation if storing all 365 parameters.

LOCO-I[1] mixes both philosophies: uses parametric probability distributions, which scale parameter (width of Laplace distribution) depends on 3 dimensional context quantized into 365 possibilities treated independently - neglecting their dependencies. Such approach is useful for low dimensional contexts, however, it becomes impractical if wanting to use higher dimensional context: like using information from all 3 color channels, further pixels than the nearest neighbors, or from some region classifiers to gradually transit between e.g. models for smooth regions like sky, to complex textures like treetop. Finally contexts of much higher dimension appear in multiscale interlaced scanning like in FLIF [3]

compressor: progressively improving quality, rather only parametric models can directly work on its high dimensional contexts.

This article discusses such parametric-parametric models: choose parameters of e.g. Laplace distribution as a parametric function of the context, like through a linear combination, or generally e.g. neural networks. Its example are ARMA-ARCH 

[4] models popular in economics: choosing squared width of Gaussian distribution as a linear combination of recent squared residues, e.g. .

These parameters can be universal e.g. default for various types of classified regions, or optimized individually by compressor and stored in the header. For the latter purpose we will focus on least squares estimation due to its low cost. Presented test results are for such estimation, a costly additional optimization might slightly improve performance.

While we will mostly focus on such static models: assuming constant joint distribution of (value, context), mentioned alternative are adaptive models: assuming non-stationary time series, evolving joint distribution. It requires additional cost to update parameters of the model, for example performing likelihood optimization step while processing each value. It has two advantages: can learn model from already decoded data even without header, and can flexibly adapt to local behavior e.g. of an image.

In literature there are also considered much more costly models, like using neural networks for predicting probability distribution of succeeding pixels ([5, 6]). In the discussed philosophy, instead of directly predicting probability of each discrete value, we can use such neural networks to directly predict context dependent parameters of some parametric distribution for the new pixel. Such simplification should allow to use much smaller neural networks, bringing it closer to practical application in data compression.

Figure 2: Dataset of 48 grayscale 8 bit 512x512 images used in tests. Source: http://decsai.ugr.es/cvg/CG/base.htm .

Ii Parametric-parametric distributions

We would like to model conditional probability distribution

of the new value , based on some local -dimensional context , in practice bounded e.g. to a cube like here. In LOCO-I image compressor this context are 4 neighboring already decoded pixels ( as in Fig. 1). Both value and context are rather discrete through some quantization, but it is useful to model them as real values - especially wanting to exploit continuity of their behavior.

Modelling general continuous conditional distributions is a difficult task - requires techniques like quantile regression 

[7]. or hierarchical correlation reconstruction [8]. However, the situation becomes much simpler if focusing on simple parametric distributions for the predicted distribution. Another standard simplification is separately modelling the center of the distribution with predictor , and the remaining parameter(s) of centered distribution for residue, usually single scale parameter defining width:

(1)

We will mainly focus on standard for such applications Laplace distribution and modeling its width parameter :

(2)

which MLE parameters for sample are:

(3)

LOCO-I has a fixed specialized predictor. Then chooses width parameter as locally constant inside 365 regions for quantized context, each into 9 ranges of nearly equal population. This way we can perform estimation independently for each region, and finally e.g. store in the header the 365 parameters.

Quantization of context neglects dependencies between these regions and can be practical rather only for low dimensional contexts - both due to the number of possibilities growing exponentially with dimension, but also underrepresentation of many such contexts. To resolve it, we will focus here on parameterized models for these parameters:

Choosing and family of functions optimized for a given type of problems is a difficult question. Like ARCH, unlike LOCO-I, we will focus on using linear combinations of some chosen functions:

(4)
(5)

The latter might need additional e.g. if positive values are required and some of are negative. We can alternatively use more sophisticated models like neural networks.

Ii-a Context dependence

Choosing some and family of functions, we can optimize (or e.g. neural network parameters) for given values and contexts, for example maximizing likelihood (MLE):

(6)

To simplify this optimization at cost of suboptimality, we can split it into predictor and the remaining as in Fig. 1.

This way we can first optimize parameters of predictor e.g. using some distance :

(7)

for example using least squares distance we are looking for predictor of expected value - appropriate e.g. for Gaussian distribution (or polynomial coefficients in [8]). For Laplace distribution it is more appropriate to use for predictor of median. However, unless heavy tails case, optimization of both gives nearly the same predictor, so it is safe to use least squares optimization which is computationally less expensive.

Having optimized predictor, we can calculate residues and separately optimize using them. Especially for scale parameter, MLE estimator is often average over some simple function of values, for example average for Laplace distribution , = average for Gaussian distribution , or generally average for exponential power distribution . Average is estimator of expected value, what allows for practical optimization of using least squares (analogously e.g. for neural networks):

(8)

Such parameters can be optimized for a dataset, for example for different regions using some segmentation, and then used as default. Alternatively, compressor can optimize them individually e.g. for a given image and store parameters in the header.

Ii-B Adaptivity

Instead of storing model parameters in the header, alternative approach is starting from some default parameters and adapting them based on the processed data, also for better agreement with varying local statistics e.g. of an image. Such adaptation brings additional cost, dependence on local situation can be alternatively realized by using some region classifier/segmentation and separate models for each class, or using outcome of such local classifier as additional context - choosing the best tradeoffs is a difficult question.

For adaptation we can treat the upper index as time and use time dependent parameters starting from some e.g. default initial choice for . For example without context dependence, we could just replace average with exponential moving average for Laplace distribution and some learning rate:

Generally we can use for example gradient descent while processing each value to optimize parameters toward local statistics for combined using (6), or in split form:

(9)

where is distance as previously. For the above gradient ascend optimizes likelihood, define adaptation rate.

Choosing the steps is a difficult question, for example it can be fixed optimized on a dataset, or compressor can test a few choices for a given e.g. image and finally use and store the best found.

Ii-C Exponential power distribution

Data compression usually focuses on Laplace distribution, but real data might have a bit different statistics, especially heavier tails. It might be worth to consider more general families, especially exponential power distribution [9]:

(10)

It covers both Laplace and Gaussian distribution. Estimating is costly, but we can fix it based on a large dataset and e.g. segment type. Then estimation of is analogous, also for context dependence like in 8:

We can prepare entropy coding tables for such fixed and some optimized discretized set of scale parameter .

Iii Practical Laplace example and experiments

Let us now focus on LOCO-I lossless image compression setting: context are 4 already decoded neighboring pixels: on correspondingly (left, up, left-up, right-up) positions as in diagram in Fig. 1.

Iii-a Predictor

LOCO-I uses a fixed predictor :

(11)

Simpler popular choices are e.g. or

. A standard way for designing such predictors is polynomial interpolation, e.g. in Lorenzo predictor 

[10]: fitting some polynomial to the known values and calculating its value in the predicted position, getting a linear combination.

We can also directly optimize it for a dataset. For example least squares optimization using combined 48 images (Fig. 2) gives (rounded to 2 digits, weights sum to 1):

Alternatively, we can optimize these weights individually for each image by compressor and store in the header - Fig. 1 contains comparison for various approaches using distance as we would like to estimate median for Laplace distribution. Such individual least squares optimization turns out always superior there (blue points), LOCO-I predictor for some images is much worse than the remaining.

Tested inexpensive least squares optimizer uses directly the functions: in 4 notation. We build matrix from them: , and vector. Then the optimal parameters are obtained using pseudo-inverse:

(12)

For further tests there were used residues from individual least squares optimization for each image: .

Iii-B Context dependent scale parameter

Having the residues, LOCO-I would divide into 9 ranges each, having nearly equal population. Including symmetry it leads to division into contexts. For each of them we independently estimate scale parameter of Laplace distribution.

Here we would like to model as a linear combination (5) of some functions of the context. The choice of these functions is difficult and essentially affects compression ratios. They should contain ”1” for the intercept term. Then, in analogy to LOCO-I, the considered 4 parameter model uses the following linear combination (for convenience enumerated from 0):

(13)
Figure 3: Top: probability density of parameters for all images, LOCO-I and discussed 4 parameter model, assuming the models are estimated and stored individually for each image. Three most characteristic images are marked as their numbers. Bottom left: such densities if combining all images into one - while huge LOCO-I number of parameters can usually learn better individual images than 4 parameter model, it has worse generalization - is inferior when combining different types of patterns. Bottom right: penalty of using power-of-2 Golomb coding for various parameters. We can get improvement if switching to arithmetic coding or asymmetric numeral systems, however, especially for LOCO-I it would require larger headers due to needed better precision of .

There is a freedom of choosing above power, and empirically power turned out to provide the best likelihood/compression ratio - corresponds well to linear behavior of . This choice leads to all the coefficients turn out positive in experiments - we have some initial width, growing with increased gradients in the neighboring pixels. Hence there is no possibility of getting negative this way, which would make no sense.

Having chosen such e.g. functions, we build matrix from them , and residue vector . Then we can use least squares optimization:

(14)

Figure 3 contains comparison of density of predicted scale parameters for individual images (top) for LOCO-I approach and the above 4 parameter model - the latter is smoother as we could expect, but generally they have similar behavior. Bottom left of this figure contains comparison for combining all images, and compression ratios showing better generalization of these low parameter models.

The second considered: parameter model extends above basis by the following arbitrarily chosen 7 functions: symmetric describing intensity of neighboring pixels, and evaluating the second derivative:

(15)

where again powers were chosen empirically to get the best likelihood/compression ratio. In contrast to 4 parameter model, this time we get also negative coefficients, leading to negative predicted . To prevent that, there was finally used width of Laplace distribution.

The used functions were chosen arbitrarily by manual optimization, some wider systematic search should improve performance. For example in practical implementations above power functions would be rather put into tables, what allows to use much more complex functions, like given by stored values on some quantized set of arguments. It would allow to carefully optimize such tabled functions based on a large set of images.

Iii-C Entropy coding, penalty of Golomb coding

Laplace distribution is continuous, to encode values from it we need to quantize it to approximately geometric distribution, which values are transformed into bits using some entropy coding.

LOCO-I uses power-of-2 Golomb coding: instead of real coefficient, it optimizes parameter, then is stored as using unary coding, and is stored directly as bits. This way it requires bits to store unsigned . Signed values are stored as position in order.

Ideally, symbol of probability carries bits of information, leading to asymptotically Shannon entropy bits/symbol. Optimal parameter power-of-two Golomb coding is worse by a few percents for used here values as shown in Fig. 3. One reason is this sparse quantization of parameters. More important, especially for small , is most of probability going to quantized value, what can correspond to lower than 1 bit of informational content. In contrast, prefix codes like Golomb need to use at least 1 bit per symbol.

Replacing power-of-2 Golomb coding with an accurate entropy coder like arithmetic coding (AC) or asymmetric numeral systems (ANS), we can improve compression ratio by . In this case we also need some quantization of parameter - we can have prepared entropy coding tables for some discredited space of possible parameters.

Figure 4: Top: conventional multi-scale interleaved scanning [5] (e.g. FLIF compressor [3]): scan over succeeding sub-lattices for progressive decoding, and most importantly: to provide better local context for later decoded pixels. Bottom: proposed Haar upsample scanning which combines advantages of Haar wavelets [11] with exploitation of local context dependence. First (scan 0) we decode low resolution image: averages over size blocks, using decoded neighboring block averages as the context. Then in each cycle (scan 1,2,3) we decode the 3 missing values (for grayscale, 9 for RGB) to improve the resolution twice: e.g. horizontal differences in scan 1, then vertical differences in two positions in scan 2 and 3. After such cycles we reach blocks - completely decoded image. The context of already decoded local information is high dimensional, of different type for each scan and level. While it is a problem for LOCO-I like context quantization, parametric models can easily handle it, for example using predictor, where denotes the type of scan - its parameters can be inexpensively e.g. MSE optimized and stored in the header. Some modification options are e.g. splitting values into higher and lower bits for separate scans [6], or using fractal-like (tame twindragon) blocks by modifying translation vectors for hexagonal block lattice [12].

Iii-D Multi-scale interleaving

In standard scanning line by line we have context only from half of the plane, only guessing what will happen from the decoded side. It can be improved in multi-scale interleaving, showing gains e.g. in FLIF [3] compressor, where we can use lower resolution context from all directions due to progressive decoding in multiple scans, like visualized in Fig. 4.

However, we can see that context information becomes much more complex here: high dimensional, varying with the scan number. Even reducing it by some arbitrary averaging, it is still rather too large for context quantization approaches like in LOCO-I. Discussed here parametric approaches have no problem with direct use of such high dimensional contexts, modelling parameters as e.g. a linear combination of a chosen family of functions, with parameters chosen e.g. by inexpensive least squares optimization and stored in the header. Alternatively more complex models can be used instead, like neural networks.

This Figure also proposes combination with Haar wavelets for hopefully improved performance - splitting decoding into

cycles, each improving resolution twice, and being composed of a few scans, e.g. 3 for grayscale, or 9 for 3 colors - each providing a single degree of freedom per block for the upscaling. Such decomposition into e.g. 9 scans clearly leaves an opportunity for optimization, starting with the choice of color transformation.

Assuming some scale invariance of images, similar models can be used for different cycles here, for example we can treat the number of cycle (defining scale) as an additional parameter.

Iv Conclusion and further work

Parametric models allow to successfully exploit trends in behavior, also for context dependence and evolution of parametric distributions. Thanks to generalization, a few parameter model can provide a better performance than treating all possibilities as independent - neglecting dependencies between them. Wanting to exploit higher dimensional contexts, e.g. for 3 colors, further pixels, region classifiers or multi-scale scanning, parametric models become a necessity as the number of discretized possibilities would grow exponentially with dimension.

There were presented and tested very basic possibilities, leaving many improvement opportunities, starting with choice of contexts and functions, or using other parametric distributions like exponential power distribution. Used least squares optimization is inexpensive enough to be used by compressor to individually optimize parameters for each image. For example choosing some general default parameters, we can use better optimizers, like for Laplace median, or generally MLE.

Lossy image compressors have a different situation: coding e.g. DCT transform coefficients, where distribution parameters should be chosen also based on position - which should be included as a part of the context with some properly chosen functions.

As we can see in Fig. 3, there is a large spread of behavior of parameters, using individual models for separate images often gives improvement. It suggests to try to segment the image into regions of similar behavior, or use a region classifier. Having such segmentation mechanism optimized for a large dataset, with separate models for each segment, they could define default behavior, avoiding the need of separate model estimation and storage. It would be valuable to optimize such segmentation based on used family of models. Alternative approach is using classifiers and treating their evaluation as part of the context, what would additionally allow to continuously interpolate between classes.

Finally, while for low cost reason we were focused on linear models for parameters, better compression ratios at larger computational cost should be achievable using more general models like neural networks. They are considered in literature to directly predict probability distribution of discrete pixel values ([5, 6]). We could reduce the computational cost if, based on the context, predicting only parameters of parametric distributions instead, then finally discretizing the distribution. For example Laplace distribution for unimodal distributions, e.g. training neural network to minimize sum of and . For multimodal distributions we can for example parameterize with polynomial, and train to minimize sum of squares of differences of coefficients of orthonormal polynomials as in [8].

References