Many types of data statistically agree with specific parametric distributions, like Gaussian distribution through the law of large numbers, or Laplace distribution popular in data compression as it agrees with statistics of errors from prediction (residues). Their parameters can often be inexpensively estimated, and storing them in a header is much less expensive than e.g. entire probability distribution on some quantized set of represented values. Parametric distributions smoothen between discretized possibilities, generalizing statistical trends emerging in a given type of data.
However, for example due to randomness alone, statistics of real data has usually some distortion from such idealization. Directly storing counted frequencies can exploit this difference, gaining asymptotically Kullback-Leibler divergence bits/value - at cost of larger header. Data compressors need to optimize this minimum description length tradeoff between model size and entropy it leads to.
In practice, instead of a single e.g. Laplace distribution to encode residues (errors of predictions) for the entire image, we would like to make its parameters dependent on local situation - through context dependence like in Markov modelling, or adaptivity like for non-stationary time series.
The possibility to directly store all values fades away when increasing dimension of the model - both due to size growing exponentially with dimension, but also underrepresentation. Going to higher dimensions requires finding and exploiting some general behaviour, for example through parametrizations, as in examples presented in Fig. 1.
LOCO-I mixes both philosophies: uses parametric probability distributions, which scale parameter (width of Laplace distribution) depends on 3 dimensional context quantized into 365 possibilities treated independently - neglecting their dependencies. Such approach is useful for low dimensional contexts, however, it becomes impractical if wanting to use higher dimensional context: like using information from all 3 color channels, further pixels than the nearest neighbors, or from some region classifiers to gradually transit between e.g. models for smooth regions like sky, to complex textures like treetop. Finally contexts of much higher dimension appear in multiscale interlaced scanning like in FLIF 
compressor: progressively improving quality, rather only parametric models can directly work on its high dimensional contexts.
This article discusses such parametric-parametric models: choose parameters of e.g. Laplace distribution as a parametric function of the context, like through a linear combination, or generally e.g. neural networks. Its example are ARMA-ARCH models popular in economics: choosing squared width of Gaussian distribution as a linear combination of recent squared residues, e.g. .
These parameters can be universal e.g. default for various types of classified regions, or optimized individually by compressor and stored in the header. For the latter purpose we will focus on least squares estimation due to its low cost. Presented test results are for such estimation, a costly additional optimization might slightly improve performance.
While we will mostly focus on such static models: assuming constant joint distribution of (value, context), mentioned alternative are adaptive models: assuming non-stationary time series, evolving joint distribution. It requires additional cost to update parameters of the model, for example performing likelihood optimization step while processing each value. It has two advantages: can learn model from already decoded data even without header, and can flexibly adapt to local behavior e.g. of an image.
In literature there are also considered much more costly models, like using neural networks for predicting probability distribution of succeeding pixels ([5, 6]). In the discussed philosophy, instead of directly predicting probability of each discrete value, we can use such neural networks to directly predict context dependent parameters of some parametric distribution for the new pixel. Such simplification should allow to use much smaller neural networks, bringing it closer to practical application in data compression.
Ii Parametric-parametric distributions
We would like to model conditional probability distributionof the new value , based on some local -dimensional context , in practice bounded e.g. to a cube like here. In LOCO-I image compressor this context are 4 neighboring already decoded pixels ( as in Fig. 1). Both value and context are rather discrete through some quantization, but it is useful to model them as real values - especially wanting to exploit continuity of their behavior.
Modelling general continuous conditional distributions is a difficult task - requires techniques like quantile regression. or hierarchical correlation reconstruction . However, the situation becomes much simpler if focusing on simple parametric distributions for the predicted distribution. Another standard simplification is separately modelling the center of the distribution with predictor , and the remaining parameter(s) of centered distribution for residue, usually single scale parameter defining width:
We will mainly focus on standard for such applications Laplace distribution and modeling its width parameter :
which MLE parameters for sample are:
LOCO-I has a fixed specialized predictor. Then chooses width parameter as locally constant inside 365 regions for quantized context, each into 9 ranges of nearly equal population. This way we can perform estimation independently for each region, and finally e.g. store in the header the 365 parameters.
Quantization of context neglects dependencies between these regions and can be practical rather only for low dimensional contexts - both due to the number of possibilities growing exponentially with dimension, but also underrepresentation of many such contexts. To resolve it, we will focus here on parameterized models for these parameters:
Choosing and family of functions optimized for a given type of problems is a difficult question. Like ARCH, unlike LOCO-I, we will focus on using linear combinations of some chosen functions:
The latter might need additional e.g. if positive values are required and some of are negative. We can alternatively use more sophisticated models like neural networks.
Ii-a Context dependence
Choosing some and family of functions, we can optimize (or e.g. neural network parameters) for given values and contexts, for example maximizing likelihood (MLE):
To simplify this optimization at cost of suboptimality, we can split it into predictor and the remaining as in Fig. 1.
This way we can first optimize parameters of predictor e.g. using some distance :
for example using least squares distance we are looking for predictor of expected value - appropriate e.g. for Gaussian distribution (or polynomial coefficients in ). For Laplace distribution it is more appropriate to use for predictor of median. However, unless heavy tails case, optimization of both gives nearly the same predictor, so it is safe to use least squares optimization which is computationally less expensive.
Having optimized predictor, we can calculate residues and separately optimize using them. Especially for scale parameter, MLE estimator is often average over some simple function of values, for example average for Laplace distribution , = average for Gaussian distribution , or generally average for exponential power distribution . Average is estimator of expected value, what allows for practical optimization of using least squares (analogously e.g. for neural networks):
Such parameters can be optimized for a dataset, for example for different regions using some segmentation, and then used as default. Alternatively, compressor can optimize them individually e.g. for a given image and store parameters in the header.
Instead of storing model parameters in the header, alternative approach is starting from some default parameters and adapting them based on the processed data, also for better agreement with varying local statistics e.g. of an image. Such adaptation brings additional cost, dependence on local situation can be alternatively realized by using some region classifier/segmentation and separate models for each class, or using outcome of such local classifier as additional context - choosing the best tradeoffs is a difficult question.
For adaptation we can treat the upper index as time and use time dependent parameters starting from some e.g. default initial choice for . For example without context dependence, we could just replace average with exponential moving average for Laplace distribution and some learning rate:
Generally we can use for example gradient descent while processing each value to optimize parameters toward local statistics for combined using (6), or in split form:
where is distance as previously. For the above gradient ascend optimizes likelihood, define adaptation rate.
Choosing the steps is a difficult question, for example it can be fixed optimized on a dataset, or compressor can test a few choices for a given e.g. image and finally use and store the best found.
Ii-C Exponential power distribution
Data compression usually focuses on Laplace distribution, but real data might have a bit different statistics, especially heavier tails. It might be worth to consider more general families, especially exponential power distribution :
It covers both Laplace and Gaussian distribution. Estimating is costly, but we can fix it based on a large dataset and e.g. segment type. Then estimation of is analogous, also for context dependence like in 8:
We can prepare entropy coding tables for such fixed and some optimized discretized set of scale parameter .
Iii Practical Laplace example and experiments
Let us now focus on LOCO-I lossless image compression setting: context are 4 already decoded neighboring pixels: on correspondingly (left, up, left-up, right-up) positions as in diagram in Fig. 1.
LOCO-I uses a fixed predictor :
Simpler popular choices are e.g. or
. A standard way for designing such predictors is polynomial interpolation, e.g. in Lorenzo predictor: fitting some polynomial to the known values and calculating its value in the predicted position, getting a linear combination.
We can also directly optimize it for a dataset. For example least squares optimization using combined 48 images (Fig. 2) gives (rounded to 2 digits, weights sum to 1):
Alternatively, we can optimize these weights individually for each image by compressor and store in the header - Fig. 1 contains comparison for various approaches using distance as we would like to estimate median for Laplace distribution. Such individual least squares optimization turns out always superior there (blue points), LOCO-I predictor for some images is much worse than the remaining.
For further tests there were used residues from individual least squares optimization for each image: .
Iii-B Context dependent scale parameter
Having the residues, LOCO-I would divide into 9 ranges each, having nearly equal population. Including symmetry it leads to division into contexts. For each of them we independently estimate scale parameter of Laplace distribution.
Here we would like to model as a linear combination (5) of some functions of the context. The choice of these functions is difficult and essentially affects compression ratios. They should contain ”1” for the intercept term. Then, in analogy to LOCO-I, the considered 4 parameter model uses the following linear combination (for convenience enumerated from 0):
There is a freedom of choosing above power, and empirically power turned out to provide the best likelihood/compression ratio - corresponds well to linear behavior of . This choice leads to all the coefficients turn out positive in experiments - we have some initial width, growing with increased gradients in the neighboring pixels. Hence there is no possibility of getting negative this way, which would make no sense.
Having chosen such e.g. functions, we build matrix from them , and residue vector . Then we can use least squares optimization:
Figure 3 contains comparison of density of predicted scale parameters for individual images (top) for LOCO-I approach and the above 4 parameter model - the latter is smoother as we could expect, but generally they have similar behavior. Bottom left of this figure contains comparison for combining all images, and compression ratios showing better generalization of these low parameter models.
The second considered: parameter model extends above basis by the following arbitrarily chosen 7 functions: symmetric describing intensity of neighboring pixels, and evaluating the second derivative:
where again powers were chosen empirically to get the best likelihood/compression ratio. In contrast to 4 parameter model, this time we get also negative coefficients, leading to negative predicted . To prevent that, there was finally used width of Laplace distribution.
The used functions were chosen arbitrarily by manual optimization, some wider systematic search should improve performance. For example in practical implementations above power functions would be rather put into tables, what allows to use much more complex functions, like given by stored values on some quantized set of arguments. It would allow to carefully optimize such tabled functions based on a large set of images.
Iii-C Entropy coding, penalty of Golomb coding
Laplace distribution is continuous, to encode values from it we need to quantize it to approximately geometric distribution, which values are transformed into bits using some entropy coding.
LOCO-I uses power-of-2 Golomb coding: instead of real coefficient, it optimizes parameter, then is stored as using unary coding, and is stored directly as bits. This way it requires bits to store unsigned . Signed values are stored as position in order.
Ideally, symbol of probability carries bits of information, leading to asymptotically Shannon entropy bits/symbol. Optimal parameter power-of-two Golomb coding is worse by a few percents for used here values as shown in Fig. 3. One reason is this sparse quantization of parameters. More important, especially for small , is most of probability going to quantized value, what can correspond to lower than 1 bit of informational content. In contrast, prefix codes like Golomb need to use at least 1 bit per symbol.
Replacing power-of-2 Golomb coding with an accurate entropy coder like arithmetic coding (AC) or asymmetric numeral systems (ANS), we can improve compression ratio by . In this case we also need some quantization of parameter - we can have prepared entropy coding tables for some discredited space of possible parameters.
Iii-D Multi-scale interleaving
In standard scanning line by line we have context only from half of the plane, only guessing what will happen from the decoded side. It can be improved in multi-scale interleaving, showing gains e.g. in FLIF  compressor, where we can use lower resolution context from all directions due to progressive decoding in multiple scans, like visualized in Fig. 4.
However, we can see that context information becomes much more complex here: high dimensional, varying with the scan number. Even reducing it by some arbitrary averaging, it is still rather too large for context quantization approaches like in LOCO-I. Discussed here parametric approaches have no problem with direct use of such high dimensional contexts, modelling parameters as e.g. a linear combination of a chosen family of functions, with parameters chosen e.g. by inexpensive least squares optimization and stored in the header. Alternatively more complex models can be used instead, like neural networks.
This Figure also proposes combination with Haar wavelets for hopefully improved performance - splitting decoding into
cycles, each improving resolution twice, and being composed of a few scans, e.g. 3 for grayscale, or 9 for 3 colors - each providing a single degree of freedom per block for the upscaling. Such decomposition into e.g. 9 scans clearly leaves an opportunity for optimization, starting with the choice of color transformation.
Assuming some scale invariance of images, similar models can be used for different cycles here, for example we can treat the number of cycle (defining scale) as an additional parameter.
Iv Conclusion and further work
Parametric models allow to successfully exploit trends in behavior, also for context dependence and evolution of parametric distributions. Thanks to generalization, a few parameter model can provide a better performance than treating all possibilities as independent - neglecting dependencies between them. Wanting to exploit higher dimensional contexts, e.g. for 3 colors, further pixels, region classifiers or multi-scale scanning, parametric models become a necessity as the number of discretized possibilities would grow exponentially with dimension.
There were presented and tested very basic possibilities, leaving many improvement opportunities, starting with choice of contexts and functions, or using other parametric distributions like exponential power distribution. Used least squares optimization is inexpensive enough to be used by compressor to individually optimize parameters for each image. For example choosing some general default parameters, we can use better optimizers, like for Laplace median, or generally MLE.
Lossy image compressors have a different situation: coding e.g. DCT transform coefficients, where distribution parameters should be chosen also based on position - which should be included as a part of the context with some properly chosen functions.
As we can see in Fig. 3, there is a large spread of behavior of parameters, using individual models for separate images often gives improvement. It suggests to try to segment the image into regions of similar behavior, or use a region classifier. Having such segmentation mechanism optimized for a large dataset, with separate models for each segment, they could define default behavior, avoiding the need of separate model estimation and storage. It would be valuable to optimize such segmentation based on used family of models. Alternative approach is using classifiers and treating their evaluation as part of the context, what would additionally allow to continuously interpolate between classes.
Finally, while for low cost reason we were focused on linear models for parameters, better compression ratios at larger computational cost should be achievable using more general models like neural networks. They are considered in literature to directly predict probability distribution of discrete pixel values ([5, 6]). We could reduce the computational cost if, based on the context, predicting only parameters of parametric distributions instead, then finally discretizing the distribution. For example Laplace distribution for unimodal distributions, e.g. training neural network to minimize sum of and . For multimodal distributions we can for example parameterize with polynomial, and train to minimize sum of squares of differences of coefficients of orthonormal polynomials as in .
-  M. J. Weinberger, G. Seroussi, and G. Sapiro, “The loco-i lossless image compression algorithm: Principles and standardization into jpeg-ls,” IEEE Transactions on Image processing, vol. 9, no. 8, pp. 1309–1324, 2000.
-  J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.
-  J. Sneyers and P. Wuille, “Flif: Free lossless image format based on maniac compression,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 66–70.
-  T. C. Mills and T. C. Mills, Time series techniques for economists. Cambridge University Press, 1991.
-  A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
-  J. Menick and N. Kalchbrenner, “Generating high fidelity images with subscale pixel networks and multidimensional upscaling,” arXiv preprint arXiv:1812.01608, 2018.
-  R. Koenker and K. F. Hallock, “Quantile regression,” Journal of economic perspectives, vol. 15, no. 4, pp. 143–156, 2001.
-  J. Duda and A. Szulc, “Credibility evaluation of income data with hierarchical correlation reconstruction,” arXiv preprint arXiv:1812.08040, 2018.
-  P. R. Tadikamalla, “Random sampling from the exponential power distribution,” Journal of the American Statistical Association, vol. 75, no. 371, pp. 683–686, 1980.
-  N. Fout and K.-L. Ma, “An adaptive prediction-based approach to lossless compression of floating-point volume data,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2295–2304, 2012.
-  A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathematische Annalen, vol. 69, no. 3, pp. 331–371, 1910.
-  J. Duda, “Fractal wavelets,” 2014. [Online]. Available: https://github.com/JarekDuda/FractalWavelets