Beside lossy compression techniques like quantization, compression ratio depends on statistical modelling - predicting conditional probability distributions of values based on context, log-likelihoods of such models can be directly translated into savings in bits/value.
Laplace distribution (geometric when discretized) turned out to be good universal approximation for distribution of many types of values in data compression, like residues (errors from prediction), or AC coefficients of discrete cosine transforms (DCT). It has two parameters: center and width/scale parameter
. While context dependent prediction of value is often treated as estimator of, the width parameter is often fixed. Rare example of predicting this width is LOCO-I/JPEG LS , which quantizes 3 dimensional context into 365 bins - not exploiting dependencies between them and rather being limited to low dimensional context.
We will focus on inexpensive general approach for predicting both centers and widths from 
- as just linear combinations of functions of context, with automatically optimized parameters e.g. with the least squares linear regression. It is computationally inexpensive, their parameters could be e.g. optimized for various region types, or even found by encoder for a given image and stored in the header. While it sounds natural for the centers, it might be surprising that we can also predict width parameter this way: by MSE prediction of absolute values, what could be also alternatively done with more sophisticated models like neural networks.
This approach is applied here for image compression through upscaling: using sequence of differences to increase resolution. It is used for example in FUIF and JPEG XL  as ”squeeze mode” of lossless image compression, however, they assume fixed Laplace distribution. As summarized in Fig. 1, adding discussed inexpensive context-dependent prediction can bring essential savings: on average 0.645 bits/difference for the most costly: last scan. The previous scans have much lower number of values: twice per level. They got lower average saving: correspondingly 0.296, 0.225, 0.201 bits/difference for the previous three scans - these simple models are insufficient for higher level information, but can be helpful for filling details of textures - and this type of information often dominates bitstream.
Context dependence for symbol probability distribution is often exploited in the final symbol/bit sequence e.g. in CABAC  popular especially in video compression. However, such sequence looses spatial context information, which is crucial in image/video compression. Presented general approach can be also useful for exploiting context dependence for such situations, like modelling DCT coefficients using context e.g. of already decoded coefficients in current and neighboring blocks.
This main Section first briefly discusses ”squeeze” upsampling approach, then approaches to predict center and width - their deeper discussion can be found in .
Ii-a Upsampling through ”squeeze”
There was used the simplest upsampling scheme, which can be seen as inspired by Haar wavelets : first store separately average over some square pixel regions (or even entire image), then succeedingly provide information about differences of averages of two subregions preferably of the same size, down to single pixel regions.
As we operate on discrete e.g. 8 bit values, it would be convenient to maintain such range of integer values during upscaling, what can be done e.g. using ”Squeeze” approach from Jon Sneyers’ FUIF image compressor 111https://github.com/cloudinary/fuif/blob/master/transform/squeeze.h. Specifically, for higher resolution integer values, we use their average (integer but approximated) and difference:
allow to uniquely determine as , hence :
We can for example scan line by line as in Fig. 1 and alternately upscale in horizontal and vertical direction, based on decoded sequence of differences .
Statistics of these differences turn out agreeing well with Laplace distribution:
The question is how to choose its parameters: center and width/scale parameter ? Standard approach is using fixed parameters. Their maximum likelihood estimation (MLE) for sample is:
Let us discuss exploiting context dependence for better choice of parameters for a given position, what can lead to surprisingly large improvements as seen in Fig. 1. We can use already decoded local context for this purpose, as in example in this figure, where yellow capital letters define values of context as averages over corresponding blocks.
Ii-B Predicting center from context
While we could consider more sophisticated predictors including neural networks, a basic family are linear predictors:
A standard approach is finding these
parameters from interpolation: fit polynomial assuming some values in context positions, find its value in predicted position - getting a linear combination of context values.
A safer data-based approach is to directly optimize these parameters based on data: getting a single set of parameters optimized for a larger dataset, or better separate parameters for various region types (requiring e.g. a classifier). Parameters for tests here were optimized for a given image, for example to be found by encoder and stored in the header. The final solution should rather have some region classification with separate predictors, e.g. classified based only on context.
For values and -dimensional context (alternatively some functions on context), we can find parameters minimizing
MLE estimator of
is median. From quantile regression median can be predicted by minimizing mean norm - absolute value in (6). However, MSE optimization: using squared norm instead is computationally less expensive and gives comparable evaluation - as it would rather have to be calculated by encoder in such applications, MSE optimization is used in tests here.
From experiments, the most crucial in predicting was difference suggesting local gradient, which should be maintained between these positions especially in smooth regions. Finally the entire size context was used in tests as generally providing the best evaluation.
Ii-C Predicting width parameter from context
We can now subtract predicted from values - denote such sequence as . For these differences from prediction we could choose fixed , e.g. MLE: as mean - used for the blue dots in Fig. 1.
We can improve by also predicting from the current context - again we could use more sophisticated models like neural networks, for simplicity in tests there was used linear combination of functions of context:
While for it is natural to directly use values from context in linear combinations, here we would like to estimate noise levels, which should be related to local gradient sizes, e.g. absolute differences of neighboring positions, generally some functions
of context vectors.
To inexpensively optimize for a chosen set of functions, remind that MLE estimation of is mean
. Observing that mean of values is the position minimizing mean square distance from these values, leads to heuristic:
Which was used to get improvement between blue and green dots in Fig. 1 for context along gradient in decoded direction: . We need to be careful here to enforce , e.g. by ensuring all . In tests it was obtained by removing context leading to some negative and recalculating until all positive.
From entropy coding perspective, there should be prepared AC/ANS encoding tables for some quantized set of widths - one of them is chosen by predictor, such encoding step is applied to shifted (and rounded) value.
Iii Conclusions and further work
There was presented application of general methodology from  to data compression with upsampling - providing surprisingly large saving opportunity with low computational cost, which seems unexploited in current compressors.
This article only suggests basic tools, which can be improved e.g. with better choice of context, or functions of context especially for predictor. We can also use more sophisticated models like neural networks - preferably with optimization of distance from for predictor, and optimization of distance from for predictor (also ensuring positivity). However, such split of parameter prediction is an approximation, better compression ratios at larger computational cost could be obtained e.g. by further optimization of parameters directly maximizing log-likelihood for predicted conditional probability distributions.
Probably the most promising direction for further work is data-based automatic choice of separate predictors for various region types, e.g. choosing one of models based only on the current context, or maybe mixing predictions from various models.
Another direction are other families of distributions, especially exponential power distribution  (also containing Laplace distribution) - some initial tests provided up to bits/difference improvement.
We can also improve Laplace distribution model with further context dependent models of density as polynomial, like in . Initial tests provided additional bits/difference improvements here, but these are relatively costly and large models, the question of their practicality here will require further investigation.
Finally, there should be also explored other applications of presented approach, especially for DCT coefficients e.g. based on already decoded neighboring coefficients.
-  M. J. Weinberger, G. Seroussi, and G. Sapiro, “The loco-i lossless image compression algorithm: Principles and standardization into jpeg-ls,” IEEE Transactions on Image processing, vol. 9, no. 8, pp. 1309–1324, 2000.
-  J. Duda, “Parametric context adaptive laplace distribution for multimedia compression,” arXiv preprint arXiv:1906.03238, 2019.
-  J. Alakuijala, R. van Asseldonk, S. Boukortt, M. Bruse, Z. Szabadka, I.-M. Comsa, M. Firsching, T. Fischbacher, E. Kliuchnikov, S. Gomez et al., “JPEG XL next-generation image compression architecture and coding tools,” in Applications of Digital Image Processing XLII, vol. 11137. International Society for Optics and Photonics, 2019, p. 111370K.
-  D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the h. 264/avc video compression standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 620–636, 2003.
-  A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathematische Annalen, vol. 69, no. 3, pp. 331–371, 1910.
-  R. Koenker and K. F. Hallock, “Quantile regression,” Journal of economic perspectives, vol. 15, no. 4, pp. 143–156, 2001.
-  P. R. Tadikamalla, “Random sampling from the exponential power distribution,” Journal of the American Statistical Association, vol. 75, no. 371, pp. 683–686, 1980.
-  J. Duda, R. Syrek, and H. Gurgul, “Modelling bid-ask spread conditional distributions using hierarchical correlation reconstruction,” arXiv preprint arXiv:1911.02361, 2019.