1 Introduction
An image is a two dimensional array of pixels with rows and columns (typically we will take ), where a pixel has a real value . (This naturally corresponds to a greyscale image, though the results extend in a straightforward way to color images, by applying them separately to each of the the basic colors RGB). It is common wisdom that in natural images nearby pixels tend to have similar values. One may refer to this property as saying that natural images are smooth. Several hypotheses can be made as to why natural images are smooth. For example:

Our physical world has the property that environments are smooth, and images merely reflect this physical reality.

Physical and technological constraints in generating images (for example, properties of lenses) tend to create smooth images, regardless of whether the environment is smooth or not.

There is a selection bias  the portions of the environment that we tend to depict in images are the smooth portions.
In order to test such hypotheses, it is desirable to compare them against a null hypothesis. One baseline for comparison is that of random arrays of pixels. However, we propose a different baseline for comparisons, that we shall refer to as
images (in distinction from natural images).We study a formal mathematical model of images that assumes that there are no technological constraints in depicting images of the environment, and assumes that there is no selection bias – any portion of the environment is equally likely to be depicted. We show that in our formal model, some level of smoothness of images is to be expected, regardless of any assumptions on the physical environment that is being depicted.
The key aspect that our model makes use of is that environments are depicted in various scales. For example, our eyes may focus on objects as small as a few centimeters in length (say, an insect), or sceneries spanning many kilometers (say, a distant mountain range). It is common wisdom that the smoothness of an object depends on the scale at which it is depicted. Consider for example a very large black and white checkerboard pattern. Viewed from a large distance, one pixel in the image will average the value of many checkerboard squares, and hence the image may be uniformly grey (very smooth). Viewed from a very short distance, every square may correspond to many pixels, and then nearby pixels will have the same value, so the image will be very smooth almost everywhere (except on the boundary between squares). However, at some intermediate scale, each square will occupy a small number of pixels (say one pixel, or four pixels), and then adjacent pixels will have very different values and the image will not be considered smooth.
In our study we present a formal model, and within this model we provide quantitative results regarding the effect that having multiple scales has on the typical smoothness of images. Our results imply that a nontrivial level of smoothness of images should be attributed to some universal mathematical principles that have nothing to do with the environment that is being depicted.
1.1 Related work
There is a vast body of work on natural image statistics (see [2], for example). Smoothness is a well observed aspect of these statistical properties. Moreover, natural images tend to have interesting and useful statistical properties that go much beyond smoothness (see [8], for example). A key aspect of our study is that environments are depicted in various scales. This same aspect appears in existing studies of natural images (see [7, 6, 1, 5], for example), though the focus of work in these references is different from ours: it relates to observed scale invariant properties of natural images, and to statistical models that attempt to explain this phenomena. Our current work does not deal with natural images, but rather with images in some abstract mathematical model. Our results can be contrasted against known results on natural images, but do not directly provide new information about natural images.
The techniques used in our proofs are of the form often used in image processing literature and practice. They are strongly related to a wavelet transform [3] with a Haar basis.
As our results deal with abstract notions of images rather than natural images, the mathematical principles that underlie them are applicable in other settings, and in fact similar principles were used in other settings. Specifically, Theorem 6 is a variation on a certain local repetition lemma proved in [4] in the context of sequential decision making, and Proposition 8 is based on an example given in [4] showing the tightness of the parameters in the local repetition lemma.
2 A formal model of images
An image is a two dimensional array of pixels with rows and columns. We shall sometimes omit the subscripts and simply use . A pixel has a real value . Numbering the pixels in by with and , two pixels and are adjacent if either and , or and . Borrowing standard graph theoretic terminology, we refer to a pair of adjacent pixels as an edge in the image, and we denote the set of edges in the image by . It is not difficult to verify that . We remark that the notation would also be used in order to denote the expectation operator, but the intended use of the notation (either as set of edges or as expectation) will be clear from the context.
The discrepancy of two pixels and is a measure of how different their value is. We consider two different ways of measuring discrepancy, linear discrepancy and quadratic discrepancy . The subscript of indicates the power to which is raised. is perhaps the more natural of these two measures, but it is mathematically more convenient to work with .
Definition 1
The local discrepancy of an image is the average discrepancy for pairs of adjacent pixels. It is denoted by and . The global discrepancy of an image is the average discrepancy over all pairs of pixels whether adjacent or not, including also pairs in which and are the same pixel. It is denoted by and . In cases where we do not wish to distinguish between linear and quadratic discrepancy, we shall use the notation and with no subscript.
The range of possible values of local and global discrepancy is as specified in the following Proposition.
Proposition 2
For every image the following hold: and .
Proof. Nonnegativity follows immediately from Definition 1.
holds because for every pixel , . In a checkerboard pattern with pixel values alternating between 0 and 1 the bound is attained. On the same pattern, the bound is attained (if is even). Convexity of the functions and implies that to maximize one needs for every , and one needs the number of 0pixels to be equal to the number of 1pixels. In this extreme case .
An image may be smooth in several different senses, and we shall explicitly distinguish between them. One sense of being smooth is that of having low local discrepancy. A consequence of this smoothness is that the image can be compressed: traversing all pixels via some connected path (e.g., row by row in a snakelike fashion), for every new pixel we encounter we already have some prior estimate on its value, based on the pixel preceding it. Another sense of being smooth is by having low global discrepancy. This is a stronger notion than low local discrepancy, due to the following proposition.
Proposition 3
For every by image , .
Proof. Pick a random pixel , and independently, a random pixel and a random neighbor of . By the triangle inequality, . Observe that exactly equals the expectation of , and nearly equals the expectation of (up to an term that is the result of boundary effects). Likewise, nearly equals the expectation of (up to an term that is the result of boundary effects). Hence averaging over all choices of the inequality is proved.
A simple modification to the proof above shows that . The proof of the stronger claim that is deferred to Section 4.1.
Yet another sense of being smooth is by having a high local correlation coefficient.
Definition 4
The local correlation of an image is , where is interpreted as being equal to 1. (Observe that if and only if .)
Observe that for an image
in which the values of pixels are chosen as independent identically distributed (i.i.d.) random variables, one would expect
. An LC value that significantly deviates from 1 is an indication that the image is not just a collection of random pixels, but rather that there are local correlations. High local correlation (LC values larger than 1) relates to the experience of putting together a jigsaw puzzle: it is a good heuristic to try to match together jigsaw pieces of roughly the same color, rather than just trying to match together random pieces. This is because local discrepancy is typically smaller than global discrepancy.
The main claim of this manuscript is that most images are smooth to a noticeable extent. However, the definitions that we gave so far point to the contrary. If an image is just an array of pixels, then a natural interpretation of the term most is to select the values of these pixels at random in an i.i.d. fashion, with each pixel value distributed uniformly in the range . This will give LC value of roughly 1 which we do not consider as smooth, and also the local discrepancy would not be low (one expects in this case, details omitted).
To be able to substantiate a claim of smoothness, we refine the definition of what an image is. This will lead to natural probability distributions over images that are different from the uniform one stated above, and with respect to these probability distributions most images will be smooth.
2.1 The probability distribution over images
An image, unlike an arbitrary array of pixels, is meant to be an image of “something”. That it, we assume that there is some underlying environment, and images depict portions of the environment. We shall not make any assumptions about the environment – it can be arbitrarily complex and random looking. However, we shall make one assumption about images, and this is that the portions of the environment that images depict can be of different sizes. We now present our model more formally.
There is an environment , which is an by grid of cells. For example, the environment can be a large geographical region (say, of size 100km by 100km), and a cell can be of size corresponding to the smallest unit realistically observable by optical means (say, of sidelength meters). In the case described above, . Each cell has intensity .
Recall that we defined an image to be an by two dimensional array composed of pixels. To simplify of the rest of the presentation, we shall assume that . We require to be considerably smaller than . For example, for images with 4 megapixels, .
In terms of terminology, the terms grid, cell, intensity and will be associated with environments, whereas the terms array, pixel, value and will be associated with images.
There is a scale associated with an image, which is an integer in the range , where is some fixed integer satisfying . An image with scale describes an by portion of , where every pixel of the image corresponds to a square of by cells of . The value of a pixel is the average intensity of the cells that it represents, namely, . One may think of an image as an pixel by pixel photograph of some portion of , taken at a zoom level determined by . (This is not meant to be a model that incorporates all optical and technological constraints when describing what a photograph is, but merely a simple approximate model.) Pixels of highest resolution () in our model correspond to single cells in the environment. This convention simplifies the presentation without significantly affecting our results.
Now we describe our probability distribution that governs which portion of is contained in the image. This involves two aspect. One is the scale of the image: in the scale is an integer chosen uniformly at random in the range . The other aspect is the location of the image within . In the location is a cell chosen uniformly at random in the range , and the image extends over those cells with modulo and modulo . Observe that under this definition, an image that is close to the boundary of , “wraps around” and continues at the other side of . Hence is treated as a torus rather than as a grid. This is done for technical reasons, so as not to complicate the analysis by boundary effects. It has very little influence on the end results, because a random image is unlikely to be at the boundary of , and even if it is, only out of its pixels are at the boundary of .
Observe that under the distribution , every cell of is equally likely to be part of an image. Each cell of belongs to at most one pixel in the image, but each pixel in the image in scale contains cells. Observe also that
as described above is simply the uniform distribution over all possible images (portions of
that satisfy the size constraints of images).Definition 5
Given an by environment and integers satisfying , the average local discrepancy of the , denoted by , is the expected local discrepancy of an image sampled from according to distribution . Namely:
Analogously, the average global discrepancy of is
Observe that given and , the average local discrepancy is independent on , but global discrepancy does depend on .
3 Results
In this section, the terminology used is as defined in Section 2. In particular, is the number of scales, and we always assume that . For simplicity, we shall assume that is a power of 2. Throughout, all logarithms are in base 2. Subscripts of 1 or 2 following denote whether we are referring to linear or quadratic discrepancy.
Theorem 6
For every environment its average local discrepancy satisfies .
Proposition 7
When , there are environment for which the average global discrepancy satisfies , up to low order terms that tend to 0 as grows.
Let us contrast Theorem 6 with Proposition 7. Suppose that images can correspond to objects as small as one centimeter in the environment (say, a photo of an insect), up to objects as large as ten kilometers (say, a photo of a landscape). This gives different scales for images. Suppose that every image has 1024 by 1024 pixels. Then and . Proposition 7 shows that for a random image (sampled from ), the average global discrepancy might be as high as . Theorem 6 shows that the average local discrepancy is at most .
Theorem 6 concerns quadratic discrepancy and not linear discrepancy. Hence possibly , even though .
Proposition 8
There is some constant such that for every , there is an environment for which .
The theme of the next theorem is that unless local correlation (in the sense of Definition 4) is significant, then bounds apply not only to quadratic discrepancy, but also to linear discrepancy.
For an environment and , let denote the average local discrepancy taken only over images of scale , and let denote the average global discrepancy taken only over images of scale .
Theorem 9
For an environment and , suppose that for every , . Let be such that . Then . In particular, as grows, the upper bound on tends to .
4 Proofs
4.1 Some preliminary results
The following propositions collect some properties of discrepancy.
Proposition 10
For every image , . There are images with .
Proof. One can lower bound as a function of by the following procedure for sampling an adjacent pair of pixels. First, sample two pixels and uniformly at random (as done for computing ). Then follow a canonical path from to , first going along the row of until the column of is reached, and then along the column of until is reached. Thereafter, a random adjacent pair of pixels along this path is chosen. As the path is at most of length , the triangle inequality for distances implies that (the worst case is when the value of every two adjacent pixels along the path differs by ). The above procedure for sampling adjacent pixels distorts the uniform distribution over adjacent pixels, but only to limited extent. A pair of adjacent pixels can increase its probability of being sampled (compared to the uniform probability) by at most a constant factor. Hence also with respect to the uniform distribution over pairs of adjacent pixels we must have . (The constants in this proof can be improved by a more careful analysis.)
An example of an image with and is the following: for every , all pixels in row have the same value .
For linear discrepancy, the bounds in Proposition 10 should be changed to (proof omitted). In any case, Proposition 10 shows that local discrepancy can be much smaller than global discrepancy. In contrast, global discrepancy cannot be much smaller than local discrepancy, as shown by Proposition 3. We now develop some machinery for proving Proposition 3 for the case of quadratic discrepancy.
Proposition 11
Given a 2 by 1 image composed only of two adjacent pixels, .
Proof. We prove the proposition for quadratic discrepancy. The proof for linear discrepancy is similar.
Let the two pixels be and . Then , whereas
Given an image , an equipartition of partitions the set of its pixels into disjoint equal size subsets. The global discrepancy of an equipartition of , denoted by , is the average of the global discrepancies of its parts.
Lemma 12
For every image and every equipartition of , .
Proof. For convenience of notation, let denote here (and only here) the total number of pixels in the image , and suppose that the equipartition partitions the set of pixels into subsets, each with pixels. Number the pixels from 1 to , with each subset occupying consecutive numbers. Consider now two by symmetric matrices. (These matrices are so called Laplacian matrices of graphs associated with the way discrepancy is being computed.) Matrix has along its diagonal, and all other entries are . Matrix is a block matrix with blocks of size along the diagonal. Each block has along its diagonal, and elsewhere in the block. Outside the diagonal blocks, the matrix is all 0.
Let
be the vector of values for the pixels of
. We think of as a column vector, and is its transposed row vector. Then , and the average discrepancy of the partition is . Decompose into two components, , where is the all 1 vector, is the average value of , and is a vector orthogonal to . Observe thatis an eigenvector of eigenvalue 0 both for
and for . Hence and . Observe that all eigenvalues of , except for the unique 0 eigenvalue, have value . Hence (where is the norm of ). As for , it has eigenvalues of 0, and each block contributes eigenvalues of value . Hence . This establishes that , as desired.We can now prove the quadratic discrepancy part Proposition 3.
Proof. Suppose for simplicity that is even. Observe that the grid graph is nearly 4regular. Add one edge to each row making that row into a cycle, and one edge to each column making the column into a cycle. Thus edges are added, but they form only a fraction of the total number of edges, explaining the error term in the statement of Proposition 3. Consider now 4 different partitions of the grid (which by now is a torus), each into parts: takes all even pairs in the rows,
takes all odd pairs in the rows,
takes all even pairs in the columns, takes all odd pairs in the columns. By Lemma 12, for every . Hence the average global discrepancy of a pair of adjacent pixels is at most . Proposition 11 then implies that .4.2 Lower bounds on discrepancy
The following proposition shows that the bounds in Theorem 6 are best possible.
Proposition 13
For some environment the average local discrepancy satisfies .
Proof. The following attains . The cells of form a checkerboard pattern with alternating 0/1 values. In the scale the local discrepancy is 1 (regardless of the location of ), and in every other scale local discrepancy is 0. As , the proposition follows.
We now prove Proposition 7 concerning global discrepancy.
Proof. Partition into megacells where a megacell is a by array of cells. Within a megacell, every cell has the same intensity. The megacells are arranged in a checkerboard pattern, with alternating 0/1 intensities.
The distribution selects a scale uniformly at random. Observe that already when , a random pixel has constant probability of being entirely contained in a mega cell, and this probability tends to 1 at an exponential rate as decreases. Moreover, when is even, as long as , exactly half the cells (not mega cells) contained in an image have intensity 1, and the other half has intensity 0. The combination of these two facts implies that roughly half the pixels of the image have value 1, and roughly half have value 0, giving . As this happens at roughly scales out of possible choices of scales, , as desired.
The bound in Proposition 7 is nearly best possible, though this will not be proved in this manuscript, because we only need the direction of the inequality that is stated in the proposition.
We now prove Proposition 8 concerning linear discrepancy.
Proof. We shall not try to optimize the constant in the following proof.
In our proof it will be convenient to allow the intensities of cells to be in the range , where for simplicity we assume that is integer. Clearly, by scaling intensities can be adjusted to lie in the range , while losing a factor of in the value of .
Let the by environment (with a power of 2) be such that the intensity of a cell depends only on but not on . Specifically, the intensity of cell is computed as follows. Write in binary notation, but with replacing 0. Consider only the least significant bits in this notation. This gives some string . For cells that we refer to as balanced the intensity of the cell is simply the sum of bits in . However, there are cells that we refer to as extreme. Those are the cells for which for some the sum of the first bits in is either or . For these extreme cells their intensity is the value of the corresponding prefix (hence the maximum allowed absolute value for the intensity). By Kolmogorov’s inequality for partial sums of independent random variables, at most one quarter of the cells are extreme.
Consider now a random pixel at an arbitrary scale . For all the cells within it, the corresponding share the same prefix. Observe that when this prefix by itself is not extreme (namely, its sum of values never hits neither nor – this happens with probability at least ) then the value of the pixel (the average over all cells that it contains) is precisely the sum of values of the prefix. Of the four pixels adjacent to , one of them is adjacent to it horizontally and agrees with it on an prefix and differs on bit . The linear discrepancy between these two pixels is 2.
This implies that with probability at least the linear discrepancy is at least 2, which after scaling the intensities to lie in shows that .
4.3 Proofs of main theorems
Proof of Theorem 6.
Proof. In an image of scale , the side length of a pixel is cells. Using distribution , every scale is chosen with equal probability, and given a scale , every two adjacent pixels of size are equally likely to be in the image. We need to prove that the expectation of the discrepancy of two adjacent pixels (chosen at random from an image chosen from distribution ) is at most . It suffices to prove it for pairs of pixels adjacent horizontally, and by symmetry, the same proof will apply to pairs of pixels adjacent vertically. Hence for the rest of the proof, pixels are considered to be adjacent if and only if they are adjacent horizontally. We may envision a pair of adjacent pixels as a domino piece. We describe now a method of sampling uniformly at random a domino piece.
Consider a “window” of with columns and rows. This window is equivalent to a domino piece of scale . Subdivide each of its pixels of scale into four pixels of scale . These pixels are arranged as two domino pieces of scale . Continue subdividing recursively, where for every , every pixel of scale gives two domino pieces of scale . Hence in scale there are disjoint domino pieces. Now to sample a random domino piece, choose at random, choose a scale uniformly at random, and within choose a domino piece of scale uniformly at random.
To compute the discrepancy of a domino piece of scale , one needs first to average the value of its left pixel (by summing all cells and dividing by ) getting a value , to average the value of its right pixel getting a value , and compute .
Let denote the set of domino pieces of scale in . Let denote the weighted average local discrepancy (over horizontal pairs) in , where the weights are such that each scale is equally likely to be chosen. We have:
(1) 
The intensities of cells in is a function from cells of to . Denoting cells by , the average intensity will be denoted by , and the average of the squares of the intensities will be denoted by . We now represent the function in an orthonormal basis that is very much related to the Haar basis, though not identical to it. The number of basis vectors needs to be (matching the number of cells in ), but we shall specify only some of the basis vectors. The set of basis vectors that we specify will be referred to as the domino partial basis. One basis vector has value on all cells of . In addition, each domino piece in represents a basis vector as follows. Given the scale of the domino piece, in its left pixel (composed of cells), each cell has value , each cell in the right pixel has value , and the cells not covered the domino piece have value 0. Hence the norm of every vector in the domino partial basis is 1, and every two vectors are orthogonal.
The inner product of with a basis vector that corresponds to a domino piece at scale is precisely . This is the coefficient of the function according to the basis vector corresponding to the domino piece. The square of this coefficient is . For , the squared value of the coefficient can readily be seen to be . The sum of squares of all coefficients is at most the square of the norm of (if we had a complete basis, they would be equal, by Parseval’s identity), and hence:
(2) 
Dividing both sides of Equation (2) by , we obtain
(3) 
Combining Equation (3) with (1) we obtain that . As the intensities are in the range , necessarily . The expression is maximized when , and then it evaluates to . Hence , as desired.
For the proof of Theorem 9 we use the following notation. Let for and for .
Lemma 14
For every , .
Proof. The proof of Lemma 14 is implicit in our proof of Theorem 6. Consider a random window of with rows and columns. It can be thought of as half an image at scale , and being half an image, Lemma 12 implies that the expectation over choice of random satisfies . The proof of Theorem 6 implies that , where is the average value of a pixel in and is the average squared value. As , the lemma follows.
We now prove Theorem 9.
Proof. Observe that convexity of the function implies that . Hence to prove Theorem 9 we shall bound the maximum possible value of . As for , this is the same as bounding the maximum possible value of . Relaxing the constraint that for , we get the following mathematical program.
Maximize subject to:

.

.

.
Constraint 2 is a consequence of Theorem 6. Constraint 3 is a consequence of Lemma 14 together with the premise of Theorem 9.
Consider a feasible (not necessarily optimal) solution to the above mathematical program of the form , for some . Then constraint 1 is necessarily satisfied. Constraint 2 is satisfied with equality because . As for Constraint 3, we require that . Dividing both sides by we get an upper bound on the maximum possible value of , implied by the inequality .
If is of the form , then in order to maximize we need to choose as large as possible. This follows because
is increasing with .
Recall that needs to satisfy the constraint:
In particular, when tends to infinity, we have that . Under the solution the value of the objective function of the mathematical program has the following simple form:
It remains to show that the solution is not only feasible but also optimal. Hence fix and integer and let be the solution of . (The inequality is required in order to ensure the existence of such a .)
Consider an optimal solution , and for the sake of contradiction suppose that there is some (we take the smallest one) for which . We consider two cases.

. Let be largest such that . Constraint 3 implies that necessarily . Likewise, Constraint 3 implies that . The same argument can be repeated with replacing , and thereafter repeated indefinitely. By minimality of we have that . Hence we have that . This means that in the solution can increase (in fact, at least up to ) without violating any of the constraints, thus contradicting the optimality of .

. An argument analogous to Case 1 above implies that it cannot be that for every Constraint 3 is attained with equality, as then Constraint 2 will be violated. Let be the smallest index for which there is slackness in Constraint 3, and let be the amount of slackness. Denote and suppose that . In this case, modify the solution to a new solution in which is replaced by and is replaced by . One can easily verify that is feasible and gives a higher value than does for the objective function (due to concavity of ). This contradicts the assumed optimality of .
It remains to deal with the case that . Below we establish that in this case there is some other index such that Constraint 3 has slackness for , and moreover, . Then the above argument can be applied with replacing , completing the proof.
Observe that there are only finitely many indices with (because ). Let be the largest index such that . By our assumption that we have that . Clearly . We now show that Constraint 3 has slackness for . There are two cases to consider.

For all it holds that . In this case , because Constraint 3 (together with ) implies that the average value of for is less than . It follows that the inequalities implied by Constraint 3, one for and one for , overlap in some terms on the righthand side. Moreover, for all the terms in which they differ, the right hand side for has strictly higher value (every term at least ) than for (every term strictly smaller than ). Since , there must be slackness for .

For some it holds that . Let be the largest such index. Then repeat the argument above with the inequalities implied by Constraint 3, one for (instead of ) and one for .

5 Discussion
One may think of our work as distinguishing between three concepts.

An array of pixels.

An image as defined in our abstract model. It depicts a portion of an environment, and may do so in one of several scales. No assumptions are made on the nature of the environment.

A natural image. The environment depicted needs to adhere to physical realities of our world, and the selection process of images may be biased, based on the goals of the person taking these images.
The three main principles that underlie our probabilistic model of images are the following:

The model assumes nothing about the nature of the environment . As our results are positive (showing some level of smoothness), this aspect strengthens the applicability of our results.

There is no single scale in which a large fraction of the images are taken. If there was such a scale, then can be arranged to have large local discrepancy at this scale (e.g., a checkerboard pattern), and on average images would not be smooth.

The location of the image is chosen independently of the content of – for a given scale, there is no correlation between the smoothness of at a certain location and the probability that the image is taken at this location.
We showed that a key statistical property associated with natural images, that of smoothness, already manifests itself to some extent in the abstract model for images. Our study is quantitative, and our quantitative results uncover rather subtle and perhaps counterintuitive effects. Let us recap one of our conclusions. Arguably, noticeable local correlation in an image (namely, having quadratic local discrepancy that is small compared to the quadratic global discrepancy) is by itself an indication for smoothness. Theorem 9 (contrasted with proposition 8) shows that the absence of local correlation (setting close to 1) leads to improved upper bounds on the expected linear local discrepancy of random images.
Given quantitative values of smoothness of natural images, our work may allow one to assess how much of this value should be attributed already to the abstract image model, and then only the residual smoothness needs to be explained by properties of the natural world.
Our results become more significant as the number of scales grows. In natural images, due to physical constraints of the real world, cannot grow indefinitely, and hence we attempted to present our results not only in an asymptotic sense (e.g., ), but also to provide explicit bounds on the leading constants involved. In particular, the premise of Theorem 9 was chosen in a way that would keep these constants small. The proof technique of Theorem 9
(using a linear program to upper bound the linear discrepancy) is versatile enough to extend to weaker premises, at the cost of resulting in higher leading constants in the
upper bound.Acknowledgements
The author thanks Ronen Basri, Anat Levin and Boaz Nadler for helpful discussions on natural image statistics.
References
 [1] Luis Alvarez, Yann Gousseau, JeanMichel Morel. The Size of Objects in Natural and Artificial Images. Advances in Imaging and Electron Physics, Volume 111, 1999, Pages 167 242.

[2]
Aapo Hyvarinen, Jarmo Hurri, Patrik O. Hoyer. Natural Image Statistics: A probabilistic approach to early computational vision. SpringerVerlag, 2009.
 [3] Ingrid Daubechies. Ten Lectures on Wavelets. SIAM, 1992.
 [4] Uriel Feige, Tomer Koren, Moshe Tennenholtz. Chasing Ghosts: Competing with Stateful Policies. Manuscript, 2014.
 [5] D. Mumford and B. Gidas. Stochastic models for generic images. Quarterly of Applied Mathematics, 54(1):85 111, 2001.
 [6] D. Ruderman. Origins of scaling in natural images. Vision Res., Vol. 37, No. 23, pp. 3385–3398, 1997.
 [7] Daniel L. Ruderman, William Bialek. Statistics of Natural Images: Scaling in the Woods. Physical Review Letters, 73(6), 814 817, 1994.
 [8] Maria Zontak, Michal Irani: Internal statistics of a single natural image. CVPR 2011: 977–984.