1 Introduction
In order to reproduce the voice of a singer who can sing up to a “soprano C”, or at a frequency of Hz, Claude Shannon [31] proved that we need to sample her voice once every seconds. He named this number the Nyquist sampling rate for a signal of band Hz, i.e., a signal with frequencies no higher than Hz, after Harry Nyquist, who had “pointed out the fundamental importance of the time interval in connection with telegraphy.”
Shannon notes that this result was known in other forms by the mathematician J. M. Whittaker [34], but that otherwise had not appeared explicitly in the literature of communication theory. The idea must have been in the air since Nyquist [24]; Bennett [4], in the steady state case; and Gabor [16] had pointed out that approximately numbers are sufficient to capture a signal of band Hz that lasts for seconds.
Further in “Communication in the presence of noise” [31], published a year after his seminal “A mathematical theory of communication” [30]
, Shannon establishes a method to represent geometrically any communication system, and explores the utility of mapping a sequence of samples of a band limited signal into a high dimensional vector space. And it is here where he makes the most interesting of all remarks, on page 13: “[…] in the case of speech, the ear is insensitive to a certain amount of phase distortion. Messages differing only in the phases of their components […] sound the same. This may have the effect of
reducing the number of essential dimensions in the message space.”In other words, even if the dimension of the ambient vector space where we embed a representation of a signal is very high, we may come up with an equivalence class for which member points will have similar information content as the original signal, as far as the end user is concerned; and that equivalence class, in turn, will induce a low dimensional manifold in the vector space where similar messages can be mapped.
These ideas make it natural to frame the theory of compressed sensing [10, 9, 13, 7, 20] in the context of sampling and information theories. To see this, observe that compressed sensing makes it possible to reconstruct a signal, under certain circumstances, with fewer measurements than the otherwise required number of samples dictated by the Nyquist sampling rate. Moreover, even when the reconstruction is not exact, the error will be small.
In specific, compressed sensing deals with the problem of recovering a signal or message of interest , which we assume can be represented as for a matrix , with , from an incomplete set of linear measurements,
(1) 
where is the vector of measurements, is the object to recover, and is the measurement matrix, with and is a full rank matrix.^{1}^{1}1If , we are in the setting of transform coding, where represents a unitary transform, for example; and if , we can talk of a dictionary or a frame representation of . Given a measurement vector , eq. 1 represents an underdetermined system of linear equations, with an infinite number of solutions. However, if has at most significant components, compared to the rest, we can recover it exactly, or very closely, by solving the constrained problem,
(2) 
where , and . Here counts the number of nonzero entries of . If is a solution to eq. 2, we then synthesize an approximate reconstruction of by using .
Note that since , we have used fewer measurements than the number of coordinates of , in effect compressing the sensing, hence the name compressed sensing; possibly beating the Nyquist sampling rate; going from a large dimensional message space, , to a smaller dimensional measurement space, , in a manner that hopefully captures the essence of the signal of interest. Just like Shannon envisioned.
For all of this to work, we need to make precise the notion of what the “significant components” of are, notion which traditionally has translated into talking of sparsity. However, we show in example 1 that the commonly used notion of sparsity—the number of nonzero entries in a vector—is defective, and we propose instead in section 2 a refined notion of sparsity that extends the traditional meaning of the word as used in the compressed sensing and sparse representation literatures. The definition of is based on the weak norm, which we define and study in section 2.1. The weak norm helps us define, for a given , the sparsity function and the sparsity relation , which induces a strict partial order on . We show that, for a given vector , is a convex function of , and we use this fact to compute effectively , which we define as the sparsity of . See section 2.2.
In section 2.3 we study unitary transformations and the sparsity , which we use to define sparsifying transforms and their properties, formalizing well known energy shifting properties of unitary transforms commonly used in compression, for example. This leads in section 2.4 to the study of error analysis and sparsity when we truncate the signal representation of a vector under a sparsifying unitary transform
. This is done in terms of the peak signaltonoise ratio or PSNR, for which we find a lower bound in terms of
.This error analysis and musings on information theoretical matters in appendix A motivate the definition of the sparsity index in section 2.5, which we use in the context of compressed sensing image reconstruction, by example of the single pixel camera, which is described in detail in section 3: In section 3.1 we provide background on the origin of the single pixel camera, in section 3.2 we provide a physical realization and mathematical modeling of a single pixel camera, and how to go about obtaining an image from it both in an inefficient way, section 3.2.1, and the compressed sensing way, section 3.2.2. In section 3.3 we show how to solve the single pixel camera compressed sensing problem with either the orthogonal matching pursuit algorithm (OMP), section 3.3.1, or the more efficient and better basis pursuit algorithm (BP), section 3.3.2, for which, in appendix B, we provide the specific methods that we use to implement it. The characteristics of OMP help us tie in the use of the sparsity index with the calculation of a lower bound of the PSNR of the various compressed sensing image reconstructions conducted in section 4 with BP, given that the solutions obtained with OMP and BP are close. Our results show that we can predict the quality of the reconstruction of images with very good accuracy without knowledge of the original, i.e., we show that we have in the sparsity index a referencefree tool to decide when to sample at a higher rate a given region to guarantee a minimum local PSNR reconstruction.
2 Sparsity
In this section we define the weak norm, go over some of its properties, and use it to redefine the notion of sparsity, which in common parlance refers to the counting of nonzero entries in a vector. We do this because we show with an example why the commonly used notion of sparsity is not fully satisfactory, and propose instead a new measure of sparsity that utilizes the weak norm, mentioned as a measure of sparsity in [6], and used in that capacity in, for example, [12] and [8]. We then derive some properties of this measure of sparsity.
2.1 The weak norm and its properties
It is easy to see that given a vector , there exists a unique vector satisfying the following two properties:

For all , there is a such that , and

For all we have that .
These two properties naturally define the ordering operator , which assigns to its corresponding . We then write , and say that is the ordering of .
Definition 1 (Weak norm)
Let and . We define the weak norm of vector as the number
where .
We are interested in the weak norm because for values of , for a given vector , the quantity can be used as a measure of sparsity of . We elaborate on this later on. First, we address how to effectively compute .
Theorem 1
Given a vector , and , we have that
(3) 
where . We define the index as the smallest index where the right hand side of eq. 3 reaches its maximum.
Proof 1
The statement is trivially true for . Assume then that is a nonzero vector with corresponding ordering . First observe that, for a given , the order in which we count the number of entries in that are greater in absolute value than , does not depend on said order. Therefore, for a given , we have that .
Since , there is an integer such that . Let be the smallest of such integers. Consider the following partition of . We compute the supremum of over each of the intervals defining . For , we have that , and since raising a number to the power is a monotonically increasing operation, we clearly have that . Similarly, for , we have that . Finally, for , we have that , and therefore . The result follows from observing that the supremum of over is the maximum of the supremums of over each and all of the intervals .
We state without proof the following properties of the weak norm, derived from theorem 1.
Theorem 2
Let , , and . Then

.

if and only if .

.

The weak norm does not satisfy the triangle inequality.

, where is the norm.
Therefore, the weak norm, is not a true norm, but almost. It is a quasi norm, but for simplicity we will refer to it as a “norm”. We explore and get acquainted with two more properties of the weak norm that will be relevant later on.
From the result of theorem 1, we observe that the power of the weak norm of a vector corresponds to the largest area of a rectangle of width , and height , where . Recall that in theorem 1 we defined to be the smallest index for which this maximal area is achieved since we will use often. For a graphic representation of this concept, see figs. 2 and 1.
Note that for any value of , for a given , ; while tends to either 1 or 0, depending on whether or , respectively, i.e.,
tends to the characteristic function
as goes to zero. Here . It follows thatWe conclude from the previous two paragraphs that
(4) 
Hence, in the case when , the weak norm tends to the norm, which counts the nonzero entries of a vector, as defined in eq. 4. Note that the norm is not a norm either, since for when , but it is commonly called a “norm” nonetheless.
However, the norm is not nuanced at all when we are trying to measure sparsity, usually defined as the count of the nonzero entries of a vector, in cases where a vector has relatively few entries that are considerably larger than the rest in absolute value, a circumstance which we would like to distinguish for reasons that will become clear later on. With this in mind, we propose a new definition and measure of sparsity next.
2.2 Defining and measuring sparsity
In common parlance, as we mentioned in section 2.1, we say that a vector is sparse if its norm is smaller than . In other words,
(5) 
As argued above, though, this measure of sparsity will not distinguish the following two vectors in as radically different:
Example 1
Consider and . From the norm point of view, they are both sparse, moreover, their norms are equal, yet, most of the entries of are 1, while most of the entries of are practically 0.
Clearly, the notion of sparsity defined by eq. 5 cannot distinguish the very different nature of these two vectors, and .
Note that in the example above, we deliberately chose both vectors to have approximately equal energy, if we define the energy of a vector as . With these observations in hand, we put forth the following definitions.
Definition 2 (Sparsity and sparsity relation )
Let . Consider the set , and define the binary relation as follows:
We call the sparsity relation (for of order ), and will write for simplicity whenever . If , we say that is sparser than . We say that has sparsity of order equal to , or simply, that has sparsity .
Theorem 3
Let , then is a strict partially ordered set.
Proof 2
Let . For all , we have that is trivially irreflexive, i.e., , since , hence . Let and assume that and . Then, by definition, we must have that and , as well as and , since both and are transitive in , we have that and , and therefore , i.e., is transitive.
When we have a partially ordered set, e.g., , we are usually interested in knowing if there are maximal or minimal elements in it with respect to its ordering. Assuming the Axiom of Choice in the form of Zorn’s lemma—which states that a partially ordered set in which every chain (i.e., every totally ordered subset), has an upper (lower) bound, necessarily contains at least one maximal (minimal) element—we would then set to find upper (lower) bounds in for each energy level to conclude that there exist maximal (minimal) elements in with respect to the partial order . We leave the task of establishing the existence of maximal or minimal elements in for another occasion, since this departs from the focus of our endeavors.
Note that the proof of theorem 3 does not use anywhere that and is, in fact, true for any . However, given the aforementioned observations stemming from eq. 4 and eq. 5, it is clear that measuring sparsity with and comparing the sparsity of two vectors with the sparsity relation , are meaningful and sensible concepts only when . Therefore, going forward, we will assume that , unless otherwise noted.
Theorem 4 (Convexity of as a function of )
For all , the function that maps is a convex function. Moreover, if is the ordering of and is such that , then is strictly convex. (See theorem 1 for the definition of .)
Proof 3
Let , , , and . We have that, by theorem 1,
(6) 
where is the ordering of . If we prove that, for all ,
(7) 
combining creftype 3 and eq. 7, it follows that,
(8)  
proving that is convex. Hence, we proceed to prove eq. 7. Let , and define the functions and . We then have that,
That is, both functions coincide at values . Noting that the graph of is a line, and observing that , it follows that is convex, and conclude that for all . Setting and , we get that for ,
(9) 
proving that eq. 7 holds, as required to complete the first half of the proof. For the second half of the claim, simply note that if is such that , creftypeplural 3, 8 and 7 become strict inequalities, resulting then in strict convexity for .
Definition 3 (Sparsity )
We define the sparsity as the function that assigns to every vector the number
To check that is well defined, we simply need to prove that for every vector , . Let . Then, from theorem 2 and LABEL:def:_s_p_and_<_{s_p}, we have that
hence the set is bounded and, therefore, the number exists and is unique, which means that is well defined. Moreover, in light of theorem 4, computing can be easily achieved by convex minimization techniques.
It is easy to see that has the following properties, which we state without proof.
Theorem 5
Let , and its ordering. Then,

.

If is such that , i.e., the ordering of is a vector in the diagonal of , then
where . Recall that, by definition of , we must have .

If is such that , then .
With this new definition of sparsity in hand, we revisit example 1 by computing and , recalling that and . This calculation requires from us to compute repeatedly, for which we refer the reader to theorem 1 on how to do it from now on.
We have that , and therefore , which, we note, is equal to . Now for , we have that . If we draw the graph of as a function of , we see that it is the union of two curves and , with and , where is the abscissa such that , readily seen as the minimum of over . It is easy to compute that , from which , faithfully reflecting the fact that most of the entries in are practically zero, except for one of them, which is distinctly nonzero. See fig. 3.
2.3 Unitary transforms and sparse representations
In this section we use our new definition of sparsity to explore unitary transforms and sparse representations stemming from them, which we define next.
Definition 4 (Sparsifying transform and sparse representation)
Let
be a unitary matrix, and consider the transform
that assigns to every vector the vector . We say that is a sparsifying transform for if and only if for every vector , we have thatIn this case we say that is a sparsifying matrix for , is a sparse representation of (under ), and admits ( as) a sparse representation (under ).
Note that the notion of a sparsifying transform is well defined since it applies to unitary matrices, which preserve energy, i.e., for all , and therefore and can be compared by the sparsity relation , see LABEL:def:_s_p_and_<_{s_p}.
Theorem 6 (Sparsity and energy distribution)
Let be a sparsifying matrix for , and let be a vector whose transform is . If and are the orderings of and , respectively, then there exists an integer such that and for all . Moreover,
(10) 
Proof 4
Let , and . Since is sparsifying for , we have, by definition 4, that , from which, by theorem 1,
hence . Therefore, the set . Let , be the largest integer in , from which it follows that,
(11) 
Now, since is a unitary matrix, we must have that , from which,
from which the inequalities in eq. 10 are easily derived.
Observe that theorem 6 tells us that the energy in a signal gets redistributed into potentially fewer coefficients of its sparse representation , when is a sparsifying matrix for . We can colloquially say that the energy got squeezed to the right in the ordering of the transform when compared to the ordering of the signal. See fig. 4, for example.
Theorem 7
Let be a sparsifying matrix for , and let be a vector whose transform is . Then . Moreover, if , then .
Proof 5
Let and . Since is a sparsifying matrix for , . But, by definition, , hence is a lower bound for . Therefore, . Hence, , where .
Now assume that . LABEL:def:_s_p_and_<_{s_p} and eq. 4 imply that . By definition, this means that for all , there exists a such that for all , . Let , then there exists a such that for all we have that,
Since is a continuous function of , and is compact, there exists a such that , therefore . Since is a sparsifying matrix for , we have that . Hence, .
In a similar but opposite observation to what happens to the energy in view of theorem 6, here, the sparsity of the transform of a signal under a sparsifying matrix gets shifted to the left of the sparsity value of said signal.
In the proof of theorem 7, given a vector , we used the notation to talk about a value of for which reaches its minimum as a function of on , resulting in . If is such that , by theorem 4, is strictly convex as a function of , making unique in this case. When , from eq. 4, we can set , and think of as the unique value of for which . These results and observations can be summarized in the following theorem.
Theorem 8
Given a vector , if , then there is a unique such that . Recall that, , and that is the smallest integer such that .
A couple of remarks are in order. The rather technical condition in theorem 8, for a given vector , that
(12) 
is necessary for there to be a unique value such that , is not uncommon when is a random or semistructured vector. We don’t have a proof of this statement, but it is our empirical observation that all vectors that are the transform of some real life vector , such as a natural image, under a unitary matrix , satisfy the condition summarized in eq. 12. Moreover, even if the set of minimizers of