Log In Sign Up

Analysis of Decimation on Finite Frames with Sigma-Delta Quantization

In Analog-to-digital (A/D) conversion on oversampled bandlimited functions, integration of samples in blocks before down-sampling, known as decimation, has been proven to greatly improve the efficiency of data storage while maintaining high accuracy. In particular, when coupled with the ΣΔ quantization scheme, the reconstruction error decays exponentially with respect to the bit-rate. Here, a similar result is proved for finite unitarily generated frames. Specifically, under certain constraints on the generator, decimation merely increases the reconstruction error estimate by a factor of π/2, independent of the block size. On the other hand, efficient encodings of samples are made possible from decimation, and thus the error decays exponentially with respect to total bits used for data storage. Moreover, the decimation on finite frames has the multiplicative structure that allows the process to be broken down into successive iterations of decimation with smaller blocks, which opens up the possibility for parallel computation and signal transmission through multiple nodes.


page 1

page 2

page 3

page 4


Adapted Decimation on Finite Frames for Arbitrary Orders of Sigma-Delta Quantization

In Analog-to-digital (A/D) conversion, signal decimation has been proven...

Analog-to-Digital Compression: A New Paradigm for Converting Signals to Bits

Processing, storing and communicating information that originates as an ...

Quantization optimized with respect to the Haar basis

We propose a method of data quantization of finite discrete-time signals...

High order low-bit Sigma-Delta quantization for fusion frames

We construct high order low-bit Sigma-Delta (ΣΔ) quantizers for the vect...

Memoryless scalar quantization for random frames

Memoryless scalar quantization (MSQ) is a common technique to quantize f...

Sigma-Delta and Distributed Noise-Shaping Quantization Methods for Random Fourier Features

We propose the use of low bit-depth Sigma-Delta and distributed noise-sh...

The case for 4-bit precision: k-bit Inference Scaling Laws

Quantization methods reduce the number of bits required to represent eac...

1. Introduction

1.1. Background and Motivation

Analog-to-digital (A/D) conversion is a process where bandlimited signals, e.g., audio signals, are digitized for storage and transmission, which is feasible thanks to the classical sampling theorem. In particular, the theorem indicates that discrete sampling is sufficient to capture all features of a given bandlimited signal, provided that the sampling rate is higher than the Nyquist rate.

Given a function

, its Fourier transform

is defined as

The Fourier transform can also be uniquely extended to as a unitary transformation.

Definition 1.1.

Given , if its Fourier transform is supported in .

An important component of A/D conversion is the following theorem:

Theorem 1.2 (Classical Sampling Theorem).

Given , for any satisfying

  • on

  • for ,

with and , , one has


where the convergence is both uniform on compact sets of and in .

In particular, for and , the following identity holds in :

However, the discrete nature of digital data storage makes it impossible to store exactly the samples . Instead, the quantized samples chosen from a pre-determined finite alphabet are stored. This results in the following reconstructed signal

As for the choice of the quantized samples , we shall discuss the following two schemes

  • Pulse Code Modulation (PCM):

    Quantized samples are taken as the direct-roundoff of the current sample, i.e.,

  • Quantization:

    A sequence of auxiliary variables is introduced for this scheme. is defined recursively as

quantization was introduced in 1963, [15], and is still widely used, due to its advantages over PCM. Specifically, quantization is robust against hardware imperfection [9], a decisive weakness for PCM. For quantization, and the more general noise shaping schemes to be explained below, the boundedness of turns out to be essential, as most analyses on quantization problems rely on it for error estimation. Schemes with bounded auxiliary variables are said to be stable.

Despite its merits over PCM, quantization merely produces linear error decay with respect to bits used as opposed to exponential error decay produced by its counterpart PCM. Thus, it is desirable to generalize quantization for higher order error decay.

Given , one can consider an -th order quantization scheme as investigated by Daubechies and DeVore:

Theorem 1.3 (Higher Order Quantization, [8]).

Consider the following stable quantization scheme

where and are the quantized samples and auxiliary variables respectively. Then, for all ,

The existence of such scheme is also proven in the same paper. This has improved the error decay rate from linear to arbitrary polynomial degree while preserving the advantages of a first order quantization scheme.

From here, a natural question arises: is it possible to generalize quantization scheme further so that the reconstruction error decay matches the exponential decay of PCM? Two solutions have been proposed for this question. The first one is to create new quantization schemes, known as noise shaping quantization schemes. A brief summary of its development will be provided in Section 2.

The other possibility is to drastically enhance data storage efficiency while maintaining the same level of reconstruction accuracy, and signal decimation belongs in this category. The process is as follows: given an r-th order quantization scheme, there exists such that

where . Then, consider

a down-sampled sequence of , where .

The process where we convert the quantized samples to is called signal decimation.

Decimation has been known in the engineering community [5], and it was observed that decimation results in exponential error decay with respect to the bit-rate, even though the observation remained a conjecture until 2015 [10], when Daubechies and Saab proved the following theorem:

Theorem 1.4 (Signal Decimation for Bandlimited Functions, [10]).

Given , , and , there exists a function such that


Moreover, the bits needed for each Nyquist interval is



From (3) and (4), we can see that the reconstruction error after decimation still decays polynomially with respect to the sampling frequency. As for the data storage, the bits needed changes from to . Thus, the reconstruction error decays exponentially with respect to the bits used.

Figure 1. Illustration of the first order Decimation Scheme: taking averages of samples nearby before down-sampling. The effect on the reconstruction (Replacing with ) is illustrated in parentheses

1.2. Results and Outlines

In this paper, we formulate and prove the extension of Theorem 1.4 to finite frames, which is Theorem 3.4. In particular, we propose a process called alternative decimation, and with this process we have exponential error decay rate with respect to the bit-rate.

To provide necessary background for the results, we include preliminaries for signal quantization theory on finite frames in Section 2. We first define quantization on finite frames in Section 2.1. Most of the subsequent generalizations of quantization are categorized as noise shaping schemes, and its formal definition is given in Section 2.2. We also introduce a specific class of finite frames called unitarily generated frames with its applications in Section 2.3. Relevant literature is presented in Section 2.4.

In Section 3, we define alternative decimation and present our main results. Theorem 3.3 is a special case of Theorem 3.4, where we restrict ourselves on finite harmonic frames, a subclass of unitarily generated frames. The full generality of our result is given in Theorem 3.4 for the unitarily generated frames and is generalized to the second order in Theorem 3.9. Moreover, the multiplicative structure of decimation is proven in Theorem 3.7 which enable us to perform decimation iteratively.

In Section 4, we derive properties of alternative decimation needed for our proof. Then, we prove Theorem 3.3, 3.4, and 3.7 in Section 5, 6, and 7 respectively. Finally, we demonstrate the generalization of alternative decimation, namely Theorem 3.9, to the second order in Section 8. Generalization to an even higher order is not possible under the current construction, and we present the main difficulty in doing so in Appendix A. Numerical experiments are given in Appendix B.

2. Preliminaries on Finite Frame Quantization

Signal quantization theory on finite frames is well motivated. Most data that are stored digitally are different from audio signals in nature: instead of having finite bandwidth, the objects of interest are more naturally represented in terms of finite frames. This prompts the development of quantization theory for finite frames.

2.1. Quantization on Finite Frames

Fix a Hilbert space

along with a set of vectors

. The vectors form a frame for if there exist such that for any , the following inequality holds:

The concept of frames is a generalization of orthonormal bases in a vector space. Different from bases, frames are usually over-complete: the vectors form a linearly dependent spanning set. It is particularly useful in signal processing as over-completeness can be utilized for noise reduction and is more robust against data corruption than orthonormal bases.

Given a frame , the linear operator satisfying is called the analysis operator, and its adjoint operator with is called the synthesis operator. Defining the Hermitian operator , we have the following reconstruction formulas: For any ,

Suppose now the Hilbert space is finite dimensional, and the frame consists of a finite number of vectors. Then this Hilbert space is isomorphic to a finite-dimensional Euclidean space, and the corresponding analysis operator is also finite-dimensional. Thus, we are able to consider as a matrix , where the rows of are , the conjugate transpose of . The synthesis operator is .

Under this framework, one considers the quantization of and reconstructs , where . The frame-theoretic greedy quantization is defined as follows: given a finite alphabet , consider the auxiliary variable , where we shall set . For , calculate and as follows:


where is defined in (2). In the matrix form, we have


where is the backward difference matrix.

For an -th order quantization, we have instead

In practice, the alphabet is often chosen to be which is uniformly spaced and symmetric around the origin: Given , we define

For complex Euclidean spaces, we define . In both cases, is called a mid-rise uniform quantizer. Throughout this paper we shall always deal with such .

2.2. Noise Shaping Scheme and the Choice of Dual Frames

In [7], it is pointed out that the reconstruction error of quantization for bandlimited functions is concentrated in high frequency. Since audio signals have finite bandwidth, it is now possible to separate the signal from the error using low-pass filters. This discovery led to the introduction of noise shaping schemes, aiming to design the scheme such that the quantization error has desirable structures for noise reduction.

In terms of finite frames, noise shaping schemes differ from the scheme in the following way:


where is lower-triangular.

Now, given a frame , a noise shaping scheme , and a dual frame of , i.e. , , the reconstruction error of this problem is

where is the operator norm between and , i.e.,

The choice of the dual frame plays a role in the reconstruction error. For instance, [4] proved that the choice of gives minimum 2-norm for , where given any matrix , is defined as the canonical dual . More generally, one can consider a -dual, namely , provided that is still a frame. With this terminology, decimation can be viewed as a special case of -dual, and conversely each -dual can be associated with corresponding post-processing on the quantized sample .

2.3. Unitarily Generated Frames

The frame elements of unitarily generated frames are generated by a cyclic group: given a base vector and a Hermitian matrix , the frame elements are defined as


As symmetry occurs naturally in many applications, it is not surprising that unitarily generated frames receive serious attention, and their applications in signal processing abound, [13, 12, 6, 7].

One particular application comes from dynamical sampling, which records the spatiotemporal samples of a signal in interest: to recover a signal on from the samples where , and denotes the evolved signal. Equivalently, one recovers from , which aligns with the frame reconstruction problems, [1, 2]. In particular, Lu and Vetterli [16, 17] investigated the reconstruction from spatiotemporal samples for a diffusion process. They noted that one can compensate under-sampled spatial information with sufficiently over-sampled temporal data. Unitarily generated frames represent the cases when the evolution process is unitary and the spatial information is one-dimensional.

It should be noted that unitarily generated frames are group frames with the generator provided that , while harmonic frames are special cases of unitarily generated frames with generator as a diagonal matrix with integer entries and the base vector .

2.4. Prior Works

2.4.1. Quantization for Bandlimited Functions

Despite its simple form and robustness, quantization only results in linear error decay with respect to the sampling period as . It was later proven in [8] that a generalization of quantization, namely the r-th order quantization, exists for any arbitrary , and for such schemes the error decay is of polynomial order . Leveraging the different constants for this family of quantization schemes, sub-exponential decay can also be achieved. A different family of quantization schemes was shown [14] to have exponential error decay with small exponent (.) In [11], the exponent was improved to .

2.4.2. Finite Frames

quantization can also be applied to finite frames. It is proven [3] that for any family of frames with bounded frame variation, the reconstruction error decays linearly with respect to the oversampling rate , where the frame is an matrix. With different choices of dual frames, [4] proposed that the so-called Sobolev dual achieves minimum induced matrix 2-norm for reconstructions. By carefully matching between the dual frame and the quantization scheme, [7] proved that using

-dual for random frames will result in exponential decay with near-optimal exponent and high probability.

2.4.3. Decimation

In [5], using the assumption that the noise in quantization is random along with numerical experiments, it was asserted that decimation greatly reduces the number of bits needed while maintaining the reconstruction accuracy. In [10], a rigorous proof was given to show that such an assertion is indeed valid, and the reduction of bits used turns the linear decay into exponential decay with respect to the bit-rate.

2.4.4. Beta Dual of Distributed Noise Shaping

Chou and Günturk [7, 6] proposed a distributed noise shaping quantization scheme with beta dual. The definition of a beta dual is as follows:

Definition 2.1 (Beta Dual).

Let be a frame, . Recall that is a V-dual of if


where such that is still a frame.

Given , the -dual has , a -by- block matrix such that each block is .

In this case, the noise shaping scheme is , where is a -by- block matrix where each block is a -by- matrix with unit diagonal entry and as sub-diagonal entries. Under this setting, it is proven that the reconstruction error decays exponentially.

One may notice the similarity between beta dual and decimation. Indeed, if one chooses and normalize by , one can obtain the same result as decimation, achieving linear error decay with respect to over-sampling rate and exponential decay with respect to the bit usage. Nonetheless, its generalization to higher order error decay with respect to the oversampling rate is lacking, whereas decimation can be extended to the second order. In particular, the raw performance of the second order decimation is superior to the one of the 1-dual under the same oversampling rate.

2.5. Notation

The following notation is used in this paper:

  • : the signal of interest.

  • : a fixed frame.

  • : the sample.

  • : the block size of the decimation, satisfying .

  • : the greatest integer smaller than the ratio .

  • : the quantization alphabet. is said to have length if for some .

  • : the quantized sample obtained from the greedy quantization defined in (5).

  • : the auxiliary variable of quantization.

  • : a dual frame to the frame , i.e. .

  • : the reconstruction error .

  • : total bits used to record the quantized sample.

  • : a Hermitian matrix with eigenvalues

    and corresponding orthonormal eigenvectors


  • : a base vector in .

  • : the unitarily generated frame (UGF) with the generator and the base vector .

3. Main Results

It will be shown that, for unitarily generated frames satisfying conditions specified in Theorem 3.4, quantization coupled with alternative decimation still has linear reconstruction error decay rate with respect to the ratio . As for the data storage, decimation allows for highly efficient storage, making the error decay exponentially with respect to the bits used.

For the rest of the paper, we shall also assume that our quantization scheme is stable, i.e.  remains bounded as the dimension . We first start with results on harmonic frames.

Definition 3.1 (Frame variation).

Given , the frame variation is defined to be

Definition 3.2 (Alternative Decimation).

Given fixed , the -alternative decimation operator is defined to be , where

  • is the integration operator satisfying

    Here, the cyclic convention is adopted: For any , .

  • is the sub-sampling operator satisfying

    where .

Theorem 3.3 (Special case: Decimation for harmonic frames).

Fix the frame with entries . Suppose are distinct integers in , then the following statements are true:

  • Signal reconstruction: The matrix is a frame.

  • Error estimate: The choice of the dual frame to gives reconstruction error


    where is the canonical dual of the down-sampled matrix

    Moreover, if , then


    In particular, the error decays linearly with respect to the oversampling rate .

  • Efficient data storage: Suppose the length of the quantization alphabet is , then the decimated samples can be encoded by a total of bits. Furthermore, suppose is fixed as , then as a function of total bits used, the reconstruction error is


    where .

    For , we have a better estimate


    where , independent of . The optimal exponent will be achieved in the case .

The more general result is as follows:

Theorem 3.4 (Decimation for Unitarily Generated Frames (UGF)).

Given , , , , and as the generator, base vector, eigenvalues, eigenvectors, and the corresponding UGF, respectively, suppose

  • ,

  • , and

  • ,

where , then the following statements are true:

  • Signal reconstruction: is a frame.

  • Error estimate: For the dual frame , the reconstruction error satisfies

  • Efficient data storage: Suppose the length of the quantization alphabet is , then the total bits used to record the quantized samples are bits. Furthermore, suppose is fixed as , then as a function of bits used at each entry, satisfies


    where , independent of .

Remark 3.5.

For both Theorem 3.3 and 3.4, if both the signal and the frame are real, then the total bits used will be bits, half the amount needed for the complex case.

Remark 3.6.

Harmonic frames are unitarily generated frames with the generator being the diagonal matrix with entries and the base vector , so Theorem 3.3 is truly a special case of Theorem 3.4. In this case, and . The estimate (16) is worse than the one given in (11) by a factor of due to the lack of knowledge on the actual distribution of , which is the difference between Lemma 5.8 and Proposition 6.2.

One additional property of decimation is its multiplicative structure.

Theorem 3.7 (The Multiplicative Structure of Decimation Schemes).

Suppose and , then the -decimation is equal to the successive iterations of an -decimation coupled by an -decimation in the sense that the corresponding canonical dual frames and reconstruction errors are the same.

Besides the first order alternative decimation in Theorem 3.4, it is also possible to generalize the result to the second order decimation. For such a decimation process, the reconstruction error decays quadratically (as opposed to linearly in Theorem 3.4) with respect to the oversampling rate and exponentially with respect to the bit usage.

Definition 3.8 (Higher Order Alternative Decimation).

Given fixed , the -alternative decimation operator is defined to be .

Naturally, a -alternative decimation operator is compatible with an -th order quantization. For , we have the following result:

Theorem 3.9 (Second Order Decimation for UGF).

Given , , , , and as the generator, base vector, eigenvalues, eigenvectors, and the corresponding UGF, respectively, suppose

  • ,

  • , and

  • ,

then the following statements are true:

  • Signal reconstruction: is a frame.

  • Error estimate: For the dual frame , the reconstruction error has quadratic error decay rate with respect to the oversampling rate :

  • Efficient data storage: Suppose the length of the quantization alphabet is , then the total bits used to record the quantized samples are bits. Furthermore, suppose is fixed as , then as a function of bits used at each entry, satisfies


    where , independent of .

Remark 3.10.

Note that we need to require that the eigenvalues to be nonzero for the second order decimation.

To better demonstrate the ideas in the proof, Theorem 3.3 will be proven separately in Section 5 even though it is essentially a special case of Theorem 3.4. Theorem 3.4 will be proven in Section 6, and Theorem 3.7 in Section 7. Finally, Theorem 3.9 is proven in Section 8.

4. Scaling Effect of the Decimation Operator on Difference Matrices

A key element in decimation for band-limited functions is the fact that it preserves the difference structure which is a signature of quantization schemes. As such, it is important to ensure that the same effect occurs for frame quantization. The scaling effect of can be seen in the lemma below:

Lemma 4.1.

For a given , let denote the -dimensional backward difference matrix. For , one has

It is tempting to consider instead, where is the circulant matrix satisfying

Indeed, there is no difference between and , as . However, difficulties occur when we try to generalize the process. See Section 8 and Appendix A.

Remark 4.2.

Visually, both and contain a parallelogram. The one in has height and width while the one for has height and width . If we consider the circulant matrix defined below, we can see that , where is constant on the first rows and zero otherwise.

Remark 4.3.

Symmetric integrations are employed in A/D conversion, but it is not necessary in finite frames. We are thus able to go from summing over an odd number of entries to any number, and this proves to be beneficial as the estimate is better for the case


The alternative decimation operator comes naturally given our needs, as preserves the structure of a backward difference matrix.

Lemma 4.4.

satisfies where


and is the Kronecker delta.


Note that




Now, compute


By definition,


Thus, splitting into , , and , we see that


as claimed. ∎


of Lemma 4.1:

If ,

Now, note that, for ,

For , . ∎

As a frame-theoretic problem, there are two questions that should be answered in the following order:

  1. Given a signal and a frame , can we reconstruct the signal from its down-sampled data ? That is, is a frame?

  2. If is indeed a frame, let be the canonical dual of . What is the reconstruction error ?

5. Decimation for Finite Harmonic Frames

5.1. The Scaling Effect of Decimation

Let be a harmonic frame. For any , we have the following lemma:

Lemma 5.1.

and satisfy


where is a diagonal matrix with entries


and is zero except for the -th column, having

In either case as .

Remark 5.2.

In (18), one observes that differs from an actual circulant matrix by a matrix with on every entry of the first rows and zero otherwise. Since , we can conclude that . Thus, it is possible to consider , which is a more natural formulation of decimation than the alternative decimation.


We start with the computation on . First, suppose . Then

For ,