In this paper, we present recursive algorithms for type-IV discrete cosine and sine transforms (DCT-IV and DST-IV) and modified discrete cosine transforms (MDCTs), of power-of-two sizes , that require fewer total real additions and multiplications (herein called flops) than previously published algorithms (with an asymptotic reduction of about 6%), without sacrificing numerical accuracy. This work, extending our previous results for small fixed , appears to be the first time in over 20 years that flop counts for the DCT-IV and MDCT have been reduced—although computation times are no longer generally determined by arithmetic counts , the question of the minimum number of flops remains of fundamental theoretical interest. Our fast algorithms are based on one of two recently published fast Fourier transform (FFT) algorithms [1, 3], which reduced the operation count for the discrete Fourier transform (DFT) of size to compared to the (previous best) split-radix algorithm’s [4, 5, 6, 7, 8]. Given the new FFT, we treat a DCT as an FFT of real-symmetric inputs and eliminate redundant operations to derive the new algorithm; in other work, we applied the same approach to derive improved algorithms for the type-II and type-III DCT and DST .
The algorithm for DCT-IV that we present has the same recursive structure as some previous DCT-IV algorithms, but the subtransforms are recursively rescaled in order to eliminate some of the multiplications. This approach reduces the flop count for the DCT-IV from the previous best of [10, 11, 12, 13, 14, 15, 16] to:
The first savings occur for , and are summarized in Table. I.
|previous DCT-IV||New algorithm|
In order to derive a DCT-IV algorithm from the new FFT algorithm, we simply consider the DCT-IV to be a special case of a DFT with real input of a certain symmetry, and discard the redundant operations. [This should not be confused with algorithms that employ an unmodified FFT combined with pre/post-processing steps to obtain the DCT.] This well-known technique [7, 17, 8, 18, 2, 1, 9] allows any improvements in the DFT to be immediately translated to the DCT-IV, is simple to derive, avoids cumbersome re-invention of the “same” algorithm for each new trigonometric transform, and (starting with a split-radix FFT) matches the best previous flop counts for every type of DCT and DST. The connection to a DFT of symmetric data can also be viewed as the basic reason why DCT flop counts had not improved for so long: as we review below, the old DCT flop counts can be derived from a split-radix algorithm , and the 1968 flop count of split radix was only recently improved upon [1, 3]. There have been many previously published DCT-IV algorithms derived by a variety of techniques, some achieving flops [10, 11, 12, 13, 14, 15, 16] and others obtaining larger or unreported flop counts [19, 20, 21, 22]. Furthermore, exactly the same flop count (1) is now obtained for the type-IV discrete sine transform (DST-IV), since a DST-IV can be obtained from a DCT-IV via flipping the sign of every other input (zero flops, since the sign changes can be absorbed by converting subsequent additions into subtractions or vice versa) and a permutation of the output (zero flops) . Also, in many practical circumstances the output can be scaled by an arbitrary factor (since any scaling can be absorbed into a subsequent computation); in this case, similar to the well-known savings for a scaled-output size-8 DCT-II in JPEG compression [23, 24, 1, 9], we show that an additional multiplications can be saved for a scaled-output (or scaled-input) DCT-IV.
Indeed, if we only wished to show that the asymptotic flop count for DCT-IV could be reduced to , we could simply apply known algorithms to express a DCT-IV in terms of a real-input DFT (e.g. by reducing it to the DCT-III  and thence to a real-input DFT ) to immediately apply the flop count for a real-input DFT from our previous paper . However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations. Our result (1) is intended as a new upper bound on this (still unknown) minimum exact count, and therefore we have done our best with the terms as well as the asymptotic constant.
An important transform closely related to the DCT-IV is an MDCT, which takes inputs and produces outputs, and is designed to be applied to 50%-overlapped blocks of data [26, 27]. Such a “lapped” transform reduces artifacts from block boundaries and is widely used in audio compression . In fact, an MDCT is exactly equivalent to a DCT-IV of size , where the inputs have been preprocessed with additions/subtractions [29, 30, 31, 32]. This means that the flop count for an MDCT is at most that of a DCT-IV plus flops. Precisely this technique led to the best previous flop count for an MDCT, [29, 30, 31, 32]. (There have also been several MDCT algorithms published with larger or unreported flop counts [33, 34, 35, 36, 37, 38, 39, 40, 41, 42].) It also means that our improved DCT-IV immediately produces an improved MDCT, with a flop count of eq. (1) plus . Similarly for the inverse MDCT (IMDCT), which takes inputs to outputs and is equivalent to a DCT-IV of size plus negations (which should not, we argue, be counted in the flops because they can be absorbed by converting subsequent additions into subtractions).
In the following sections, we first briefly review the new FFT algorithm, previously described in detail . Then, we review how a DCT-IV may be expressed as a special case of a real DFT, and how the new DCT-IV algorithm may be derived by applying the new FFT algorithm and pruning the redundant operations. In doing so, we find it necessary to develop a algorithm for a DCT-III with scaled output. (Previously, we had derived a fast algorithm for a DCT-III based on the new FFT, but only for the case of scaled or unscaled input .) This DCT-III algorithm follows the same approach of eliminating redundant operations from our new scaled-output FFT (a subtransform of the new FFT) applied to appropriate real-symmetric inputs. We then analyze the flop counts for the DCT-III and DCT-IV algorithms. Finally, we show that this improved DCT-IV immediately leads to improved DST-IV, MDCT, and IMDCT algorithms. We close with some concluding remarks about future directions.
Ii Review of the new FFT
To obtain the new FFT, we used as our starting point a variation called the conjugate-pair FFT of the well-known split-radix algorithm. Here, we first review the conjugate-pair FFT, and then briefly summarize how this was modified to reduce the number of flops.
Ii-a Conjugate-pair FFT
The discrete Fourier transform of size is defined by
where is an th primitive root of unity and .
Starting with this equation, the decimation-in-time conjugate-pair FFT [43, 1], a variation on the well-known split-radix algorithm [4, 5, 6, 7], splits it into three smaller DFTs: one of size of the even-indexed inputs, and two of size :
[In contrast, the ordinary split-radix FFT uses for the third sum (a cyclic shift of ), with a corresponding multiplicative “twiddle” factor of .] This decomposition is repeated recursively until base cases of size or are reached. The number of flops required by this algorithm, after certain simplifications (common subexpression elimination and constant folding) and not counting data-independent operations like the computation of , is , identical to ordinary split radix [44, 45, 46, 1].
Ii-B New FFT
Based on the conjugate-pair split-radix FFT from section II-A, a new FFT algorithm with a reduced number of flops can be derived by scaling the subtransforms . We will not reproduce the derivation here, but will simply summarize the results. In particular, the original conjugate-pair split-radix algorithm is split into four mutually recursive algorithms, for , each of which has the same split-radix structure but computes a DFT scaled by a factor of (defined below), respectively. These algorithms are shown in Algorithm 1, in which the scaling factors are combined with the twiddle factors to reduce the total number of multiplications. In particular, all of the savings occur in , while is factorized into a special form to minimize the number of extra multiplications it requires. Here, although are presented in a compact form by a single subroutine , in practice they would have to be implemented as three separate subroutines in order to exploit the special cases of the multiplicative constants , as described in Ref. . For simplicity, we have omitted the base cases of the recursion ( and ) and have not eliminated common subexpressions.
The key aspect of these algorithms is the scale factor , where the subtransforms compute the DFT scaled by for . This scale factor is defined for by the following recurrence, where mod :
This definition has the properties: , , and . Also, and decays rather slowly with : is asymptotically . When these scale factors are combined with the twiddle factors , we obtain terms of the form
which is always a complex number of the form or and can therefore be multiplied with two fewer real multiplications than are required to multiply by . We denote the complex conjuate of by . The resulting flop count, for arbitrary complex data , is then reduced from for split radix to .
Iii Fast DCT-IV from new FFT
Various forms of discrete cosine transform have been defined, corresponding to different boundary conditions on the transform. The type-IV DCT is defined as a real, linear transformation by the formula:
for inputs and outputs . The transform can be made orthogonal (unitary) by multiplying with a normalization factor , but for our purposes the unnormalized form is more convenient (and has no effect on the number of operations). We will now derive an algorithm, starting from the new FFT of the previous section, to compute the DCT-IV in terms of a scaled DCT-III and DST-III. These type-III transforms are then treated in following section by a similar method, and lead to our new flop count for the DCT-IV.
In particular, we wish to emphasize in this paper that the DCT-IV (and, indeed, all types of DCT) can be viewed as special cases of the discrete Fourier transform (DFT) with real inputs of a certain symmetry. This viewpoint is fruitful because it means that any FFT algorithm for the DFT leads immediately to a corresponding fast algorithm for the DCT-IV simply by discarding the redundant operations, rather than rederiving a “new” algorithm from scratch. A similar viewpoint has been used to derive fast algorithms for the DCT-II [7, 17, 8, 18, 2, 1], as well as in automatic code-generation for the DCT-IV [2, 1], and has been observed to lead to the minimum known flop count starting from the best known DFT algorithm. Furthermore, because the algorithm is equivalent to an FFT algorithm with certain inputs, it should have the same floating-point error characteristics as that FFT—in this case, the underlying FFT algorithm is simply a rescaling of split radix , and therefore inherits the favorable mean error growth and error bounds of the Cooley-Tukey algorithm [47, 48, 49], unlike at least one other DCT-IV algorithm  that has been observed to display error growth .
Iii-a DCT-IV in terms of DFT
Recall that the discrete Fourier transform of size is defined by eq. (2). In order to derive from the DFT formula, one can use the identity to write:
where is a real-even sequence of length , defined as follows for :
Furthermore, the even-indexed inputs are zeros: for all . (The factors of will disappear in the end because they cancel equivalent factors in the subtransforms.)
This is illustrated by an example, for , in Fig. 1
. The original four inputs of the DCT-IV are shown as open dots, which are interleaved with zeros (black dots) and extended to an odd-even-odd (square dots) periodic (gray dots) sequence of lengthfor the corresponding DFT. Referring to eq. (7), the output of the DCT-IV is given by the first odd-index outputs of the corresponding DFT (with a scale factor of ). (The type-IV DCT is distinguished from the other types by the fact that it is even around the left boundary of the original data while it is odd around the right boundary, and the symmetry points fall halfway in between pairs of the original data points.) We will refer, below, to this figure in order to illustrate what happens when an FFT algorithm is applied to this real-symmetirc zero-interleaved data.
Iii-B DCT-IV from DCT/DST-III
For a DCT-IV of size , our strategy is to directly apply the new FFT algorithm to the equivalent DFT of size , and to discard the redundant operations from each stage. As it turns out, the sub-transforms after one step of this algorithm are actually scaled-output type-III DCTs and DSTs. This is closely related to a well-known algorithm to express a DCT-IV in terms of a half-size DCT-III and DST-III . In this section, we derive this reduction to a DCT-III for the new FFT algorithm, and then in Sec. IV we derive a new algorithm for the scaled-output DCT-III. The scaled output DST-III algorithm can be easily re-expressed in terms of the DCT-III, as is presented in Sec. IV-C. Here, we define the DCT-III and DST-III, respectively, by the (unnormalized) equations:
Starting with the DFT of length in eq. (2), the new split-radix FFT algorithm splits it into three smaller DFTs: of size , as well as the scaled transforms and of size (including a factor of 2 for convenience). These are combined via:
Here, where comes from the DCT-IV as in Sec. III-A, the even-indexed elements in are all zero, so . Furthermore, by the even symmetry of , we have (complex conjugate of ). Thus, we have . We will now show that is given by combining a DCT-III and a DST-III.
In order to calculate , we denote for simplicity the inputs of this subtransform by for . Since is the output of a real-input DFT of size , we have . Thus, for any , . However, there is an additional redundancy in this transform that we must exploit: by inspection of the construction of and by reference to Fig. 2, we see that the inputs are actually a real anti-periodic sequence of length (which becomes periodic when it is extended to length ). We must exploit this symmetry in order to avoid wasting operations.
In particular, by using the anti-periodic symmetry of , we can write the DFT of length as a single summation of length :
using the facts that and that . Then, if we take the real and imaginary parts of the third line above, we obtain precisely a DCT-III and a DST-III, respectively, with outputs scaled by . However, these sub-transforms are actually of size , because the symmetry means that the and terms merely add or subtract in the input:
for any . We can define two new sequences () and () of length to be the inputs of this DCT-III and DST-III, respectively:
With this definition of and , we can conclude from eqs. (13–14), and the definition of DCT-III and DST-III, that the real part of is exactly a scaled-output DCT-III of , while the imaginary part of is exactly a scaled-output DST-III of . (The scale factors of will disappear in the end: they combine with the 2 in to cancel the in the definition of .)
Thus, we have shown that the first half of the sequence can be found from a scaled-output DCT-III and DST-III of length . The second half of the sequence can be derived by the relation obtained earlier. Given , the output of the original DCT-IV, , can be obtained by the formula . This algorithm, in which the computation of has been folded into the computation of and , is presented in Algorithm 2.
In Algorithm 2, calculates the DCT-III of scaled by a factor of for , and will be presented in Sec. IV. Similarly, calculates the DST-III of scaled by a factor of for , and will be presented in Sec. IV-C in terms of .
If the scale factors are removed (set to 1) in Algorithm 2, we recover a decomposition of the DCT-IV in terms of an ordinary (unscaled) DCT-III and DST-III that was first described by Wang . This well-known algorithm yields a flop count exactly the same as previous results: . (Wang obtained a slightly larger count, apparently due to an error in adding his DCT-III and DST-III counts.) The introduction of the scaling factors in Algorithm 2 reduces the flop count by simplifying some of the multiplications in the scaled DCT-III/DST-III compared to their unscaled versions, as will be derived in Sec. IV. Note that, in Algorithm 2, multiplying by does not require any more operations than multiplying by , because the constant product can be precomputed. Let denote the number of flops saved in compared to the best-known unscaled DCT-III. We shall prove in Sec. IV-C that the same number of operations, , can be saved in . Thus, the total number of flops required by Alg. 2 will be . The formula for will be derived in Sec. IV, leading to the final DCT-IV flop count formula given by eq. (1).
Iv DCT-III from new FFT
The type-III DCT (for a convenient choice of normalization) is defined by eq. (10) for inputs and outputs . We will now follow a process similar to that in the previous section for the DCT-IV: we first express in terms of a larger DFT of length , then apply the new FFT algorithm of Sec. II-B, and finally discard the redundant operations to yield an efficient DCT-III algorithm. The resulting algorithm matches the DCT-III flop count of our previous publication , which improved upon classic algorithms by about 6% asymptotically. Unlike our previous DCT-III algorithm, however, the algorithm presented here also gives us an efficient scaled-output DCT-III, which saves operations over the classic DCT-III algorithms.
Iv-a DCT-III in terms of DFT
In order to derive from the DFT formula, one can use the identity to write:
where is a real-even sequence of length , defined as follows for :
with , , , and . (Notice that the definitions of and here are different from those in Sec. III-A.)
This is illustrated by an example, for , in Fig. 3. This figure is very similar to Fig. 1: both of them are even around the points and , and are odd around the points and . The difference from the DCT-IV is that these points of symmetry/anti-symmetry are now data points of the original sequence, and so the data is not interleaved with zeros as it was for the DCT-IV. We will refer, below, to this figure in order to illustrate what happens when an FFT algorithm is applied to this real-symmetirc data.
Iv-B New (scaled) DCT-III algorithm
In this subsection, we will apply the new FFT algorithm (Alg. 1) to the corresponding DFT for a DCT-III of size as obtained in Sec. IV-A. This process is similar to what we did in Sec. III-B. We will see that a DCT-III of size can be calculated by three subtransforms: a DCT-III of size , a DCT-III of size , and a DST-III of size . The resulting algorithm for the DCT-III will have the same recursive structure as in Alg. 1: four mutually recursive subroutines that compute the DCT-III with output scaled by different factors. For use in the DCT-IV algorithm from Sec. III-B, we will actually use only three of these subroutines, because we will only need a scaled-output DCT-III and not the original DCT-III.
When the new FFT algorithm is applied to the sequence of length defined by eqs. (18–19), we get three subtransforms: of the sequences , , and . The DFT of the sequence is equivalent to a size- DCT-III of the original even-indexed data , as can be seen from Fig. 3. The subtransforms of and have exactly the same properties as the corresponding subtransforms of the DCT-IV as described in Sec. III-B (except that the length of the subtransform here is instead of as in the DCT-IV case). That is, we denote the DFT of by , and this combines with the DFT of to yield a term in the output as before. And, as before, the inputs of are anti-periodic with length . In consequence, we can apply the result derived in Sec. III-B to conclude that these two subtransforms can be found from a DCT-III of size and a DST-III of size . The corresponding inputs of the DCT-III and DST-III, () and (), are defined as follows [compare to eqs. (15–16)]:
where for . (Again, the factors of 2 will cancel the factor of in the definition of .) Therefore, the real part of is the DCT-III of , while the imaginary part is the DST-III of . In summary, a DCT-III of size can be calculated by a DCT-III of size , a DCT-III of size , and a DST-III of size . Without the scaling factors , this is equivalent to a decomposition derived by Wang of a DCT-III of size into a DCT-III and a DCT-IV of size , in which the DCT-IV is then decomposed into a DCT-III and a DST-III of size .
The above discussion is independent of the scaling factor applied to the output of the transform. So, for the various scale factors in the different subroutines of Alg. 1, we simply scale the DCT-III and DST-III subtransforms in the same way to obtain similar savings in the multiplications (as quantified in the next section). This results in the new DCT-III algorithm presented in Algorithm 3. (The base cases, for and , are omitted for simplicity.)