Wavelet-based Scale Saliency

01/14/2013 ∙ by Anh Cat Le Ngo, et al. ∙ 0

Both pixel-based scale saliency (PSS) and basis project methods focus on multiscale analysis of data content and structure. Their theoretical relations and practical combination are previously discussed. However, no models have ever been proposed for calculating scale saliency on basis-projected descriptors since then. This paper extend those ideas into mathematical models and implement them in the wavelet-based scale saliency (WSS). While PSS uses pixel-value descriptors, WSS treats wavelet sub-bands as basis descriptors. The paper discusses different wavelet descriptors: discrete wavelet transform (DWT), wavelet packet transform (DWPT), quaternion wavelet transform (QWT) and best basis quaternion wavelet packet transform (QWPTBB). WSS saliency maps of different descriptors are generated and compared against other saliency methods by both quantitative and quanlitative methods. Quantitative results, ROC curves, AUC values and NSS values are collected from simulations on Bruce and Kootstra image databases with human eye-tracking data as ground-truth. Furthermore, qualitative visual results of saliency maps are analyzed and compared against each other as well as eye-tracking data inclusive in the databases.



There are no comments yet.


page 13

page 23

page 33

page 34

page 35

page 36

page 40

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A few centuries ago, Neisser proposed a fundamental theory about the human visual attention system including pre-attentive and attentive stages Neisser (1967) in his psychology studies. However, his work was unknown to machine vision scientists until David Marr David Marr (1982)

, a neurologist, proposed a neurology-based computational model for Neisser’s theory. The computational model includes a feature extraction stage followed by perceptual grouping stage. Though the model was practically limited and rarely implemented, it inspires and provides framework for several later computational models. Among them, Itti model

Itti et al. (1998) holds significant influence and provides a standard in the research field. Itti models feature extraction as center-surrounds operator, a property of visual cortex; while, perceptual grouping and attentive region assessment are due to Proto-Object generation Walther (2006) and Winner-Take-All network Itti et al. (1998). After center-surrounds operations in multi-scale levels were proposed for construction of conspicuity and saliency maps by Itti et al, other theories like Graph-based Visual Saliency Harel et al. (2007), and Spectral Residual Saliency Hou and Zhang (2007) were brought in to produce more meaningful saliency maps Harel et al. (2007) as well as reduce the computational complexity Hou and Zhang (2007). These saliency models assume that human vision systems may behave like random-walk processes Harel et al. (2007) or follow statistical property of natural images Harel et al. (2007).

Without making such strong assumption, Kadir Kadir and Brady (2001) and Gilles S. Gilles. (1998) initiated information-based saliency map with their work on pixel-based scale saliency (PSS). Other information-related saliency research rapidly gained pace with Niel Bruce’s An Information Maximization (AIM) theory Bruce and Tsotsos (2006) and Danash Gao’s Discriminative Information Saliency (DIS) Gao et al. (2007). Furthermore, the information-based spatial-temporal framework (ENT) Anh Cat et al. (2012) Qiu et al. (2007) extends and fastens the models from still images to the dynamic video context.

Information-based saliency approaches all motivated from the assumption that human attention could be attracted to spatial location accompanied with highly informative content. From signal coding, compression and self-information theory, an event has more information when it appears to be structural and rare. Though based on similar concepts, each method has its own information estimation approach on different type of descriptors. Generally, those approaches can be characterized according to their choices of descriptors and calculation methods. For examples, PSS

Kadir and Brady (2001) and ENT Qiu et al. (2007) utilize pixel-value descriptors; meanwhile, AIM Bruce and Tsotsos (2006) and DIS Gao et al. (2007)

emphasizes on the alternative basis-projection descriptors, ICA bases and Wavelet bases consecutively. In accordance with information measurement, ENT and PSS employ the popular Shannon entropy estimated by histogram construction or Parzen kernel. AIM estimates self-information through neural-network training on patches of natural images. Decision-theory based DIS has its discriminative information from classifying descriptors into center or surrounds classes. Noteworthy, PSS is so far the only approach accumulating information of both descriptors and their structure. However, PSS employs pixel-value descriptors and isotropic circular sampling, which might hinder its performance in term of accuracy due to failure in extracting popular oriented features in natural images as well as speed due to the curse of dimensionality in information estimation.

The limitation of PSS sparked discussion for alternative solutions by Kadir et alKadir and Brady (2001) Kadir et al. (2003). Deploying basis-projected descriptors in place of pixel-based descriptors not only boosts practical performance of scale saliency but as well provides deeper theoretical understanding of scale saliency and data multi-scale structural information. Moreover, the extension would make scale saliency the first-ever method capable of using both pixel-value descriptors (PSS) and basis-projection descriptors. Wavelet elements are preferred as alternative basis in this paper; therefore, the proposed method is named Wavelet Scale Saliency (WSS). In order to clearly explain extension from pixel-based descriptors to basis-projection descriptor, we organize this paper in following sections. Section 2 gives overview about scale saliency and its main idea. The next section 3 explains the rationale behind usage of time-frequency domain instead of time-domain only for visual saliency; meanwhile, sections 4, 6 elaborates statistical distribution and correlation of time-frequency descriptors. As wavelet is chosen as time-frequency basis, section 5 gives background information about four types of wavelet transforms considered in this study: discrete wavelet transform (DWT), discrete best basis wavelet packet transform (DWPTBB) as well as two quaternion wavelet transforms QWT and QWPTBB. Accordingly, there are four time-frequency descriptors representing time-frequency domain slightly different from each other. Moreover, each descriptor depends on a particular morphological shape of its own mother wavelet. All details about properties of those descriptors are organized in section 6.1. Along with new descriptors, suitable mathematical models of feature-space and inter-scale saliency estimation are derived in section 7. Moreover, the mathematical derivation unveils strong relation between WSS with another state-of-the-art Bayesian Surprise Saliency (BSS) Baldi and Itti (2010). Beside theoretical evaluation, simulations on Neil Bruce image database Bruce and Tsotsos (2009) and Kootstra image database Kootstra et al. (2011) are carried out in order to compare quantitatively the proposed WSSs with different basis-projection descriptors, the original PSS and the ITT model. Furthermore, qualitative analysis on particular images provides better details about responses of the proposed methods with different types of scenes. It is possible that performance of saliency methods depends on image content. All results and discussion are detailed in the section 8. Finally, the conclusion 9 summarizes our main contributions and future research directions.

2 Scale Saliency

To get a hold of what exactly is a scale saliency; a few fundamental principles of original scales saliency are reviewed. Scale saliency utilizes maximum feature-space entropy weighted by its inter-scale dependency across scales as saliency values; furthermore, it argues that information measurement might be data-driven pivot for human visual attention. Its mathematical model is summarized as follows.


Feature-space saliency, () in the equation 2, is measured by its Shannon entropy of pixel-values descriptor at a specific scale or sampling window size for each image location . Shannon entropy is chosen since it satisfies fours over five criteria of multi-scale entropy filtering Starck and Murtagh (1999). The last criterion actually requires structural correlation from information estimation, which is apparently not considered in Shannon entropy. However, inter-scale saliency,

, actually fulfils this requirement, and it is estimated at every location by total variation of descriptors’ probability distribution function (

) across scales. Then, the scale at which most significant information should be found; it is actually the maximum point of the scale-entropy concave curve in the equation 4. Finally, the overall saliency is stated mathematically as the equation 1 in accordance with the definition of scale saliency. Lets apply the concept of scale saliency on a general form of signal , where is ideal noise free signal, is the measured signal with noise at specific location and scale . Assumed no dependencies between noise and the ideal signal, the estimated entropy is . Assumed that noise are scale-invariant, the equation 3 implies that inter-scale saliency measure is purely dependent on variation of useful signal and not affected by variation of noise . This briefly explains basic motivation behind scale saliency work; further mathematical analysis and experiments results can be found in Kadir and Brady (2001) Kadir et al. (2003).

The original scale saliency Kadir and Brady (2001) uses pixel-value descriptors which are simple, intuitive, and straight forward interpretation of image data. Moreover, its combination with circular sampling window provides isotropic information analysis, independent of any morphological shape inside sampled regions. Nevertheless, its drawbacks are be susceptible to noise, require high computational cost and cause significant bias in entropy estimation. So far, histogram construction and approximated Parzel kernel are two popular parameter methods for constructing pixel-value descriptors’ and estimating entropy. Entropy bias and speed performance in those mentioned methods greatly depend on manual tuning of histogram numbers of bins or Parzel size kernel; in addition, they as well restrict extension of scale saliency to higher dimensional data. Suau Suau and Escolano (2009) overcomes these problems by bypassing construction stage and estimating PSS by multivariate-data-adaptive information estimation technique Stowell and Plumbley (2009). In spite of its fast computation for multivariate data, the non-pdf approach hinders the inter-scale saliency process which directly depends on s 3. It is solved by adapting set-theory based elegant solutions of Kadir Kadir et al. (2003) for inter-scale saliency computation into kd-tree structure. However, the solution is not intuitively and mathematically related to the information-based frame-work. That motivates us develop (), a more coherent information-based scale saliency with sub-band energy descriptors, as solutions for all these short-comings of PSS.

3 Time-Scale-Frequency

A well-known computational model of visual attention is first mentioned in Koch and Ullman’s publication Koch and Ullman (1985). After that, several other models are proposed; however, they are usually over-complex and not biologically plausible. The disadvantages might be due to pixel representation utilized in many early visual attention algorithm. To overcome these problems systematically, Urban Urban et al. (2010) has investigated strong constraints to keep computational complexity within an acceptable range for possible real-time implementation. These constrains are drawn from evidences of psychological experiments which shows that images could be analyzed in psycho-visual channels at least in TV-viewing condition Watson (1987). In other words, visual data could be further analysed into channels and sub-bands instead of being used in raw pixel format. Furthermore, the channels can be effectively characterized by separated frequency bands and orientation ranges of wavelet analysis Senane et al. (1995). Lets assume visual active areas of brains can deploy some 9/7 Cohen-Daubechies-Feauveau (CDF) wavelet transform operators; then, it results in multi-scale pyramid composed of oriented contrast maps with limited frequency range and low-resolution image. For each level of wavelet decomposition, there are four channels: (i) sub-band 0 is approximated image after filtered with many low-pass blurring kernels; (ii) sub-band 1 extracts horizontal frequencies corresponding to vertical edges of images (iii) sub-band 2 contains frequencies and features along two diagonals of image frames. (iv) sub-band 3 prefers vertical frequencies mapping to horizontal features form images. Natural scenes are full of horizontal, vertical or two diagonals features; therefore, human visual perception seems to prefer those dominant features. Besides oriental constraints, visual acuity is another visually perceptual limit. Normally, human fovea could decompose and process details above its limit visual acuity (1.5-2 degrees of visual angle). It lasts in frequency range: 0.7-0.5 pixels per degree,or 0.33-0.25 cycles per degree. This range is nearly resembled by the last level low-resolution version of images in usual wavelet decomposition. Each decomposed level are generated by moving kernels with different window size to any image positions. Spatial frequency of other wavelet analysis levels, varying in accordance with analyzing depths, is shown the following table 1

Depth 0 1 2 3 4 5
Frequency range
(cycles per degree) 10.7-5.3 5.3-2.7 2.7-1.3 1.3-0.7 0.7-0.3 0.3-0.2
Table 1: Wavelet Levels vs Frequency Range

Spectral energy are usually employed as spectral signature for image collections or individual images. Urban et alUrban et al. (2010) analyses different sets of images belonging to four different semantic categories: coast, mountain, street and open-country. Interestingly, Fourier spectrum of each category possess distinguished shape and frequency range, significantly different from each other Urban et al. (2010). In other words, each general spectral profile and associated distribution histogram of image classes have unique energy distribution. This distribution is proportional to distance

from mean magnitude spectrum normalized by the standard deviation of the category.

where is average spectrum. Carefully observing the spectral profiles could give distinguishing clues for each semantic scene. For example, ”Coastal” scenes are dominated with horizontal features; therefore, its spectral profiles stretch along vertical axes. Furthermore, spectral profile of ”OpenCountry” categories is biased toward two upper and lower spectrum. Though almost similar to the ”OpenCountry” profile, spectrum of ”Street” images includes more types of features from artificial environment beside horizon-oriented details. Therefore, the diamond of image spectral profile of ”Street” becomes more significant horizontally. ”Mountain” categories with its random scenic details have isotropic spectrum while scenes of streets filled with artificial objects have spectrum stretched in both horizontal and vertical axis. From Urban’s research, spectral energy distribution seems to be important clues for visually perceptual system of human beings.

Beside image classification, the spectral distribution signature is as well useful in visual attention and early visual process. Such energy distribution becomes differentiable clues for features across scales. Spectral profiles of image feature at a particular scale would help differentiate itself from directly upper and lower scale. Lets do an imaginary experiments with a single square input signal defined as follows.

If is filtered by a kernel with kernel size ( 1-D kernel width ) and is much smaller than non-zero period of the given square signal , the response will be just two impulse function at .

For 2-D signal or image context, the above operation corresponds to a classic edge detection phenomena. Though edges and structures plays important roles in visual perception, their information does not sufficiently represent the whole natural scenes. Natural images are rich of other features like texture, flat regions, etc beside edges and corners. As mentioned before, image features can be interpreted in terms of energy distribution. For example, edges are the places of high energy concentration, homogeneous flat regions do not contain much energy while textures, hybrid of edges and flat regions, contains certain amount of energy . If only one window size is used in the analysis, significant responses only come from features or objects which happen to fit into that window size. The other useful features with inappropriate size in accordance with the filter could not be extracted. Therefore, a multi-scale’s approach is extremely necessary in order to identify suitable sizes of kernels or fuse features from different scales together. When mother wavelet is chosen as filtering kernels, window size becomes equivalent to frequency range in the table 1

, and choosing adaptive frequency ranges is important computation task for spatial feature extraction. Inspired by such fundamental query in computer vision, this paper tries to contribute a little insight about how spectral density distribution can characterize features at each scale and how the frequency range of processing can be appropriately chosen for multi-scale feature representation. From the multi-scale features and appropriate scale selection, we can develop computation saliency methods capable of highlighting salient features across scales by using spectral energy distribution.

4 Time-Scale-Energy

PSS estimates information from pixel values, time-domain descriptors by constructing normalized histogram of pixel values as probability distribution.

where is probability of descriptor, the ratio between number of pixels with descriptors and total image pixels N. Lets use square of pixel as weights

Normalized weighted-histogram , of signal can obviously be interpreted as energy density of descriptors

in time domain. By the isometric property of the Fourier transform, the

of energy density distribution can be expressed in frequency domain as well.

Or in joint time and frequency domain.


where is energy density in joint time-frequency representation. Pure time descriptors have perfect localization in time, no localization in frequency , and vice versa for frequency descriptors. Both extreme time or frequency descriptors make interpretation of constructed , and estimated information difficult to explain. Therefore, it is necessary to find a representation of which describes spectral density of local energy. For example, Short-Time Fourier Transform (STFT) is the first-known transform capable of generating spectrogram, a graphical representation of local signal energy in time-frequency plan.

STFT identifies spectral density as well as local energy density or information in a short-time period of the signals. However, it does not much benefit scale saliency unless scale parameters are actually considered as in signal description on phase-space. Fortunately, in recent years, alternative scale-based representation, called wavelet-transform (WT), has been widely addressed among signal processing community, and its fundamental idea is replacing the frequency shifting operation by a time (or frequency) scaling operation , a basic wavelet kernel. Consequently, the energy density in WT framework is formulated as follows.

WT coefficients, , averagely measure spectral density of frequency sub-bands, a short range of frequency, in a short period of time. Characteristics of the time-frequency window are specified by two main parameters, time-shift and scale . As derived from short-time spectral representation, a spectrogram, by utilizing scale operations, its energy density distribution is called scalogram. Given time-scale space, the total signal energy can be rewritten as follows.

and probability of time-scale descriptors can be specified with scale parameters


Generally, the wavelet transform can help generate frequency sub-band coefficients, square of which over total energy are density distribution of that sub-band in time-scale space.

5 Wavelet Transform

After discussing about usefulness of time-frequency-scale representation in the section 3 and its corresponding time-frequency energy distribution, we recognize that wavelet-representation would be ideal candidates for our investigation into energy density distribution and other statistical property across multiple scales of natural images. During the quite short history of wavelet analysis, this research fields have been very fruitful and there are several analysing techniques with wide range of characteristics. In this paper, only standard techniques such as discrete wavelet transform (DWT), discrete wavelet packet transform (DWPT), quaternion wavelet transform with best-basis(QWTBB), and quaternion wavelet packet transform with best-basis. are deployed as possible descriptors. In this subsection, we first look into discrete and real wavelet and wavelet packet transform with best basis (DWT,DWPTPP) for its theoretical background; then QWT and QWPTBB, quaternion versions of two prior wavelet transforms, are considered.

5.1 Discrete Wavelet Transform

Though wavelets were firstly introduced in the early 20th century by Alfred Harr, they are only developed rapidly much later. Only until recently, they have been widely employed in many computer vision problems such as image or video de-noising, enhancement, coding, and pattern classification Unser et al. (2011); Do and Vetterli (2005); Chan et al. (2008) . Signal analysis for frequency components can be achieved by Fourier transform (FT) but FT does not provide suitable tool for time-frequency analysis of images. Short-time Fourier Transform (STFT) is an extension from FT approach for analysing local frequency analysis at a short period of time Mallat (1999). Noteworthy that, STFT can be used for taking the spatial interval in 2-D signal instead of time period in 1-D type since there is no time dimension for still images. However, STFT utilizes fixed window kernels for every data blocks across input signals; this property make STFT less suitable for complex signal analysis, especially signals with strong semantic structures appearing across multiple scales. In other words, STFT only succeeds with signals whose features are embedded in fixed definite temporal or spatial regions or there is prior knowledge about a suitable size of window kernel for STFT processing. Without the above conditions, STFT would totally miss signal features. Theoretically, the disadvantage can be avoided by employing STFT with multiple kernel sizes; however, it raises up another issues such as what range of sizes would be chosen to optimally extract useful features with reasonably computational effort.

Problems of STFT in analyzing local frequency have motivated development of multi-scale wavelet techniques for better local frequency representation. Since limitations of STFT is due to fixed-size processing windows, wavelet analysis deploys multi-resolution filter-banks on input signals. As illustrated in figure 1, 1-D signals are decomposed into low-pass and high-pass components. In case of 2-D input signals, the filtered outputs are four sub-bands: low-low, high-low, low-high, and high-high in regards of processing orientation. Intuitively, 2-D signals analysis includes row-wise 1-D analysis followed by column-wise 1-D analysis or vice verse. With respect to processing direction, high-low sub-band tends to extract horizontal features, low-high sub-band prefers vertical features, high-high sub-band detects diagonal features, and low-low are approximated version of original signal by inverse dyadic scale. Lets assume that input signals are two-dimensional grey-scale image , and the scaled mother wavelets have following mathematical form for vertical, horizontal and diagonal sub-bands and a scaling function for low-resolution signals with . Then, we can represent any images as.



are scaling coefficients and wavelet coefficients from vertical, horizontal and diagonal sub-bands. The parameter S represents the lowest analyzing depth while s is higher decomposing levels in multiple scale-space framework. As mentioned before, 2-D DWT can be obtained by tensors products of 1-D DWT when the orginal image

is analyzed along two dimensions and separately. As a result, the scale function is approximated as and filter-banks of three directional sub-bands are . In the figure 1, discrete wavelet transforms are carried on the sample image in the left hand-side. On the right-hand side contains decomposed results by two levels with three distinctive sub-bands and a down-sampled version of the original image.

(a) Sample Image
(b) Wavelet Decomposition

Noteworthy that, the real-wavelet transform like DWT suffers from shift-variance, a small shift in the signal can greatly change magnitudes of wavelet coefficients around singularities. Furthermore, it has no phase to embed signal location information therefore aliasing effects would be introduced into recovery process. These issues need seriously considering whenever the discrete real wavelet transform is employed. Therefore, modelling statistical property of DWT coefficients’ magnitude across scales might request extra investigation with those draw-backs in mind. Further arguments and details about this matter of DWT descriptors will be discussed in the section


Figure 1: Wavelet (solid line) & Wavelet Packet Decomposition (solid and dash lines) for 1-D signal

5.2 Discrete Wavelet Packet Transform

Well-known DWT can be computed efficiently by an orthonormal FIR conjugate quadrature filter banks including analysis low-pass and high-pass filters ( denoted and respectively. The low-pass coefficients are decomposed recursively for a number of levels, and inverse discrete wavelet transform is calculated by an inverse filter bank. The extension from normal wavelet transform (DWT) to wavelet packets transform (DWPT) is straightforward by an additional step at each processing level. Instead of decomposing only low-pass coefficients in the Low-Low sub-band for 2D input signals, the transform performs decomposition of high-pass coefficients in Low-High, High-Low and High-High sub-bands as well. As a result, all coefficients of DWPT can be neatly arranged in a binary tree and addressed as follows.

where is a analysis depth in the tree, notes the deepest decomposed level, and i is the node index in this depth. With regards to other representations, wavelet packets have advantages in their adaptability to varying statistical structure. Unlike Fourier Transform with one fixed-size base or normal wavelet transform with a fixed number of bases, we may search the “best” orthonormal bases from dictionary of basis acquired after wavelet package decomposition. This idea is initially proposed by Coifman et alCoifman and Wickerhauser (1992) mainly for signal compression. Therefore, this “best” basis is the best in terms of compressing ratio which often desires sparsest representation. In other words, input signals can be characterized by few large coefficients. Supposed the whole best basis operation is denoted as which exhaustively goes through the whole binary tree to look for locations of a set basis with parameters such that the there is a minimum amount of uncertainty measured by Shannon entropy. More details can be found in Coifman’s works Coifman and Wickerhauser (1992), and can be summarized mathematically as follows.

Noteworthy that, sometime “brute-force-attack” every branch of the tree is not possible or feasible due to intensive requirement of computational power. Fortunately, there exist fast algorithms to implement the best basis for given signals. Then, the optimum time-frequency representation can be achieved by tilting the time-frequency plan in accordance with best-basis algorithms. Though the representation may be optimally sparsest in time-frequency domain, whether sparseness of features suitably matches performance of human visual attention is still a question to be answered. To rectify the matter, experiments have been carried out and performance comparison between the DWT case and DWPTBB case is reported in the section 8.

5.3 Quaternion Wavelet Transform

Like wavelet packet transform in the previous discussion, Quaternion Wavelet Transform (QWT) is extended and enhanced to eliminate shift-variance problems from real discrete wavelet transform (DWT). Though there are a few different definitions and implementations of Quaternion Wavelet Bulow and Sommer (2001), the QWT implementation in this paper is inspired by Chan’s research et alChan et al. (2008). Some backgrounds about complex wavelet transform (CWT) with its implementation dual-tree complex wavelet transform (DT-CWT) Kingsbury et alKingsbury (1998) need reviewing before the QWT can be explained and discussed. In discrete wavelet transform, 2-D DWT can be considered as concatenation of two consecutive 1-D DWTs. Though the same process does not exactly happen in 2-D DT-CWT or QWT straightforwardly. A similar concept is used for easily explaining how QWT can be achieved. It means 1-D DT-CWT will be elaborated first; then, we will discuss about 2-D QWT signals and how it may handle processing along different orientations and sub-bands.

Real DWT have well-known drawbacks in terms of shift-invariance and phases to encode coefficient locations. Kingsbury et alKingsbury (2001) reckons problems and proposes an dual-tree CWT as a specific solution. Rationale of the dual-tree approach is usage of complex numbers for wavelet coefficients which directly tackles one of two DWT’s dragging problems. Complex extension of wavelet transform makes phase extraction from wavelet coefficients possible since complex wavelet transforms have both real and imaginary values unlike real DWT with only one real value for each coefficient. Real and imaginary components of the dual-tree CWT are generated by two sets of wavelet and scaling functions and . Moreover, filter-banks and have to be independent and orthogonal as shown in the figure 2.

Figure 2: Illustration of DTCWT

The notations and are denoted for scaling and wavelet functions corresponding to filter-banks . In addition,and with denotes first set of DTCWT coefficients. Similar notations are used for the second set of scaling and wavelet functions and with filter banks and according coefficients and with . Wavelet functions and forms two binary trees in figure 2 and at leaves of each tree is the real and imaginary parts of a complex analytic wavelets.

Moreover, the imaginary wavelet is 1-D Hilbert Transform of the real-wavelet :

Any complex wavelet coefficient are formed by wavelet coefficients of two other real wavelet transforms; therefore, this combination generates a 2 x redundant tight frame. This redundancy in complex wavelet frames prevents non-oscillating magnitudes of coefficients around singularity points as well makes the transform near-shift invariant. Furthermore, there is no energy ( or little energy in practice ) in the negative region of frequency because of a relationship between two wavelet functions and .



Thus, the Fourier transform of complex wavelet transform has no energy in the negative frequency region. It makes DTCWT an analytic wavelet transform with analytic output signals. Due to this analyticity, the dual-tree wavelet transform has implicitly managed to include all information in the half positive plan of the frequency domain.

It is quite straight forward for 2-D DWT expansion from 1-D DWT, discussion in the previous subsection 5.1. However, it is unfortunately not easy for similar expansion from 1-D DTCWT to 2-D DTCWT transforms because Hilbert Transform (HT) and analytic signals need an theoretical extension for 2-D signals. Furthermore, there exists not only one but several definitions which define different zero-out regions ( negative frequency domain in 1D DT-CWT) , signal-power regions (positive frequency domain in 2D DT-CWT). In this paper, we only focus on Bulow definition Bulow and Sommer (2001) of analytic quaternion signals which combines both partial and total Hilbert transform (HT) . Partial HTs are done along either or directions only; meanwhile, total HT is carried out on both directions simultaneously. They are defined as following formula.

The are partially Hilbert transformed along and axis consequently, and is total HT; while denotes 2-D convolution. Each 2-D CWT basis is a complex analytic function, computationally equivalent to a product of two 1-D complex wavelet functions either along only one or both axis. Similar to expansion of discrete real wavelet, the diagonal sub-band wavelet is defined as . Other total and partial HT are products of coefficients from different sets of wavelet functions deployed in the 1-D CWT implementation.

To unify all different Hilbert Transform in a meaningful and compact representation, we can utilize quaternion algebra and treat as a real part and as three imaginary components Chan et al. (2008).

More details about theory behind QWT and its special characteristics such as its singular cases, three phases, and zero-out regions can be found in Chan etal ’s and Bulow ’s publications Chan et al. (2008); Bulow and Sommer (2001). Resting on form of the above quaternion wavelet transformation, we can organize four quadrant components of 2-D wavelet as a quaternion. Lets take a example of diagonal signals with following quadrant components.

We can have a diagonal quaternion wavelet functions for the diagonal sub-band mathematically defined as follows.

To compute the QWT coefficients, we can use proposals of a separable 2-D implementation Kingsbury (2001) of dual-tree filter-banks previously illustrated in the figure 2. At each filtering stage, both two-sets of wavelet filters and are independently applied to each dimension and of a 2-D image. For example, the filter-bank is applied along both axis; then, it yields the scaling coefficients and three diagonal,vertical and horizontal wavelet coefficients and respectively as shown in the figure 3.

Figure 3: Illustration of 2D dual-tree complex wavelet transform

Dual-tree implementation of two separated filter-banks for 1-D signal can be considered as four independent filter banks for 2-D signals according to all possible combinations of filter for one dimension . With these combinations of filters and corresponding wavelet functions and are generated four components of quaternion wavelet transform for horizontal, vertical, and diagonal sub-bands. Four different wavelet coefficients from these filter banks are arranged by quaternion algebra to obtain QWT coefficients. For example, a coefficient from diagonal wavelet sub-band of QWT can be written in terms of responses from independent filter-banks as follows.

So far, we have taken a diagonal sub-band as example for showing how QWT can be computed. The construction and properties for other two sub-bands are similar to what have been done for diagonal sub-bands. Except that the axis combinations results in a horizontal sub-band or for a vertical sub-band instead of a diagonal sub-band . In summary, QWT at each stage sports three quaternion sets corresponding to three sub-bands; each quaternion contains four wavelet functions. Therefore, there are 12 functions in total which can be easily seen as matrix of functions as follows.

Columns of the above matrix correspond to quaternion wavelet functions of the horizontal sub-band , the vertical sub-band , and diagonal sub-band from left-to-right respectively. The three according wavelet coefficients are , and and the operator means formation of quaternion number by coefficients along each column. Though quaternion wavelet coefficients possess rich phase information, our research currently focuses on magnitudes of each wavelet sub-bands. Therefore, magnitudes of horizontal, vertical sub-bands can be computed according to quaternion magnitude formula as follows.

While the final approximated version of input signals, which are not decomposed further by the transform, have its magnitude computed by quaternion algebra.

5.4 Quaternion Wavelet Packet Transform

To construct a packet form of QWT, each and every sub-band should be repeatedly decomposed by low-pass and high-pass filters . Bayram et alBayram (2008) has investigated into formation of wavelet packets for DT-CWT, an equivalent form of QWT. In order to get an analytic quaternion wavelet packet, the filter banks need to be chosen in a specific way such that the Hilbert transform relationship is preserved. In Bayram’s works Bayram (2008), the analytic wavelet transformed can be achieved if whatever filter-bank is used to decompose the first filter-bank of QWT should also be used for the second (dual ) filter-banks. Another important point about the extension to wavelet packet QWPT is the choice of the extension filters It has been found that the only necessary constrain to preserver the Hilbert transform property is forcing the usage of the same filter-pairs

in both filter-banks of QWT or DT-CWT . Therefore, any CQF pair of filter-banks with short support, frequency selectivity or possessing a number of vanish moments can be candidates for the extension filter. Noted that, the above criteria such as CQF pair of filters have been employed for extending a regular DWT. Like other derivatives of DT-CWT or QWT, the quaternion wavelet packet transform (QWPT) are approximately shift-invariant, which means the energy in each sub-band is approximately preserved if the input signals are shifted by a number of samples. Noteworthy that, there are other methods beside QWPT with shift-invariant property in wavelet pack decomposition. For example, by performing an exhaustive search over all shifted wavelet packet bases to find the “best basis” according to a certain cost function

Cohen et al. (1997), the orthonormal wavelet packet transform becomes shift-invariant in a sense that energy in each sub-band is invariant to transition of input signals . This (approximately) shift-invariance property becomes very useful and important in the search for a suitable energy descriptors. This shift-invariance property guarantees that DT-CWT, DT-CWPT or QWT, QWPT would have energy descriptors robust to certain amount of affine transformation in input signals.

Interactions of filtering both low and high components at each stage of DT-CWT introduces a complete structures of all possible sub-bands that can be generated by the filter-bank pair. Each tree forms a unique frequency profile of input signals. Among those countless numbers of possibilities, there exist a frequency decomposition being more sparse and compact than the others. It is called the best-basis in terms of representing the input signals with fewest wavelet coefficients. A fast algorithm for indicating such best basis has been reported in extension from DWT to DWPT Coifman and Wickerhauser (1992). In addition, it is previously mentioned in the section 5.2, the same strategy can be adopted for searching best-basis in QWT. In brief, the approach is looking for a path in a binary of decomposition to minimize a Shannon entropy cost function; more details can be found in the work of Coifman et alCoifman and Wickerhauser (1992). After “best basis” searching for QWPT decomposition, we can identify magnitudes and energy of coefficients at a specific location by a simple quaternion algebra. .

6 Wavelet Coefficients Correlation

The previous section 4

have discussed the potential of using energy density distribution of localized time-scale element or wavelet elements instead of pixel-value probability distribution. Only general 1-D signal is considered and these elements are assumed to be independent or at least linearly independent (uncorrelated); however, this assumption only works for random variables as input signals. Practically, except total noise, any meaningful signals often has specific structures persistent across multiple time-scale element in 1-D case. For 2-D signals like natural images, an additional orientation needs considering; in other words, their wavelet coeffcients are highly statistically related across scales, orientation, spaces. This phenomenon is systematically studied and confirmed in Azimifar

et alresearch Azimifar et al. (2011). The author has conducted an empirical study of joint wavelet statistics for texture and natural images to investigate correlation relationship between neighbouring coefficients. Examination of these dependencies helps propose appropriate models for such a transform-domain algorithm. Though Azimifar’s work Azimifar et al. (2011) only covers linear dependencies and just a squint on non-linear relations, its proposals are evaluated on a collection of 5000 real images. Therefore, we believe her conclusion in that study is generally true at least for natural images, the main researching objects. In brief, there exists a few elementary correlation relationships as follows.

  • The spatially-localized and sparse correlation structure has a clear persistence across scales.

  • Every coefficient exhibits correlations extending across multiple scales, with spatially near neighbors both within and across orientations.

  • A subband coefficients at the same spatial locations but from different orientations are not linearly correlated.

  • Within-subband, inter-subband, and inter-scale correlations are highly oriented and persistent across local neighbors of its parent.

The below figure 4 clearly illustrates all mentioned correlations, their preferences to locality as must be expected. This locality increases toward finer scales, which supports persistency property of wavelet coefficients. A single coefficient correlates with its parents as well as neighbors across orientations and scales.

Figure 4: Illustration of wavelet coefficients inter-band and intra-band correlation Azimifar et al. (2011)

Among several mentioned statistical dependencies, the most vital findings for our work are uncorrelated siblings coefficients across orientation and strong correlated coefficients across scales since it theoretically allows uncertainty and mutual information estimation of a 2D time-scale element or a wavelet sub-band energy descriptor. To elaborate this point, lets consider two adjacent scales and their corresponding coefficients of horizontal, vertical, diagonal orientation for 2-D signals . Due to non-correlation of sibling coefficients across orientation, it is possible to consider three wavelet coefficients as a multivariate variable with uncertainty estimation by energy density distribution across three orientations . For two adjacent sub-band , there are two multivariate variables and corresponding two entropy values and , and the mutual information between two variables due to inter-scale dependencies between correspondent wavelet coefficients are computed as follows.


From that basic observation about inter-scale and intra-scale wavelet coefficients of natural images is developed the core idea of our proposal. More details about sub-band energy descriptors and how to measure their uncertainty and mutual information will be clearly explained in following sections 6.1, 6.2

6.1 Interscale Subband Energy Descriptor

Interesting relationship between basis-project methods and scale saliency are repeatedly discussed in several publicationsKadir and Brady (2001), Kadir et al. (2003). Kadir Kadir and Brady (2001) actually discusses about behaviours of non-saliency and saliency regions in spectral and wavelet domain. A simple, flat, non-salient regions or images is sufficiently described by a single sub-band; meanwhile, complicated data and structure regions require more sub-bands descriptors. This directly introduces basis-projected sub-bands as potential alternative descriptors. Like pixel-value descriptors, real wavelet sub-bands must be treated as discrete variables due to its theoretical restriction, data-analysis uncertainties, . In other words, it is impossible for continuous wavelet sub-bands distribution at any specific location. Following available mathematical definition of PSS for discrete pixel descriptors, we sketch rough mathematical models of WSS with discrete sub-band descriptors, , in the equations 7, 8, 9, 10 whereof , are a element and set of sub-band descriptor consecutively.


However, a general concept of sub-band descriptor is not useful in actual computation; therefore, an appropriate numerical attribute of sub-bands need proposing instead. Lets consider 2-D discrete real-wavelet transform with three sub-bands vertical (), horizontal () and diagonal () sub-bands at each particular dyadic scale represented by three set of wavelet coefficients () accordingly in the equation 11. Equation 12 uses those coefficients to compute sub-band energy densities as descriptors () for Wavelet-domain Scale Saliency (WSS).


In the standard real discrete wavelet transform (DWT), there are fixed three analysed sub-bands for each dyadic sampling step. Supposedly the maximum level of wavelet decomposition is , the number of dyadic scales is with 3 sub-bands for each scale. With 4 or 5 as the usual number of decomposition levels, totally around 12 or 15 sub-bands descriptors are analysed for an image. This number of descriptors is significantly less than 255 pixel-value descriptors of PSS for any grey-scale image.

Besides wavelet transforms, different other types of basis projection techniques could also be utilized; for example, best basis wavelet packet analysis (DWPTBB). The full wavelet packet transform breaks signals into sub-bands with the same bandwidth at the maximum dyadic scale. It would not fit into the scale saliency concept which requires descriptors at different scales. Fortunately, the ”balanced” full wavelet packet tree usually over-describes image properties, and the description can be optimized by Best Basis () finding operation. The optimized wavelet packet tree often has projected basis across dyadic scales since some small image details are best described with a basis at finer resolution while other big details prefer another basis with coarser resolution. The DWPTBB coefficients are utilized for sub-band energy density , the proposed image descriptors, calculation in the equations 13, 14.


Comparing mathematical statements 11,12 and 13,14 for sub-bands descriptors of Discrete Wavelet Transform (DWT) and Discrete Wavelet Packet Transform Best Basis (DWPTBB) consecutively, we can see their fundamental differences. While DWT provides determinant basis-projection methods with pre-computed basis and fixed structure of sub-bands, DWPTBB adapts itself into each data set. Then, its number and structure of sub-bands are specified by Best Basis () finding operator Coifman and Wickerhauser (1992). It requires more operations; however, more faithful and adaptive descriptors can be achieved.

Both DWTBB and DWT are popular wavelet-transforms; however, they both depend on shift-variant real discrete wavelet transforms. It means that projection of coefficients not only depends on data but also its relative location on the scene.


As the fourth criteria for good information measurement of Starck et alStarck and Murtagh (1999) states that entropy must work in the same way regardless of descriptors’ locations. Both DWT and DWTBB projected descriptors do not satisfy that condition since usages of these descriptors might lead to different information estimation for identical data at two different locations. The shift-variance of real-wavelet transform can be avoided by complex wavelet transform design; for instances, recently developed dual-tree complex wavelet transform (DTCWT) Selesnick et al. (2005), Quaternion wavelet transform (QWT) Chan et al. (2008), or dual-tree complex wavelet packet transform with best-basis (DTCWTBB) Bayram (2008). General formula of complex coefficients and their corresponding sub-band energy density are summarized in the equations 17,18.


Dual-tree approaches use two different wavelet filter-banks ,, and they are designed to form analytical complex filter banks, . The magnitudes of projected-complex coefficients are proven to be shift-invariant; therefore, its derived energy density of the sub-bands is as well shift-invariant. Probably, the quaternion version of wavelet transform (QWT) and quaternion wavelet packet transform best basis (QWTBB) with shift-invariant property would provide better descriptors than their real counterparts according to five criteria of Starck Starck and Murtagh (1999).

6.2 Intra-scale Subband Energy Descriptor

As previously mentioned in the section 6, there is strong correlation or statistic linear dependence between wavelet coefficients in natural images. The first correlation, the inter-scale dependencies, has been discussed in the section 4,6, and modeled as sub-band descriptors in the previous section 6.1. Moreover, the relation have been widely and effectively employed in various tree-structured coding techniques such as SPIHT Buccigrossi and Simoncelli (1999). Besides inter-scale relationship, many authors Do and Vetterli (2005) have pointed out another strong correlation of intra-band coefficients existing across many different types of natural scenes . Minh Do and M Vetterli Do and Vetterli (2005)

successfully modelled coefficients of a wavelet sub-band with a simple explicit mathematical form, Generalized Gaussian Distribution (GGD). While statistical distribution of wavelet coefficients gets a lot of interest, several researches have proposed different mathematical models for analysing this statistical characteristic. However, few models of wavelet coefficients marginal density at a particular sub-band works better than GGD in terms of accuracy, approximation and simplicity. After such the distribution is widely observed in experimental data with natural images, theoretical analysis on the plausibility of modelling by the GGD distribution is defined as follows.


is the Gamma distribution. Here

dictates the scale parameter or variance of the distribution, and controls shapes. For example, GGD with is Gaussian distribution; it becomes Laplacian distribution with .

7 Information Measurement

In the previous section 6.1, four different wavelet transforms generate corresponding wavelet sub-band energy density descriptors. From those energy density, energy probability distribution function () at each scale can be computed as follows.

The above formula computes the probability of energy density at one location across different sub-bands, for (DWT) or (QWT) or , for (DWPTBB) and (QWPTBB) from the smallest scale, , to currently considered scales, . The first level uses the smallest sampling window size of wavelet atoms; therefore, it generates analysed coefficients with finest details. Then, the sampling window sizes are doubled after each level; they generate coarser analysed details. It is quite similar to PSS sampling operations except that scales are doubled rather than increased by a unit. Like of PSS descriptor, WSS descriptors are distributed with increasing scales of j, from level ( smallest wavelet atom ) to level ( currently biggest wavelet atom). From the equation 7, it is straightforward to compute feature-space entropy as follows whereof is shorted as .


Both entropy of PSS’s descriptors and the above entropy formula for the proposed descriptor only summarizes statistical property in local spatial regions since both considering window sizes in PSS and scale levels of wavelet decomposition are finite. Then, it lacks involvement of energy distribution in the whole image and it is confirmed that such distribution is vital for natural image and texture modeling Do and Vetterli (2005). As presented in the sub-section 6.2 is the Generalized Gaussian Distribution of coefficients magnitudes from a wavelet intra-band.


In order to combine both global and local characteristics into a single value, we propose cross-entropy between inter-band and intra-band distribution as an alternative formulation of the equation 19.


To distinguish between two modes of entropy computation, we names the local entropy by the equation 19 as ”observer” mode, and the cross-entropy involving both local and global statistics as ”searcher” mode. In later formula, when general entropy symbol without specific subscripts appears in any formulas, it means both modes are eligible for those equations. Those names also help to distinguish different parameters and simulation modes presented in the experimental sections 8.2.

The equation 19 computes feature-space entropy of sub-band energy descriptors for WSS as the equation 8 does for PSS. Half of scale saliency measure, feature-space entropy, has been figured out for sub-band energy density descriptors. The other half of the problem rests in computational details of inter-scale saliency; in other words, how the equation 9 should be interpreted with the proposed descriptors. In equation 8, the inter-scale saliency is measured as total variation in probability distribution of descriptors at two consecutive scales in which pixel-value descriptors () appear in both distributions, it complicates the problem. However, the situation is different for wavelet sub-band energy density descriptors since each sub-band in the current level is unique for this level only. It does not appear in other levels of analysis. This wonderful property simplifies out task in building sub-band probability distribution for different levels but makes the equation 3 inappropriate for sub-band features. Since it is unjustifiable to find total variation of two on two different set of descriptors, an alternative interpretation of inter-scale saliency need developing. Lets consider , of all sub-bands up to the current level, . When a new analysed sub-band, , is generated, this sub-band descriptor will modify the current into

. The distance between the prior model and the modified model can be measured by Kullback-Leibler divergence as follows.


Noteworthy, it is similar to Itti’s Bayesian Surprise Saliency (BSS) metric Baldi and Itti (2010), and the surprise model can be extended for multiple sub-bands descriptors or evidences in BSS. The equation 22 becomes mutual information between the current model and a set of new evidences. In other words, the expectation of surprise for adding new sub-bands into the current model is the mutual information between new sub-bands and the current model, shown in the equation 23.


Therefore, mutual information is chosen as inter-scale saliency for successive dyadic scales since it actually implies averaged ”bayesian surprise” Baldi and Itti (2010) saliency of sub-bands across scales. Furthermore, mutual information as inter-scale saliency measurement well emphasizes the structural coherence of data across scales. If there are useful structures such as edges or joints and they are consistent across consecutive scales, they will increase mutual information between two consecutive scales. Otherwise noises have no mutual information across scales as its self-information is zero, . It is remarkable that mutual information satisfies the fifth criterion of the good information estimation by Starck et alStarck and Murtagh (1999). The only remaining step is identifying how the mutual should be calculated in discrete cases. Following formula shows relation between mutual information and entropy.


The mutual information can be directly calculated as difference between separated () and joint () entropy estimation of the current energy descriptors (the current model) and the next-level sub-bands, the equation 24. While the entropy elements can be easily estimated by simple mathematical equations 25,26,27. The joint entropy can be reused as for the next level inter-scale saliency estimation because of the sub-band descriptors uniqueness. The scale saliency principles on wavelet-domain sub-band energy descriptors are summarized in the equation 28 as product of maximum feature-space saliency and inter-scale saliency, or product of mutual information between consecutive levels and maximum sub-band entropy.


The characteristic scale is chosen to maximize information of the model . Lets imagine the case prior scale contains only noise meanwhile later scales actually contain useful structures of images. With bias of Shannon entropy toward noise, the characteristic scale fails to enclose any useful structure. To overcome this drawback, we propose alternative approach, DIS to differentiate from the original strategy , in which is selected so as to maximize inter-scale saliency or average ”Bayesian surprise”. DIS principles can be summarized as follows.

Experiments with DIS and WSS are carried out and simulations results are detailed in the next section in order to confirm effectiveness of the proposed strategy.

8 Discussion & Results

The previous sections 6.1 and 7 have analysed theoretical advantages of WSS and its derivative DIS. In addition, the subsection 6.1 present four descriptors based on different wavelet transforms: DWT, QWT, DWPTBB, and QWPTBB. Accordingly, we have several derivatives for the proposed method according to specific choices of scale section mechanisms and sub-band descriptor. To evaluate them against other saliency approaches, they are compared with PSS Kadir and Brady (2001), and the de-facto ITT model Itti et al. (1998). The purpose of comparisons are not for claiming the best saliency method or racing toward the highest possible evaluating measurement; it just proves the rationale of the assumption that feature and structural complexity would be a good clues for human attention. The best evaluation measurement reported does not necessarily mean the best saliency maps since it much depends on choices of databases, elimination of experimental bias, performance of human test subjects, etc. Moreover, a standardized evaluation process in saliency map evaluation is far from being reached since several researchers choose different database and measurement methods or even create their own. In our research, we focus on the effectiveness of information measurement in visual attention; then, the most common processes and databases would be chosen to confirm generalization of the assumption.

In line of searches for informative clues , Bruce and Tsotsos Bruce and Tsotsos (2009) database is certainly among the popularly used stimuli. However, only Bruce’s database is certainly not enough due to limits in numbers and contents of stimuli. Then, Kootstra’s database Kootstra et al. (2011) are chosen for extra testing samples and ground-truths. Two database with over 200 samples with ground-truths provided by more than 50 human subjects would help to confirm the generalization of our proposed framework to a certain extent. Similarly, only common evaluation approaches are deployed in our studies, and they can be categorized into either quantitative or qualitative methods. Quantitative relations between different saliency methods and human visual performance are shown by appropriate statistical methods (AUC,NSS) with eye-tracking data as ground-truths. Meanwhile, the qualitative results, visual comparisons of different saliency maps, gives a glimpse about performance for each individual sample. It also specifies imaging contexts where saliency methods give reasonable solution as well as situations where saliency maps are unreasonable to human perception.

8.1 Databases of image stimuli

The ground-truth and data for basic evaluations of visual saliency performance is got from eye-tracking experiments. Specially in