I Introduction
Fusion of multisensor images has been explored during recent years and is still a very active research area [2]. A popular fusion problem in remote sensing consists of merging a high spatial resolution panchromatic (PAN) image and a low spatial resolution multispectral (MS) image. Many solutions have been proposed in the literature to solve this problem, known as pansharpening [2, 3, 4, 5].
More recently, hyperspectral (HS) imaging acquiring a scene in several hundreds of contiguous spectral bands has opened a new range of relevant applications such as target detection [6] and spectral unmixing [7]. However, while HS sensors provide abundant spectral information, their spatial resolution is generally more limited [8, 9]. To obtain images with good spectral and spatial resolutions, the remote sensing community has been devoting increasing research efforts to the problem of fusing HS with MS or PAN images [10, 11, 12, 13]. From an application point of view, this problem is also important as motivated by recent national programs, e.g., the Japanese nextgeneration spaceborne hyperspectral image suite (HISUI), which fuses coregistered MS and HS images acquired over the same scene under the same conditions [14].
The fusion of HS and MS differs from traditional pansharpening since both spatial and spectral information is contained in multiband images. Therefore, a lot of pansharpening methods, such as component substitution [15] and relative spectral contribution [16]
are inapplicable or inefficient for the HS/MS fusion problem. Since the fusion problem is generally illposed, Bayesian inference offers a convenient way to regularize the problem by defining an appropriate generic prior for the scene of interest. Following this strategy, Gaussian or
norm priors have been considered to build various estimators, in the image domain
[17, 18, 19] or in a transformed domain [20]. Recently, the fusion of HS and MS images based on spectral unmixing has been explored [21, 22, 23, 24].Besides, sparse representations have received a considerable interest in recent years exploiting the selfsimilarity properties of natural images [25, 26, 27, 28]. Using this property, a sparse constraint has been proposed in [29, 30, 31]
to regularize various illposed superresolution and/or fusion problems. The linear decomposition of an image using a few atoms of a redundant dictionary learned from this image (instead of a predefined dictionary, e.g., of wavelets) has recently been used for several problems related to lowlevel image processing tasks such as denoising
[32] and classification [33], demonstrating the ability of sparse representations to model natural images. Learning a dictionary from the image of interest is commonly referred to as dictionary learning (DL). Liu et al. recently proposed to solve the pansharpening problem based on DL [5]. DL has also been investigated to restore HS images [34]. More precisely, a Bayesian scheme was introduced in [34] to learn a dictionary from an HS image, which imposes a selfconsistency of the dictionary by using BetaBernoulli processes. This Monte Carlobased method provided interesting results at the price of a high computational complexity. Fusing multiple images using a sparse regularization based on the decomposition of these images into high and low frequency components was considered in [31]. However, this method developed in [31] required a training dataset to learn the dictionaries. The references mentioned before proposed to solve the corresponding sparse coding problem either by using greedy algorithms such as matching pursuit (MP) and orthogonal MP [35] or by relaxing the norm to norm to take advantage of the last absolute shrinkage and selection operator (LASSO) [36].In this paper, we propose to fuse HS and MS images within a constrained optimization framework, by incorporating a sparse regularization using dictionaries learned from the observed images. Knowing the trained dictionaries and the corresponding supports of the codes circumvents the difficulties inherent to the sparse coding step. The optimization problem can then be solved by optimizing alternatively with respect to (w.r.t.) the projected target image and the sparse code. The optimization w.r.t. the image is achieved by the split augmented Lagrangian shrinkage algorithm (SALSA) [37], which is an instance of the alternating direction method of multipliers (ADMM). By a suitable choice of variable splittings, SALSA enables to decompose a huge nondiagonalizable quadratic problem into a sequence of convolutions and pixel decoupled problems, which can be solved efficiently. The coding step is performed using a standard leastsquare (LS) algorithm which is possible because the supports have been fixed a priori.
The paper is organized as follows. Section II formulates the fusion problem within a constrained optimization framework. Section III presents the proposed sparse regularization and the method used to learn the dictionaries and the code support. The strategy investigated to solve the resulting optimization problem is detailed in Section IV. Simulation results are presented in Section V whereas conclusions are reported in Section VI.
Ii Problem Formulation
Iia Notations and observation model
In this paper, we consider the fusion of hyperspectral (HS) and multispectral (MS) images. The HS image is supposed to be a blurred and downsampled version of the target image whereas the MS image is a spectrally degraded version of the target image. Both images are contaminated by white Gaussian noises. Instead of resorting to the totally vectorized notations used in
[17, 20, 18], the HS and MS images are reshaped bandbyband to build and matrices, respectively, where is the number of HS bands, is the number of MS bands, is the number of pixels in each band of the MS image and is the number of pixels in each band of the HS image. The resulting observation models associated with the HS and MS images can be written as follows [17, 38, 39](1) 
where

is the full resolution target image with bands and pixels,

and are the observed HS and MS images, respectively,

is a cyclic convolution operator acting on the bands,

is a downsampling matrix (with downsampling factor denoted as ),

is the spectral response of the MS sensor,

and are the HS and MS noises.
Note that is a sparse symmetric Toeplitz matrix for a symmetric convolution kernel and , where is an integer standing for the downsampling factor. In this work, each column of the noise matrices and is assumed to be a banddependent Gaussian noise vector, i.e., and where is the vector of zeros, and are diagonal matrices. Note that the Gaussian noise assumption used in this paper is quite popular in image processing [40, 41, 42] as it facilitates the formulation of the likelihood and the associated optimization algorithms. By denoting the Frobenius norm as , the signal to noise ratios (SNRs) of each band in the two images (expressed in decibels) are defined as
IiB Subspace learning
The unknown image is where is the vector corresponding to the th spatial location (with ). As the bands of the HS data are generally spectrally dependent, the HS vector usually lives in a subspace whose dimension is much smaller than the number of bands [43, 44], i.e.,
(2) 
where is the projection of the vector onto the subspace spanned by the columns of (
is an orthogonal matrix
). Using the notation , we have . Moreover, since is an orthogonal matrix. In this case, the fusion problem (1) can be reformulated as estimating the unknown matrix from the following observation equations(3) 
The dimension of the subspace is generally much smaller than the number of HS bands, i.e., . As a consequence, inferring in the subspace greatly decreases the computational burden of the fusion algorithm. Another motivation for working in the subspace associated with is to bypass the possible matrix singularity caused by the spectral dependency of the HS data. Note that each column of the orthogonal matrix can be interpreted as a basis of the subspace of interest. In this paper, the subspace transform matrix
has been determined from a principal component analysis (PCA) of the HS data
(see step 7 of Algorithm 1).Iii Proposed Fusion Rule for multispectral and hyperspectral images
Iiia Illposed inverse problem
As shown in (3), recovering the projected highspectral and highspatial resolution image from the observations and is a linear inverse problem (LIP) [37]. In most singleimage restoration problem (using either or ), the inverse problem is illposed or underconstrained [29], which requires regularization or prior information (in Bayesian terminology). However, for multisource image fusion, the inverse problem can be illposed or wellposed, depending on the dimension of the subspace and the number of spectral bands. If the matrix has full column rank and is well conditioned, which is seldom the case, the estimation of according to (3) is an overdetermined problem instead of an underdetermined problem [45]. In this case, it is redundant to introduce regularizations. Conversely, if there are fewer MS bands than the subspace dimension (e.g., the MS image degrades to a PAN image), the matrix cannot have full column rank, which means the fusion problem is an illposed LIP. In this paper, we focus on the underdetermined case. Note however that the overdetermined problem can be viewed as a special case with a regularization term set to zero. Another motivation for studying the underdetermined problem is that it includes an archetypal fusion task referred to as pansharpening [2].
Based on the model (3) and the noise assumption, the distributions of and are
(4) 
where
represents the matrix normal distribution. The probability density function of a matrix normal distribution
is defined bywhere is the mean, and are two matrices denoted as the row and column covariance matrices.
According to Bayes’ theorem and using the fact that the noises
and are independent, the posterior distribution of can be inferred as(5) 
In this work, we compute the maximum a posteriori (MAP) estimator in an optimization framework to solve the fusion problem. Taking the negative logarithm of the posterior distribution, maximizing the posterior distribution is equivalent to solving the following minimization problem
(6) 
where the two first terms are associated with the MS and HS images (data fidelity terms) and the last term is a penalty ensuring appropriate regularization. Note that is a parameter adjusting the importance of regularization w.r.t. the data fidelity terms. It is also noteworthy that the MAP estimator is equivalent with the minimum mean square error (MMSE) estimator when has a quadratic form, which is the case in our approach detailed below.
IiiB Sparse Regularization
Based on the selfsimilarity property of natural images, modeling images with a sparse representation has been shown to be very effective in many signal processing applications [46]. Generally, an overcomplete dictionary with columns referred to as atoms is proposed as a basis for the image patches. In many applications, the dictionary is predefined and can be constructed from wavelets [47], curvelets [48] or discrete cosine transform (DCT) vectors [49]. However, these bases are not necessarily well matched to natural or remote sensing images since they do not necessarily adapt to the spatial nature of the observed images. As a consequence, learning the dictionary from the observed images instead of using predefined bases generally improves image representation [32], which is preferred in most scenarios. Therefore, an adaptive sparse imagedependent regularization is explored in this paper to solve the fusion problem of interest. The rationale with adaptive sparse representations is to learn dictionaries from the data yielding sparse representations thereof. In this case, the atoms of the dictionary are tuned to the input images, leading to much better results than predefined dictionaries. More specifically, the goal of sparse regularization is to represent the patches of the target image as a weighted linear combination of a few elementary basis vectors or atoms, chosen from a learned overcomplete dictionary. The sparse regularization investigated in this paper is defined as
(7) 
where

is the th band (or row) of , with ,

is a linear operator that averages the overlapping patches^{1}^{1}1Note that the overlapping decomposition adopted here is to prevent the block artifacts [50]. of each band,

is the overcomplete dictionary whose columns are basis elements of size (corresponding to the size of a patch),

is the th band code ( is the number of atoms and is the number of patches associated with the th band).
Note that there are vectors since the dimension of the hyperspectral subspace in which the observed vectors have been projected is . The operation decomposing each band into overlapping patches of size is denoted as , which is the adjoint operation of , i.e., .
IiiC Dictionary learning step
The DL strategy advocated in this paper consists of learning the dictionaries and an associated sparse code for each band of a rough estimation of using the observed HS and MS images. A rough estimation of , referred as is constructed using the MS image and the HS image , following the strategy initially studied in [17]. A brief introduction of this method is given in the Appendix. Note that other estimation methods might also be used to propose a rough estimation of (see step 1 in Algorithm 1). Then each band of is decomposed into overlapping patches of size forming a patch matrix .
Many DL methods have been studied in the recent literature. These methods are for instance based on KSVD [51], online dictionary learning (ODL) [27] or Bayesian learning [34]. In this study, we propose to learn the set of overcomplete dictionaries using ODL since it is effective from the computational point of view and has empirically demonstrated to provide more relevant representations. More specifically, the dictionary associated with the band is trained by solving the following optimization problem (see step 3 in Algorithm 1).
(8) 
Then, to provide a more compact representation, we propose to reestimate the sparse code
(9) 
where is a given maximum number of atoms, for each patch of . This norm constrained regression problem can be addressed using greedy algorithms, e.g., orthogonal matching pursuit (OMP). Generally, the maximum number of atoms is set much smaller than the number of atoms in the dictionary, i.e., . The positions of the nonzero elements of the code , namely the supports denoted , are also identified (see steps 4 and 5 in Algorithm 1).
IiiD Including the sparse code into the estimation framework
Since the regularization term (7) exhibits separable terms w.r.t. each image in band , it can be easily interpreted in a Bayesian framework as the joint prior distribution of the images () assumed to be a priori independent, where each marginal prior
is a Gaussian distribution with mean
. More formally, by denoting , the prior distribution for associated with the regularization (7) can be written(10) 
In a standard approach, the hyperparameters
and can be a priori fixed, e.g., based on the dictionary learning step detailed in the previous section. However, this choice can drastically impact the accuracy of the representation and therefore the relevance of the regularization term. Inspired by hierarchical models frequently encountered in Bayesian inference [52], we propose to add a second level in the Bayesian paradigm by fixing the dictionaries and the set of supports , but including the code within the estimation process. The associated joint prior can be written as follows, using implicitly as a hyperhyperparameter:(11) 
where is derived from . Therefore, the regularization term (7) reduces to
(12) 
where and . It is worthy to note that i) the regularization term in (12) is still separable w.r.t. each band , and ii) the optimization of (12) w.r.t. reduces to an norm optimization task w.r.t. the nonzero elements in , which can be solved easily. The hierarchical structure of the observed data, parameters and hyperparameters is summarized in Fig. 1.
Finally, substituting (12) into (6), the optimization problem to be solved can be expressed as follows
(13) 
Note that the set of constraints could have been removed. In this case, to ensure sparse representations of (), sparse constraints on the codes (), such as or sparsity promoting penalties, e.g., should have been included into the object function (13). This would have resulted in a much more computationally intensive algorithm.
Iv Alternate optimization
Once , and have been learned from the HS and MS data, the problem (13) reduces to a standard constrained quadratic optimization problem w.r.t. and . However, this problem is difficult to solve due to its large dimension and due to the fact that the linear operators and cannot be easily diagonalized. To cope with this difficulty, we propose an optimization technique that alternates optimization w.r.t. and , which is a simple version of block coordinate descent algorithm.
The optimization w.r.t. conditional on (or equivalent on ) can be achieved efficiently with the alternating direction method of multipliers (ADMM) [53], whose convergence has been proved in the convex case. The optimization w.r.t. with the support constraint () conditional on is a least squares (LS) regression problem for the nonzero elements of , which can be solved easily. The resulting scheme including learning , and is detailed in Algo. 1. The alternating ADMM and LS steps are detailed in what follows.
Note that the objective function is convex w.r.t although no strictly. In practice, a very simple way to ensure convergence is to add the quadratic terms , with very small . In this case, the solution of Algorithm 1 is unique and the ADMM algorithm converges linearly [54]. In practice, we notice that even when is zero, the solution of Algorithm 1 always converges to a unique point.
Iva ADMM Step
Recall that the function to be minimized w.r.t. conditional on (or ) is
(14) 
By introducing the splittings , and and the respective scaled Lagrange multipliers and , the augmented Lagrangian associated with the optimization of can be written as
The updates of and are achieved with an optimization tool named split augmented Lagrangian shrinkage algorithm (SALSA) [55, 37], which is an instance of the ADMM algorithm with guaranteed convergence. The SALSA scheme is summarized in Algorithm 2. Note that the optimization w.r.t. to (step 5) can be efficiently solved in the Fourier domain.
IvB Patchwise Sparse Coding
The optimization w.r.t. conditional on can be formulated as
(15) 
Since the operator is a linear mapping from patches to images and , the problem (15) can be rewritten as
(16) 
Furthermore, as the adjoint operator has the property (where is a constant and is the identity operator), the suboptimal solution of problem (16) can be approximated by solving
(17) 
Tackling the support constraint consists of only updating the nonzero elements of each column of . Denote the th vectorized column of as , the vector composed of the nonzero elements of the th column of as , and the corresponding columns of as . Then the problems in (17) reduce to subproblems
(18) 
whose solutions can be explicitly computed in parallel. The corresponding patch estimate is , with . These patches are used to build (i.e., equivalently, ) required in the optimization w.r.t. (see section IVA). Note that is a projection operator, and hence is symmetric () and idempotent (). Note also that needs to be calculated only once, given the learned dictionaries and associated supports.
IvC Complexity Analysis
The SALSA algorithm has a complexity of the order [37], where is the number of SALSA iterations. The computational complexity of the patchwise sparse coding is . Conducting the fusion in a subspace of dimension instead of working with the initial space of dimension greatly decreases the complexity of both SALSA and sparse coding steps.
V Simulation Results on Synthetic Data
This section studies the performance of the proposed sparse representationbased fusion algorithm. The reference image considered here as the high spectral and high spectral image is a HS image with spatial resolution of m acquired by the reflective optics system imaging spectrometer (ROSIS) optical sensor over the urban area of the University of Pavia, Italy. The flight was operated by the Deutsches Zentrum für Luft und Raumfahrt (DLR, the German Aerospace Agency) in the framework of the HySens project, managed and sponsored by the European Union. This image was initially composed of bands that have been reduced to bands after removing the water vapor absorption bands (with spectral range from 0.43 to 0.86μm). It has received a lot of attention in the remote sensing literature [56, 57, 58]. A composite color image, formed by selecting the red, green and blue bands of the reference image is shown in the left of Fig. 2.
Va Simulation Scenario
We propose to reconstruct the reference HS image from two lower resolved images. A highspectral lowspatial resolution HS image has been constructed by applying a exponential degrading spatial filter on each band of the reference image and downsampling every pixels in both horizontal and vertical directions. In a second step, we have generated a band MS image by filtering the reference image with the IKONOSlike reflectance spectral responses depicted in Fig. 3. The HS and MS images are both perturbed by zeromean additive Gaussian noises. Our simulations have been conducted with SNRdB for the first 127 bands and SNRdB for the remaining 50 bands of the HS image. For the MS image, SNR is 30dB for all bands. The noisecontaminated HS and MS images are depicted in the middle and right of Fig. 2
(the HS image has been interpolated for better visualization).
VB Learning the Subspace, the Dictionaries and the Code Supports
VB1 Subspace
To learn the transform matrix , we propose to use the principal component analysis (PCA) as in [19]. Note that PCA is a classical dimensionality reduction technique used in hyperspectral imagery. The empirical covariance matrix of the HS pixel vectors is diagonalized leading to
(19) 
where is an unitary matrix () and
is a diagonal matrix whose diagonal elements are the ordered eigenvalues of
denoted as . The top components are selected and the matrixis then constructed as the eigenvectors associated with the
largest eigenvalues of . As an illustration, the eigenvalues of the empirical covariance matrix for the Pavia image are displayed in Fig. 4. For this example, the eigenvectors contain of the information.VB2 Dictionaries
As explained before, the target high resolution image is assumed to live in the lower dimensional subspace. Firstly, a rough estimation of the projected image is obtained with the method proposed in [17]. In a second step, dictionaries are learned from the rough estimation of the projected image using the ODL method.
As , the dictionary is overcomplete. There is no unique rule to select the dictionary size and the number of atoms . However, two limiting cases can be identified:

The patch reduces to a single pixel, which means . In this case, the sparsity is not necessary to be introduced since only one D dictionary atom (which is a constant) is enough to represent any target patch.

The patch is as large as the whole image, which means only one atom is needed to represent the image. In this case, the atom is too “specialized” to describe any other image.
More generally, the smaller the patches, the more objects the atoms can approximate. However, too small patches are not efficient to properly capture the textures, edges, etc. With larger patch size, a larger number of atoms are required to guarantee the overcompleteness (which requires larger computation cost). In general, the size of patches is selected empirically. For the ODL algorithm used in this study, the size has been fixed to and the number of atoms is . The learned dictionaries for the first three bands of are displayed in the left column of Fig. 5. This figure shows that the spatial properties of the target image have been captured by the atoms of the dictionaries.
VB3 Code Supports
Based on the dictionaries learned following the strategy presented in Section VB2, reestimation of the code is achieved by solving (9) with OMP. Note that the target sparsity represents the maximum number of atoms used to represent one patch, which also determines the number of nonzeros elecments of estimated jointly with the projected image . If is too large, the optimization w.r.t. and leads to overfitting, which means there are too many parameters to estimate while the sample size is too small. The training supports for the first three bands are displayed in the right column of Fig. 5. The number of rows is , which represents the number of atoms in each dictionary (). The white dots in the th column indicate which atoms are used for reconstructing the th patch (). The sparsity is clearly illustrated in this figure. Note that some atoms are frequently used whereas some others are not. The most popular atoms represent spatial details that are quite common in images. The other atoms represent details that are characteristics of specific patches.
VC Fusion Quality Metrics
To evaluate the quality of the proposed fusion strategy, several image quality measures have been employed. Referring to [20], we propose to use RMSE, SAM, UIQI, ERGAS and DD as defined below.
VC1 Rmse
The root mean square error (RMSE) is a similarity measure between the target image and fused image defined as
The smaller RMSE, the better the fusion quality.
VC2 Sam
The spectral angle mapper (SAM) measures the spectral distortion between the actual and estimated images. The SAM of two spectral vectors and is defined as
The overall SAM is finally obtained by averaging the SAMs computed from all image pixels. Note that the value of SAM is expressed in degrees and thus belongs to . The smaller the absolute value of SAM, the less important the spectral distortion.
VC3 Uiqi
The universal image quality index (UIQI) was proposed in [59] for evaluating the similarity between two single band images. It is related to the correlation, luminance distortion and contrast distortion of the estimated image w.r.t. the reference image. The UIQI between two singleband images and is defined as
where
are the sample means and variances of
and , and is the sample covariance of . The range of UIQI is and UIQI when . For multiband images, the overall UIQI can be computed by averaging the UIQI computed bandbyband.VC4 Ergas
The relative dimensionless global error in synthesis (ERGAS) calculates the amount of spectral distortion in the image [60]. This measure of fusion quality is defined as
where is the ratio between the pixel sizes of the MS and HS images, is the mean of the th band of the HS image, and is the number of HS bands. The smaller ERGAS, the smaller the spectral distortion.
VC5 Dd
The degree of distortion (DD) between two images and is defined as
The smaller DD, the better the fusion.
VD Comparison with other fusion methods
This section compares the proposed fusion method with four other stateoftheart fusion algorithms for MS and HS images [17, 20, 23, 19]. The parameters used for the proposed fusion algorithm have been specified as follows.

The regularization parameter used in the SALSA method is . The selection of this parameter is still an open issue even if there are some strategies to tune it to accelerate convergence [37]. According to the convergence theory [61], for any , if minimizing (14) has a solution, say , then the sequence
Comments
There are no comments yet.