Fusion of multi-sensor images has been explored during recent years and is still a very active research area . A popular fusion problem in remote sensing consists of merging a high spatial resolution panchromatic (PAN) image and a low spatial resolution multispectral (MS) image. Many solutions have been proposed in the literature to solve this problem, known as pansharpening [2, 3, 4, 5].
More recently, hyperspectral (HS) imaging acquiring a scene in several hundreds of contiguous spectral bands has opened a new range of relevant applications such as target detection  and spectral unmixing . However, while HS sensors provide abundant spectral information, their spatial resolution is generally more limited [8, 9]. To obtain images with good spectral and spatial resolutions, the remote sensing community has been devoting increasing research efforts to the problem of fusing HS with MS or PAN images [10, 11, 12, 13]. From an application point of view, this problem is also important as motivated by recent national programs, e.g., the Japanese next-generation space-borne hyperspectral image suite (HISUI), which fuses co-registered MS and HS images acquired over the same scene under the same conditions .
The fusion of HS and MS differs from traditional pansharpening since both spatial and spectral information is contained in multi-band images. Therefore, a lot of pansharpening methods, such as component substitution  and relative spectral contribution 
are inapplicable or inefficient for the HS/MS fusion problem. Since the fusion problem is generally ill-posed, Bayesian inference offers a convenient way to regularize the problem by defining an appropriate generic prior for the scene of interest. Following this strategy, Gaussian or
-norm priors have been considered to build various estimators, in the image domain[17, 18, 19] or in a transformed domain . Recently, the fusion of HS and MS images based on spectral unmixing has been explored [21, 22, 23, 24].
Besides, sparse representations have received a considerable interest in recent years exploiting the self-similarity properties of natural images [25, 26, 27, 28]. Using this property, a sparse constraint has been proposed in [29, 30, 31]
to regularize various ill-posed super-resolution and/or fusion problems. The linear decomposition of an image using a few atoms of a redundant dictionary learned from this image (instead of a predefined dictionary, e.g., of wavelets) has recently been used for several problems related to low-level image processing tasks such as denoising and classification , demonstrating the ability of sparse representations to model natural images. Learning a dictionary from the image of interest is commonly referred to as dictionary learning (DL). Liu et al. recently proposed to solve the pansharpening problem based on DL . DL has also been investigated to restore HS images . More precisely, a Bayesian scheme was introduced in  to learn a dictionary from an HS image, which imposes a self-consistency of the dictionary by using Beta-Bernoulli processes. This Monte Carlo-based method provided interesting results at the price of a high computational complexity. Fusing multiple images using a sparse regularization based on the decomposition of these images into high and low frequency components was considered in . However, this method developed in  required a training dataset to learn the dictionaries. The references mentioned before proposed to solve the corresponding sparse coding problem either by using greedy algorithms such as matching pursuit (MP) and orthogonal MP  or by relaxing the -norm to -norm to take advantage of the last absolute shrinkage and selection operator (LASSO) .
In this paper, we propose to fuse HS and MS images within a constrained optimization framework, by incorporating a sparse regularization using dictionaries learned from the observed images. Knowing the trained dictionaries and the corresponding supports of the codes circumvents the difficulties inherent to the sparse coding step. The optimization problem can then be solved by optimizing alternatively with respect to (w.r.t.) the projected target image and the sparse code. The optimization w.r.t. the image is achieved by the split augmented Lagrangian shrinkage algorithm (SALSA) , which is an instance of the alternating direction method of multipliers (ADMM). By a suitable choice of variable splittings, SALSA enables to decompose a huge non-diagonalizable quadratic problem into a sequence of convolutions and pixel decoupled problems, which can be solved efficiently. The coding step is performed using a standard least-square (LS) algorithm which is possible because the supports have been fixed a priori.
The paper is organized as follows. Section II formulates the fusion problem within a constrained optimization framework. Section III presents the proposed sparse regularization and the method used to learn the dictionaries and the code support. The strategy investigated to solve the resulting optimization problem is detailed in Section IV. Simulation results are presented in Section V whereas conclusions are reported in Section VI.
Ii Problem Formulation
Ii-a Notations and observation model
In this paper, we consider the fusion of hyperspectral (HS) and multispectral (MS) images. The HS image is supposed to be a blurred and down-sampled version of the target image whereas the MS image is a spectrally degraded version of the target image. Both images are contaminated by white Gaussian noises. Instead of resorting to the totally vectorized notations used in[17, 20, 18], the HS and MS images are reshaped band-by-band to build and matrices, respectively, where is the number of HS bands, is the number of MS bands, is the number of pixels in each band of the MS image and is the number of pixels in each band of the HS image. The resulting observation models associated with the HS and MS images can be written as follows [17, 38, 39]
is the full resolution target image with bands and pixels,
and are the observed HS and MS images, respectively,
is a cyclic convolution operator acting on the bands,
is a down-sampling matrix (with down-sampling factor denoted as ),
is the spectral response of the MS sensor,
and are the HS and MS noises.
Note that is a sparse symmetric Toeplitz matrix for a symmetric convolution kernel and , where is an integer standing for the downsampling factor. In this work, each column of the noise matrices and is assumed to be a band-dependent Gaussian noise vector, i.e., and where is the vector of zeros, and are diagonal matrices. Note that the Gaussian noise assumption used in this paper is quite popular in image processing [40, 41, 42] as it facilitates the formulation of the likelihood and the associated optimization algorithms. By denoting the Frobenius norm as , the signal to noise ratios (SNRs) of each band in the two images (expressed in decibels) are defined as
Ii-B Subspace learning
The unknown image is where is the vector corresponding to the th spatial location (with ). As the bands of the HS data are generally spectrally dependent, the HS vector usually lives in a subspace whose dimension is much smaller than the number of bands [43, 44], i.e.,
where is the projection of the vector onto the subspace spanned by the columns of (
is an orthogonal matrix). Using the notation , we have . Moreover, since is an orthogonal matrix. In this case, the fusion problem (1) can be reformulated as estimating the unknown matrix from the following observation equations
The dimension of the subspace is generally much smaller than the number of HS bands, i.e., . As a consequence, inferring in the subspace greatly decreases the computational burden of the fusion algorithm. Another motivation for working in the subspace associated with is to bypass the possible matrix singularity caused by the spectral dependency of the HS data. Note that each column of the orthogonal matrix can be interpreted as a basis of the subspace of interest. In this paper, the subspace transform matrix
has been determined from a principal component analysis (PCA) of the HS data(see step 7 of Algorithm 1).
Iii Proposed Fusion Rule for multispectral and hyperspectral images
Iii-a Ill-posed inverse problem
As shown in (3), recovering the projected high-spectral and high-spatial resolution image from the observations and is a linear inverse problem (LIP) . In most single-image restoration problem (using either or ), the inverse problem is ill-posed or under-constrained , which requires regularization or prior information (in Bayesian terminology). However, for multi-source image fusion, the inverse problem can be ill-posed or well-posed, depending on the dimension of the subspace and the number of spectral bands. If the matrix has full column rank and is well conditioned, which is seldom the case, the estimation of according to (3) is an over-determined problem instead of an under-determined problem . In this case, it is redundant to introduce regularizations. Conversely, if there are fewer MS bands than the subspace dimension (e.g., the MS image degrades to a PAN image), the matrix cannot have full column rank, which means the fusion problem is an ill-posed LIP. In this paper, we focus on the under-determined case. Note however that the over-determined problem can be viewed as a special case with a regularization term set to zero. Another motivation for studying the under-determined problem is that it includes an archetypal fusion task referred to as pansharpening .
Based on the model (3) and the noise assumption, the distributions of and are
whereis defined by
where is the mean, and are two matrices denoted as the row and column covariance matrices.
According to Bayes’ theorem and using the fact that the noisesand are independent, the posterior distribution of can be inferred as
In this work, we compute the maximum a posteriori (MAP) estimator in an optimization framework to solve the fusion problem. Taking the negative logarithm of the posterior distribution, maximizing the posterior distribution is equivalent to solving the following minimization problem
where the two first terms are associated with the MS and HS images (data fidelity terms) and the last term is a penalty ensuring appropriate regularization. Note that is a parameter adjusting the importance of regularization w.r.t. the data fidelity terms. It is also noteworthy that the MAP estimator is equivalent with the minimum mean square error (MMSE) estimator when has a quadratic form, which is the case in our approach detailed below.
Iii-B Sparse Regularization
Based on the self-similarity property of natural images, modeling images with a sparse representation has been shown to be very effective in many signal processing applications . Generally, an over-complete dictionary with columns referred to as atoms is proposed as a basis for the image patches. In many applications, the dictionary is predefined and can be constructed from wavelets , curvelets  or discrete cosine transform (DCT) vectors . However, these bases are not necessarily well matched to natural or remote sensing images since they do not necessarily adapt to the spatial nature of the observed images. As a consequence, learning the dictionary from the observed images instead of using predefined bases generally improves image representation , which is preferred in most scenarios. Therefore, an adaptive sparse image-dependent regularization is explored in this paper to solve the fusion problem of interest. The rationale with adaptive sparse representations is to learn dictionaries from the data yielding sparse representations thereof. In this case, the atoms of the dictionary are tuned to the input images, leading to much better results than predefined dictionaries. More specifically, the goal of sparse regularization is to represent the patches of the target image as a weighted linear combination of a few elementary basis vectors or atoms, chosen from a learned overcomplete dictionary. The sparse regularization investigated in this paper is defined as
is the th band (or row) of , with ,
is a linear operator that averages the overlapping patches111Note that the overlapping decomposition adopted here is to prevent the block artifacts . of each band,
is the overcomplete dictionary whose columns are basis elements of size (corresponding to the size of a patch),
is the th band code ( is the number of atoms and is the number of patches associated with the th band).
Note that there are vectors since the dimension of the hyperspectral subspace in which the observed vectors have been projected is . The operation decomposing each band into overlapping patches of size is denoted as , which is the adjoint operation of , i.e., .
Iii-C Dictionary learning step
The DL strategy advocated in this paper consists of learning the dictionaries and an associated sparse code for each band of a rough estimation of using the observed HS and MS images. A rough estimation of , referred as is constructed using the MS image and the HS image , following the strategy initially studied in . A brief introduction of this method is given in the Appendix. Note that other estimation methods might also be used to propose a rough estimation of (see step 1 in Algorithm 1). Then each band of is decomposed into overlapping patches of size forming a patch matrix .
Many DL methods have been studied in the recent literature. These methods are for instance based on K-SVD , online dictionary learning (ODL)  or Bayesian learning . In this study, we propose to learn the set of over-complete dictionaries using ODL since it is effective from the computational point of view and has empirically demonstrated to provide more relevant representations. More specifically, the dictionary associated with the band is trained by solving the following optimization problem (see step 3 in Algorithm 1).
Then, to provide a more compact representation, we propose to re-estimate the sparse code
where is a given maximum number of atoms, for each patch of . This -norm constrained regression problem can be addressed using greedy algorithms, e.g., orthogonal matching pursuit (OMP). Generally, the maximum number of atoms is set much smaller than the number of atoms in the dictionary, i.e., . The positions of the non-zero elements of the code , namely the supports denoted , are also identified (see steps 4 and 5 in Algorithm 1).
Iii-D Including the sparse code into the estimation framework
Since the regularization term (7) exhibits separable terms w.r.t. each image in band , it can be easily interpreted in a Bayesian framework as the joint prior distribution of the images () assumed to be a priori independent, where each marginal prior
is a Gaussian distribution with mean. More formally, by denoting , the prior distribution for associated with the regularization (7) can be written
In a standard approach, the hyperparametersand can be a priori fixed, e.g., based on the dictionary learning step detailed in the previous section. However, this choice can drastically impact the accuracy of the representation and therefore the relevance of the regularization term. Inspired by hierarchical models frequently encountered in Bayesian inference , we propose to add a second level in the Bayesian paradigm by fixing the dictionaries and the set of supports , but including the code within the estimation process. The associated joint prior can be written as follows, using implicitly as a hyper-hyperparameter:
where is derived from . Therefore, the regularization term (7) reduces to
where and . It is worthy to note that i) the regularization term in (12) is still separable w.r.t. each band , and ii) the optimization of (12) w.r.t. reduces to an -norm optimization task w.r.t. the non-zero elements in , which can be solved easily. The hierarchical structure of the observed data, parameters and hyperparameters is summarized in Fig. 1.
Note that the set of constraints could have been removed. In this case, to ensure sparse representations of (), sparse constraints on the codes (), such as or sparsity promoting penalties, e.g., should have been included into the object function (13). This would have resulted in a much more computationally intensive algorithm.
Iv Alternate optimization
Once , and have been learned from the HS and MS data, the problem (13) reduces to a standard constrained quadratic optimization problem w.r.t. and . However, this problem is difficult to solve due to its large dimension and due to the fact that the linear operators and cannot be easily diagonalized. To cope with this difficulty, we propose an optimization technique that alternates optimization w.r.t. and , which is a simple version of block coordinate descent algorithm.
The optimization w.r.t. conditional on (or equivalent on ) can be achieved efficiently with the alternating direction method of multipliers (ADMM) , whose convergence has been proved in the convex case. The optimization w.r.t. with the support constraint () conditional on is a least squares (LS) regression problem for the non-zero elements of , which can be solved easily. The resulting scheme including learning , and is detailed in Algo. 1. The alternating ADMM and LS steps are detailed in what follows.
Note that the objective function is convex w.r.t although no strictly. In practice, a very simple way to ensure convergence is to add the quadratic terms , with very small . In this case, the solution of Algorithm 1 is unique and the ADMM algorithm converges linearly . In practice, we notice that even when is zero, the solution of Algorithm 1 always converges to a unique point.
Iv-a ADMM Step
Recall that the function to be minimized w.r.t. conditional on (or ) is
By introducing the splittings , and and the respective scaled Lagrange multipliers and , the augmented Lagrangian associated with the optimization of can be written as
The updates of and are achieved with an optimization tool named split augmented Lagrangian shrinkage algorithm (SALSA) [55, 37], which is an instance of the ADMM algorithm with guaranteed convergence. The SALSA scheme is summarized in Algorithm 2. Note that the optimization w.r.t. to (step 5) can be efficiently solved in the Fourier domain.
Iv-B Patchwise Sparse Coding
The optimization w.r.t. conditional on can be formulated as
Since the operator is a linear mapping from patches to images and , the problem (15) can be rewritten as
Furthermore, as the adjoint operator has the property (where is a constant and is the identity operator), the sub-optimal solution of problem (16) can be approximated by solving
Tackling the support constraint consists of only updating the non-zero elements of each column of . Denote the th vectorized column of as , the vector composed of the non-zero elements of the th column of as , and the corresponding columns of as . Then the problems in (17) reduce to sub-problems
whose solutions can be explicitly computed in parallel. The corresponding patch estimate is , with . These patches are used to build (i.e., equivalently, ) required in the optimization w.r.t. (see section IV-A). Note that is a projection operator, and hence is symmetric () and idempotent (). Note also that needs to be calculated only once, given the learned dictionaries and associated supports.
Iv-C Complexity Analysis
The SALSA algorithm has a complexity of the order , where is the number of SALSA iterations. The computational complexity of the patchwise sparse coding is . Conducting the fusion in a subspace of dimension instead of working with the initial space of dimension greatly decreases the complexity of both SALSA and sparse coding steps.
V Simulation Results on Synthetic Data
This section studies the performance of the proposed sparse representation-based fusion algorithm. The reference image considered here as the high spectral and high spectral image is a HS image with spatial resolution of m acquired by the reflective optics system imaging spectrometer (ROSIS) optical sensor over the urban area of the University of Pavia, Italy. The flight was operated by the Deutsches Zentrum für Luft- und Raumfahrt (DLR, the German Aerospace Agency) in the framework of the HySens project, managed and sponsored by the European Union. This image was initially composed of bands that have been reduced to bands after removing the water vapor absorption bands (with spectral range from 0.43 to 0.86μm). It has received a lot of attention in the remote sensing literature [56, 57, 58]. A composite color image, formed by selecting the red, green and blue bands of the reference image is shown in the left of Fig. 2.
V-a Simulation Scenario
We propose to reconstruct the reference HS image from two lower resolved images. A high-spectral low-spatial resolution HS image has been constructed by applying a exponential degrading spatial filter on each band of the reference image and down-sampling every pixels in both horizontal and vertical directions. In a second step, we have generated a -band MS image by filtering the reference image with the IKONOS-like reflectance spectral responses depicted in Fig. 3. The HS and MS images are both perturbed by zero-mean additive Gaussian noises. Our simulations have been conducted with SNRdB for the first 127 bands and SNRdB for the remaining 50 bands of the HS image. For the MS image, SNR is 30dB for all bands. The noise-contaminated HS and MS images are depicted in the middle and right of Fig. 2
(the HS image has been interpolated for better visualization).
V-B Learning the Subspace, the Dictionaries and the Code Supports
To learn the transform matrix , we propose to use the principal component analysis (PCA) as in . Note that PCA is a classical dimensionality reduction technique used in hyperspectral imagery. The empirical covariance matrix of the HS pixel vectors is diagonalized leading to
where is an unitary matrix () and
is a diagonal matrix whose diagonal elements are the ordered eigenvalues ofdenoted as . The top components are selected and the matrix
is then constructed as the eigenvectors associated with thelargest eigenvalues of . As an illustration, the eigenvalues of the empirical covariance matrix for the Pavia image are displayed in Fig. 4. For this example, the eigenvectors contain of the information.
As explained before, the target high resolution image is assumed to live in the lower dimensional subspace. Firstly, a rough estimation of the projected image is obtained with the method proposed in . In a second step, dictionaries are learned from the rough estimation of the projected image using the ODL method.
As , the dictionary is over-complete. There is no unique rule to select the dictionary size and the number of atoms . However, two limiting cases can be identified:
The patch reduces to a single pixel, which means . In this case, the sparsity is not necessary to be introduced since only one D dictionary atom (which is a constant) is enough to represent any target patch.
The patch is as large as the whole image, which means only one atom is needed to represent the image. In this case, the atom is too “specialized” to describe any other image.
More generally, the smaller the patches, the more objects the atoms can approximate. However, too small patches are not efficient to properly capture the textures, edges, etc. With larger patch size, a larger number of atoms are required to guarantee the over-completeness (which requires larger computation cost). In general, the size of patches is selected empirically. For the ODL algorithm used in this study, the size has been fixed to and the number of atoms is . The learned dictionaries for the first three bands of are displayed in the left column of Fig. 5. This figure shows that the spatial properties of the target image have been captured by the atoms of the dictionaries.
V-B3 Code Supports
Based on the dictionaries learned following the strategy presented in Section V-B2, re-estimation of the code is achieved by solving (9) with OMP. Note that the target sparsity represents the maximum number of atoms used to represent one patch, which also determines the number of non-zeros elecments of estimated jointly with the projected image . If is too large, the optimization w.r.t. and leads to over-fitting, which means there are too many parameters to estimate while the sample size is too small. The training supports for the first three bands are displayed in the right column of Fig. 5. The number of rows is , which represents the number of atoms in each dictionary (). The white dots in the th column indicate which atoms are used for reconstructing the th patch (). The sparsity is clearly illustrated in this figure. Note that some atoms are frequently used whereas some others are not. The most popular atoms represent spatial details that are quite common in images. The other atoms represent details that are characteristics of specific patches.
V-C Fusion Quality Metrics
To evaluate the quality of the proposed fusion strategy, several image quality measures have been employed. Referring to , we propose to use RMSE, SAM, UIQI, ERGAS and DD as defined below.
The root mean square error (RMSE) is a similarity measure between the target image and fused image defined as
The smaller RMSE, the better the fusion quality.
The spectral angle mapper (SAM) measures the spectral distortion between the actual and estimated images. The SAM of two spectral vectors and is defined as
The overall SAM is finally obtained by averaging the SAMs computed from all image pixels. Note that the value of SAM is expressed in degrees and thus belongs to . The smaller the absolute value of SAM, the less important the spectral distortion.
The universal image quality index (UIQI) was proposed in  for evaluating the similarity between two single band images. It is related to the correlation, luminance distortion and contrast distortion of the estimated image w.r.t. the reference image. The UIQI between two single-band images and is defined as
are the sample means and variances ofand , and is the sample covariance of . The range of UIQI is and UIQI when . For multi-band images, the overall UIQI can be computed by averaging the UIQI computed band-by-band.
The relative dimensionless global error in synthesis (ERGAS) calculates the amount of spectral distortion in the image . This measure of fusion quality is defined as
where is the ratio between the pixel sizes of the MS and HS images, is the mean of the th band of the HS image, and is the number of HS bands. The smaller ERGAS, the smaller the spectral distortion.
The degree of distortion (DD) between two images and is defined as
The smaller DD, the better the fusion.
V-D Comparison with other fusion methods
This section compares the proposed fusion method with four other state-of-the-art fusion algorithms for MS and HS images [17, 20, 23, 19]. The parameters used for the proposed fusion algorithm have been specified as follows.
The regularization parameter used in the SALSA method is . The selection of this parameter is still an open issue even if there are some strategies to tune it to accelerate convergence . According to the convergence theory , for any , if minimizing (14) has a solution, say , then the sequence