I Introduction
Accurate landuse and landcover classification plays an important role in many applications such as urban planning and change detection. In the past few years, hyperspectral data have been widely explored for this task [12, 9, 15]. Compared to multispectral data, hyperspectral data have more rich spectral information, ranging from the visible spectrum to the infrared spectrum [16]. Such information, combined with some spatial information in hyperspectral data, can generally acquire satisfying classification results [6, 13]. However, for urban and rural areas, there often exist many complex objects that are difficult to discriminate, because they have similar spectral responses. Thanks to the development of remote sensing technologies, nowadays, it is possible to measure different aspects of the same object on the Earth’s surface [8]
. Different from hyperspectral data, Light Detection And Ranging (LiDAR) data can record the elevation information of objects, thus providing complementary information for hyperspectral data. For instance, if the building roof and the road are both made up of concrete, it is very difficult to distinguish them using only hyperspectral data since their spectral responses are similar. However, LiDAR data can accurately classify those two classes as they have different heights. On the contrary, LiDAR data cannot differentiate between two different roads, which are made up of different materials (e.g., asphalt and concrete), having the same height. Therefore, fusing hyperspectral and LiDAR data is a promising scheme whose performance has already been validated in the literature for landcover and landuse classification
[8, 4].In order to take advantage of the complementary information between hyperspectral and LiDAR data, a lot of works have been proposed. One widely used class of methods is based on the featurelevel fusion. In [25]
, morphological extended attribute profiles (EAPs) were applied to hyperspectral and LiDAR data respectively. These profiles and the original spectral information of hyperspectral data were stacked together for classification. However, the direct stacking of these highdimensional features inevitably results in the wellknown Hughes phenomenon, especially when only a relatively small number of training samples is available. To address this issue, principal component analysis (PCA) was employed to reduce the dimensionality. Similar to this work, many subspacerelated models can be designed to fuse the extracted spectral, spatial, and elevation features
[35, 20, 27, 26, 28]. For example, a graph embedding framework was proposed in [20]; a lowrank component analysis model was proposed in [27]. Different from them, Gu et al. attempted to use multiplekernel learning [24] to combine heterogeneous features [10]. They constructed a kernel for each feature, and then combined these kernels together in a weighted summation manner. Different weights can represent the importance of different features for classification.Besides the featurelevel fusion, the decisionlevel fusion is another popularly adopted method. In [19]
, spectral features, spatial features, elevation features, and their fused features were fed into the support vector machine (SVM) individually to generate four classifiers, and the final classification result was determined by them. In
[36], two different fusion strategies named hard decision fusion and soft decision fusion were used to integrate the classification results from different data source. Their fusion weights were uniformly distributed. In
[37], three different classifiers, including the maximum likelihood classifier, SVM, and the multinomial logistic regression, were used to classify the extracted features. The fusion weights for these classifiers were adaptively optimized by a differential evolution algorithm. Recently, a novel ensemble classifier using random forest was proposed, in which a majority voting method was used to produce the final classification result
[29]. In summary, the difference between featurelevel fusion and decisionlevel fusion methods lies in the phase where the fusion process happens, but both of them require powerful representations of hyperspectral and LiDAR data. To achieve this goal, one needs to spend a lot of time designing appropriate feature extraction and feature selection methods. These handcrafted features often require domain expertise and prior knowledge.
In recent years, deep learning has attracted more and more attention in the field of remote sensing
[33, 11]. In contrast to the handcrafted features, deep learning can learn highlevel semantic features from data itself in an endtoend manner [39]. Among various deep learning models, convolutional neural networks (CNNs) gain the most attention and have been explored in various tasks. For example, in [3], CNN was applied to object detection in remote sensing images. In [1], three CNN frameworks were proposed for hyperspectral image classification. In [21], Liu et al.used CNNs to learn multiscale deep features for remote sensing image scene classification. Due to its powerful feature learning ability, some researchers attempted to use CNN for hyperspectral and LiDAR data fusion recently. An early attempt appears in
[23]. It directly considered LiDAR data as another spectral band of hyperspectral data, and then fed the concatenated data into CNN to learn features and perform classification. In [5], Ghamisi et al. tried to combine the traditional feature extraction method and CNN together. They fed the fused features to CNN for learning a higherlevel representation and getting a classification result. Similarly, Li et al. constructed three CNNs to learn spectral, spatial, and elevation features, respectively, and then used a composite kernel method to fuse them [18]. Different from them, an endtoend CNN fusion model was designed in [2], which embedded feature extraction, feature fusion, and classification into one framework. Specifically, the hyperspectral and LiDAR data were directly fed into their corresponding CNNs to extract features, and then these features were concatenated together, followed by a fullyconnected layer to further fuse them. Based on this twobranch framework, Xu et al. also proposed a spectralspatial CNN for hyperspectral data analysis and another spatial CNN for LiDAR data analysis [30].It is wellknown that the performance of CNNbased models heavily depends on the number of available samples. However, in the field of hyperspectral and LiDAR data fusion, there often exists a small number of training samples. To address this issue, an unsupervised CNN model was proposed in [34] based on the famous encoderdecoder architecture [22]
. Specifically, it first mapped the hyperspectral data into a hidden space via an encoding path, and then reconstructed the LiDAR data with a decoding path. After that, the hidden representation in the encoding path can be considered as fused features of hyperspectral and LiDAR data. Nevertheless, there still exist some issues. For examples, the loss of supervised information from labeled samples will lead to a suboptimal feature representation; it also needs to design another network to classify the learned representation, which will increase the computation complexity. In this paper, we propose a supervised model to fuse hyperspectral and LiDAR data by designing an efficient and effective CNN framework. Similar to
[2], we also use two CNNs but with a more efficient representation. We use three convolutional layers with small kernels (i.e., ), and two of them share parameters. Besides the output layer, we do not use any fullyconnected layers. The major contributions of this paper are summarized as follows.
In order to sufficiently fuse hyperspectral and LiDAR data, two coupled CNNs are designed. Compared to the existing CNNbased fusion models, our model is more efficient and effective. The coupled convolution layers can reduce the number of parameters, and more importantly, guide the two CNNs learn from each other, thus facilitating the following feature fusion process.

In the fusion phase, we simultaneously use featurelevel and decisionlevel fusion strategies. For the featurelevel fusion, we propose summation and maximization fusion methods in addition to the widely adopted concatenation method. To enhance the discriminative ability of learned features, we add two output layers to the CNNs, respectively. These three output results are finally combined together via a weighted summation method, whose weights are determined by the classification accuracy of each output on the training data.

We test the effectiveness of the proposed model on two data sets using standard training and test sets. On the Houston data, we can achieve an overall accuracy of 96.03%, which is the best result ever reported in the literature. On the Trento data, we can also obtain very high performance (i.e., an overall accuracy of 99.12%).
The rest of this paper is organized as follows. Section II describes the details of the proposed model, including the coupled CNN framework, the data fusion model, and the network training and testing methods. The descriptions of data sets and experimental results are given in Section III. Finally, Section IV concludes this paper.
Ii Methodology
Iia Framework of the Proposed Model
As shown in Fig.1, our proposed model mainly consists of two networks: an HS network for spectralspatial feature learning and a LiDAR network for elevation feature learning. Each of them includes an input module, a feature learning module and a fusion module. For the HS network, PCA is firstly used to reduce the redundant information of the original hyperspectral data, and then a small cube is extracted surrounding the given pixel. For the LiDAR network, we can directly extract an image patch at the same spatial position as the hyperspectral data. In the feature learning module, we use three convolutional layers, and the last two of them share parameters. In the fusion module, we construct three classifiers. Each CNN has an output layer, and their fused features are also fed into an output layer.
IiB Feature Learning via Coupled CNNs
Given a hyperspectral image and a corresponding LiDAR image covering the same area on the Earth’s surface. Here, and represent the height and width, respectively, of the two images, and refers to the number of spectral bands of the hyperspectral image. Our goal is to sufficiently fuse the information from and to improve the classification performance. The same as other classification tasks, feature representation is a critical step here. Due to the effects of multipath scattering and the heterogeneity of subpixel constituents, often exhibits nonlinear relationships between the captured spectral information and the corresponding material. This nonlinear characteristic will be magnified when dealing with [8]. It has been proved that CNNs are capable of extracting highlevel features, which are usually invariant to the nonlinearities of hyperspectral [32, 31, 38] and LiDAR data [2, 14]. Inspired from them, we design a coupled CNN framework to learn features from and efficiently.
The detailed architecture of the coupled CNNs is demonstrated in Fig.2. First of all, PCA is used to extract the first principle components of to reduce the redundant spectral information. Then, for each pixel, a small cube and a small patch centered at it are chosen from and , respectively. According to [2] and [34], the neighboring size can be empirically set to 11. After that, and are fed into three convolutional layers to learn features. For the first convolutional layer, we adopt two different convolution operators (the blue box and the orange box) to obtain an initial representation of and
, respectively. This convolutional layer is sequentially followed by a batch normalization (BN) layer to regularize and accelerate the training process, a rectified linear unit (ReLU) to learn a nonlinear representation, and a maxpooling layer to reduce the data variance and the computation complexity.
For the second convolutional layer, we let the HS network and the LiDAR network share parameters. Such a coupling strategy has at least two benefits. First, it can significantly reduce the number of parameters by twice, which is very useful with small numbers of training samples. Second, it can make these two networks learn from each other. Without weight sharing, the training parameters in each network will be optimized independently using their own loss functions. After adopting the coupling strategy, the backpropagated gradients to this layer will be determined by the loss functions of both networks, which means that the information in one network will directly affect the other one. For the third convolutional layer, we also use the coupling strategy, which can further improve the discriminative ability of the learned representation from the second convolutional layer. Again, these two convolutional layers are followed by BN, ReLU, and maxpooling operators. The sizes (i.e.,
) and the number of kernels (i.e., 32, 64 and 128 sequentially) of each convolutional layer are shown at the left side under each data. Similarly, the output size (e.g.,) of each operator is shown at the right side. It is worth noting that all the convolutional layers have padding operators to make the output size the same as the input size.
IiC Hyperspectral and LiDAR Data Fusion
After getting the feature representations of and , how to combine them becomes another important issue. Most of the existing deep learning models [2, 34, 30] choose to stack them together and use a few fullyconnected layers to fuse them. However, fullyconnected layers often contain large numbers of parameters, which will increase the training difficulty when there only exists a small number of training samples. To this end, we propose a novel combination strategy based on featurelevel and decisionlevel fusions. Assume and denote the learned features for and , respectively. As shown in Fig.3, we first combine and to generate a new feature representation. Then, we input these three features into output layers separately. Finally, all the output layers are integrated together to produce a final result. The whole fusion process can be formulated as:
(1) 
In the above equation, , where is the number of classes to discriminate, reprensents the final output of the fusion module; and are decisionlevel and featurelevel fusions, respectively; , , and are three output layers connected to , , and , respectively; , , , denote the connection weights for , , and , respectively; corresponds to the fusion weight for .
For the featurelevel fusion , we use summation and maximization methods in addition to the widely used concatenation method. The summation fusion aims to compute the sum of the two representations:
(2) 
Similarly, the maximization fusion aims at performing an elementwise maximization:
(3) 
Obviously, the performance of depends on its inputs and . Therefore, we add two output layers , and to supervise their learning processes. In the output phase, they can also help make decisions. The output value of can be derived as follows:
(4) 
where represents the softmax function. Similar to Equation(4), we can also derive the output values and for and , respectively. For the decisionlevel fusion , we adopt a weighted summation method:
(5) 
where is an elementwise product operator, , and are three column vectors of , and the th element of depends on the th class accuracy acquired by the th output layer on the training data.
IiD Network Training and Testing
The whole network in Fig.1 is trained in an endtoend manner using a given training set , where represents the number of training samples, and is the groundtruth for the th sample. After a feedforward process, we are able to obtain three outputs for each sample. Their loss values can be computed by a crossentropy loss function. For instance, the loss value between the first output and the groundtruth can be formulated as
(6) 
Similarly, we can also derive and for the other two outputs. is designed to supervise the learning process of the fused feature between hyperspectral and LiDAR data, while and are responsible for the hyperspectral and LiDAR features, respectively. The final loss value is represented as the combination of , , and :
(7) 
where and represent the weight parameters for and , respectively. In the experiments, we empirically set them to 0.01 because it can achieve satisfactory performance. The effects of them on the classification performance will be analysed in section IIID.
The same as most CNN models,
can be optimized using a backpropagation algorithm. Note that
and can also be considered as regularization terms for , thus reducing the overfitting risk during the network training process.Once the network is trained, we can use it to predict the label of each test sample. Firstly, is computed on the training set. Its th element can be derived as
(8)  
where is the th class accuracy of the th output, and is an indicator function, whose value equals to when the condition exists and otherwise. Secondly, for the th test sample, we are able to obtain three output values , , and via a feedforward propagation. Finally, the output value can be derived by using Equation(5).
Class No.  Class Name  Training  Test 

1  Healthy grass  198  1053 
2  Stressed grass  190  1064 
3  Synthetic grass  192  505 
4  Tree  188  1056 
5  Soil  186  1056 
6  Water  182  143 
7  Residential  196  1072 
8  Commercial  191  1053 
9  Road  193  1059 
10  Highway  191  1036 
11  Railway  181  1054 
12  Parking lot 1  192  1041 
13  Parking lot 2  184  285 
14  Tennis court  181  247 
15  Running track  187  473 
  Total  2832  12197 
Iii Experiments
Iiia Data Description
Class No.  Class Name  Training  Test 

1  Apple trees  129  3905 
2  Buildings  125  2778 
3  Ground  105  374 
4  Wood  154  8969 
5  Vineyard  184  10317 
6  Roads  122  3252 
  Total  819  29595 
We test the effectiveness of our proposed model on two hyperspectral and LiDAR fusion data sets.
1) Houston Data: The first data was acquired over the University of Houston campus and the neighboring urban area on June, 2012 [4]. It consists of a hyperspectral image and LiDAR data, both of which contain pixels with a spatial resolution of 2.5 m. The number of spectral bands for the hyperspectral data is 144. Fig.4 demonstrates a pseudocolor image of the hyperspectral data, a grayscale image of the LiDAR data, and groundtruth maps of the training and test samples. As shown in the figure, there exist 15 different classes. The detailed numbers of samples for each class are reported in TableI. It is worth noting that we use the standard sets of training and test samples which makes our results fully comparable with several works such as [8] and [4].
2) Trento Data: The second data was captured over a rural area in the south of Trento, Italy. The LiDAR data was acquired by the Optech ALTM 3100EA sensor, and the hyperspectral data was acquired by the AISA Eagle sensor with 63 spectral bands. The size of these two data is pixels, and the spatial resolution is 1 m. Fig.5 visualizes this data, and TableII lists the number of samples in 6 different classes. Again, we also use the standard sets of training and test samples to construct experiments.
IiiB Experimental Setup
In order to validate the effectiveness of our proposed models, we comprehensively compare it with several different models. Specifically, we first select the HS network (i.e., CNNHS) and the LiDAR network (i.e., CNNLiDAR) in Fig.1 as two baselines, and compare different fusion methods on both Houston and Trento data. Then, we focus on the Houston data, and compare our model with numerous stateoftheart models.
All of the deep learning models are implemented in the PyTorch framework. To optimize them, we use the Adam algorithm. The batch size, the learning rate, and the number of training epochs are set to 64, 0.001, and 200, respectively. The experiments are implemented on a personal computer with an Intel core i74790, 3.60GHz processor, 32GB RAM, and a GTX TITAN X graphic card.
The classification performance of each model is evaluated by the overall accuracy (OA), the average accuracy (AA), the perclass accuracy, and the Kappa coefficient. OA defines the ratio between the number of correctly classified pixels to the total number of pixels in the test set, AA refers to the average of accuracies in all classes, and Kappa is the percentage of agreement corrected by the number of agreements that would be expected purely by chance.
IiiC Experimental Results
Class No.  CNNHS  CNNLiDAR  CNNFC  CNNFM  CNNFS  CNNDFC  CNNDFM  CNNDFS 

1  82.91  60.30  82.91  81.86  89.93  82.81  83.00  85.57 
2  99.91  24.34  99.81  99.44  98.21  100  99.81  99.81 
3  91.29  66.53  97.43  97.03  98.61  96.44  97.62  97.62 
4  95.93  88.73  99.43  99.05  99.05  98.96  99.91  99.43 
5  100  24.81  100  98.86  99.72  100  99.91  100 
6  93.71  25.87  96.50  100  100  100  100  95.80 
7  91.60  61.19  87.41  96.74  91.98  91.32  90.39  95.24 
8  87.18  84.33  91.17  92.69  96.30  92.40  95.54  96.39 
9  86.87  40.32  87.25  92.92  92.92  89.33  93.86  93.20 
10  97.59  53.86  98.75  84.94  88.51  99.71  96.04  98.84 
11  89.56  80.46  97.15  97.34  96.49  99.43  98.39  96.77 
12  91.16  29.30  96.25  92.22  86.65  92.51  93.18  92.60 
13  88.77  81.05  92.98  92.63  89.82  89.82  92.98  92.98 
14  89.07  52.63  93.52  100  99.60  88.26  95.95  99.19 
15  90.91  29.81  100  92.81  99.58  100  98.73  100 
OA  92.05  54.52  94.37  93.92  94.49  94.74  95.29  96.03 
AA  91.76  53.57  94.70  94.57  95.16  94.73  95.69  96.23 
Kappa  0.9136  0.5082  0.9389  0.9340  0.9402  0.9429  0.9488  0.9569 
Class No.  CNNHS  CNNLiDAR  CNNFC  CNNFM  CNNFS  CNNDFC  CNNDFM  CNNDFS 

1  99.85  99.92  98.49  96.72  99.15  98.44  99.69  99.64 
2  94.67  93.16  97.01  97.05  96.36  97.73  98.81  97.66 
3  82.09  60.43  92.51  95.99  93.05  88.50  94.39  92.25 
4  98.73  99.12  99.11  100  100  100  99.88  99.96 
5  99.73  95.63  100  100  99.96  100  100  99.90 
6  76.31  50.59  90.53  92.69  89.71  93.64  94.00  92.40 
OA  96.31  91.91  98.17  98.48  98.37  98.77  99.12  98.80 
AA  91.90  83.14  96.28  97.08  96.37  96.39  97.80  96.97 
Kappa  0.9505  0.8917  0.9754  0.9796  0.9782  0.9835  0.9881  0.9839 
Traditional models  
Model  MLR  GGF  SLRCA  OTVCA  ODFADE  EUGF  HyMCKs 
OA  92.05  94.00  91.30  92.45  93.50  95.11  90.33 
AA  92.87  93.79  91.95  92.68    94.57  91.14 
Kappa  0.9137  0.9350  0.9056  0.9181  0.9299  0.9447  0.8949 
CNNrelated models  
Model  DF  CNNGBFF  CNNCK  TCNN  PToPCNN  CNNDFM  CNNDFS 
OA  91.32  91.02  92.57  87.98  92.48  95.29  96.03 
AA  91.96  91.82  92.48  90.11  93.55  95.69  96.23 
Kappa  0.9057  0.9033  0.9193  0.8698  0.9187  0.9488  0.9569 
IiiC1 Comparison with different fusion models
In addition to two singlesource models (i.e., CNNHS and CNNLiDAR), we also test the effectiveness of featurelevel fusion models, i.e., using only. The three featurelevel fusion methods CNNFC, CNNFM, and CNNFS stand for the concatenation method, the maximization method, and the summation method, respectively. Similarly, the three decisionlevel and featurelevel fusion methods in Fig.3 are abbreviated as CNNDFC, CNNDFM, and CNNDFS, respectively. TableIII shows the detailed classification results of eight models on the Houston data. Several conclusions can be observed from it. First, for the singlesource models, CNNHS achieves significantly better results than CNNLiDAR in each class. It indicates that the spectralspatial information in the hyperspectral data is more discriminative than the elevation information in the LiDAR data. Second, all of the three featurelevel fusion models (i.e., CNNFC, CNNFM, and CNNFS) obtain higher accuracies than the CNNHS model in most classes. This can be explained by that the LiDAR data can provide complementary information for the hyperspectral data, and by combining them together in a proper way, the classification performance can be improved. Third, based on the featurelevel fusion models, if we further use the decisionlevel fusion (i.e., CNNDFC, CNNDFM, and CNNDFS), the performance is improved again. Taking the summation fusion method as an example, by the simultaneous use of featurelevel and decisionlevel fusions, the OA is increased from 94.49% to 96.03%, which is the best result ever reported in the literature. Last but not the least, compared to the widely used concatenation method, our proposed maximization and summation fusion methods can achieve better OA, AA, and Kappa values. Besides the quantitative results, we also qualitatively analyze the performance of different models. Fig.6 demonstrates the classification maps of different models. In this figure, different colors represent different classes of objects. From Fig.6
(b), we can see that the CNNLiDAR model generates many outliers, and misclassifies a lot of objects. In comparison with it, other models obtain more homogeneous classification maps. However, some objects are a little oversmoothed, because all of the models use the small patches and cubes as inputs.
Similar to the Houston data, TableIV and Fig.7 show the quantitative and qualitative results, respectively, on the Trento data. The data have larger and more homogeneous objects to discriminate than the Houston data, so all of the models can achieve relatively high performance (e.g., the OA values are larger than 90%). Specifically, CNNHS is better than CNNLiDAR, and the featurelevel fusion method can improve the performance of CNNHS. More importantly, the simultaneous featurelevel and decisionlevel fusion is more effective than using featurelevel fusion only. The best results appear when adopting the maximization fusion method.
IiiC2 Comparison with stateoftheart models
In the existing hyperspectral and LiDAR data fusion works, most of models tested their performance on the Houston data. To highlight the superiority of our proposed models, we also compare them with stateoftheart models, including 7 traditional models and 5 CNNrelated models, using standard training and test sets. These traditional models include the multiple feature learning model MLR in [17], the generalized graphbased fusion model GGF in [20], the sparse and lowrank component analysis model SLRCA in [27], the total variation component analysis model OTVCA in [26], the adaptive differential evolution based fusion model ODFADE in [37], the unsupervised graph fusion model EUGF in [29], and the composite kernel extreme learning machine model HyMCKs in [7]. The CNNrelated models include the deep fusion model DF in [2], the CNN model combined with graphbased feature fusion method CNNGBFF in [5], the threestream CNN based composite kernel model CNNCK in [18], the twobranch CNN model TCNN in [30], and the patchtopatch CNN model PToPCNN in [34].
TableV reports the detailed comparison results of different models in terms of OA, AA, and Kappa coefficients. Note that all the results are directly cited from their original papers, because we are not able to reproduce them due to missing parameters or availability of codes. For the traditional models, the best OA, AA, and Kappa values are 95.11%, 94.57%, and 0.9447, respectively, achieved by a recent work named EUGF [29]. For the CNNrelated models, CNNCK [18] obtains the best OA and Kappa values, while PToPCNN [34] acquires the best AA. Compared to the EUGF model, both CNNCK and PToPCNN models obtain inferior performance, which indicate that the existing CNNrelated fusion models still have some potentials to explore. Similar to DF [2] and TCNN [30] models, our proposed models (i.e., CNNDFM and CNNDFS) can also be considered as a twobranch CNN model. However, the proposed models can obtain significantly better results than them, even than EUGF, which sufficiently certify the effectiveness of the proposed model.
IiiD Analysis on the proposed model
IiiD1 Analysis on the reduced dimensionality
For the proposed model, we have two hyperparameters to predefine. The first one is the number of reduced dimensionality of hyperspectral data using PCA, and the second one is the neighboring size extracted from hyperspectral and LiDAR data. To evaluate the effect of , we fix and select from a candidate set . Since the fusion models have the same hyperparameter values as single models (i.e., CNNHS and LiDARHS), we only demonstrate the results of single models here. Fig.8 shows the performance (i.e., OA) of CNNHS on the Houston (the blue line) and Trento (the red line) data. From this figure, we can observe that as increases, OA firstly increases and then tends to a stable state. Considering the computation complexity and classification performance, can be set to 20 for both data.
IiiD2 Analysis on the neighboring size
Similar to the analysis of , we can also fix and choose from a candidate set to evaluate the effect of . TableVI reports the changes of OA values at different sizes. When the size increases from 9 to 11 on the Houston data, the improvements of OA acquired by CNNHS and CNNLiDAR are more than 1 percent. But for the other sizes, these two models do not change significantly. For the Trento data, CNNHS is relatively stable when the size changes, but CNNLiDAR will increase more than 1 percent from 9 to 11, and decrease from 11 to 13. Based on the above analysis, 11 is a reasonable choice for CNNHS and CNNLiDAR on both data. This choice is consistent with the works in [2] and [34].
Houston Data  

Size  9  11  13  15  17  19 
CNNHS  90.88  92.05  91.49  91.41  91.87  92.06 
CNNLiDAR  52.45  54.52  54.44  54.59  54.29  54.51 
Trento Data  
Size  9  11  13  15  17  19 
CNNHS  96.02  96.43  96.39  96.17  95.97  95.53 
CNNLiDAR  90.80  91.91  90.29  90.70  91.40  90.57 
Time  CNNHS  CNNLiDAR  CNNFC  CNNFM 
Train  43.68  38.04  71.57  70.85 
Test  1.24  1.18  1.30  1.27 
Time  CNNFS  CNNDFC  CNNDFM  CNNDFS 
Train  70.90  185.71  182.54  184.43 
Test  1.28  1.38  1.33  1.37 
Time  CNNHS  CNNLiDAR  CNNFC  CNNFM 
Train  32.11  21.84  49.99  49.53 
Test  1.33  1.24  1.44  1.37 
Time  CNNFS  CNNDFC  CNNDFM  CNNDFS 
Train  49.62  118.65  116.43  117.29 
Test  1.43  1.66  1.62  1.65 
IiiD3 Analysis on the coupling strategy
Benefiting from the coupling strategy, the number of parameters in the second and the third convolutional layers is reduced by twice. Taking CNNDFM and CNNDFS models as an example, on the Houston data, the total number of parameters to train is 196128 without weight sharing, while this number is reduced to 103968 after adopting the coupling strategy; on the Trento data, the trainable parameters are 192672 and 100512 without and with weight sharing, respectively. In summary, the parameter numbers in CNNDFM and CNNDFS models are reduced by about 47% on both data when the coupling strategy is employed. Besides, we also test the effects of the coupling strategy on the classification performance. Fig.9 illustrates the changes of OA before and after adopting the coupling strategy on the Houston data (left one) and the Trento data (right one). It indicates that the performance of CNNDFC, CNNDFM, and CNNDFS in terms of OA is slightly improved after adopting the coupling strategy.
IiiD4 Analysis on the computation cost
To quantitatively analyze the computation cost of different models, TableVII and TableVIII report their computation time on the Houston and Trento data, respectively. From these two tables, we can observe that CNNHS and CNNLiDAR models take less training time than the other fusion models, because they only need to process singlesource data, without any interactions between different sources. On the contrary, the proposed decisionlevel and featurelevel fusion models cost much more training time than the singlesource and the featurelevel fusion models. Nevertheless, once the networks are trained, their test efficiency is very high. In particular, it takes no more than 2 seconds to finish the test process, which is close to the time costs of the other models.
IiiD5 Analysis on the weight parameters
The loss function of the proposed model in Equation(7) contains two hyperparameters (i.e., and ). In order to test their effects on the classification performance, we firstly fix and change from a candidate set . Then, we set to the optimal value and change from the same set . Fig.10 shows the OAs obtained by the proposed CNNDFS model on the Houston and Trento data with different and values. In this figure, the pink and the blue lines represent the CNNDFS model with different and values, respectively. It is shown that as increases, the OA will firstly increase and then decrease on both data. The highest OA value appears when . Similar conclusions can be observed for . Therefore, the optimal values for and are 0.01.
Iv Conclusions
This paper proposed a coupled CNN framework for hyperspectral and LiDAR data fusion. Small convolution kernels and parameter sharing layers were designed to make the model more efficient and effective. In the fusion phase, we used featurelevel and decisionlevel fusion strategies simultaneously. For the featurelevel fusion, we proposed summation and maximization methods in addition to the widely used concatenation method. For the decisionlevel fusion, we proposed a weighted summation method, whose weights depend on the performance of each output layer. To validate the effectiveness of the proposed model, we constructed several experiments on two data sets. The experimental results show that the proposed model can achieve the best performance on the Houston data, and very high performance on the Trento data. Additionally, we also thoroughly evaluated the effects of different hyperparameters on the classification performance, including the reduced dimensionality and the neighboring size. In the future, more powerful neighboring extraction methods need to be explored, because the current classification maps still exist oversmoothing problems.
References
 [1] (2016) Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 54 (10), pp. 6232–6251. Cited by: §I.
 [2] (2017) Deep fusion of remote sensing data for accurate classification. IEEE Geoscience and Remote Sensing Letters 14 (8), pp. 1253–1257. Cited by: §I, §I, §IIB, §IIB, §IIC, §IIIC2, §IIIC2, §IIID2.
 [3] (2016) Learning rotationinvariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 54 (12), pp. 7405–7415. Cited by: §I.
 [4] (2014) Hyperspectral and lidar data fusion: outcome of the 2013 grss data fusion contest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6), pp. 2405–2418. Cited by: §I, §IIIA.
 [5] (2017) Hyperspectral and lidar data fusion using extinction profiles and deep convolutional neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (6), pp. 3011–3024. Cited by: §I, §IIIC2.
 [6] (2018) New frontiers in spectralspatial hyperspectral image classification: the latest advances based on mathematical morphology, markov random fields, segmentation, sparse representation, and deep learning. IEEE Geoscience and Remote Sensing Magazine 6 (3), pp. 10–43. Cited by: §I.
 [7] (2019) Multisensor composite kernels based on extreme learning machines. IEEE Geoscience and Remote Sensing Letters 16 (2), pp. 196–200. Cited by: §IIIC2.
 [8] (2019) Multisource and multitemporal data fusion in remote sensing: a comprehensive review of the state of the art. IEEE Geoscience and Remote Sensing Magazine 7 (1), pp. 6–39. Cited by: §I, §IIB, §IIIA.
 [9] (2017) Advances in hyperspectral image and signal processing: a comprehensive overview of the state of the art. IEEE Geoscience and Remote Sensing Magazine 5 (4), pp. 37–78. Cited by: §I.
 [10] (2015) A novel mkl model of integrating lidar data and msi for urban area classification. IEEE transactions on geoscience and remote sensing 53 (10), pp. 5312–5326. Cited by: §I.

[11]
(201908)
Cascaded recurrent neural networks for hyperspectral image classification
. IEEE Transactions on Geoscience and Remote Sensing 57 (8), pp. 5384–5394. External Links: Document, ISSN 15580644 Cited by: §I.  [12] (2015) Matrixbased discriminant subspace ensemble for hyperspectral image spatial–spectral feature fusion. IEEE Transactions on Geoscience and Remote Sensing 54 (2), pp. 783–794. Cited by: §I.
 [13] (2018) Recent advances on spectral–spatial hyperspectral image classification: an overview and new guidelines. IEEE Transactions on Geoscience and Remote Sensing 56 (3), pp. 1579–1597. Cited by: §I.
 [14] (2018) LiDAR data classification using spatial transformation and cnn. IEEE Geoscience and Remote Sensing Letters. Cited by: §IIB.

[15]
(2018)
An augmented linear mixing model to address spectral variability for hyperspectral unmixing
. IEEE Transactions on Image Processing 28 (4), pp. 1923–1938. Cited by: §I.  [16] (2019) Cospace: common subspace learning from hyperspectralmultispectral correspondences. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §I.
 [17] (2015) Fusion of hyperspectral and lidar remote sensing data using multiple feature learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8 (6), pp. 2971–2983. Cited by: §IIIC2.
 [18] (2018) Hyperspectral and lidar fusion using deep threestream convolutional neural networks. Remote Sensing 10 (10), pp. 1649. Cited by: §I, §IIIC2, §IIIC2.
 [19] (2014) Combining feature fusion and decision fusion for classification of hyperspectral and lidar data. In Geoscience and Remote Sensing Symposium (IGARSS), 2014 IEEE International, pp. 1241–1244. Cited by: §I.
 [20] (2015) Generalized graphbased fusion of hyperspectral and lidar data using morphological features. IEEE Geoscience and Remote Sensing Letters 12 (3), pp. 552–556. Cited by: §I, §IIIC2.
 [21] (2018) Learning multiscale deep features for highresolution satellite image scene classification. IEEE Transactions on Geoscience and Remote Sensing 56 (1), pp. 117–126. Cited by: §I.

[22]
(2015)
Fully convolutional networks for semantic segmentation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3431–3440. Cited by: §I.  [23] (2016) Classification of pixellevel fused hyperspectral and lidar data using deep convolutional neural networks. In Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), 2016 8th Workshop on, pp. 1–5. Cited by: §I.
 [24] (2018) Multiple kernel learning for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 56 (3), pp. 1425–1443. Cited by: §I.
 [25] (2012) Classification of remote sensing optical and lidar data using extended attribute profiles. IEEE Journal of Selected Topics in Signal Processing 6 (7), pp. 856–865. Cited by: §I.
 [26] (2017) Hyperspectral and lidar fusion using extinction profiles and total variation component analysis. IEEE Transactions on Geoscience and Remote Sensing 55 (7), pp. 3997–4007. Cited by: §I, §IIIC2.
 [27] (2017) Fusion of hyperspectral and lidar data using sparse and lowrank component analysis. IEEE Transactions on Geoscience and Remote Sensing 55 (11), pp. 6354–6365. Cited by: §I, §IIIC2.
 [28] (2019) Hyperspectral feature extraction using sparse and smooth lowrank analysis. Remote Sensing 11 (2), pp. 121. Cited by: §I.
 [29] (2018) Fusion of hyperspectral and lidar data with a novel ensemble classifier. IEEE Geoscience and Remote Sensing Letters 15 (6), pp. 957–961. Cited by: §I, §IIIC2, §IIIC2.
 [30] (2018) Multisource remote sensing data classification based on convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 937–949. Cited by: §I, §IIC, §IIIC2, §IIIC2.
 [31] (2018) Spectralspatial unified networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing (99), pp. 1–17. Cited by: §IIB.
 [32] (2017) Convolutional neural networks for hyperspectral image classification. Neurocomputing 219, pp. 88–98. Cited by: §IIB.
 [33] (2016) Deep learning for remote sensing data: a technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine 4 (2), pp. 22–40. Cited by: §I.
 [34] (2018) Feature extraction for classification of hyperspectral and lidar data using patchtopatch cnn. IEEE transactions on cybernetics. Cited by: §I, §IIB, §IIC, §IIIC2, §IIIC2, §IIID2.
 [35] (2016) Multisource geospatial data fusion via local joint sparse representation. IEEE Transactions on Geoscience and Remote Sensing 54 (6), pp. 3265–3276. Cited by: §I.

[36]
(2015)
Ensemble multiple kernel active learning for classification of multisource remote sensing data
. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8 (2), pp. 845–858. Cited by: §I.  [37] (2017) Optimal decision fusion for urban landuse/landcover classification based on adaptive differential evolution using hyperspectral and lidar data. Remote Sensing 9 (8), pp. 868. Cited by: §I, §IIIC2.
 [38] (2018) Spectral–spatial residual network for hyperspectral image classification: a 3d deep learning framework. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 847–858. Cited by: §IIB.
 [39] (2017) Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5 (4), pp. 8–36. Cited by: §I.