Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework

03/24/2019 ∙ by Lamei Zhang, et al. ∙ Harbin Institute of Technology NetEase, Inc 0

Accompanied by the successful progress of deep representation learning, convolutional neural networks (CNNs) have been widely applied to improve the accuracy of polarimetric synthetic aperture radar (PolSAR) image classification. However, in most applications, the difference between PolSAR image and optical image is rarely considered. The design of most existing network structures is not tailored to the characteristics of PolSAR image data and complex-valued data of PolSAR image are simply equated to real-valued data to adapt to the existing mainstream network pipeline to avoid complex-valued operations. These make CNNs unable to perform their full capabilities in the PolSAR image classification tasks. In this paper, we focus on finding a better input form of PolSAR image data and designing special CNN structures that are more compatible with PolSAR image. Considering the relationship between complex number and its amplitude and phase, we extract the amplitude and phase of the complex-valued PolSAR image data as input to maintain the integrity of the original information while avoiding the current immature complex-valued operations, and a novel multi-task CNN framework is proposed to adapt to novel form of input data. Furthermore, in order to better explore the unique phase information in the PolSAR image data, depthwise separable convolutions are applied to the proposed multi-task CNN model. Experiments on three benchmark datasets not only prove that using amplitude and phase information as input does contribute to the improvement of classification accuracy, but also verify the effectiveness of the proposed methods for amplitude and phase input.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Polarimetric synthetic aperture radar (PolSAR), as one of the most advanced detectors in the field of remote sensing, can describe the target more comprehensively in all-weather and all-times. Due to Earth s surface rich target information can be capture by PolSAR, it has a wide range of applications in various fields, such as agriculture, fisheries, urban planning, environmental monitoring, etc. As the basis of PolSAR image interpretation, classification problem has always been a hot topic of research.In recent years, with the maturity of pattern recognition methods, PolSAR image classification algorithms have made great progress. However, due to the complexity of echo imaging process system, knowledge effectively mining from the existing complex-valued PolSAR data is still an open question.

Different from the traditional manual (Zou et al., 2017)

or statistical learning based feature extraction methods

(Turk & Pentland, 1991; Vapnik, 1995), deep learning (Lecun et al., 2015) uses a deep neural network model to find a potential representation from the original data, which has achieved the state-of-the-art in most pattern recognition problems. There are two main reasons for the success of deep learning. The first is its high degree of flexibility, which is reflected in its ability to adapt to any form of input data, such as image (Lecun et al., 1989)

, natural language processing

(Mikolov et al., 2013), audio (Graves et al., 2013) and video (Shuiwang et al., 2013)

. For different input, only changing the network structure according to the input’s characteristics can achieve good results, which avoids the difficulty of designing manual feature extraction methods. The second is the powerful capacity of feature selection. Deep neural networks can extract high-level features which cannot be obtained by traditional methods and more universal features naturally avoid the difficulty of designing multi-class classifiers. Obviously, such an end-to-end learning framework is very suitable for PolSAR image interpretation.

Deep learning based image processing models represented by convolutional neural networks (CNNs) in computer vision are the focus of our attention since ImageNet Large-Scale Visual Recognition Challenge 2012

(Krizhevsky et al., 2012). At present, the CNN based algorithms are quite mature and can deal with various multi-class classification problems (He et al., 2015b; Gao et al., 2016). Some pioneering work of applying CNNs to PolSAR image processing has existed for some time. (Chen et al., 2016)

proposed applied CNN to PolSAR image classification for the first time, which embodied the strong feature selection ability of CNN and achieved excellent experimental results on MSTAR dataset. A fully convolutional network based classification method combined with deep features and shallow features was proposed in

(Yan et al., 2018)

. As another form of deep learning, autoencoders have also been applied in PolSAR image classification tasks

(Hou et al., 2017; Zhang et al., 2016; De et al., 2018b). Some scholars seek higher classification accuracy by improving CNN’s network structure and network parameters (Wang et al., 2018; Zhou et al., 2017; De et al., 2018a). In addition, CNNs also demonstrated potential capabilities in SAR image denoising (Wang et al., 2017).

Although CNNs have been used in PolSAR image classification to some extent, there are not too many people who consider the characteristics of PolSAR image data during the design of classification algorithms. Unlike optical images, the polarization scattering matrix depicts the scattering properties of each pixel of a PolSAR image. Each element in the matrix represents the backscattering coefficients produced by the terrain receiving polarized electromagnetic waves in different directions. The data representation of polarization scattering matrix, polarization coherence matrix and polarization covariance matrix are complex-valued numbers. The corresponding amplitude and phase information can be transformed based on the complex-valued source data of each pixel. The amplitude information usually corresponds to the backscattering intensity of the target to the radar wave, which has a great correlation with the gray scale information obtained by visible light imaging. The phase information corresponds to the distance between the sensor platform and the target, which is not available from other detectors. Moreover, there is a coupling relationship between the phase information obtained from different transmitting and receiving directions of electromagnetic wave, so PolSAR image can better reflect the scattering characteristics of targets than SAR image. Because of the unique echo imaging system, PolSAR image contains richer terrain information than optical image. However, rich information and distinctive nature of PolSAR image make its interpretation difficult. How to correctly use the deep learning to adapt to the PolSAR image data and improve the accuracy of PolSAR image classification is still an urgent problem.

From the author’s point of view, the main problem of applying deep learning to PolSAR image classification is that there is no deep consideration between the difference between PolSAR images and optical images. CNNs originate from computer vision, and the background of computer vision is based on optical image, so most CNNs only consider extracting features from the single amplitude information when designing their network structures. When applying CNNs to PolSAR image, most scholars follow the pattern of extracting information from a single angle, which is a great waste of PolSAR’s rich information. We believe that the existing CNN models in computer vision should be used in a targeted way, rather than blindly handling them.

In order to build a special network for PolSAR image classification, we think the following two key issues need to be considered: the form of input data and the structure of network model. From the perspective of the form of input data. It is natural to consider constructing a complex-valued deep neural network for image classification since the PolSAR image data is complex-valued. However, complex-valued neural networks have been marginalized under the current research background. Most complex-valued neural networks only stay in the conjecture stage and lack sufficient theoretical basis. Therefore, most of the current applications directly split the complex-valued PolSAR data into real part and imaginary part as network’s input. In fact, this is artificially treating complex-valued data as real-valued data, and follows the pattern of CNNs that process optical image. The root cause of this is to deliberately avoid complex operations that are difficult to implement. In this paper, we try to find a representation form that can not only preserve the information of complex-valued data to the maximum extent, but also avoid complex operation. Another key issue is the design of network structure. According to the connotation of deep representation learning, for different forms of input data, the structure of deep neural network should be adjusted adaptively and pertinently, which is often neglected in most applications. In addition, the potential coupling relationship between PolSAR phase information can be found to better identify the target, and how to explore the potential relationship between phase information should be considered. In this paper, we try to design a tailored convolutional network structure that can not only fit in with the novel PolSAR image input form, but also it can excavate the potential relationship between multiple phase information.

There are also some work that take into account the characteristics of PolSAR image data and make an effort to design exclusive network structures for the PolSAR image classification. Although the research is not very mature, a complex-valued CNN framework was proposed to fit PolSAR image complex data (Zhang et al., 2017c), which provides the basis for future follow-up research. A 3D convolutions based CNN structure was proposed to better extract the connection between the channels of PolSAR image (Zhang et al., 2018). Besides, some scholars use feature-extracted PolSAR data as the input of CNNs (Chen & Tao, 2018; Xu et al., 2018).

Based on the above knowledge background, in this paper, we are looking for better ways to boost PolSAR image classification using deep CNN based methods. In order to solve the two key issues we raised earlier, we first propose to use the amplitude and phase of complex-valued PolSAR image data as input to deep neural networks. This can fully retain the information of complex-valued data under the premise of avoiding immature complex operations, and changes the current research status of regarding the phase of complex-valued data as amplitude. Considering that PolSAR image data is characterized as two parts, amplitude and phase, we construct a multi-task CNN framework to better accommodate this input data based on existing network design tricks to overcome the second issue. Furthermore, a depthwise separable convolutions (Chollet, 2016) based multi-task CNN model is proposed to deeply explore the potential links in the phase information of PolSAR image. As far as the author knows, the work done in this paper has not been touched by the predecessors. Experimental results on several benchmark datasets demonstrate the validity of the proposed methods.

The rest of this paper is organized as follows: Background and some relevant technical literature are reviewed in Section 2. The proposed strategies and relevant analysis are listed in Section 3. In Section 4, experimental results are exhibited. Conclusion and possible future directions for further development are given in Section 5.

2 Background

As one of the representative algorithms of deep learning, CNNs has been widely used in the field of image processing, and CNN-based algorithms have achieved the state-of-the-arts in a variety of image processing tasks, such as image classification (Szegedy et al., 2014; Gao et al., 2016), semantic segmentation (Shelhamer et al., 2014; Ronneberger et al., 2015; Badrinarayanan et al., 2017), instance segmentation (He et al., 2017), target detection (Ren et al., 2017; Redmon et al., 2015; Liu et al., 2016), fine-grained recognition (Fu et al., 2017; Ning et al., 2014)

, etc.. CNNs has greatly changed the normal form of traditional image processing methods and established a end-to-end learning framework. According to the designed model and the objective function, the network automatically solves a complex non-convex optimization problem to find a mapping from the original data to the prediction. The design of network model and the construction of objective function incorporate the wisdom of the network designer. Unlike traditional machine learning algorithms

(Cortes & Vapnik, 1995; Huang et al., 2006; Wright et al., 2009), CNNs have the ability to utilize massive data and extract high-level features, avoiding the difficulty of designing manual feature extractors and classifiers. Moreover, the generation of fast computing technology based on graphics processing unit (GPU) greatly promotes the application of CNNs in the engineering field. For a long time, CNNs has followed the LeNet-style (Lecun et al., 1998) paradigm with convolution-subsampling cascade structure, its application in PolSAR image processing is shown in Fig. 1

Fig. 1: A classical LeNet-style network structure of CNNs for PolSAR image processing. The complex data contained in all elements of the or or matrix of PolSAR image data are divided into real parts and imaginary parts. Each real-value element represents a channel, and each complex-value element is divided into two channels. Data from multiple channels are concatenated in the channel dimension as the input of network. The network follows the LeNet-style structure model (a series of convolution-pooling layers cascade) in optical image processing.

Because of its huge redundant parameters, the fully connected layer in the original structure has been mostly replaced by global average pooling. The main choice in the output part of the network is to use the softmax classifier to convert the network output into a probability distribution. Rectified linear unit (ReLu) nonlinear activation function

(Nair & Hinton, 2010)

, batch normalization (BN)

(Ioffe & Szegedy, 2015), skip connection (He et al., 2015b) and dropout (Srivastava et al., 2014) are commonly used tricks in network design to increase the depth or generalization performance. In addition, some targeted improvements have also been extensively studied. Relevant methods for initializing network parameters can be seen in (Glorot & Bengio, 2010; He et al., 2015c). Some network models do not pursue deeper layers, but wider layers, and have achieved good results (Szegedy et al., 2014). Scholars are enthusiastic about the improvement of the basic components of CNNs, convolution, pooling and activation function. Some variants of classical convolution have been studied to adapt to different tasks, such as 3D convolution (Zhang et al., 2018; Tran et al., 2014; Maturana & Scherer, 2015; Qiu et al., 2017), dilated convolution (Yu & Koltun, 2015), depthwise separable convolution (Chollet, 2016; Howard et al., 2017; Zhang et al., 2017b), network in network (Lin et al., 2013) and group convolution (Krizhevsky et al., 2012; Zhang et al., 2017a). In addition to maximum pooling and average pooling, some new methods have also been developed (Estrach et al., 2014; Zeiler & Fergus, 2013; Kaiming et al., 2014). Some alternative activation functions like Leaky ReLu (Maas et al., 2013), PReLu (He et al., 2015a)), ELU ((Xu et al., 2015) have been proposed.

In order to handle the image with multi-data sources, the multi-stream CNN structures are studied in (Guo et al., 2016; Simonyan & Zisserman, 2014a; Feichtenhofer et al., 2016). The core idea of this kind of network structure is to use different branches of the network to process different data types in multi-source data. This is based on the hypothesis that different types of data may be difficult to extract their typical features in a same network branch. In essence, it also increases the width of network structure in order to achieve better performance. Simple two-stream network for video detection (Simonyan & Zisserman, 2014a) and improved network structure with convolutional information fusion module (Guo et al., 2016; Feichtenhofer et al., 2016) can be seen in Fig. 2

(a) (b)
Fig. 2: Network structure of two-stream CNN (a) and two-stream fusion CNN (b). The structure of the two network models is similar, the difference lies in whether the two branches of the network have information fusion or not.

With the growth of CNNs, a number of PolSAR image classification task based on CNNs have been developed (Wang et al., 2018; Zhou et al., 2017; De et al., 2018a; Guo et al., 2017; Bi et al., 2018). PolSAR image classification is the basis for further interpretation of PolSAR images and extraction of hidden information. For the task of PolSAR image classification, every pixel should be given a certain category. In the field of computer vision, this is a problem of semantics segmentation. To solve the problem of semantics segmentation, the mature solution is to use the network based on fully convolutional network (FCNs) (Shelhamer et al., 2014) to input the whole picture and output the classification result of the whole picture. This can avoid the loss of detail information caused by image cutting and greatly improve the training and testing time. The foundation of the FCN based models is labeling each pixel of the input image manually. However, the task of labeling each pixel, which can be implemented in an optical image, is less likely to be accomplished in the context of a PolSAR image. The reason is that PolSAR images are very difficult to completely understand for people without professional knowledge, let alone assign a clear label to each pixel. Therefore, at present, PolSAR image classification still uses the normal pattern of slicing image and recognizing all the image patches. The flow chart of this task can be seen in Fig. 3.

Fig. 3: Application flowchart of CNNs for PolSAR image classification.

At present, most of the latest research on PolSAR image classification is based on deep learning methods, of which CNNs-based methods are the majority. In (Chen & Tao, 2018; Xu et al., 2018; Zhou et al., 2017; De et al., 2018a; Liu et al., 2018), CNN is applied to PolSAR image classification from different perspectives, and some improvements have been made. However, they did not consider how to design a special network to fit the characteristics of PolSAR image data, but directly followed the network structure for optical image classification tasks. Considering that PolSAR image has its unique phase information, which is significantly different from optical image, it is more wise to modify the network structure to adapt to different data characteristics from the connotation of deep representation learning. Some methods (Chen & Tao, 2018; Xu et al., 2018) used the manually extracted PolSAR image features as input to the network, although this is a way of finding solutions from the perspective of data input, but this is contrary to the original intention of learning to learn a mapping from raw data to labels. Although some basic work has been done before, (Zhang et al., 2017c) constructed a special complex-valued convolutional network to adapt to the complex-valued PolSAR image data. This is the ideal choice from the perspective of modeling. But at present, the development of complex neural networks is not mature, and many mainstream network design tricks have not been extended to the complex domain.

3 Contributions

In this section, we present the details of the implementation of the proposed CNN based PolSAR image classification algorithms. First, the original complex-valued data of PolSAR is converted into the form of amplitude information and phase information. Then use a part of labeled samples to train the proposed network and save the trained network model parameters. All labeled samples are used to test the proposed convolutional network model. Finally, full image pixel-wise classification result is obtained.

3.1 Representation forms of PolSAR image

Polarized scattering matrix can fully characterize the electromagnetic scattering properties of different types of ground targets. The scattering matrix is defined as:

(1)

where represents the backscattering coefficient of the polarized electromagnetic wave in emitting direction and receiving direction. and represent the horizontal and vertical polarization, respectively. According to the reciprocity theorem, the matrix satisfies . In order to describe the scattering properties of targets more clearly, the

matrix is usually vectorized and the polarization coherence matrix or polarization covariance matrix containing all polarization information is obtained.

The polarization vector and coherence matrix based on Pauli decomposition are expressed as (2) and (3)

(2)
(3)

Notice that the polarization coherence matrix is a Hermitian matrix, every element except the diagonal element is a complex number. Generally take the element that is not repeated in matrix, that is, the upper triangular elements as input. Thus for each pixel of PolSAR image, there are three real numbers and three complex numbers to describe it. The usual practice is to split the real and imaginary parts of the three complex numbers, and each pixel gets three real parts and three imaginary parts of its complex numbers besides three real numbers. At this point, for each pixel, there are nine real values to describe it, so for a PolSAR image, if its matrix is used as the input, the general input has nine channels.

However, the vulnerability is that the encapsulation of complex-valued data of PolSAR image is broken. From the previous analysis we can see that the complex-valued data in the PolSAR image is split into real and imaginary parts, which are considered to be no different from real-valued data. In fact, this rough equivalent is to accommodate the general pattern that CNN uses when processing optical images. The author believes that such a last resort will influence the capability of CNNs. The direct solution to this problem is to build a complex-valued network model. Although some literatures mention the complex network, its development is very preliminary. Most of the mainstream deep learning methods are based on the real domain and have not been extended to the complex domain. Notice that for a complex number can be expressed in terms of amplitude and phase , where and is real numbers respectively represent the amplitude and phase of the complex number . They can be calculated by the following formula:

(4)

and

(5)

After the above operation, the complex-valued data can be converted into its corresponding amplitude and phase data. Since the real-valued data of PolSAR image originally represents amplitude information, we totally get amplitude information of six channels and the phase information of three channels for polarization coherence matrix .

3.2 Multi-task convolutional network for PolSAR image classification

In order to maintain the integrity of the complex-valued data of PolSAR image, we transform the complex numbers into their amplitude and phase as input of deep neural network. According to the guidance of deep representation learning theory, the network structure should be adjusted to adapt to the changes of input data. The optical image data only contains amplitude information, so the CNN model for optical image can only learn representations from a single angle. Notice that, like PolSAR images, the form of input can be divided into multiple perspectives are not unique. For video action recognition, the general approach is to extract features from source data from both spatial and temporal dimensions simultaneously, which is similar to PolSAR image. Inspired by the two-stream network structure for edge detection (Guo et al., 2016) and for video data recognition (Simonyan & Zisserman, 2014a; Feichtenhofer et al., 2016), we propose a multi-task CNN to better learn the potential representations from PolSAR image amplitude and phase data.

Fig. 4: General view of the proposed multi-task CNN structure. There are six channels amplitude information of PolSAR image and three channels phase information. Two information sources are input into two different branches of the network model to extract information separately. Then, on the basis of retaining the acquired information, the deep information is fused by the convolution fusion layer. There are four classifiers in the network structure, which correspond to four different tasks: using amplitude information to classify, using phase information to classify, using fused information to classify and synthesizing the above three kinds of information to classify. Later, we improved the phase information stream and the convolution fusion layer according to the needs of PolSAR image classification.

It can be seen from Fig. 4 that the proposed model is similar to the two-stream CNN model in some degree, but it also has significant differences.

Architecture of individual stream: The input of one branch is six channels amplitude data of PolSAR image, and the input of the other is three channels phase data. A mature CNN basic structure, VGG-style network structure (Simonyan & Zisserman, 2014b) is used in the design of two individual streams. VGG-style network focus on changing the original single convolution layer into a convolution block composed of multiple cascade convolution layers, which can be seen from Fig. 5,

(a) (b)
Fig. 5: Basic component module of classical CNN (a) and VGG-Net (b). The difference is that the basic module of VGG-style networks are composed of multiple convolution layers to form convolution blocks. Continuous convolution operations are performed to extract deeper abstract features.

To respectively extract potential representations in both amplitude and phase information, each stream has six cascaded convolution layers and every two consecutive convolution layers can be considered as a convolution block. There is a sub-sampling layer between each two convolution blocks. The output of the last convolution layer is flattened, and then the dimensions are converted to the number of labels through two fully connected layers. Finally, a normalized probability output is obtained by a softmax classifier. Notice that if we look at this part of network separately and take the average of two softmax classifiers as the final output, then it is no different from the traditional two-stream CNN which used in video detection (Simonyan & Zisserman, 2014a).

Early fusion: For amplitude and phase information of PolSAR image, they describe a ground target from different perspectives and it would be unwise to see the two separately. Naturally, combining the two can get more useful information. This issue has also been found and a two-stream structure with information fusion was proposed in solving the problem of video recognition (Simonyan & Zisserman, 2014a). It is currently the choice of most people to concatenate the outputs of the two branches of the network and fuse them through a convolutional layer to achieve multi-source information fusion (Hu et al., 2017). In this paper, we consider using a more well-designed network structure based on the latest research progress to achieve better fusion of the information extracted from two branches. We observed the following phenomena: There is no fusion installation in many networks and the final output is the average of the output of network branches, which has achieved not very bad results. This shows that the output feature maps of the branches are relatively high-level features after information acquisition through the cascaded convolution layers. Our thought is to preserve the existing higher-level features on the basis of extracting deeper features by information fusion. This idea of feature reuse can be implemented through DensenNet-style network structure. Its mathematical expression can be written as:

(6)

where is output of the th layer and is the operation of this layer. As shown in Fig. 6

Fig. 6: A four layers Dense block, the mainly difference lies on that in dense block each layer takes all preceding feature maps as input. Such a structure can protect the deep features that have been acquired and promote the effective reuse of features.

A five layers densely connected dense block is used to replace the traditional cascade convolution structure to achieve better information fusion. After that, the global average pooling layer, fully connected layer and softmax layer are followed to output a possible label.

Extra classifiers: The network structure described above, which separates first and then merges, appears in many literatures (Simonyan & Zisserman, 2014a; Feichtenhofer et al., 2016). From the perspective of further utilization of side outputs and multi-task learning, we make further improvements to this structure and this part is the main difference between the proposed network structure and two-stream structure. Using side outputs of network can prevent the gradient from vanishing, enhance the detail preservation and improve the interpretability of features (Szegedy et al., 2014; Lee et al., 2014). Since the utilization of side output is usually accompanied by the addition of extra classifiers, it is naturally associated with the idea of ensemble learning and joint decision making. At the same time, this structure is also widely used to implement multi-task learning (Redmon et al., 2015; Ren et al., 2017). From the previous introduction, we can see that the proposed network framework has three side classifiers, which correspond to three tasks: extracting features from amplitude information, extracting features from phase amplitude information and fusing the first two parts of features. Without loss of generality, we denote the collection of all network parameters as and suppose that there are totally side classifiers (in this paper ), their corresponding parameters can be written as . Thus, the sum of side objective of the proposed structure can be written as:

(7)

where denotes cross entropy loss and is the weight of the th side classifier. The reason why we add extra classifiers is that because of the scarcity of PolSAR image, it is difficult to optimize a deep network. Adding three tasks into the objective function with extra classifiers provide complementary regularization and it is beneficial to network optimization under the condition of lacking labeled samples.

Advanced fusion: Within the proposed framework, early fusion layers mix together the information extracted from two branches at the feature level. Due to the addition of extra classifiers and the establishment of multi-task model, we can seek a deeper fusion at the level of classification results. This technique also exists in some semantic segmentation models to preserve the details of the image (Xie & Tu, 2015; Hou et al., 2016). The deep fusion operation we adopt can be divided into two stages. First, inspired by (Kim et al., 2016), a simple weighted fusion layer as shown in Fig. 7 is used to further mix together the previous three outputs in order to maintain better accuracy and reduce misclassification.

Fig. 7: The structure of weighted fusion module for three information sources.

In addition, we also add a classifier to this fusion layer, whose objective can be written as:

(8)

where is true labels, is the th fusion weight and is activation of the th side output. Based on the above, the objective function of the proposed network framework can be written as follows,

(9)

Finally, there are classifiers in the proposed network structure. The network prediction is the average of all the classification results.

(10)

Implementation Details: We followed some mainstream detail designs to improve performance. Before each convolution layer, batch normalization layer and ReLU layer are added. Each convolution layer and fully connected layer are followed by dropout layer, except for the last convolution layer in each convolution block (He et al., 2015b; Gao et al., 2016)

. Therefore, each layer becomes a series of cascaded combinations: BN-ReLU-Conv-Dropout in the network structure proposed in this paper. In order to speed up the optimization of objective function and obtain better approximate solution, we use adaptive moment estimation algorithm (Adam) instead of classical stochastic gradient descent algorithm

(Diederik & Jimmy, 2015).

3.3 Deeply explore the potential links between phase information

Through previous analysis, for phase information in PolSAR images, we should look for a better way to extract potential connections. From the data point of view, this problem is manifested in how to extract features from data with certain correlation between channels. Notice that classical convolution operation simultaneously acts on both the spatial and channel dimensions of the input, as shown in Fig. 8 (a). Formally, a group of classical convolutional filters takes the input feature maps where is the length of image slices, denotes channels, and outputs a

feature map (convolution with zero-padding). The number of kernel parameters for a single filter bank is

. Expressing the process of convolution as a mathematical expression is

(11)

where is the input, and denote the output and kernel matrix of the th filter bank of the th layer and represents nonlinear activation function. More specifically, the value at position can be expressed as:

(12)
(a) (b)
Fig. 8: General comparison of classical convolution (a) and depthwise separable convolution (b). The sets of connections are color-coded in figure to indicate different groups of convolutions. It can be seen that the classical convolution operation (a) uses the same kernel to filter both spatial dimension and channel dimension of the input, and depthwise separable convolution (b) uses two different sets of kernels to operate on spatial dimension and channel dimension, respectively.

Depthwise separable convolutions is based on the hypothesis that for the data which is closely related between channels, separating convolution operation in spatial and channel dimensions may yield better results. Thus, a complete traditional convolution operation can be divided into two operations: depthwise convolution and pointwise convolution. For input feature maps, groups of classical convolutional filters with parameters are replaced by groups of depthwise convolutions with parameters and groups of pointwise convolutions with parameters. Depthwise convolution uses numbers kernels to filter each channel of the input feature map spatially, and obtains output feature maps, then pointwise convolution uses numbers kernels to fuse the spatial information of the output of depthwise convolution and obtains output feature maps. The difference between depthwise separable convolution and classical convolution can be seen from Fig. 8. To compare with (12), we give a mathematical expression of depthwise separable convolution as follows,

(13)

and

(14)

where and respectively denote the kernel matrix and output maps of depthwise convolution and pointwise convolution, represents the input feature maps.

To deeply explore the potential links between phase information, we use a depthwise separable convolutions based network structure to replace former phase stream. Since the amplitude information between the PolSAR image and the optical image is not significantly different, the network structure of the amplitude stream is preserved. This operation can not only reduce network parameters, but also extract potential connections in phase information more effectively.

4 Experiments

In this section, we verify the effectiveness of the proposed PolSAR image classification framework. Some mature CNN based classification algorithms are listed as objects of comparison. The experiment environment: PC with Intel Core i7-7700 CPU, Nvidia GTX-1060 GPU (6 GB memory), and 16 GB RAM. Deep learning framework (Abadi et al., 2016) is selected to minimize the difficulty of algorithm implementation.

4.1 Dataset description

We evaluate the proposed methods on three benchmark PolSAR image datasets: AIRSAR Flevoland, ESAR Oberpfaffenhofen and EMISAR Foulum, which are commonly used in PolSAR image classification tasks. Here we describe the details of these images.

AIRSAR Flevoland: As shown in fig, an L-band, full polarimetric image of the agricultural region of the Netherlands is obtained through NASA/Jet Propulsion Laboratory AIRSAR. The size of this image is and the spatial resolution is . There are kinds of ground truth objects in Fig. 9 (a) including building, stembeans, rapeseed, beet, bare soil, forest, potatoes, peas, lucerne, barley, grasses, water, wheat one, wheat two and wheat three. For the experiment of this map, we randomly select 15 of them and expand them to 80 samples using data augmentation because the number of labeled samples in category building is rare. For other categories, we randomly select 300 samples for each category as training set.

(a) (b) (c)
Fig. 9: Pauli image of (a) AIRSAR Flevoland dataset, (b) ESAR Oberpfaffenhofen dataset, (c) EMISAR Foulum dataset

ESAR Oberpfaffenhofen: An L-band, full polarimetric image of Oberpfaffenhofen, Germany, 12001300 scene size, are obtained through ESAR airborne platform. Its Pauli color-coded image can be seen in Fig. 9 (b). Each pixel in the map is divided into three categories: built-up areas, wood land and open areas, except for some unknown regions. 600 of each category of labeled samples are randomly selected as training sets.

EMISAR Foulum: The last full polarimetric image used in this experiment is the L-band image taken by EMISAR in Foulum, Denmark. EMISAR is a full polarized airborne SAR operating in L band and C band with resolution of and mainly acquired and studied by Danish Center for Remote Sensing (DCRS). Fig. 9 (b) shows its Pauli RGB image. 5 types of ground objects are marked in the image: water, rye, oats, winter wheat and coniferous. 200 of each category of labeled samples are randomly selected as training sets.

Dataset Training num Testing num
AIRSAR Flevoland 4280 66903
ESAR Oberpfaffenhofen 1800 113282
EMISAR Foulum 1000 59905
Table 1: The detail information of three dataset used in experiments.

4.2 Experiments starting

To validate the significance of the proposed PolSAR image classification framework, a classical CNN model (Zhou et al., 2017) and a vgg-style CNN model (Simonyan & Zisserman, 2014b) are chosen to be compared. To verify the validity of using amplitude and phase information of PolSAR image instead of real parts and imaginary parts as input, experiments are performed in different inputs. For ease of description, CNN and vgg-style CNN model with real parts and imaginary parts as input are abbreviated as CNN-v1 and VGG-v1, while the corresponding model using amplitude and phase as input for short as CNN-v2 and VGG-v2. The proposed multi-task CNN model and the depthwise separable convolution based multi-task model both use amplitude and phase as input, which are denoted as MCNN and DMCNN. During the training and testing of all the network, the PolSAR images are sliced into the size of

patches and the stride sliding windows is

.

In experiments, the size of all convolution kernels is

and the dropout rate is 0.5 for fully connected layers. Training epoch is 20 and learning rate is 0.9. For experiments on AIRSAR Flevoland and ESAR Oberpfaffenhofen, the discarding ratio of dropout layer after convolution layer is 0.8 and the numbers of convolution kernels are set as 32, 64, 64 for every convolution block in individual stream. For the densely connected fusion block, its grow rate is set to be 16 and its first convolution layer output 4 times of grow rate feature maps. For the experiment on EMISAR Foulum, we do not abandon any data after the convolution layer and reasonably reduce the above parameters. The numbers of convolution kernels are set as 12, 24, 24 for every convolution block in individual stream and the parameters in fusion block are changed to be 12 and 2. The parameter of depthwise separable convolution multiplier factor is set to be 1.

To evaluating the performance of the algorithms mentioned in this paper, average accuracy (AA), overall accuracy (OA), kappa coefficient (Kappa) and -score are chosen as criteria, which can be defined as follows (Xu et al., 2018)

(15)

where is the number of classes and denote the number of correctly classified and the totally num of the th category, respectively.

(16)

where is the number of testing samples and

denotes the classification confusion matrix. The

-score of multi-class classification is calculated by calculating the -score of each class against to all other classes, the th category’s -score can be obtained as follows

(17)

where TP represents the quantity of correct positive samples, FN and FP respectively represent the number of mistaking other classes for th class and wrong prediction of th class to other classes samples. The final -score can be calculated by

(18)

4.3 Results comparison

Based on the stated experimental settings, we conduct experiments on the benchmark datasets and obtain the following results and analysis.

  1. Result on AIRSAR Flevoland: As shown in Table 2, and in Fig. 10, the proposed DMCNN and MCNN achieve the best and the second best results, respectively.

    Method C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 AA OA
    CNN-v1 100.00 84.62 99.91 100.00 100.00 99.96 96.22 99.47 99.39 93.75 99.78 93.95 97.14 97.43 93.72 97.02 96.13
    VGG-v1 100.00 94.87 100.00 100.00 98.81 100.00 99.18 98.12 100.00 100.00 97.37 75.78 90.66 94.51 99.91 96.61 93.61
    CNN-v2 100.00 96.14 100.00 100.00 98.47 99.93 99.20 99.53 100.00 97.58 67.27 98.22 87.55 89.49 91.49 94.99 92.73
    VGG-v2 100.00 98.65 100.00 100.00 65.60 99.96 99.79 96.17 100.00 100.00 99.90 98.92 98.37 98.83 97.12 96.89 96.52
    MCNN 100.00 100.00 100.00 100.00 100.00 95.34 99.77 100.00 100.00 100.00 98.15 98.96 92.69 99.13 99.92 98.93 98.09
    DMCNN 100.00 99.92 100.00 100.00 100.00 99.58 100.00 99.91 100.00 100.00 99.98 100.00 93.67 100.00 99.69 99.52 98.77
    Table 2: Comparison of experimental results on AIRSAR Flevoland dataset. For convenience, C1 to C15 refer to three different categories: water, rye, oats, winter wheat and coniferous. AA means average accuracy and OA means overall accuracy.
    Fig. 10: Classification maps of different methods on AIRSAR Flevoland PolSAR image. (a) Ground truth. (b) CNN-v1. (c) VGG-v1. (d) CNN-v2. (e) VGG-v2. (f) MCNN. (g) DMCNN.

    This proves that the proposed method can improve the accuracy of PolSAR image classification on this data set. From the experimental results, it can be seen that the special network designed according to the characteristics of PolSAR image data has better performance than the network commonly used in optical images. This result confirms the importance of designing special network structures for PolSAR image classification tasks. Further, it can be seen that the classification accuracy of depthwise separable convolution based MCNN model is higher. This shows that there is a potential correlation between the phase information of PolSAR images and the correlation can be used to distinguish real objects more accurately.

  2. Result on ESAR Oberpfaffenhofen: Table 3 shows the experimental results of each algorithm on Oberpfaffenhofen dataset.

    Method C1 C2 C3 AA OA
    CNN-v1 82.09 91.17 99.09 90.78 89.71
    VGG-v1 63.98 99.87 97.77 87.21 84.85
    CNN-v2 90.44 96.79 96.87 95.03 94.26
    VGG-v2 85.32 99.17 99.48 94.66 93.69
    MCNN 92.99 99.02 99.98 97.33 96.86
    DMCNN 95.23 99.93 99.98 98.38 98.06
    Table 3: Comparison of experimental results on ESAR Oberpfaffenhofen dataset. For convenience, C1 to C3 refer to different categories: builtup areas, wood land and open areas.
    Fig. 11: Classification maps of different methods on ESAR Oberpfaffenhofen PolSAR image. (a) Ground truth. (b) CNN-v1. (c) VGG-v1. (d) CNN-v2. (e) VGG-v2. (f) MCNN. (g) DMCNN.

    It can be seen that for the same model (CNN and VGG), the classification results using amplitude and phase information as input are better than using real part and imaginary part information of PolSAR image. The change of input data form almost improves the classification accuracy of the same model for each category. This phenomenon shows that even if the network used is still a traditional structure, the change of input form may retain more information. Besides, under the new form of input data, the accuracy of the proposed methods are significantly improved compared with the traditional models, which is consistent with the theoretical guidance of representation learning. At the same time, the experimental results are presented in the form of classification result maps in Fig. 11, from which we can get the same conclusion as that of digital type.

  3. Result on EMISAR Foulum: The comparative results of the experiments on Foulum data can be seen in Table 4. It can be seen that the proposed method still achieves good results. Unlike previous experiments, for the optical image widely used CNN and VGG models, the change of input data results in a decrease in accuracy. This shows that the existing network model is not suitable for all forms of input data and it is necessary to improve the adaptability of network structure for different inputs.

    Method C1 C2 C3 C4 C5 AA OA
    CNN-v1 85.16 89.82 100.00 99.37 99.64 94.80 94.68
    VGG-v1 78.60 98.75 100.00 100.00 99.45 95.36 93.51
    CNN-v2 83.68 66.09 96.05 99.69 99.99 89.10 92.27
    VGG-v2 70.85 69.79 96.50 98.28 99.89 87.06 88.91
    MCNN 98.98 99.92 100.00 100.00 100.00 99.78 99.71
    DMCNN 99.84 98.79 99.32 100.00 100.00 99.59 99.83
    Table 4: Comparison of experimental results on EMISAR Foulum dataset. For convenience, C1 to C5 refer to three different categories: water, rye, oats, winter wheat and coniferous.

A summary of the above experimental results is given,

  • Using the amplitude and phase information of PolSAR image as input of the deep neural network has indeed played a role in promoting accuracy. We consider that because of the significant difference between PolSAR image and optical image, following the input pattern of optical image will lose the unique information of PolSAR image and the original information of PolSAR image can be better preserved by presenting it in the form of amplitude and phase.

  • For the models using amplitude and phase input, the traditional CNN model (proposed for optical image classification) is not as effective as the proposed PolSAR special model. The reason is that the traditional CNNs only extracts feature from one angle because of its background of optical image. The proposed framework can deal with different types of information in input data differently and pertinently. This also reflects the importance of adjusting network structure according to input data when applying deep representation learning methods in PolSAR image classification tasks.

  • There is potential information between the phase of PolSAR image which is helpful to recognize the ground target. How to excavate this potential connection is really a problem to be considered in the process of modeling. The DMCNN model considering this viewpoint has achieved the best results in the experiment. This shows that depthwise separable convolution based DMCNN model does better adapt to the input data with certain connections between channels.

4.4 Contrast within the framework

In this subsection, we test the components of the proposed framework to prove the truth and validity of the proposed models. The difference between the proposed PolSAR image classification framework and traditional methods is significant, which is mainly embodied in the following aspects: densely connected feature early fusion layer, the addition of side classifiers and the existence of advanced fusion layer. To observe whether the combination of these tricks is effective, we establish the following models step by step and compared them with the proposed model. For convenience, the following models are simply recorded as to in the order of appearance.

Two stream model (M1): A two-stream CNN model (Simonyan & Zisserman, 2014a) for dealing with amplitude and phase data separately is constructed to verify the value of feature fusion layer. Among them, two branches of the network share parameters. Using the softmax classifiers at the end of each branch to get the classification probability obtained by inputing the amplitude and phase information separately and taking the average of the two probability distributions as the final result.

Convolutional fusion (M2): Considering that it is not advisable to deal with the related information separately, so we add a common convolution layer after the two branches of model which is to fuse the information obtained from the two branches and extract higher-level features (Feichtenhofer et al., 2016). The two classifiers in model are removed and replaced by a softmax classifier after the convolution fusion layer.

Densely connected fusion (M3): In order to preserve the acquired advanced features in the process of information fusion, we use a densely connected convolution block (Gao et al., 2016) for information fusion of two-stream networks instead of a simple classical convolution layer.

Adding extra classifiers (M4): Due to the deepening of the network structure, two additional classifiers in are added to the model to prevent gradient vanish and the output of the side classifiers are also added to the final decision (Lee et al., 2014; Szegedy et al., 2014).

Multi-task CNN (M5): This is the proposed MCNN model. Compared with the model, a sematic segmentation commonly used advanced fusion layer is added to make better use of the existing information for comprehensive decision making.

Depthwise separable convolution based model (M6): Depthwise separable convolution is introduced in order to make full use of the unique phase information of the PolSAR image, which has been extensively described before.


Fig. 12: A hierarchical contrast between the components of the framework.

From the result in Fig. 12, it can be seen that the accuracy of the models to is gradually increasing under various criteria. Carefully speaking, it can be found that the model with a feature fusion layer has higher precision and the densely connected feature fusion has stronger ability than classical convolution fusion by comparing the results of to . The result of model is better than the previous three models, which shows that the addition of side classifiers is beneficial to the PolSAR image judgment. At the same time, the effect of is worse than , which shows that the addition of advanced fusion layer improves the accuracy of classification. The comparison between and further confirms the previous conclusion that phase information can be better used to obtain more accurate classification results.

5 Conclusion

In this work, based on the need for network structure to adapt to the input data in representation learning, we construct a special classification framework for PolSAR images. The construction of this special framework is mainly divided into two parts. Firstly, the input form of PolSAR image is changed to its amplitude and phase information instead of the real and imaginary parts of complex-valued data. Secondly, for the amplitude and phase input, we propose a novel multi-task CNN structure to mine features better. This network structure integrates the idea of two-stream fusion network and multi-task learning, and can treat two kinds of different information in input data differently. Further, in order to better mine the potential correlation between the unique phase information of PolSAR image, depthwise separable convolution is introduced into multi-task CNN, and a depthwise separable convolution based multi-task CNN model is constructed to better utilize PolSAR images to improve the classification capacity. Experiments on benchmark datasets show that the proposed framework has certain advantages over traditional input and traditional optical image classification networks. For the future direction of work, better input form of PolSAR image, more carefully designed network structure and application of low-shot learning in PolSAR are all the issues we are considering.

Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (61401124, 61871158) and in part by Scientific Research Foundation for the Returned Overseas Scholars of Heilongjiang Province (LC2018029).

References

  • Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., & Zheng, X. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, .
  • Badrinarayanan et al. (2017) Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 2481–2495. doi:10.1109/TPAMI.2016.2644615.
  • Bi et al. (2018) Bi, H., Sun, J., & Xu, Z. (2018). A graph-based semisupervised deep learning model for PolSAR image classification. IEEE Transactions on Geoscience and Remote Sensing, PP, 1–17. doi:10.1109/TGRS.2018.2871504.
  • Chen & Tao (2018) Chen, S., & Tao, C. (2018). PolSAR image classification using polarimetric-feature-driven deep convolutional neural network. IEEE Geoscience and Remote Sensing Letters, 15, 627–631. doi:10.1109/LGRS.2018.2799877.
  • Chen et al. (2016) Chen, S., Wang, H., Feng, X., & Jin, Y. Q. (2016). Target classification using the deep convolutional networks for SAR images. IEEE Transactions on Geoscience and Remote Sensing, 54, 4806–4817. doi:10.1109/TGRS.2016.2551720.
  • Chollet (2016) Chollet, F. (2016). Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2017.195.
  • Cortes & Vapnik (1995) Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. doi:10.1007/BF00994018.
  • De et al. (2018a) De, S., Bruzzone, L., Bhattacharya, A., Bovolo, F., & Chaudhuri, S. (2018a). A novel technique based on deep learning and a synthetic target database for classification of urban areas in PolSAR data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11, 154–170. doi:10.1109/JSTARS.2017.2752282.
  • De et al. (2018b) De, S., Ratha, D., Ratha, D., Bhattacharya, A., & Chaudhuri, S. (2018b). Tensorization of multifrequency PolSAR data for classification using an autoencoder network. IEEE Geoscience and Remote Sensing Letters, 15, 542–546. doi:10.1109/LGRS.2018.2799875.
  • Diederik & Jimmy (2015) Diederik, K., & Jimmy, B. (2015).

    Rectified linear units improve restricted boltzmann machines.

    In International Conference for Learning Representations.
  • Estrach et al. (2014) Estrach, J. B., Szlam, A., & LeCun, Y. (2014). Signal recovery from pooling representations. In Proceedings of the 31st International Conference on Machine Learning (pp. 307–315). volume 32.
  • Feichtenhofer et al. (2016) Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2016.213.
  • Fu et al. (2017) Fu, J., Zheng, H., & Tao, M. (2017). Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2017.476.
  • Gao et al. (2016) Gao, H., Zhuang, L., & Weinberger, K. Q. (2016). Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2017.243.
  • Glorot & Bengio (2010) Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deepfeedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics

    .
  • Graves et al. (2013) Graves, A., Mohamed, A., & Hinton, G. (2013).

    Speech recognition with deep recurrent neural networks.

    In IEEE International Conference on Acoustics. doi:10.1109/ICASSP.2013.6638947.
  • Guo et al. (2016) Guo, H., Wang, G., & Chen, X. (2016). Two-stream convolutional neural network for accurate rgb-d fingertip detection using depth and edge information. In IEEE International Conference on Image Processing. doi:10.1109/ICIP.2016.7532831.
  • Guo et al. (2017) Guo, H., Wu, D., & An, J. (2017). Discrimination of oil slicks and lookalikes in polarimetric sar images using cnn. Sensors, 17, 1837–1856. doi:10.3390/s17081837.
  • He et al. (2017) He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. In IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2017.322.
  • He et al. (2015a) He, K., Zhang, X., Ren, S., & Jian, S. (2015a). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2015.123.
  • He et al. (2015b) He, K., Zhang, X., Ren, S., & Sun, J. (2015b). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2016.90.
  • He et al. (2015c) He, K., Zhang, X., Ren, S., & Sun, J. (2015c). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2015.123.
  • Hou et al. (2017) Hou, B., Kou, H., & Jiao, L. (2017). Classification of polarimetric SAR images using multilayer autoencoders and superpixels. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9, 3072–3081. doi:10.1109/JSTARS.2016.2553104.
  • Hou et al. (2016) Hou, Q., Cheng, M. M., Hu, X. W., Borji, A., Tu, Z., & Torr, P. (2016). Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP, 1–1. doi:0.1109/TPAMI.2018.2815688.
  • Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Marco, A., & Hartwig, A. (2017). Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, .
  • Hu et al. (2017) Hu, J., Mou, L., Schmitt, A., & Xiao, X. Z. (2017).

    Fusionet: A two-stream convolutional neural network for urban scene classification using PolSAR and hyperspectral data.

    In Urban Remote Sensing Event. doi:10.1109/JURSE.2017.7924565.
  • Huang et al. (2006) Huang, G. B., Zhu, Q. Y., & Siew, C. K. (2006). Extreme learning machine: Theory and applications. Neurocomputing, 70, 489–501. doi:10.1016/j.neucom.2005.12.126.
  • Ioffe & Szegedy (2015) Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning.
  • Kaiming et al. (2014) Kaiming, H., Xiangyu, Z., Shaoqing, R., & Jian, S. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 1904–16. doi:10.1007/978-3-319-10578-9_23.
  • Kim et al. (2016) Kim, H., Uh, Y., Ko, S., & Byun, H. (2016). Weighing classes and streams: Toward better methods for two-stream convolutional networks. Optical Engineering, 55, 053108. doi:10.1117/1.OE.55.5.053108.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. doi:10.1145/3065386.
  • Lecun et al. (2015) Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. doi:10.1038/nature14539.
  • Lecun et al. (1989) Lecun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. doi:10.1162/neco.1989.1.4.541.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278 – 2324. doi:10.1109/5.726791.
  • Lee et al. (2014) Lee, C. Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2014). Deeply-supervised nets. In Advances in Neural Information Processing Systems.
  • Lin et al. (2013) Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400, .
  • Liu et al. (2018) Liu, F., Jiao, L., Tang, X., Yang, S., Ma, W., & Hou, B. (2018). Local restricted convolutional neural network for change detection in polarimetric sar images. IEEE Transactions on Neural Networks and Learning Systems, PP, 1–16. doi:10.1109/TNNLS.2018.2847309.
  • Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European Conference on Computer Vision. doi:10.1007/978-3-319-46448-0_2.
  • Maas et al. (2013) Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic model. In International Conference on Machine Learning.
  • Maturana & Scherer (2015) Maturana, D., & Scherer, S. (2015). VoxNet: A 3D convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems. doi:10.1109/IROS.2015.7353481.
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In International Conference on Learning Representations.
  • Nair & Hinton (2010) Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning.
  • Ning et al. (2014) Ning, Z., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In European Conference on Computer Vision. doi:10.1007/978-3-319-10590-1_54.
  • Qiu et al. (2017) Qiu, Z., Yao, T., & Tao, M. (2017). Learning spatio-temporal representation with pseudo-3D residual networks. In IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2017.590.
  • Redmon et al. (2015) Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2016.91.
  • Ren et al. (2017) Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137–1149. doi:10.1109/TPAMI.2016.2577031.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. doi:10.1007/978-3-319-24574-4_28.
  • Shelhamer et al. (2014) Shelhamer, E., Long, J., & Darrell, T. (2014). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 640–651. doi:10.1109/TPAMI.2016.2572683.
  • Shuiwang et al. (2013) Shuiwang, J., Ming, Y., & Kai, Y. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 221–231. doi:10.1109/tpami.2012.59.
  • Simonyan & Zisserman (2014a) Simonyan, K., & Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. doi:10.1002/14651858.CD001941.pub3.
  • Simonyan & Zisserman (2014b) Simonyan, K., & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
  • Szegedy et al. (2014) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2015.7298594.
  • Tran et al. (2014) Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). Learning spatiotemporal features with 3D convolutional networks. In IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2015.510.
  • Turk & Pentland (1991) Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3, 71–86. doi:10.1162/jocn.1991.3.1.71.
  • Vapnik (1995) Vapnik, V. N. (1995).

    The Nature of Statistical Learning Theory

    .
    Springer New York. doi:10.1007/978-1-4757-2440-0.
  • Wang et al. (2018) Wang, L., Xu, X., Dong, H., Gui, R., & Pu, F. (2018). Multi-pixel simultaneous classification of PolSAR image using convolutional neural networks. Sensors, 18, 769–786. doi:10.3390/s18030769.
  • Wang et al. (2017) Wang, P., Zhang, H., & M. Patel, V. (2017). SAR image despeckling using a convolutional neural network. IEEE Signal Processing Letters, 24, 1763–1767. doi:10.1109/LSP.2017.2758203.
  • Wright et al. (2009) Wright, J., Yang, A. Y., Sastry, S. S., Sastry, S. S., & Yi, M. (2009).

    Robust face recognition via sparse representation.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 210–227. doi:10.1109/TPAMI.2008.79.
  • Xie & Tu (2015) Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. International Journal of Computer Vision, 125, 3–18.
  • Xu et al. (2015) Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, .
  • Xu et al. (2018) Xu, L., Jiao, L., Xu, T., Sun, Q., & Dan, Z. (2018). Polarimetric convolutional network for PolSAR image classification. IEEE Transactions on Geoscience and Remote Sensing, PP, 1–15. doi:10.1109/TGRS.2018.2879984.
  • Yan et al. (2018) Yan, W., Chu, H., Liu, X., & Liao, M. (2018). A hierarchical fully convolutional network integrated with sparse and low-rank subspace representations for PolSAR imagery classification. Remote Sensing, 10, 342–365. doi:10.3390/rs10020342.
  • Yu & Koltun (2015) Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations.
  • Zeiler & Fergus (2013) Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, .
  • Zhang et al. (2018) Zhang, L., Chen, Z., & Zou, B. (2018). Polarimetric SAR terrain classification using 3D convolutional neural network. In IEEE International Geoscience and Remote Sensing Symposium.
  • Zhang et al. (2016) Zhang, L., Ma, W., & Zhang, D. (2016). Stacked sparse autoencoder in PolSAR data classification using local spatial information. IEEE Geoscience and Remote Sensing Letters, 13, 1359–1363. doi:10.1109/LGRS.2016.2586109.
  • Zhang et al. (2017a) Zhang, T., Qi, G. J., Xiao, B., & Wang, J. (2017a). Interleaved group convolutions. In IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2017.469.
  • Zhang et al. (2017b) Zhang, X., Zhou, X., Lin, M., & Jian, S. (2017b). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, .
  • Zhang et al. (2017c) Zhang, Z., Wang, H., Xu, F., & Jin, Y. Q. (2017c). Complex-valued convolutional neural network and its application in polarimetric sar image classification. IEEE Transactions on Geoscience and Remote Sensing, 55, 7177–7188. doi:10.1109/TGRS.2017.2743222.
  • Zhou et al. (2017) Zhou, Y., Wang, H., Xu, F., & Jin, Y. Q. (2017). Polarimetric SAR image classification using deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters, 13, 1935–1939. doi:10.1109/LGRS.2016.2618840.
  • Zou et al. (2017) Zou, B., Da, L., Zhang, L., & Moon, W. M. (2017). Independent and commutable target decomposition of PolSAR data using a mapping from su(4) to so(6). IEEE Transactions on Geoscience and Remote Sensing, 55, 3396–3407. doi:10.1109/TGRS.2017.2670261.