Log In Sign Up

PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification

by   Lamei Zhang, et al.
Harbin Institute of Technology
NetEase, Inc

Deep learning and convolutional neural networks (CNNs) have made progress in polarimetric synthetic aperture radar (PolSAR) image classification over the past few years. However, a crucial issue has not been addressed, i.e., the requirement of CNNs for abundant labeled samples versus the insufficient human annotations of PolSAR images. It is well-known that following the supervised learning paradigm may lead to the overfitting of training data, and the lack of supervision information of PolSAR images undoubtedly aggravates this problem, which greatly affects the generalization performance of CNN-based classifiers in large-scale applications. To handle this problem, in this paper, learning transferrable representations from unlabeled PolSAR data through convolutional architectures is explored for the first time. Specifically, a PolSAR-tailored contrastive learning network (PCLNet) is proposed for unsupervised deep PolSAR representation learning and few-shot classification. Different from the utilization of optical processing methods, a diversity stimulation mechanism is constructed to narrow the application gap between optics and PolSAR. Beyond the conventional supervised methods, PCLNet develops an auxiliary pre-training phase based on the proxy objective of contrastive instance discrimination to learn useful representations from unlabeled PolSAR data. The acquired representations are transferred to the downstream task, i.e., few-shot PolSAR classification. Experiments on two widely used PolSAR benchmark datasets confirm the validity of PCLNet. Besides, this work may enlighten how to efficiently utilize the massive unlabeled PolSAR data to alleviate the greedy demands of CNN-based methods for human annotations.


page 1

page 2

page 7

page 8

page 9

page 11

page 13

page 15


Collaboration of Pre-trained Models Makes Better Few-shot Learner

Few-shot classification requires deep neural networks to learn generaliz...

One-Vote Veto: A Self-Training Strategy for Low-Shot Learning of a Task-Invariant Embedding to Diagnose Glaucoma

Convolutional neural networks (CNNs) are a promising technique for autom...

IMG2IMU: Applying Knowledge from Large-Scale Images to IMU Applications via Contrastive Learning

Recent advances in machine learning showed that pre-training representat...

Few-Shot Classification with Contrastive Learning

A two-stage training paradigm consisting of sequential pre-training and ...

Learning Weakly-Supervised Contrastive Representations

We argue that a form of the valuable information provided by the auxilia...

GeoCLR: Georeference Contrastive Learning for Efficient Seafloor Image Interpretation

This paper describes Georeference Contrastive Learning of visual Represe...

I Introduction

Polarimetric synthetic aperture radar (PolSAR) image classification aims to predict each pixel of the whole map. It has been a hot topic because of the powerful observation capacity of PolSAR system. The development of many industries, such as agriculture [1], urban planning [2], geoscience [3], environmental monitoring [4, 5], etc., is inseparable from the valuable information extracted by PolSAR classification. Therefore, the significance of the breakthrough of PolSAR classification is not limited to itself, but also lies in the broad application fields.

Deep learning, represented by convolutional neural networks (CNNs) [6], has made progress in many problems, e.g., optical [7, 8, 9], medical [10, 11, 12] and remote sensing [13, 14, 15]

image recognitions. Due to the impressive results achieved by CNNs, the mainstream feature extraction technique of PolSAR classification is currently transforming from unsupervised hand-crafted features with physical meanings

[16, 17, 18, 19] to supervised deep ones obtained by neural networks. Zhou et al. firstly explored the application of CNNs in PolSAR image classification [20]. They constructed a four-layer convolutional architecture to process the 6-D manually designed PolSAR representations, and the experiments showed breakthrough results. Recently, the nonlinear fitting ability of CNNs has attracted widespread attention, and various supervised CNN-based PolSAR studies are springing up. Some focused on how to find suitable input information to boost the classification performance, such as manually [21] or auto-selected [22] polarimetric features, source complex-valued PolSAR data [23] or the improved versions [24, 25]. Besides, many studies concerned about using advanced CNN models, such as fully convolutional [26], 3D convolution-based [27], sparse manifold-regularized [28], generative [29]

and hyperparameter optimized

[30] architectures.

The recently developed supervised CNN-based methods have achieved promising results and improved PolSAR classification to some extent [31]

. But this does not mean that unsupervised methods are no longer needed; on the contrary, their existences become more essential. The supervised machine learning paradigm implies that the high recognition accuracy is based on a sufficiently large training set with human annotations

[32], especially for deep CNNs with a large number of trainable parameters. The intrinsic reason may be that the training process based on sparse labels is easy to converge to a fragile and task-specific solution [33]. Although augmentation and regularization techniques [34, 35, 36] were explored, this requirement is still hard to meet in the application of easily acquired and understood optical images, let alone the more complex PolSAR systems. Insufficient supervision will cause the network to overfit the training data, thus lacking generalization in large-scale applications, which can be regarded as the most significant bottleneck hindering the usage of CNNs in PolSAR classification. Therefore, unsupervised CNNs which combine the advantages of both, i.e., the discrimination ability of CNNs and the feasibility of large-scale problems about unsupervised methods, are undoubtedly more desirable and meaningful than supervised ones.

This work falls in the area of unsupervised PolSAR representation learning [37]. Similar to supervised, unsupervised methods can be implemented by shallow models and deep neural networks. The former is widely used in PolSAR area, including a variety of physical [38, 39] and statistical [40]

features. The complexity of these methods is low, which brings fast running speed but also limits the performance. In contrast, neural networks for unsupervised learning are highly flexible and effective. Autoencoder is a representative technique, which explicitly defines a feature extraction mapping through the objective of image reconstruction

[41]. The features learnt by encoders can be used by network fine-tuning [42, 43]. Although it is hard to evaluate the quality of such automated feature engineerings, some recent studies showed that the reconstruction loss based methods may be difficult to learn high-level representations because they pay too much attentions to pixel-level details.

Greatly inspired by the success in natural language processing

[44], self-supervised learning (SSL) [45] provides a promising way for unsupervised representation learning, which follows the supervised learning paradigm, but the supervision is provided by the data itself. Therefore, free and abundant labeled samples are available for network training due to the automation of the pseudo-label generation process. Moreover, self-generated annotations can provide richer information, because the pale man-made labels cannot indicate the potential connections between samples of the related categories. Similar to some few-shot learning paradigms [46, 47, 48]

, SSL can be divided into two components, i.e., unsupervised pre-training and supervised fine-tuning. Designing effective pre-training methods to acquire transferrable representations is the key to the validity of SSL. Generally, the pre-training is performed by a proxy objective, called pretext task, and the genuine interest (PolSAR classification in this paper) is called downstream task. The construction of most pretext tasks is heuristic and predictive, such as predicting spatial correlations

[49, 50] and colors [51]. Although they have achieved some results, the generality of these pretext tasks is obviously not enough [52]. For example, it is meaningless to predict the spatiality for satellite images and the color for CT images. Recently, a flexible paradigm of SSL based on the pretext task of instance discrimination [53]

and InfoNCE loss function

[54, 55, 56], i.e. contrastive learning (CL), has emerged and made a breakthrough [57]. The proposal of instance discrimination comes from the fact that the apparent similarity among semantic categories can be automatically discovered by neural networks. Therefore, the similarity among instances may also be captured, which can be seen as high-level representations. InfoNCE loss plays a role of measurement to maximize mutual information between the instances [55].

Fig. 1: An intuitive comparison about the diversity of individuals between optical images and PolSAR images.

Considering the appealing properties of CL, we aim at combining it with PolSAR image for high-precision few-shot classification [58]. However, it must be pointed out that all existing CL methods are proposed for optical image processing. Although the generality is intrinsic, the application gap between optics and PolSAR can not be ignored. The authors believe that CL should be transformed into PolSAR-tailored methods to obtain satisfactory results, rather than follow the original blindly. We maintain that the key factor that affects the performance of CL in PolSAR representation learning is not the data modality, but diversity. It can be found that the transferrable representations of CL are learnt through distinguishing the difference between individuals. In other words, if the similarity between each sample is high, the performance will be greatly reduced. As shown in Fig. 1, for optical images, the difference is relatively large whether it is between different categories (inter-class) or between different instances of the same category (intra-class). But it is another story for PolSAR images. For a pair of optical and PolSAR images of the same size, the corresponding real scenes may differ hundreds of times. This phenomenon is reflected in PolSAR data with the following two characteristics: less number of categories and lower intra-class diversity. And it brings a great challenge to the optimization of CL, i.e., a large number of samples from the same category have to be selected during random sampling, and they are difficult to identify with each other.

Based on the above analysis and inspired by previous works, a PolSAR-tailored contrastive learning network (PCLNet) is proposed in this paper. The proposal effectively combines the unsupervised CL methods with PolSAR representation learning and classification. The main novelties and contributions can be summarized as follows:

  1. Unsupervised deep PolSAR representation learning and few-shot classification are explored with the help of CL for the first time. Specifically, a unsupervised pre-training method is designed to learn transferrable representations without human annotations. The acquired representations are transferred with very little supervision to achieve few-shot PolSAR classification. Theoretically speaking, we construct a practical way to utilize the massive unlabeled PolSAR data and improve the applicability of CNN-based methods to large-scale problems.

  2. A novel diversity stimulation mechanism is proposed and combined with the CL method, which narrows the application gap between optics and PolSAR. The diversity of training samples can be stimulated through two steps: Firstly, revised Wishart distance [19] based unsupervised clustering is used to perform an overcomplete partition of the dataset and construct numerous categorizations. Then, fully connected graphs are constructed for each category, and the nodes with high affinities are removed. Diversified training data acquired by this dual-stimulating mechanism can be seen as the key factor that make CL methods adopt to few-shot PolSAR classification.

  3. Experiments on two widely used PolSAR benchmark datasets are implemented. And the experimental results demonstrate the validity of the proposed method for both few-shot and full-supervised PolSAR classification.

The rest of this paper is organized as follows: The proposed PCLNet is introduced in Section II. Experimental results and analyses are presented in Section III. Section IV concludes this work and gives possible future directions.

Ii Proposed method

In this section, the proposed PCLNet for deep PolSAR representations and few-shot classification is presented. The proposal is a variant of CL, which is customized for PolSAR images. There are three steps to be implemented to train the PCLNet, as shown in Fig. 2.

Fig. 2: General flow chart of the training of PCLNet.

Since there is no benchmark for CL, we first need to construct a dataset for unsupervised learning. It is worthy noting that in this process, manually labeling is unnecessary, which supports the use of massive unlabeled PolSAR data. The acquired dataset is then used for unsupervised network pre-training. Finally, a new round of supervised training is performed on basis of the result of unsupervised pre-training. In fact, the training of traditional CNN-based PolSAR classifiers can be regarded as doing the third step from scratch. The following is a detailed description of each step.

Ii-a Dataset Collection

The construction of the training dataset is the first problem to be solved. Random sampling is a natural choice used by almost all CL methods, which is to randomly select a certain number of samples from the supervised benchmark dataset. However, it is another story for PolSAR images. First, the difference between the data obtained by various PolSAR systems is relatively large. Therefore, an applicable dataset construction method is more desired rather than specific datasets. Moreover, the premise of using random sampling is that certain differences exist between individuals, which cannot be satisfied by PolSAR images as illustrated in Fig. 1. To address these issues, a diversity stimulation mechanism is designed as a general means to obtain the dataset with a high degree of diversity for the training of CL. This is a dual mechanism, which is realized by successively stimulating the inter-class and intra-class diversities.

Ii-A1 Stimulation for Inter-Class Diversity

An widely used clustering method, i.e., unsupervised Wishart classifier, is adopted to perform a preliminary overcomplete partition for PolSAR images. The Wishart classifier is based on central grouping techniques and inherits many attractive highlights of the well-known K-means algorithm


According to the basic operation principle of PolSAR [38], the complex Sinclair scattering matrix is usually utilized to represent the amplitude and phase information of the transmitted and received backscattered signals. In a dynamically changing environment, numerous distributed targets can be analyzed by the polarimetric coherency matrix which follows complex Wishart distribution:


where denotes the complex Hermitian transpose, is the number of looks and

is the Pauli scattering target vector. Based on the matrix Wishart distance, Lee

et al. [18] introduced the unsupervised Wishart classifier to assign each pixel of coherency matrix with a cluster prototype , where is the number of clusters. For example, if one pixel is corresponding to class , then


Considering that the revised Wishart distance [19] satisfies the identity of discernibles and symmetry conditions , it is used to measure the pair-wise distance between samples and cluster prototypes:


where is the trace of a matrix and notes the dimension of coherency matrix.

Fig. 3: Illustration of the unsupervised clustering for inter-class diversity stimulation. The unlabeled PolSAR data are overcompletely partitioned according to their similarities between each other without human annotation.

As shown in Fig. 3, unlabeled PolSAR image samples can be clustered by the revised Wishart distance based unsupervised Wishart classifier. In our setting, the result of clustering should be overcomplete, which means that the number of clusters is unrealistically large. The purpose is to constrain the diversity of clustering prototypes deliberately, so as to stimulate the inter-class diversity.

Ii-A2 Stimulation for Intra-Class Diversity

Following the result of unsupervised clustering, the training dataset can be collected through intra-class screening, which is to maintain relatively large diversity between different instances in the same cluster. For the th cluster with the samples of , , an undirected fully connected graph can be constructed according to the spectral graph theory [19]. Meanwhile, the pair-wise similarity of the th graph can be represented by the edges . We employ affinity to measure the pair-wise similarity between two instances. Hence, we can flexibly collect the training dataset by cutting each graph.

The affinity is deduced by a revised Wishart distance based Gaussian kernel function in this work. Then the fully connected graph of the th cluster can be obtained by calculating the affinity between each two instances, as follows:


where notes the affinity between the instances of . is the Gaussian kernel bandwidth. It is obviously that the fully connected graph is expressed as a symmetric positive semidefinite matrix with the size of and its digonal elements meet . Therefore, only the upper triangular elements need to be calculated.

Fig. 4: An example for intra-class diversity stimulation. We assume that there are five samples (left) in the cluster, and the number of remaining sample is set as three (right). The fully connected graph is visualized where the circles in different colors represent samples and the value on the line represent the affinity. The edge with the highest affinity is marked in brownish red. The red dotted line indicates that the sample is diametrically removed.

As shown in Fig. 4, intra-class diversity stimulation of each cluster is implemented by cutting the nodes (samples) of the corresponding fully connected graph. First, all of the upper triangular elements in the fully connected graph are sorted, and the affinity with the largest value is located. Next, one of the two nodes connected by this edge will be diametrically removed. This process will be carried out iteratively until the number of samples reaches a pre-defined threshold. Finally, the instances corresponding to the remaining nodes will be collected to form the dataset for the training of CL. The process of collecting a dataset by the proposed diversity stimulation mechanism is outlined as Algorithm 1.

1:  Begin
2:  Prepare PolSAR image samples in the form of polarimetric coherency matrix .
2:  Inter-class diversity stimulation
3:  Prepare the number of clusters , max iterations and hyperpatameter . Initialize cluster prototypes .
4:  for iteration in do
5:     Measure the revised Wishart distance between samples and cluster prototypes by (3).
6:     Assign each sample with corresponding cluster by (2).
7:     Recalculate cluster prototypes .
8:  end for
9:  return  clustered sample sets , .
9:  Intra-class diversity stimulation
10:  Prepare the number of retaining sample threshold and hyperpatameter .
11:  for cluster in do
12:     Construct the fully connected graph by (4) for all samples in the cluster.
13:     while do
14:        , randomly remove one sample or from .
15:     end while
16:     Add the retaining samples in the cluster to dataset .
17:  end for
18:  return  dataset
Algorithm 1 Diversity Stimulation Mechanism for Dataset Collection

Ii-B Unsupervised Pre-training

Inspired by some related works [57, 52, 53, 56, 55, 54], a CL based unsupervised pre-training method is designed in this section. The novelty is that the training of the proposal can be implemented without human annotation, which brings the possibility of using massive unlabeled PolSAR images. Moreover, the transferrable deep PolSAR representations can be acquired by the pre-trained network, which are the bases for achieving few-shot classification. The following points need to be considered during the construction of unsupervised pre-training: pretext task and loss function, the architecture of encoder and its optimization.

Ii-B1 Pretext Task and Loss Function

Generally speaking, high-level representations work better when transferring to other tasks because they are more abstract than low-level ones. It has been proved that the training of supervised learning is inefficient and it converges to a fragile and task-specific solution [33, 60]. In other words, although the representations obtained by supervised CNNs are higher-order than hand-crafted ones, they are still not enough to achieve task transfer.

To address this issue, the objective of supervised learning should be improved and we need to construct the corresponding loss function. The training of supervised learning is based on class discrimination, so it is necessary to provide category information manually. In this work, instance discrimination [53] is adopted which takes the class-wise supervision to the extreme, i.e., treat each sample as a category. Therefore, the sample itself provides the supervision and human annotation is no longer needed. The validity of such a pretext task comes from the inference that realizing instance discrimination requires more generalized representations than class discrimination.

Compared with the cross-entropy loss in supervised learning, in this work, InfoNCE [54, 52] is used as a contrastive loss function to implement instance discrimination by maximizing the mutual information [56, 55]. Considering that there are PolSAR image samples and their representations can be expressed as . InfoNCE loss for the th training sample can be defined as:


where notes the vector which is related to , and represents other representations besides .

represents the cosine similarity

[61] and is the temperature [53] hyperparameter that controls the uniformity of information distribution. Treat and as a positive pair and treat with the rest as negative pairs, (5) can be seen as a multi-class -pair loss [62] which tries to classify as .

Ii-B2 Architecture of Encoder

A convolutional network, called encoder in this paper, is used to obtain the representation of input samples. It is considered to be the object of unsupervised pre-training. The encoder can be mainly divided into convolutional encoder and projection head. The former consists of four parts, including convolution, nonlinear activation, pooling, and global average pooling. An intuitive diagram of the architecture of encoder is shown in Fig. 5.

Fig. 5: Architecture of PCLNet encoder for PolSAR representations.

Specifically, the operation of convolution is defined as:


where and represent the th input and the th output of in layer , and denote the learnable kernel matrix and bias, means the number of input feature maps and

denotes the convolution operator. To improve the nonlinear fitting ability, rectified linear units (ReLU)


is employed as the activation function to avoid gradient vanishing, which is implemented by:


Pooling can be considered as a sub-sampling process which reduces the dimension of features. Moreover, it helps to identify displacement, scaling, and other distortion-invariant in 2D maps. After several cascade in the form of conv-ReLU-maxpool, global average pooling (GAP) is employed to performs down-sampling operator by computing the mean of the height and width of the feature maps so as to decrease computational complexity [63].

Note the aforementioned convolutional encoder as where means all the learnable parameters, a PolSAR image sample can be transformed into a representation . Then a projection head is used to map the representation into the space where contrastive loss is applied. Some recent studies [57, 52]

showed that such a module can prevent the loss of information valid for the downstream task. A multilayer perceptron is adopted to construct the projection head. For the input sample

, the output of projection head can be written as:


where notes the projection head and means its learnable parameters. and

denote the weight matrix and bias vector which are the elements of

. We name the encoder composed of and as main encoder, which is used to obtain representations from input samples.

Although not shown in Fig. 5, the encoder of PCLNet is actually a two-stream architecture [64] and the two branches share the topology and hyperparameter. This means that there is an architecture that exists in parallel with the main encoder. This design is determined by the definition of InfoNCE loss. It can be found that a pair of positive samples is needed for the calculation of InfoNCE loss. To obtain the of (5), a correlated view , called positive sample, is generated through some data augmentation methods of . Therefore, and forms the ordinary positive pair. Then will be fed to the other branch which is called auxiliary encoder in this paper. Note the convolutional encoder and projection head of the auxiliary encoder as and , the output of auxiliary encoder can be written as:


Two branches of the encoder can obtain positive pairs and to calculate the InfoNCE loss. There is only one positive sample to match the input sample, and all other ones are considered as negative samples. It is worthy noting that the relationship between positive pairs is similar to that of samples and labels in supervised learning.

Ii-B3 Optimization of Encoder

Fig. 6: General flow chart of the optimization of encoder. Main encoder is the object of optimization, which is marked by a red dotted box. The InfoNCE loss supported by auxiliary encoder and memory bank is the means to update the main encoder. Back propagation of main encoder can be realized through the calculation of InfoNCE loss between positive and negative pairs. The optimization process gets rid of the need for human annotation.

The optimization is implemented by mini-batch stochastic gradient descent of InfoNCE loss. The first thing to point out is that the optimization object is the learnable parameters of main encoder. Because only the representation it produces will be used for the downstream task transfer. Consider a mini-batch with

training samples and the dimension of encoded sample is , an illustration of the optimization can be seen from Fig. 6. It can be seen that the InfoNCE loss plays an important role in the optimization process. Moreover, as shown in (5), positive and negative pairs support the calculation of InfoNCE loss.

Inspired by some related works [57, 52], in this paper, positive and negative pairs are obtained by auxiliary encoder and memory bank respectively. First, positive samples are generated by the augmentation of training samples. Then, they are input into the main and auxiliary encoders to obtain positive pairs. Finally, the encoded negative samples provided by memory bank are used to calculate the InfoNCE loss, so that the main encoder can be updated through back propagation. The construction of auxiliary encoder and the acquisition of positive pairs have been stated before. Next, we introduce the construction of memory bank and the update of auxiliary encoder.

On the premise of obtaining positive pairs, it is very important to traverse as many negative pairs as possible [55]. However, this leads to unacceptable computational complexity. To handle this problem, a memory bank is utilized to store the processed samples [53]. The memory bank can be seen as a dictionary filled with encoded negative samples. Therefore, of (5) can be obtained by query instead of repeated calculation.

The construction of memory bank is an important part of the optimization. It should be pointed out that the individuals of memory bank are not static, but vary on-the-fly. After each batch, representations obtained by the auxiliary encoder will be stored in the memory bank. The capacity of memory bank is where and it is divisible by . When the storage of memory bank reaches the upper limit, the latest enqueue representations will replace the oldest ones to achieve dynamic updates. There are two main benefits of adopting such a dynamic memory bank. On the one hand, using the newest and removing the oldest one can boost a consistent comparison [57] when calculating the loss function. One the other hand, the capacity of memory bank is independent of the mini-batch size . Therefore, the value of can be very large, which is helpful to the training.

The update of auxiliary encoder is also a crucial problem, which profoundly affects the optimization performance. In an extreme case, its update can be completely independent of main encoder (two different encoders). Conversely, they can also be exactly the same (two identical encoders). However, both of these result in a rapidly changing auxiliary encoder. It has been proved that the encoded samples in the memory bank should maintain a certain consistency [53]. Therefore, a smooth changing auxiliary encoder is needed. Inspired by [57], a momentum based method is employed to update the parameters of auxiliary encoder:


where means the momentum coefficient. In this way, a relatively large momentum encourages the auxiliary encoder to update more smoothly and stably. Moreover, the auxiliary encoder can be excluded from back propagation and the computational complexity is more manageable.

With the support of auxiliary encoder and memory bank, the InfoNCE loss for one mini-batch can be written as:


where is the variable to be optimized and others can be seen as constants during the back propagation.

Ii-C Supervised Fine-tuning

Fig. 7: Illustration of supervised fune-tuning. The results of unsupervised pre-training are selectively used in the downstream task. Limited supervision is fed to train a new linear classifier, so as to achieve few-shot PolSAR classification.

Completely depending on unsupervised learning may not meet the accuracy requirements, and a moderate approach, i.e., unsupervised pre-training and supervised fine-tuning, is more acceptable. In this paper, fine-tuning generally follows the paradigm of supervised learning [20], but slightly different. The following of supervised learning is reflected in the dataset collection, the loss function definition and optimization method. The difference is that the feature extraction layers of network is not involved in the supervised training, and the number of training sample can be very small. The reason for this difference is that the representations learnt in unsupervised pre-training are transfferable, so that the dependence on complex paradigms and massive human annotations can be alleviated.

An illustration of the supervised fune-tuning we adopted is shown in Fig. 7. It can be seen that the pre-trained of main encoder is undoubtedly the foundation of few-shot classification. In the process of supervised fine-tuning, will be used without any changes for the representation learning of labeled training samples. And a trainable linear classifier with a fully-connected layer followed by softmax activation is connected behind . Limited supervision is sufficient to the training of linear classifier due to its low complexity. In summary, the whole training process of PCLNet is shown in Algorithm 2.

1:  Begin
2:  Prepare the contrastive learning dataset .
2:  Unsupervised pre-training
3:  Prepare the positive sample pair and

, max epoch

, number of steps , momentum coefficient and temperature . Initialize learnable parameters , , and memory bank .
4:  for iteration in do
5:     for mini-batch in do
6:        ,
7:        ,
8:        for encoded negative sample in do
9:           Calculate the InfoNCE loss by (11).
10:        end for
11:        Optimize the parameters and by mini-batch SGD of InfoNCE loss.
12:        Momentum update the parameters and by (10).
13:        Enqueue the current mini-batch and remove the oldest one to update .
14:     end for
15:  end for
16:  return  optimal parameter of .
16:  Supervised fine-tuning
17:  Prepare the labeled PolSAR training set and max epoch . Freeze the optimal parameter . Initialize learnable parameters and of a linear classifier.
18:  for iteration in do
19:     Optimize and with training set by back propagation.
20:  end for
21:  return  optimal parameters , and .
Algorithm 2 Training Process of PCLNet

Iii Experimental Results

Iii-a Datasets Description

We employ two widely-used PolSAR datasets in the experiments, i.e., AIRSAR Flevoland and ESAR Oberpfaffenhofen. Figs. 8-9 show their Pauli and ground truth maps respectively. Besides, Tables I-II show some details about the benchmark datasets.

Iii-A1 AIRSAR Flevoland

As shown in Fig. 8, an L-band, full polarimetric image of the agricultural region of the Netherlands is obtained through NASA/Jet Propulsion Laboratory AIRSAR. The size of this image is and the spatial resolution is . There are kinds of ground objects including buildings, stembeans, rapeseed, beet, bare soil, forest, potatoes, peas, lucerne, barley, grasses, water and three kinds of wheat. The number of the labeled pixels can be seen in Table I.

Fig. 8: AIRSAR Flevoland dataset and its color code. (a) Pauli RGB map. (b) Ground truth map.


AIRSAR Flevoland
Class code Name Reference data
1 Buildings 963
2 Rapeseed 17195
3 Beet 11516
4 Stembeans 6812
5 Peas 11394
6 Forest 20458
7 Lucerne 11411
8 Potatoes 19480
9 Bare soil 6116
10 Grass 8159
11 Barley 8046
12 Water 8824
13 Wheat one 16906
14 Wheat two 12728
15 Wheat three 24584
Total - 184592


TABLE I: Number of Pixels in Each Category for Airsar Flevoland
Fig. 9: ESAR Oberpfaffenhofen dataset and its color code. (a) Pauli RGB map. (b) Ground truth map.

Iii-A2 ESAR Oberpfaffenhofen

An L-band, full polarimetric image of Oberpfaffenhofen, Germany, scene size, are obtained through ESAR airborne platform. Its Pauli color-coded image and ground truth map can be seen in Fig. 9. Each pixel in the map is divided into three categories: built-up areas, wood land and open areas, except for some unknown regions. The number of the labeled pixels can be seen in Table II.


ESAR Oberpfaffenhofen
Class code Name Reference data
1 Built-up areas 310829
2 Wood land 263238
3 Open areas 733075
Total - 1307142


TABLE II: Number of Pixels in Each Category for Esar Oberpfaffenhofen

Iii-B Experimental Setup

Iii-B1 Data Preparations

The original PolSAR images are represented by the polarimetric coherency matrix . In the diversity stimulation mechanism, the cluster numbers of AIRSAR and E-SAR datasets are set to 35 and 50, respectively. Then, we expree the instance similarities by affinities with the Gaussian kernel function, and the value of bandwidth is set to 0.42. Finally, 600 instances are filtered out from each cluster as the training set. In the pretext task, the upper triangular elements of are devided into the real and imaginary parts to describe each pixel of the PolSAR image. In the fine-tuning stage, the process is similar to some traditional methods. Not only the pixels, but also the surrounding image patches are cropped to generate the datasets. Then, the training sets with 300 samples, validation sets with 200 samples and testing datasets with the remaining samples are obtained.

Iii-B2 Parameter Settings and Comparing Methods


Buildings 91.59 93.87 98.75 88.27 92.83 97.72 95.02 94.18 94.18 95.43
Rapeseed 73.70 72.17 73.72 83.12 70.89 64.90 84.48 59.16 81.00 87.11
Beet 93.67 86.42 86.28 84.43 88.00 89.12 67.83 71.28 97.35 96.41
Stembeans 89.68 90.24 92.84 93.94 94.38 98.08 98.80 96.21 89.74 96.59
Peas 91.45 83.20 83.31 83.02 97.14 95.73 94.33 94.28 82.30 97.96
Forest 85.51 83.14 79.44 89.26 97.67 98.09 97.44 87.69 96.82 98.15
Lucerne 78.11 87.63 86.55 85.97 81.78 90.75 97.26 97.37 84.48 95.22
Potatoes 92.74 77.85 79.30 83.40 96.67 89.19 89.53 93.13 86.96 96.00
Bare soil 70.85 92.71 95.24 93.84 75.20 92.81 99.71 61.59 92.02 92.81
Grass 24.86 58.39 61.10 60.46 83.52 94.72 53.27 17.17 94.25 85.98
Barley 99.14 89.47 87.81 96.07 82.12 70.40 95.20 69.94 84.41 93.84
Water 57.54 96.43 98.70 98.16 99.34 76.30 99.98 98.54 95.58 95.26
Wheat one 94.64 73.83 72.76 82.08 91.57 93.67 98.86 84.58 89.95 94.33
Wheat two 36.11 73.37 73.44 74.96 82.16 92.11 80.68 58.96 82.10 84.10
Wheat three 81.86 66.65 69.86 80.24 85.06 96.40 92.24 86.15 83.32 96.60
OA 78.81 78.77 79.29 84.10 88.03 89.28 89.81 79.23 88.09 93.96
AA 77.43 81.69 82.61 85.15 87.89 89.33 89.64 78.02 88.96 93.72
Kappa 77.53 77.64 78.17 83.11 87.19 88.50 89.06 77.96 87.27 93.47


TABLE III: Comparisons of Full-supervised Classification Results () for Airsar Flevoland Dataset.

At the beginning of pre-training, each training sample is considered as , and the corresponding positive sample is generated by rotating . The parameter settings of main encoders and auxiliary encoders are identical, which is crucial for retaining consistency. The detail information is displayed in Fig. 5. For the convolutional encoder, the size of the convolution kernels is

with stride 1. And the number of the kernels in three convolution layers is 16, 32 and 64, respectively. The size of max pooling is

and the stride is 2. For the projection heads, the dimensions of two fully-connected layers are 64 and 32, which means the extracted feature vectors are 32-dimensional here. In order to optimize the encoders, SGD is employed in our experiments. The weight decay is 0.0001 and its momentum is 0.9. At the same time, the mini-batch size is set as 512 while the length of memory bank is 8192, and the initial learning rate is 0.1. We train for 800 epochs with learning rate multiplied by 0.5 at 300 and 500 epochs. Besides, the momentum coefficient is 0.999 and the temperature is 0.4.

In the fine-tuning stage, a linear classifier is connected with GAP layer and the dimension is equal to the number of categories. We train for 300 epochs with the learning rate of 0.01 and a mini-batch size of 32.

In order to evaluate the effectiveness of the proposed method, several supervised and semi-supervised classifiers are performed and tested in the experiments. Specifically, three classical shallow models with hand-crafted features are chosen, including Wishart [65]

, linear support vector machine (SVM) and radial basis function kernel support vector machine (RBF-SVM)

[66]. Four CNN-based methods including MLP, CNN [20], CV-CNN [23] and polarimetric-feature-driven CNN (SF-CNN) [21]

are chosen for comparison. Moreover, two representative few-shot learning methods, i.e., transfer learning and meta learning, are selected to be compared. In this paper, transfer learning is realized by a ImageNet pre-trained VGG-11 architecture

[67], and meta learning is realized by model-agnostic meta-learning [46]. We denote them as TFL and MAML for convenience.

Iii-B3 Evaluation Criteria

To evaluate the classification performanc quantitatively, we chose three standard criteria including overall accuracy (OA), average accuracy (AA) and kappa coefficient (Kappa). They can be defined as follows:


where denotes category numbers. means the number of correctly classified samples of the th category and denotes the number of the th labeled samples.


where is the number of testing samples and

denotes the classification confusion matrix.

Iii-C Experimental Results

Iii-C1 Classification Results

Fig. 10: Full-supervised classification results of the whole map on AIRSAR Flevoland dataset with different methods. (a) Result of Wishart. (b) Result of SVM. (c) Result of RBF-SVM. (d) Result of MLP. (e) Result of CNN. (f) Result of CV-CNN. (g) Result of SF-CNN. (h) Result of TFL. (i) Result of MAML. (j) Result of PCLNet.


Buildings 69.26 77.57 94.29 79.44 89.20 79.02 87.75 89.72 97.72 94.39
Rapeseed 25.12 69.65 71.24 75.44 34.78 51.78 73.71 55.76 81.82 69.28
Beet 52.33 82.19 81.86 90.04 43.74 53.27 75.94 62.71 92.94 91.76
Stembeans 60.47 80.42 83.43 87.40 75.25 70.96 95.41 86.10 97.03 92.35
Peas 37.37 77.29 78.00 91.81 70.58 56.75 87.52 69.14 85.97 93.81
Forest 48.77 74.71 72.31 91.49 57.71 60.06 90.04 68.61 82.66 96.44
Lucerne 43.65 83.01 83.05 89.15 84.87 66.01 97.91 62.31 90.82 94.00
Potatoes 42.18 73.84 74.67 66.14 47.46 42.70 70.45 65.64 84.44 91.76
Bare soil 75.43 90.84 93.18 87.34 69.00 74.92 92.04 74.26 93.80 85.79
Grass 27.52 37.28 41.93 52.57 42.81 43.03 60.42 55.57 60.72 64.43
Barley 66.31 68.89 70.90 83.97 79.01 84.19 84.97 75.25 90.17 94.93
Water 58.39 92.52 97.05 76.56 77.69 72.98 76.87 85.51 99.29 85.97
Wheat one 41.19 66.99 64.02 47.37 62.84 71.53 54.54 65.11 68.25 89.07
Wheat two 35.69 61.40 63.98 65.26 52.33 36.18 28.06 56.25 68.75 83.09
Wheat three 22.69 54.25 55.74 35.52 59.37 66.28 67.08 63.41 87.95 90.12
OA 41.71 70.64 71.52 70.69 58.82 59.34 70.76 65.97 83.68 87.88
AA 47.09 72.72 75.04 74.63 63.11 61.98 72.18 69.02 85.49 87.81
Kappa 40.61 69.34 70.25 69.40 57.45 57.91 69.35 64.61 82.65 87.02


TABLE IV: Comparisons of Few-shot Classification Results () on Airsar Flevoland Dataset.

In the experiments, we use 20 and 300 training samples for each category to perform our few-shot and full-supervised PolSAR image classification, respectively. Tables III-VI report the classification results on two benchmark datasets with the aforementioned experimental settings. Moreover, the classification maps are shown in Figs. 10 and 12. Generally speaking, different situations have different results, but the trends of different datasets are consistent. Hence, we need to describe the results from two aspects. In the case of few-shot classification, the performance of traditional Wishart classifier is not satisfactory. MLP, SVM and its extended version RBF-SVM show better results than CNN and CV-CNN. The performance of SF-CNN and MAML is slightly improved compared with the traditional CNN-based methods. In addition, the knowledge mining from optical images by TFL is not suitable to transfer into PolSAR tasks. As the samples number increases, the performance of CNN-based methods is improved and surpasses other methods such as Wishart, SVM, MLP and MAML. The results of TFL are still not promising. But the proposed PCLNet emerges the best generalization performance in two cases. All of the experimental results and classification maps on each dataset are analyzed specifically as follow.

Full-supervised classification results of the whole map on Flevoland dataset are presented in Fig. 10. As mentioned above, 300 (about sampling rate) training samples for each category are utilized in this experiment. From these results, we can observe how the different methods influence the overall performance. It can be noted that, the proposed PCLNet achieves the best completeness of the terrains and obtains more precise appearance. Moreover, it also reduce the occurrence of certain situations that buildings are wrongly assigned to forest category or some intersection locations of different terrain categories. These phenomenons confirm the validity of the proposed PCLNet.

Fig. 11: Comparisons of involved methods on AIRSAR Flevoland dataset. (a) Result of few-shot classification. (b) Result of full-supervised classification.


Built-up areas 51.21 63.07 73.32 80.25 84.52 85.81 79.38 65.22 74.63 86.84
Wood land 72.20 83.97 88.90 91.27 92.94 93.34 86.20 86.08 93.94 95.38
Open areas 94.86 91.07 88.13 89.29 89.39 88.75 92.64 91.07 92.39 91.19
OA 79.92 82.98 84.76 87.54 88.95 88.97 88.19 83.92 89.03 92.50
AA 72.76 79.37 83.45 86.94 88.95 89.30 86.07 80.79 86.99 91.13
Kappa 70.13 75.03 77.87 73.74 76.72 81.80 82.14 76.34 83.23 87.36


TABLE V: Comparisons of Full-supervised Classification Results () for Esar Oberpfaffenhofen Dataset.
Fig. 12: Full-supervised classification results of the whole map on ESAR Oberpfaffenhofen dataset with different methods. (a) Result of Wishart. (b) Result of SVM. (c) Result of RBF-SVM. (d) Result of MLP. (e) Result of CNN. (f) Result of CV-CNN. (g) Result of SF-CNN. (h) Result of TFL. (i) Result of MAML. (j) Result of PCLNet.

Quantitative comparisons are reported in Tables III-IV, in which the proposed method achieves the highest scores on three criteria. PCLNet improves OA, AA, and Kappa of real-valued CNN by , , and ; , , and increase of OA, AA, and Kappa are accomplished for complex-valued CNN. Furthermore, PCLNet with only 20 (about sampling rate) training samples for each category also demonstrates the promising effectiveness. As shown in Table IV, the best results obtained by our proposal can reach OA, AA, and Kappa, and these scores are equivalent to employ CNN in full-supervised classification. Overfitting problems limit the performance of CNN severely. , , and decrease of OA, AA, and Kappa are accomplished for the proposal, but limited training samples drop OA, AA, and Kappa of real-valued CNN by , , and . The accuracy of most categories decreases dramatically in some CNN-based methods with complex backbones. However, our proposed method can better maintain the classification performance.


Built-up areas 46.62 55.33 61.89 62.16 50.55 51.44 70.57 59.35 73.84 78.06
Wood land 70.82 92.71 85.85 88.33 90.00 90.85 71.73 69.94 82.63 90.72
Open areas 92.99 81.09 85.71 89.05 91.67 92.62 89.94 92.24 90.73 84.92
OA 77.50 77.30 80.07 82.51 81.56 81.40 81.67 79.93 85.08 86.54
AA 70.14 76.38 77.82 79.85 77.41 78.30 77.41 73.84 82.40 84.57
Kappa 67.06 69.02 71.92 63.14 61.13 69.41 73.41 70.70 76.00 77.65


TABLE VI: Comparisons of Few-shot Classification Results () on Esar Oberpfaffenhofen Dataset.
Fig. 13: Comparisons of involved methods on ESAR Oberpfaffenhofen dataset. (a) Result of few-shot classification. (b) Result of full-supervised classification.

In order to compare more clearly, Fig. 11 also shows the performance of few-shot classification and full-supervised classification on AIRSAR Flevoland dataset. These performances reveal that high-level representations learnt by CL can effectively alleviate the greedy demands of CNN-based methods for massive annotations. At the same time, the shallow model in the downstream task can better avoid the difficulty of overfitting. All in all, various signs can prove the rationality of PCLNet.

Fig. 12 displays the full-supervised classification results of the whole map on ESAR Oberpfaffenhofen dataset. We use 300 training samples for each category (sampling rate is about 0.69‰) in the fine-tuning stage. And there is no overlap between the training and the validation datasets. Comparing the results with other classifiers, the proposal can better make a distinction between the built-up areas and the wood land. And fewer noisy scatter points are contained in the open areas. However, other methods depict more errors especially in the wood land. This is because very little supervision results in an inadequate quality of extracted features.


Shot number 10 20 30 50 100 150 200 300
Method CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours
Buildings 79.85 90.03 89.20 94.39 87.54 98.13 87.54 95.43 91.07 99.58 92.63 96.47 95.02 92.00 92.83 95.43
Rapeseed 24.01 47.00 34.78 69.28 42.83 66.60 40.13 78.99 41.70 84.45 62.94 79.34 70.81 84.84 70.89 87.11
Beet 53.69 90.74 43.74 91.76 66.80 92.44 61.15 92.80 68.04 92.84 74.07 94.14 86.01 95.37 88.00 96.41
Stembeans 69.20 93.82 75.25 92.35 86.26 96.39 80.46 97.24 96.05 97.94 79.71 98.34 96.17 98.34 94.38 96.59
Peas 37.54 95.70 70.58 93.81 61.65 95.77 70.12 96.32 91.44 96.53 86.12 97.70 89.62 97.31 97.14 97.96
Forest 65.94 96.85 57.71 96.44 83.60 97.92 81.94 96.80 96.35 97.96 86.82 98.13 97.05 97.52 97.67 98.15
Lucerne 71.37 92.22 84.87 94.00 87.14 91.84 81.40 93.89 88.02 95.67 91.30 92.42 88.92 94.84 81.78 95.22
Potatoes 53.78 93.17 47.46 91.76 62.48 94.00 72.48 95.24 74.67 95.51 92.71 94.43 93.54 96.12 96.67 96.00
Bare soil 71.40 89.14 69.00 85.79 81.74 88.57 77.86 87.84 87.39 92.81 68.05 92.85 87.77 92.53 75.20 92.81
Grass 37.50 58.23 42.81 64.43 45.37 73.76 58.00 78.02 59.73 74.27 77.18 82.00 62.57 85.93 83.52 85.98
Barley 78.24 89.70 79.01 94.93 87.91 92.75 88.35 90.94 85.89 94.37 77.50 94.54 89.30 94.23 82.12 93.84
Water 75.77 63.63 77.69 85.97 85.72 85.28 93.81 88.96 90.34 91.42 95.58 93.37 90.61 94.19 99.34 95.26
Wheat one 45.60 69.85 62.84 89.07 72.68 93.77 67.37 91.45 77.51 91.46 91.54 92.77 93.51 94.16 91.57 94.33
Wheat two 44.08 85.07 52.33 83.09 65.35 77.03 63.32 78.20 53.50 83.29 79.70 88.93 62.06 83.34 82.16 84.10
Wheat three 57.78 89.46 59.37 90.12 76.07 91.72 78.58 91.76 90.23 92.71 82.79 95.51 91.54 95.14 85.06 96.60
OA 54.24 82.80 58.82 87.88 70.78 88.83 71.55 90.33 78.20 91.91 82.79 92.61 86.60 93.30 88.03 93.96
AA 57.72 82.97 63.11 87.81 72.88 89.06 73.50 90.26 79.46 92.06 82.58 92.73 86.30 93.06 87.89 93.72
Kappa 52.86 81.69 57.45 87.02 69.42 88.02 70.18 89.61 76.93 91.29 81.71 92.04 85.65 92.77 87.19 93.47


TABLE VII: Comparisons of Classification Results () with Different Numbers of Training Samples on Airsar Flevoland Dataset.


Shot number 10 20 30 50 100 150 200 300
Method CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours CNN Ours
Built-up areas 59.16 59.88 50.55 78.06 60.44 75.06 75.52 80.15 77.49 80.63 74.19 84.91 78.97 86.78 84.52 86.84
Wood land 83.30 88.25 90.00 90.72 86.85 95.03 86.61 91.68 94.02 92.56 93.18 94.99 93.77 94.75 92.94 95.38
Open areas 89.52 88.37 91.67 84.92 90.88 87.96 88.49 93.28 88.11 92.25 90.25 92.06 89.18 92.36 89.39 91.19
OA 81.05 81.53 81.56 86.54 82.83 88.86 85.03 89.26 86.77 89.66 87.02 92.00 87.68 92.37 88.95 92.50
AA 77.33 78.84 77.41 84.57 79.39 86.02 83.54 88.37 86.54 88.48 85.87 90.65 87.31 91.30 88.95 91.13
Kappa 60.07 69.64 61.13 77.65 63.81 81.18 68.45 82.14 72.13 82.75 72.65 86.55 74.03 87.18 76.72 87.36


TABLE VIII: Comparisons of Classification Result () with Different Numbers of Training Samples on Esar Oberpfaffenhofen Dataset.

Tables V-VI summarize the experimental results of each method quantitatively, and Fig. 13 shows the performance comparisons of different methods on ESAR Oberpfaffenhofen dataset. When we conduct the full-supervised classification, the proposed PCLNet approach achieves large accuracy increments of OA, AA, and Kappa for real-valued CNN, OA, AA, and Kappa for complex-valued CNN. And higher scores are obtained by the proposed CL paradigm than those from classic Wishart classifier and MAML algorithms. In the case of few-shot classification, only 20 samples for each category (sampling rate is about 0.05‰) is used to construct the training sets. The accuracy on testing sets can achieve OA, AA, and Kappa for PCLNet. As shown in Fig. 13, the OA and AA of PCLNet with 20 training samples for each category are very close to CNN with 300 samples for each category, and the Kappa even surpasses it. It can be seen that, smaller number of training samples may encourage more apparent advantages of our proposed PCLNet. From the global point of view, all of the cues evaluate the robustness of our proposed method.

Fig. 14: Comparisons of the performance with different numbers of training samples between CNN and PCLNet on two benchmark datasets. The solid and dotted lines represent the results of CNN and PCLNet, respectively. (a) Result of Flevoland dataset. (b) Result of Oberpfaffenhofen dataset.
Fig. 15: T-SNE visualization of the representations learnt with different methods on AIRSAR Flevoland dataset. (a) Result of SVM. (b) Result of MLP. (c) Result of CNN. (d) Result of CV-CNN. (e) Result of SF-CNN. (f) Result of TFL. (g) Result of MAML. (h) Result of PCLNet. Each data point in the t-SNE scatter plots is colored according to its ground truth map.

Iii-C2 Impact of Input Shots

In order to investigate the influence of the training samples numbers, we carry out some comparative experiments between CNN and our proposed method with 10, 20, 30, 50, 100, 150, 200 and 300 training samples for each category. Tables VII-VIII report the comparative results on two benchmarks, respectively. Besides, the proposed PCLNet is denoted as Ours for short in table. Generally speaking, with different numbers of training samples, the performances of proposal surpass CNN method in most categories. It is worthy noting that, the performances of water area on Flevoland dataset and open areas on Oberpfaffenhofen dataset have some opposite results sometimes. This is because, the difference is not obvious between different instances of the same category, like water or some open areas. Even if we have adopted the diversity stimulation mechanism, this situation cannot be completely ruled out in the pretext task of instance discrimination. We assume that if the precision of the sensor is higher or the filter is stronger, this situation may be further reduced. Moreover, this phenomenon also indicates that CNN-based methods may not require too many samples for this type of terrain with very solid consistency, because so massive training samples may not provide more additional information.

The variation trend of different number of training samples for each category is shown in Fig. 14. It can be seen that our proposed method achieves the most accurate results for two benchmark datasets. Moreover, on the Flevoland dataset, when we only use 10 samples for each category, the performance gap between two methods is the largest, which is , , and in terms of OA, AA, and Kappa. And one need to use 150 samples for each category in CNN to get close to the same level. On the Oberpfaffenhofen dataset, the significant improvement of Kappa can also demonstrate the excellent achievement of the proposed method in the case of few-shot learning.

Iii-C3 Visualization of Features

PCLNet has presented excellent performances with limited training samples, the intrinsic reason is the mining of high-level representations in the pretext task of instance discrimination. In order to evaluate the quality of extracted features, we apply T-SNE scatter plots [68] to perform two-dimensional visualization of the learnt representations. In this experiment, MLP, TFL, CNN, CV-CNN, SF-CNN, and MAML algorithms utilize 20 samples for each category in the phase of training or fine-tuning. We visualize the activation responses of the last hidden layer in these neural networks. Besides, some hand-crafted features are also visualized in SVMs, such as Freeman-Durden three-component decomposition [39], Yamaguchi four-component decomposition [16], and H/Ani/ [40]. For PCLNet, the transferrable representation , which is embedded by main encoder in Fig. 7, is immediately visualized. It is extracted through unsupervised CL, but any training sample is required for intervention. Moreover, we conduct these experiments on two benchmark datasets separately. Different categories are marked in different colors, but the color-coding of the scatter points is consistent with the ground truth maps in Figs. 8-9. The final T-SNE visualization results are shown in Figs. 15-16.

Fig. 16: T-SNE visualization of the representations learnt with different methods on ESAR Oberpfaffenhofen dataset. (a) Result of SVM. (b) Result of MLP. (c) Result of CNN. (d) Result of CV-CNN. (e) Result of SF-CNN. (f) Result of TFL. (g) Result of MAML. (h) Result of PCLNet. Each data point in the t-SNE scatter plots is colored according to its ground truth map.

It can be clearly observed that the hand-crafted features used in SVMs are not adequate for distinguishing most categories. For MLP algorithm, some features from the same category are extensively distributed in various positions and develop multiple disconnected regions, so the compactness is relatively weak. It produces great challenges to the design of appropriate classifier. For CNN methods, the compactness improves slightly, but numerous points from different categories may overlap and cover with each other seriously. Some revised versions of CNN-based methods and MAML algorithm advance the separability of each category. It is worthy noting that SF-CNN slightly raises the generalization ability by mapping the hand-crafted features from the original space to a reasonable high-dimensional embedding space. In contrast, we observe that CL pre-training of PCLNet provides more discriminative features, and creates more compact and distinctive category-specific clusters. It turns out that the proposal can embed original feature vectors immediately to a high-dimension space, and make the representations possess more desirable generalization capabilities to some genuine interests.

The experimental results illustrated above exhibit some remarkable advantages of unsupervised deep PolSAR representation learning and some clear explanations for the excellent achievements in few-shot learning. Firstly, the pretext task of instance discrimination supports the feature extractors to capture more discriminative semantic cues, and encodes some potential information on the feature activations. These semantic cues are also the ultimate aims that the classifier parameters learn to look for based on the ground truth categories. Secondly, InfoNCE loss based on cosine similarity successfully assists the encoder to obtain some feature vectors with low intra-class variance

[61]. From this perspective, all transferrable representations that correspond to the same category match compactly with the specific weight vectors of that category. It brings great convenience for the optimization of shallow models in the downstream tasks. Finally, nonlinear projection head effectively avoids the loss of information induced by InfoNCE loss. Hence, more discriminative representations can be sufficiently produced and maintained.

Iii-D Discussion

In the above experiments, the high-level transferrable representations captured by PCLNet present powerful generalization abilities. At the same time, the performance of the proposal has made a significant breakthrough in the few-shot PolSAR classification. Therefore, it is necessary to combine the theoretical basis and experimental simulation results to analyze and discuss the proposed method comprehensively.

First of all, a diversity stimulation mechanism is assembled flexibly to construct the CL datasets in the auxiliary pre-training phase. This component makes it possible to take full advantage of massive unlabeled PolSAR data and improve the correctness of negative sampling in instance discrimination. We maintain that the improvement of diversity is the key factor to unlock the critical bottleneck of the application gap between optics and PolSAR.

Secondly, high-level representations alleviate the greedy demands of CNNs for abundant human annotations. The t-SNE scatter plots present that the proposal creates more compact and separable clusters, indicating that the transferrable representations are learnt through discovering the distinction between multiple individuals. In fact, these results also confirm the central idea of our pretext task. By taking the class-wise supervision to the extreme and maximizing mutual information, the apparent similarity among semantic categories can be automatically discovered, which can be seen as high-level representations. It is interesting to note that, in the downstream tasks, although any augmentation and regularization techniques are explored, the proposed method can still achieve outstanding results in the few-shot PolSAR classification. On the contrary, when we employ limited training samples to optimize deep CNN-based methods with a large number of trainable parameters from scratch, it is very likely that insufficient supervision causes the network to overfit the training data and aggravates generalization in large-scale applications.

Last but not least, among all of the supporting evidences, unsupervised representation learning, which combines the advantages of discrimination ability in CNN-based methods and the feasibility to large-scale problems of unsupervised methods, can undoubtedly be extended to more downstream tasks. In this paper, we sufficiently reveal the considerable advantages of the proposed PCLNet in the high-precision few-shot PolSAR image classification. However, the powerful capacity of unsupervised representation learning is not limited to a single specific task, but also lies in the broad application fields. In a sense, the existence of unsupervised learning methods is undoubtedly very desirable and meaningful in the real-word large-scale applications.

Iv Conclusion

In this paper, a practical way for unsupervised PolSAR representation learning and few-shot classification are explored with the help of CL for the first time. To design a PolSAR-tailored CL methods, a diversity stimulation mechanism is constructed to replace the random sampling of ordinary CL methods so as to improve the diversity of training data. This improvement can effectively narrow the application gap between optical and PolSAR images. After collecting the required training datasets, the PCLNet includes two other parts, i.e., unsupervised pre-training based on the pretext task and supervised fine-tuning in terms of the genuine interest. Among the unsupervised pre-training, the construction of memory bank effectively reduces the computational complexity, and the momentum-based update of auxiliary encoder significantly improves the consistency of the learning process. With the help of the transferrable representations acquired by pre-training, high-precision PolSAR classification can be achieved by training a linear classifier with very limited supervision. Numerous experiments are carried out on two widely used benchmark datasets. The experimental results exhibit the validity of PCLNet for both few-shot and full-supervised PolSAR classification compared with lots of popular methods.

We believe that more powerful pretext tasks have tremendous potential for the improvements of PolSAR image classification. More importantly, the proposed framework opens the door to future researches about unsupervised representation learning in terms of large-scale PolSAR image interpretation. Hence, some related PolSAR applications, like fine-grained classification, semantic segmentation and object detection, are all the issues that we will be interested in.


  • [1] G. Hong, S. Wang, J. Li, and J. Huang, “Fully polarimetric synthetic aperture radar (SAR) processing for crop type identification,” Photogramm. Eng. Remote Sens., vol. 81, no. 2, pp. 109–117, Feb. 2015.
  • [2] L. Pipia, X. Fabregas, A. Aguasca, and C. Lopez-Martinez, “Polarimetric temporal analysis of urban environments with a ground-based SAR,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 4, pp. 2343–2360, Oct. 2013.
  • [3] F. Ulaby and C. Elachi, “Radar polaritnetry for geoscience applications,” Geocarto Int., vol. 5, no. 3, p. 38, Sep. 1990.
  • [4] R. Shirvany, M. Chabert, and J. Tourneret, “Ship and oil-spill detection using the degree of polarization in linear and hybrid/compact dual-pol SAR,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 3, pp. 885–892, Feb. 2012.
  • [5]

    J. Karvonen, “Baltic sea ice concentration estimation based on C-band dual-polarized SAR data,”

    IEEE Trans. Geosci. Remote Sens., vol. 52, no. 9, pp. 5558–5566, Dec. 2014.
  • [6]

    Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Backpropagation applied to handwritten zip code recognition,”

    Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989.
  • [7] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Lake Tahoe, CA, USA, Dec. 2012, pp. 1097–1105.
  • [8] C. Szegedy et al, “Going deeper with convolutions,” in

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)

    , Boston, Massachusetts, USA, Jun. 2015, pp. 1–9.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, Nevada, USA, Jun. 2016, pp. 770–778.
  • [10] O. Ronneberger, P. Fischer, and T. Brox. (May 2015) U-Net: Convolutional networks for biomedical image segmentation. [Online]. Available:
  • [11] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1153–1159, May 2016.
  • [12] H. Shin et al, “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1285–1298, May 2016.
  • [13] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. Johnson, “Deep learning in remote sensing applications: A meta-analysis and review,” ISPRS J. Photogramm. Remote Sens., vol. 152, no. 6, pp. 166 – 177, Jun. 2019.
  • [14] X. Zhu et al, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 8–36, Dec. 2017.
  • [15] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geosci. Remote Sens. Mag., vol. 4, no. 2, pp. 22–40, Jun. 2016.
  • [16] Y. Yamaguchi, A. Sato, W. Boerner, R. Sato, and H. Yamada, “Four-component scattering power decomposition with rotation of coherency matrix,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 6, pp. 2251–2258, Jun. 2011.
  • [17] W. Cameron and L. Leung, “Feature motivated polarization scattering matrix decomposition,” in Proc. IEEE Int. Radar Conf., May 1990, pp. 549–557.
  • [18] J. Lee, M. Grunes, E. Pottier, and L. Ferro-Famil, “Unsupervised terrain classification preserving polarimetric scattering characteristics,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 4, pp. 722–731, Apr. 2004.
  • [19] K. Ersahin, I. Cumming, and W. R., “Segmentation and classification of polarimetric SAR data using spectral graph partitioning,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 1, pp. 164–174, Jan. 2010.
  • [20] Y. Zhou, H. Wang, F. Xu, and Y. Jin, “Polarimetric SAR image classification using deep convolutional neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 12, pp. 1935–1939, Dec. 2016.
  • [21] S. Chen and C. Tao, “PolSAR image classification using polarimetric-feature-driven deep convolutional neural network,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 4, pp. 627–631, Apr. 2018.
  • [22]

    C. Yang, B. Hou, B. Ren, Y. Hu, and L. Jiao, “CNN-based polarimetric decomposition feature selection for PolSAR image classification,”

    IEEE Trans. Geosci. Remote Sens., vol. 57, no. 11, pp. 8796–8812, Nov. 2019.
  • [23] Z. Zhang, H. Wang, F. Xu, and Y. Jin, “Complex-valued convolutional neural network and its application in polarimetric SAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 12, pp. 7177–7188, Dec. 2017.
  • [24] X. Liu, L. Jiao, X. Tang, Q. Sun, and D. Zhang, “Polarimetric convolutional network for PolSAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 5, pp. 3040–3054, May 2019.
  • [25] L. Zhang, H. Dong, and B. Zou, “Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework,” ISPRS J. Photogramm. Remote Sens., vol. 157, no. 8, pp. 59–72, Sep. 2019.
  • [26] A. Mullissa, C. Persello, and A. Stein, “PolSARNet: A deep fully convolutional network for polarimetric SAR image classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 12, pp. 5300–5309, Dec. 2019.
  • [27] X. Tan, M. Li, P. Zhang, Y. Wu, and W. Song, “Complex-valued 3-D convolutional neural network for PolSAR image classification,” IEEE Geosci. Remote Sens. Lett., pp. 1–5, 2019.
  • [28] H. Liu, F. Shang, S. Yang, M. Gong, T. Zhu, and L. Jiao, “Sparse manifold-regularized neural networks for polarimetric SAR terrain classification,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–10, 2019.
  • [29] Z. Wen, Q. Wu, Z. Liu, and Q. Pan, “Polar-spatial feature fusion learning with variational generative-discriminative network for PolSAR classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 11, pp. 8914–8927, Jul. 2019.
  • [30] H. Dong, B. Zou, L. Zhang, and S. Zhang, “Automatic design of CNNs via differentiable neural architecture search for PolSAR image classification,” IEEE Trans. Geosci. Remote Sens., pp. 1–14, 2020.
  • [31] H. Wang, F. Xu, and Y. Jin, “A review of PolSAR image classification: From polarimetry to deep learning,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Yokohama, Japan, Jul. 2019, pp. 3189–3192.
  • [32] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci., vol. 3, no. 1, pp. 71–86, Jan. 1991.
  • [33] L. Jing and Y. Tian. (Feb. 2019) Self-supervised visual feature learning with deep neural networks: A survey. [Online]. Available:
  • [34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 56, pp. 1929–1958, Jun. 2014.
  • [35]

    V. Nair and G. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proc. 27th Int. Confer. Mach. Learn. (ICML), Haifa, Israel, Jun. 2010, pp. 807–814.
  • [36] R. Srivastava, K. Greff, and J. Schmidhuber. (Nov. 2015) Highway networks. [Online]. Available:
  • [37] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
  • [38] J. Lee, M. Grunes, and G. De Grandi, “Polarimetric SAR speckle filtering and its implication for classification,” IEEE Trans. Geosci. Remote Sens., vol. 37, no. 5, pp. 2363–2373, Sep. 1999.
  • [39] S. Freeman and S. Durden, “A three-component scattering model for polarimetric SAR data,” IEEE Trans. Geosci. Remote Sens., vol. 36, no. 3, pp. 963–973, May 1998.
  • [40] S. Cloude and E. Pottier, “An entropy based classification scheme for land applications of polarimetric SAR,” IEEE Trans. Geosci. Remote Sens., vol. 35, no. 1, pp. 68–78, Jan. 1997.
  • [41] E. Hinton and S. Richard, “Autoencoders, minimum description length and helmholtz free energy,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Denver, Colorado, USA, Mar. 1994, pp. 3–10.
  • [42] Y. Hu, J. Fan, and J. Wang, “Classification of PolSAR images based on adaptive nonlocal stacked sparse autoencoder,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 7, pp. 1050–1054, Jul. 2018.
  • [43] L. Zhang, W. Ma, and D. Zhang, “Stacked sparse autoencoder in PolSAR data classification using local spatial information,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 9, pp. 1359–1363, Sep. 2016.
  • [44] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Am. Assoc. Comput. Linguist. (NAACL), Minneapolis, USA, Jun. 2019, p. 4171–4186.
  • [45] C. Doersch, A. Gupta, and A. Efros, “Unsupervised visual representation learning by context prediction,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 1422–1430.
  • [46] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. 34th Int. Confer. Mach. Learn. (ICML), Sydney, Australia, Aug. 2017, pp. 1126–1135.
  • [47] K. Hsu, S. Levine, and C. Finn. (Mar. 2019) Unsupervised learning via meta-learning. [Online]. Available:
  • [48] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Barcelona, Spain, Dec. 2016, pp. 3630–3638.
  • [49] S. Gidaris, P. Singh, and N. Komodakis. (Mar. 2018) Unsupervised representation learning by predicting image rotations. [Online]. Available:
  • [50] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Amsterdam, Netherlands, Oct. 2016, pp. 69–84.
  • [51]

    G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual understanding,” in

    Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 840–849.
  • [52] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. (Feb. 2020) A simple framework for contrastive learning of visual representations. [Online]. Available:
  • [53] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discriminatio,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, Jun. 2018, pp. 3733–3742.
  • [54] A. Oord, Y. Li, and O. Vinyals. (Jan. 2019) Representation learning with contrastive predictive coding. [Online]. Available:
  • [55] R. Hjelm et al. (Feb. 2019) Learning deep representations by mutual information estimation and maximization. [Online]. Available:
  • [56] P. Bachman and R. Hjelm. (Jul. 2019) Learning representations by maximizing mutual information across views. [Online]. Available:
  • [57] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. (Nov. 2019) Momentum contrast for unsupervised visual representation learning. [Online]. Available:
  • [58] W. Chen, Y. Liu, Z. Kira, Y. Wang, and J. Huang. (Jan. 2020) A closer look at few-shot classification. [Online]. Available:
  • [59] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Prob., 1965.
  • [60] Z. Deng, H. Sun, S. Zhou, and J. Zhao, “Learning deep ship detector in SAR images from scratch,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 6, pp. 4021–4039, Jun. 2019.
  • [61] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, June 2018, pp. 4367–4375.
  • [62] K. Sohn, “Improved deep metric learning with multi-class N-pair loss objective,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Barcelona, Spain, Dec. 2016, pp. 1857–1865.
  • [63] M. Lin, Q. Chen, and S. Yan. (2013) Network in network. [Online]. Available:
  • [64] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, Nevada, USA, Jun. 2016, pp. 1933–1941.
  • [65] J. Lee, M. Grunes, and R. Kwok, “Classification of multi-look polarimetric SAR imagery based on complex wishart distribution,” Int. J. Remote Sens., vol. 15, no. 11, pp. 2299–2311, Jul. 1994.
  • [66] C. Lardeux, P. Frison, C. Tison, J. Souyris, B. Stoll, B. Fruneau, and J. Rudant, “Support vector machine for multifrequency SAR polarimetric data classification,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 12, pp. 4143–4152, Dec. 2009.
  • [67] K. Simonyan and A. Zisserman. (Apr. 2015) Very deep convolutional networks for large-scale image recognition. [Online]. Available:
  • [68] V. D. M. Laurens and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn. Res., vol. 9, no. 2605, pp. 2579–2605, Nov. 2008.