Pose Discrepancy Spatial Transformer Based Feature Disentangling for Partial Aspect Angles SAR Target Recognition

03/07/2021 ∙ by Zaidao Wen, et al. ∙ 0

This letter presents a novel framework termed DistSTN for the task of synthetic aperture radar (SAR) automatic target recognition (ATR). In contrast to the conventional SAR ATR algorithms, DistSTN considers a more challenging practical scenario for non-cooperative targets whose aspect angles for training are incomplete and limited in a partial range while those of testing samples are unlimited. To address this issue, instead of learning the pose invariant features, DistSTN newly involves an elaborated feature disentangling model to separate the learned pose factors of a SAR target from the identity ones so that they can independently control the representation process of the target image. To disentangle the explainable pose factors, we develop a pose discrepancy spatial transformer module in DistSTN to characterize the intrinsic transformation between the factors of two different targets with an explicit geometric model. Furthermore, DistSTN develops an amortized inference scheme that enables efficient feature extraction and recognition using an encoder-decoder mechanism. Experimental results with the moving and stationary target acquisition and recognition (MSTAR) benchmark demonstrate the effectiveness of our proposed approach. Compared with the other ATR algorithms, DistSTN can achieve higher recognition accuracy.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic target recognition (ATR) is one of the ultimate goals in the field of synthetic aperture radar (SAR), which has attracted much attention in civil and military reconnaissance and surveillance for years [16, 13, 8]. In contrast to optical sensors, the special electromagnetic imaging mechanism of SAR makes the resulted image reconstruction of the specular backscattering of the illuminated target [15]. Therefore, SAR will actually “see” some types of physical structures of a target, whose scattering signatures in the resulted SAR image will be highly sensitive to its pose. Crucially, the profile and the position of strong scattering points of the target will vary a lot as the viewing angle of the SAR platform changes, which makes SAR ATR more challenging in comparison with the general recognition task with data of optical sensors.

To tackle this issue, an intuitive solution in conventional SAR ATR algorithms is to collect the distributions of strong scattering points of the target in full () aspect angles uniformly [18, 28]

, which will be subsequently constructed as a feature template in the hope to record all the scattering signatures in different poses. Then the identity of a query test target can be determined in a pairwise class-template matching manner. However, this approach is not robust to many disturbances such as speckle noise and motion ambiguity, which heavily influence its practical recognition performance. To improve the robustness, several hand-crafted image feature extractors such as scale-invariant feature transform (SIFT) focus on some pose invariant features to describe some intrinsic visual signatures of the target. These visual features will be normally more robust and achieve a better generalization performance by choosing a suitable discriminative classifier such as support vector machine (SVM)


. To further improve the discrimination and adaptivity of the features, many machine learning-based algorithms are gradually devoted to SAR ATR

[26, 25, 7, 6, 24, 1, 3, 4, 5, 20, 27]

, among which the most notable model will be the convolutional neural networks (CNN). Distinct from the above feature engineering algorithms, the core idea of CNN-based algorithms to address the pose sensitivity issue is to fit a set of rotating target images as well as their class-labels with a deep neural network through which every intra-class rotating targets can be mapped to the same class-label. In this way, some label-invariant features are expected to be obtained in a discriminative learning way without concerning their variances in aspect angle, and the recognition performance can be remarkably improved in terms of both accuracy and efficiency.

The success of the above learning models, especially CNN owns to the existence of a large number of training samples to cover sufficient target patterns. It can ensure the distribution consistency between the training and testing ones. However, in practical SAR ATR application, it is intractable to collect sufficient non-cooperative target samples in the different poses for training. In general, only a small number of targets in partial aspect angles can be acquired by reconnaissance. It will thus lead to the so-called out of distribution (o.o.d) classification problem that the distributions of the training and testing samples are slightly different [10]. In this case, the recognition performance of the above SAR ATR algorithms will dramatically decrease due to a higher requirement of the generalization ability [10]

. An intuitive way to overcome this problem in computer vision society would be manually generating pseudo targets in full angles for data augmentation and distribution completion and alignment. We, however, empirically found that this trick is invalid for SAR data, which might be due to ambiguous artifacts caused by nonlinear pixel interpolation. The interpolated pixels will not correctly recover the actual physical scattering signature of the corresponding real target so that these generated pseudo samples do not explicitly provide more discrimination information than the original ones

[27, 17, 19].

Motivated by the above practical o.o.d scenario, this paper develops a pose discrepancy spatial transformer based feature disentangling framework (DistSTN) for partial aspect angles SAR target recognition (PAA-ATR). Instead of learning the pose invariant features, DistSTN newly involves an elaborated feature disentangling model to separate the learned pose factors of a SAR target from the identity ones so that they can independently control the representation process of the target image for better generalization ability. To disentangle the explainable factors, a pose discrepancy spatial transformer module is developed in DistSTN. It aims to characterize the intrinsic transformation between the factors of two different targets with an explicit geometric model induced regularization. Furthermore, DistSTN develops an amortized inference scheme that enables efficient feature extraction and recognition using an encoder-decoder mechanism. Experimental results on the MSTAR benchmark demonstrate that our framework achieves better recognition accuracy in the PAA-ATR task. The rest of this paper is organized as follows: Section II proposes the main framework; Section III describes our experiments to validate performance and Section IV summarizes our work and suggests future directions.

Ii Framework Presentation

In this section, we will first formulate the problem of PAA-ATR and analysis some insight viewpoints to shed light on our solution. Next, we will develop a spatial transformer-based feature disentanglement model to address this task. Finally, we will propose an encoder-decoder architecture for amortized inference and model learning.

Ii-a Problem Formulation of PAA-ATR and Insight

Let be labeled SAR targets drawn from c-th class, where and

stand for the SAR target image and its identity label vector, respectively, and

is the total number of target categories. The ultimate goal of ATR is to predict the label vector of a new query sample according to these training data. In general SAR ATR algorithms, it is implicitly assumed that the aspect angle of the training and testing samples should identically and uniformly reside in the range of , which is intractable for non-cooperative targets. Alternatively, this paper considers a more difficult but practical task termed PAA-ATR. It assumes that the aspect angles of the training samples from at least one class are incomplete and limited in a partial range of , while those of testing samples are in , yielding an o.o.d scenario.

Due to the electromagnetic imaging mechanism of the SAR sensor, the scattering appearance of an illuminated target will be severely sensitive to the relative pose between the target and the sensor. As a result, the difference between the training and testing samples in PAA-ATR will be aggravated in comparison with the general ATR task. The latent factors accounting for the pose and identity are entangled in the image domain. To alleviate this challenge, the conventional algorithms will exploit some physical-driven or geometric-guided methods to design some rotation-invariant features in a hand-crafted way. Alternatively, the learning-based model such as CNN will train a deep network mapping every intra-class targets with different poses into the same label vector in a supervised learning way. Through this way, the intermediate features will only account for the identity label and ignore the intra-class variances, and it can achieve much better performance only assuming the distribution consistency between the training and testing samples. Since the general CNN involves no explicit spatial geometric transformation, it can, however, only achieve local spatial pose invariance by introducing a deep hierarchy of max-pooling and convolutions layers. Consequently, its generalization ability for PAA-ATR is weak.

Ii-B The Framework for Feature Disentangling

According to the above analysis, the critical issue of the CNN-based model is the lack of rotation awareness so that the model does not understand the physical and semantic concept of the target rotation. In this sense, it is intractable for a CNN-based model to generalize the training rotational pattern to the unseen ones in the testing phase in PAA-ATR. To tackle this issue, our core idea alternatively focuses on equivariant feature disentanglement instead of pursuing the pose-invariant features for discrimination. In contrast to the general CNN model for discriminative learning, we will develop a generative feature learning model containing a geometric transformation module. It aims to explicitly characterize and disentangle the pose features from the identity ones. Through this separation, it is expected that the discrimination of the rest identity ones will be enhanced without being influenced.

To this end, let and be the latent pose (relative to the radar) and identity factors, respectively. Considering a generative learning model, a SAR target image is modeled and represented by and through a nonlinear parametric function as , where contains the model parameters controlling the unknown intricate SAR imaging process. Based on this generative model, the task of feature disentanglement is an inverse problem of extracting and from

, which can be usually addressed by maximizing a posterior (MAP) estimator given by:


where and are two elaborated regularization functions on and to encode our prior preference for disentangling, and measures the representation error. Note from Eq.(1

) that it originates from the idea of independent component analysis (ICA) for source separation

[2]. Therefore, the critical issue in Eq. (1) is to design two regularization functions for effective disentanglement and develop an efficient feature inference process to solve the optimization.

For the first regularizer , it can be designed as the following task-induced function (2) with a supervised analysis prior [9], which will force to contain sufficient discriminative information for correct recognition.


where is the negative log-likelihood function of given induced from the categorical distribution, is the indicator function, is a simple affine transformation of followed by a softmax function, and the subscript represents the corresponding value in c-th index.

For the second regularizer , it initially attempts to characterize the factors accounting for the target pose relative to the sensor, including azimuth angle, depression angle, and some other positional factors of the sensor. Nevertheless, it is intractable to labeling the entire factors explicitly and exactly for all training targets. Thus, we cannot exploit the above strategy to model it in a discriminative way. Alternatively, we propose a novel self-supervised task of target cross-transformation to model the pairwise pose discrepancy between two targets with an explicit geometric model. Formally, let and be the pose factors of and , respectively. If they can indeed capture the entire pose information, there will be a geometrically explicit operator warping to and vice versa. More importantly, the parameters will have a clear physical meaning to measure the pose discrepancy between two targets without being influenced by the other shared sensor factors. According to the 2D rigid-body geometric transformation model [14], the warping function

is essentially an affine transformation of the 2D coordinates of the input feature maps, followed by sampling and interpolation processes that is also exploited in the spatial transformer network (STN)

[12]. If can be correctly estimated and assigned, will be equal to . Therefore, can be further exploited to represent in conjunction with and . In this sense, we can exploit a pose discrepancy aware network to estimate the parameters as , where contains its parameters. and will constitute the designed pose discrepancy STN illustrated in Fig. 1(b). According to this model, will be designed as (3) to measure the error of representing with and without external supervised pose information.


It should be worth noting that the main purpose of pose discrepancy STN is not to generate a high-quality SAR image to simulate its special imaging mechanism, but to impose an explicit model-induced learning bias on the latent for disentanglement.

(a) Overall architecture for model training
(b) Illustration of feature disentanglement via pose discrepancy STN
Fig. 1: The framework of the proposed feature disentangling model.

Ii-C Amortized Inference and Overall Architecture

In the above subsection, we have developed a novel model (1) for identity and pose features disentanglement. We have elaborated two regularization functions (2) and (3) to inject the identity and pose information into the features and , respectively. More specifically, Eq. (2) exploits a task-induced regularization on each to make it more discriminated while Eq. (3) involves a geometric transformation model to characterize the pose discrepancy between two targets. However, directly solving the inverse problem (1) with a general optimization algorithm is intractable and time-consuming. Inspiring from the recent amortized inference [2] and our previous research [23, 24], we will utilize the encoder-decoder architecture for feature inference and parameters learning in an end-to-end learning pipeline. To this end, we will design an encoder parameterized by the three-layers CNNs in the hope to directly output the estimated feature maps as , where contains the parameters of to be learned. Through this way, the identity and pose features in terms of the model (1) can be efficiently obtained with low computational complexity. The pose discrepancy-aware network is a three-layer fully connected network whose hidden unit numbers are 60, 30, and 6, respectively. The overall architecture termed DistSTN is illustrated in Fig. 1(a), and the final optimization problem is summarized as:


where and are two hyper-parameters for balance. From Fig. 1(a), DistSTN is a double-input CNN which allows taking two targets from arbitrary classes. It follows that we can generate at most target pairs for model learning, where is the permutation operator and counts the total number of training samples. In this regard, DistSTN will be more appropriate for learning with limited training samples. In the testing phase, we can simply remove the target transformer module from DistSTN and feed the query sample into the encoder-target recognition module to output its identity label.

Iii Experiments

In this section, we carried out several experiments on the MSTAR database to validate the performance of the proposed DistSTN for PAA-ATR. The parameters in the networks are initially in a default way without pre-training. We exploit the weight decay regularization on the parameters in with rate , except in

. The optimizer of DistSTN is chosen as the stochastic gradient descent (SGD) with a constant learning rate of

and a momentum rate of 111It is empirically found that using the Adam optimizer can obviously achieve a much higher recognition accuracy for the most compared algorithms.. Two hyper-parameters and

are determined via cross-validation according to grid search. The loss function

is chosen as the mean absolute error. We exploit the early-stopping trick to control the training procedure. The model achieving the best performance on a validate set will be restored for testing. We conduct all the experiments on a workstation with a single RTX 2070-Super GPU five times using TensorFlow 2.3 library.

Iii-a Database and Comparison Algorithms Introduction

Iii-A1 Database

The MSTAR database was collected by the Sandia National Laboratory using a Twin Otter SAR sensor operating at X-band. It comprises about ten types of military ground target images taken at multiple depression angles and aspect angles with approximately interval. We crop all input amplitude images to obtain the central target chip to get rid of the impact from the surrounding background cluttering. According to the normal setting in other SAR ATR algorithms [3], the targets taken at and will be used for training and testing respectively. In particular, to verify the performance of DistSTN on PAA-ATR, the detailed training and testing information different from the normal setting is summarized in Table I. The training set comprises cooperative and non-cooperative classes to simulate the practical situation. In order to keep number of samples in the cooperative and non-cooperative classes balanced, only a half of cooperative samples in each class will be used for training. For all testing samples, their aspect angles are unlimited. Considering the limitation of computation memory and time cost, we randomly shuffle all training samples twice to generate two training sets from which and will be jointly sampled for model learning.

Aspect Angle Depression Angle
Training Coop.
Non-Coop. or
TABLE I: Information of Training and Testing Setting for PAA-ATR

Iii-A2 Algorithms

We will compare several existing SAR ATR algorithms to demonstrate the effectiveness and superiority of our proposal, including aforementioned support vector machine (SVM) with a Gaussian kernel, SRC [25] and A-ConvNets [3]. Additionally, STN is a designed free module to handle the rotations among inputs [12]. For comparison, the STN module will be inserted into the A-ConvNets, namely A-ConvNets+STN. We also exploit data augmentation trick of generating some pseudo targets to form a full aspect angle training set. It will be used to train A-ConvNet, yielding another variant termed A-ConvNet*. Finally, ResNet-50 and EfficientNet, two state-of-the-art architectures for image classification will be also compared [11, 21].

Fig. 2: Illustration of cross-reconstruction of DistSTN.

Iii-B Validation of Target Transformation

Considering the proposed DistSTN, the most notable contribution is to explore an equivariant feature disentanglement model to address the task of PAA-ATR. In order to capture the pose information of a SAR target in the learned feature maps , we develop a pose discrepancy STN in the hope to wrap the pose features of a target into those of another target with the explainable geometric transformation model. If the obtained can indeed contain sufficient pose information via DistSTN, the transformed pose feature maps will be able to reconstruct the . As a result, we will design an experiment to validate the effectiveness of the proposed pose discrepancy STN according to its cross-reconstruction result. To this end, we will visualize some resulted reconstruction images shown in Fig. 2. The images in the first and second row are initial inputs and . The third-row illustrates the corresponding reconstruction results of using and . The last row depicts the corresponding cross-reconstruction results using the identity features and . At the first sight of the results, we can see that the reconstruction results are very similar to the cross-reconstruction ones . Both of these resulted images can be considered as the denoised and smoothed version of the original inputs with the same pose (the target orientation), though the pose discrepancy between and are obvious. Therefore, these results can clearly demonstrate the effectiveness of the proposed model for pose feature disentanglement.

Iii-C Validation of PAA-ATR

Methods 5 Non-Coo. 9 Non-Coo. 10 Non-Coo.
SVM 18.26 18.26 18.26
SRC 62.98 63.89 65.31
A-ConvNets 67.29 66.03 64.98
A-ConvNets 65.35 64.03 63.75
A-ConvNets+STN 68.68 67.70 67.87
ResNet-50 64.98 65.42 66.86
EfficientNet 59.32 56.36 59.88
DistSTN 70.72 68.69 69.16
TABLE II: Recognition accuracy of different algorithms

Finally, DistSTN will be compared with the other algorithms on the task PAA-ATR. To validate the performance of discriminative feature disentanglement, different number of non-cooperative class will be considered, including 5 (half), 9 (only one cooperative class) and 10 (all non-cooperative class). The comparison result are summarized in Table II. From the results, DistSTN can achieve the highest accuracy among all comparison algorithms for all three settings, which clearly verifies the superior discrimination and generalization ability of disentangled features. The accuracy obtained from SVM is particularly low, which implies this method failing to cope with the unseen angles. A-ConvNets trained with the manually generating pseudo targets will be worse than its original counterpart, namely A-ConvNet. This result reflects that the trick used in RGB image classification may be unsuitable for SAR ATR anymore because of different imaging mechanisms. We can conclude from the consequences of A-ConvNets+STN, STN module can indeed improve the recognition performance of A-ConvNet by involving a self-rotational transformation of feature maps. However, due to different mechanisms and motivations, the generalization ability of STN is still weak than our proposed DistSTN. It is still less efficiency for STN to address such o.o.d classification task.

Iv Conclusion

In this letter, we present an efficient framework termed DistSTN to address the challenging task of PAA-ATR for non-cooperative targets. Instead of pursuing the pose invariant features in the conventional algorithms, DistSTN newly exploits a feature disentangling strategy to separate the pose factors of a target from the identity ones so that they can independently control the representation process of the target. Experimental results demonstrate the superiority of our proposed model on PAA-ATR, which achieves higher recognition accuracy compared to the other ATR algorithms. Future research will consider the potential contribution of those cooperative training samples for knowledge transfer.


  • [1] X. Bai, R. Xue, L. Wang, and F. Zhou (2019-11) Sequence SAR image classification based on bidirectional convolution-recurrent network. ieee_j_grs 57 (11), pp. 9223–9235. External Links: ISSN 1558-0644, Document Cited by: §I.
  • [2] A. J. Bell (1995) An information-maximisation approach to blind separation and blind deconvolution. Neural Computation 7. Cited by: §II-B, §II-C.
  • [3] S. Chen, H. Wang, F. Xu, and Y. Jin (2016) Target classification using the deep convolutional networks for SAR images. ieee_j_grs 54 (8), pp. 4806–4817. Cited by: §I, §III-A1, §III-A2.
  • [4] S. Deng, L. Du, C. Li, J. Ding, and H. Liu (2017-07)

    SAR automatic target recognition based on euclidean distance restricted autoencoder

    IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 10 (7), pp. 3323–3333. Cited by: §I.
  • [5] J. Ding, B. Chen, H. Liu, and M. Huang (2016) Convolutional neural network with data augmentation for SAR target recognition. ieee_j_grsl 13 (3), pp. 364–368. Cited by: §I.
  • [6] G. Dong, G. Kuang, N. Wang, L. Zhao, and J. Lu (2015) SAR target recognition via joint sparse representation of monogenic signal. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 8 (7), pp. 3316–3328. Cited by: §I.
  • [7] G. Dong and G. Kuang (2016) SAR target recognition via sparse representation of monogenic signal on grassmann manifolds. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 9 (3), pp. 1308–1319. Cited by: §I.
  • [8] K. El-Darymli, E. W. Gill, P. Mcguire, D. Power, and C. Moloney (2016) Automatic target recognition in synthetic aperture radar imagery: a state-of-the-art review. IEEE Access 4, pp. 6014–6058. External Links: Document, ISSN 2169-3536 Cited by: §I.
  • [9] M. Elad, P. Milanfar, and R. Rubinstein (2007) Analysis versus synthesis in signal priors. Inverse Problems 23 (3), pp. 1–5. Cited by: §II-B.
  • [10] R. Geirhos, J. Jacobsen, C. Michaelis, and et al. (2020) Shortcut learning in deep neural networks. Nat. Mach. Intell. (2), pp. 665–673. Cited by: §I.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Los Alamitos, CA, USA, pp. 770–778. External Links: Document, ISSN 1063-6919 Cited by: §III-A2.
  • [12] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu (2015)

    Spatial transformer networks

    In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. 2017–2025. Cited by: §II-B, §III-A2.
  • [13] E. R. Keydel, S. W. Lee, and J. T. Moore (1996) MSTAR extended operating conditions - a tutorial. Proc Spie. Cited by: §I.
  • [14] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry (2004) An invitation to 3-d vision-from images to geometric models. Springer-Verlag New York. Cited by: §II-B.
  • [15] S. Niu, X. Qiu, B. Lei, C. Ding, and K. Fu (2020) Parameter extraction based on deep neural network for SAR target simulation. ieee_j_grs 58 (7), pp. 4901–4914. Cited by: §I.
  • [16] L. M. Novak, G. J. Owirka, W. S. Brower, and A. L. Weaver (1997) The automatic target-recognition system in SAIP. Lincoln Laboratory Journal 10 (2). Cited by: §I.
  • [17] J. Pei, Y. Huang, W. Huo, Y. Zhang, J. Yang, and T. Yeo (2018-04) SAR automatic target recognition based on multiview deep learning framework. ieee_j_grs 56 (4), pp. 2196–2210. External Links: ISSN 0196-2892, Document Cited by: §I.
  • [18] L. C. Potter and R. L. Moses (1997) Attributed scattering centers for SAR ATR. ieee_j_ip 6 (1), pp. 79–91. Cited by: §I.
  • [19] U. Schmidt and S. Roth (2012-06) Learning rotation-aware features: from invariant priors to equivariant descriptors. In IEEE Conf. Computer Vision and Pattern Recognition, pp. 2050–2057. External Links: Document, ISSN 1063-6919 Cited by: §I.
  • [20] Q. Song and F. Xu (2017-12) Zero-shot learning of SAR target feature space with deep generative neural networks. ieee_j_grsl 14 (12), pp. 2245–2249. External Links: ISSN 1545-598X, Document Cited by: §I.
  • [21] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, Cited by: §III-A2.
  • [22] Y. Wang, P. Han, X. Lu, R. Wu, and J. Huang (2006-10) The performance comparison of adaboost and SVM applied to SAR ATR. In CIE Int. Conf. Radar, pp. 1–4. External Links: Document Cited by: §I.
  • [23] Z. Wen, B. Hou, and L. Jiao (2017) Discriminative nonlinear analysis operator learning: when cosparse model meets image classification. ieee_j_ip 26 (7), pp. 3449–3462. External Links: Document Cited by: §II-C.
  • [24] Z. Wen, B. Hou, Q. Wu, and L. Jiao (2018-07) Discriminative feature learning for real-time SAR automatic target recognition with the nonlinear analysis cosparse model. ieee_j_grsl 15 (7), pp. 1045–1049. External Links: ISSN 1545-598X, Document Cited by: §I, §II-C.
  • [25] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma (2009-Feb.)

    Robust face recognition via sparse representation

    ieee_j_pami 31 (2), pp. 210–227. Cited by: §I, §III-A2.
  • [26] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang (2012) Multi-view automatic target recognition using joint sparse representation. ieee_j_aes 48 (3), pp. 2481–2497. Cited by: §I.
  • [27] F. Zhou, L. Wang, X. Bai, and Y. Hui (2018-12) SAR ATR of ground vehicles based on LM-BN-CNN. ieee_j_grs 56 (12), pp. 7282–7293. Cited by: §I, §I.
  • [28] J. Zhou, S. Zhiguang, C. Xiao, and Q. Fu (2011) Automatic target recognition of SAR images based on global scattering center model. ieee_j_grs 49 (10), pp. 3713–3729. Cited by: §I.