I Introduction
Automatic target recognition (ATR) is one of the ultimate goals in the field of synthetic aperture radar (SAR), which has attracted much attention in civil and military reconnaissance and surveillance for years [16, 13, 8]. In contrast to optical sensors, the special electromagnetic imaging mechanism of SAR makes the resulted image reconstruction of the specular backscattering of the illuminated target [15]. Therefore, SAR will actually “see” some types of physical structures of a target, whose scattering signatures in the resulted SAR image will be highly sensitive to its pose. Crucially, the profile and the position of strong scattering points of the target will vary a lot as the viewing angle of the SAR platform changes, which makes SAR ATR more challenging in comparison with the general recognition task with data of optical sensors.
To tackle this issue, an intuitive solution in conventional SAR ATR algorithms is to collect the distributions of strong scattering points of the target in full () aspect angles uniformly [18, 28]
, which will be subsequently constructed as a feature template in the hope to record all the scattering signatures in different poses. Then the identity of a query test target can be determined in a pairwise classtemplate matching manner. However, this approach is not robust to many disturbances such as speckle noise and motion ambiguity, which heavily influence its practical recognition performance. To improve the robustness, several handcrafted image feature extractors such as scaleinvariant feature transform (SIFT) focus on some pose invariant features to describe some intrinsic visual signatures of the target. These visual features will be normally more robust and achieve a better generalization performance by choosing a suitable discriminative classifier such as support vector machine (SVM)
[22]. To further improve the discrimination and adaptivity of the features, many machine learningbased algorithms are gradually devoted to SAR ATR
[26, 25, 7, 6, 24, 1, 3, 4, 5, 20, 27], among which the most notable model will be the convolutional neural networks (CNN). Distinct from the above feature engineering algorithms, the core idea of CNNbased algorithms to address the pose sensitivity issue is to fit a set of rotating target images as well as their classlabels with a deep neural network through which every intraclass rotating targets can be mapped to the same classlabel. In this way, some labelinvariant features are expected to be obtained in a discriminative learning way without concerning their variances in aspect angle, and the recognition performance can be remarkably improved in terms of both accuracy and efficiency.
The success of the above learning models, especially CNN owns to the existence of a large number of training samples to cover sufficient target patterns. It can ensure the distribution consistency between the training and testing ones. However, in practical SAR ATR application, it is intractable to collect sufficient noncooperative target samples in the different poses for training. In general, only a small number of targets in partial aspect angles can be acquired by reconnaissance. It will thus lead to the socalled out of distribution (o.o.d) classification problem that the distributions of the training and testing samples are slightly different [10]. In this case, the recognition performance of the above SAR ATR algorithms will dramatically decrease due to a higher requirement of the generalization ability [10]
. An intuitive way to overcome this problem in computer vision society would be manually generating pseudo targets in full angles for data augmentation and distribution completion and alignment. We, however, empirically found that this trick is invalid for SAR data, which might be due to ambiguous artifacts caused by nonlinear pixel interpolation. The interpolated pixels will not correctly recover the actual physical scattering signature of the corresponding real target so that these generated pseudo samples do not explicitly provide more discrimination information than the original ones
[27, 17, 19].Motivated by the above practical o.o.d scenario, this paper develops a pose discrepancy spatial transformer based feature disentangling framework (DistSTN) for partial aspect angles SAR target recognition (PAAATR). Instead of learning the pose invariant features, DistSTN newly involves an elaborated feature disentangling model to separate the learned pose factors of a SAR target from the identity ones so that they can independently control the representation process of the target image for better generalization ability. To disentangle the explainable factors, a pose discrepancy spatial transformer module is developed in DistSTN. It aims to characterize the intrinsic transformation between the factors of two different targets with an explicit geometric model induced regularization. Furthermore, DistSTN develops an amortized inference scheme that enables efficient feature extraction and recognition using an encoderdecoder mechanism. Experimental results on the MSTAR benchmark demonstrate that our framework achieves better recognition accuracy in the PAAATR task. The rest of this paper is organized as follows: Section II proposes the main framework; Section III describes our experiments to validate performance and Section IV summarizes our work and suggests future directions.
Ii Framework Presentation
In this section, we will first formulate the problem of PAAATR and analysis some insight viewpoints to shed light on our solution. Next, we will develop a spatial transformerbased feature disentanglement model to address this task. Finally, we will propose an encoderdecoder architecture for amortized inference and model learning.
Iia Problem Formulation of PAAATR and Insight
Let be labeled SAR targets drawn from cth class, where and
stand for the SAR target image and its identity label vector, respectively, and
is the total number of target categories. The ultimate goal of ATR is to predict the label vector of a new query sample according to these training data. In general SAR ATR algorithms, it is implicitly assumed that the aspect angle of the training and testing samples should identically and uniformly reside in the range of , which is intractable for noncooperative targets. Alternatively, this paper considers a more difficult but practical task termed PAAATR. It assumes that the aspect angles of the training samples from at least one class are incomplete and limited in a partial range of , while those of testing samples are in , yielding an o.o.d scenario.Due to the electromagnetic imaging mechanism of the SAR sensor, the scattering appearance of an illuminated target will be severely sensitive to the relative pose between the target and the sensor. As a result, the difference between the training and testing samples in PAAATR will be aggravated in comparison with the general ATR task. The latent factors accounting for the pose and identity are entangled in the image domain. To alleviate this challenge, the conventional algorithms will exploit some physicaldriven or geometricguided methods to design some rotationinvariant features in a handcrafted way. Alternatively, the learningbased model such as CNN will train a deep network mapping every intraclass targets with different poses into the same label vector in a supervised learning way. Through this way, the intermediate features will only account for the identity label and ignore the intraclass variances, and it can achieve much better performance only assuming the distribution consistency between the training and testing samples. Since the general CNN involves no explicit spatial geometric transformation, it can, however, only achieve local spatial pose invariance by introducing a deep hierarchy of maxpooling and convolutions layers. Consequently, its generalization ability for PAAATR is weak.
IiB The Framework for Feature Disentangling
According to the above analysis, the critical issue of the CNNbased model is the lack of rotation awareness so that the model does not understand the physical and semantic concept of the target rotation. In this sense, it is intractable for a CNNbased model to generalize the training rotational pattern to the unseen ones in the testing phase in PAAATR. To tackle this issue, our core idea alternatively focuses on equivariant feature disentanglement instead of pursuing the poseinvariant features for discrimination. In contrast to the general CNN model for discriminative learning, we will develop a generative feature learning model containing a geometric transformation module. It aims to explicitly characterize and disentangle the pose features from the identity ones. Through this separation, it is expected that the discrimination of the rest identity ones will be enhanced without being influenced.
To this end, let and be the latent pose (relative to the radar) and identity factors, respectively. Considering a generative learning model, a SAR target image is modeled and represented by and through a nonlinear parametric function as , where contains the model parameters controlling the unknown intricate SAR imaging process. Based on this generative model, the task of feature disentanglement is an inverse problem of extracting and from
, which can be usually addressed by maximizing a posterior (MAP) estimator given by:
(1) 
where and are two elaborated regularization functions on and to encode our prior preference for disentangling, and measures the representation error. Note from Eq.(1
) that it originates from the idea of independent component analysis (ICA) for source separation
[2]. Therefore, the critical issue in Eq. (1) is to design two regularization functions for effective disentanglement and develop an efficient feature inference process to solve the optimization.For the first regularizer , it can be designed as the following taskinduced function (2) with a supervised analysis prior [9], which will force to contain sufficient discriminative information for correct recognition.
(2) 
where is the negative loglikelihood function of given induced from the categorical distribution, is the indicator function, is a simple affine transformation of followed by a softmax function, and the subscript represents the corresponding value in cth index.
For the second regularizer , it initially attempts to characterize the factors accounting for the target pose relative to the sensor, including azimuth angle, depression angle, and some other positional factors of the sensor. Nevertheless, it is intractable to labeling the entire factors explicitly and exactly for all training targets. Thus, we cannot exploit the above strategy to model it in a discriminative way. Alternatively, we propose a novel selfsupervised task of target crosstransformation to model the pairwise pose discrepancy between two targets with an explicit geometric model. Formally, let and be the pose factors of and , respectively. If they can indeed capture the entire pose information, there will be a geometrically explicit operator warping to and vice versa. More importantly, the parameters will have a clear physical meaning to measure the pose discrepancy between two targets without being influenced by the other shared sensor factors. According to the 2D rigidbody geometric transformation model [14], the warping function
is essentially an affine transformation of the 2D coordinates of the input feature maps, followed by sampling and interpolation processes that is also exploited in the spatial transformer network (STN)
[12]. If can be correctly estimated and assigned, will be equal to . Therefore, can be further exploited to represent in conjunction with and . In this sense, we can exploit a pose discrepancy aware network to estimate the parameters as , where contains its parameters. and will constitute the designed pose discrepancy STN illustrated in Fig. 1(b). According to this model, will be designed as (3) to measure the error of representing with and without external supervised pose information.(3) 
It should be worth noting that the main purpose of pose discrepancy STN is not to generate a highquality SAR image to simulate its special imaging mechanism, but to impose an explicit modelinduced learning bias on the latent for disentanglement.
IiC Amortized Inference and Overall Architecture
In the above subsection, we have developed a novel model (1) for identity and pose features disentanglement. We have elaborated two regularization functions (2) and (3) to inject the identity and pose information into the features and , respectively. More specifically, Eq. (2) exploits a taskinduced regularization on each to make it more discriminated while Eq. (3) involves a geometric transformation model to characterize the pose discrepancy between two targets. However, directly solving the inverse problem (1) with a general optimization algorithm is intractable and timeconsuming. Inspiring from the recent amortized inference [2] and our previous research [23, 24], we will utilize the encoderdecoder architecture for feature inference and parameters learning in an endtoend learning pipeline. To this end, we will design an encoder parameterized by the threelayers CNNs in the hope to directly output the estimated feature maps as , where contains the parameters of to be learned. Through this way, the identity and pose features in terms of the model (1) can be efficiently obtained with low computational complexity. The pose discrepancyaware network is a threelayer fully connected network whose hidden unit numbers are 60, 30, and 6, respectively. The overall architecture termed DistSTN is illustrated in Fig. 1(a), and the final optimization problem is summarized as:
(4) 
where and are two hyperparameters for balance. From Fig. 1(a), DistSTN is a doubleinput CNN which allows taking two targets from arbitrary classes. It follows that we can generate at most target pairs for model learning, where is the permutation operator and counts the total number of training samples. In this regard, DistSTN will be more appropriate for learning with limited training samples. In the testing phase, we can simply remove the target transformer module from DistSTN and feed the query sample into the encodertarget recognition module to output its identity label.
Iii Experiments
In this section, we carried out several experiments on the MSTAR database to validate the performance of the proposed DistSTN for PAAATR. The parameters in the networks are initially in a default way without pretraining. We exploit the weight decay regularization on the parameters in with rate , except in
. The optimizer of DistSTN is chosen as the stochastic gradient descent (SGD) with a constant learning rate of
and a momentum rate of ^{1}^{1}1It is empirically found that using the Adam optimizer can obviously achieve a much higher recognition accuracy for the most compared algorithms.. Two hyperparameters andare determined via crossvalidation according to grid search. The loss function
is chosen as the mean absolute error. We exploit the earlystopping trick to control the training procedure. The model achieving the best performance on a validate set will be restored for testing. We conduct all the experiments on a workstation with a single RTX 2070Super GPU five times using TensorFlow 2.3 library.
Iiia Database and Comparison Algorithms Introduction
IiiA1 Database
The MSTAR database was collected by the Sandia National Laboratory using a Twin Otter SAR sensor operating at Xband. It comprises about ten types of military ground target images taken at multiple depression angles and aspect angles with approximately interval. We crop all input amplitude images to obtain the central target chip to get rid of the impact from the surrounding background cluttering. According to the normal setting in other SAR ATR algorithms [3], the targets taken at and will be used for training and testing respectively. In particular, to verify the performance of DistSTN on PAAATR, the detailed training and testing information different from the normal setting is summarized in Table I. The training set comprises cooperative and noncooperative classes to simulate the practical situation. In order to keep number of samples in the cooperative and noncooperative classes balanced, only a half of cooperative samples in each class will be used for training. For all testing samples, their aspect angles are unlimited. Considering the limitation of computation memory and time cost, we randomly shuffle all training samples twice to generate two training sets from which and will be jointly sampled for model learning.
Aspect Angle  Depression Angle  
Training  Coop.  
NonCoop.  or  
Testing 
IiiA2 Algorithms
We will compare several existing SAR ATR algorithms to demonstrate the effectiveness and superiority of our proposal, including aforementioned support vector machine (SVM) with a Gaussian kernel, SRC [25] and AConvNets [3]. Additionally, STN is a designed free module to handle the rotations among inputs [12]. For comparison, the STN module will be inserted into the AConvNets, namely AConvNets+STN. We also exploit data augmentation trick of generating some pseudo targets to form a full aspect angle training set. It will be used to train AConvNet, yielding another variant termed AConvNet*. Finally, ResNet50 and EfficientNet, two stateoftheart architectures for image classification will be also compared [11, 21].
IiiB Validation of Target Transformation
Considering the proposed DistSTN, the most notable contribution is to explore an equivariant feature disentanglement model to address the task of PAAATR. In order to capture the pose information of a SAR target in the learned feature maps , we develop a pose discrepancy STN in the hope to wrap the pose features of a target into those of another target with the explainable geometric transformation model. If the obtained can indeed contain sufficient pose information via DistSTN, the transformed pose feature maps will be able to reconstruct the . As a result, we will design an experiment to validate the effectiveness of the proposed pose discrepancy STN according to its crossreconstruction result. To this end, we will visualize some resulted reconstruction images shown in Fig. 2. The images in the first and second row are initial inputs and . The thirdrow illustrates the corresponding reconstruction results of using and . The last row depicts the corresponding crossreconstruction results using the identity features and . At the first sight of the results, we can see that the reconstruction results are very similar to the crossreconstruction ones . Both of these resulted images can be considered as the denoised and smoothed version of the original inputs with the same pose (the target orientation), though the pose discrepancy between and are obvious. Therefore, these results can clearly demonstrate the effectiveness of the proposed model for pose feature disentanglement.
IiiC Validation of PAAATR
Methods  5 NonCoo.  9 NonCoo.  10 NonCoo. 

SVM  18.26  18.26  18.26 
SRC  62.98  63.89  65.31 
AConvNets  67.29  66.03  64.98 
AConvNets  65.35  64.03  63.75 
AConvNets+STN  68.68  67.70  67.87 
ResNet50  64.98  65.42  66.86 
EfficientNet  59.32  56.36  59.88 
DistSTN  70.72  68.69  69.16 
Finally, DistSTN will be compared with the other algorithms on the task PAAATR. To validate the performance of discriminative feature disentanglement, different number of noncooperative class will be considered, including 5 (half), 9 (only one cooperative class) and 10 (all noncooperative class). The comparison result are summarized in Table II. From the results, DistSTN can achieve the highest accuracy among all comparison algorithms for all three settings, which clearly verifies the superior discrimination and generalization ability of disentangled features. The accuracy obtained from SVM is particularly low, which implies this method failing to cope with the unseen angles. AConvNets trained with the manually generating pseudo targets will be worse than its original counterpart, namely AConvNet. This result reflects that the trick used in RGB image classification may be unsuitable for SAR ATR anymore because of different imaging mechanisms. We can conclude from the consequences of AConvNets+STN, STN module can indeed improve the recognition performance of AConvNet by involving a selfrotational transformation of feature maps. However, due to different mechanisms and motivations, the generalization ability of STN is still weak than our proposed DistSTN. It is still less efficiency for STN to address such o.o.d classification task.
Iv Conclusion
In this letter, we present an efficient framework termed DistSTN to address the challenging task of PAAATR for noncooperative targets. Instead of pursuing the pose invariant features in the conventional algorithms, DistSTN newly exploits a feature disentangling strategy to separate the pose factors of a target from the identity ones so that they can independently control the representation process of the target. Experimental results demonstrate the superiority of our proposed model on PAAATR, which achieves higher recognition accuracy compared to the other ATR algorithms. Future research will consider the potential contribution of those cooperative training samples for knowledge transfer.
References
 [1] (201911) Sequence SAR image classification based on bidirectional convolutionrecurrent network. ieee_j_grs 57 (11), pp. 9223–9235. External Links: ISSN 15580644, Document Cited by: §I.
 [2] (1995) An informationmaximisation approach to blind separation and blind deconvolution. Neural Computation 7. Cited by: §IIB, §IIC.
 [3] (2016) Target classification using the deep convolutional networks for SAR images. ieee_j_grs 54 (8), pp. 4806–4817. Cited by: §I, §IIIA1, §IIIA2.

[4]
(201707)
SAR automatic target recognition based on euclidean distance restricted autoencoder
. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 10 (7), pp. 3323–3333. Cited by: §I.  [5] (2016) Convolutional neural network with data augmentation for SAR target recognition. ieee_j_grsl 13 (3), pp. 364–368. Cited by: §I.
 [6] (2015) SAR target recognition via joint sparse representation of monogenic signal. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 8 (7), pp. 3316–3328. Cited by: §I.
 [7] (2016) SAR target recognition via sparse representation of monogenic signal on grassmann manifolds. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 9 (3), pp. 1308–1319. Cited by: §I.
 [8] (2016) Automatic target recognition in synthetic aperture radar imagery: a stateoftheart review. IEEE Access 4, pp. 6014–6058. External Links: Document, ISSN 21693536 Cited by: §I.
 [9] (2007) Analysis versus synthesis in signal priors. Inverse Problems 23 (3), pp. 1–5. Cited by: §IIB.
 [10] (2020) Shortcut learning in deep neural networks. Nat. Mach. Intell. (2), pp. 665–673. Cited by: §I.

[11]
(201606)
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Los Alamitos, CA, USA, pp. 770–778. External Links: Document, ISSN 10636919 Cited by: §IIIA2. 
[12]
(2015)
Spatial transformer networks
. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. 2017–2025. Cited by: §IIB, §IIIA2.  [13] (1996) MSTAR extended operating conditions  a tutorial. Proc Spie. Cited by: §I.
 [14] (2004) An invitation to 3d visionfrom images to geometric models. SpringerVerlag New York. Cited by: §IIB.
 [15] (2020) Parameter extraction based on deep neural network for SAR target simulation. ieee_j_grs 58 (7), pp. 4901–4914. Cited by: §I.
 [16] (1997) The automatic targetrecognition system in SAIP. Lincoln Laboratory Journal 10 (2). Cited by: §I.
 [17] (201804) SAR automatic target recognition based on multiview deep learning framework. ieee_j_grs 56 (4), pp. 2196–2210. External Links: ISSN 01962892, Document Cited by: §I.
 [18] (1997) Attributed scattering centers for SAR ATR. ieee_j_ip 6 (1), pp. 79–91. Cited by: §I.
 [19] (201206) Learning rotationaware features: from invariant priors to equivariant descriptors. In IEEE Conf. Computer Vision and Pattern Recognition, pp. 2050–2057. External Links: Document, ISSN 10636919 Cited by: §I.
 [20] (201712) Zeroshot learning of SAR target feature space with deep generative neural networks. ieee_j_grsl 14 (12), pp. 2245–2249. External Links: ISSN 1545598X, Document Cited by: §I.
 [21] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, Cited by: §IIIA2.
 [22] (200610) The performance comparison of adaboost and SVM applied to SAR ATR. In CIE Int. Conf. Radar, pp. 1–4. External Links: Document Cited by: §I.
 [23] (2017) Discriminative nonlinear analysis operator learning: when cosparse model meets image classification. ieee_j_ip 26 (7), pp. 3449–3462. External Links: Document Cited by: §IIC.
 [24] (201807) Discriminative feature learning for realtime SAR automatic target recognition with the nonlinear analysis cosparse model. ieee_j_grsl 15 (7), pp. 1045–1049. External Links: ISSN 1545598X, Document Cited by: §I, §IIC.

[25]
(2009Feb.)
Robust face recognition via sparse representation
. ieee_j_pami 31 (2), pp. 210–227. Cited by: §I, §IIIA2.  [26] (2012) Multiview automatic target recognition using joint sparse representation. ieee_j_aes 48 (3), pp. 2481–2497. Cited by: §I.
 [27] (201812) SAR ATR of ground vehicles based on LMBNCNN. ieee_j_grs 56 (12), pp. 7282–7293. Cited by: §I, §I.
 [28] (2011) Automatic target recognition of SAR images based on global scattering center model. ieee_j_grs 49 (10), pp. 3713–3729. Cited by: §I.
Comments
There are no comments yet.