Dense Nested Attention Network for Infrared Small Target Detection

06/01/2021
by   Boyang Li, et al.
9

Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied for infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNANet) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repeated interaction in DNIM, infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNANet, contextual information of small targets can be well incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection of union (IoU).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

page 9

page 10

page 11

09/30/2020

Asymmetric Contextual Modulation for Infrared Small Target Detection

Single-frame infrared small target detection remains a challenge not onl...
09/29/2021

Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds

The infrared small-dim target detection is one of the key techniques in ...
11/14/2019

Progressive Feature Polishing Network for Salient Object Detection

Feature matters for salient object detection. Existing methods mainly fo...
08/09/2021

TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network

Salient object detection is the pixel-level dense prediction task which ...
11/26/2019

F3Net: Fusion, Feedback and Focus for Salient Object Detection

Most of existing salient object detection models have achieved great pro...
10/18/2018

MRI Reconstruction via Cascaded Channel-wise Attention Network

We consider an MRI reconstruction problem with input of k-space data at ...
06/04/2021

NMS-Loss: Learning with Non-Maximum Suppression for Crowded Pedestrian Detection

Non-Maximum Suppression (NMS) is essential for object detection and affe...

Code Repositories

Infrared-Small-Target-Detection

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Single-frame infrared small target (SIRST) detection is widely used in many applications such as maritime surveillance [1-Maritime-Surveillance], early warning systems [2-early-warning], and precise guidance [3-anti-miss]. Compared to generic object detection, infrared small target detection has several unique characteristics: 1) Small: Due to the long imaging distance, infrared targets are generally small, ranging from one pixel to tens of pixels in the images. 2) Dim: Infrared target usually have low signal-to-clutter ratio (SCR) and are easily immersed in heavy noise and clutter background. 3) Shapeless: Infrared small targets have limited shape characteristics. 4) Changeable: The sizes and shapes of infrared targets vary a lot among different scenarios.

Fig. 1: Visual results achieved by Tophat[4-tophat], IPI[10-IPI], RIPT[12-RIPT], and our network for different infrared small targets. The correctly detected target, false alarm, and miss detection areas are highlighted by red, yellow, and green dotted circle, respectively.

To detect infrared small targets, numerous traditional methods have been proposed, including filtering-based methods [4-tophat, 5-maxmedian], local-contrast-based methods [6-lcm, 7-Robust-lcm, 8-TLLCM, 9-WSLLCM], and low-rank-based methods [10-IPI, 11-NRAM, 12-RIPT, 13-PSTNN]. However, these traditional methods heavily rely on handcrafted features. When the characteristics of real scenes (e.g., target size, target shape, SCR, and clutter background) change dramatically, it is difficult to use handcrafted features and fixed hyper-parameters to handle such variations.

Different from traditional methods, CNN-based methods can learn features of infrared small targets in a data-driven manner. Liu et al. [18-five-layer] proposed the first CNN-based SIRST detection method. They designed a multi-layer perception (MLP) network with 5 layers for infrared small target detection. Then, McIntosh et al. [19-infrared-car] fine-tuned several existing generic object detection networks (e.g., Faster-RCNN [20-faster-rcnn] and Yolo-v3 [21-yolov3]) for infrared small target detection. Specifically, Dai et al. [22-ACM] proposed the first segmentation-based SIRST detection method. They designed an asymmetric contextual module (ACM) to replace the plain skip connection of Unet [15-Unet]. Although recent CNN-based methods have achieved the state-of-the-art performance, most of them only fine-tuned these networks designed for generic objects. Since the size of infrared small targets is much smaller than generic objects, directly applying these methods for SIRST detection can easily lead to the loss of small targets in deep layers.

Inspired by the success of nested structure in medical image segmentation [Mdu-net, DMPU-net, Unet_3+, 25-Unet++], we propose a dense nested attention network (namely, DNANet) to maintain small targets in deep layers. Specifically, we design a tri-directional dense nested interactive module (DNIM) with a cascaded channel and spatial attention module (CSAM) to achieve progressive feature interaction and adaptive feature enhancement. Within our DNIM, multiple nodes are imposed on the pathway between the encoder and decoder sub-networks. As shown in Fig. 2(b), all nodes are connected with each other to form a nested-shape network. Using DNIM, those middle nodes can receive features from its own and the adjacent two layers, leading to repeated multi-layer feature fusion at deep layers. Through repeated feature fusion and enhancement, our network can maintain the targets in deep layers. Meanwhile, contextual information of maintained small targets can be well incorporated and fully exploited. In addition, we develop a novel infrared small target dataset (namely, the NUDT-SIRST dataset) to evaluate the performance of SIRST detection methods under different clutter backgrounds, target shapes, and target sizes. In summary, the contributions of this paper can be summarized as follows.

  • We propose a DNANet to maintain small targets in deep layers. The contextual information of small targets can be well incorporated and fully exploited by repeated feature fusion and enhancement.

  • A dense nested interactive module and a channel-spatial attention module are proposed to achieve progressive feature fusion and adaptive feature enhancement.

  • We develop an infrared small target dataset (namely, NUDT-SIRST). To the best of our knowledge, our dataset is the largest dataset with numerous categories of target shapes, rich target sizes, diverse clutter backgrounds, and ground truth annotations.

  • Experiments on both public and our NUDT datasets demonstrate the superior performance of our method. Compared to existing methods, our method is more robust to the variations of clutter background, target size, and target shape (as shown in Fig. 1).

Fig. 2: The representation of small targets in deep CNN layers of (a) U-shape network (b) our Nested U-shape (DNANet) network.

The remainder of this paper is organized as follows: In Section II, we briefly review the related work. In Section III, we introduce the architecture of our DNANet and our self-developed dataset in details. Section IV represents the experimental results. Section V gives the conclusion.

Ii Related Work

In this section, we briefly review the major works in infrared small target detection and SIRST dataset.

Ii-a Single-frame Infrared Small Target Detection

SIRST detection has been extensively investigated for decades. The traditional paradigm achieves SIRST detection by measuring the discontinuity between targets and backgrounds. Typical methods include filtering-based methods [TDLMS], local contrast measure based methods [Local_contrast_01, Local_contrast_02], and low rank based methods [low_rank_01, low_rank_02]. However, these traditional methods heavily rely on handcrafted features. When real scenes change dramatically, such as in SCR, clutter background, target shape, and target size, it is difficult to use handcrafted features and fixed hyper-parameters to handle such variations. To address this problem, recent CNN-based methods learn trainable features in a data-driven manner. Thanks to the large quantity of data and the powerful model fitting capability of CNNs, these methods achieve better performance than traditional ones.

Existing CNN-based methods can be divided into detection based methods and segmentation based methods. Liu et al. [18-five-layer] first introduced a generic target detection framework for infrared small target detection. They designed a multi-layer perception (MLP) network with 5 layers for infrared small target detection. Then, McIntosh et al. [19-infrared-car] fine-tuned several generic target detection network (e.g., Faster-RCNN [20-faster-rcnn] and Yolo-v3 [21-yolov3]

) and used the optimized eigen-vectors as input to achieve improved performance.

Recently, segmentation-based methods have attracted increasing attention. That is because, these methods can produce both pixel-level classification and localization outputs. Dai et al. [22-ACM] proposed the first segmentation-based network (i.e., ACM). They designed an asymmetric contextual module to aggregate features from shallow layers and deep layers. Then, Dai et al. [23-ALCNet] further improved their ACM by introducing a dilated local contrast measure. Specifically, a feature cyclic shift scheme was designed to achieve a trainable local contrast measure. Moreover, Wang et al. [24-ICCV19]

decomposed the infrared target detection problem into two opposed sub-problems (i.e., miss detection and false alarm) and used a conditional generative adversarial network (CGAN) to achieve the trade-off between miss detection and false alarm for infrared small target detection.

Although the performance is continuously improved by recent networks, the loss of small targets in deep layers still remains. This problem ultimately results in the poor robustness to dramatic scene changes (e.g., clutter background, targets with different SCR, shape, and size).

Fig. 3:

An illustration of the proposed dense nested attention network (DNANet). (a) Feature extraction module. Input images are first fed into the dense nested interactive module (DNIM) to achieve progressive feature fusion. Then, features from different semantic levels are adaptively enhanced by a channel and spatial attention module (CSAM). (b) Feature pyramid fusion module (FPFM). The enhanced features are upsampled and concatenated to achieve multi-layer output fusion. (c) Eight-connected neighborhood clustering algorithm. The segmentation map is clustered and the centroid of each target region is finally determined.

Ii-B Datasets for SIRST Detection

Existing open-source dataset in infrared small target detection is scarce, most traditional methods are evaluated on their in-house datasets. Only a few infrared small target datasets are released by CNN-based methods

[24-ICCV19, 22-ACM]. Wang et al.[24-ICCV19] built the first big and open SIRST dataset. This dataset includes 10000 training images and 100 test images. Since many targets in this dataset do not meet the definition of society of photo-optical instrumentation engineers (SPIE) [SPIE] and have obvious synthesized traces with illogical annotations. These problems may lead to the inapplicability toward SIRST detection. Dai et al.[22-ACM] built the first real SIRST dataset with high-quality images and labels. However, the number of images in NUAA-SIRST is 427 (256 for training), which cannot well cover dramatic scene changes in infrared small target detection. Moreover, these real infrared data are all manually labelled with many inaccurately labeled pixels.

Although these open-sourced datasets greatly prompt the prosperity of SIRST detection, their limited data capacity, data variety, and poor annotation hinder the further development of this field. Synthesized data can be easily generated to achieve higher variety and annotation quality at very low cost (i.e., time and money). Hence, we developed a new NUDT-SIRST dataset with numerous categories of target, rich target sizes, diverse clutter backgrounds, and accurate annotations. The effectiveness of our dataset is evaluated in Section IV.

Iii Methodology

In this work, we introduce our DNANet in details. The overall architecture of the proposed method is shown in Fig. 3.

Iii-a Overall Architecture

As illustrated in Fig. 3, our DNANet takes a SIRST image as its input and sequentially performs feature extraction (Section III-B), feature pyramid fusion (Section III-C), and eight-connected neighborhood clustering (Section III-D).

Section III-B introduces our feature extraction module, including the dense nested interactive module (DNIM) and the channel-spatial attention module (CSAM). Input images are first preprocessed and fed into the backbone of DNIM to extract multi-layer features. Then, multi-layer features are repeatedly fused at the middle convolution nodes of skip connection and then are gradually passed into the decoder subnetworks. Due to the semantic gap at multi-layer feature fusion stage of DNIM, we used CSAM to adaptively enhance these multi-level features for achieving better feature fusion. Section III-C presents the feature pyramid fusion module. Enhanced multi-layer features at each scale are upscaled to the same size. Next, the shallow-layer features with rich spatial information and deep-layer features with high-level information are concatenated to generate robust feature maps. Section III-D elaborates the eight-connected neighborhood clustering module. Feature maps are fed into this module to calculate the spatial location of target centroid, which is then used for comparison in Section IV. In Section III-E, we introduce our NUDT-SIRST dataset.

Iii-B The Feature Extraction Module

Iii-B1 Motivation

As shown in Fig. 4(a), traditional U-shape structure[15-Unet] consists of an encoder, a decoder, and plain skip connections. The encoder is used to enlarge the receptive field and extract high-level information. Decoder helps to recover to the same size of input images. The plain skip connection acts as a bridge to pass these low-level and high-level features from encoder to decoder subnetworks.

To achieve powerful contextual information modeling capability, a straightforward idea is to continuously increase the layers of the network. In this way, high-level information can be obtained and larger receptive field can be achieved. However, infrared small targets are different significantly in their sizes, ranging from one pixel (i.e., point targets) to tens of pixels (i.e., extended targets). With the increase of network layers, high-level information of extended targets is obtained, while the point targets are easily lost after multiple pooling operation. Therefore, we should design a special module to extract high-level features and maintain the representation of small targets in the deep layers.

Fig. 4: An illustration of the U-shape structure and our dense nested structure. The insight comes from the multiple U-shape subnetwork stacking. The representation of small targets in the deep layers is maintained and the high-level information is extracted.

Iii-B2 The Dense Nested Interactive Module

As shown in Fig. 4(b), we stack multiple U-shape sub-networks together to build a dense nested structure. Since the optimal receptive field for different sizes of target varies a lot, these U-shape sub-networks with different depths are naturally suitable for targets with different sizes. Based on this idea, we impose multiple nodes in the pathway between encoder and decoder sub-networks. All of these middle nodes are densely connected with each other to form a nested-shape network. As shown in Fig. 4(c) and (d), each node can receive features from its own and the adjacent layers, leading to repeated multi-layer feature fusion. As a result, the representation of small targets is maintained in the deep layers and thus better results can be achieved.

In this paper, we stack I layers of DNIM to form our feature extraction module. Without loss of generality, we take the DNIM layer as an example to introduce this structure, as shown in Fig. 4(c) and (d). Assume denote the output of node , where is the down-sampling layer along the encoder and is the convolution layer of dense block along the plain skip pathway. When , each node only receives features from dense plain skip connection. The stack of feature maps represented by are computed as

(1)

where denotes multiple cascaded convolution layers of the same convolution block.

denotes max-pooling with a stride of 2. When

, each node receives outputs from three directions including dense plain skip connection and nested bi-direction interactive skip connection, the stack of feature maps represented by is generated as:

(2)

where denotes the up-sampling layer, and denotes the concatenation layer.

Iii-B3 Channel and Spatial Attention Module

As shown in Fig. 5(b), CSAM is used for adaptive feature enhancement at the multi-layer feature fusion stage of DNIM.

Fig. 5: Channel and spatial attention module. CSAM is used to reduce the semantic gap at multi-layer feature fusion stage of DNIM.
Datasets Image Type Background Scene #Image Label Type Target Type Public
NUAA-SIRST(ACM)[22-ACM] real CloudCitySea 427 Manual Coarse Label Point/Spot/Extended
NUST-SIRST[24-ICCV19] synthetic CloudCityRiverRoad 10000 Manual Coarse Label Point/Spot/Extended
CQU-SIRST(IPI)[10-IPI] synthetic CloudCitySea 1676 Ground Truth Point/Spot
NUDT-SIRST(ours) synthetic CloudCitySeaFieldHighlight 1327 Ground Truth Point/Spot/Extended
TABLE I: Main characteristics of several popular SIRST datasets. Note that, our NUDT-SIRST dataset contains common background scenes, various target types, and most ground truth annotations.

The CSAM consists of two cascaded attention units. The feature maps from node (, ) are sequentially processed by a 1D channel attention map and a 2D spatial attention map . As shown in Fig. 5(a). The channel attention process can be summarized as follows:

(3)
(4)

where denotes the element-wise multiplication,

denotes sigmoid function,

denote the number of channels, height, and width of . and

denote average pooling and max pooling with a stride of 2, respectively. The shared network is composed of a multi-layer perceptron (MLP) with one hidden layer. Before multiplication, the attention maps

are stretched to the size of .

Similar to channel attention process, the spatial attention process can be summarized as follows:

(5)
(6)

where represents a convolutional operation with a filter size of 77. The attention maps are also stretched to the size of before multiplication.

Fig. 6: Samples of our NUDT-SIRST dataset. Our dataset covers multiple real infrared backgrounds, various target types, rich poses, and ground truth labels. , , , and represents different moving directions of targets.

Iii-C The Feature Pyramid Fusion Module

After the feature extraction module, we develop a feature pyramid fusion module to aggregate the resultant multi-layer features. As shown in Fig. 3, we first upscale multi-layer features to the same size of . Then, the shallow-layer feature with rich spatial and profile information and deep-layer feature with rich semantic information are concatenated to generate global robust feature maps:

(7)

Iii-D The Eight-connected Neighborhood Clustering Module

After the feature pyramid fusion module, we introduce an eight-connected neighborhood clustering module [cluster] to clutter all pixels and calculate the centroid of each target. If any two pixels , in feature maps have intersection areas in their eight neighborhoods (e.g., Eq. 8) and have the same value (0 or 1) (e.g., Eq. 9), these two pixels are considered to be in a connected area. Pixels in a connected area belong to the same targets. Once all targets in the image are determined, centroid is calculated as their coordinate.

(8)
(9)

Iii-E The NUDT-SIRST dataset

Quality, quantity, and scene diversity of data significantly affect the performance of CNN-based methods. Existing CNN-based methods mainly use real infrared data with manual annotations. However, it is costly to collect a large-scale dataset with accurate pixel-level annotations. These issues hinder the further development of CNN-based methods. Inspired by the solutions in other data-scarcity fields (e.g., ship detection [17-Shipdetection]), we develop a large-scale infrared small target dataset (namely, the NUDT-SIRST dataset). Our NUDT-SIRST dataset enables performance evaluation of CNN-based methods under numerous categories of target type, rich target size, and diverse clutter backgrounds.

Our NUDT-SIRST is compared with existing SIRST datasets in Table I. It has 5 main background scenes and covers various targets. All scenes in our dataset are rendered by synthesizing real infrared backgrounds with various virtual infrared targets. For each scene, all images have a resolution of 256 256. To render realistic infrared small targets, we adopt an adaptive target size function and apply a 55 Gaussian blur. We illustrate 5 main background scenes (including city, field, highlight, sea, and cloud) in Fig. 6.

Iv Experiment

In this section, we first introduce our evaluation metrics and training protocol. Then, we compare our DNANet to several state-of-the-art SIRST detection methods. Finally, we present ablation studies to investigate our network.

Iv-a Evaluation Metrics

Pioneering CNN-based works[22-ACM, 23-ALCNet, 24-ICCV19] mainly use pixel-level evaluation metrics like

, precision, and recall values. These metrics mainly focus on the target shape evaluation. However, infrared small targets are generally lack of shapes and textures. For a

small target, one falsely predicted pixel will cause 11.1% decrease in . Consequently, these pixel-level evaluation metrics are unsuitable for small targets. Actually, the overall target localization is the most important criteria for SIRST detection. Therefore, we adopt and to evaluate both localization ability and use to evaluate shape description ability.

Method Description NUDT-SIRST (Tr=50%) NUAA-SIRST (Tr=50%)
() () () () () ()
Filtering Based: Top-Hat[4-tophat] 20.72 78.41 166.7 7.143 79.84 1012
Filtering Based: Max-Median [5-maxmedian] 4.197 58.41 36.89 4.172 69.20 55.33
Local Contrast Based: WSLCM [9-WSLLCM] 2.283 56.82 1309 1.158 77.95 5446
Local Contrast Based: TLLCM[8-TLLCM] 2.176 62.01 1608 1.029 79.09 5899
Local Rank Based: IPI[10-IPI] 17.76 74.49 41.23 25.67 85.55 11.47
Local Rank Based: NRAM[11-NRAM] 6.927 56.40 19.27 12.16 74.52 13.85
Local Rank Based: RIPT[12-RIPT] 29.44 91.85 344.3 11.05 79.08 22.61
Local Rank Based: PSTNN[13-PSTNN] 14.85 66.13 44.17 22.40 77.95 29.11
Local Rank Based: MSLSTIPT [3-anti-miss] 8.342 47.40 888.1 10.30 82.13 1131
CNN Based: MDvsFA-cGAN [24-ICCV19] 75.14 90.47 25.34 60.30 89.35 56.35
CNN Based: ACM [22-ACM] 67.08 95.97 10.18 70.33 93.91 3.728
CNN Based: ALCNet [23-ALCNet] 81.40 96.51 9.261 73.33 96.57 30.47
DNANet-VGG10 (ours) 85.23 96.95 6.782 74.96 97.34 26.73
DNANet-ResNet10 (ours) 86.36 97.39 6.897 76.24 97.71 12.80
DNANet-ResNet18 (ours) 87.09 98.73 4.223 77.47 98.48 2.353
DNANet-ResNet34 (ours) 86.87 97.98 3.710 77.54 98.10 2.510
TABLE II: , , and values achieved by different SOTA methods on the NUDT-SIRST and NUAA-SIRST datasets, For and . larger values indicate higher performance. For , smaller values indicate higher performance. The best results are in red and the second best results are in blue.

Iv-A1 Intersection over Union

Intersection over Union () is a pixel-level evaluation metric. It evaluates profile description ability of the algorithm. IoU is calculated by the ratio of intersection and union area between the predictions and labels.

(10)

where and represent the interaction areas and all areas, respectively.

Iv-A2 Probability of Detection

Probability of Detection () is a target-level evaluation metric. It measures the ratio of correctly predicted target number over all target number. is defined as follows:

(11)

where and represent the numbers of correctly predicted targets and all targets, respectively. If the centroid derivation of the target is less than a maximum allowed derivation, we consider those targets as correctly predicted ones. We set the maximum centroid derivation as 3 in this paper.

Iv-A3 False-Alarm Rate

False-Alarm Rate () is another target-level evaluation metric. It is used to measure the ratio of falsely predicted pixels over all image pixels. is defined as follows:

(12)

where and represent the numbers of falsely predicted pixels and all image pixels, respectively. If the centroid derivation of the target is larger than a maximum allowed derivation, we consider those pixels as falsely predicted ones. We set the maximum centroid derivation as 3 in this paper.

Iv-A4 Receiver Operation Characteristics

Receiver Operation Characteristics (ROC) is used to describe the changing trends of the detection probability () under varying false alarm rate ().

Fig. 7: Qualitative results achieved by different SIRST detection methods. For better visualization, the target area is enlarged in the right-top corner. The correctly detected target, false alarm, and miss detection areas are highlighted by red, yellow, and green dotted circles, respectively. Our DNANet can achieve output with precise target localization, shape segmentation with a lower false alarm rate.

Iv-B Implementation Details

As discussed in Section IV-E, we used the published NUAA-SIRST dataset[23-ALCNet] and our NUDT-SIRST dataset for both training and test. All input images with different initial sizes were first resized to a resolution of 256 256. To accelerate the network convergence, all images were normalized to ensure that their values centered at zero.

In this paper, we adopted a segmentation network as our baseline to generate a pixel-level segmentation map and then used a clustering algorithm to achieve target localization. The U-net paradigm with ResNets[30_resnet] was chosen as our segmentation backbone. The number of down-sampling layer

was chosen as 4. Our network was trained using the Soft-IoU loss function and optimized by the Adagrad method

[27-Adagrad]. We initialized the weights and bias of our model using the Xavier method[28-xavier]

. We set the batch size to 16 and the learning rate to 0.05. All models were implemented in PyTorch

[29-pytorch] on a computer with an Intel i7 7700H @ 2.80 GHz CPU and an Nvidia GeForce 1080Ti GPU.

Fig. 8: 3D visualization results of different methods on 6 test images.

Iv-C Comparison to the State-of-the-art Methods

To demonstrate the superiority of our method, we compare our DNANet to several state-of-the-art (SOTA) methods, including traditional methods (Top-Hat[4-tophat], Max-Median [5-maxmedian], WSLCM [9-WSLLCM], TLLCM[8-TLLCM], IPI[10-IPI], NRAM[11-NRAM], RIPT[12-RIPT], PSTNN[13-PSTNN], MSLSTIPT [3-anti-miss]) and CNN-based methods (MDvsFA-cGAN[24-ICCV19], ACM[22-ACM], ALCNet[23-ALCNet]) on the NUAA-SIRST and NUDT-SIRST datasets. For fair comparison, we retrained all the CNN-based methods on the same training datasets as our DNANet. It is worth noting that we use our implementations for these methods for fair comparison. Most of these open-source CNN-based codes are rewritten by pytorch and released in our homepage: https://github.com/YeRen123455/Infrared-Small-Target-Detection.

Iv-C1 Quantitative Results

For all the compared algorithms, we first obtain their predicts and then performed noise suppression by setting a threshold to remove low-response areas. Specifically, the adaptive threshold was calculated by Eq. 13 for traditional methods. For CNN-based methods, we followed their original papers and adopted their fixed thresholds (i.e., 0, 0, 0.5 for ACM[22-ACM], ALCNet[23-ALCNet], and MDvsFA-cGAN[24-ICCV19], respectively). We kept all remaining parameters the same as their original papers.

(13)

where represents the largest value of output. and mean the standard derivation and average value of output, respectively.

Quantitative results are shown in Table II. The improvements achieved by our DNANet over traditional methods are significant. That is because, both NUDT-SIRST and NUAA-SIRST contain challenging images with different SCR, clutter background, target shape, and target size. Our DNANet can learn discriminative features robust to scene variations. In contrast, the traditional methods are usually designed for specific scenes (e.g., specific target size and clutter background). The manually-selected parameters (e.g., structure size in Tophat and patch size in IPI) limit the generalization performance of these methods. Moreover, we also observe that the IoU improvements are obviously higher than the improvement of and . That is because, the traditional methods mainly focus on the overall localization of the target instead of precise shape matching. It also validates our claim that using pixel-level evaluation metric (such as ) introduces unfair comparison and results in inaccurate conclusion.

As shown in Table II, the improvements achieved by DNANet over other CNN-based methods (i.e., MDvsFA-cGAN, ACM, and ALCNet) are obvious. That is because, we redesign a new backbone network that is tailored for SIRST detection. The U-shape basic backbone with our dense nested interactive skip connection module can achieve progressive feature fusion and selectively enhance the informative features in deep CNN layers. Consequently, intrinsic features of infrared small targets can be maintained and fully learned in the network. It is also worth noting that the improvements of our method on NUDT-SIRST is significantly higher than those on the NUAA-SIRST dataset. That is because, our dataset contains more challenging scenes with various target sizes, types and poses. Our channel and spatial attention module and feature pyramid fusion module help to learn discriminative features to achieve better performance.

Iv-C2 Qualitative Results

Qualitative results on two datasets (i.e., NUDT-SIRST, NUAA-SIRST) are shown in Fig. 7 and Fig. 8. Compared with traditional methods, our method can produce output with precise target localization and shape segmentation under very low false alarm rate. Nonetheless, the traditional methods only perform well on point targets, (e.g., image-3), and easily generate lots of false alarm areas in local highlight areas (e.g., image-4 and image-6). Moreover, as shown in Fig. 9, we divided our NUDT-SIRST dataset into point targets subset, spot targets subset, and extended targets subset. With the increase of spot and extended targets ratio, traditional methods suffers dramatic performance decrease while our DNANet maintains high accuracy. That is because, the performance of traditional methods rely heavily on handcrafeted features and cannot adapt to the variations of target sizes. The CNN-based methods (i.e., MDvsFA-cGAN, ACM, and ALCNet) perform much better than traditional methods. However, due to the complicated scenes in our NUDT-SIRST, MDvsFA-cGAN produces many false alarm and miss detection areas (Fig. 8). Our DNANet is more robust to these scene changes. Moreover, our DNANet can generate better shape segmentation than ALCNet. That is because, our designed new backbone can well adapt to various clutter background, target shape, and target size challenges and thus achieves better performance.

Fig. 9: ROC performance on (a) point targets subset, (b) point targets subset + spot targets subset, (c) all kinds of targets of NUDT-SIRST. With the increase of spot and extended targets ratio, the performance of traditional methods suffers dramatic decrease. In constrast, the performance our DNANet is stable.
Fig. 10: Visualization map of DNANet and DNANet w/o DNIM. The output of DNANet is marked by a solid red frame. The feature maps from the deep layer of DNANet w/o DNIM loses representation of small targets. It finally results in low values and miss detection in the output layer.

Iv-D Ablation Study

In this subsection, we compare our DNANet with several variants to investigate the potential benefits introduced by our network modules and design choice.

Iv-D1 The Dense Nested Interactive Module (DNIM)

The dense nested interactive skip-connection module is used to interact with features at different scale levels to enlarge receptive fields while maintain fine-grained features at the finest scale level. To demonstrate the effectiveness of our DNIM, we introduced three network variants and made their model sizes comparable for fair comparison.

Fig. 11: Three variants of DNIM. (a) DNANet w/o DNIM. (b) DNANet-top-to-bottom. (c) DNANet-left-to-right. (d) DNANet, each color represents different U-shape sub-networks.
  • DNANet w/o DNIM: We replaced the dense nested interactive skip connection module with a regular plain skip connection module.

  • DNANet-left-to-right: As shown in Fig. 11(c), multiple U-shape subnetworks with different depths are stacked from left to right. Each node in the middle part of the network can receive features from its own and the lower layer.

  • DNANet-top-to-bottom: We stacked the U-shape subnetworks from top to bottom to generate DNANet-top-to-bottom, as shown in Fig. 11(b). Different from DNANet-left-to-right, this variant stacks U-shape subnetworks with three kinds of depth and only its core part uses tri-direction skip connection.

Fig. 12: Visualization map of DNANet and DNANet w/o CSAM. The output of DNANet is circled by a solid red frame. The feature maps from the deep layer of DNANet have high values representation to informative cues and finally results in precise profile segmentation in output layer.

Table III shows the comparative results achieved by DNANet and its variants. It can be observed that the , , and values of DNANet w/o DNIM suffer decreases of 2.08%, 2.23%, and an increase of 4.298 on the NUDT-SIRST dataset, respectively. Similar results are also observed on the NUAA-SIRST dataset. That is because, DNIM progressively aggregates features at multiple scales to maintain the target information at the finest scale for better performance. Visualization maps shown in Fig. 10 also demonstrates the effectiveness of our DNIM. Small targets are lost in the feature maps of the deep layer in DNANet w/o DNIM (i.e., L(4,0), L(3,1)).

Model #Params(M) Datasets
NUDT-SIRST NUAA-SIRST
DNANet w/o DNIM 4.71 85.01/96.50/8.521 75.12/97.34/12.05
DNANet-top-to-bottom 4.72 85.75/96.96/7.682 75.94/97.71/11.84
DNANet-left-to-right 4.71 85.89/97.29/4.649 76.59/98.10/11.05
DNANet-ResNet18 4.70 87.09/98.73/4.223 77.47/98.48/2.353
TABLE III: , , and values achieved by main variants of DNANet and DNIM on the NUDT-SIRST and NUAA-SIRST datasets. Top-to-bottom and Left-to-right mean stack U-shape sub-network from different directions

As shown in Table III, DNANet-left-to-right suffers decreases of 1.20%, 1.44%, and an increase of 0.426 in terms of , , and values over DNANet on the NUDT-SIRST dataset. That is because, each node in DNANet-left-to-right only interacts with the deep layer instead of full interaction among shallow, their-own, and deep layers. Shallow layer has rich localization and profile information, but the information is not fully incorporated at the skip connection stage. Consequently, this variant has limited performance.

As compared to our DNANet, the variant DNANet-top-to-bottom suffers decreases of 1.34%, 1.77%, and an increase of 3.459 in terms of , , and values on NUDT-SIRST dataset. That is because, only the core part of this variant adopts tri-direction skip connection, the remaining part still uses the plain skip connection. Moreover, its tri-direction interactive area is relatively shallow, high-level information can not be fully exploited at shallow layers.

Model #Params(M) Datasets
NUDT-SIRST NUAA-SIRST
DNANet w/o CSAM 4.70 85.90/96.62/5.738 75.81/96.19/22.12
DNANet w/o CSAM 4.71 85.25/96.62/6.710 75.35/95.82/34.97
DNANet w/o CA 4.73 86.27/96.96/4.881 76.20/96.96/12.69
DNANet w/o SA 4.73 86.14/96.73/4.128 76.69/97.34/10.96
DNANet-ResNet18 4.70 87.09/98.73/4.223 77.47/98.48/2.353
TABLE IV: , , and values achieved by main variants of DNANet and CSAM on the NUDT-SIRST and NUAA-SIRST datasets. means element-wise summing as feature fusion method.

Iv-D2 The Channel and Spatial Attention Module (CSAM)

The channel and spatial attention module is used for adaptive feature enhancement to achieve better feature fusion. To investigate the benefits introduced by this module, we compare our DNANet with four variants. To achieve fair comparison (i.e., comparable model size), we increased the number of filters of all convolution layers of four variants to made their model sizes slightly larger than DNANet.

  • DNANet w/o CSAM: We removed the channel and spatial attention module in this variant and directly concatenate multi-layer features for subsequent process.

  • DNANet w/o CSAM (Element-wise summation): We replaced CSAM with common element-wise summation in this variant to explore the effectiveness of CSAM. Specifically, we used 11 convolution operation and up-sampling/down-sampling to make features from different layer identical. Then, an element-wise summation is used to achieve multi-layer feature fusion.

  • DNANet w/o channel attention: We removed the channel attention operation in this variant to evaluate its contribution.

  • DNANet w/o spatial attention: We canceled the spatial attention operation in this variant to investigate the benefit introduced by spatial attention.

If CSAM is removed, the performance suffers decreases of 1.19%1.84%, 2.11%2.11%, and 1.5152.487 in terms of , , and for DNANet w/o CSAM and DNANet w/o CSAM on the NUDT-SIRST dataset, respectively. Similar results are achieved on the NUAA-SIRST dataset. This clearly demonstrates the importance of the channel and spatial attention module. As shown in Fig. 12, with the help of CSAM, the feature maps from the deep layer of DNANet have high response to informative cues and finally results in precise shape segmentation.

As shown in Table IV, DNANet w/o channel attention suffers decreases of 0.82%, 1.77%, and an increase of 0.658 in terms of , , and values over DNANet on NUDT-SIRST dataset. That is because, channel attention unit in our DNANet can better exploit informative channels to enhance the representation capability of features.

If the spatial attention unit is removed, the performance suffers decreases of 0.95%, 2.00%, and an increase of 0.095 in terms of , , and values over the DNANet on NUDT-SIRST dataset. That is because, infrared small targets are easily immersed in heavy cloud and noise, it is hard to distinguish these small and dim targets from the background. Spatial attention facilitates the network to pay attention to local informative areas and thus produces better results.

Fig. 13: Samples of the input images, public ground truth masks[22-ACM] (manually labeled), and output of our DNANet trained on mixed dataset. Our method can even produce more precise segmentation result than manually labeled ground truth masks.

Iv-E Potential of The Synthesized Dataset

In this section, we evaluate the potential of our synthesized dataset for real IRST tasks. Specifically, we mixed real SIRST images from NUAA-SIRST and synthesized SIRST images from NUDT-SIRST with different ratios to train our DNANet. With more real SIRST images being included for training, the performance of our network is gradually improved. Even on the extreme situation with only 42 real images, our DNANet still achieves better performance than ACM with 100% real SIRST images. That is because, our synthesized dataset can well cover the main challenges for infrared small target detection (i.e., different SCR, clutter background, target shape, and target size). Consequently, the huge cost for collecting real SIRST images can be reduced.

Moreover, we compared the output of our network trained on the mixed dataset with the manually labeled masks of NUAA-SIRST in Fig. 13. It can be observed that the output of our network has more reasonable shape segmentation. That is because, the synthesized SIRST images have absolutely precise labels. The network can learn the essence of infrared small targets with sufficiently well labeled data and finally contribute to the improvement of real SIRST images. As a result, our network can generate better visual performance than ground truth label.

V Conclusion

# Real images # Synthesized images
Performance on real SIRST
0% (0/427) 50% (663/1327) 61.69/88.97/30.26
10% (42/427) 50% (663/1327) 68.26/94.67/14.11
20% (85/427) 50% (663/1327) 72.38/95.63/8.932
30% (128/427) 50% (663/1327) 74.53/97.71/4.341
TABLE V: , , and values achieved by DNANet on real datasets. The DNANet is trained on mixed dataset with different real images ratios

In this paper, we propose a DNANet to achieve SIRST detection. Different from existing CNN-based SIRST detection methods, we explicitly handle the problem of small targets being lost in deep layers by designing a new tri-direction dense nested interactive module with a cascaded channel and spatial attention model. The intrinsic information of small targets can be incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an open SIRST dataset to evaluate the performance of infrared small target detection with respect to challenging scenes. We also reorganized a set of evaluation metrics. Experiments on both our dataset and the public dataset have shown the superiority of our method over the state-of-the-art methods.

References