Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied for infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNANet) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repeated interaction in DNIM, infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNANet, contextual information of small targets can be well incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection of union (IoU).READ FULL TEXT VIEW PDF
Single-frame infrared small target (SIRST) detection is widely used in many applications such as maritime surveillance [1-Maritime-Surveillance], early warning systems [2-early-warning], and precise guidance [3-anti-miss]. Compared to generic object detection, infrared small target detection has several unique characteristics: 1) Small: Due to the long imaging distance, infrared targets are generally small, ranging from one pixel to tens of pixels in the images. 2) Dim: Infrared target usually have low signal-to-clutter ratio (SCR) and are easily immersed in heavy noise and clutter background. 3) Shapeless: Infrared small targets have limited shape characteristics. 4) Changeable: The sizes and shapes of infrared targets vary a lot among different scenarios.
To detect infrared small targets, numerous traditional methods have been proposed, including filtering-based methods [4-tophat, 5-maxmedian], local-contrast-based methods [6-lcm, 7-Robust-lcm, 8-TLLCM, 9-WSLLCM], and low-rank-based methods [10-IPI, 11-NRAM, 12-RIPT, 13-PSTNN]. However, these traditional methods heavily rely on handcrafted features. When the characteristics of real scenes (e.g., target size, target shape, SCR, and clutter background) change dramatically, it is difficult to use handcrafted features and fixed hyper-parameters to handle such variations.
Different from traditional methods, CNN-based methods can learn features of infrared small targets in a data-driven manner. Liu et al. [18-five-layer] proposed the first CNN-based SIRST detection method. They designed a multi-layer perception (MLP) network with 5 layers for infrared small target detection. Then, McIntosh et al. [19-infrared-car] fine-tuned several existing generic object detection networks (e.g., Faster-RCNN [20-faster-rcnn] and Yolo-v3 [21-yolov3]) for infrared small target detection. Specifically, Dai et al. [22-ACM] proposed the first segmentation-based SIRST detection method. They designed an asymmetric contextual module (ACM) to replace the plain skip connection of Unet [15-Unet]. Although recent CNN-based methods have achieved the state-of-the-art performance, most of them only fine-tuned these networks designed for generic objects. Since the size of infrared small targets is much smaller than generic objects, directly applying these methods for SIRST detection can easily lead to the loss of small targets in deep layers.
Inspired by the success of nested structure in medical image segmentation [Mdu-net, DMPU-net, Unet_3+, 25-Unet++], we propose a dense nested attention network (namely, DNANet) to maintain small targets in deep layers. Specifically, we design a tri-directional dense nested interactive module (DNIM) with a cascaded channel and spatial attention module (CSAM) to achieve progressive feature interaction and adaptive feature enhancement. Within our DNIM, multiple nodes are imposed on the pathway between the encoder and decoder sub-networks. As shown in Fig. 2(b), all nodes are connected with each other to form a nested-shape network. Using DNIM, those middle nodes can receive features from its own and the adjacent two layers, leading to repeated multi-layer feature fusion at deep layers. Through repeated feature fusion and enhancement, our network can maintain the targets in deep layers. Meanwhile, contextual information of maintained small targets can be well incorporated and fully exploited. In addition, we develop a novel infrared small target dataset (namely, the NUDT-SIRST dataset) to evaluate the performance of SIRST detection methods under different clutter backgrounds, target shapes, and target sizes. In summary, the contributions of this paper can be summarized as follows.
We propose a DNANet to maintain small targets in deep layers. The contextual information of small targets can be well incorporated and fully exploited by repeated feature fusion and enhancement.
A dense nested interactive module and a channel-spatial attention module are proposed to achieve progressive feature fusion and adaptive feature enhancement.
We develop an infrared small target dataset (namely, NUDT-SIRST). To the best of our knowledge, our dataset is the largest dataset with numerous categories of target shapes, rich target sizes, diverse clutter backgrounds, and ground truth annotations.
Experiments on both public and our NUDT datasets demonstrate the superior performance of our method. Compared to existing methods, our method is more robust to the variations of clutter background, target size, and target shape (as shown in Fig. 1).
In this section, we briefly review the major works in infrared small target detection and SIRST dataset.
SIRST detection has been extensively investigated for decades. The traditional paradigm achieves SIRST detection by measuring the discontinuity between targets and backgrounds. Typical methods include filtering-based methods [TDLMS], local contrast measure based methods [Local_contrast_01, Local_contrast_02], and low rank based methods [low_rank_01, low_rank_02]. However, these traditional methods heavily rely on handcrafted features. When real scenes change dramatically, such as in SCR, clutter background, target shape, and target size, it is difficult to use handcrafted features and fixed hyper-parameters to handle such variations. To address this problem, recent CNN-based methods learn trainable features in a data-driven manner. Thanks to the large quantity of data and the powerful model fitting capability of CNNs, these methods achieve better performance than traditional ones.
Existing CNN-based methods can be divided into detection based methods and segmentation based methods. Liu et al. [18-five-layer] first introduced a generic target detection framework for infrared small target detection. They designed a multi-layer perception (MLP) network with 5 layers for infrared small target detection. Then, McIntosh et al. [19-infrared-car] fine-tuned several generic target detection network (e.g., Faster-RCNN [20-faster-rcnn] and Yolo-v3 [21-yolov3]
) and used the optimized eigen-vectors as input to achieve improved performance.
Recently, segmentation-based methods have attracted increasing attention. That is because, these methods can produce both pixel-level classification and localization outputs. Dai et al. [22-ACM] proposed the first segmentation-based network (i.e., ACM). They designed an asymmetric contextual module to aggregate features from shallow layers and deep layers. Then, Dai et al. [23-ALCNet] further improved their ACM by introducing a dilated local contrast measure. Specifically, a feature cyclic shift scheme was designed to achieve a trainable local contrast measure. Moreover, Wang et al. [24-ICCV19]
decomposed the infrared target detection problem into two opposed sub-problems (i.e., miss detection and false alarm) and used a conditional generative adversarial network (CGAN) to achieve the trade-off between miss detection and false alarm for infrared small target detection.
Although the performance is continuously improved by recent networks, the loss of small targets in deep layers still remains. This problem ultimately results in the poor robustness to dramatic scene changes (e.g., clutter background, targets with different SCR, shape, and size).
Existing open-source dataset in infrared small target detection is scarce, most traditional methods are evaluated on their in-house datasets. Only a few infrared small target datasets are released by CNN-based methods[24-ICCV19, 22-ACM]. Wang et al.[24-ICCV19] built the first big and open SIRST dataset. This dataset includes 10000 training images and 100 test images. Since many targets in this dataset do not meet the definition of society of photo-optical instrumentation engineers (SPIE) [SPIE] and have obvious synthesized traces with illogical annotations. These problems may lead to the inapplicability toward SIRST detection. Dai et al.[22-ACM] built the first real SIRST dataset with high-quality images and labels. However, the number of images in NUAA-SIRST is 427 (256 for training), which cannot well cover dramatic scene changes in infrared small target detection. Moreover, these real infrared data are all manually labelled with many inaccurately labeled pixels.
Although these open-sourced datasets greatly prompt the prosperity of SIRST detection, their limited data capacity, data variety, and poor annotation hinder the further development of this field. Synthesized data can be easily generated to achieve higher variety and annotation quality at very low cost (i.e., time and money). Hence, we developed a new NUDT-SIRST dataset with numerous categories of target, rich target sizes, diverse clutter backgrounds, and accurate annotations. The effectiveness of our dataset is evaluated in Section IV.
In this work, we introduce our DNANet in details. The overall architecture of the proposed method is shown in Fig. 3.
As illustrated in Fig. 3, our DNANet takes a SIRST image as its input and sequentially performs feature extraction (Section III-B), feature pyramid fusion (Section III-C), and eight-connected neighborhood clustering (Section III-D).
Section III-B introduces our feature extraction module, including the dense nested interactive module (DNIM) and the channel-spatial attention module (CSAM). Input images are first preprocessed and fed into the backbone of DNIM to extract multi-layer features. Then, multi-layer features are repeatedly fused at the middle convolution nodes of skip connection and then are gradually passed into the decoder subnetworks. Due to the semantic gap at multi-layer feature fusion stage of DNIM, we used CSAM to adaptively enhance these multi-level features for achieving better feature fusion. Section III-C presents the feature pyramid fusion module. Enhanced multi-layer features at each scale are upscaled to the same size. Next, the shallow-layer features with rich spatial information and deep-layer features with high-level information are concatenated to generate robust feature maps. Section III-D elaborates the eight-connected neighborhood clustering module. Feature maps are fed into this module to calculate the spatial location of target centroid, which is then used for comparison in Section IV. In Section III-E, we introduce our NUDT-SIRST dataset.
As shown in Fig. 4(a), traditional U-shape structure[15-Unet] consists of an encoder, a decoder, and plain skip connections. The encoder is used to enlarge the receptive field and extract high-level information. Decoder helps to recover to the same size of input images. The plain skip connection acts as a bridge to pass these low-level and high-level features from encoder to decoder subnetworks.
To achieve powerful contextual information modeling capability, a straightforward idea is to continuously increase the layers of the network. In this way, high-level information can be obtained and larger receptive field can be achieved. However, infrared small targets are different significantly in their sizes, ranging from one pixel (i.e., point targets) to tens of pixels (i.e., extended targets). With the increase of network layers, high-level information of extended targets is obtained, while the point targets are easily lost after multiple pooling operation. Therefore, we should design a special module to extract high-level features and maintain the representation of small targets in the deep layers.
As shown in Fig. 4(b), we stack multiple U-shape sub-networks together to build a dense nested structure. Since the optimal receptive field for different sizes of target varies a lot, these U-shape sub-networks with different depths are naturally suitable for targets with different sizes. Based on this idea, we impose multiple nodes in the pathway between encoder and decoder sub-networks. All of these middle nodes are densely connected with each other to form a nested-shape network. As shown in Fig. 4(c) and (d), each node can receive features from its own and the adjacent layers, leading to repeated multi-layer feature fusion. As a result, the representation of small targets is maintained in the deep layers and thus better results can be achieved.
In this paper, we stack I layers of DNIM to form our feature extraction module. Without loss of generality, we take the DNIM layer as an example to introduce this structure, as shown in Fig. 4(c) and (d). Assume denote the output of node , where is the down-sampling layer along the encoder and is the convolution layer of dense block along the plain skip pathway. When , each node only receives features from dense plain skip connection. The stack of feature maps represented by are computed as
where denotes multiple cascaded convolution layers of the same convolution block., each node receives outputs from three directions including dense plain skip connection and nested bi-direction interactive skip connection, the stack of feature maps represented by is generated as:
where denotes the up-sampling layer, and denotes the concatenation layer.
As shown in Fig. 5(b), CSAM is used for adaptive feature enhancement at the multi-layer feature fusion stage of DNIM.
|Datasets||Image Type||Background Scene||#Image||Label Type||Target Type||Public|
|NUAA-SIRST(ACM)[22-ACM]||real||CloudCitySea||427||Manual Coarse Label||Point/Spot/Extended|
|NUST-SIRST[24-ICCV19]||synthetic||CloudCityRiverRoad||10000||Manual Coarse Label||Point/Spot/Extended|
The CSAM consists of two cascaded attention units. The feature maps from node (, ) are sequentially processed by a 1D channel attention map and a 2D spatial attention map . As shown in Fig. 5(a). The channel attention process can be summarized as follows:
where denotes the element-wise multiplication,
denotes sigmoid function,denote the number of channels, height, and width of . and
denote average pooling and max pooling with a stride of 2, respectively. The shared network is composed of a multi-layer perceptron (MLP) with one hidden layer. Before multiplication, the attention mapsare stretched to the size of .
Similar to channel attention process, the spatial attention process can be summarized as follows:
where represents a convolutional operation with a filter size of 77. The attention maps are also stretched to the size of before multiplication.
After the feature extraction module, we develop a feature pyramid fusion module to aggregate the resultant multi-layer features. As shown in Fig. 3, we first upscale multi-layer features to the same size of . Then, the shallow-layer feature with rich spatial and profile information and deep-layer feature with rich semantic information are concatenated to generate global robust feature maps:
After the feature pyramid fusion module, we introduce an eight-connected neighborhood clustering module [cluster] to clutter all pixels and calculate the centroid of each target. If any two pixels , in feature maps have intersection areas in their eight neighborhoods (e.g., Eq. 8) and have the same value (0 or 1) (e.g., Eq. 9), these two pixels are considered to be in a connected area. Pixels in a connected area belong to the same targets. Once all targets in the image are determined, centroid is calculated as their coordinate.
Quality, quantity, and scene diversity of data significantly affect the performance of CNN-based methods. Existing CNN-based methods mainly use real infrared data with manual annotations. However, it is costly to collect a large-scale dataset with accurate pixel-level annotations. These issues hinder the further development of CNN-based methods. Inspired by the solutions in other data-scarcity fields (e.g., ship detection [17-Shipdetection]), we develop a large-scale infrared small target dataset (namely, the NUDT-SIRST dataset). Our NUDT-SIRST dataset enables performance evaluation of CNN-based methods under numerous categories of target type, rich target size, and diverse clutter backgrounds.
Our NUDT-SIRST is compared with existing SIRST datasets in Table I. It has 5 main background scenes and covers various targets. All scenes in our dataset are rendered by synthesizing real infrared backgrounds with various virtual infrared targets. For each scene, all images have a resolution of 256 256. To render realistic infrared small targets, we adopt an adaptive target size function and apply a 55 Gaussian blur. We illustrate 5 main background scenes (including city, field, highlight, sea, and cloud) in Fig. 6.
In this section, we first introduce our evaluation metrics and training protocol. Then, we compare our DNANet to several state-of-the-art SIRST detection methods. Finally, we present ablation studies to investigate our network.
Pioneering CNN-based works[22-ACM, 23-ALCNet, 24-ICCV19] mainly use pixel-level evaluation metrics like
, precision, and recall values. These metrics mainly focus on the target shape evaluation. However, infrared small targets are generally lack of shapes and textures. For asmall target, one falsely predicted pixel will cause 11.1% decrease in . Consequently, these pixel-level evaluation metrics are unsuitable for small targets. Actually, the overall target localization is the most important criteria for SIRST detection. Therefore, we adopt and to evaluate both localization ability and use to evaluate shape description ability.
|Method Description||NUDT-SIRST (Tr=50%)||NUAA-SIRST (Tr=50%)|
|Filtering Based: Top-Hat[4-tophat]||20.72||78.41||166.7||7.143||79.84||1012|
|Filtering Based: Max-Median [5-maxmedian]||4.197||58.41||36.89||4.172||69.20||55.33|
|Local Contrast Based: WSLCM [9-WSLLCM]||2.283||56.82||1309||1.158||77.95||5446|
|Local Contrast Based: TLLCM[8-TLLCM]||2.176||62.01||1608||1.029||79.09||5899|
|Local Rank Based: IPI[10-IPI]||17.76||74.49||41.23||25.67||85.55||11.47|
|Local Rank Based: NRAM[11-NRAM]||6.927||56.40||19.27||12.16||74.52||13.85|
|Local Rank Based: RIPT[12-RIPT]||29.44||91.85||344.3||11.05||79.08||22.61|
|Local Rank Based: PSTNN[13-PSTNN]||14.85||66.13||44.17||22.40||77.95||29.11|
|Local Rank Based: MSLSTIPT [3-anti-miss]||8.342||47.40||888.1||10.30||82.13||1131|
|CNN Based: MDvsFA-cGAN [24-ICCV19]||75.14||90.47||25.34||60.30||89.35||56.35|
|CNN Based: ACM [22-ACM]||67.08||95.97||10.18||70.33||93.91||3.728|
|CNN Based: ALCNet [23-ALCNet]||81.40||96.51||9.261||73.33||96.57||30.47|
Intersection over Union () is a pixel-level evaluation metric. It evaluates profile description ability of the algorithm. IoU is calculated by the ratio of intersection and union area between the predictions and labels.
where and represent the interaction areas and all areas, respectively.
Probability of Detection () is a target-level evaluation metric. It measures the ratio of correctly predicted target number over all target number. is defined as follows:
where and represent the numbers of correctly predicted targets and all targets, respectively. If the centroid derivation of the target is less than a maximum allowed derivation, we consider those targets as correctly predicted ones. We set the maximum centroid derivation as 3 in this paper.
False-Alarm Rate () is another target-level evaluation metric. It is used to measure the ratio of falsely predicted pixels over all image pixels. is defined as follows:
where and represent the numbers of falsely predicted pixels and all image pixels, respectively. If the centroid derivation of the target is larger than a maximum allowed derivation, we consider those pixels as falsely predicted ones. We set the maximum centroid derivation as 3 in this paper.
Receiver Operation Characteristics (ROC) is used to describe the changing trends of the detection probability () under varying false alarm rate ().
As discussed in Section IV-E, we used the published NUAA-SIRST dataset[23-ALCNet] and our NUDT-SIRST dataset for both training and test. All input images with different initial sizes were first resized to a resolution of 256 256. To accelerate the network convergence, all images were normalized to ensure that their values centered at zero.
In this paper, we adopted a segmentation network as our baseline to generate a pixel-level segmentation map and then used a clustering algorithm to achieve target localization. The U-net paradigm with ResNets[30_resnet] was chosen as our segmentation backbone. The number of down-sampling layer
was chosen as 4. Our network was trained using the Soft-IoU loss function and optimized by the Adagrad method[27-Adagrad]. We initialized the weights and bias of our model using the Xavier method[28-xavier]
. We set the batch size to 16 and the learning rate to 0.05. All models were implemented in PyTorch[29-pytorch] on a computer with an Intel i7 7700H @ 2.80 GHz CPU and an Nvidia GeForce 1080Ti GPU.
To demonstrate the superiority of our method, we compare our DNANet to several state-of-the-art (SOTA) methods, including traditional methods (Top-Hat[4-tophat], Max-Median [5-maxmedian], WSLCM [9-WSLLCM], TLLCM[8-TLLCM], IPI[10-IPI], NRAM[11-NRAM], RIPT[12-RIPT], PSTNN[13-PSTNN], MSLSTIPT [3-anti-miss]) and CNN-based methods (MDvsFA-cGAN[24-ICCV19], ACM[22-ACM], ALCNet[23-ALCNet]) on the NUAA-SIRST and NUDT-SIRST datasets. For fair comparison, we retrained all the CNN-based methods on the same training datasets as our DNANet. It is worth noting that we use our implementations for these methods for fair comparison. Most of these open-source CNN-based codes are rewritten by pytorch and released in our homepage: https://github.com/YeRen123455/Infrared-Small-Target-Detection.
For all the compared algorithms, we first obtain their predicts and then performed noise suppression by setting a threshold to remove low-response areas. Specifically, the adaptive threshold was calculated by Eq. 13 for traditional methods. For CNN-based methods, we followed their original papers and adopted their fixed thresholds (i.e., 0, 0, 0.5 for ACM[22-ACM], ALCNet[23-ALCNet], and MDvsFA-cGAN[24-ICCV19], respectively). We kept all remaining parameters the same as their original papers.
where represents the largest value of output. and mean the standard derivation and average value of output, respectively.
Quantitative results are shown in Table II. The improvements achieved by our DNANet over traditional methods are significant. That is because, both NUDT-SIRST and NUAA-SIRST contain challenging images with different SCR, clutter background, target shape, and target size. Our DNANet can learn discriminative features robust to scene variations. In contrast, the traditional methods are usually designed for specific scenes (e.g., specific target size and clutter background). The manually-selected parameters (e.g., structure size in Tophat and patch size in IPI) limit the generalization performance of these methods. Moreover, we also observe that the IoU improvements are obviously higher than the improvement of and . That is because, the traditional methods mainly focus on the overall localization of the target instead of precise shape matching. It also validates our claim that using pixel-level evaluation metric (such as ) introduces unfair comparison and results in inaccurate conclusion.
As shown in Table II, the improvements achieved by DNANet over other CNN-based methods (i.e., MDvsFA-cGAN, ACM, and ALCNet) are obvious. That is because, we redesign a new backbone network that is tailored for SIRST detection. The U-shape basic backbone with our dense nested interactive skip connection module can achieve progressive feature fusion and selectively enhance the informative features in deep CNN layers. Consequently, intrinsic features of infrared small targets can be maintained and fully learned in the network. It is also worth noting that the improvements of our method on NUDT-SIRST is significantly higher than those on the NUAA-SIRST dataset. That is because, our dataset contains more challenging scenes with various target sizes, types and poses. Our channel and spatial attention module and feature pyramid fusion module help to learn discriminative features to achieve better performance.
Qualitative results on two datasets (i.e., NUDT-SIRST, NUAA-SIRST) are shown in Fig. 7 and Fig. 8. Compared with traditional methods, our method can produce output with precise target localization and shape segmentation under very low false alarm rate. Nonetheless, the traditional methods only perform well on point targets, (e.g., image-3), and easily generate lots of false alarm areas in local highlight areas (e.g., image-4 and image-6). Moreover, as shown in Fig. 9, we divided our NUDT-SIRST dataset into point targets subset, spot targets subset, and extended targets subset. With the increase of spot and extended targets ratio, traditional methods suffers dramatic performance decrease while our DNANet maintains high accuracy. That is because, the performance of traditional methods rely heavily on handcrafeted features and cannot adapt to the variations of target sizes. The CNN-based methods (i.e., MDvsFA-cGAN, ACM, and ALCNet) perform much better than traditional methods. However, due to the complicated scenes in our NUDT-SIRST, MDvsFA-cGAN produces many false alarm and miss detection areas (Fig. 8). Our DNANet is more robust to these scene changes. Moreover, our DNANet can generate better shape segmentation than ALCNet. That is because, our designed new backbone can well adapt to various clutter background, target shape, and target size challenges and thus achieves better performance.
In this subsection, we compare our DNANet with several variants to investigate the potential benefits introduced by our network modules and design choice.
The dense nested interactive skip-connection module is used to interact with features at different scale levels to enlarge receptive fields while maintain fine-grained features at the finest scale level. To demonstrate the effectiveness of our DNIM, we introduced three network variants and made their model sizes comparable for fair comparison.
DNANet w/o DNIM: We replaced the dense nested interactive skip connection module with a regular plain skip connection module.
DNANet-left-to-right: As shown in Fig. 11(c), multiple U-shape subnetworks with different depths are stacked from left to right. Each node in the middle part of the network can receive features from its own and the lower layer.
DNANet-top-to-bottom: We stacked the U-shape subnetworks from top to bottom to generate DNANet-top-to-bottom, as shown in Fig. 11(b). Different from DNANet-left-to-right, this variant stacks U-shape subnetworks with three kinds of depth and only its core part uses tri-direction skip connection.
Table III shows the comparative results achieved by DNANet and its variants. It can be observed that the , , and values of DNANet w/o DNIM suffer decreases of 2.08%, 2.23%, and an increase of 4.298 on the NUDT-SIRST dataset, respectively. Similar results are also observed on the NUAA-SIRST dataset. That is because, DNIM progressively aggregates features at multiple scales to maintain the target information at the finest scale for better performance. Visualization maps shown in Fig. 10 also demonstrates the effectiveness of our DNIM. Small targets are lost in the feature maps of the deep layer in DNANet w/o DNIM (i.e., L(4,0), L(3,1)).
|DNANet w/o DNIM||4.71||85.01/96.50/8.521||75.12/97.34/12.05|
As shown in Table III, DNANet-left-to-right suffers decreases of 1.20%, 1.44%, and an increase of 0.426 in terms of , , and values over DNANet on the NUDT-SIRST dataset. That is because, each node in DNANet-left-to-right only interacts with the deep layer instead of full interaction among shallow, their-own, and deep layers. Shallow layer has rich localization and profile information, but the information is not fully incorporated at the skip connection stage. Consequently, this variant has limited performance.
As compared to our DNANet, the variant DNANet-top-to-bottom suffers decreases of 1.34%, 1.77%, and an increase of 3.459 in terms of , , and values on NUDT-SIRST dataset. That is because, only the core part of this variant adopts tri-direction skip connection, the remaining part still uses the plain skip connection. Moreover, its tri-direction interactive area is relatively shallow, high-level information can not be fully exploited at shallow layers.
|DNANet w/o CSAM||4.70||85.90/96.62/5.738||75.81/96.19/22.12|
|DNANet w/o CSAM||4.71||85.25/96.62/6.710||75.35/95.82/34.97|
|DNANet w/o CA||4.73||86.27/96.96/4.881||76.20/96.96/12.69|
|DNANet w/o SA||4.73||86.14/96.73/4.128||76.69/97.34/10.96|
The channel and spatial attention module is used for adaptive feature enhancement to achieve better feature fusion. To investigate the benefits introduced by this module, we compare our DNANet with four variants. To achieve fair comparison (i.e., comparable model size), we increased the number of filters of all convolution layers of four variants to made their model sizes slightly larger than DNANet.
DNANet w/o CSAM: We removed the channel and spatial attention module in this variant and directly concatenate multi-layer features for subsequent process.
DNANet w/o CSAM (Element-wise summation): We replaced CSAM with common element-wise summation in this variant to explore the effectiveness of CSAM. Specifically, we used 11 convolution operation and up-sampling/down-sampling to make features from different layer identical. Then, an element-wise summation is used to achieve multi-layer feature fusion.
DNANet w/o channel attention: We removed the channel attention operation in this variant to evaluate its contribution.
DNANet w/o spatial attention: We canceled the spatial attention operation in this variant to investigate the benefit introduced by spatial attention.
If CSAM is removed, the performance suffers decreases of 1.19%1.84%, 2.11%2.11%, and 1.5152.487 in terms of , , and for DNANet w/o CSAM and DNANet w/o CSAM on the NUDT-SIRST dataset, respectively. Similar results are achieved on the NUAA-SIRST dataset. This clearly demonstrates the importance of the channel and spatial attention module. As shown in Fig. 12, with the help of CSAM, the feature maps from the deep layer of DNANet have high response to informative cues and finally results in precise shape segmentation.
As shown in Table IV, DNANet w/o channel attention suffers decreases of 0.82%, 1.77%, and an increase of 0.658 in terms of , , and values over DNANet on NUDT-SIRST dataset. That is because, channel attention unit in our DNANet can better exploit informative channels to enhance the representation capability of features.
If the spatial attention unit is removed, the performance suffers decreases of 0.95%, 2.00%, and an increase of 0.095 in terms of , , and values over the DNANet on NUDT-SIRST dataset. That is because, infrared small targets are easily immersed in heavy cloud and noise, it is hard to distinguish these small and dim targets from the background. Spatial attention facilitates the network to pay attention to local informative areas and thus produces better results.
In this section, we evaluate the potential of our synthesized dataset for real IRST tasks. Specifically, we mixed real SIRST images from NUAA-SIRST and synthesized SIRST images from NUDT-SIRST with different ratios to train our DNANet. With more real SIRST images being included for training, the performance of our network is gradually improved. Even on the extreme situation with only 42 real images, our DNANet still achieves better performance than ACM with 100% real SIRST images. That is because, our synthesized dataset can well cover the main challenges for infrared small target detection (i.e., different SCR, clutter background, target shape, and target size). Consequently, the huge cost for collecting real SIRST images can be reduced.
Moreover, we compared the output of our network trained on the mixed dataset with the manually labeled masks of NUAA-SIRST in Fig. 13. It can be observed that the output of our network has more reasonable shape segmentation. That is because, the synthesized SIRST images have absolutely precise labels. The network can learn the essence of infrared small targets with sufficiently well labeled data and finally contribute to the improvement of real SIRST images. As a result, our network can generate better visual performance than ground truth label.
|# Real images||# Synthesized images||
|0% (0/427)||50% (663/1327)||61.69/88.97/30.26|
|10% (42/427)||50% (663/1327)||68.26/94.67/14.11|
|20% (85/427)||50% (663/1327)||72.38/95.63/8.932|
|30% (128/427)||50% (663/1327)||74.53/97.71/4.341|
In this paper, we propose a DNANet to achieve SIRST detection. Different from existing CNN-based SIRST detection methods, we explicitly handle the problem of small targets being lost in deep layers by designing a new tri-direction dense nested interactive module with a cascaded channel and spatial attention model. The intrinsic information of small targets can be incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an open SIRST dataset to evaluate the performance of infrared small target detection with respect to challenging scenes. We also reorganized a set of evaluation metrics. Experiments on both our dataset and the public dataset have shown the superiority of our method over the state-of-the-art methods.