MoCoPnet
None
view repo
Infrared small target super-resolution (SR) aims to recover reliable and detailed high-resolution image with highcontrast targets from its low-resolution counterparts. Since the infrared small target lacks color and fine structure information, it is significant to exploit the supplementary information among sequence images to enhance the target. In this paper, we propose the first infrared small target SR method named local motion and contrast prior driven deep network (MoCoPnet) to integrate the domain knowledge of infrared small target into deep network, which can mitigate the intrinsic feature scarcity of infrared small targets. Specifically, motivated by the local motion prior in the spatio-temporal dimension, we propose a local spatiotemporal attention module to perform implicit frame alignment and incorporate the local spatio-temporal information to enhance the local features (especially for small targets). Motivated by the local contrast prior in the spatial dimension, we propose a central difference residual group to incorporate the central difference convolution into the feature extraction backbone, which can achieve center-oriented gradient-aware feature extraction to further improve the target contrast. Extensive experiments have demonstrated that our method can recover accurate spatial dependency and improve the target contrast. Comparative results show that MoCoPnet can outperform the state-of-the-art video SR and single image SR methods in terms of both SR performance and target enhancement. Based on the SR results, we further investigate the influence of SR on infrared small target detection and the experimental results demonstrate that MoCoPnet promotes the detection performance. The code is available at https://github.com/XinyiYing/MoCoPnet.
READ FULL TEXT VIEW PDFNone
Infrared imaging system is all-weather in day and night and has high penetrability, sensitivity and concealment. Infrared imaging system is widely used in security monitoring, remote sensing investigation, aerospace offense-defense and other military mission. Recently, low-resolution (LR) infrared images cannot meet the high requirements of practical military mission. Therefore, it is necessary to improve the resolution of infrared images. A straightforward way to obtain high-resolution (HR) infrared images is to increase the size of infrared sensor arrays. However, due to the technical limitations of sensors and the high cost of large infrared sensor arrays, it is necessary and important to develop practical, low-cost and highly reliable infrared image super-resolution (SR) algorithms. Note that, modern autonomous driving technology requires the infrared imaging system to detect the target in a fairly long distance. Therefore, the target only occupies a very small proportion of the whole image, and is susceptible to noise and clutters. In this paper, we mainly focus on infrared small target SR task and investigate its influence on infrared small target detection.
The special imaging mechanism and military application of infrared imaging system put forward the following requirements for infrared small target SR: 1) High fidelity of super-resolved images. Noise and false contours should be avoided as much as possible. 2) High contrast of super-resolved targets. The target contrast in the super-resolved images should be strengthened to boost the subsequent tasks. 3) High robustness to complex scenes and noise. Small objects are sometimes submerged in clutter and thus of low local contrast to the background. SR algorithms should be robust to various complex scenes and imaging noise. 4) High generalization to insufficient datasets. The lack of infrared image datasets requires that SR algorithms should achieve stable results with a relative small dataset.
The motivations of our method come from data analysis, and can be summarized as: 1) The target occupies a small proportion of the whole infrared image (generally less than 0.12% [10]) and lacks color and fine structure information (e.g., contour, shape and texture). Few information is available for SR within a single image. Therefore, we perform SR on image sequences to use the supplementary information among the temporal dimension to improve the SR performance and the target contrast. 2) Due to the long distance between the target and the imaging system, the mobility of the targets on the imaging plane is limited, leading to small motion of the target between neighborhood frames (i.e., local motion prior [51, 66, 67] in spatio-temporal dimension). Therefore, we design a local spatio-temporal attention (LSTA) module to perform implicit frame alignment and exploit the supplementary information in the local spatio-temporal neighborhood to enhance the local features (especially for small targets). 3) Compared with the background clutter, the contrast and gradient between the target and the background in the local neighborhood are high in all directions (i.e., local contrast prior [57, 11] in spatial dimension). Therefore, we design a center difference residual group (CD-RG) to achieve center-oriented gradient-aware feature extraction, which can encode the local contrast prior to further improve the target contrast.
Based on the above observations, we propose a local motion and contrast prior driven deep network (MoCoPnet) for infrared small target SR. The main contributions can be summarized as follows: 1) We propose the first infrared small target SR method named local motion and contrast prior driven deep network (MoCoPnet) and summarize the definition and requirements of this task. The proposed modules (i.e., central difference residual group and local spatio-temporal attention module) of MoCoPnet integrate the domain knowledge (i.e., local contrast prior and local motion prior) of infrared small targets into deep networks, which can mitigate the intrinsic feature scarcity of data-driven approaches [11]
. 2) The experimental results demonstrate that MoCoPnet can achieve state-of-the-art SR performance and effectively improve the target contrast. 3) Based on the SR results, we further investigate the influence of SR on infrared small target detection. The experimental results show that MoCoPnet can promote the detection performance to achieve high signal-to-noise ratio gain (SNRG), signal-to-clutter ratio gain (SCRG), contrast gain (CG) scores and improved receiver operating characteristic curve (ROC) results.
Image SR is an inherently ill-posed optimization problem and has been investigated for decades. In literature, researchers have proposed a variety of classic single image SR (SISR) methods, including prediction-based methods [15, 29, 33], edge-based methods [64, 16], statistics-based methods [36, 86], patch-based methods [16, 17, 6, 19] and sparse representation methods [88, 89]
. However, most of the aforementioned traditional methods use handicraft features to reconstruct HR images, which cannot formulate the complex SR process and thus limits the SR performance. Recently, due to the powerful feature representation capability, convolutional neural networks (CNNs) have been widely used in single image SR task and achieve the state-of-the-art performance
[97, 78, 77, 81, 82]. Dong et al. [13, 14] proposed the pioneering CNN-based work SRCNN to recover an HR image from its LR counterpart. Kim et al. [34] deepened the network to 20 convolutional layers (i.e., VDSR) and achieved improved SR performance by increasing model complexity. Moreover, various increasingly deep and complex architectures (e.g., residual networks [48, 1, 45], recursive networks [35, 69, 68], densely connected networks [72, 99, 23, 44], attention-based networks [7, 97, 9]) have also been applied to SISR for performance improvement. Other than tackling image average distortion by norm loss, generative adversarial image SR networks [42, 61, 80] employed the perceptual loss for perceptual quality improvement.Existing video SR methods commonly follow a three-step pipeline, including feature extraction, motion compensation and reconstruction [91]. Traditional video SR methods [50, 4, 49, 56, 87, 3]
employ handcrafted models to estimate motion, noise and blur kernel and reconstruct HR video sequences. Recent deep learning-based video SR methods are better in exploiting spatio-temporal information by its powerful feature representation capability and can achieve the state-of-the-art performance. Liao
et al. [47] proposed the pioneering CNN-based video SR method to perform motion compensation by optical flow and then ensembled the compensated drafts via CNN. Afterwards, A series of optical flow-based video SR algorithms [62, 5, 70, 60, 76, 24] emerged to explicitly perform motion estimation and frame alignment, resulting in vague and duplication [54, 30]. To avoid the aforementioned problem, deformable convolution [8, 103] has been employed to perform motion compensation explicitly in a unified step [71, 79, 73] through extra offsets. Apart from these explicit motion compensation methods, implicit approaches (e.g., 3D convolution networks [31, 46, 37], recursive networks [27, 21, 102, 85], non-local networks [79, 90, 84]) have also been applied to video SR for performance improvement.With the increased demands of high-resolution infrared images, some researchers perform image SR on infrared images. Traditional methods [55, 101] consider SR as sparse signal reconstruction in compressive sensing. Based on the previous studies, Zhang et al. [96] combined compressive sensing and deep learning to achieve improved SR performance with low computational cost. Han et al. [22] proposed to employ CNNs to recover high-frequency components with upscaled LR images to generate the SR results. He et al. [25] proposed a cascaded deep network with multiple receptive fields for large scale factor (8) infrared image SR. Liu et al. [52]
proposed to use generative adversarial network and perceptual loss to reconstruct the texture details of infrared images.
Since the importance of each spatial location and channel is not uniform, Hu et al. [26] proposed SeNet for classification, which consists of selection units to control the switch of passed data. Zhang et al. [97] proposed a channel attention mechanism to calculate the importance along the channel dimension for channel selection. Anwar et al. [2] proposed feature attention to urge the network to pay more attention to the high frequency region. Dai et al. [9] proposed second-order attention to adaptively readjust features for powerful feature correlation learning. Wang et al. [74] explored the sparsity in SR task and proposed sparse masks for efficient inference. The spatial mask and channel mask calculate the importance along both the spatial dimension and the channel dimension to prune the redundant computations. The aforementioned studies only consider the global importance on spatial and channel dimension. Since small targets only occupy a small portion in the whole image and have high contrast with the local neighborhood, we design a local attention mechanism which can better characterize the small targets.
Sequence image infrared small target detection is significant for long-range precision strikes, aerospace offensive-defensive countermeasures and remote sensing intelligence reconnaissance. According to whether the sequential information is used, sequence image infrared small target detection methods can be divided into two categories: detect before track (DBT) methods and track before detect (TBD) methods. Based on the results of single image infrared small target detection [41, 39, 18, 12, 94, 53], DBT methods employed the motion trajectory of targets through sequence image projection to eliminate the false targets and reduce the false alarm rate. DBT methods have low computational cost and are easy to implement. However, the performance drops rapidly with low SNR. TBD methods [58, 93, 20] commonly follow a three-step pipeline, including background suppression, region of interest extraction and target detection. TBD methods are robust to images with low SNR but have high computational cost, which cannot meet the requirements of real-time detection. It is challenging to achieve high detection rate and low false alarm rate in real-time due to the lack of target information, the complex background noise, the insufficient public datasets and the explosion of data amount and the computational cost. Therefore, it is necessary to recover reliable image details and enhance the contrast between target and background for detection.
In this section, we introduce our method in details. Specifically, Section III-A introduces the overall framework of our network. Section III-B-III-C introduce the two modules which integrate local contrast prior and local motion prior of infrared small target into deep networks.
The overall framework of our MoCoPnet is shown in Fig. 1. Specifically, an image sequence with 5 frames is first sent to a convolutional layer to generate the initial features , which are then sent to the central difference residual group (CD-RG) to achieve center-oriented gradient-aware feature extraction. Then, each neighborhood feature is paired with the reference feature and sent to two local spatio-temporal attention (LSTA) modules to achieve motion compensation and enhance the local features. Next, the reference feature is concated with two compensated neighborhood frames and then sent to a residual group (RG) and a convolution layer for coarse fusion. Afterwards, the two fused features are concatenated and sent to an RG and a convolution for fine fusion. Then, the fused feature is processed by an RG, a sub-pixel layer and a convolutional layer for SR reconstruction and upsampling. Finally, the SR reference frame is obtained by adding the bicubicly upsampled LR reference frame to accelerate the training convergence. Note that, the number of the input frames is set to 7 in this paper and the process is the same as in Fig. 1
(a). We use the mean square error (MSE) between the SR reference frame and the groundtruth reference frame as the loss function of our network.
Central difference residual group (CD-RG) incorporates central difference convolution (CD-Conv [92, 63]) into residual group (RG [97, 99]) to achieve the center-oriented gradient-aware feature extraction, which can utilize the spatial local salient prior to strengthen the contrast of the small targets. Note that, we employ RG as the backbone of our MoCoPnet for the following reasons: RG can generate features with large receptive field and dense sampling rate, which promotes the information exploitation. The reuse of hierarchical features not only improves the SR performance [83] but also maintains the information of small targets [10, 43].
The architecture of central difference residual group (CD-RG) is shown in Fig. 1(b). The input feature is first fed to central difference residual dense blocks [98] (CD-RDB) to extract hierarchical features. Then, the hierarchical features are concatenated and fed to a 11 convolutional layer to generate output feature . As is shown in Fig. 1(b1), CD-Conv and Convs with a growth rate of are used within each CD-RDB to achieve dense feature representation. The architecture of CD-Conv is shown in Fig. 1(b2). CD-Conv aggregates the center-oriented gradient information, which echoes the spatial local saliency prior of infrared small target. As shown in Fig. 2, different from handcrafted dilated local contrast measure (DLCM [11]) which can only reserve the contrast information in one direction, CD-Conv is a learnable measure and can improve the contrast of small target while maintaining the background information. In conclusion, CD-Conv is more in line with the task of infrared small target SR (i.e., recovering reliable and detailed high-resolution image with high-contrast target). DLCM and CD-Conv can be formulated as and :
(1) |
(2) |
where represents the value of a specific location in the feature map, and is the direction index. is a learnable weight to continuously optimize the local contrast measure and
is a hyperparameter to balance the contribution between gradient-level detailed information and intensity-level semantic information. Note that,
is set to 0.7 [92] in our paper.Local spatio-temporal attention (LSTA) module calculates the local response between the neighborhood frame and the reference frame and uses the local spatio-temporal information to enhance the local features of the reference frames. The inputs of LSTA are the reference frame and one neighborhood frame. For a sequence with 7 frames, the operation need to be repeated 6 times. The architecture of LSTA is shown in Fig. 1(c). The red reference feature and the blue neighborhood feature are first fed to 11 convolutional layers and for dimension compression to generate , where is the compression ratio and is set to 8 in our paper. The process can be formulated as:
(3) |
where and represent 11 convolutions. Then, we calculate the response between each location in and the corresponding neighborhood (centered in ) in . Afterwards, the response is summed and softmax along the channel dimension to generate the attention map . The process is defined as:
(4) |
where represents the value of the local neighborhood centered in with kernel size of and dilation rate of . The purple 33 grid in Fig. 1(c) is the local attention feature map with parameter (=3, =1). Note that, as shown in Figs 3(c) and (d), can be integer larger than 1 to enlarge the receptive filed without additional computational cost. As shown in Figs 3(e) and (f),
can also be fractional to capture the sub-pixel motion between frames and we employ bilinear interpolation to generate the exact corresponding values.
Finally, dot production is performed between the local neighborhood feature centered in and the corresponding attention map to generate the value of location in the output feature . The process is formulated as:
(5) |
LSTA first calculates the response between the reference frame and its adjacent frames to generate the attention map, and then calculates a weighted summation of these frames using the generated attention maps. In this way, the neighborhood frames can be implicitly aligned and the complementary temporal information can be incorporated to enhance the features of small targets.
In this section, we first introduce the experiment settings, and then conduct ablation studies to validate our method. Next, we compare our network to several state-of-the-art SISR and video SR methods. Finally, we investigate the influence of SR on infrared small target detection.
In this subsection, we sequentially introduce the datasets, the evaluation metrics, the network parameters and the training details.
Hui et al. [28] developed a dataset for detection and tracking of dim-small aircraft infrared targets under ground/air background. This dataset contains 22 image sequences (totally 16177 frames) with a resolution of 256256. Recently, a large-scale high-quality semi-synthetic dataset (named SAITD [65]) has been proposed for small aerial infrared targets detection. SAITD dataset contains 350 image sequences with a resolution of 640512 (175 image sequences with target annotations and 175 without, 150185 images in total). The 2nd Anti-UAV Workshop & Challenge (Anti-UAV [100]) releases 250 high-quality infrared video sequences with multi-scale UAV targets. In this paper, we employ the sequences with target annotations of SAITD as the test datasets and the remaining 300 sequences as the training datasets. In addition, we employ Hui and Anti-UAV as the test dataset to test the robustness of our MoCoPnet to real scenes. In Anti-UAV dataset, only the sequences with infrared small target [10] (21 sequences in total) are selected as the test set. Note that, we only use the first 100 images of each sequence for test to balance computational/time cost and generalization performance.
We employ peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the SR performance. In addition, we introduce signal-to-noise ratio (SNR) and contrast ratio (CR) in the local background neighborhood [18] of targets to evaluate the performance of recovering small targets. As shown in Fig. 4(a), the size of the target area is , and the local background neighborhood is extended from the target area by both in width and height. Note that, the parameters of local background neighborhood in HR images are set to , , in SAITD111The synthetic target size in SAITD is preset to less than 77., Hui and Anti-UAV222The target size is less than 0.12% of the image size [10] (i.e., 256256 in Hui and 640512 in Anti-UAV). respectively. When 4 SR is performed on HR images, the parameters (a, b, d) are set to , , . When 4 downsampling is performed on HR images, the parameters are set to , , .
To further evaluate the impact of SR algorithms on infrared small target detection, we adopt SNR gain (SNRG), background suppression factor (BSF), signal-to-clutter ratio gain (SCRG), contrast gain (CG) and receiver operating characteristic curve (ROC) for comprehensive evaluation. Note that, the common detection evaluation metrics calculate the ratio of the statistics in the local background neighborhood before and after detection. Since we first super-resolve the LR image and then perform detection, the inputs of detection algorithms, which are the outputs of different SR algorithms, are different. Therefore, direct using the common detection evaluation metrics cannot evaluate the impact of SR on detection accurately. To eliminate the influence of different inputs, we modify the first four metrics to calculate the ratio of the statistics in the local background neighborhood between the LR image before SR and the HR target image after detection. The modified evaluation metrics are shown in Fig. 4(b). We then introduce the aforementioned evaluation metrics in details. SNRG is used to measure the SNR improvement of detection algorithms and is formulated as:
(6) |
where and represent the metrics in the local background neighborhood of the LR images and the HR target images respectively. and are the maximum value of the target area and the background area respectively. BSF is used to measure the background suppression effect and is formulated as:
(7) |
Dataset | Variants | PSNR | SSIM | SNR | CR |
SAITD | DLCM | 26.37 | 0.725 | 0.664 | 14.200 |
Conv | 27.92 | 0.798 | 0.678 | 14.250 | |
CD-Conv | 28.17 | 0.807 | 0.678 | 14.259 | |
Hui | DLCM | 32.32 | 0.832 | 0.820 | 15.167 |
Conv | 33.00 | 0.854 | 0.846 | 15.198 | |
CD-Conv | 33.12 | 0.857 | 0.859 | 15.203 | |
Anti-UAV | DLCM | 31.44 | 0.901 | 0.946 | 6.739 |
Conv | 31.85 | 0.913 | 0.960 | 6.696 | |
CD-Conv | 31.85 | 0.914 | 0.965 | 6.709 | |
Avg. | DLCM | 30.04 | 0.820 | 0.810 | 12.035 |
Conv | 30.93 | 0.855 | 0.828 | 12.048 | |
CD-Conv | 31.05 | 0.859 | 0.834 | 12.057 |
where
is the standard deviation of the background area. SCRG is used to measure the SCR improvement of detection algorithms and is formulated as:
(8) |
where and are the mean value of the target area and the background area respectively. CG is used to measure the improvement of contrast between targets and background and is formulated as:
(9) |
Note that, in order to avoid the value of “Inf” (i.e., the denominator is zero) and “NAN” (i.e., the numerator and denominator are both zero), we add to each denominator in equations 6-9 to prevent it from being zero. is set to
in our paper. ROC is used to measure the trend between detection probability
and false alarm probability , which are formulated as:(10) | |||
(11) |
where and are the number of true detection and false detection. and are the amount of targets and the number of image pixels. Note that, the criterion for judging true detection is that the distance between the detected location and the groundtruth location is less than threshold and is set to 10 pixels [65] in our paper.
The parameters of CD-RG in the feature extraction is CD-RG(=4, =6, =32) and the parameters of RGs are RG1,2(=1, =4, =64), RG3(=8, =6, =32). The parameters of the two LSTAs are LSTA1(=3, =3) and LSTA2(=3, =1).
During the training phase, we randomly extracted 7 consecutive frames from an LR video clip, and randomly cropped a 6464 patch as the input. Meanwhile, its corresponding patch in HR video clip was cropped as groundtruth. We followed [76, 75] to augment the training data by random flipping and rotation.
All experiments were implemented on a PC with an Nvidia RTX 3090 GPU. The networks were optimized using the Adam method [38] with = 0.9, = 0.999 and the batch size was set to 12. The learning rate was initially set to and halved in 10K, 20K, 60K iterations. We trained our network from scratch for 100K iterations.
In this subsection, we conduct ablation experiments to validate our design choice.
To demonstrate the effectiveness of our central difference residual group (CD-RG), we replace all the CD-Convs in CD-RG by Convs (i.e., residual group) and retrain the network from scratch. The experimental results in Table I show that CD-RG (i.e., CD-Convs) can introduce 0.12dB/0.004 gains on PSNR/SSIM and 0.06/0.09 gains on SNR/CR. This demonstrates that CD-RG can exploit the spatial local contrast prior to effectively improve the SR performance and the target contrast.
In addition, we visualize the feature maps generate by residual group (RG) and CD-RG with a toy example in Fig. 5. Note that, the visualization maps are the L2 norm results along the channel dimension [40, 43] and the red and blue boxes represent target areas and edge areas respectively. As is illustrated in Fig. 5(a), the input frame of the image sequence consists of a target of size 33 (i.e., the white cube at the top) and the clutter (i.e., the white area at the bottom). It can be observed from Figs. 5(b) and (c) that the target contrast in the feature map extracted by CD-RG is higher than that of RG. This demonstrates that CD-RG can enhance the target contrast (from 7.41 to 13.55). In addition, CD-RG can also improve the contrast between high-frequency edges and background (from 6.64 to 13.59). This is because, CD-RG aggregates the gradient-level information to concentrate more on the high-frequency edge information, thus improving the SR performance and target contrast simultaneously.
Moreover, we conduct ablation experiments to replace all the CD-convs in MoCoPnet by DLCMs. Note that, the training process of MoCoPnet with DLCMs is unstable with sudden loss divergence due to gradient fracture. By contrast, CD-conv reserves the image feature information to update all pixels, which ensures the gradient propagation continuity. The ablation results in Table I show that CD-conv introduces significant performance gain on PSNR/SSIM (i.e., 1.01/0.039 on average) and further improve the contrast of small targets (i.e., 0.024/0.022 SNR/CR gain on average).
Variants | Details | PSNR | SSIM | SNR | CR |
LSTA | w/o LSTA | 30.77 | 0.851 | 0.813 | 12.044 |
LSTA | C-LSTA[(3,1),(3,1)] | 31.02 | 0.857 | 0.831 | 12.054 |
LSTA | LSTA(3,1) | 30.89 | 0.854 | 0.825 | 12.051 |
LSTA | LSTA(9,1/4) | 30.96 | 0.855 | 0.828 | 12.051 |
LSTA | P-LSTA[(3,1),(3,3),(3,5)] | 30.98 | 0.857 | 0.829 | 12.052 |
OFM | optical flow module | 30.94 | 0.855 | 0.819 | 12.048 |
MoCoPnet | C-LSTA[(3,3),(3,1)] | 31.05 | 0.859 | 0.834 | 12.057 |
In MoCoPnet, two cascaded LSTAs with parameters LSTA(=3, =3) and LSTA(=3, =1) are used to enhance the spatio-temporal local features of sequence images in a coarse-to-fine manner. To validate the effectiveness of our design choice, we first remove LSTAs in MoCoPnet and name the model as LSTA. In addition, we further conduct ablation experiments to investigate the influences of the parameters, numbers, sub-pixel information exploitation and arrangements of LSTAs on SR performance. Specifically, we first replace LSTAs in MoCoPnet by two cascaded LSTAs with parameters (=3, =1) and name the model as LSTA. Secondly, we replace LSTAs in MoCoPnet by an LSTA with parameter (=3, =1) and name the model as LSTA. Thirdly, we replace LSTAs in MoCoPnet by an LSTA with parameter (=9, =1/4) and name the model as LSTA. Fourthly, we replace LSTAs in MoCoPnet by three parallel LSTAs with parameters (=3, =1), (=3, =3), (=3, =5) and name the model as LSTA.
Methods | Bicubic | VSRnet [32] | VESPCN [5] | RCAN [97] | SOF-VSR [76] | TDAN [71] | D3Dnet[91] | MoCoPnet |
SAITD | 25.37 / 0.663 | 26.03 / 0.706 | 26.57 / 0.735 | 26.58 / 0.735 | 26.97 / 0.753 | 26.11 / 0.709 | 27.81 / 0.794 | 28.17 / 0.807 |
Hui | 31.43 / 0.809 | 32.03 / 0.828 | 32.33 / 0.835 | 32.44 / 0.836 | 32.55 / 0.841 | 32.17 / 0.830 | 32.84 / 0.850 | 33.12 / 0.857 |
Anti-UAV | 30.76 / 0.889 | 31.42 / 0.904 | 31.63 / 0.910 | 31.73 / 0.912 | 31.68 / 0.912 | 31.58 / 0.905 | 31.81 / 0.911 | 31.85 / 0.914 |
Average | 29.19 / 0.787 | 29.83 / 0.813 | 30.18 / 0.827 | 30.25 / 0.828 | 30.40 / 0.835 | 29.95 /0.815 | 30.82 / 0.852 | 31.05 / 0.859 |
Resolution | 640512 | 256256 | 640640 | - | 25602048 | 20482048 | 25602048 | - | ||||||||
Methods | SAITD | Hui | Anti-UAV | Average | SAITD | Hui | Anti-UAV | Average | ||||||||
SNR | CR | SNR | CR | SNR | CR | SNR | CR | SNR | CR | SNR | CR | SNR | CR | SNR | CR | |
LR | 0.666 | 14.066 | 0.781 | 13.583 | 0.915 | 6.174 | 0.787 | 11.274 | - | - | - | - | - | - | - | - |
Bicubic | 0.676 | 13.780 | 0.750 | 15.100 | 0.817 | 6.747 | 0.747 | 11.875 | 0.808 | 14.736 | 0.986 | 15.817 | 0.958 | 7.173 | 0.917 | 12.575 |
VSRnet [32] | 0.659 | 14.125 | 0.776 | 15.100 | 0.882 | 6.726 | 0.773 | 11.984 | 0.672 | 14.704 | 1.002 | 15.661 | 0.954 | 7.178 | 0.876 | 12.514 |
VESPCN [5] | 0.656 | 14.118 | 0.793 | 15.145 | 0.920 | 6.641 | 0.790 | 11.968 | 0.901 | 14.664 | 1.005 | 15.616 | 0.963 | 7.115 | 0.956 | 12.465 |
RCAN [97] | 0.670 | 14.213 | 0.813 | 15.202 | 0.952 | 6.713 | 0.811 | 12.043 | 0.914 | 14.720 | 0.997 | 15.649 | 0.947 | 7.168 | 0.953 | 12.512 |
SOF-VSR [76] | 0.662 | 14.175 | 0.808 | 15.113 | 0.932 | 6.698 | 0.800 | 11.995 | 0.913 | 14.700 | 0.997 | 15.603 | 0.965 | 7.168 | 0.958 | 12.490 |
TDAN [71] | 0.655 | 14.206 | 0.772 | 15.192 | 0.882 | 6.711 | 0.770 | 12.036 | 0.889 | 14.725 | 0.999 | 15.693 | 0.963 | 7.173 | 0.950 | 12.530 |
D3Dnet [91] | 0.672 | 14.240 | 0.845 | 15.215 | 0.972 | 6.736 | 0.830 | 12.064 | 0.907 | 14.731 | 1.005 | 15.699 | 0.964 | 7.166 | 0.959 | 12.532 |
MoCoPnet | 0.678 | 14.259 | 0.859 | 15.203 | 0.965 | 6.709 | 0.834 | 12.057 | 0.922 | 14.729 | 1.006 | 15.685 | 0.966 | 7.181 | 0.965 | 12.532 |
HR | 0.810 | 14.262 | 1.001 | 15.265 | 0.959 | 6.706 | 0.923 | 12.078 | - | - | - | - | - | - | - | - |
The experimental results of LSTA are shown in Table II. It can be observed that the PSNR/SSIM/SNR/CR scores of LSTA are 0.28dB/0.008/0.021/0.013 lower than MoCoPnet. This demonstrates that LSTA can effectively use the supplementary temporal information to enhance the local features, thus improving the SR performance and the target contrast. The PSNR/SSIM/SNR/CR scores of LSTA are 0.03dB/0.002/0.003/0.003 lower than MoCoPnet. This demonstrates that LSTA with larger expansion rate (i.e., =3) for coarse processing promotes our network to better extract and utilize temporal information. The PSNR/SSIM/SNR/CR scores of LSTA are 0.16dB/0.004/0.009/0.006 lower than MoCoPnet and 0.13dB/0.003/0.006/0.003 lower than LSTA. This demonstrates that coarse-to-fine processing benefits SR performance and target enhancement. The PSNR/SSIM/SNR/CR scores of LSTA are slightly higher than LSTA for 0.07dB/0.001/0.003/0.000 but the memory cost of LSTA is 2 times than LSTA (i.e., 2.46 vs. 1.17). This demonstrates sub-pixel information exploitation benefits the performance of SR and target enhancement but significantly increases the memory cost. The PSNR/SSIM/SNR/CR scores of LSTA are 0.07dB/0.002/0.005/0.005 lower than MoCoPnet and 0.09dB/0.003/0.004/0.001 higher than LSTA. This demonstrates that the cascade mode of LSTAs can better exploit inter-frame information correlation and SR performance and target enhancement can be further improved by enlarging the receptive field of LSTAs.
Method | Parameters |
Top-hat | 55 square filter |
ILCM | 55 filter |
IPI | B=5050, S=10, , L=1, = |
in HR images. “B” represents block size and “S” represents stride.
Finally, we replace LSTAs in MoCoPnet by an optical-flow module (OFM) to compare our LSTA with widely applied optical flow technique. The experimental results are listed in Table II. It can be observed that the PSNR/SSIM/SNR/CR scores of MoCoPnet with LSTAs are higher than MoCoPnet with OFM for 0.11dB/0.004/0.015/0.009. By contrast, the parameters and FLOPs of MoCoPnet with LSTA modules are lower than MoCoPnet with OFMd for 0.11M and 2.70G. This demonstrates that LSTA is superior in exploiting the information between frames to improve the SR performance and target contrast with lower computational cost. This is because, LSTA can direct learn motion compensation by attention mechanism without optical flow estimation and warping, which results in ambiguous and duplicate results [54, 30].
In addition, we visualize the feature maps generated by OFM and LSTAs with a toy example in Fig. 6. Note that, the visualization maps are the L2 norm results along the channel dimension [40, 43]. As is illustrated in Fig 6(a), the input image sequence consists of a random consistent movement of a target with size 33 (i.e., the white cube) in the background (i.e., the black area). The feature maps before OFM and LSTAs are shown in Figs. 6(b) and (d). It can be observed that the target positions in the extracted feature maps are close to the blue dots (i.e., the groundtruth position of the target in the current feature). Then OFM and LSTA perform feature alignment on the extracted features. As is illustrated in Fig. 6(b), the target positions in the feature maps generated by OFM are close to the blue dots. The feature maps generated by LSTA1(=3, =3) and LSTA2( =3, =1) are shown in Figs. 6(e) and (f). As is illustrated in Fig. 6(f), all the target positions in the feature maps generated by LSTA2 are closer to the red dot (i.e., the groundtruth position of the target in the reference feature) than those of OFM. This demonstrates that LSTA is superior in motion compensation. Note that, it can be observed from Figs. 6(e) and (f) that LSTA1 and LSTA2 achieve coarse-to-fine alignment to highlight the aligned target. This demonstrates the effectiveness and superiority of our coarse-to-fine alignment strategy.
In this subsection, we compare our MoCoPnet with 1 top-performing single image SR methods RCAN [97] and 5 video SR methods VSRnet [32], VESPCN [5], SOF-VSR [76, 75] and TDAN [71], D3Dnet[91]. We also present the bicubicly upsampled (Bicubic) images as the baseline results. For fair comparison, we retrain all the compared methods on infrared small target dataset [28] and exclude the first and the last 2 frames of the video sequences for performance evaluation.
Resolution | 640512 | 256256 | 640640 | - | |||||||||||||
Methods | SAITD | Hui | Anti-UAV | Average | |||||||||||||
SNRG | BSF | SCRG | CG | SNRG | BSF | SCRG | CG | SNRG | BSF | SCRG | CG | SNRG | BSF | SCRG | CG | ||
Top-hat | Bicubic | 0.50 | 2.55 | 6.43 | 3.60 | 0.93 | 1.79 | 15.94 | 9.66 | 3.77 | 4.32 | 15.58 | 3.12 | 1.73 | 2.89 | 12.65 | 5.46 |
D3Dnet | 0.77 | 3.10 | 9.31 | 4.28 | 1.49 | 2.01 | 20.47 | 11.20 | 9.60 | 6.70 | 31.05 | 3.33 | 3.95 | 3.94 | 20.28 | 6.27 | |
MoCoPnet | 0.82 | 3.25 | 9.04 | 3.55 | 1.53 | 1.98 | 18.88 | 10.22 | 13.06 | 6.72 | 28.35 | 2.82 | 5.14 | 3.99 | 18.76 | 5.53 | |
HR | 1.62 | 1.84 | 5.40 | 3.22 | 1.73 | 1.55 | 8.82 | 5.30 | 7.61 | 13.18 | 74.73 | 2.99 | 3.65 | 5.52 | 29.65 | 3.83 | |
ILCM | Bicubic | 1.07 | 1.20 | 5.93 | 4.89 | 0.96 | 0.91 | 3.73 | 4.34 | 0.90 | 0.83 | 1.82 | 2.15 | 0.97 | 0.98 | 3.83 | 3.79 |
D3Dnet | 1.07 | 1.02 | 7.29 | 7.15 | 1.08 | 0.84 | 8.01 | 10.03 | 1.07 | 0.77 | 2.84 | 3.72 | 1.07 | 0.87 | 6.05 | 6.96 | |
MoCoPnet | 1.08 | 1.00 | 7.91 | 7.91 | 1.09 | 0.84 | 8.21 | 10.12 | 1.07 | 0.76 | 3.30 | 4.47 | 1.08 | 0.87 | 6.47 | 7.50 | |
HR | 1.37 | 0.89 | 11.85 | 13.20 | 1.31 | 0.77 | 10.88 | 15.39 | 1.05 | 0.70 | 3.14 | 4.54 | 1.24 | 0.79 | 8.62 | 11.04 | |
IPI | Bicubic | 5.17 | 2.110 | 1.67 | 1-3 | 6.29 | 1.110 | 2.19 | 0.82 | 9.89 | 9.99 | 3.39 | 0.06 | 5.39 | 1.410 | 1.89 | 0.29 |
D3Dnet | 1.98 | 4.09 | 2.18 | 0.07 | 1.910 | 5.09 | 3.59 | 0.96 | 2.010 | 3.29 | 4.19 | 0.13 | 1.310 | 4.19 | 2.69 | 0.39 | |
MoCoPnet | 3.58 | 2.79 | 6.48 | 0.11 | 2.010 | 4.59 | 4.69 | 1.12 | 4.210 | 5.49 | 8.19 | 0.13 | 2.110 | 4.29 | 4.49 | 0.45 | |
HR | 1.110 | 2.69 | 2.39 | 0.13 | 0.60 | 2.29 | 8.63 | 0.61 | 8.09 | 1.59 | 1.39 | 0.16 | 6.29 | 2.19 | 1.29 | 0.30 |
Resolution | 25602048 | 20482048 | 25602560 | - | |||||||||||||
Methods | SAITD | Hui | Anti-UAV | Average | |||||||||||||
SNRG | BSF | SCRG | CG | SNRG | BSF | SCRG | CG | SNRG | BSF | SCRG | CG | SNRG | BSF | SCRG | CG | ||
Top-hat | Bicubic | 1.01 | 1.79 | 4.45 | 2.78 | 1.19 | 1.82 | 6.09 | 3.03 | 4.66 | 6.95 | 49.83 | 4.40 | 2.28 | 3.52 | 20.13 | 3.40 |
D3Dnet | 1.27 | 1.70 | 4.67 | 2.95 | 1.21 | 1.74 | 5.63 | 2.98 | 3.21 | 7.39 | 59.37 | 4.40 | 1.90 | 3.61 | 23.22 | 3.44 | |
MoCoPnet | 1.33 | 1.70 | 4.60 | 2.87 | 1.20 | 1.78 | 5.88 | 3.11 | 3.20 | 7.35 | 58.17 | 4.53 | 1.91 | 3.61 | 22.88 | 3.50 | |
ILCM | Bicubic | 1.20 | 0.90 | 9.02 | 11.10 | 0.97 | 0.84 | 9.72 | 11.76 | 1.00 | 0.86 | 7.77 | 8.71 | 1.06 | 0.86 | 8.84 | 10.52 |
D3Dnet | 1.37 | 0.85 | 12.65 | 16.53 | 1.04 | 0.78 | 10.54 | 13.94 | 1.02 | 0.85 | 8.58 | 9.60 | 1.14 | 0.82 | 10.59 | 13.35 | |
MoCoPnet | 1.40 | 0.85 | 12.86 | 16.90 | 1.04 | 0.76 | 10.50 | 14.18 | 1.02 | 0.85 | 8.71 | 9.75 | 1.15 | 0.82 | 10.69 | 13.61 | |
IPI | Bicubic | 2.59 | 3.99 | 2.88 | 0.039 | 5.110 | 2.610 | 1.310 | 0.14 | 1.79 | 1.29 | 3.08 | 0.046 | 1.810 | 1.010 | 4.69 | 0.074 |
D3Dnet | 3.49 | 2.99 | 3.18 | 0.048 | 4.810 | 1.910 | 1.610 | 0.16 | 3.58 | 4.07 | 7.97 | 0.048 | 1.710 | 7.29 | 5.49 | 0.086 | |
MoCoPnet | 3.69 | 2.69 | 3.28 | 0.044 | 3.710 | 1.610 | 1.710 | 0.18 | 0.23 | 20.90 | 9.95 | 0.049 | 1.310 | 6.29 | 5.99 | 0.092 |
PSNR/SSIM results calculated on the whole image are listed in Tables III. SNR and CR scores calculated in the local background neighborhood are listed in the columns of Table IV. It can be observed that MoCoPnet achieves the highest scores of PSNR, SSIM and outperforms most of the compared algorithms on SNR and CR scores. The above scores demonstrate that our network can effectively recover accurate details and improve the target contrast. That is because, LSTA perform implicit motion compensation and CD-RG incorporates the center-orient gradient information to effectively improve the SR performance and the target contrast.
Qualitative results are shown in Fig. 7. For SR performance, it can be observed from the blue zoom in regions that MoCoPnet can recover more accurate details (e.g., the sharp edges of buildings, and the lighthouse details closer to groundtruth HR image). For target enhancement, it can be observed from the red zoom in regions that, in the first row, MoCoPnet can further improve the target contrast which is almost invisible in other compared methods. In the second row, MoCoPnet is more robust to large motion caused by turntable collections [28] (e.g., artifacts in the zoom-in region of D3Dnet). In the third row, MoCoPnet can effectively improve the target contrast to be even higher than HR images (i.e., 1.82 vs. 1.75).
SNR and CR scores calculated in the local background neighborhood of super-resolved HR images are listed in the columns of Table IV. It can be observed that MoCoPnet can achieve the best SNR score and the second best CR score on the average of test datasets under real-world degradation. This demonstrates the superiority of our method in improving the contrast between targets and background.
Qualitative results are shown in Fig. 8. It can be observed that MoCoPnet can recover finer details and achieve better visual quality, such as the edges of building and window. In addition, MoCoPnet can further improve the intensity and the contour details of the targets.
In this subsection, we select three typical infrared small target detection algorithms (Top-hat [59], ILCM[95], IPI [18]) to perform detection on super-resolved infrared images. The parameters of the three infrared small target detection algorithms are shown in Table V. When 4 SR is performed on HR images, the size of filters, block and stride, as well as the true detection threshold are enlarged by 4 times respectively. When 4 downsampling is performed on HR images, the filter sizes of Top-hat and ILCM are set to 33. The block sizes and the stride of IPI are set to 1515 and 3. The true detection threshold is set to 3.0. For simplicity, we only use the best two super-resolved results of D3Dnet and MoCoPnet to perform detection. We also introduce bicubicly upsampled (Bicubic) images and HR images as the baseline results.
The quantitative detection results of super-resolved LR images are listed in Table VI. It can be observed that the SNRG, SCRG and CG of the super-resolved images are generally higher than the Bicubic images. This demonstrates that SR algorithms can effectively improve the contrast between the target and the background, thus promoting the detection performance. It is worth noting that the SNRG, SCRG and CG scores of D3Dnet and MoCoPnet can even surpass those of HR. This is because, SR algorithms can perform better on the high-frequency small targets than the low-frequency local background, thus achieving improved target contrast than HR images. In addition, Bicubic can achieve the highest BSF score in most cases. This is because SR algorithms act on the entire image, which enhances targets and background simultaneously and detection algorithms have better filtering performance in smoothly changing background. Note that, BSF of MoCoPnet is generally higher than that of D3Dnet. This is because MoCoPnet can focus on recovering the local salient features in the image and further improve the contrast between targets and background, which benefits the detection performance.
The qualitative results of super-resolved LR images and detection results are shown in Fig. 9. In the LR images, the targets intensity are very low (e.g., the targets in SAITD and Anti-UAV are almost invisible). In the super-resolved images, the targets intensity are higher and closer to the HR images. This is because, SR algorithms can effectively use the spatio-temporal information to enhance the target contrast. Note that, our MoCoPnet is more robust to large motion caused by turntable collections [28] (i.e., artifacts in the zoom-in region of D3Dnet in Hui dataset). In addition, the neighborhood noise in HR image are suppressed by the way of downsampling and then super-resolution (e.g., point noise are not exist in the zoom-in regions of Hui and Anti-UAV datasets). Then, we perform detection on the super-resolved images. It can be observed in Fig. 9 that all the detection algorithms have poor performance on the Bicubic images (e.g., the target intensity in the target image is very low and almost invisible in all detection results). This is because, bicubic interpolation cannot introduce additional information. However, the targets intensity in the target images of super-resolved images are higher than the Bicubic images. Among the super-resolved images, MoCoPnet is superior than D3Dnet in improving the target contrast due to the center-oriented gradient-aware feature extraction of CD-RG and the effective spatio-temporal information exploitation of LSTA.
To evaluate the detection performance comprehensively, we further calculate the ROC results which are shown in Fig. 10. Note that, ROC results on LR and HR image are used as the baseline results. The targets in HR images have the highest intensity. Therefore, high detection probability and low false alarm probability can be obtained and the detection probability reaches 1 faster (e.g., The ROC results reach 1 the fast in SAITD and Hui datasets). Downsampling leads to target intensity reduction, thus reducing the detection probability and increasing the false alarms probability. Bicubic introduces no additional image prior information, therefore, LR and Bicubic have the worst detection performance and the ROC results are significant lower than other algorithms (e.g., the ROC results of LR are the lowest and those of Bicubic are the second lowest except the ROC of Tophat in the SAITD dataset). SR algorithms can introduce prior information to improve the contrast between targets and background, thus achieving improved detection accuracy (e.g., The ROC results of MoCoPnet and D3Dnet are higher than Bicubic in SAITD and Hui datasets and even higher than HR in Anti-UAV dataset). Note that, false alarm rates of LR and Bicubic can only reach a relatively low value. This is because, IPI achieves detection by sparse and low rank recovery, which significantly decreases the false alarm rate than Tophat and ILCM. From another point, IPI suffers low detection rate of low contrast targets. Therefore, the ROC curves of Bicubic and LR images are shorter than those of HR and super-resolved images. The above experimental results show that SR algorithms can recover high-contrast targets, thus improving the detection performance.
The quantitative detection results of super-resolved HR images are listed in Table VII. It can be observed that the detection performance of SR algorithms is superior to Bicubic. This demonstrates that MoCoPnet and D3Dnet can effectively improve the contrast between targets and background, resulting in performance gain of detection. Among SR algorithms, due to the superior performance of SR and target enhancement by our well-designed modules, MoCoPnet can achieve the best SNRG, SCRG and CG scores in most cases. Note that, the SNRG and SCRG scores (achieved by IPI) of MoCoPnet in Anti-UAV dataset are 7-8 orders of magnitude lower than those of Bicubic and D3Dnet. First of all, MoCoPnet can achieve highest scores of CG. This demonstrates the target intensity can be effectively and further enhanced by MoCoPnet. Then, the differences come from the performance of background suppression. Since MoCoPnet can achieve higher scores of SR performance than Bicubic and D3Dnet, the local backgrounds of Bicubic and D3Dnet are more gentle and detection algorithms can achieve better suppression performance. IPI is superior in suppressing background clutter, therefore, sometimes the local backgrounds in the target image of Bicubic and D3Dnet are zero. Since we add to each denominator in equations 6-9 to prevent it from being zero, SNRG and SCRG scores can be very large due to completely suppressed background. In addition, bicubic interpolation suppresses the high-frequency components to a certain extent, resulting in optimal BSF value.
The qualitative results of super-resolved HR images and detection results are shown in Fig. 11. It can be observed that the targets of Bicubic images are blur while SR can enhance the intensity of target (e.g., the highlighted and sharpened targets). After processed by SR algorithms, we then perform detection on the super-resolved images. Note that, SR algorithms can effectively improve the intensity of targets and the contrast against background, resulting in better detection performance.
To evaluate the detection performance comprehensively, we further present the ROC results in Fig. 12. Note that, ROC results on HR image are used as the baseline results. It can be observed that SR algorithms can improve the detection probability and reduce false alarm probability in most cases. Compared with D3Dnet, MoCoPnet can further improve the target contrast, thus promoting the detection performance. Note that, false alarm rates of Bicubic can only reach a relatively low value. This is because, IPI achieves detection by sparse and low rank recovery, which significantly decreases the false alarm rate than Tophat and ILCM. In other words, IPI suffers low detection rate of low contrast targets.
In this paper, we propose a local motion and contrast prior driven deep network (MoCoPnet) for infrared small target super-resolution. Experimental results show that MoCoPnet can effectively recover the image details and enhance the contrast between targets and background. Based on the super-resolved images, we further investigate the effect of SR algorithms on detection performance. Experimental results show that MoCoPnet can improve the performance of infrared small target detection.
Proceedings of the European Conference on Computer Vision
, pp. 252–268. Cited by: §II-A.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4778–4787. Cited by: §II-B, §IV-C, TABLE III, TABLE IV.A deep convolutional neural network with selection units for super-resolution
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 154–160. Cited by: §II-A.Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection
. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (8), pp. 3752–3767. Cited by: §II-E.Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: §II-B.
Comments
There are no comments yet.