MoCoPnet: Exploring Local Motion and Contrast Priors for Infrared Small Target Super-Resolution

01/04/2022
by   Xinyi Ying, et al.
4

Infrared small target super-resolution (SR) aims to recover reliable and detailed high-resolution image with highcontrast targets from its low-resolution counterparts. Since the infrared small target lacks color and fine structure information, it is significant to exploit the supplementary information among sequence images to enhance the target. In this paper, we propose the first infrared small target SR method named local motion and contrast prior driven deep network (MoCoPnet) to integrate the domain knowledge of infrared small target into deep network, which can mitigate the intrinsic feature scarcity of infrared small targets. Specifically, motivated by the local motion prior in the spatio-temporal dimension, we propose a local spatiotemporal attention module to perform implicit frame alignment and incorporate the local spatio-temporal information to enhance the local features (especially for small targets). Motivated by the local contrast prior in the spatial dimension, we propose a central difference residual group to incorporate the central difference convolution into the feature extraction backbone, which can achieve center-oriented gradient-aware feature extraction to further improve the target contrast. Extensive experiments have demonstrated that our method can recover accurate spatial dependency and improve the target contrast. Comparative results show that MoCoPnet can outperform the state-of-the-art video SR and single image SR methods in terms of both SR performance and target enhancement. Based on the SR results, we further investigate the influence of SR on infrared small target detection and the experimental results demonstrate that MoCoPnet promotes the detection performance. The code is available at https://github.com/XinyiYing/MoCoPnet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 9

page 10

page 11

page 13

04/06/2020

Deformable 3D Convolution for Video Super-Resolution

The spatio-temporal information among video sequences is significant for...
02/27/2021

Super-resolution-based Change Detection Network with Stacked Attention Module for Images with Different Resolutions

Change detection, which aims to distinguish surface changes based on bi-...
05/27/2021

Blind Motion Deblurring Super-Resolution: When Dynamic Spatio-Temporal Learning Meets Static Image Understanding

Single-image super-resolution (SR) and multi-frame SR are two ways to su...
12/15/2020

Attentional Local Contrast Networks for Infrared Small Target Detection

To mitigate the issue of minimal intrinsic features for pure data-driven...
04/05/2019

Fast Spatio-Temporal Residual Network for Video Super-Resolution

Recently, deep learning based video super-resolution (SR) methods have a...
07/22/2019

Learned Image Downscaling for Upscaling using Content Adaptive Resampler

Deep convolutional neural network based image super-resolution (SR) mode...
09/03/2021

Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

Super-resolving the Magnetic Resonance (MR) image of a target contrast u...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Infrared imaging system is all-weather in day and night and has high penetrability, sensitivity and concealment. Infrared imaging system is widely used in security monitoring, remote sensing investigation, aerospace offense-defense and other military mission. Recently, low-resolution (LR) infrared images cannot meet the high requirements of practical military mission. Therefore, it is necessary to improve the resolution of infrared images. A straightforward way to obtain high-resolution (HR) infrared images is to increase the size of infrared sensor arrays. However, due to the technical limitations of sensors and the high cost of large infrared sensor arrays, it is necessary and important to develop practical, low-cost and highly reliable infrared image super-resolution (SR) algorithms. Note that, modern autonomous driving technology requires the infrared imaging system to detect the target in a fairly long distance. Therefore, the target only occupies a very small proportion of the whole image, and is susceptible to noise and clutters. In this paper, we mainly focus on infrared small target SR task and investigate its influence on infrared small target detection.

The special imaging mechanism and military application of infrared imaging system put forward the following requirements for infrared small target SR: 1) High fidelity of super-resolved images. Noise and false contours should be avoided as much as possible. 2) High contrast of super-resolved targets. The target contrast in the super-resolved images should be strengthened to boost the subsequent tasks. 3) High robustness to complex scenes and noise. Small objects are sometimes submerged in clutter and thus of low local contrast to the background. SR algorithms should be robust to various complex scenes and imaging noise. 4) High generalization to insufficient datasets. The lack of infrared image datasets requires that SR algorithms should achieve stable results with a relative small dataset.

The motivations of our method come from data analysis, and can be summarized as: 1) The target occupies a small proportion of the whole infrared image (generally less than 0.12% [10]) and lacks color and fine structure information (e.g., contour, shape and texture). Few information is available for SR within a single image. Therefore, we perform SR on image sequences to use the supplementary information among the temporal dimension to improve the SR performance and the target contrast. 2) Due to the long distance between the target and the imaging system, the mobility of the targets on the imaging plane is limited, leading to small motion of the target between neighborhood frames (i.e., local motion prior [51, 66, 67] in spatio-temporal dimension). Therefore, we design a local spatio-temporal attention (LSTA) module to perform implicit frame alignment and exploit the supplementary information in the local spatio-temporal neighborhood to enhance the local features (especially for small targets). 3) Compared with the background clutter, the contrast and gradient between the target and the background in the local neighborhood are high in all directions (i.e., local contrast prior [57, 11] in spatial dimension). Therefore, we design a center difference residual group (CD-RG) to achieve center-oriented gradient-aware feature extraction, which can encode the local contrast prior to further improve the target contrast.

Based on the above observations, we propose a local motion and contrast prior driven deep network (MoCoPnet) for infrared small target SR. The main contributions can be summarized as follows: 1) We propose the first infrared small target SR method named local motion and contrast prior driven deep network (MoCoPnet) and summarize the definition and requirements of this task. The proposed modules (i.e., central difference residual group and local spatio-temporal attention module) of MoCoPnet integrate the domain knowledge (i.e., local contrast prior and local motion prior) of infrared small targets into deep networks, which can mitigate the intrinsic feature scarcity of data-driven approaches [11]

. 2) The experimental results demonstrate that MoCoPnet can achieve state-of-the-art SR performance and effectively improve the target contrast. 3) Based on the SR results, we further investigate the influence of SR on infrared small target detection. The experimental results show that MoCoPnet can promote the detection performance to achieve high signal-to-noise ratio gain (SNRG), signal-to-clutter ratio gain (SCRG), contrast gain (CG) scores and improved receiver operating characteristic curve (ROC) results.

Ii Related Work

Ii-a Single Image SR

Image SR is an inherently ill-posed optimization problem and has been investigated for decades. In literature, researchers have proposed a variety of classic single image SR (SISR) methods, including prediction-based methods [15, 29, 33], edge-based methods [64, 16], statistics-based methods [36, 86], patch-based methods [16, 17, 6, 19] and sparse representation methods [88, 89]

. However, most of the aforementioned traditional methods use handicraft features to reconstruct HR images, which cannot formulate the complex SR process and thus limits the SR performance. Recently, due to the powerful feature representation capability, convolutional neural networks (CNNs) have been widely used in single image SR task and achieve the state-of-the-art performance

[97, 78, 77, 81, 82]. Dong et al. [13, 14] proposed the pioneering CNN-based work SRCNN to recover an HR image from its LR counterpart. Kim et al. [34] deepened the network to 20 convolutional layers (i.e., VDSR) and achieved improved SR performance by increasing model complexity. Moreover, various increasingly deep and complex architectures (e.g., residual networks [48, 1, 45], recursive networks [35, 69, 68], densely connected networks [72, 99, 23, 44], attention-based networks [7, 97, 9]) have also been applied to SISR for performance improvement. Other than tackling image average distortion by norm loss, generative adversarial image SR networks [42, 61, 80] employed the perceptual loss for perceptual quality improvement.

Ii-B Video SR

Existing video SR methods commonly follow a three-step pipeline, including feature extraction, motion compensation and reconstruction [91]. Traditional video SR methods [50, 4, 49, 56, 87, 3]

employ handcrafted models to estimate motion, noise and blur kernel and reconstruct HR video sequences. Recent deep learning-based video SR methods are better in exploiting spatio-temporal information by its powerful feature representation capability and can achieve the state-of-the-art performance. Liao

et al. [47] proposed the pioneering CNN-based video SR method to perform motion compensation by optical flow and then ensembled the compensated drafts via CNN. Afterwards, A series of optical flow-based video SR algorithms [62, 5, 70, 60, 76, 24] emerged to explicitly perform motion estimation and frame alignment, resulting in vague and duplication [54, 30]. To avoid the aforementioned problem, deformable convolution [8, 103] has been employed to perform motion compensation explicitly in a unified step [71, 79, 73] through extra offsets. Apart from these explicit motion compensation methods, implicit approaches (e.g., 3D convolution networks [31, 46, 37], recursive networks [27, 21, 102, 85], non-local networks [79, 90, 84]) have also been applied to video SR for performance improvement.

Ii-C Infrared Image SR

With the increased demands of high-resolution infrared images, some researchers perform image SR on infrared images. Traditional methods [55, 101] consider SR as sparse signal reconstruction in compressive sensing. Based on the previous studies, Zhang et al. [96] combined compressive sensing and deep learning to achieve improved SR performance with low computational cost. Han et al. [22] proposed to employ CNNs to recover high-frequency components with upscaled LR images to generate the SR results. He et al. [25] proposed a cascaded deep network with multiple receptive fields for large scale factor (8) infrared image SR. Liu et al. [52]

proposed to use generative adversarial network and perceptual loss to reconstruct the texture details of infrared images.

Ii-D Attention Mechanism

Since the importance of each spatial location and channel is not uniform, Hu et al. [26] proposed SeNet for classification, which consists of selection units to control the switch of passed data. Zhang et al. [97] proposed a channel attention mechanism to calculate the importance along the channel dimension for channel selection. Anwar et al. [2] proposed feature attention to urge the network to pay more attention to the high frequency region. Dai et al. [9] proposed second-order attention to adaptively readjust features for powerful feature correlation learning. Wang et al. [74] explored the sparsity in SR task and proposed sparse masks for efficient inference. The spatial mask and channel mask calculate the importance along both the spatial dimension and the channel dimension to prune the redundant computations. The aforementioned studies only consider the global importance on spatial and channel dimension. Since small targets only occupy a small portion in the whole image and have high contrast with the local neighborhood, we design a local attention mechanism which can better characterize the small targets.

Ii-E Sequence Image Infrared Small Target Detection

Sequence image infrared small target detection is significant for long-range precision strikes, aerospace offensive-defensive countermeasures and remote sensing intelligence reconnaissance. According to whether the sequential information is used, sequence image infrared small target detection methods can be divided into two categories: detect before track (DBT) methods and track before detect (TBD) methods. Based on the results of single image infrared small target detection [41, 39, 18, 12, 94, 53], DBT methods employed the motion trajectory of targets through sequence image projection to eliminate the false targets and reduce the false alarm rate. DBT methods have low computational cost and are easy to implement. However, the performance drops rapidly with low SNR. TBD methods [58, 93, 20] commonly follow a three-step pipeline, including background suppression, region of interest extraction and target detection. TBD methods are robust to images with low SNR but have high computational cost, which cannot meet the requirements of real-time detection. It is challenging to achieve high detection rate and low false alarm rate in real-time due to the lack of target information, the complex background noise, the insufficient public datasets and the explosion of data amount and the computational cost. Therefore, it is necessary to recover reliable image details and enhance the contrast between target and background for detection.

Fig. 1: The proposed architecture of MoCoPnet. (a) represents the overall framework. (b) represents the central difference residual group (CD-RG) and (b1), (b2) represents its sub-modules central difference dense block (CD-RDB) and central difference convolution (CD-Conv) respectively. (c) represents the local spatio-temporal attention (LSTA) module with kernel size 3 and dilation rate 1.

Iii Methodology

In this section, we introduce our method in details. Specifically, Section III-A introduces the overall framework of our network. Section III-B-III-C introduce the two modules which integrate local contrast prior and local motion prior of infrared small target into deep networks.

Iii-a Overall Framework

The overall framework of our MoCoPnet is shown in Fig. 1. Specifically, an image sequence with 5 frames is first sent to a convolutional layer to generate the initial features , which are then sent to the central difference residual group (CD-RG) to achieve center-oriented gradient-aware feature extraction. Then, each neighborhood feature is paired with the reference feature and sent to two local spatio-temporal attention (LSTA) modules to achieve motion compensation and enhance the local features. Next, the reference feature is concated with two compensated neighborhood frames and then sent to a residual group (RG) and a convolution layer for coarse fusion. Afterwards, the two fused features are concatenated and sent to an RG and a convolution for fine fusion. Then, the fused feature is processed by an RG, a sub-pixel layer and a convolutional layer for SR reconstruction and upsampling. Finally, the SR reference frame is obtained by adding the bicubicly upsampled LR reference frame to accelerate the training convergence. Note that, the number of the input frames is set to 7 in this paper and the process is the same as in Fig. 1

(a). We use the mean square error (MSE) between the SR reference frame and the groundtruth reference frame as the loss function of our network.

Fig. 2: Differences between (a) dilated local contrast measure (DLCM [11]) and (b) central difference convolution (CD-Conv [92, 63]).

Iii-B Central Difference Residual Group

Central difference residual group (CD-RG) incorporates central difference convolution (CD-Conv [92, 63]) into residual group (RG [97, 99]) to achieve the center-oriented gradient-aware feature extraction, which can utilize the spatial local salient prior to strengthen the contrast of the small targets. Note that, we employ RG as the backbone of our MoCoPnet for the following reasons: RG can generate features with large receptive field and dense sampling rate, which promotes the information exploitation. The reuse of hierarchical features not only improves the SR performance [83] but also maintains the information of small targets [10, 43].

The architecture of central difference residual group (CD-RG) is shown in Fig. 1(b). The input feature is first fed to central difference residual dense blocks [98] (CD-RDB) to extract hierarchical features. Then, the hierarchical features are concatenated and fed to a 11 convolutional layer to generate output feature . As is shown in Fig. 1(b1), CD-Conv and Convs with a growth rate of are used within each CD-RDB to achieve dense feature representation. The architecture of CD-Conv is shown in Fig. 1(b2). CD-Conv aggregates the center-oriented gradient information, which echoes the spatial local saliency prior of infrared small target. As shown in Fig. 2, different from handcrafted dilated local contrast measure (DLCM [11]) which can only reserve the contrast information in one direction, CD-Conv is a learnable measure and can improve the contrast of small target while maintaining the background information. In conclusion, CD-Conv is more in line with the task of infrared small target SR (i.e., recovering reliable and detailed high-resolution image with high-contrast target). DLCM and CD-Conv can be formulated as and :

(1)
(2)

where represents the value of a specific location in the feature map, and is the direction index. is a learnable weight to continuously optimize the local contrast measure and

is a hyperparameter to balance the contribution between gradient-level detailed information and intensity-level semantic information. Note that,

is set to 0.7 [92] in our paper.

Fig. 3: An illustration of local spatio-temporal attention (LSTA) module with difference kernel size of and dilation rate of . (a) represents the reference frame and pixel is highlighted by a red box. (b)-(f) represent the corresponding neighborhood pixels centered in and are highlighted by blue boxs.

Iii-C Local Spatio-Temporal Attention Module

Local spatio-temporal attention (LSTA) module calculates the local response between the neighborhood frame and the reference frame and uses the local spatio-temporal information to enhance the local features of the reference frames. The inputs of LSTA are the reference frame and one neighborhood frame. For a sequence with 7 frames, the operation need to be repeated 6 times. The architecture of LSTA is shown in Fig. 1(c). The red reference feature and the blue neighborhood feature are first fed to 11 convolutional layers and for dimension compression to generate , where is the compression ratio and is set to 8 in our paper. The process can be formulated as:

(3)

where and represent 11 convolutions. Then, we calculate the response between each location in and the corresponding neighborhood (centered in ) in . Afterwards, the response is summed and softmax along the channel dimension to generate the attention map . The process is defined as:

(4)

where represents the value of the local neighborhood centered in with kernel size of and dilation rate of . The purple 33 grid in Fig. 1(c) is the local attention feature map with parameter (=3, =1). Note that, as shown in Figs 3(c) and (d), can be integer larger than 1 to enlarge the receptive filed without additional computational cost. As shown in Figs 3(e) and (f),

can also be fractional to capture the sub-pixel motion between frames and we employ bilinear interpolation to generate the exact corresponding values.

Finally, dot production is performed between the local neighborhood feature centered in and the corresponding attention map to generate the value of location in the output feature . The process is formulated as:

(5)

LSTA first calculates the response between the reference frame and its adjacent frames to generate the attention map, and then calculates a weighted summation of these frames using the generated attention maps. In this way, the neighborhood frames can be implicitly aligned and the complementary temporal information can be incorporated to enhance the features of small targets.

Iv Experiments

In this section, we first introduce the experiment settings, and then conduct ablation studies to validate our method. Next, we compare our network to several state-of-the-art SISR and video SR methods. Finally, we investigate the influence of SR on infrared small target detection.

Iv-a Experiment Settings

In this subsection, we sequentially introduce the datasets, the evaluation metrics, the network parameters and the training details.

Iv-A1 Datasets

Hui et al. [28] developed a dataset for detection and tracking of dim-small aircraft infrared targets under ground/air background. This dataset contains 22 image sequences (totally 16177 frames) with a resolution of 256256. Recently, a large-scale high-quality semi-synthetic dataset (named SAITD [65]) has been proposed for small aerial infrared targets detection. SAITD dataset contains 350 image sequences with a resolution of 640512 (175 image sequences with target annotations and 175 without, 150185 images in total). The 2nd Anti-UAV Workshop & Challenge (Anti-UAV [100]) releases 250 high-quality infrared video sequences with multi-scale UAV targets. In this paper, we employ the sequences with target annotations of SAITD as the test datasets and the remaining 300 sequences as the training datasets. In addition, we employ Hui and Anti-UAV as the test dataset to test the robustness of our MoCoPnet to real scenes. In Anti-UAV dataset, only the sequences with infrared small target [10] (21 sequences in total) are selected as the test set. Note that, we only use the first 100 images of each sequence for test to balance computational/time cost and generalization performance.

Fig. 4: Evaluation metrics. (a) represents the local background neighborhood and (b) represents the modified evaluation metrics in this paper.

Iv-A2 Evaluation Metrics

We employ peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the SR performance. In addition, we introduce signal-to-noise ratio (SNR) and contrast ratio (CR) in the local background neighborhood [18] of targets to evaluate the performance of recovering small targets. As shown in Fig. 4(a), the size of the target area is , and the local background neighborhood is extended from the target area by both in width and height. Note that, the parameters of local background neighborhood in HR images are set to , , in SAITD111The synthetic target size in SAITD is preset to less than 77., Hui and Anti-UAV222The target size is less than 0.12% of the image size [10] (i.e., 256256 in Hui and 640512 in Anti-UAV). respectively. When 4 SR is performed on HR images, the parameters (a, b, d) are set to , , . When 4 downsampling is performed on HR images, the parameters are set to , , .

To further evaluate the impact of SR algorithms on infrared small target detection, we adopt SNR gain (SNRG), background suppression factor (BSF), signal-to-clutter ratio gain (SCRG), contrast gain (CG) and receiver operating characteristic curve (ROC) for comprehensive evaluation. Note that, the common detection evaluation metrics calculate the ratio of the statistics in the local background neighborhood before and after detection. Since we first super-resolve the LR image and then perform detection, the inputs of detection algorithms, which are the outputs of different SR algorithms, are different. Therefore, direct using the common detection evaluation metrics cannot evaluate the impact of SR on detection accurately. To eliminate the influence of different inputs, we modify the first four metrics to calculate the ratio of the statistics in the local background neighborhood between the LR image before SR and the HR target image after detection. The modified evaluation metrics are shown in Fig. 4(b). We then introduce the aforementioned evaluation metrics in details. SNRG is used to measure the SNR improvement of detection algorithms and is formulated as:

(6)

where and represent the metrics in the local background neighborhood of the LR images and the HR target images respectively. and are the maximum value of the target area and the background area respectively. BSF is used to measure the background suppression effect and is formulated as:

(7)
Dataset Variants PSNR SSIM SNR CR
SAITD DLCM 26.37 0.725 0.664 14.200
Conv 27.92 0.798 0.678 14.250
CD-Conv 28.17 0.807 0.678 14.259
Hui DLCM 32.32 0.832 0.820 15.167
Conv 33.00 0.854 0.846 15.198
CD-Conv 33.12 0.857 0.859 15.203
Anti-UAV DLCM 31.44 0.901 0.946 6.739
Conv 31.85 0.913 0.960 6.696
CD-Conv 31.85 0.914 0.965 6.709
Avg. DLCM 30.04 0.820 0.810 12.035
Conv 30.93 0.855 0.828 12.048
CD-Conv 31.05 0.859 0.834 12.057
TABLE I: Ablation results of DLCM, Conv and CD-Conv for SR on SAITD, Hui and Anti-UAV datasets. Best results are shown in boldface.

where

is the standard deviation of the background area. SCRG is used to measure the SCR improvement of detection algorithms and is formulated as:

(8)

where and are the mean value of the target area and the background area respectively. CG is used to measure the improvement of contrast between targets and background and is formulated as:

(9)

Note that, in order to avoid the value of “Inf” (i.e., the denominator is zero) and “NAN” (i.e., the numerator and denominator are both zero), we add to each denominator in equations 6-9 to prevent it from being zero. is set to

in our paper. ROC is used to measure the trend between detection probability

and false alarm probability , which are formulated as:

(10)
(11)

where and are the number of true detection and false detection. and are the amount of targets and the number of image pixels. Note that, the criterion for judging true detection is that the distance between the detected location and the groundtruth location is less than threshold and is set to 10 pixels [65] in our paper.

Iv-A3 Network Parameters

The parameters of CD-RG in the feature extraction is CD-RG(=4, =6, =32) and the parameters of RGs are RG1,2(=1, =4, =64), RG3(=8, =6, =32). The parameters of the two LSTAs are LSTA1(=3, =3) and LSTA2(=3, =1).

Iv-A4 Training Details

During the training phase, we randomly extracted 7 consecutive frames from an LR video clip, and randomly cropped a 6464 patch as the input. Meanwhile, its corresponding patch in HR video clip was cropped as groundtruth. We followed [76, 75] to augment the training data by random flipping and rotation.

All experiments were implemented on a PC with an Nvidia RTX 3090 GPU. The networks were optimized using the Adam method [38] with = 0.9, = 0.999 and the batch size was set to 12. The learning rate was initially set to and halved in 10K, 20K, 60K iterations. We trained our network from scratch for 100K iterations.

Fig. 5: A toy example of features generated by RG (b) and CD-RG (c). Note that, (a) represents the corresponding frame of the input image sequence. Red and blue boxes represent target and edge area, and the remaining area is background area.

Iv-B Ablation Study

In this subsection, we conduct ablation experiments to validate our design choice.

Iv-B1 Central Difference Residual Group

To demonstrate the effectiveness of our central difference residual group (CD-RG), we replace all the CD-Convs in CD-RG by Convs (i.e., residual group) and retrain the network from scratch. The experimental results in Table I show that CD-RG (i.e., CD-Convs) can introduce 0.12dB/0.004 gains on PSNR/SSIM and 0.06/0.09 gains on SNR/CR. This demonstrates that CD-RG can exploit the spatial local contrast prior to effectively improve the SR performance and the target contrast.

In addition, we visualize the feature maps generate by residual group (RG) and CD-RG with a toy example in Fig. 5. Note that, the visualization maps are the L2 norm results along the channel dimension [40, 43] and the red and blue boxes represent target areas and edge areas respectively. As is illustrated in Fig. 5(a), the input frame of the image sequence consists of a target of size 33 (i.e., the white cube at the top) and the clutter (i.e., the white area at the bottom). It can be observed from Figs. 5(b) and (c) that the target contrast in the feature map extracted by CD-RG is higher than that of RG. This demonstrates that CD-RG can enhance the target contrast (from 7.41 to 13.55). In addition, CD-RG can also improve the contrast between high-frequency edges and background (from 6.64 to 13.59). This is because, CD-RG aggregates the gradient-level information to concentrate more on the high-frequency edge information, thus improving the SR performance and target contrast simultaneously.

Moreover, we conduct ablation experiments to replace all the CD-convs in MoCoPnet by DLCMs. Note that, the training process of MoCoPnet with DLCMs is unstable with sudden loss divergence due to gradient fracture. By contrast, CD-conv reserves the image feature information to update all pixels, which ensures the gradient propagation continuity. The ablation results in Table I show that CD-conv introduces significant performance gain on PSNR/SSIM (i.e., 1.01/0.039 on average) and further improve the contrast of small targets (i.e., 0.024/0.022 SNR/CR gain on average).

Variants Details PSNR SSIM SNR CR
LSTA w/o LSTA 30.77 0.851 0.813 12.044
LSTA C-LSTA[(3,1),(3,1)] 31.02 0.857 0.831 12.054
LSTA LSTA(3,1) 30.89 0.854 0.825 12.051
LSTA LSTA(9,1/4) 30.96 0.855 0.828 12.051
LSTA P-LSTA[(3,1),(3,3),(3,5)] 30.98 0.857 0.829 12.052
OFM optical flow module 30.94 0.855 0.819 12.048
MoCoPnet C-LSTA[(3,3),(3,1)] 31.05 0.859 0.834 12.057
TABLE II: Ablation results of the local spatio-temporal attention module on the average of SAITD, Hui and Anti-UAV datasets. Note that, LSTA1 validates the effectiveness of the module and LSTA2-5 investigate the impact of its parameters, numbers, sub-pixel information exploitation and arrangements on SR performance. OFM validates the superiority of LSTA than optical flow technique. “C-LSTA[(a,b),(c,d)]” represents the cascaded LSTA(a,b) and LSTA(c,d). “P-LSTA[(a,b),(c,d)]” represents the parallel LSTA(a,b) and LSTA(c,d). Best results are shown in boldface.

Iv-B2 Local Spatio-Temporal Attention Module

In MoCoPnet, two cascaded LSTAs with parameters LSTA(=3, =3) and LSTA(=3, =1) are used to enhance the spatio-temporal local features of sequence images in a coarse-to-fine manner. To validate the effectiveness of our design choice, we first remove LSTAs in MoCoPnet and name the model as LSTA. In addition, we further conduct ablation experiments to investigate the influences of the parameters, numbers, sub-pixel information exploitation and arrangements of LSTAs on SR performance. Specifically, we first replace LSTAs in MoCoPnet by two cascaded LSTAs with parameters (=3, =1) and name the model as LSTA. Secondly, we replace LSTAs in MoCoPnet by an LSTA with parameter (=3, =1) and name the model as LSTA. Thirdly, we replace LSTAs in MoCoPnet by an LSTA with parameter (=9, =1/4) and name the model as LSTA. Fourthly, we replace LSTAs in MoCoPnet by three parallel LSTAs with parameters (=3, =1), (=3, =3), (=3, =5) and name the model as LSTA.

Fig. 6: A toy example illustration of feature maps generated by OFM and LSTA. Note that, represents the temporal dimension. The blue dot and the red dot represent the groundtruth position of the target in the current feature () and in the reference feature ().
Methods Bicubic VSRnet [32] VESPCN [5] RCAN [97] SOF-VSR [76] TDAN [71] D3Dnet[91] MoCoPnet
SAITD 25.37 / 0.663 26.03 / 0.706 26.57 / 0.735 26.58 / 0.735 26.97 / 0.753 26.11 / 0.709 27.81 / 0.794 28.17 / 0.807
Hui 31.43 / 0.809 32.03 / 0.828 32.33 / 0.835 32.44 / 0.836 32.55 / 0.841 32.17 / 0.830 32.84 / 0.850 33.12 / 0.857
Anti-UAV 30.76 / 0.889 31.42 / 0.904 31.63 / 0.910 31.73 / 0.912 31.68 / 0.912 31.58 / 0.905 31.81 / 0.911 31.85 / 0.914
Average 29.19 / 0.787 29.83 / 0.813 30.18 / 0.827 30.25 / 0.828 30.40 / 0.835 29.95 /0.815 30.82 / 0.852 31.05 / 0.859
TABLE III: PSNR and SSIM results of different methods achieved on SAITD [65], Hui [28] and Anti-UAV [100] datasets. Best results are shown in boldface.
Resolution 640512 256256 640640 - 25602048 20482048 25602048 -
Methods SAITD Hui Anti-UAV Average SAITD Hui Anti-UAV Average
SNR CR SNR CR SNR CR SNR CR SNR CR SNR CR SNR CR SNR CR
LR 0.666 14.066 0.781 13.583 0.915 6.174 0.787 11.274 - - - - - - - -
Bicubic 0.676 13.780 0.750 15.100 0.817 6.747 0.747 11.875 0.808 14.736 0.986 15.817 0.958 7.173 0.917 12.575
VSRnet [32] 0.659 14.125 0.776 15.100 0.882 6.726 0.773 11.984 0.672 14.704 1.002 15.661 0.954 7.178 0.876 12.514
VESPCN [5] 0.656 14.118 0.793 15.145 0.920 6.641 0.790 11.968 0.901 14.664 1.005 15.616 0.963 7.115 0.956 12.465
RCAN [97] 0.670 14.213 0.813 15.202 0.952 6.713 0.811 12.043 0.914 14.720 0.997 15.649 0.947 7.168 0.953 12.512
SOF-VSR [76] 0.662 14.175 0.808 15.113 0.932 6.698 0.800 11.995 0.913 14.700 0.997 15.603 0.965 7.168 0.958 12.490
TDAN [71] 0.655 14.206 0.772 15.192 0.882 6.711 0.770 12.036 0.889 14.725 0.999 15.693 0.963 7.173 0.950 12.530
D3Dnet [91] 0.672 14.240 0.845 15.215 0.972 6.736 0.830 12.064 0.907 14.731 1.005 15.699 0.964 7.166 0.959 12.532
MoCoPnet 0.678 14.259 0.859 15.203 0.965 6.709 0.834 12.057 0.922 14.729 1.006 15.685 0.966 7.181 0.965 12.532
HR 0.810 14.262 1.001 15.265 0.959 6.706 0.923 12.078 - - - - - - - -
TABLE IV: SNR and CR results of different methods achieved on super-resolved LR images (columns ) and super-resolved HR images (columns ). Note that, we add the results of LR and HR as the baseline results and the resolution of LR is 4 times lower than the listed resolution. Exclude LR and HR, best results are shown in boldface and second best results are shown in underlined.
Fig. 7: Visual results of different SR methods on LR images for 4 SR.
Fig. 8: Visual results of realSR on HR images for 4 SR.

The experimental results of LSTA are shown in Table II. It can be observed that the PSNR/SSIM/SNR/CR scores of LSTA are 0.28dB/0.008/0.021/0.013 lower than MoCoPnet. This demonstrates that LSTA can effectively use the supplementary temporal information to enhance the local features, thus improving the SR performance and the target contrast. The PSNR/SSIM/SNR/CR scores of LSTA are 0.03dB/0.002/0.003/0.003 lower than MoCoPnet. This demonstrates that LSTA with larger expansion rate (i.e., =3) for coarse processing promotes our network to better extract and utilize temporal information. The PSNR/SSIM/SNR/CR scores of LSTA are 0.16dB/0.004/0.009/0.006 lower than MoCoPnet and 0.13dB/0.003/0.006/0.003 lower than LSTA. This demonstrates that coarse-to-fine processing benefits SR performance and target enhancement. The PSNR/SSIM/SNR/CR scores of LSTA are slightly higher than LSTA for 0.07dB/0.001/0.003/0.000 but the memory cost of LSTA is 2 times than LSTA (i.e., 2.46 vs. 1.17). This demonstrates sub-pixel information exploitation benefits the performance of SR and target enhancement but significantly increases the memory cost. The PSNR/SSIM/SNR/CR scores of LSTA are 0.07dB/0.002/0.005/0.005 lower than MoCoPnet and 0.09dB/0.003/0.004/0.001 higher than LSTA. This demonstrates that the cascade mode of LSTAs can better exploit inter-frame information correlation and SR performance and target enhancement can be further improved by enlarging the receptive field of LSTAs.

Method Parameters
Top-hat 55 square filter
ILCM 55 filter
IPI B=5050, S=10, , L=1, =
TABLE V: Parameter settings of Top-hat [59], ILCM [95] and IPI [18]

in HR images. “B” represents block size and “S” represents stride.

Finally, we replace LSTAs in MoCoPnet by an optical-flow module (OFM) to compare our LSTA with widely applied optical flow technique. The experimental results are listed in Table II. It can be observed that the PSNR/SSIM/SNR/CR scores of MoCoPnet with LSTAs are higher than MoCoPnet with OFM for 0.11dB/0.004/0.015/0.009. By contrast, the parameters and FLOPs of MoCoPnet with LSTA modules are lower than MoCoPnet with OFMd for 0.11M and 2.70G. This demonstrates that LSTA is superior in exploiting the information between frames to improve the SR performance and target contrast with lower computational cost. This is because, LSTA can direct learn motion compensation by attention mechanism without optical flow estimation and warping, which results in ambiguous and duplicate results [54, 30].

In addition, we visualize the feature maps generated by OFM and LSTAs with a toy example in Fig. 6. Note that, the visualization maps are the L2 norm results along the channel dimension [40, 43]. As is illustrated in Fig 6(a), the input image sequence consists of a random consistent movement of a target with size 33 (i.e., the white cube) in the background (i.e., the black area). The feature maps before OFM and LSTAs are shown in Figs. 6(b) and (d). It can be observed that the target positions in the extracted feature maps are close to the blue dots (i.e., the groundtruth position of the target in the current feature). Then OFM and LSTA perform feature alignment on the extracted features. As is illustrated in Fig. 6(b), the target positions in the feature maps generated by OFM are close to the blue dots. The feature maps generated by LSTA1(=3, =3) and LSTA2( =3, =1) are shown in Figs. 6(e) and (f). As is illustrated in Fig. 6(f), all the target positions in the feature maps generated by LSTA2 are closer to the red dot (i.e., the groundtruth position of the target in the reference feature) than those of OFM. This demonstrates that LSTA is superior in motion compensation. Note that, it can be observed from Figs. 6(e) and (f) that LSTA1 and LSTA2 achieve coarse-to-fine alignment to highlight the aligned target. This demonstrates the effectiveness and superiority of our coarse-to-fine alignment strategy.

Iv-C Comparative Evaluation

In this subsection, we compare our MoCoPnet with 1 top-performing single image SR methods RCAN [97] and 5 video SR methods VSRnet [32], VESPCN [5], SOF-VSR [76, 75] and TDAN [71], D3Dnet[91]. We also present the bicubicly upsampled (Bicubic) images as the baseline results. For fair comparison, we retrain all the compared methods on infrared small target dataset [28] and exclude the first and the last 2 frames of the video sequences for performance evaluation.

Resolution 640512 256256 640640 -
Methods SAITD Hui Anti-UAV Average
SNRG BSF SCRG CG SNRG BSF SCRG CG SNRG BSF SCRG CG SNRG BSF SCRG CG
Top-hat Bicubic 0.50 2.55 6.43 3.60 0.93 1.79 15.94 9.66 3.77 4.32 15.58 3.12 1.73 2.89 12.65 5.46
D3Dnet 0.77 3.10 9.31 4.28 1.49 2.01 20.47 11.20 9.60 6.70 31.05 3.33 3.95 3.94 20.28 6.27
MoCoPnet 0.82 3.25 9.04 3.55 1.53 1.98 18.88 10.22 13.06 6.72 28.35 2.82 5.14 3.99 18.76 5.53
HR 1.62 1.84 5.40 3.22 1.73 1.55 8.82 5.30 7.61 13.18 74.73 2.99 3.65 5.52 29.65 3.83
ILCM Bicubic 1.07 1.20 5.93 4.89 0.96 0.91 3.73 4.34 0.90 0.83 1.82 2.15 0.97 0.98 3.83 3.79
D3Dnet 1.07 1.02 7.29 7.15 1.08 0.84 8.01 10.03 1.07 0.77 2.84 3.72 1.07 0.87 6.05 6.96
MoCoPnet 1.08 1.00 7.91 7.91 1.09 0.84 8.21 10.12 1.07 0.76 3.30 4.47 1.08 0.87 6.47 7.50
HR 1.37 0.89 11.85 13.20 1.31 0.77 10.88 15.39 1.05 0.70 3.14 4.54 1.24 0.79 8.62 11.04
IPI Bicubic 5.17 2.110 1.67 1-3 6.29 1.110 2.19 0.82 9.89 9.99 3.39 0.06 5.39 1.410 1.89 0.29
D3Dnet 1.98 4.09 2.18 0.07 1.910 5.09 3.59 0.96 2.010 3.29 4.19 0.13 1.310 4.19 2.69 0.39
MoCoPnet 3.58 2.79 6.48 0.11 2.010 4.59 4.69 1.12 4.210 5.49 8.19 0.13 2.110 4.29 4.49 0.45
HR 1.110 2.69 2.39 0.13 0.60 2.29 8.63 0.61 8.09 1.59 1.39 0.16 6.29 2.19 1.29 0.30
TABLE VI: Quantitative detection results of Tophat, ILCM and IPI achieved on super-resolved LR images in infrared small target datasets. Best results are shown in boldface and second best results are shown in underlined.
Resolution 25602048 20482048 25602560 -
Methods SAITD Hui Anti-UAV Average
SNRG BSF SCRG CG SNRG BSF SCRG CG SNRG BSF SCRG CG SNRG BSF SCRG CG
Top-hat Bicubic 1.01 1.79 4.45 2.78 1.19 1.82 6.09 3.03 4.66 6.95 49.83 4.40 2.28 3.52 20.13 3.40
D3Dnet 1.27 1.70 4.67 2.95 1.21 1.74 5.63 2.98 3.21 7.39 59.37 4.40 1.90 3.61 23.22 3.44
MoCoPnet 1.33 1.70 4.60 2.87 1.20 1.78 5.88 3.11 3.20 7.35 58.17 4.53 1.91 3.61 22.88 3.50
ILCM Bicubic 1.20 0.90 9.02 11.10 0.97 0.84 9.72 11.76 1.00 0.86 7.77 8.71 1.06 0.86 8.84 10.52
D3Dnet 1.37 0.85 12.65 16.53 1.04 0.78 10.54 13.94 1.02 0.85 8.58 9.60 1.14 0.82 10.59 13.35
MoCoPnet 1.40 0.85 12.86 16.90 1.04 0.76 10.50 14.18 1.02 0.85 8.71 9.75 1.15 0.82 10.69 13.61
IPI Bicubic 2.59 3.99 2.88 0.039 5.110 2.610 1.310 0.14 1.79 1.29 3.08 0.046 1.810 1.010 4.69 0.074
D3Dnet 3.49 2.99 3.18 0.048 4.810 1.910 1.610 0.16 3.58 4.07 7.97 0.048 1.710 7.29 5.49 0.086
MoCoPnet 3.69 2.69 3.28 0.044 3.710 1.610 1.710 0.18 0.23 20.90 9.95 0.049 1.310 6.29 5.99 0.092
TABLE VII: Quantitative detection results of Tophat, ILCM and IPI achieved on super-resolved HR images in infrared small target datasets. Best results are shown in boldface and second best results are shown in underlined.
Fig. 9: Qualitative results of super-resolved LR image and detection results in SAITD, Hui and Anti-UAV datasets.

Iv-C1 SR on Synthetic Images

PSNR/SSIM results calculated on the whole image are listed in Tables III. SNR and CR scores calculated in the local background neighborhood are listed in the columns of Table IV. It can be observed that MoCoPnet achieves the highest scores of PSNR, SSIM and outperforms most of the compared algorithms on SNR and CR scores. The above scores demonstrate that our network can effectively recover accurate details and improve the target contrast. That is because, LSTA perform implicit motion compensation and CD-RG incorporates the center-orient gradient information to effectively improve the SR performance and the target contrast.

Qualitative results are shown in Fig. 7. For SR performance, it can be observed from the blue zoom in regions that MoCoPnet can recover more accurate details (e.g., the sharp edges of buildings, and the lighthouse details closer to groundtruth HR image). For target enhancement, it can be observed from the red zoom in regions that, in the first row, MoCoPnet can further improve the target contrast which is almost invisible in other compared methods. In the second row, MoCoPnet is more robust to large motion caused by turntable collections [28] (e.g., artifacts in the zoom-in region of D3Dnet). In the third row, MoCoPnet can effectively improve the target contrast to be even higher than HR images (i.e., 1.82 vs. 1.75).

Iv-C2 SR on Real Images

SNR and CR scores calculated in the local background neighborhood of super-resolved HR images are listed in the columns of Table IV. It can be observed that MoCoPnet can achieve the best SNR score and the second best CR score on the average of test datasets under real-world degradation. This demonstrates the superiority of our method in improving the contrast between targets and background.

Qualitative results are shown in Fig. 8. It can be observed that MoCoPnet can recover finer details and achieve better visual quality, such as the edges of building and window. In addition, MoCoPnet can further improve the intensity and the contour details of the targets.

Iv-D Effect On Infrared Small Target Detection Algorithm

Fig. 10: ROC results of Tophat, ILCM and IPI achieved on super-resolved LR images in SAITD, Hui and Anti-UAV datasets.
Fig. 11: Qualitative results of super-resolved HR image and detection results in SAITD, Hui and Anti-UAV datasets.
Fig. 12: ROC results of Tophat, ILCM and IPI achieved on super-resolved HR images in SAITD, Hui and Anti-UAV datasets.

In this subsection, we select three typical infrared small target detection algorithms (Top-hat [59], ILCM[95], IPI [18]) to perform detection on super-resolved infrared images. The parameters of the three infrared small target detection algorithms are shown in Table V. When 4 SR is performed on HR images, the size of filters, block and stride, as well as the true detection threshold are enlarged by 4 times respectively. When 4 downsampling is performed on HR images, the filter sizes of Top-hat and ILCM are set to 33. The block sizes and the stride of IPI are set to 1515 and 3. The true detection threshold is set to 3.0. For simplicity, we only use the best two super-resolved results of D3Dnet and MoCoPnet to perform detection. We also introduce bicubicly upsampled (Bicubic) images and HR images as the baseline results.

Iv-D1 Detection on Synthetic Images

The quantitative detection results of super-resolved LR images are listed in Table VI. It can be observed that the SNRG, SCRG and CG of the super-resolved images are generally higher than the Bicubic images. This demonstrates that SR algorithms can effectively improve the contrast between the target and the background, thus promoting the detection performance. It is worth noting that the SNRG, SCRG and CG scores of D3Dnet and MoCoPnet can even surpass those of HR. This is because, SR algorithms can perform better on the high-frequency small targets than the low-frequency local background, thus achieving improved target contrast than HR images. In addition, Bicubic can achieve the highest BSF score in most cases. This is because SR algorithms act on the entire image, which enhances targets and background simultaneously and detection algorithms have better filtering performance in smoothly changing background. Note that, BSF of MoCoPnet is generally higher than that of D3Dnet. This is because MoCoPnet can focus on recovering the local salient features in the image and further improve the contrast between targets and background, which benefits the detection performance.

The qualitative results of super-resolved LR images and detection results are shown in Fig. 9. In the LR images, the targets intensity are very low (e.g., the targets in SAITD and Anti-UAV are almost invisible). In the super-resolved images, the targets intensity are higher and closer to the HR images. This is because, SR algorithms can effectively use the spatio-temporal information to enhance the target contrast. Note that, our MoCoPnet is more robust to large motion caused by turntable collections [28] (i.e., artifacts in the zoom-in region of D3Dnet in Hui dataset). In addition, the neighborhood noise in HR image are suppressed by the way of downsampling and then super-resolution (e.g., point noise are not exist in the zoom-in regions of Hui and Anti-UAV datasets). Then, we perform detection on the super-resolved images. It can be observed in Fig. 9 that all the detection algorithms have poor performance on the Bicubic images (e.g., the target intensity in the target image is very low and almost invisible in all detection results). This is because, bicubic interpolation cannot introduce additional information. However, the targets intensity in the target images of super-resolved images are higher than the Bicubic images. Among the super-resolved images, MoCoPnet is superior than D3Dnet in improving the target contrast due to the center-oriented gradient-aware feature extraction of CD-RG and the effective spatio-temporal information exploitation of LSTA.

To evaluate the detection performance comprehensively, we further calculate the ROC results which are shown in Fig. 10. Note that, ROC results on LR and HR image are used as the baseline results. The targets in HR images have the highest intensity. Therefore, high detection probability and low false alarm probability can be obtained and the detection probability reaches 1 faster (e.g., The ROC results reach 1 the fast in SAITD and Hui datasets). Downsampling leads to target intensity reduction, thus reducing the detection probability and increasing the false alarms probability. Bicubic introduces no additional image prior information, therefore, LR and Bicubic have the worst detection performance and the ROC results are significant lower than other algorithms (e.g., the ROC results of LR are the lowest and those of Bicubic are the second lowest except the ROC of Tophat in the SAITD dataset). SR algorithms can introduce prior information to improve the contrast between targets and background, thus achieving improved detection accuracy (e.g., The ROC results of MoCoPnet and D3Dnet are higher than Bicubic in SAITD and Hui datasets and even higher than HR in Anti-UAV dataset). Note that, false alarm rates of LR and Bicubic can only reach a relatively low value. This is because, IPI achieves detection by sparse and low rank recovery, which significantly decreases the false alarm rate than Tophat and ILCM. From another point, IPI suffers low detection rate of low contrast targets. Therefore, the ROC curves of Bicubic and LR images are shorter than those of HR and super-resolved images. The above experimental results show that SR algorithms can recover high-contrast targets, thus improving the detection performance.

Iv-D2 Detection on Real Images

The quantitative detection results of super-resolved HR images are listed in Table VII. It can be observed that the detection performance of SR algorithms is superior to Bicubic. This demonstrates that MoCoPnet and D3Dnet can effectively improve the contrast between targets and background, resulting in performance gain of detection. Among SR algorithms, due to the superior performance of SR and target enhancement by our well-designed modules, MoCoPnet can achieve the best SNRG, SCRG and CG scores in most cases. Note that, the SNRG and SCRG scores (achieved by IPI) of MoCoPnet in Anti-UAV dataset are 7-8 orders of magnitude lower than those of Bicubic and D3Dnet. First of all, MoCoPnet can achieve highest scores of CG. This demonstrates the target intensity can be effectively and further enhanced by MoCoPnet. Then, the differences come from the performance of background suppression. Since MoCoPnet can achieve higher scores of SR performance than Bicubic and D3Dnet, the local backgrounds of Bicubic and D3Dnet are more gentle and detection algorithms can achieve better suppression performance. IPI is superior in suppressing background clutter, therefore, sometimes the local backgrounds in the target image of Bicubic and D3Dnet are zero. Since we add to each denominator in equations 6-9 to prevent it from being zero, SNRG and SCRG scores can be very large due to completely suppressed background. In addition, bicubic interpolation suppresses the high-frequency components to a certain extent, resulting in optimal BSF value.

The qualitative results of super-resolved HR images and detection results are shown in Fig. 11. It can be observed that the targets of Bicubic images are blur while SR can enhance the intensity of target (e.g., the highlighted and sharpened targets). After processed by SR algorithms, we then perform detection on the super-resolved images. Note that, SR algorithms can effectively improve the intensity of targets and the contrast against background, resulting in better detection performance.

To evaluate the detection performance comprehensively, we further present the ROC results in Fig. 12. Note that, ROC results on HR image are used as the baseline results. It can be observed that SR algorithms can improve the detection probability and reduce false alarm probability in most cases. Compared with D3Dnet, MoCoPnet can further improve the target contrast, thus promoting the detection performance. Note that, false alarm rates of Bicubic can only reach a relatively low value. This is because, IPI achieves detection by sparse and low rank recovery, which significantly decreases the false alarm rate than Tophat and ILCM. In other words, IPI suffers low detection rate of low contrast targets.

V Conclusion

In this paper, we propose a local motion and contrast prior driven deep network (MoCoPnet) for infrared small target super-resolution. Experimental results show that MoCoPnet can effectively recover the image details and enhance the contrast between targets and background. Based on the super-resolved images, we further investigate the effect of SR algorithms on detection performance. Experimental results show that MoCoPnet can improve the performance of infrared small target detection.

References

  • [1] N. Ahn, B. Kang, and K. Sohn (2018) Fast, accurate and lightweight super-resolution with cascading residual network. In

    Proceedings of the European Conference on Computer Vision

    ,
    pp. 252–268. Cited by: §II-A.
  • [2] S. Anwar and N. Barnes (2019) Real image denoising with feature attention. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3155–3164. Cited by: §II-D.
  • [3] S. P. Belekos, N. P. Galatsanos, and A. K. Katsaggelos (2010) Maximum a posteriori video super-resolution using a new multichannel image prior. IEEE Transactions on Image Processing 19 (6), pp. 1451–1464. Cited by: §II-B.
  • [4] M. Ben-Ezra, A. Zomet, and S. K. Nayar (2005) Video super-resolution using controlled subpixel detector shifts. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (6), pp. 977–987. Cited by: §II-B.
  • [5] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4778–4787. Cited by: §II-B, §IV-C, TABLE III, TABLE IV.
  • [6] H. Chang, D. Yeung, and Y. Xiong (2004) Super-resolution through neighbor embedding. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. I–I. Cited by: §II-A.
  • [7] J. Choi and M. Kim (2017)

    A deep convolutional neural network with selection units for super-resolution

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 154–160. Cited by: §II-A.
  • [8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773. Cited by: §II-B.
  • [9] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang (2019) Second-order attention network for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11065–11074. Cited by: §II-A, §II-D.
  • [10] Y. Dai, Y. Wu, F. Zhou, and K. Barnard (2021) Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Cited by: §I, §III-B, §IV-A1, footnote 2.
  • [11] Y. Dai, Y. Wu, F. Zhou, and K. Barnard (2021) Attentional local contrast networks for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing, pp. 1–12. Cited by: §I, §I, Fig. 2, §III-B.
  • [12] Y. Dai and Y. Wu (2017)

    Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection

    .
    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (8), pp. 3752–3767. Cited by: §II-E.
  • [13] C. Dong, C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pp. 184–199. Cited by: §II-A.
  • [14] C. Dong, C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision, pp. 391–407. Cited by: §II-A.
  • [15] C. Duchon (1979) Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18 (8), pp. 1016–1022. Cited by: §II-A.
  • [16] G. Freedman and R. Fattal (2011) Image and video upscaling from local self-examples. ACM Transactions on Graphics 30 (2), pp. 1–11. Cited by: §II-A.
  • [17] W. Freeman, T. Jones, and E. Pasztor (2002) Example-based super-resolution. IEEE Computer Graphics and Applications 22 (2), pp. 56–65. Cited by: §II-A.
  • [18] C. Gao, D. Meng, Y. Yang, Y. Wang, X. Zhou, and A. G. Hauptmann (2013) Infrared patch-image model for small target detection in a single image. IEEE Transactions on Image Processing 22 (12), pp. 4996–5009. Cited by: §II-E, §IV-A2, §IV-D, TABLE V.
  • [19] D. Glasner, S. Bagon, and M. Irani (2009) Super-resolution from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 349–356. Cited by: §II-A.
  • [20] N. J. Gordon, D. J. Salmond, and A. F. Smith (1993) Novel approach to nonlinear/non-gaussian bayesian state estimation. IEE proceedings Part F - Radar and Signal Processing 140 (2), pp. 107–113. Cited by: §II-E.
  • [21] J. Guo and H. Chao (2017) Building an end-to-end spatial-temporal convolutional network for video super-resolution. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Cited by: §II-B.
  • [22] T. Y. Han, Y. J. Kim, and B. C. Song (2017) Convolutional neural network-based infrared image super resolution under low light environment. In Proceedings of the European Signal Processing Conference, pp. 803–807. Cited by: §II-C.
  • [23] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1664–1673. Cited by: §II-A.
  • [24] M. Haris, G. Shakhnarovich, and N. Ukita (2019) Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3897–3906. Cited by: §II-B.
  • [25] Z. He, S. Tang, J. Yang, Y. Cao, M. Y. Yang, and Y. Cao (2018) Cascaded deep networks with multiple receptive fields for infrared image super-resolution. IEEE transactions on circuits and systems for video technology 29 (8), pp. 2310–2322. Cited by: §II-C.
  • [26] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §II-D.
  • [27] Y. Huang, W. Wang, and L. Wang (2017) Video super-resolution via bidirectional recurrent convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 1015–1028. Cited by: §II-B.
  • [28] B. Hui, Z. Song, H. Fan, P. Zhong, W. Hu, X. Zhang, J. Ling, H. Su, W. Jin, Y. Zhang, and Y. Bai (2020) A dataset for infrared detection and tracking of dim-small aircraft targets under ground / air background. China Scientific Data. External Links: Document Cited by: §IV-A1, §IV-C1, §IV-C, §IV-D1, TABLE III.
  • [29] M. Irani and S. Peleg (1991) Improving resolution by image registration. Graphical Models and Image Processing 53 (3), pp. 231–239. Cited by: §II-A.
  • [30] T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y. Li, S. Wang, and Q. Tian (2020) Video super-resolution with temporal group attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8008–8017. Cited by: §II-B, §IV-B2.
  • [31] Y. Jo, S. Wug Oh, J. Kang, and S. Joo Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. Cited by: §II-B.
  • [32] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos (2016) Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §IV-C, TABLE III, TABLE IV.
  • [33] R. Keys (1981) Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 29 (6), pp. 1153–1160. Cited by: §II-A.
  • [34] J. Kim, L. Kwon, and L. Mu (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §II-A.
  • [35] J. Kim, L. Kwon, and L. Mu (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1645. Cited by: §II-A.
  • [36] K. Kim and Y. Kwon (2010) Single-image super-resolution using sparse regression and natural image prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1127–1133. Cited by: §II-A.
  • [37] S. Y. Kim, J. Lim, T. Na, and M. Kim (2019) Video super-resolution based on 3d-cnns with consideration of scene change. In Proceedings of the IEEE International Conference on Image Processing, pp. 2831–2835. Cited by: §II-B.
  • [38] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §IV-A4.
  • [39] C. Koch and T. Poggio (1999) Predicting the visual world: silence is golden. Nature Neuroence 2 (1), pp. 9–10. Cited by: §II-E.
  • [40] N. Komodakis and S. Zagoruyko (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, Cited by: §IV-B1, §IV-B2.
  • [41] Z. Kun (2004) Background noise suppression in small targets infrared images and its method discussion. Optics and Optoelectronic Technology 2 (2), pp. 9–12. Cited by: §II-E.
  • [42] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690. Cited by: §II-A.
  • [43] B. Li, C. Xiao, L. Wang, Y. Wang, Z. Lin, M. Li, W. An, and Y. Guo (2021) Dense nested attention network for infrared small target detection. arXiv preprint arXiv:2106.00487. Cited by: §III-B, §IV-B1, §IV-B2.
  • [44] J. Li, F. Fang, J. Li, K. Mei, and G. Zhang (2020) MDCN: multi-scale dense cross network for image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-A.
  • [45] J. Li, F. Fang, K. Mei, and G. Zhang (2018) Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pp. 517–532. Cited by: §II-A.
  • [46] S. Li, F. He, B. Du, L. Zhang, Y. Xu, and D. Tao (2019) Fast spatio-temporal residual network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-B.
  • [47] R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia (2015) Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 531–539. Cited by: §II-B.
  • [48] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 136–144. Cited by: §II-A.
  • [49] C. Liu and D. Sun (2011) A bayesian approach to adaptive video super resolution. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 209–216. Cited by: §II-B.
  • [50] C. Liu and D. Sun (2013) On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2), pp. 346–360. Cited by: §II-B.
  • [51] H. Liu, L. Zhang, and H. Huang (2020) Small target detection in infrared videos based on spatio-temporal tensor model. IEEE Transactions on Geoscience and Remote Sensing 58 (12), pp. 8689–8700. Cited by: §I.
  • [52] Q. Liu, R. Jia, Y. Liu, H. Sun, J. Yu, and H. Sun (2021) Infrared image super-resolution reconstruction by using generative adversarial network with an attention mechanism. Applied Intelligence 51 (4), pp. 2018–2030. Cited by: §II-C.
  • [53] T. Liu, J. Yang, B. Li, C. Xiao, Y. Sun, Y. Wang, and W. An (2021) Non-convex tensor low-rank approximation for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §II-E.
  • [54] Y. Lu, J. Valmadre, H. Wang, J. Kannala, M. Harandi, and P. Torr (2020) Devon: deformable volume network for learning optical flow. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 2705–2713. Cited by: §II-B, §IV-B2.
  • [55] Y. Mao, Y. Wang, J. Zhou, and H. Jia (2016) An infrared image super-resolution reconstruction method based on compressive sensing. Infrared Physics & Technology 76, pp. 735–739. Cited by: §II-C.
  • [56] D. Mitzel, T. Pock, T. Schoenemann, and D. Cremers (2009) Video super resolution using duality based tv-l 1 optical flow. In Proceedings of the Joint Pattern Recognition Symposium, pp. 432–441. Cited by: §II-B.
  • [57] Y. Qin and B. Li (2016) Effective infrared small target detection utilizing a novel local contrast method. IEEE Geoscience and Remote Sensing Letters 13 (12), pp. 1890–1894. Cited by: §I.
  • [58] I. Reed, R. Gagliardi, and H. Shao (1983) Application of three-dimensional filtering to moving target detection. IEEE Transactions on Aerospace and Electronic Systems (6), pp. 898–905. Cited by: §II-E.
  • [59] J. Rivest and R. Fortin (1996) Detection of dim targets in digital infrared imagery by morphological image processing. Optical Engineering 35 (7), pp. 1886–1893. Cited by: §IV-D, TABLE V.
  • [60] M. S. Sajjadi, R. Vemulapalli, and M. Brown (2018) Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626–6634. Cited by: §II-B.
  • [61] M. Sajjadi, B. Scholkopf, and M. Hirsch (2017) Enhancenet: single image super-resolution through automated texture synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4491–4500. Cited by: §II-A.
  • [62] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §II-B.
  • [63] Z. Su, Liu,Wenzhe, Z. Yu, D. Hu, Q. Liao, Q. Tian, M. Pietikäinen, and L. Liu (2021) Pixel difference networks for efficient edge detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Fig. 2, §III-B.
  • [64] J. Sun, Z. Xu, and H. Shum (2008) Image super-resolution using gradient profile prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §II-A.
  • [65] X. Sun, L. Guo, W. Zhang, Z. Wang, Y. Hou, Z. Li, and X. Teng (2021) A dataset for small infrared moving target detection under clutter background. Chinese Scientific Data. Cited by: §IV-A1, §IV-A2, TABLE III.
  • [66] Y. Sun, J. Yang, and W. An (2020) Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Transactions on Geoscience and Remote Sensing 59 (5), pp. 3737–3752. Cited by: §I.
  • [67] Y. Sun, J. Yang, M. Li, and W. An (2019) Infrared small target detection via spatial–temporal infrared patch-tensor model and weighted schatten p-norm minimization. Infrared Physics & Technology 102, pp. 103050. Cited by: §I.
  • [68] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4539–4547. Cited by: §II-A.
  • [69] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3147–3155. Cited by: §II-A.
  • [70] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia (2017) Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4472–4480. Cited by: §II-B.
  • [71] Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) TDAN: temporally deformable alignment network for video super-sesolution. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Cited by: §II-B, §IV-C, TABLE III, TABLE IV.
  • [72] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4799–4807. Cited by: §II-A.
  • [73] H. Wang, D. Su, C. Liu, L. Jin, X. Sun, and X. Peng (2019) Deformable non-local network for video super-resolution. IEEE Access 7, pp. 177734–177744. Cited by: §II-B.
  • [74] L. Wang, X. Dong, Y. Wang, X. Ying, Z. Lin, W. An, and Y. Guo (2021) Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Cited by: §II-D.
  • [75] L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An (2018) Learning for video super-resolution through HR optical flow estimation. In Proceedings of the Asian Conference on Computer Vision, pp. 514–529. Cited by: §IV-A4, §IV-C.
  • [76] L. Wang, Y. Guo, L. Liu, Z. Lin, X. Deng, and W. An (2020) Deep video super-resolution using HR optical flow estimation. IEEE Transactions on Image Processing. Cited by: §II-B, §IV-A4, §IV-C, TABLE III, TABLE IV.
  • [77] L. Wang, Y. Guo, Y. Wang, Z. Liang, Z. Lin, J. Yang, and W. An (2020) Parallax attention for unsupervised stereo correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A.
  • [78] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo (2019) Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12250–12259. Cited by: §II-A.
  • [79] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy (2019) EDVR: video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §II-B.
  • [80] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision, pp. 0–0. Cited by: §II-A.
  • [81] Y. Wang, L. Wang, J. Yang, W. An, J. Yu, and Y. Guo (2019) Spatial-angular interaction for light field image super-resolution. Proceedings of the European Conference on Computer Vision. Cited by: §II-A.
  • [82] Y. Wang, J. Yang, L. Wang, X. Ying, T. Wu, W. An, and Y. Guo (2020) Light field image super-resolution using deformable convolution. IEEE Transactions on Image Processing 30, pp. 1057–1071. Cited by: §II-A.
  • [83] Y. Wang, X. Ying, L. Wang, J. Yang, W. An, and Y. Guo (2021-06) Symmetric parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 766–775. Cited by: §III-B.
  • [84] Z. Xiao, X. Fu, J. Huang, Z. Cheng, and Z. Xiong (2021-06) Space-time distillation for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2113–2122. Cited by: §II-B.
  • [85] Z. Xiao, Z. Xiong, X. Fu, D. Liu, and Z. Zha (2020) Space-time video super-resolution using temporal profiles. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 664–672. Cited by: §II-B.
  • [86] Z. Xiong, X. Sun, and F. Wu (2010) Robust web image/video super-resolution. IEEE Transactions on Image Processing 19 (8), pp. 2017–2028. Cited by: §II-A.
  • [87] Z. Xiong, X. Sun, and F. Wu (2010) Robust web image/video super-resolution. IEEE Transactions on Image Processing 19 (8), pp. 2017–2028. Cited by: §II-B.
  • [88] J. Yang, J. Wright, T. Huang, and Y. Ma (2008) Image super-resolution as sparse representation of raw image patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §II-A.
  • [89] Yang, Jianchao and Wright, John and Huang, Thomas and Ma, Yi (2010) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19, pp. 2861–2873. Cited by: §II-A.
  • [90] P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma (2019) Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3106–3115. Cited by: §II-B.
  • [91] X. Ying, L. Wang, Y. Wang, W. Sheng, W. An, and Y. Guo (2020) Deformable 3d convolution for video super-resolution. IEEE Signal Processing Letters 27, pp. 1500–1504. Cited by: §II-B, §IV-C, TABLE III, TABLE IV.
  • [92] Z. Yu, C. Zhao, Z. Wang, Y. Qin, Z. Su, X. Li, F. Zhou, and G. Zhao (2020) Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5295–5305. Cited by: Fig. 2, §III-B, §III-B.
  • [93] H. Zhang and T. Zhang (2005) A new method based on multi-stages trace association for detection and tracking of crossing multi-targets. Chinese Journal of Electronics 33 (6), pp. 1109–1112. Cited by: §II-E.
  • [94] L. Zhang, L. Peng, T. Zhang, S. Cao, and Z. Peng (2018) Infrared small target detection via non-convex rank approximation minimization joint l2, 1 norm. Remote Sensing 10 (11), pp. 1821. Cited by: §II-E.
  • [95] X. Zhang, Q. Ding, H. Luo, H. Bin, C. Zheng, and Z. Junchao (2017) Infrared dim target detection algorithm based on improved lcm. Infrared and Laser Engineering 46 (7), pp. 726002–0726002. Cited by: §IV-D, TABLE V.
  • [96] X. Zhang, C. Li, Q. Meng, S. Liu, Y. Zhang, and J. Wang (2018) Infrared image super resolution by combining compressive sensing and deep learning. Sensors 18 (8), pp. 2587. Cited by: §II-C.
  • [97] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, pp. 286–301. Cited by: §II-A, §II-D, §III-B, §IV-C, TABLE III, TABLE IV.
  • [98] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2472–2481. Cited by: §III-B.
  • [99] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2472–2481. Cited by: §II-A, §III-B.
  • [100] J. Zhao, G. Wang, J. Li, L. Jin, N. Fan, M. Wang, X. Wang, T. Yong, Y. Deng, Y. Guo, and S. Ge The 2nd anti-uav workshop & challenge. Note: https://anti-uav.github.io/2021 Cited by: §IV-A1, TABLE III.
  • [101] Y. Zhao, Q. Chen, X. Sui, and G. Gu (2015) A novel infrared image super-resolution method based on sparse representation. Infrared Physics & Technology 71, pp. 506–513. Cited by: §II-C.
  • [102] X. Zhu, Z. Li, X. Zhang, C. Li, Y. Liu, and Z. Xue (2019) Residual invertible spatio-temporal network for video super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5981–5988. Cited by: §II-B.
  • [103] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §II-B.