Super-Resolution (SR) methods are used to increase the spatial resolution and improve the level of detail of digital images, while preserving the image content. Such methods have important applications for multiple industries, such as health-care, agriculture, defense and film [Nasrollahi:2014:SCS:2647753.2647819]
. In recent years, more advanced methods of SR have been heavily based on Deep Learning[DBLP:journals/corr/DongLHT15, DBLP:journals/corr/LedigTHCATTWS16, DBLP:journals/corr/abs-1904-07523] where one learns the mapping between Low-Resolution (LR) images and their High-Resolution (HR) counterparts, and applies the same mapping to an unseen low-resolution input, effectively performing super-resolution on that image.
The need for super-resolution becomes even more prominent when dealing with sensors other than the visible light, since those sensors typically produce images with lower resolution [kiran2017single, Mandanici2019]. For example, Infra-Red (IR) camera sensors are more expensive than classical camera sensors, and their output images commonly have much lower spatial resolution. While the aforementioned SR methods can still work on such images, there is still a big gap between the level of detail in the achieved results, and the one found in common RGB images. To bridge that gap, Joint Cross-Modality methods were developed. The idea is to use the higher-resolution RGB modality to guide the process of super-resolution on images taken by the lower resolution sensor, taking advantage of the finer details found in the RGB images. The challenge is to remain loyal to the target modality characteristics and to avoid adding redundant artifacts or textures from the RGB modality [almasri2018multimodal].
State-of-the-art Joint Cross-Modality SR methods rely on the assumption that their multiple inputs are well aligned [almasri2018multimodal, almasri2018rgb, DBLP:journals/corr/abs-1708-09105, chen2016color, ni2017color]. Thus, they perform well only when the input images were captured by different sensors placed in the exact same position, and taken at the same exact time. In real-life scenarios, perfect alignment of multiple sensors is often hard to achieve. Aligning the images in a pre-process typically yields only a weak alignment, dimming the effectiveness of joint cross-modality method.
In our work, we introduce a new method to perform joint cross-modality SR, where different modality images are allowed to be moderately misaligned, namely Weakly Aligned. We tackle the problem of misalignment using a learnable deformation that implicitly aligns the two images together. More specifically, our architecture includes a deformation model which aligns the RGB image to the target modality in a coarse-to-fine manner, before they are fused together. The network does not use any explicit supervision for the deformation subtask, but rather optimizes the deformation parameters to adhere to the super-resolution goal.
Furthermore, since most multi-modal pairs are not perfectly aligned, we are able to improve results even on supposedly well-aligned datasets, compared to previous methods (see Section 4). The SR module in our approach is based on ZSSR [DBLP:journals/corr/abs-1712-06087], and allows the network to perform SR using only the input pair without any training dataset. The network learns the internal statistics of the images by training on patches extracted from the input pair, and uses them to perform SR on the entire target modality image.
In addition to that, since over-transferal of information is an often arising problem in the world of multi-modal fusing [almasri2018multimodal, almasri2018rgb], our method is designed to transfer details from the RGB image carefully and conservatively; it avoids producing artifacts, or learning redundant details such as textures. It only learns the details that aid improving its super-resolution task. We show that our network achieves state-of-the-art results, while being generic in supporting any modality as input, requiring no training data and adjusting to any image size.
2 Related Works
Super-Resolution has been extensively studied throughout the last two decades. See [Nasrollahi:2014:SCS:2647753.2647819] for a survey covering various SR techniques. Recent surveys [DBLP:journals/corr/abs-1808-03344, DBLP:journals/corr/abs-1904-07523] cover more advanced methods, including Deep-Learning based methods. The first notable deep network-based method of SR method is SRCNN [DBLP:journals/corr/DongLHT15], a simple fully convolutional method that showed superior results to traditional methods. Like most methods, SRCNN uses external image datasets, like T91, Set5 and Set14 [MSLapSRN, DBLP:journals/corr/LedigTHCATTWS16] for training and evaluation.
However, it was claimed [irani2009super, zontak2011internal, DBLP:journals/corr/abs-1712-06087] that methods which rely on large external datasets do not learn the internal image-specific properties of the given input. In [irani2009super, zontak2011internal], the subject of internal patch recurrence is investigated, leading to quantifiable results suggesting that patches of different scales tend to recur in the same image more than in external image datasets. This observation gave rise to powerful Zero-Shot methods [Huang_2015_CVPR, DBLP:journals/corr/abs-1712-06087, cui2014deep], most notably ZSSR [DBLP:journals/corr/abs-1712-06087], which applies random cropping to its input image, effectively creating an internal image-specific dataset of patches taken solely from a single input. The method we present builds upon these ideas to deal with cross-modality, enjoying both the strong property of internal patch recurrence, together with the ability to transfer fine-grained details from our guiding modality input image to obtain super resolution images of even higher quality.
A straightforward generalization of SR performed on the visual modality 111In this paper, we use the terms RGB modality and visual modality interchangeably. is applying SR methods on varying modalities which are commonly acquired using low resolution sensors. Traditional SR methods for Thermal images (e.g., [liu2013infrared, mao2016infrared]) have approached the problem by using signal reconstruction methodologies, whereas SR methods for depth-maps (e.g., [kim2012high, xie2014single] have been based on Markov random fields and coupled dictionary learning. Unlike the above methods, our method is generic in the sense that it can be applied to any given modality. In this paper, we evaluate our method on three modalities: Thermal (Infrared), NIR (Near-infrared), and depth-maps.
2.1 Joint Cross-Modality
In the Joint Cross-Modality setting the two different modalities are jointly analyzed to enhance one of them. As mentioned earlier, camera sensors capturing the RGB modality produce images with richer HR details than other modalities. Thus, a common setting is the usage of a visual HR version of the image, alongside with a LR version taken by the other modality sensor. This setting was adopted by all relevant joint cross-modality methods.
In [ni2017color], a learning-based visual-depth method is presented. It is based on a CNN architecture operating on a LR depth-map and a sharp edge-map extracted from the HR visual modality. The network is trained on visual-depth aligned pairs from the Middlebury dataset [Scharstein:2002:TED:598429.598475]. In [DBLP:journals/corr/abs-1708-09105], a GAN-based method (CDcGAN) is presented. The method adds auxiliary losses that encourage keeping the resulting depth-maps smooth and texture-free, and is also trained on the Middlebury dataset.
In [chen2016color], a non learning-based joint visual-thermal method is presented. It is based on guided filtering of an up-sampled LR thermal input in areas that correlate well with the HR visual input. It is tested on visual-thermal pairs whose capturing sensors were manually calibrated to be aligned. Almastri et al. [almasri2018multimodal] introduced the learning-based visual-thermal SR methods VTSRCNN and VTSRGAN, built on top of the existing SRCNN and SRGAN. They perform joint visual-thermal SR by concatenating feature maps extracted from each input modality, and are trained and evaluated on the ULB17-VT [Almasri2019] visual-thermal dataset consisting of well aligned pairs.
As noted by Almasri et al. [almasri2018multimodal], in the context of cross-modal super-resolution, misalignment is a major limitation in producing artifact-free SR results. In their paper, it is claimed that the artifacts added to the SR result appear where there is cross-modal displacements, and a better synchronized capturing device would likely solve that problem. Our method’s approach in handling cross-modal misalignment is to deform the RGB modality and align details that improve the SR objective to the target modality.
Our method differs from the aforementioned joint cross-modality techniques in two central aspects. First, it requires only weak alignment, as opposed to the aforementioned techniques which rely on well aligned pairs. Second, our network does not require any training data, and therefore avoids the need for a modal-specific dataset, relying on the internal image-specific statistics instead. This allows us to work on unseen modalities using a single architecture, and it is more suitable for cases where external modality image datasets are hard to obtain, making supervision practically impossible. Moreover, when facing unique modalities with high internal variance (i.e, the images look differently from one another), it is more feasible to rely on the internal image statistics, and not on a highly varied dataset, if one exists.
2.2 Image Registration
The subject of multi-modal image registration has been studied mainly in the context of medical imaging. Deep methods [DBLP:journals/corr/SimonovskyGMNK16, DBLP:journals/corr/VosBVSI17, DBLP:journals/corr/abs-1809-06130] have mostly based their architectures on a regressor, a spatial transformer and a re-sampler. They use supervision to optimize their regression and deformation models. It is also possible to use similarity metrics (like cross-correlation) [DBLP:journals/corr/abs-1809-06130] instead of training a regressor with supervision, and obtain an unsupervised registration framework.
In our work, multi-modal image registration is integrated into the main SR task. We use the same SR reconstruction loss to optimize our deformation parameters. Thus, we do not require aligned pairs for training. The deformation framework used in our method is divided into three steps in a coarse-to-fine manner [DBLP:journals/corr/abs-1809-06130, DBLP:journals/corr/abs-1907-04641]. We first transform our image using global affine transformation for an initial rough approximation. Then, we further align our two modalities using CPAB [freifeld2017, skafte2018deep] transformation, which acts in a piecewise yet continuous manner. Finally, we use thin-plate spline (TPS) transformation for the final refinement of our alignment task.
3 Cross-Modality Super Resolution
The main motivation for our method is the ability to cope with pairs of images from different modalities which are only weakly aligned. To that end, our architecture includes a stage of local deformation which aligns objects in both images before they enter the SR network, as can be seen in Figures 4 and 5.
This concept can be used together with different super-resolution schemes. We chose to base our method on the ZSSR network of Shocher et al. [DBLP:journals/corr/abs-1712-06087] to enable our method to work on a single image, without pre-training. This has two key advantages: (i) it avoids the need to train on external image datasets, which are often scarce for various modalities, and (ii) it fully utilises the internal image statistics property, particularly relevant to non-standard capturing sensors.
Figure 4 describes the general architecture and the training process of our method, Cross-Modality Super-Resolution framework, CMSR. Our method includes three stages: a local deformation model to align the images of the different modalities, a patch selection phase which generates a training set out of a single pair of images, and a super-resolution network (CMSR). These components are introduced and described in Section 3.1. The way we incorporate those components into our training and inference schemes is covered by Sections 3.2 and 3.3.
3.1 Network Architecture
Alignment using Learnable Deformation
Our network corrects displacements between the two modalities on-the-fly, through a local deformation process applied to the RGB modality as a first gate to the network, optimized implicitly during training. In other words, instead of using explicit supervision to optimize the deformation parameters, they are trained with the super-resolution loss and therefore deform only parts which are relevant to this task. Our deformation process consists of three different transformation layers, performing the learned alignment in a coarse-to-fine manner.
The first layer of our deformation framework is the original Affine STN layer by Jaderberg et al. [DBLP:journals/corr/JaderbergSZK15]. It captures a global affine transformation that is used to position the two modalities together as a rough initial approximation.
The second layer is a DDTN transformation layer (Deep Diffeomorphic Transformation Network,[skafte2018deep]), a variant of the original STN layer supporting more flexible and expressive transformations. Our chosen transformation model is CPAB (Continuous Piecewise-Affine Based, [freifeld2017, skafte2018deep]). It is based on the integration of Continuous Piecewise-Affine (CPA) velocity fields, and yields a transformation that is both differentiable and has a differentiable inverse. It is Continuous Piecewise-Affine w.r.t a tessellation of the image into cells. For this reason, it is well suited to our alignment task; each cell can be deformed differently, yet continuity is preserved between neighboring cells, yielding a deformation that can express local (per-cell) misalignments while preserving the image semantics.
The third and last layer of our deformation framework performs a TPS
(Thin-plate spline) transformation, a technique that is widely used in computer vision and particularly in image registration tasks[DBLP:journals/pami/Bookstein89]. Our implementation (also taken from [skafte2018deep]DBLP:journals/pami/Bookstein89]. Since TPS displaces its keypoints freely, the displacement is unconstrained to any image transformation model, and has the power to align the fine-grained objects of the scene, providing the final refinement of our alignment task.
Similarly to ZSSR [DBLP:journals/corr/abs-1712-06087] we produce our training set from a single pair of images by sampling patches using random augmentations. In our implementation we use scale, rotation, shear and translations. This random patch selection yields two patches that correspond to roughly the same area in the scene: one taken from the target modality and the second is taken from the deformed RGB modality which was previously aligned to the target modality.
The CMSR network is the main component of our architecture as it is the component responsible for performing super-resolution. Namely, it produces a HR version of its target modality LR input image, guided by its HR RGB input. As Figure 4 and Figure 5 suggest, this component can be applied to varying image sizes, thanks to its fully convolutional nature.
The fully convolutional architecture of CMSR is based on the one from Shocher et al. [DBLP:journals/corr/abs-1712-06087]. However, a few changes have been made to better apply it to cross-modality SR (see Figure 6).
The first gate to the network is up-sampling of the LR modality input to the size of the RGB input. This is done naively, using the Bi-cubic method, in case no specific kernels are given. 222 Optimal blur kernels can be directly estimated as shown in
Optimal blur kernels can be directly estimated as shown in[irani2009super], and are fully supported by our method as an additional input to the network. From the up-sampled modality input we generate a feature map using a number of convolutional layers, denoted as Feature-Extractor 1 in Figure 6. From the RGB modality input that was previously aligned to target modality input, we generate a feature map using Feature-Extractor 2. We perform summation of the two resulting feature maps, one from each Feature-Extractor block, alongside with an up-sampled version of the LR target modality image, in a residual manner. This yields our HR super-resolved output.
During each training iteration, we perform local deformation on the RGB modality input and produce a displaced version of it, aligned to the target modality image, as described in 3.1. Then, a random patch is selected from the input pair (illustrated in Figure 4), yielding two corresponding patches; one taken from the target modality, and the second from the displaced (aligned) RGB modality, as described in 3.1
. The patch selection phase is an integral part of the network, and is done in a differentiable manner, so as to allow the gradients to backpropagate through it to the deformation model. This enables us to optimize the transformation on the entire RGB image despite using patches of the image during training.
In order to generate supervision for the training process, we down-sample the two patches and use the original target modality patch as ground-truth. We use reconstruction loss between the reconstructed patch and original input target modality patch. Note that there is no ground truth for a perfectly aligned RGB modality. Instead, the deformation parameters are optimized using the same reconstruction loss as an integral part of the SR task.
As mentioned above, after the Patch Selection (3.1) step of our training scheme, we down-sample both patches (Figure 4, in Green) by our desired SR ratio (e.g., , ), denoted as . The modality patch is down-sampled to allow training the network to reconstruct it with self-supervision, whereas the RGB patch is down-sampled accordingly, to keep the ratio between the two patches equal to .
Instead of down-sampling the RGB patch, it is also possible to naively up-sample the modality patch, and still preserve the same ratio, , between patches. We found that by alternating between up-sampling and down-sampling of the aforementioned patches, we are able to significantly improve the results. More details regarding this technique can be found in the supplementary material.
At inference time, we use the trained CMSR network and deformation parameters, to perform SR on the entire target modality image guided by the RGB modality image (see Figure 5).
Since CMSR is fully convolutional, it can operate on any image size (e.g., both image patches of different scales, and full images) using the same network. We first apply the alignment dictated by the optimized deformation parameters, and then feed the LR target modality image and the aligned HR RGB image to the SR network which outputs a HR version of the target modality image.
After the HR target modality image is obtained, we perform two additional refinement operators aimed to further improve our SR results. The first operator, Geometric Self-Ensemble, is an averaging technique shown to improve SR results [DBLP:journals/corr/LimSKNL17, DBLP:journals/corr/TimofteRG15, DBLP:journals/corr/abs-1712-06087]. The second operator, Iterative Back-Projection, is an error-correcting technique that was used successfully in the context of SR [Glasner2009, Irani:1991:IRI:108693.108696, DBLP:journals/corr/abs-1712-06087].
4 Results and Evaluation
4.1 Implementation Details
Our model is implemented in Tensorflow 1.11.0 and trained on a single GeForce GTX 1080 Ti GPU. The full code and datasets will be published upon acceptance in the project’s GitHub page. We typically start with a learning rate of 0.0001 and gradually decrease it to, depending on the slope of our reconstruction error line, whereas the learning rates of our transformation layers follow the same pattern, multiplied by constant factors. Those factors are treated as hyper-parameters, and should typically be larger when dealing with highly displaced input pairs, like in the case of Weakly Aligned modalities (Figure 3). Performing a SR on an input of size typically takes 30 to 60 seconds, depending on the desired number of iterations. To achieve SR of higher scales, we perform gradual SR with intermediate scales, as this further improves the results [DBLP:journals/corr/LaiHA017, DBLP:journals/corr/abs-1804-02900, DBLP:journals/corr/abs-1712-06087].
For Feature-Extractor 1 we use eight hidden layers, each containing 64 channels and a filter size ofFeature-Extractor 2 we typically use four to eight hidden layers with number of channels ranging from 4 to 128, a filter size of and a ReLU activation function. The last layer has no activation and a filter size of . We find that highly detailed RGB inputs require Feature-Extractor 2 to have more channels. The hyper-parameters rarely require adjustments; they only require manual tuning when dealing with inputs that are unique, unusual, or ones that reflect very unusual displacements.
4.2 Evaluation with State-of-the-arts
Strongly Aligned Modalities
We compared our method to cross-modal state-of-the-art SR methods on strongly aligned pairs. We used the ULB17-VT dataset [Almasri2019], consisting of visual-thermal pairs that are mostly well aligned, as shown in Figure 2 (bottom row). This proves to be an easier case for joint cross-modal super-resolution, and typically requires only local understanding of the input pair. We have included the results of our evaluation in Table 1, showing that our method, despite not being previously trained, beats competing methods, averaged across the ULB17-VT dataset which was used by the said methods for evaluation in their original papers. Figures 8 and 11 include some visual results.
Weakly Aligned Modalities
The Middlebury dataset [Scharstein:2002:TED:598429.598475] contains strongly aligned depth-visual pairs as shown in Figure 2 (top row). In that dataset, multiple angles and different sensor placements are included, for each pair. To obtain weakly-aligned pairs, we shuffled the pairs together such that the resulting pairs would correspond to a small sensor misplacement, shown in Figure 3 (left pair). We further increased the size of the dataset through random augmentations. We denote the new resulting dataset as Shuffled-Middlebury. CMSR surpasses competing cross-modal methods on those weakly aligned pairs by using a coarse-to-fine alignment approach, as summarized in Table 1.
Single modality baseline model
We evaluated CMSR against the baseline state-of-the art single modality method, ZSSR [DBLP:journals/corr/abs-1712-06087]. Our experiment shows that our method leverages the fine details in its RGB input and produces a SR output that is both appealing to the eye, and numerically closer to a Ground-Truth version, as shown in Figures 7, 8, 12 and 9.
4.3 RGB Artifacts
A fusion of multiple image sources, often causes the transfer of unnecessary artifacts from one modality to the other (e.g., [almasri2018multimodal]). Those artifacts not only sabotage the quality of the image, but harm the modality characteristics and could potentially make it unusable. Our method learns only the relevant RGB information that improves SR results; Figures 12 and 11 show cases where the RGB modality input contains a great amount of textural information, yet our SR output remains texture-free. In Figure 8, the learned RGB residual is given; it contains no irrelevant textures and it resembles an edge-map, used to sharpen our output image.
4.4 Local Deformation
As shown in Figure 10, our method aligns the RGB modality input to the target modality input on-the-fly, to aid the joint cross-modal SR task. Although we have no aligned RGB ground-truth image, nor any target modality ground-truth image, we still correct those cross-modal misalignment successfully, thanks to an expressive deformation framework integrated into our architecture. The deformation parameters are optimized using the SR reconstruction loss; hence we learn only the deformations that are needed to minimize that loss and assist in the SR task.
We have introduced CMSR, a method for cross-modality super-resolution. Our method utilises an associated high-resolution RGB image of the scene to boost its accuracy. The method presented is generic and yet outperforms state-of-the-art methods, even when its two modalities are misaligned, as elaborated below.
Generic. To the best of our knowledge, CMSR is the first self-supervised cross-modal SR method. It requires no training data, a prominent advantage when dealing with scarce and unique modalities. It is trained on the target image only, and can thus, take any modality as input, and learns its internal, possibly unique, statistics, adapting to the unknown imaging conditions and down-scaling kernels.
Furthermore, the method can be applied to any image sizes, and to any ratio between the two inputs. This is unlike other architectures that use strides for up-sampling[almasri2018multimodal], thus they are fixed to a specific image size and constant scale factor.
Our method is conservative, in the sense that it learns from its RGB features only when it contributes to the up-sampling process, without introducing outliers, ghosts, halos, or other artifacts.
We achieve state-of-the-art results, qualitatively (visually) and quantitatively, compared to competing cross-modal methods, as well as to our state-of-the-art single-modality baseline. Specifically, we show that the RGB modality indeed greatly contributes as a guide to the up-sampling process.
Misalignment A unique property of our method is that it is robust to cross-modal misalignment. This property is imperative, since in real life conditions, sight misalignment is, more often than not, unavoidable. It should be emphasized that the alignment is done without pre-training or any supervision.
In the future we would like to further enhance our technique by applying the deformation in the feature space instead of the RGB pixel-space. The hope is that in this way, it would be possible to adopt a deformation-per-feature scheme that would reflect different displacements for different scene objects, possibly using segmentation.
Weakly Aligned Joint Cross-Modality Super Resolution -
6 Additional Results
In Figure 13, additional results from our evaluation on the EPFL NIR dataset [ivrl] are included. This dataset was originally used in Figure 9 of the original paper. The results indicate that our method avoids transferring unnecessary RGB textures to its output; it only learns from its RGB input when it contributes to the results. This conservative approach enables CMSR to surpass state-of-the-art cross-modal methods, despite the fact that those competing methods were pre-trained extensively on the full dataset, whereas our method operates on its single input pair, without pre-training, in a Zero-Shot [DBLP:journals/corr/abs-1712-06087] manner.
7 Alternating Scales
In Section 3.2 of the submitted paper, the Alternating Scales technique is briefly discussed. It corresponds to training CMSR using two different scales, alternating between them across iterations. Here, we wish to further elaborate on this technique.
7.1 Alternating Scales - Elaboration
Denoting our desired SR ratio (e.g. , ) by , our network, CMSR, takes a target modality input of size alongside with an RGB input of size , and produces a target modality output of size . Hence, by design, a ratio of must be preserved between CMSR’s two inputs (The architecture of CMSR is given in the original paper, Figure 6). Since CMSR is trained to reconstruct a random patch taken from its modality input (Figure 4 of the original paper, Training process), this random patch is down-sampled, by ratio , before it is reconstructed by the CMSR network. However, since the ratio between CMSR’s two inputs must remain , the corresponding RGB patch is also down-sampled accordingly, by ratio . This way, we preserve the same ratio between CMSR’s two input patches, as needed.
Nonetheless, instead of down-sampling the RGB patch to match this required ratio, it is also possible to naively up-sample the modality patch by ratio . Clearly, this has the same effect on the ratio between the two patches, which yet again remains . However, this way, we obtain a different training scheme. Figure 14 compares the two different schemes, corresponding to the two different scales CMSR operates on.
We found that by alternating between the two schemes during training, we are able to significantly improve our results. We name this combination of training schemes as the Alternating Scales technique. It allows our network to be optimized using patches of their original scale, as explained in Table 2. We observe that training our network on patches of their original scale improves its generalization capabilities, since during the inference stage, the network operates on the full input pair, at its original scale.
7.2 Alternating Scales - Ablation Study
for information on the schemes), alternating between them randomly. We used the Upsampling-Based scheme with probabilityand the Downsampling-Based with probability .
According to the results, summarized in Figures 15 and 16, the best PSNR was obtained when , which starts decaying when . We notice that always yields better results than . This observation is important, since the risk of using sub-optimal values on new, unseen input pairs is minimal; using this technique is always better than not using it, regardless of .
|Training Scheme||Modality Scale||RGB Scale|
8 Alignment using Learnable Deformation - Ablation Study
To show the necessity of each layer of our coarse-to-fine deformation framework (Section 3.1 of the original paper), we evaluated CMSR on a Weakly Aligned pair, adding one layer at a time, averaged across multiple runs. The results indicate that each layer is necessary and plays a different role, which can be seen visually in Figure 18, and numerically in Figure 17.
Two additional points should be mentioned; First, we remind that our goal is not to perform image registration. Hence, we measure the quality of alignment through the quality of the yielded SR result, and not by conventional image registration metrics. Second, when CMSR is evaluated with no transformation layers on a severely misaligned pair (like the one in Figure 18), its RGB input remains mostly unused, enabling CMSR to produce a result that is comparably worse (as shown in 17), but does not reflect the failed fusion of misaglined RGB objects. This conservative approach allows our method to surpass competing cross-modal SR methods. CMSR leverages its RGB modality input only when it contributes to the final SR result; when CMSR has no transformation layers, a severely misaligned RGB input will mostly be ignored.