Joint Implicit Image Function for Guided Depth Super-Resolution

07/19/2021 ∙ by Jiaxiang Tang, et al. ∙ Peking University 8

Guided depth super-resolution is a practical task where a low-resolution and noisy input depth map is restored to a high-resolution version, with the help of a high-resolution RGB guide image. Existing methods usually view this task as a generalized guided filtering problem that relies on designing explicit filters and objective functions, or a dense regression problem that directly predicts the target image via deep neural networks. These methods suffer from either model capability or interpretability. Inspired by the recent progress in implicit neural representation, we propose to formulate the guided super-resolution as a neural implicit image interpolation problem, where we take the form of a general image interpolation but use a novel Joint Implicit Image Function (JIIF) representation to learn both the interpolation weights and values. JIIF represents the target image domain with spatially distributed local latent codes extracted from the input image and the guide image, and uses a graph attention mechanism to learn the interpolation weights at the same time in one unified deep implicit function. We demonstrate the effectiveness of our JIIF representation on guided depth super-resolution task, significantly outperforming state-of-the-art methods on three public benchmarks. Code can be found at <https://git.io/JC2sU>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

page 8

Code Repositories

jiif

[ACM MM 2021] Joint Implicit Image Function for Guided Depth Super-Resolution


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. RGB guided noisy depth map super-resolution. Our method predicts a high-resolution target depth map from a noisy and low-resolution input depth map with the guidance from a high-resolution RGB image. The low-resolution depth map is up-sampled with bicubic interpolation for better visualization.

Depth maps have been widely used as a basic element in various computer vision tasks, such as semantic segmentation 

(Weder et al., 2020; Chen et al., 2020b; Gupta et al., 2014; Xing et al., 2019a, b) and 3D reconstruction (Song et al., 2017; Chen et al., 2020a, c). With the geometric information in the depth maps, these tasks can be facilitated and better understood. Despite the improvement of the depth sensors in recent years, high quality depth maps are still challenging to acquire. The acquired depth maps are usually low-quality due to the limitation of the sensors. In the meantime, RGB cameras have evolved rapidly and can acquire high-quality RGB images with a comparatively low cost. Hence, RGB guided depth super-resolution, where a high-resolution (HR) RGB image is used to guide the up-sampling process of a low-resolution (LR) depth input image, has become an important research topic. As illustrated in Figure 1, the detailed structures in the RGB image can be used to avoid blurry edges and suppress noises when up-sampling depth maps, but this process is non-trivial due to the complicated nature of RGB images.

Various methods have been proposed to extract the rich information from the RGB images to guide the depth super-resolution. The widely used bilateral filtering (Tomasi and Manduchi, 1998) and guided filtering (He et al., 2012) can be extended to solve this task, when we first up-sample the LR input image with a traditional interpolation method, e.g., bilinear or bicubic interpolation, to the same size as the HR guide image. These methods construct explicit filters that reflect structural information from the guide image. However, a common drawback to these filtering based methods is that the filter is only dependent on the guide image, and it may transfer incorrect contents if the guide image is inconsistent with the input image (e.g., the RGB guide image is too dark to distinguish the structure). Another popular way to model guided super-resolution is to treat it as a dense regression problem, and train a neural network to directly predict the HR target image from the LR input image and the HR guide image (Li et al., 2016, 2019; Hui et al., 2016) in a supervised way. They first down-sample the HR targets to generate LR inputs, then learn to undo the down-sampling process. With the strong capability of CNNs to extract features, these methods often outperform traditional methods. However, the simple feature aggregation doesn’t model the guide process explicitly, which lacks interpretability and may not generalize well to other datasets.

Inspired by the recent progress of implicit neural representation in 3D object/scene representation 

(Park et al., 2019; Jiang et al., 2020; Rist et al., 2020) and image super-resolution (Chen et al., 2021), we revisit the guided super-resolution problem and view it from the perspective of implicit neural representation. The idea of implicit neural representation is to use a deep implicit function (DIF) to map continuous coordinates to signals in a certain domain. To share knowledge across different input observations, an encoder is often used to extract latent codes from the input to make the DIF conditional to the current observation. Thus, a scene/image can be represented by a set of local latent codes distributed in the coordinates of the input domain, which can be used in downstream tasks such as semantic segmentation (Rist et al., 2020) and super-resolution (Chen et al., 2021). To make the output of DIF continuous, a weighted average of the predictions from several neighboring coordinates is usually calculated, which can be viewed as a neural implicit interpolation process. However, these weights are usually empirical (e.g., distance-based) in previous work since there is no other prior knowledge between the query coordinate and the neighboring coordinates. With the extra HR guide image in the guided super-resolution task, we can learn to extract this knowledge and learn the weights in a data-driven way. We hypothesize that the guide image can benefit the learning of both interpolation weights and values, and propose to learn the interpolation weights via a graph attention mechanism. Furthermore, we integrate the learning of weights and values into one unified DIF, which we call the joint implicit image function (JIIF) representation.

To summarize, the contributions of this paper are as follows:

  • We propose a novel joint implicit image function representation for guided image super-resolution, where the target image is represented by local latent codes from both the input image and the guide image.

  • We learn interpolation weights at the same time via a graph attention mechanism, and integrate the learning of both interpolation weights and values into one unified representation.

  • Our method outperforms existing methods by large margins on guided depth super-resolution tasks, and achieves state-of-the-art results on guided noisy depth super-resolution tasks.

2. Related Work

2.1. Guided Super-Resolution

2.1.1. Filtering based methods

Guided filtering aims to enhance a target image by applying a filter that is dependent on a guide image. Bilateral filter (Tomasi and Manduchi, 1998) is the starting work where the target image also serves as the guide image. Later work includes joint bilateral filter (Kopf et al., 2007), guided filter (He et al., 2012) and weighted median filter (Ma et al., 2013). Guided filtering can be used for a variety of tasks such as image denoising, colourisation and stereo matching. When extended to different sized target image and guide image, it can also be used for the guided super-resolution task. We further distinguish these methods into two categories: Local methods first upsample the low-resolution target image with a traditional interpolation method, then apply a local filter which is controlled by the guide image (Kopf et al., 2007; Yang et al., 2007) or both the target image and the guide image (Chan et al., 2008). On the other hand, Global methods formulate the filtering as an implicit energy minimization problem, and optimize values of all the pixels in the target image. This category includes Markov Random Field (Diebel and Thrun, 2005) and its non-local means variant (Park et al., 2011). Variational inference with anisotropic total generalized variation prior (Ferstl et al., 2013) and auto-regressive models (Yang et al., 2014) are other types of global methods (Li et al., 2012; Kiechle et al., 2013). Some recent methods also combine the idea of guided filtering into the global optimization framework, such as the fast bilateral solver (Barron and Poole, 2016) and the SD filter (Ham et al., 2017).

2.1.2. Learning based methods

Different from the previous unsupervised filtering based methods, learning based methods provide a data-driven and supervised way to solve the guided super-resolution problem by training neural networks. Self-supervised super-resolution where the HR target image is first down-sampled to serve as the LR input has been explored by lots of methods (Chang et al., 2004; Dong et al., 2014; Lai et al., 2017; Zhang et al., 2020, 2018; Ledig et al., 2017; Chen et al., 2021; Ye et al., 2020). Guided super-resolution further introduces a HR guide image to direct the up-sampling process of the LR input. Early work like Depth Multi-Scale Guided Network (DMSG) (Hui et al., 2016), Dynamic Guidance (DG) (Gu et al., 2017) and Deep Joint Filtering (DJF) (Li et al., 2016, 2019) starts to use CNNs to extract features and directly regress the target image. Pixel-Adaptive Convolution (PAC) (Su et al., 2019) learns a spatially variant kernel to fuse the guide image features into the LR input. Deformable Kernel Network (DKN) (Kim et al., 2021) draws ideas from both the explicit filtering based methods and learning based methods. It uses a CNN to learn a set of sparsely chosen neighbors and the interpolation weights adaptively, then apply an explicit image filter to calculate the final prediction. These methods either lack model interpretability for directly regressing the target, or rely on a simple image filter that cannot take full advantage of the guide image. Instead, our method starts from the general form of image interpolation and equips it with the effective implicit neural representation, leading to both better performance and interpretability.

2.2. Implicit Neural Representation

Implicit neural representation uses a deep implicit function (DIF) to map coordinates to signals in a specific domain. A DIF is a continuous and differentiable function, usually parameterized by an MLP. To make the DIF conditional to different input observations, a latent code extracted from the input is usually appended to the coordinate. Recent research has demonstrated the potential of implicit neural representations for 3D single objects (Park et al., 2019; Xu et al., 2019; Mescheder et al., 2019), 3D scene surface (Sitzmann et al., 2019; Jiang et al., 2020; Genova et al., 2020; Sitzmann et al., 2020), 2D images (Chen et al., 2021; Sitzmann et al., 2020) and 1D audios (Sitzmann et al., 2020). Compared to traditional representations, DIF is shown to be more efficient, expressive, and is fully continuous. It is able to capture better structural details with fewer parameters when trained properly. For example, DeepSDF (Park et al., 2019) takes a 3D coordinate and a categorical latent code as the input, and outputs the signed distance (SDF) at this coordinate to decide whether it is inside the target shape. Local Implicit Grid (LIG) (Jiang et al., 2020) learns the common geometric features from local overlapping patches and reconstructs complicated scenes by associating them. Local Implicit Image Function (LIIF) (Chen et al., 2021) extracts a set of latent codes distributed in the LR domain to interpolate the HR target image. SIREN (Sitzmann et al., 2020)

proposed a general implicit neural representation for various domains to fit complicated signals by using periodic activation functions. Different from previous methods that focus on learning from single-modal data, we stress our work on learning from multi-modal data, e.g., HR RGB guide and LR depth input. We focus on extracting prior knowledge from the guide image to help the representation learning of the target image. On this aspect, our method is more close to PixTransform 

(Lutio et al., 2019)

. This method explores guided super-resolution from a different perspective more similar to depth estimation, by training a DIF that maps each pixel in the guide image to the target image, and supervises only by the LR input. Different from our training pipeline, it doesn’t rely on HR target image for supervision, and can be categorized as an unsupervised depth super-resolution task. Besides, PixTransform is unconditional to observations, which means it needs to train a different set of parameters for every new image.

2.3. Graph Attention Mechanism

Graph Convolution Networks (GCNs) focus on problems residing in graph-structured data, by defining graph convolutions on the vertices and edges of a graph (Kipf and Welling, 2016; Veličković et al., 2017; Wang et al., 2019; Tang et al., 2021). An undirected graph is composed of vertices , edges , and an adjacency matrix which is binary or weighted. For tasks where is binary, some work explores to learn the edge weights from vertex features to facilitate the feature propagation. Graph Attention Networks (Veličković et al., 2017) leverages masked self-attention layers to regress a continuous weight between each two connected vertices. EdgeConv (Wang et al., 2019) learns different adjacency matrices in different layers to extract vertex features in a dynamic way. For the task of guided super-resolution, we propose to first divide the image into pixels with implicit neural representation, then treat each pixel query as a graph problem. Thus, the interpolation weights can be interpreted as graph edge weights and learned through the graph attention mechanism.

3. Method

In this section, we introduce our JIIF representation for guided super-resolution task. We first review the recent neural implicit interpolation methods in Section 3.1, then detail our JIIF representation for guided super-resolution in Section 3.2. Finally, we describe our design of the JIIF-Net to learn the representation from data in Section 3.3.

Figure 2. Network Architecture. The grid illustrates the relative image resolution, and we use a up-sampling as an example for simplicity. Given a HR guide image and a LR input image, we extract two sets of latent codes via two encoders, then query the JIIF decoder with a coordinate in the HR domain to predict the pixel value at this coordinate. The prediction is a weighted average from the four nearest coordinates in the LR domain just like the standard image interpolation, but we learn the interpolation weights and values via a deep implicit function.

3.1. Neural Implicit Image Interpolation

We start from a general formulation of the image interpolation problem for image up-sampling, then view it from the perspective of implicit neural representation to introduce the neural implicit image interpolation method. For each LR input image , we want to calculate the corresponding HR target image :

(1)

where is the coordinate of the query pixel in the HR domain, is the set of neighbor pixels for in the LR domain, is the interpolation weight between and , and is the interpolation value for . The interpolation weights are usually normalized so that . We use a continuous image representation by scaling the image coordinates into to make it possible to share the coordinate in both the HR and LR domain. Due to the nature of 2D images, is usually chosen as the four nearest corner pixels of in the LR domain (as illustrated in Figure 2). Different interpolation methods have different ways to calculate the interpolation weights and values. The most commonly used bilinear interpolation is implemented with:

(2)

where is the partial area diagonally opposite to the corner pixel , is the total area serving as a normalization factor, and is the pixel value of the LR input image at .

In implicit neural representation, instead of directly using the pixel value , a DIF is applied to calculate the interpolation value . For example, LIIF (Chen et al., 2021), LIG (Jiang et al., 2020) and SCSSNet (Rist et al., 2020, 2021) all take the following form:

(3)

where is a MLP with parameters that takes a local latent code and a relative coordinate as input. In this setting, the target image is represented by a set of local latent codes distributed at the pixel coordinates of the LR domain, each storing information about its local area (Chen et al., 2021). The latent codes map is the output feature map from an encoder network, and it is of the same resolution as the LR input image:

(4)

where is the encoder network with parameters . The DIF thus models a local area centered at the coordinate of the given latent code. By querying the conditioned DIF with a relative query coordinate , it returns the estimated target value at the query coordinate , e.g., the depth value in depth super-resolution. The weighted average of these estimated values from the four corners is further calculated to avoid discontinuous prediction (which is called local ensemble in (Chen et al., 2021)).

3.2. Joint Implicit Image Function

We focus on the problem of guided super-resolution, where an extra HR guide image is provided with the LR input image . Previous methods either directly regress the target image values by fusing CNN features which lacks interpretability (Li et al., 2016, 2019), or treat it as an explicit filtering problem which cannot fully take advantage of the information in the guide image (Kim et al., 2021). We hypothesize that the information in the guide image can benefit the learning of both interpolation weights and values, and these two terms can be learned jointly to boost the performance. Inspired by the recent neural implicit image interpolation methods, we propose to use DIFs to model both the interpolation weights and values, which we call the Joint Implicit Image Function representation.

Similar to the LIIF representation, the target image is represented by a set of local latent codes, but our latent codes are extracted from both the LR input image and the HR guide image, allowing the detailed information from the guide image to help the up-sampling process. In particular, we apply two encoder networks to extract two sets of latent codes from the guide image and the input image respectively:

(5)

where is another encoder network with parameters . Then, the interpolation values can be naturally calculated by querying the DIF with these two latent codes and a relative coordinate:

(6)

where is one of the neighbors of in the LR domain (). Please note that due to the different resolutions of HR and LR images, we could not obtain the HR latent code at position directly. In such cases, we conduct the bicubic interpolation operation to approximate the HR latent code at position .

Furthermore, we propose to learn the interpolation weights at the same time. As illustrated in the neural implicit interpolation part in Figure 2, we view the interpolation at each query pixel as a graph problem. The four corner pixels and the query pixel are the vertices, and each corner pixel is connected to the query pixel with an edge. Previous methods usually use an empirical value for the edge weights (Chen et al., 2021), or directly regress the weights from the CNN features (Kim et al., 2021). Inspired by recent research in Graph Convolutional Networks (Veličković et al., 2017; Wang et al., 2019), we propose to use a graph attention mechanism to calculate the edge weights. Specifically, we extract the guide latent code of each corner pixel and the query pixel in the HR domain, and apply a MLP to learn the weight in an asymmetric way:

(7)

where is the learned edge weight, and is a MLP with parameters .

We notice the representation of the interpolation weights (Equation 7) and values (Equation 6) are of a similar form. Hence, we propose to integrate these two separate functions into a unified one:

(8)

By integrating the learning of interpolation weights and values, we reduce the parameters needed to model the representation, and allow interaction between these two processes, which is demonstrated to be more effective in our experiments. Finally, the edge weights are normalized by applying the softmax function to calculate the final interpolation weights:

(9)
Datasets Middlebury Lu NYU v2
1-10 Down-sampling Ratio
1-10 Bicubic 2.28 3.98 6.37 2.42 4.54 7.38 4.28 7.14 11.58
DMSG (Hui et al., 2016) 1.88 3.45 6.28 2.30 4.17 7.22 3.02 5.38 9.17
DG (Gu et al., 2017) 1.97 4.16 5.27 2.06 4.19 6.90 3.68 5.78 10.08
DJF (Li et al., 2016) 1.68 3.24 5.62 1.65 3.96 6.75 2.80 5.33 9.46
DJFR (Li et al., 2019) 1.32 3.19 5.57 1.15 3.57 6.77 2.38 4.94 9.18
PAC (Su et al., 2019) 1.32 2.62 4.58 1.20 2.33 5.19 1.89 3.33 6.78
DKN (Kim et al., 2021) 1.23 2.12 4.24 0.96 2.16 5.11 1.62 3.26 6.51
1-10 Ours 1.09 1.82 3.31 0.85 1.73 4.16 1.37 2.76 5.27

Table 1. Quantitative comparison with the state of the art on depth map upsampling in terms of average RMSE.
(a) RGB image
(b) Bicubic Int.
(c) DJFR (Li et al., 2019)
(d) DKN (Kim et al., 2021)
(e) Ours
(f) Ground truth
Figure 3. Qualitative comparisons of guided depth map super-resolution on the NYU v2 dataset.

Datasets Art Books Moebius
(lr)1-10 Down-sampling Ratio
1-10 Bicubic 6.07 7.27 9.59 5.15 5.45 5.97 5.51 5.68 6.11
DMSG (Hui et al., 2016) 6.19 7.26 9.53 5.38 5.18 5.20 5.48 5.06 5.36
PDN (Riegler et al., 2016) 3.11 4.48 7.35 1.56 2.24 3.46 1.68 2.48 3.62
DG (Gu et al., 2017) 2.96 4.41 7.06 1.64 2.35 3.50 1.74 2.57 3.79
DJFR (Li et al., 2019) 4.25 6.43 9.05 2.20 3.35 4.94 2.39 3.51 4.56
PAC (Su et al., 2019) 5.34 7.69 10.66 2.11 3.12 4.60 2.21 3.38 4.72
DKN (Kim et al., 2021) 3.01 4.14 7.01 1.44 2.10 3.09 1.63 2.39 3.55
1-10 Ours 2.79 3.87 7.14 1.30 1.75 2.47 1.40 2.03 3.18

Table 2. Quantitative comparison with the state of the art on noisy depth map upsampling in terms of average RMSE.
(a) RGB image
(b) Bicubic Int.
(c) DJFR (Li et al., 2019)
(d) Ours
(e) Ground truth
Figure 4. Qualitative comparisons of guided noisy depth map super-resolution on the Noisy Middlebury dataset. The first row shows results of up-sampling on Art, and the second row shows up-sampling on Moebius.

3.3. Network Architecture and Training

After defining the JIIF representation, we design a neural network to learn the representation from large datasets. As shown in Figure 2, the network contains two image encoders and one JIIF decoder. The input image and the guide image are fed into two encoders respectively, generating two feature maps as the latent codes for the JIIF representation. During training, we sample a set of pixels from the HR image with their coordinates, and query the JIIF decoder with these coordinates to predict the pixel values. A standard L1 loss is applied to optimize the network for predicting accurate results:

(10)

where is the total number of sampled pixels, is the coordinate of any sampled pixel, is the ground truth pixel value, and is the predicted pixel value. In testing, we query all pixels’ coordinates in the target domain to recover the full up-sampled image.

4. Experiments

In this section, we apply our method to guided depth map super-resolution and guided noisy depth map super-resolution tasks to demonstrate the effectiveness of our method.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 5. Qualitative comparisons of guided depth map super-resolution on the Middlebury dataset (first two rows) and the Lu dataset (last row). From left to right: RGB images, Bicubic interpolation results, DJFR (Li et al., 2019) results, DKN (Kim et al., 2021) results, Our results, and the Ground Truth.

4.1. Guided depth map super-resolution

4.1.1. Datasets and Evaluation Metrics

We adpot three widely-used benchmarks for the guided depth super-resolution task:

  • NYU v2 dataset (Silberman et al., 2012): This dataset provides 1449 RGBD pairs of indoor scenes captured by Microsoft Kinect (Zhang, 2012) using structural light. We use the first 1000 pairs as the training set and the rest 449 pairs as the evaluation set following previous work (Kim et al., 2021; Li et al., 2019).

  • Middlebury dataset (Hirschmuller and Scharstein, 2007; Scharstein and Pal, 2007): we use a subset of 30 RGBD pairs from the 2001-2006 datasets provided by Lu et al. (Lu et al., 2014) for testing.

  • Lu dataset (Lu et al., 2014): This dataset consists of 6 RGBD pairs acquired by ASUS Xtion Pro camera. We use it for testing.

Following Kim et al. (Kim et al., 2021), we train our model on the NYU v2 dataset, and test it on all the three datasets. We do not fine-tune the model on Middlebury dataset or Lu dataset in order to test the generalization ability of the model. The LR input images are generated at different ratios (, ,

) through bicubic down-sampling from the HR target images. We use average RMSE as the evaluation metric for the depth map super-resolution task.

4.1.2. Implementation Details

We choose EDSR-baseline (Lim et al., 2017) as the backbone for the two encoders, and discard the up-sampling modules to generate feature maps of the same size as the input image. The output dimension of the encoder is set to , and thus the input dimension of the DIF is . A 5-layer MLP is used to model the DIF with decreasing hidden dimensions .

We train the model for 200 epochs with the batch size of 1. The HR image is randomly cropped into

patches and we sample pixels per patch for each training step. The depth maps are scaled to before fed into the neural networks. For the Middlebury dataset and the Lu dataset, we interpret the provided disparity map as the depth map according to (Kim et al., 2021). We use the Adam optimizer (Kingma and Ba, 2014) to train our models. The initial learning rate is set to and is divided by for every epochs. We apply data augmentation by flipping the image pairs vertically or horizontally in training. When testing, all of the pixels in the HR domain are queried to recover the target image. The model is implemented and trained using the PyTorch framework (Paszke et al., 2017).

4.1.3. Quantitative Comparisons

We compare the proposed method with state-of-the-art methods, including recent learning based methods such as DJFR (Li et al., 2019) and DKN (Kim et al., 2021). Table 1 shows the detailed results on the three datasets. We report the average RMSE on the test set. For the NYU v2 dataset, the average RMSE is measured in centimeters. For the Middlebury dataset and the Lu dataset, the average RMSE is measured in the original scale of the provided disparity. Our method outperforms the existing methods by large margins in all datasets and settings. With the proposed JIIF representation, our method predicts more accurate target image in all up-sampling ratios, and generalizes well into data from other sources (e.g., disparity maps acquired by different devices). This improvement is from the strong capability of implicit neural representation and the joint leaning of interpolation values and weights.

4.1.4. Qualitative Comparisons

We provide visual comparison of the super-resolution results on the NYU v2 dataset in Figure 3. Also, generalization results on the Middlebury dataset and the Lu dataset are shown in Figure 5. Our method produces more accurate and sharper edges in areas of complicated structures, where other methods fail to model the geometry and generate blurred results. Besides, our method can restore reasonable structure even when the RGB guidance is ambiguous, e.g., too dark to provide any useful information. This confirms the advantages of the proposed JIFF representation.

Baseline Joint Repr. Residual RMSE
3.12
2.95
2.97
2.76
Table 3. Ablation study on proposed modules.
Methods RMSE
Bilinear 3.68
Direct Regression 3.67
Graph Attention 2.76
Table 4. Ablation study on interpolation weights learning strategies.

4.2. Guided noisy depth map super-resolution

To show the robustness of our method on noisy data, we further perform experiments to restore noisy low-resolution input depth maps to noise-free high-resolution target depth maps.

4.2.1. Datasets

The Noisy Middlebury dataset (Park et al., 2011) is used as the evaluation dataset for this task. It contains three standard RGBD pairs from the Middlebury 2005 dataset, i.e.,Art, Books and Moebius. We simulate noisy LR input following previous work (Riegler et al., 2016; Kim et al., 2021) by adding a conditional Gaussian noise to the LR input:

(11)

where is proportional to the depth value (e.g., if the input is disparity , we use ), and is the magnitude of the noise. For training, we use the NYU v2 dataset with the same type of noise added to the input images, and do not fine-tune the model on the Noisy Middlebury dataset. In particular, the is set to for the Noisy Middlebury dataset following (Riegler et al., 2016), and for the NYU v2 dataset to simulate similar magnitude of noise. The other experimental settings are the same as in Section 4.1.2.

Figure 6. Visualization of the learned interpolation weights. We show two examples of the query pixel crossing an edge in the HR target image. The query pixel is in red, and the four corner pixels’ color indicates the learned interpolation weights. Higher weights are in bluer color, while lower weights are in greener color.

4.2.2. Quantitative Comparisons

From Table 2, we can see our method outperforms other methods on most settings of the guided noisy depth map super-resolution task. Although trained on depth maps from the NYU v2 dataset, our method generalizes well to the disparity maps from the Noisy Middlebury dataset. This agrees with the previous experiments and demonstrates the noise suppression ability of our method.

4.2.3. Qualitative Comparisons

The visual comparison of noisy depth map super-resolution is shown in Figure 4. Even the input image is corrupted severely by the noise and at up-sampling ratio, our method successfully restores reasonable structural details in the predicted HR depth map. Also, our method can better suppress noises and restore sharper edges compared to other methods.

4.3. Ablation Study

We conduct ablation studies on different proposed modules in our method, and verify the effect of these modules for the guided depth super-resolution task on the NYU v2 dataset.

4.3.1. Learning interpolation weights

Firstly, we do ablation studies on different strategies to learn the interpolation weights. From Table 4, our graph attention based weights learning achieves the best performance. ‘Bilinear’ means the bilinear interpolation weights are used. ‘Direct Regression’ means we use directly a convolution layer to regress the weights from the guide image features, and ‘Graph Attention’ means we apply a graph attention layer to regress the weights from the guide image features. Compared to the baseline bilinear interpolation weights, our method reduces the average RMSE by . Direct regression of the weights used in DKN (Kim et al., 2021) also fails to learn meaningful interpolation weights. This verifies the effectiveness of the graph attention mechanism for leaning edge weights. We also provide the visualization of the learned interpolation weights in Figure 6. The graph attention module can adapt to different locations dynamically, predicting higher weights if the two vertices share common guidance features. For example, when the query pixel crosses an edge, the interpolation weights will switch to the correct side too. This avoids assigning large weights to wrong values from the opposite side, which is one of the main causes of blur in traditional image interpolation.

4.3.2. Joint learning of interpolation weights and values

We argue that the interpolation weights and values are correlated and can be learned together to boost the performance. Our JIFF representation is designed to exploit this correlation by predicting them in one DIF. We conduct experiments to prove this hypothesis in Table 3. ‘Baseline’ means that we break the JIIF into two DIFs that learn interpolation weights and values separately as described in Equation 6 and 7. ‘Joint Repr.’ means we use one unified MLP to learn interpolation weights and values as described in Equation 8. Note that we use the same architecture for the DIFs, which means the ‘Joint Repr.’ setting also reduces the last four MLP layers’ parameters by half. The experimental results validate our hypothesis that the joint representation can further enhance the final performance.

4.3.3. Residual Learning

Previous work (Li et al., 2019; Kim et al., 2021) has shown that residual learning, i.e. first up-sample the input with bicubic interpolation and then correct it by predicting the a residual image, can speed up convergence and improve the final performance. We also adopt this idea and perform experiments to validate the effect of residual learning in Table 3. ‘Residual’ means we adopt a residual learning framework. With both residual learning and joint representation applied, our method achieves the best performance.

5. Conclusion

In this paper, we propose a Joint Implicit Image Function (JIIF) representation for the guided super-resolution task. In JIIF representation, we take the form of a general image interpolation problem but equip it with deep implicit functions to enhance the model capability. In particular, the target image is represented with spatially distributed local latent codes extracted from both the input image and the guide image, and we use a graph attention mechanism to learn the interpolation weights at the same time in one unified deep implicit function. We demonstrate that the learning of interpolation weights and values are correlated, and our JIIF representation takes advantage of this correlation to boost performance. The effectiveness and generalization ability of our method is verified on two tasks.

Acknowledgements.
This work is supported by the National Key Research and Development Program of China (2017YFB1002601), National Natural Science Foundation of China (61632003, 61375022, 61403005), Beijing Advanced Innovation Center for Intelligent Robots and Systems (2018IRS11), and PEK-SenseTime Joint Laboratory of Machine Vision.

References

  • J. T. Barron and B. Poole (2016) The fast bilateral solver. In European Conference on Computer Vision, pp. 617–632. Cited by: §2.1.1.
  • D. Chan, H. Buisman, C. Theobalt, and S. Thrun (2008) A noise-aware filter for real-time depth upsampling. In Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SFA2 2008, Cited by: §2.1.1.
  • H. Chang, D. Yeung, and Y. Xiong (2004) Super-resolution through neighbor embedding. In

    Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004.

    ,
    Vol. 1, pp. I–I. Cited by: §2.1.2.
  • X. Chen, K. Lin, C. Qian, G. Zeng, and H. Li (2020a) 3d sketch-aware semantic scene completion via semi-supervised structure prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4193–4202. Cited by: §1.
  • X. Chen, K. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng (2020b) Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 561–577. Cited by: §1.
  • X. Chen, Y. Xing, and G. Zeng (2020c) Real-time semantic scene completion via feature aggregation and conditioned prediction. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2830–2834. Cited by: §1.
  • Y. Chen, S. Liu, and X. Wang (2021) Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8628–8638. Cited by: §1, §2.1.2, §2.2, §3.1, §3.2.
  • J. Diebel and S. Thrun (2005) An application of markov random fields to range sensing. In NIPS, Vol. 5, pp. 291–298. Cited by: §2.1.1.
  • C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §2.1.2.
  • D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof (2013) Image guided depth upsampling using anisotropic total generalized variation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 993–1000. Cited by: §2.1.1.
  • K. Genova, F. Cole, A. Sud, A. Sarna, and T. Funkhouser (2020) Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4857–4866. Cited by: §2.2.
  • S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang (2017) Learning dynamic guidance for depth image enhancement. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3769–3778. Cited by: §2.1.2, Table 1, Table 2.
  • S. Gupta, R. Girshick, P. Arbeláez, and J. Malik (2014) Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision, pp. 345–360. Cited by: §1.
  • B. Ham, M. Cho, and J. Ponce (2017) Robust guided image filtering using nonconvex potentials. IEEE transactions on pattern analysis and machine intelligence 40 (1), pp. 192–207. Cited by: §2.1.1.
  • K. He, J. Sun, and X. Tang (2012) Guided image filtering. IEEE transactions on pattern analysis and machine intelligence 35 (6), pp. 1397–1409. Cited by: §1, §2.1.1.
  • H. Hirschmuller and D. Scharstein (2007) Evaluation of cost functions for stereo matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: 2nd item.
  • T. Hui, C. C. Loy, and X. Tang (2016) Depth map super-resolution by deep multi-scale guidance. In European conference on computer vision, pp. 353–369. Cited by: §1, §2.1.2, Table 1, Table 2.
  • C. Jiang, A. Sud, A. Makadia, J. Huang, M. Nießner, T. Funkhouser, et al. (2020) Local implicit grid representations for 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010. Cited by: §1, §2.2, §3.1.
  • M. Kiechle, S. Hawe, and M. Kleinsteuber (2013) A joint intensity and depth co-sparse analysis model for depth map super-resolution. In 2013 IEEE International Conference on Computer Vision, pp. 1545–1552. Cited by: §2.1.1.
  • B. Kim, J. Ponce, and B. Ham (2021) Deformable kernel networks for joint image filtering. International Journal of Computer Vision 129 (2), pp. 579–600. Cited by: §2.1.2, (d)d, §3.2, §3.2, Table 1, Table 2, Figure 5, 1st item, §4.1.1, §4.1.2, §4.1.3, §4.2.1, §4.3.1, §4.3.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.3.
  • J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele (2007) Joint bilateral upsampling. ACM Transactions on Graphics (ToG) 26 (3), pp. 96–es. Cited by: §2.1.1.
  • W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §2.1.2.
  • C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §2.1.2.
  • Y. Li, T. Xue, L. Sun, and J. Liu (2012) Joint example-based depth map super-resolution. In 2012 IEEE International Conference on Multimedia and Expo, pp. 152–157. Cited by: §2.1.1.
  • Y. Li, J. Huang, N. Ahuja, and M. Yang (2016) Deep joint image filtering. In European Conference on Computer Vision, pp. 154–169. Cited by: §1, §2.1.2, §3.2, Table 1.
  • Y. Li, J. Huang, N. Ahuja, and M. Yang (2019) Joint image filtering with deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1909–1923. Cited by: §1, §2.1.2, (c)c, (c)c, §3.2, Table 1, Table 2, Figure 5, 1st item, §4.1.3, §4.3.3.
  • B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §4.1.2.
  • S. Lu, X. Ren, and F. Liu (2014) Depth enhancement via low-rank matrix completion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3390–3397. Cited by: 2nd item, 3rd item.
  • R. d. Lutio, S. D’aronco, J. D. Wegner, and K. Schindler (2019) Guided super-resolution as pixel-to-pixel transformation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8829–8837. Cited by: §2.2.
  • Z. Ma, K. He, Y. Wei, J. Sun, and E. Wu (2013) Constant time weighted median filtering for stereo matching and beyond. In Proceedings of the IEEE International Conference on Computer Vision, pp. 49–56. Cited by: §2.1.1.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §2.2.
  • J. Park, H. Kim, Y. Tai, M. S. Brown, and I. Kweon (2011) High quality depth map upsampling for 3d-tof cameras. In 2011 International Conference on Computer Vision, pp. 1623–1630. Cited by: §2.1.1, §4.2.1.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §1, §2.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.2.
  • G. Riegler, D. Ferstl, M. Rüther, and H. Bischof (2016) A deep primal-dual network for guided depth super-resolution. arXiv preprint arXiv:1607.08569. Cited by: Table 2, §4.2.1.
  • C. B. Rist, D. Schmidt, M. Enzweiler, and D. M. Gavrila (2020) SCSSnet: learning spatially-conditioned scene segmentation on lidar point clouds. In 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1086–1093. Cited by: §1, §3.1.
  • C. Rist, D. Emmerichs, M. Enzweiler, and D. Gavrila (2021) Semantic scene completion using local deep implicit functions on lidar data. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.1.
  • D. Scharstein and C. Pal (2007) Learning conditional random fields for stereo. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: 2nd item.
  • N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746–760. Cited by: 1st item.
  • V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33. Cited by: §2.2.
  • V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618. Cited by: §2.2.
  • S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754. Cited by: §1.
  • H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz (2019)

    Pixel-adaptive convolutional neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11166–11175. Cited by: §2.1.2, Table 1, Table 2.
  • J. Tang, X. Gao, and W. Hu (2021) RGLN: robust residual graph learning networks via similarity-preserving mapping on graphs. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2940–2944. Cited by: §2.3.
  • C. Tomasi and R. Manduchi (1998) Bilateral filtering for gray and color images. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271), pp. 839–846. Cited by: §1, §2.1.1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.3, §3.2.
  • Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §2.3, §3.2.
  • S. Weder, J. Schonberger, M. Pollefeys, and M. R. Oswald (2020) Routedfusion: learning real-time depth map fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4887–4897. Cited by: §1.
  • Y. Xing, J. Wang, X. Chen, and G. Zeng (2019a) 2.5 d convolution for rgb-d semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1410–1414. Cited by: §1.
  • Y. Xing, J. Wang, X. Chen, and G. Zeng (2019b) Coupling two-stream rgb-d semantic segmentation network by idempotent mappings. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1850–1854. Cited by: §1.
  • Q. Xu, W. Wang, D. Ceylan, R. Mech, and U. Neumann (2019) Disn: deep implicit surface network for high-quality single-view 3d reconstruction. arXiv preprint arXiv:1905.10711. Cited by: §2.2.
  • J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang (2014)

    Color-guided depth recovery from rgb-d data using an adaptive autoregressive model

    .
    IEEE transactions on image processing 23 (8), pp. 3443–3458. Cited by: §2.1.1.
  • Q. Yang, R. Yang, J. Davis, and D. Nistér (2007) Spatial-depth super resolution for range images. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.1.1.
  • X. Ye, B. Sun, Z. Wang, J. Yang, R. Xu, H. Li, and B. Li (2020) Depth super-resolution via deep controllable slicing network. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 1809–1818. Cited by: §2.1.2.
  • K. Zhang, L. V. Gool, and R. Timofte (2020) Deep unfolding network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3217–3226. Cited by: §2.1.2.
  • Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2472–2481. Cited by: §2.1.2.
  • Z. Zhang (2012) Microsoft kinect sensor and its effect. IEEE multimedia 19 (2), pp. 4–10. Cited by: 1st item.