Joint Image Filtering with Deep Convolutional Networks

10/11/2017 ∙ by Yijun Li, et al. ∙ University of Illinois at Urbana-Champaign University of California, Merced Virginia Polytechnic Institute and State University 0

Joint image filters leverage the guidance image as a prior and transfer the structural details from the guidance image to the target image for suppressing noise or enhancing spatial resolution. Existing methods either rely on various explicit filter constructions or hand-designed objective functions, thereby making it difficult to understand, improve, and accelerate these filters in a coherent framework. In this paper, we propose a learning-based approach for constructing joint filters based on Convolutional Neural Networks. In contrast to existing methods that consider only the guidance image, the proposed algorithm can selectively transfer salient structures that are consistent with both guidance and target images. We show that the model trained on a certain type of data, e.g., RGB and depth images, generalizes well to other modalities, e.g., flash/non-Flash and RGB/NIR images. We validate the effectiveness of the proposed joint filter through extensive experimental evaluations with state-of-the-art methods.



There are no comments yet.


page 2

page 4

page 5

page 8

page 9

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image filtering with guidance signals, known as joint or guided filtering

, has been successfully applied to numerous computer vision and computer graphics tasks, such as depth map enhancement 

[1, 2, 3], joint upsampling [4, 1], cross-modality noise reduction [5, 6, 7], and structure-texture separation [8, 9]. The wide applicability of joint filters can be attributed to the adaptability in handling visual signals in various image domains and modalities, as shown in Figure 1. For a target image, the guidance image can either be the target image itself [10, 6], high-resolution RGB images [6, 2, 3], images from different sensing modalities [11, 12, 5], or filter outputs from previous iterations [9]. The basic idea behind joint image filtering is that we can transfer the important structural details contained in the guidance image to the target image. The main goal of joint filtering is to enhance the degraded target image due to noise or low spatial resolution while avoiding transferring extraneous structures that do not originally exist in the target image, e.g., texture-copying artifacts.

Several approaches have been developed to transfer structures in the guidance image to the target image. One category of algorithms is to construct joint filters for specific tasks. For example, the bilateral filtering algorithm [10] constructs spatially-varying filters that reflect local image structures (e.g., smooth regions, edges, textures) in the guidance image. Such filters can then be applied to the target image for edge-aware smoothing [10] or joint upsampling [4]. On the other hand, the guided image filter [6] assumes a locally linear model over the guidance image for filtering. However, these filters share one common drawback. That is, the filter construction considers only the information contained in the guidance image and remains fixed (i.e., static guidance). When the local structures in the guidance and target images are not consistent, these methods may transfer incorrect or extraneous contents to the target image.

To address this issue, recent efforts focus on considering the common structures existing in both the target and guidance images [9, 13, 7]. These frameworks typically build on iterative methods for minimizing global objective functions. The guidance signals are updated at each iteration (i.e., dynamic guidance) towards preserving the mutually consistent structures while suppressing contents that are not commonly shared in both images. However, these global optimization based methods often use hand-crafted objective functions that may not reflect natural image priors well and typically require heavy computational load.

Depth map Saliency map Chromaticity map Flash/Non-Flash Inverse Texture
upsampling upsampling upsampling noise reduction halftoning removal
Fig. 1: Sample applications of joint image filtering. The target/guidance pair (top) can be various types of cross-modality visual data. With the help of the guidance image, important structures can be transferred to the degraded target image to help enhance the spatial resolution or suppress noises (bottom). The guidance image can either be high-resolution RGB images, images from different sensing modalities, or the target image itself.

In this work, we propose a learning-based joint filter based on Convolutional Neural Networks (CNNs). We propose a network architecture that consists of three sub-networks and a skip connection, as shown in Figure 2. The first two sub-networks and extract informative features from both target and guidance images. These feature responses are then concatenated as inputs for the network to selectively transfer common structures. As the target input and output images are largely similar, we introduce a skip connection, together with the output of to reconstruct the filtered output. In other words, we enforce the network to focus on learning the residuals between the degraded target and the ground truth images. We train the network using large quantities of real data (RGB and depth images) and learn all the network parameters simultaneously without stage-wise training.

Our algorithm differs from existing methods in that the proposed joint image filter is purely data-driven. This allows the network to handle complicated scenarios that may be difficult to capture through hand-crafted objective functions. While the network is trained using the RGB-D data, the network learns how to selectively transfer structures by leveraging the prior from the guidance image, rather than predicting specific values. As a result, the learned network generalizes well for handling images in various domains and modalities.

We make the following contributions in this paper:

  • We propose a learning-based framework for constructing a generic joint image filter. Our network takes both target and guidance images into consideration and naturally handles the inconsistent structure problem.

  • With the learned joint filter, we demonstrate the state-of-the-art performance for joint depth upsampling on real datasets and competitive performance on the synthetic datasets.

  • We show that the model trained on a certain type of data (e.g., RGB-D) generalizes well to handle image data in a variety of domains.

A preliminary version of this work was presented earlier in [14]. In this paper, we significantly extend our work and summarize the main differences as follows. First, we propose an improved network architecture for joint image filtering. Instead of directly predicting filtered pixel values (as in [14]), we predict a residual image by adding a skip connection from the input target image to the output (Figure 2). As the residual learning alleviates the need for restoring specific target image contents (which complicates the learning process), we show significant improvement in transferring accurate details from the guidance to the target image. Second, in [14], we train the model only using an RGB/depth dataset and then evaluate its generalization ability on other domains. In this work, we show that the model trained using an RGB/flow dataset also generalizes well on other visual domains. This demonstrates that our network design is insensitive to the modality of the training data. Third, we evaluate our approach on various joint image filter applications, compare against several state-of-the-art joint image filters (including concurrent work [15, 16]), and conduct a detailed ablation study by analyzing the performance of all methods under different hyper-parameter settings (e.g., number of filters and filter size).

Fig. 2: Network architecture for joint image filter. The proposed deep joint image filter model consists of three major components. Each component is a three-layer network. The sub-networks and aim to extract informative feature responses from the target and guidance images, respectively. We then concatenate these features responses together and use them as input for the network . In addition, we introduce a skip connection so that the network learns to predict the residuals between the input target image and the desired ground truth output. We train the network to selectively transfer main structures while suppressing inconsistent structures using an RGB-D dataset. While we describe these sub-networks individually, the parameters of all three sub-networks are updated simultaneously during the training stage.

2 Related Work

Joint image filters.

Joint image filters can be categorized into two main classes based on explicit filter construction or global optimization of data fidelity and regularization terms.

Explicit joint filters compute the filtered output as a weighted average of neighboring pixels in the target image. The bilateral filters [10, 1, 4, 17, 9, 16] and guided filters [6] are representative algorithms in this class. The filter weights, however, depend solely on the local structure of the guidance image. Therefore, erroneous or extraneous structures may be transferred to the target image due to the lack of consistency constraints. In contrast, our model considers the contents of both images based on feature maps and enforces consistency implicitly through learning from examples.

Numerous approaches formulate joint filtering based on a global optimization framework. The objective function typically consists of two terms: data fidelity and regularization terms. The data fidelity term ensures that the filtering output is close to the input target image. These techniques differ from each other mainly in the regularization term that encourages the output to have a similar structure with the guidance image. The regularization term can be defined according to texture derivatives [18], mid-level representations [2] such as segmentation and saliency, filtering outputs [13], or mutual structures shared by the target and guidance image [7]. However, global optimization based methods rely on hand-designed objective functions that may not reflect the complexities in natural images. Furthermore, these approaches involve iterative optimization are often time-consuming. In contrast, our method learns how to selectively transfer important details directly from real RGB-depth datasets. Although the training process is time-consuming, the learned model is efficient during run-time.

Learning-based image filters.

With significant success in high-level vision tasks [19], substantial efforts have been made to construct image filters using learning algorithms and CNNs. For example, the conventional bilateral filter can be improved by replacing the predefined filter weights with those learned from a large amount of data [20, 21, 22]. In the context of joint depth upsampling, Tai et al. [15] use a multi-scale guidance strategy to improve upsampling performance. Gu et al. [23] adjust the original guidance dynamically to account for the iterative updates of the filtering results. However, these methods [15, 23] are limited to the application of depth map upsampling. In contrast, our goal is to construct a generic joint filter for various applications using target/guidance image pairs in different visual domains.

Deep models for low-level vision.

In addition to filtering, deep learning models have also been applied to other low-level vision and computational photography tasks. Examples include image denoising 

[24], raindrop removal [25]

, image super-resolution 

[26], image deblurring [27]

and optical flow estimation 

[28]. Existing deep learning models for low-level vision use either one input image [26, 24, 25, 29] or two images in the same domain [28]. In contrast, our network can accommodate two streams of inputs by heterogeneous domains, e.g., RGB/NIR, flash/non-flash, RGD/Depth, intensity/color. Our network architecture bears some resemblance to that in Dosovitskiy et al. [28]. The main difference is that the merging layer in [28] uses a correlation operator while our model integrates the inputs through concatenating the feature responses. Furthermore, we adopt the residual learning by introducing the skip connection.

Another closely related work is by Xu et al. [29], which learns a CNN to approximate existing edge-aware filters from example images. Our method differs from [29] in two aspects. First, the goal of [29] is to use CNN for approximating existing edge-aware filters. In contrast, our goal is to learn a new joint image filter. Second, unlike the network in [29] that takes only one single RGB image, the proposed joint filter handles two images from different domains and modalities.

Skip connections.

As deeper networks have been developed for vision tasks, the information contained in the input or gradients can vanish and wash out by the time it reaches the end (or beginning) of the network. He et al. [30] address this problem through bypassing the signals from one layer to the next via skip connections. This residual learning method facilitates us to train very deep networks effectively. The work of [31] further strengthens its effectiveness with dense connections across all the layers. For low-level vision tasks, skip connection have been shown to be useful to restore high-frequency details [32, 33] by enforcing the network to learn the residual signals only.

3 Learning Joint Image Filters

In this section, we introduce a learning-based joint image filter based on CNNs. We first present the network design (Section 3.1) and skip connection (Section 3.2). Next, we describe the network training process (Section 3.3) and visualize the guidance map generated by the network (Section 3.4).

Our CNN model consists of three sub-networks: the target network , the guidance network , and the filter network as shown in Figure 2. First, the sub-network takes the target image as input and extracts a feature map. Second, similar to , the sub-network extracts a feature map from the guidance image. Third, the sub-network takes the concatenated feature responses from the sub-networks and as input and generates the residual, i.e., the difference between the degraded target image and ground truth. By adding the target input through the skip connection, we obtain the final joint filtering result. Here, the main roles of the two sub-networks and are to serve as non-linear feature extractors that capture the local structural details in the respective target and guidance images. The sub-network

can be viewed as a non-linear regression function that maps the feature responses from both target and guidance images to the desired residuals. Note that the information from target and guidance images is simultaneously considered when predicting the final filtered result. Such a design allows us to selectively transfer structures and avoid texture-copying artifacts.

(a) Ground truth (b) Bicubic upsampling
(c) 3-layer (9-1-5) (d) 4-layer (9-1-1-5)
(e) 4-layer (9-5-1-5) (f) Our network
Fig. 3: Comparison of network design. Joint depth upsampling (8) results of using different network architectures --… where is the filter size of the -th layer. (a) GT depth map (inset: guidance image). (b) Bicubic upsampling. (c)-(e) Results from the straightforward implementation using . (f) Our results

3.1 Network architecture design

To design a joint filter using CNNs, a straightforward implementation is to concatenate the target and guidance images together and directly train a generic CNN similar to the filter network . While in theory, we can train a generic CNN to approximate the desired function for joint filtering, our empirical results show that such a network generates poor results. Figure 3(c) shows one typical example of joint upsampling using the network . The main structures (e.g., the boundary of the bed) contained in the guidance image are not well transferred to the target depth image, thereby resulting in blurry boundaries. In addition, inconsistent texture structures in the guidance image (e.g., the stripe pattern of the curtain on the wall) are also incorrectly copied to the target image. A potential approach that may improve the results is to adjust the architecture of , such as increasing the network depth or using larger filter sizes. However, as shown in Figure 3(d) and (e), these variants do not show notable improvement. Blurry boundaries and the texture-copying problem still occur. We note that similar observations have also been reported in [34], which indicate that the effectiveness of deeper structures for low-level tasks is not as apparent as that shown in high-level tasks (e.g., image classification).

We attribute the limitation of using a generic network for joint filtering to the fact that the original RGB guidance image fails to provide direct and effective guidance as it mixes a variety of information (e.g., texture, intensity, and edges). To validate this intuition, we show in Figure 4 one example where we replace the original RGB guidance image with its edge map extracted using [35]. Compared to the results guided by the RGB image (Figure 4(d)), the upsampled image using the edge map guidance (Figure 4(e)) shows substantial improvement in preserving the sharp edges.

Based on the above observation, we introduce two sub-networks and to first construct two separate processing streams for the two images before concatenation. With the proposed architecture, we constrain the network to extract effective features from both images separately first and then fuse them at a later stage to generate the final filtering output. This differs from conventional joint filters where the guidance information is mainly computed from the pixel-level intensity/color differences in the local neighborhoods. As our models are jointly trained in an end-to-end fashion, our result (Figure 4(f)) shows further improvements over that of using the edge guided filtering (Figure 4(e)).

In this work, we adopt a three-layer structure for each sub-network as shown in Figure 2. Given training image samples , we learn the network parameters by minimizing the sum of the squared losses:


where denotes the joint image filtering operator. In addition, , , and denote the target image, the guidance image and the ground truth output, respectively.

(a) GT depth (b) Guidance (c) Bicubic
(d) RGB guided (e) Edge guided (f) Ours
Fig. 4: Comparison of different types of guidance. Joint depth upsampling (8) results using different types of guidance images. Both (d) and (e) are trained using the network. Our method generates sharper boundary of the sculpture (left) and the cone (middle).
(a) Target input (b) Residual output
(c) Filtering output (d) Ground truth
Fig. 5: Residual prediction. Joint depth upsampling results (8) of using our network with a skip connection. The filtering output (c) is the summation of (a) the target input and (b) the predicted output by the network.
(a) Target input depth (b) GT depth (e) Target input flow (f) GT flow
(c) Upsample by the flow model (d) Upsample by the depth model (g) Upsample by the depth model (h) Upsample by the flow model
Fig. 6: Effect of training data modalities. (a)-(d) Joint depth map upsampling (8). The model trained with RGB/flow data generates similar results when compared with the model trained with RGB-D data. (e)-(h) Joint flow map upsampling (8). (g) The model trained with RGB-D data and (h) The model trained with RGB/flow data.

3.2 Skip connection

As the goal of the joint image filter is to leverage the signals from the guidance image to enhance the degraded target image, the input target image and the desired output share the same low-resolution frequency components. We thus introduce a skip connection to enforce the network to focus on learning the residuals rather than predicting the actual pixel values. With the skip connection, the network does not need to learn the identity mapping function from the input target image to the desired output in order to preserve the low-frequency contents. Instead, the network learns to predict the sparse residuals in important regions (e.g., object contours). In Figure 5, we show an example of the predicted residuals, which highlights the estimated difference between the target input (Figure 5(a)) and the ground truth (Figure 5(d)). Quantitative results in Table I show that with the skip connection, the proposed algorithm obtains notable improvements over the method by Li et al. [14].

Fig. 7: Visualization of feature responses. Sample feature responses of the input in Figure 9(a) at the first layer of (top), and (middle), and the second layer of (bottom). Pixels with darker intensities indicate stronger responses. Note that with the help of , inconsistent structures (e.g., the window on the wall) are correctly suppressed.
(a) Guidance (b) Ground truth (c) JBU [4] (d) Park [2] (e) Ours
Fig. 8: Selective transfer. Comparisons of different joint upsampling methods on handling the texture-copying issue. The carpet on the floor contains grid-like texture structures that may be incorrectly transferred to the target image.

3.3 Network training

Since the target and guidance image pair can be from various modalities (e.g., RGB-D, RGB/NIR), it is infeasible and costly to collect large datasets and train one network for each type of data pair separately. The goal of our network training, however, is not predicting specific pixel values in one particular modality. Instead, we aim to train the network so that it can selectively transfer structures by leveraging the prior from the guidance image. Consequently, we only need to train the network with only one type of image data and then apply the network to other domains.

To demonstrate that the proposed method is insensitive to the training data modality, we train the network with either the RGB-D dataset [36] or RGB/flow dataset [37]. We conduct a cross-dataset evaluation (training with one type and evaluate on the other) and show the exemplary results in Figure 6. Figure 6 (a)-(d) shows the upsampled depth maps using models trained with different domains of image data. The flow model refers to the one trained with RGB/flow data for flow map upsampling, while the depth model is trained with RGB-D data for depth map upsampling. In Figure 6(c), we apply the flow model to upsample the degraded depth map and show competitive results obtained by the depth model (Figure 6(d)). Similar observations on flow map upsampling are also found in Figure 6 (e)-(h). Both the models trained with the flow and depth data achieve similar performance. More filtering results are shown in Section 4, where we evaluate the model with different image data from various domains. More quantitative results are presented in Table I.

3.4 What has the network learned?

Selective transferring.

Using the learned guidance model alone to transfer details may sometimes be erroneous. In particular, the structures extracted from the guidance image may not exist in the target image. The top and middle rows of Figure 7 show typical responses at the first layer of and . These two sub-networks show strong responses to edges from the target and guidance images respectively. Note that there are inconsistent structures in the guidance and target images, e.g., the window on the wall. The bottom row of Figure 7 shows sample responses at the second layer of . We observe that the sub-network suppresses inconsistent details.

(a) Guidance (b) Depth map (c) Learned guidance (d) Edge map [35]
Fig. 9: Visualization of the learned guidance map. Comparison between the learned guidance feature maps from and edge maps from [35]. The network is capable of extracting informative, salient structures from the guidance image for content transfer.

We present another example in Figure 8. We note that the ground truth depth map of the selected region is smooth. However, due to the high contrast patterns on the mat in the guidance image, several methods, e.g., [4, 2], incorrectly transfer the mat structure to the upsampled depth map. The reason is that these methods [4, 2] rely only on structures in the guidance image. The problem, commonly known as texture-copying artifacts, often occurs when the texture in the guidance image has strong color contrast. With the help of the , our method successfully blocks the texture structure in the guidance image (Figure 8(e)).

Output of .

In Figure 9(c), we visualize the learned guidance from using two examples from the NYU v2 dataset [36]. In general, the learned guidance appears to be similar to an edge map highlighting the salient structures in the guidance image. We show edge detection results from [35] in Figure 9(d). Both results show strong responses to the main structures, but the guidance map generated by appears to detect sharper boundaries while suppressing responses to small-scale textures, e.g., the wall in the first example. The result suggests that using only (Figure 3

(c)) does not perform well due to lack of the salient feature extraction step from the sub-network


3.5 Relationship to prior work

The proposed framework is closely related to weighted-average, optimization-based, and CNN-based models. In each layer of the network, the convolutional filters also perform the weighted-average process. In this context, our filter is similar to the weighted-average filters. The key difference is that our weights are learned from data by considering both contents of the target and guidance images while weighted-average filters (e.g., bilateral filters) depend only on the guidance image. Compared with optimization-based joint filters, our network plays a similar role of the fidelity and regularization terms in the optimization methods by minimizing the error in (1). Through a data-driven approach and the incorporation of skip connection, our model implicitly ensures that the output does not deviate significantly from the target image while sharing salient structures with the guidance image. For CNN-based models, our network architecture can be viewed as a unified model for different tasks. For example, if we remove and use only and , the resulting network architecture resembles to an image restoration model, e.g., SRCNN [26]. On the other hand, in cases of removing the network , the remaining networks and can be viewed as one using CNNs for depth prediction [38].

Middlebury [41, 42] Lu [39] NYU v2 [36] SUN RGB-D [40]
4    8   16 4    8   16 4    8   16 4    8   16
Bicubic 4.44  7.58  11.87  5.07  9.22  14.27  8.16  14.22  22.32 2.09  3.45  5.48
MRF [18] 4.26  7.43  11.80  4.90  9.03  14.19  7.84  13.98  22.20 1.99  3.38  5.45
GF [6] 4.01  7.22  11.70  4.87  8.85  14.09  7.32  13.62  22.03 1.91  3.31  5.41
JBU [4] 2.44  3.81   6.13  2.99  5.06  7.51 4.07   8.29  13.35 1.37  2.01  3.15
TGV [3] 3.39  5.41  12.03  4.48  7.58  17.46  6.98  11.23  28.13 1.94  3.01  5.87
Park [2] 2.82  4.08   7.26  4.09  6.19  10.14  5.21   9.56  18.10 1.78  2.76  4.77
Ham [13] 3.14  5.03   8.83  4.65  7.53  11.52  5.27  12.31  19.24 1.67  2.60  4.36
DMSG [15] 1.79  3.39   5.87 1.84  4.24   7.19  3.78   6.37   11.16 1.33  1.82  2.87
FBS [16] 2.58  4.19  7.30 3.03  5.77   8.48  4.29   8.94   14.59 1.58  2.27  3.76
Ours-flow 2.31   3.95  6.34 2.87  5.14  8.08  4.42   7.32  11.62  1.36  1.91  2.90
DJF [14] 2.14   3.77  6.12 2.54  4.71  7.66  3.54   6.20  10.21  1.28  1.81  2.78
Ours 1.98   3.61  6.07 2.22  4.54  7.48  3.38   5.86  10.11  1.27  1.77  2.75
TABLE I: Quantitative comparisons on depth upsampling. Comparisons with the state-of-the-art methods in terms of RMSE. The depth values are scaled to the range for the Middlebury, Lu [39] and SUN RGB-D [40] datasets. For the NYU v2 dataset [36], the depth values are measured in centimeter. Note that the depth maps in the SUN RGB-D dataset may contain missing regions due to the limitation of depth sensors. We ignore these pixels in calculating the RMSE. Numbers in bold indicate the best performance and underscored numbers indicate the second best.
(a) Guidance (b) GT (c) GF [6] (d) JBU [4] (e) TGV [3] (f) Park [2] (g) Ours
Fig. 10: Qualitative comparisons on depth upsampling. Comparisons against existing depth upsampling algorithms for a scaling factor of 8.

4 Experimental Results

In this section, we demonstrate the effectiveness and applicability of our approach through a broad range of joint image filtering tasks, including joint image upsampling, texture-structure separation, and cross-modality image restoration. The source code and datasets will be made available to the public. More results can be found at

Network training.

To train our network, we randomly collect 160,000 training patch pairs of pixels from 1,000 RGB and depth images in the NYU v2 dataset [36]. Images in the NYU dataset are real data captured in complicated indoor scenarios. We train two models for two different tasks: (1) joint image upsampling and (2) noise reduction. For the upsampling task, we obtain each low-quality target image from downsampling the ground-truth image (with scale factors of 4, 8, 16

) using the nearest neighbor interpolation. For the noise reduction task, we generate the low-quality target image by adding Gaussian noise to each of the ground-truth depth maps with zero mean and a variance of 1e-3. We use the MatConvNet toolbox 

[43] to train our joint filters. We set the learning rate of the first two layers as 1e-3 and the third layer as 1e-4.


Using RGB-D data for training, our model takes a 1-channel target image (depth map) and a 3-channel guidance image (RGB) as inputs. However, the trained model can be applied to other data types in addition to RGB-D images with simple modifications. For the multi-channel target images, we apply the trained model independently for each channel. For the single-channel guidance images, we replicate it three times to create the 3-channel guidance image.

4.1 Depth map upsampling

MRF [18] GF [6] JBU [4] TGV [3] Park [2] Ham [13] DMSG [15] FBS [16] Ours
Time(s) 0.76 0.08 5.6 68 45 8.6 0.71 0.34 1.3
TABLE II: Run-time performance comparisons. Average run-time (second) of depth map upsampling algorithms on images of size pixels in the NYU v2 dataset.


We present quantitative performance evaluation on joint depth upsampling using four benchmark datasets where the corresponding high-resolution RGB images are available:

  • Middlebury dataset [41, 42]: We collect 30 images from 2001-2006 datasets with the missing depth values provided by Lu et al. [39].

  • Lu [39]: This dataset contains six real depth maps captured with the ASUS Xtion Pro camera.

  • NYU v2 dataset [36]: As we use the 1,000 images in this dataset for training, we use the rest of 449 images for testing.

  • SUN RGB-D [40]: We use a random subset of 2,000 high-quality RGB-D image pairs from the 3,784 pairs captured by the Kinect V2 sensor. These images are captured from a variety of complicated indoor scenes.

(a) Scribbles (b) Levin [44] (8.2s) (c) Bicubic (1.3s) (d) GF [6] (1.5s) (e) Ham [13] (28.8s) (f) Ours (2.8s)
Fig. 11: Colorization upsampling.

Joint image upsampling applied to colorization. We also list the required time (in seconds) for the colorization upsampling process for each method. The close-up areas show that our joint upsampling results (f) have fewer color bleeding artifacts when compared with other competing algorithms (c-e). Our visual results (f) are comparable with the results computed using the full resolution image in (b).

Evaluated methods.

We compare our model against several state-of-the-art joint image filters for depth map upsampling. The JBU [4], GF [6], Ham [13] and FBS [16] methods are generic joint image upsampling. On the other hand, the MRF [18], TGV [3], Park [2] and DMSG [15], algorithms are designed specifically for image-guided depth upsampling. Using the experimental protocols for evaluating the joint depth upsampling algorithms [2, 3, 13], we obtain the low-resolution target image from the ground-truth depth map using the nearest-neighbor downsampling method.

Quantitative comparisons.

Table I shows the quantitative results in terms of the root mean squared errors (RMSE). For other methods, we use the default parameters in the original implementations. The proposed algorithm performs well against state-of-the-art methods on both synthetic and real datasets. The extensive evaluations on real depth maps demonstrate the effectiveness of our algorithm in handling complicated indoor scenes in the real world. We also compare the average run-time of different methods on the NYU v2 dataset in Table II. We carry out all the experiments on the same machine with an Intel i7 3.6GHz CPU and 16GB RAM. Among all the evaluated methods, the proposed algorithm is efficient while delivering high-quality upsampling results.

The concurrent DMSG method by Tai et al. [15] outperforms the proposed algorithm on synthetic depth datasets. This can be attributed to several reasons. First, Tai et al. [15] leverage multi-scale guidance data while we use only single scale signals. The multi-scale design requires more network parameters to learn. For example, the model size of the upsampling model (8) in [15] is 1,822 KB compared to our model size of 526 KB. Second, the model in [15] is trained on a collection of synthetic datasets (82 images) [41, 42, 37]. In contrast, our model is trained on a large real dataset (1000 images). When evaluating the algorithms on real depth datasets [36, 40], we find that the model of [15] does not generalize well due to different characteristics of synthetic and real depth maps. Table I shows that the proposed algorithm performs favorably against the other state-of-the-art methods on real datasets, which suggests the practical applicability on real-world applications. Another important difference is that the model in [15] is designed only for depth upsampling. Our approach, on the other hand, can be applied to generic joint image filtering tasks.

(a) Low-res saliency [45] (b) Bicubic (c) GF [6] (d) Ham [13] (e) Ours
Fig. 12: Saliency map upsampling. Visual comparisons of saliency map upsampling results (10). (a) Low-res saliency map obtained from the downsampled RGB image (inset: guidance image).
(a) Input (b) RGF [9] (c) L0 [46] (d) Xu [8] (e) Kopf [47] (f) Ours
Fig. 13: Inverse halftoning. For each method, we carefully select the parameter for the optimal results. (b) . (c) . (d) . (e) Result of [47]. Note that [47] is an algorithm specifically designed for reconstructing halftoned images.
(a) Input (b) RGF [9] (c) L0 [46] (d) Xu [8] (e) Ham [13] (f) Ours
Fig. 14: Texture removal. For each method, we carefully select the parameter for the optimal results. (b) . (c) . (d) . (e) (f) Ours.

Effects of skip connection.

We validate the contribution of the introduced skip connection by comparing the DJF [14] method and proposed algorithm (bottom two rows of Table I). In Section 5, we show that it is difficult to gain further improvement by simply modifying network parameters, such as the filter size, filter number, and network depth. However, with the skip connection, the proposed algorithm obtains significant performance improvement. The performance gain can be explained by that using skip connection alleviates the issues with that the network only learns the appearance of the target input images, and helps the network focus on learning the residuals instead.

Effects of training modality.

To validate the effect of training with different modalities, we compare our model with a variant that is trained with RGB/flow data (Ours-flow). We randomly select 1,000 RGB/flow image pairs from the Sintel dataset [37] and collect 80,000 training patch pairs of 3232 pixels. We use either x-component or y-component of the optical flow as our target image. During the testing phase, we apply the trained model independently for each channel of the target image. Although the model Ours-flow is trained with the RGB/flow data for optical flow upsampling, Ours-flow performs favorably against Ours on the task of depth upsampling using the RGB-D data.

Visual comparisons.

We show five indoor real examples and one synthetic example (bottom) for qualitative comparisons in Figure 10. It is worth noticing that the proposed joint filter selectively transfers salient structures in the guidance image while avoiding texture-copying artifacts (see the green boxes). The GF [6] method does not recover the degraded boundary well under a large upsampling factor (e.g., 8). The JBU [4], TGV [3] and Park [2] approaches are agnostic to structural consistency between the target and guidance images, and thus transfer erroneous details. In contrast, the results generated by the proposed algorithm are smoother, sharper and more accurate with respect to the ground truth.

4.2 Joint image upsampling

Numerous computational photography applications require obtaining a solution map (e.g., chromaticity, saliency, disparity, labels) over the pixel grid. However, it is often time-consuming or memory-intensive to compute the high-resolution solution maps directly. An alternative is to first obtain a low-res solution map over the downsampled pixel grids and then upsample the low-resolution solution map back to the original resolution with a joint image upsampling algorithm. Such a pipeline requires the upsampling method to restore well image degradation caused by downsampling and avoid the inconsistency issues. In what follows, we demonstrate the use of the learned joint image filters for colorization and saliency as examples.

(a) Noisy RGB (b) Guided NIR (e) Noisy Non-Flash (f) Guided Flash
(c) Restoration [5] (d) Ours (g) Restoration [5] (h) Ours
Fig. 15: Cross-modality filtering for noise reduction. (a)-(d) Results of noise reduction using RGB/NIR image pairs. Target: RGB, Guidance: NIR. (e)-(h) Results of noise reduction using flash/non-flash image pairs. Target: Non-Flash, Guidance: Flash.

For the colorization task, we first compute the chromaticity map on the downsampled (4) image using the user-specified color scribbles [44]. We then use the original high-resolution intensity image as the guidance image to jointly upsample the low-resolution chromaticity map. Figure 11 shows that our model is able to achieve visually pleasing results with fewer color bleeding artifacts and efficiently. Our results are visually similar to the direct solutions on the high-resolution intensity images (Figure 11(b)). The quantitative comparisons are presented in the first row of Table III. We use the direct solution of [44] on the high-resolution image as ground truth and compute the RMSE over seven test images in [44]. Table III shows that our method performs well with lowest error. Note that our pipeline (low-res result + joint upsampling) is nearly three times faster (2.8 seconds) than directly running the colorization algorithm [44] on the original pixel grid to obtain the high-resolution result (8.4 seconds).

 Bicubic  GF [6]  Ham [13]  Ours
RMSE 6.01 5.74 6.31 5.48
F-measure 0.759 0.766 0.763 0.778
TABLE III: Quantitative comparisons of different upsampling methods on difference solution maps.

For saliency detection, we first compute the saliency map on the downsampled (10) image using the manifold method by Yang et. al. [45]. We then use the original high-resolution intensity image as guidance to upsample the low-resolution saliency map. Figure 12 shows the saliency detection results by the state-of-the-art methods and proposed algorithm. Overall, the proposed algorithm generates sharper edges than other alternatives. In addition, we present quantitative evaluation using the ASD benchmark dataset [48] which consists of 1,000 images with labeled ground truth. Table III shows the comparison between different upsampling methods and our approach in terms of F-measure [49]. The experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

4.3 Structure-texture separation

We apply our model trained for noise reduction to the task of structure-texture separation. Here we use the target image itself as the guidance and adopt a similar strategy as in the rolling guidance filter (RGF) [9] to remove small-scale textures.

We first use the inverse halftoning task as an example. A halftoned image is generated by the reprographic technique that simulates continuous tone imagery using various dot patterns [47], as shown in Figure 13(a). The goal of inverse halftoning is to remove these dots while preserving the main structures. We compare our results with those from the RGF [9], L0 [46], Xu [8] and the method by Kopf [47] methods for halftoned images reconstruction. For each method, we carefully select the parameter (listed in Figure 13) for the optimal results to ensure the dots are removed. However, all the other methods tend to over-smooth the salient structures (e.g., lines). Figure 13 shows that our filter can well preserve edges and achieve comparable performance against Kopf [47], which is an algorithm specifically designed for reconstructing halftoned images.

Figure 14 shows another application of our filter for image smoothing. The goal here is to remove insignificant details while retaining and sharpening salient edges. We compare our algorithm with the RGF [9], L0 [46], Xu [8], as well as Ham [13] methods, and select the corresponding parameters for the optimal results. The proposed algorithm achieves comparable results against all the other methods. By removing details located in petals and the background, the proposed algorithm also well preserves the silhouette of the green bush on the right while other methods over-smooth these regions.

4.4 Cross-modality filtering for noise reduction

Here, we demonstrate that our model can handle various visual domains through two noise reduction applications using RGB/NIR and flash/non-flash image pairs. Figure 15 (a)-(d) show sample results on joint image denoising with the NIR guidance image. The filtering results by our method are comparable to those of the state-of-the-art technique [5]. For flash/non-flash image pairs, we aim to merge the ambient qualities of the no-flash image with the high-frequency details of the flash image. Guided by a flash image, the filtering result of our method is comparable to that of [5], as shown in Figure 15 (e)-(h).

5 Discussions

In this section, we analyze the effects of the performance under different hyper-parameter settings using the network architecture in Figure 2. Next, we discuss several limitations of the proposed algorithm. We evaluate the proposed network model without and with the skip link. As suggested in [34] that the number of layers does not play a significant role for low-level tasks, we mainly vary the filter number and size (i=1, 2, 3) of each layer in each sub-network. We use the task of depth map upsampling as an example (8). In addition, we use the same training process as described in Section 4 and evaluate on the NYU v2 dataset [36] for 8 upsampling (449 test images) in terms of RMSE.

6.40 6.44 6.32 6.35
5.82 5.84 5.90 5.97
TABLE IV: Quantitative results (RMSE in centimeters for 8) of using different filter numbers in each sub-network. We apply the same parameters to three sub-networks. Top: without the skip link, Bottom: with the skip link.

5.1 Filter number

We first analyze the effects of the number of filters (, ) in first two layers of each sub-network. The quantitative results are shown in Table IV. In the setting of without the skip connection (top row), we observe that larger filter number may not always result in performance improvements because it increases the difficulty for training the network. The results suggest that the performance of such network design is somewhat saturated with the sufficient number of filters. In order to get further improvements, we need to adjust the network design or the learning objectives, rather than simply modifying hyper-parameters.

Such a hypothesis is supported by the setting of with the skip link, where we add a skip connection to the entire network and reformulate the network as learning residual functions. The bottom row of Table IV shows that the filter number do yield progressive improvements when it is increased. This is in accordance with the observation in [30, 32]

where residual learning is more effective for training the network with larger capacity. However, a larger network also slows down the training process and may only provide marginal performance improvements. Consequently, the selected hyperparameters of our method (shown in Figure 

2) strikes a good balance between accuracy and computational efficiency.

Furthermore, we discuss the effects of the output channels () of and and show the results in Table V. Intuitively, using multi-dimensional features may improve the model capacity and therefore its performance. However, our experimental results indicate that using multi-dimensional feature maps only slows down the training process without clear performance gain, for both without and with the skip link settings. Therefore, we set the output feature maps extracted from the target and guidance images as one single channel (). The output of and can be viewed as a transformed pair of the original input target/guidance pair.

5.2 Filter size

We examine the network sensitivity to the spatial support of the filters. With all the other experimental settings kept the same, we gradually increase the filter size (i=1, 2, 3) in different layers and show the corresponding performance in Table VI.

Starting from using small filter sizes (, , ), we observe a steady trend of improvements when increasing the filter sizes. This is because smaller filters will restrict the network to focus on detailed local smooth regions that provide little information for restoration. In contrast, a reasonably large filter size can cover richer structural cues that lead to better results. However, when we further enlarge the filter size (e.g.., up to , , ), we do not see additional performance gain. We attribute this to the increasing difficulty of network training because larger filter sizes indicate more number of parameters to be learned. Consequently, we choose the filter size , , and as a good trade-off between the efficiency and performance.

6.20 6.40 6.24 6.34
5.86 6.11 5.93 6.02
TABLE V: Quantitative results (RMSE in centimeters for 8) of using different filter numbers in the 3rd layer of and . Top: without the skip link, Bottom: with the skip link.
6.28 6.40 6.20 6.47 6.62
5.93 6.05 5.86 6.06 6.24
TABLE VI: Quantitative results (RMSE in centimeters for 8) of using different filter sizes in each sub-network. Top: without the skip link, Bottom: w/ the skip link.

5.3 Limitations

We note that in some images, our model fails to transfer small-scale details from the guidance map. In such cases, our model incorrectly treats certain small-scale details as noise. This can be explained by the fact that our training data is based on depth images that are mostly smooth and does not contain many spatial details.

Figure 16 shows two examples of a flash/non-flash pair for noise reduction. There are several spotty textures on the porcelain in the guided flash image that should have been preserved when filtering the noisy non-flash image. Similarly, our method is not able to effectively transfer the small-scale strip textures on the carpet to the target image. Compared with the method by Georg et al. [12] (Figure 16(b) and (d)) that is designed specifically for flash/non-flash images, our filter treats these small-scale details as noise and tends to over-smooth the contents. We will collect more training data from other domains (e.g., flash/non-flash) in addition to depth data to address the over-smoothing problem in our future work.

(a) Input (b) Georg et al. [12] (c) Ours
Fig. 16: Failure cases. Detailed small-scale textures (yellow rectangle) in the guidance image are over-smoothed by our filter.

6 Conclusions

In this paper, we present a learning-based approach for joint filtering based on convolutional neural networks. Instead of relying only on the guidance image, we design two sub-networks and to extract informative features from both the target and guidance images. These feature maps are then concatenated as inputs for the network to selectively transfer salient structures from the guidance image to the target image while suppressing structures that are not consistent in both images. While we train our network on one type of data (RGB/depth or RGB/flow), our model generalizes well on handling images in various modalities, e.g., RGB/NIR and flash/non-Flash image pairs. We show that the proposed algorithm is computationally efficient and performs favorably against the state-of-the-art techniques on a wide variety of computer vision and computational photography applications, including cross-modal denoising, joint image upsampling, and texture-structure separation.


  • [1] Q. Yang, R. Yang, J. Davis, and D. Nistér, “Spatial-depth super resolution for range images,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2007.
  • [2] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality depth map upsampling for 3d-tof cameras,” in IEEE International Conference on Computer Vision, 2011.
  • [3] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof, “Image guided depth upsampling using anisotropic total generalized variation,” in IEEE International Conference on Computer Vision, 2013.
  • [4] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” in ACM SIGGRAPH, 2007.
  • [5] Q. Yan, X. Shen, L. Xu, S. Zhuo, X. Zhang, L. Shen, and J. Jia, “Cross-field joint image restoration via scale map,” in IEEE International Conference on Computer Vision, 2013.
  • [6] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1397–1409, 2013.
  • [7] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” in IEEE International Conference on Computer Vision, 2015.
  • [8] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via relative total variation,” ACM Transactions on Graphics, vol. 31, no. 6, p. 139, 2012.
  • [9] Q. Zhang, X. Shen, L. Xu, and J. Jia, “Rolling guidance filter,” in European Conference on Computer Vision, 2014.
  • [10] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in IEEE International Conference on Computer Vision, 1998.
  • [11] E. Eisemann and F. Durand, “Flash photography enhancement via intrinsic relighting,” in ACM SIGGRAPH, 2004.
  • [12] P. Georg, A. Maneesh, H. Hugues, S. Richard, C. Michael, and T. Kentaro, “Digital photography with flash and no-flash image pairs,” in ACM SIGGRAPH, 2004.
  • [13] B. Ham, M. Cho, and J. Ponce, “Robust image filtering using joint static and dynamic guidance,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [14] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image filtering,” in European Conference on Computer Vision, 2016.
  • [15] T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super-resolution by deep multi-scale guidance,” in European Conference on Computer Vision, 2016.
  • [16] J. T. Barron and B. Poole, “The fast bilateral solver,” in European Conference on Computer Vision, 2016.
  • [17] M.-Y. Liu, O. Tuzel, and Y. Taguchi, “Joint geodesic upsampling of depth images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [18] J. Diebel and S. Thrun, “An application of markov random fields to range sensing,” in Neural Information Processing Systems, 2005.
  • [19]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in

    Neural Information Processing Systems, 2012.
  • [20] V. Jampani, M. Kiefel, and P. V. Gehler, “Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [21] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F. Durand, “Deep bilateral learning for real-time image enhancement,” in ACM SIGGRAPH, 2017.
  • [22] Q. Chen, J. Xu, and V. Koltun, “Fast image processing with fully-convolutional networks,” in IEEE International Conference on Computer Vision, 2017.
  • [23] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning dynamic guidance for depth image enhancement,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [24] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [25] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken through a window covered with dirt or rain,” in IEEE International Conference on Computer Vision, 2013.
  • [26] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European Conference on Computer Vision, 2014.
  • [27] J. Zhang, J. Pan, W.-S. Lai, R. Lau, and M.-H. Yang, “Learning fully convolutional networks for iterative non-blind deconvolution,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [28] F. Philipp, D. Alexey, I. Eddy, H. Philip, H. Caner, G. Vladimir, V. d. S. Patrick, C. Daniel, and B. Thomas, “FlowNet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision, 2015.
  • [29] L. Xu, J. Ren, Q. Yan, R. Liao, and J. Jia, “Deep edge-aware filters,” in

    International Conference on Machine Learning

    , 2015.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [31] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [32] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [33] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [34] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295 – 307, 2015.
  • [35] P. Dollár and C. L. Zitnick, “Structured forests for fast edge detection,” in IEEE International Conference on Computer Vision, 2013.
  • [36] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision, 2012.
  • [37] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in European Conference on Computer Vision, 2012.
  • [38] E. David, P. Christian, and F. Rob, “Depth map prediction from a single image using a multi-scale deep network,” in Neural Information Processing Systems, 2014.
  • [39] S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix completion,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [40]

    S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in

    IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [41] D. Scharstein and C. Pal, “Learning conditional random fields for stereo.” in IEEE Conference on Computer Vision and Pattern Recognition, 2007.
  • [42] H. Hirschmüller and D. Scharstein, “Evaluation of cost functions for stereo matching.” in IEEE Conference on Computer Vision and Pattern Recognition, 2007.
  • [43] V. Andrea and L. Karel, “MatConvNet – convolutional neural networks for matlab,” in ACM Multimedia, 2015.
  • [44] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” in ACM SIGGRAPH, 2004.
  • [45] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [46] L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via gradient minimization,” in ACM SIGGRAPH ASIA, 2011.
  • [47] J. Kopf and D. Lischinski, “Digital reconstruction of halftoned color comics,” in ACM SIGGRAPH, 2012.
  • [48] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [49] L. Mai and F. Liu, “Comparing salient object detection results without ground truth,” in European Conference on Computer Vision, 2014.