Code for our IEEE TCSVT Paper: Lightweight Modules for Efficient Deep Learning based Image Restoration
Low level image restoration is an integral component of modern artificial intelligence (AI) driven camera pipelines. Most of these frameworks are based on deep neural networks which present a massive computational overhead on resource constrained platform like a mobile phone. In this paper, we propose several lightweight low-level modules which can be used to create a computationally low cost variant of a given baseline model. Recent works for efficient neural networks design have mainly focused on classification. However, low-level image processing falls under the image-to-image' translation genre which requires some additional computational modules not present in classification. This paper seeks to bridge this gap by designing generic efficient modules which can replace essential components used in contemporary deep learning based image restoration networks. We also present and analyse our results highlighting the drawbacks of applying depthwise separable convolutional kernel (a popular method for efficient classification network) for sub-pixel convolution based upsampling (a popular upsampling strategy for low-level vision applications). This shows that concepts from domain of classification cannot always be seamlessly integrated into image-to-image translation tasks. We extensively validate our findings on three popular tasks of image inpainting, denoising and super-resolution. Our results show that proposed networks consistently output visually similar reconstructions compared to full capacity baselines with significant reduction of parameters, memory footprint and execution speeds on contemporary mobile devices.READ FULL TEXT VIEW PDF
Data augmentation is an effective way to improve the performance of deep...
The task of single image super-resolution (SISR) aims at reconstructing ...
Deep convolutional networks have become a popular tool for image generat...
In this paper, we present a general framework for low-level vision tasks...
Many real-world solutions for image restoration are learning-free and ba...
Image super-resolution is one of the most popular computer vision proble...
Code for our IEEE TCSVT Paper: Lightweight Modules for Efficient Deep Learning based Image Restoration
Image restoration refers to recovery of clean signal from an observed noisy input. Following the ground-breaking work of Krizhevsky et al. 
on ImageNet classification with deep neural networks, CNNs have superseded traditional methods across a variety of tasks such as object recognition[26, 66, 65], detection [15, 14, 56] and tracking [4, 23], action recognition [7, 22], segmentation [24, 51] to list a few. Image restoration frameworks also improved from the data driven hierarchical feature learning capability of deep neural networks with state-of-the-art performances on inpainting [30, 76, 74, 40, 37, 38], denoising [81, 69, 78], super-resolution [39, 67], de-hazing , de-occlusion , 3D surface reconstruction [32, 63] etc. Though these deep learning based restoration frameworks yield photo-realistic outputs, the models are computationally expensive with millions of parameters. Inference through such complex networks requires billions of floating point operations (FLOPs). This might not be seen as a problem while executing over a GPU enabled workstation; however such networks are practically not scalable to run on resource-constrained platforms such as a commodity CPU or a mobile device. However, with the proliferation of multimedia enabled mobile devices, there is an increased demand of on-device multimedia manipulations. For example, image denoising is a crucial component of imaging setup in any contemporary smartphone. Super-resolution is also an inevitable component because online multimedia hosting sites often prefer to transmit low resolution images and videos with super-resolution performed on device so that the end user enjoys high resolution multimedia experience even on low bandwidth channel. Similarly, inpainting plays a crucial role in many downstream applications such as image editing, Augmented Reality , ‘dis-occlusion’ inpainting [45, 42] for novel view synthesis in a multi camera video capture setting  to be integrated with mobile Head Mounted Displays (HMD).
Executing billions of FLOPs on mobile devices leads to fast reduction of battery life with potential heating up of the device. Also, the lag encountered while executing such large models on constrained platform tends to disrupt the engagement of the user. To address the above two issues, in this paper we propose several lightweight computing units which dramatically reduce the computational cost of a given deep neural network without any visual degradation of reconstructed outputs.
Recently, there has been a surge of interest for designing efficient neural networks mainly for object classification and detection. However, there is a dearth of literature for efficient processing of networks concerned with low-level image restoration. Fundamentally, restoration requires the spatial resolution of input and output signal to be same and the general practice [30, 76, 81] is to follow encoder-decoder based architectures to first down-sample and later on up-sample the intermediate feature maps of the network. On contrary, classification frameworks are mainly concerned with progressive downsampling and thus efficient strategies to up-sample in a network are not discussed. Also, dense prediction tasks such as inpainting requires long range spatial information and often deploys dilated/atrous convolutions  to increase the receptive field of processing. However dilated convolutions are rarely used in classification frameworks and thus recent advancements such separable convolution  and group convolutions  cannot be directly applied for dilated convolution operations.
In this paper we have mainly focused on design principles for components to be used in low-level restoration tasks. Since 33 kernel 111Ideally, it should be 33- where is number of input channels. For brevity of notation, henceforth, we will drop the channel dimension. is the most commonly used kernel in contemporary low level vision applications [30, 76, 81, 39], we introduce ‘LIght Spatial Transition’ layer, (LIST), which simultaneously benefits from local feature aggregation  and multi-scale spatial processing  and uses upto 24 fewer parameters than a similar 33 convolution layer. Next, we introduce ‘Grouped Shuffled Atrous Transition’ layer, GSAT, which is an efficient atrous/dilated convolution layer by leveraging recent concepts of group convolution  and channel shuffling  and each layer uses approximately 7 fewer parameters compared to an usual dilated convolution layer. While designing efficient upsampling module, we show that separable convolution kernels are inept at sub-pixel convolution  based upsampling and we provide an analytical justification for the same. Instead we show that deterministic upsampling such as bilinear upsampling followed by our LIST module provides an efficient upsampling framework. Combination of these modules enable us to run image restoration models on mobiles with milli-seconds level execution speed compared to several seconds by contemporary full-scale models without any visual degradation of outputs. One of the major advantages of our proposed modules is that these can seamlessly replace commonly used computational blocks such as 33 convolution, dilated convolution, differentiable upsampling within a given network. Thus, in this paper we refrain from proposing new end-to-end architectures; instead we select recent state-of-the-art networks and reduce the computational footprints of those networks using our lightweight layers.
In summary, our key technical contributions in this paper are:
-We present LIST layer as a computationally cheaper alternative to a regular 33 convolution layer. Each instance of LIST can save 12 - 24 parameters. Repeated use of LIST in a deep network leads to significant reduction of parameters and FLOPs
-We present GSAT layer which implements dilated convolution on separate sparse group of channels to reduce FLOPs followed by feature mixing for enhanced representation capability. Each instance of the proposed module utilizes approximately 7 fewer parameters than a regular dilated convolution layer
- We present our findings on drawbacks of applying separable convolution for feature upsampling with sub-pixel convolution and provide a detailed insight for possible reason of failure. Instead, we show that deterministic upsampling followed by LIST layer based convolution is an efficient yet accurate alternative
- We perform extensive study of our network components on tasks of image inpainting, denoising and super- resolution. On all tasks we achieve significant reduction in parameters and FLOPs and massive execution speed-ups on resource constrained platforms without any compromise in visual quality. Such exhaustive experiments manifest the generalizability of our processing components across a variety low-level restoration tasks.
In recent years deep neural networks have achieved overwhelming success on a variety of computer vision tasks in which network design plays a crucial role. Executing these large models on resource constrained platforms requires efficient design strategies. Recently there has been a surge of interest in either compressing existing pre-trained big networks or designing small networks from scratch.
For training small networks from scratch, factorization of kernels have been a preferred choice. The most common realization is depthwise separable convolution initially presented in 
and then popularized in Inception module. Following that, it has become the backbone of many popular architectures such as MobileNet  and MobileNet-V2 . Xception network  showed how to scale up depthwise separable convolutions to outperform Inception-V3 . Another popular concept of group convolution was introduced in  to distribute model parameters over multiple GPUs. Currently, it is utilized in several recent efficient networks [83, 48, 72, 64]. The idea is to convert dense convolutions across all feature channels to be sparse by channel grouping and performing convolution only on grouped set of channels.
Model compression is another genre of approach for efficient inferencing by lossy compression of a pre-trained network while maintaining similar accuracy. Compression can be achieved either by pruning some of the intermediate synaptic connections in the network or by quantizing pre-trained kernels to be represented as integers or booleans. Denton et al. 
applied Singular Value Decomposition (SVD) to approximate a pre-trained network to achieve 2inference speedup. Han et al.  pruned and fine-tuned a pre-trained network to identity important network connections to create a smaller network. The work was extended in Deep Compression  to combine network pruning with quantization. Later, ‘Quantized CNN’  was proposed which aimed at directly quantizing network weights during training. Chen et al. proposed ‘HashedNet’  to compress networks with hashing.
Some recent works have focused on smarter network designs for efficient low-level vision applications. Zhang and Tao  proposed a light-weight multi-scale network for single image dehazing. In  Ahm et al. proposed a cascaded residual network coupled with group convolution for efficient single image super-resolution. In , Tan et al. presented a low-cost network for unmanned aerial vehicle (UAV) noise reduction at low signal-to-noise (SNR) level. In , Zhang et al. present a ‘mixed-convolution’ layer by merging normal and dilated convolution for image super-resolution. Kim et al.  presented dilated-Winograd transformation for a faster realization of dilated convolution. In RHNet  the authors present a dilated special pyramid pooling framework for dense object counting.
In this paper we mainly focus on constructing lightweight modules for training efficient networks from scratch for low-level image restoration tasks. However, the building blocks of modern efficient networks are mainly concerned with classification tasks in which essential components such as upsampling, sub-pixel convolution and dilated convolution are usually not involved. Hence, those methods are not self-sufficient for low-level computer vision applications.
In recent years deep learning based methods have produced phenomenal performances on a variety of low-level image restoration tasks. However majority of research has been focused on improving the visual quality without worrying much about the computational burden. In this paper we aim to realize lightweight versions of these networks which can be run on mobile devices with milli-seconds level execution time instead of multiple seconds required by full-scale baselines.
This section elaborates on the architectural details of LIST layer. Pictorial representation of a LIST layer is shown in Fig. 1(b). We will first discuss the driving intuitions and principles behind LIST followed by calculating computation savings achieved by using LIST instead of regular 3
3 convolution layer. Presence of a ‘sub-network’ capable of universal functional approximation such as multi-layer perceptron (MLP) in between two consecutive layers boosts the feature extraction capability in a CNN. In LIST, we realize this functionality by having one parallel branch of two successive layers of 1
1 convolution with ReLu non-linearity in between to promote sparsity of features. Such cascades of 11 convolution promotes parametric cross-channel pooling and enables a network to learn non-trivial transformations.
Starting from the Inception  module of GoogleNet (see Fig. 1(a)), multi-path branched module has become the de facto choice for multi-scale processing of features in deep neural networks . Following that, we incorporate a branch for 33 convolution in parallel with the 11 branch. In this case, the initial (top) 11 layer acts an embedding layer by projecting incoming feature volume to a lower dimension and thereby reducing the FLOPs requirement for performing 33 convolution. We further reduce the FLOPs count for 33 convolution branch by factoring it with depthwise separable kernels. However, we deviate from the design principles of Inception by restricting the number of parallel branches inside the LIST layer. This is motivated by the ‘network fragmentation’ issue pointed out in . Parallel branches in a network creates overhead of kernel launching and synchronization resulting in reduction of execution speed. So, unlike that in Inception, we refrain from using two additional parallel branches of 5
5 convolution and max-pool layer inside ourLIST layer. Apart from ‘network fragmentation’ issue, avoiding parallel branches also benefits from reduced number of final feature channels which need to be processed by next layer- this further helps in decreasing FLOPs.
A LIST layer is meant for replacing a normal 33 convolution layer with input and output feature channels.
Input to a LIST module is a feature volume of shape (, , ) (height, width, channels). In the first step, the input volume is pointwise convolved with number of 1 kernels; is the compression ratio. In the second stage, these feature maps are passed to two parallel streams of 11 and 33 convolution. In the 11 branch, we perform another set of pointwise 11 convolution and output channels; is the branching factor. The 33 branch is realized with depthwise separable kernels and outputs channels. Outputs from 11 and 33 streams are concatenated (to form total channels) and passed on to the next layer.
Comparison to 33 convolution: In this section we elaborate on the savings of parameters and FLOPs achieved by our LIST layer over the usual 33 layer. We assume the spatial resolution of incoming and outgoing features to be HW. Number of trainable parameters for a 33 is,
while the total FLOPs is,
Computations for a LIST module will consist of three components- (a) 11 convolution in Stage-1; (b) 11 convolution in Stage-2 parallel stream; (c) separable 33 convolution in Stage-2 parallel stream. Assuming = 2 (see Sec. IV-A1) number of parameters for a LIST layer is,
while total FLOPs is,
Ratio of parameters of 33 to that of LIST is given by,
Since, , is the compression ratio of incoming and outgoing channels to the first 11 layer, 1. Thus, we have,
From Eq. 7 we get the lower bound of parameters saving by using proposed LIST layer instead of a 33 convolution layer. Some of the usual settings in a network are , or . After a brief hyper-parameters search (see Sec. IV-A1) we set = 4 and thus we achieve 18, 12 and 24 parameters saving at , and . Thus a single instance of our LIST layer is significantly cheaper than a normal 33 convolution layer. On a similar note, we can show that the ratio of FLOPS of 33 to that of LIST is given by,
Since the ratio is same as what we got for parameters savings,
following the approximation done in Eq. 6 and lower bound logic of Eq. 7, we get similar scales of FLOPs savings as we showed for the parameters.
Stacking several layers of LIST layer thereby helps in significant reduction of memory footprint (fewer parameters) and faster execution speed (fewer FLOPs) compared to a network realized with 33 convolution layers.
Comparison to depthwise separable 33 convolution: In this section we first find the condition under which proposed LIST layer is even cheaper than the widely used depthwise separable convolution layer. We again assume 33 convolution over a feature volume of incoming and outgoing channels and spatial resolution of HW. Number of trainable parameters for a separable 33 convolution layer is,
while total FLOPS is,
Ratio of parameters for a separable 33 convolution layer to that of LIST is,
If we want then we need to satisfy the following condition:
So, we have the following criteria for at different ratios of :
To satisfy all the conditions of Eq. 13 we need which gives the lower bound of parameters savings. Since we set 4 for all our experiments, the conditions of Eq. 13 are satisfied. With = 4, from Eq. 11 we have = 2, 2.6 and 1.3 at =, = and = respectively. Similarly we can show that ratio of FLOPS of a depthwise separable 33 layer to that of LIST is,
With , or we would approximately save 2, 2.6 and 1.3 FLOPs respectively. Our LIST layer’s design has appreciably fewer parameters and FLOPs compared to even a depthwise separable realization of 33 convolution and thus can be used as an off-the-self replacement for separable convolution layer.
In this section we elaborate on the design of our proposed GSAT layer which is an efficient replacement for an usual atrous/dilated convolution layer found in numerous contemporary low-level vision applications [30, 70, 19]. Realizing a 33 dilated convolution is not trivially possible by our LIST module because of the 11 convolution in the first stage. For this we propose GSAT layer. We mainly consider a 33 dilated convolution with same number of incoming and outgoing channels. This is the most popular configuration in contemporary architectures. Illustration of a GSAT layer is shown in Fig. 2(b).
Input to the layer is a feature volume of shape HWM. Based on group convolution , we divide the incoming channels into non-overlapping groups. Then each of the groups is individually processed by an usual dilated 33 convolution. The initial group partitioning helps in reduction of incoming channels to individual 33 dilated convolution layers and thereby saves on parameters and FLOPs. However, each of the groups are processed independently on a sub-group of channels without any cross-group interaction. This property weakens the representation capability of the model. Thus for cross channel interaction we perform a channel shuffling operation  to periodically sample and stack features from each of groups. This results in an intermediate volume of shape HWM. So features from a particular group are stacked every alternate channels apart. Thus a group of channels inside the intermediate volume has features from each of the groups. Next, to perform a cross channel interaction  of features we include a 11 convolution layer. However to reduce FLOPS, we perform grouped 11 convolution partitioned over groups. Since the channel shuffling operation already populated each of the sub-groups with features from all the 33 dilated convolution layers, the grouped 126], we add the input with the 11 group convolution’s output. To our best knowledge, this is the first realization of dilated convolution layer with grouped convolution and channel shuffling.
In this section we numerically illustrate the computational benefits of using our GSAT layer instead of usual dilated convolution layer. Number of trainable parameters for a normal 33 dilated convolution layer is given by,
where is the number of incoming and outgoing channels. For GSAT layer, number of parameters for the first stage of grouped convolution is while for the second stage of 11 grouped convolution is . So, total parameters for GSAT layer is,
Ratio of parameters used in regular dilated convolution and that used by proposed GSAT layer is,
So, we can save parameters if 1, which requires 2. In fact, after hyper-parameters search (see Sec. IV-A1) we used 8 and thus GSAT module requires almost 7 fewer parameters compared to normal dilated convolution layer.
Upsampling of intermediate feature maps in a network is an essential component for low-level vision tasks. However, recent frameworks for efficient network design do not discuss upsampling strategies because it is rarely required in classification frameworks. We thus devote this section for discussing possible solutions for efficient upsampling.
In recent literature transposed convolution (popular as deconvolution)  has become the de facto choice for upsampling. However, from an image generation perspective, transposed convolution is known to render ‘checkboard’ effects [53, 2] on the final synthesized image. Thus, even though there are efforts towards making transposed convolution computationally faster [66, 10] we explore other avenues for efficient upsampling.
Sub-pixel convolution based upsampling is a preferred paradigm of upsampling specifically for image generation tasks because of its demonstrated ability to get rid of ‘checkboard’ artifacts introduced by transposed convolution layer. In this section we elaborate on our initial failed attempt of applying (see Fig. 5 for failed inpainting results) separable kernels for sub-pixel convolution based upsampling and provide justifications for the same.
It can be shown that, for an upscale factor of, , a sub-pixel convolution with kernel shape ( height, width, # of output channels, # of input channels) is equivalent to a that of a transposed convolution by a kernel of shape . After sub-pixel convolution, the channel elements are periodically shuffled to upscale feature maps by factor of along height and width. See Fig. 3(a) for visualization. Refer to [60, 61] for more detailed derivation.
From the theory of sub-pixel convolution we know that with an upscale factor of 2, sub-pixel convolution can learn to represent feature maps in LR (low resolution space) which are equivalent to feature maps in HR (high resolution space). We will show that this essentially means both networks have same run time complexity but sub-pixel convolution has more parameters. Let us consider a general case where shape of input volume at layer is (height, width, depth) = . The target is to upscale this to spatial resolution of , for next layer, . Let, for sub-pixel convolution we choose kernels of shape . Then for counterpart of HR model (which first does deterministic up-scaling followed by convolution in HR space itself), kernel sizes will be . Total FLOPs for sub-pixel convolution is,
The number of trainable parameters for LR model is,
For convolution in HR, total FLOPs,
and number of parameters,
So, important observation is that FLOPs for both LR and HR models are same but LR model has more parameters and thus greater representation capability.
Let us now examine what will happen when we try to realize separable sub-pixel convolution. See Fig. 3(b) for a visualization. In the first stage, we need kernels of shape (height, width, output channels, input channels). In this stage, total FLOPs,
and number of parameters,
In the next stage we need kernels of shape . Total FLOPs in this stage,
and number of trainable parameters,
So, total FLOPs for separable LR model, and total number of parameters, . Now consider the ratio,
always and thus we see that converting a sub-pixel convolution to a separable paradigm reduces its representation prowess with respect to a convolution in HR. Similarly, if we compare the FLOPs by , separable sub-pixel convolution is computationally cheaper. But because of its reduced representation capability, it is not recommended for practical applications.
One way to mitigate ‘checkboard’ effect is to disentangle upsampling and convolution operations . An usual procedure is to use some deterministic upscaling followed by convolution in the high resolution space. This has worked well in applications such as super resolution  and inpainting . But, when implemented in naive version, this increases the computational cost. For example, if we do a bilinear upscaling by 4 followed by convolution, there is a quadratic increase of feature size but ‘same information content’ (if we count the number of floats). This makes bilinear upsampling + convolution almost 4
costlier than transposed convolution. We optimize this concept by first upsampling with bilinear interpolation followed by an efficient convolution block realized by the proposedLIST layer. This is our preferred method for efficient upsampling.
To maintain the homogeneity in network design we prefer to realize spatial downsampling with LIST layer. However, strided convolution is not trivially possible by LIST module because of initial 11 convolution stream. So, we first down-sample feature maps with bilinear interpolation and follow up with LIST based efficient convolution.
We organize our results as follows. In Sec. IV-A, we initially perform extensive studies to select the hyper parameters governing the design choices for various proposed modules based on image inpainting. We systematically investigate the role of individual components towards reduction of parameters and FLOPs. This is followed by comparison with recent full capacity inpainting baselines and compressed models realized with MobileNet , ShuffleNet  and ShuffleNetV2 .
Next, with our understanding of best network configurations we compare applicability of our proposed layers on image denoising (Sec. IV-B) and image super-resolution (Sec. IV-C). It is encouraging to note that the proposed layers are quite insensitive to hyper parameters across different tasks which allows us to reuse the same set of hyper parameters across all the three above mentioned applications without degradation of visual quality.
We select the globally and locally consistent image inpainting model, GLCIC  as our baseline for image inpainting. Currently, GLCIC serves as a strong Generative Adversarial Network (GAN)  based contemporary baseline for inpainting and we aim at realizing a lightweight version of GLCIC using our proposed layers. A GAN framework consists of two deep neural nets, generator, , and discriminator, . The task of the generator is to generate an image,
with a latent noise prior vector,, as input. is sampled from a known distribution, . A common choice  is, . The discriminator has to distinguish real samples (sampled from ) from generated samples. Discriminator and generator play the following two-player min-max game on :
At the core, GLCIC comprises of repeated applications of 33 convolution, 33 dilated convolution and transposed convolution layers. Please refer to  for details of the architecture. We replaced the corresponding layers with proposed LIST, GSAT and LIST based upsampling layers.
Automated Visual Quality Metric: Manually analyzing the perceptual quality of reconstruction by different models is not feasible. Recent works [39, 76] have shown that PSNR and MS-SSIM metrics are not suitable for evaluating quality of adversarial loss guided reconstructions. Analyzing the quality and diversity of GAN samples is still an open research topic. Recently Fréchet Inception Distance (FID)  was proposed for quantifying quality and diversity of GAN samples. Lower FID value indicates overall better quality and diversity of generated samples. For automated screening of models, we use FID as the base metric.
Datasets: We experimented on CelebA (128128) , CelebA-HQ (256256) , Places2 (256256)  and DTD (256256) . For CelebA, hole sizes greater than 4848 occludes almost entire face and thus maximum training hole size is 4848 at random location. For comparing FID during evaluation, a randomly positioned hole (but same for all models for a given image) of 4848 is considered. At 256256 image resolution, the maximum hole size of 9696 is considered during training and FID is reported at hole size of 9696. From CelebA, CelebA-HQ, Places2, and DTD we kept 20000, 10000, 20000, and 1000 (converted to 4000 with horizontal and vertical flip) samples for testing.
Training Details: In practice, we follow the stagewise training procedure as presented in . In Stage-1, we pre-train the inpainting (generator) network alone with (Mean Squared Error) loss for iterations. In Stage-2, we freeze the parameters of inpainting network and pre-train the critic (discriminator) network to distinguish between real and inpainted samples for iterations using cross-entropy loss. In Stage-3, both completion and critic networks are iteratively updated under the min-max GAN game formulation  for iterations.
Implementation Details We first discuss how we select design hyper parameters of our network modules such as LIST and GSAT. For a given parameter setting, we train on CelebA dataset and evaluate the FID on CelebA validation set (10000 samples). Due to lack of massive computational resources, we run parameter search sweep only on CelebA and adopted our understanding on other datasets. It is encouraging to see that lessons learned from CelebA generalize well to other datasets also. We set , and to 10, 10 and 10 iterations. Mini batch gradient descent based optimization is performed with ADAM  optimizer with batch size = 64. Following , we perform paired two-sided Wilcoxon signed-rank tests and significance level set to 10.
Design parameters for LIST module: A LIST module is characterized by the two hyper parameters, and . Firstly, we study the effect of reducing 33 kernels in the network by varying . To keep things constant, dilation layer for each case was realized with normal dilated convolution and fixed at 0.25. In Table I we report FID metrics on CelebA validation set at a hole size of 4848. Decreasing (pushing more computations to 33 stream) less than 0.5 does not improve FID appreciably but increases the model parameters while FID deteriorates briskly with increase of (pushing more computations to 11 stream). We thus keep = 2 in our further experiments. Such a balance of channels along two parallel processing streams is also recommended in [54, 44]. Next, we sweep over different settings of at a fixed = 0.5. Increasing improves the representation efficacy of the Stage-1 11 layer and thus aids in FID improvement but at a cost of higher parameters. With 0.35, FID improvement almost saturates.
Finally, to find a suitable threshold of FID aligned with human perception, we showed 100 inpainted images of five models with FID (model with FID 8 are perceptually not acceptable) to five independent raters who were asked to rate a given image in ; 5: excellent and 1: bad quality. The difference of mean scores of models with FID 7.0 were statistically insignificant. With = 0.5, from Table I we see that = 0.30 yields model in the regime of FID 7.0. Since channel counts in deep nets are usually of the form of , , we proceed in the remaining paper with = 4 ( = 0.25) and = 2 ( = 0.5).
Number of Groups in GSAT layer: Our proposed GSAT module is characterized by number of groups, , for the group convolution layers. For simplicity of parameter sweep, we keep the group numbers same for dilated 33 and 11 stages. In Table II we report FID scores on CelebA validation set for different values of . For other layers, all models used LIST with = 4 and = 2 as discussed in previous section. A smaller value of indicates more computational load on the initial 33 layers and subsequent better FIDs. However, at = 8, we get FID 7.0, which is perceptually acceptable. On the contrary, increasing creates many independent feature volumes and the combined channel shuffle and 11 group convolution is not able to properly amalgamate the groups leading to higher FID. So, for future experiments we set = 8 for GSAT layers.
|Model||3X3||Upsampling||Dilation||Params (10)||FLOPs (10)|
|DS||BiL + DS||Normal||2.81||26.9|
|LIST||BiL + DS||Normal||2.63||24.8|
|LIST||BiL + LIST||Normal||2.61||24.0|
|BiL + LIST||
In Table III we define the proposed architecture variants and compare the associated parameters and FLOPs. Such analysis gives a foundation to appreciate the effect of a given speedup technique.
In Table V we compare the FID scores of different proposed models with full-scale baseline models. Some of the key lessons from Tables III and V:
– Comparing and : Proposed LIST layer is a much more efficient alternative to depthwise separable 33 convolution layer, but both models have similar reconstruction performances.
– Comparing and : Proposed GSAT layer used in as an alternative for normal dilated convolution layer significantly helps in reduction of parameters without hampering visual quality. Since, combines both LIST and GSAT layers, it is our preferred proposed model unless otherwise stated.
– As per our theoretical justification, a network with separable sub-pixel convolution (model ) performs worse than a network with normal convolution based sub-pixel convolution (model ) as reflected by higher FID scores of . Also, see Fig. 5 for visualizing such failures.
– Comparing and : Model, , with bilinear upsampling + separable convolution has fewer FLOPs than (upsampled with sub-pixel convolution) while having similar FID. Thus it is prudent to have efficient bilinear upsampling which we improve further with proposed based upsampling in .
Since we design all our smaller models based on the architecture of GLCIC , it is fair to compare performances only with GLCIC as baseline. However, for initial benchmarking of our model designs we also compared against recent state-of-the-art deep learning based models of
GIP  and Shift .
Reduction in Computation In Table IV we report the parameters count, FLOPs and mobile memory size. Our preferred model, achieves almost 91% ( = ) relative parameters savings compared to the parent framework of GLCIC with 88.6% and 93.5% relative savings in FLOPs.
Comparison of Reconstruction In Table V we report FID metrics of the comparing methods @256256 on CelebA-HQ, Places2, and DTD datasets. We did not find any significant difference of FID between any of our models (except ) and the full-scale baselines. In Fig. 6 we provide some inpainting examples by GLCIC, GIP, Shift and our preferred proposed model, . Clearly, the reconstruction qualities of our proposed smaller model are indistinguishable from full-scale baselines.
|Proposed Efficient Variants|
Mean Opinion Score Testing (MOS): To further bolster our findings, we conducted MOS testing to visually quantify the quality of inpainting by different models. Raters were asked to rate an inpainted image in the scale of 1 (bad quality) to 5 (excellent quality). Total of 20 raters were selected for the study. From each dataset, each rater was shown 50 inpainted images by GIP, Shift, GLCIC and proposed models , models. Original images were also rated. So, each rater rated 1200 samples (4 datasets 6 models 50 images). We used two random positioned holes (but same across all model for an image) of 6464. In Table VI we report the MOS for each dataset. Encouragingly MOS also follows the trend of FID scores. Similar to our FID findings, the difference of MOS scores between our models and any of the full-scale baselines are not significant.
For comparison on mobile we select two low-end mobile device namely, Mi A1 and Motorola G5 S-Plus and one high-end Asus Zenfone 5Z all running on Android operating system. Mi and Motorola has 1.9GHz Qualcomm Snapdragon 625 processor while Asus has 2.8 GHz snapdragon 845 processor. TensorFlow Lite was used for mobile execution and the framework was executed on a single thread. In Table VII we report the execution times on 256256 resolution images. Our preferred model, consistently runs at milli-seconds interval compared to multiple seconds by the full-scale baselines. It is also evident that newer generation processor present in Asus mobile helps in faster execution compared to the lower-end models of Mi and Motorola. We also profiled the execution times of the models on CPU of a regular commodity laptop with Intel i5 processor and 8GB RAM @ 2.2GHz without any GPU acceleration. It is encouraging to see that even without GPU, model is able to inpaint approximately 3.3 second compared to 0.9, 1.25, and 0.7 second by [30, 76, 74] respectively. Another encouraging observation is that sub-pixel convolution based upsampling (models and ) is slower on resource constrained mobile platform than proposed bilinear upsampling followed by efficient convolution. This is attributed to the computationally heavy pixel-shuffle operation in and . However, on a more resourceful platform such as CPU, this difference is nullified. This observation further strengthens the pragmatism of using bilinear upsampling based efficient upsampling instead of pixel-shuffle based upsampling.
We also designed cheaper variants of GLCIC baseline using efficient convolution units from MobileNet  and ShuffleNet  and ShuffleNetV2 . However, as discussed earlier, these frameworks were targeted for classification tasks and lack any efficient designs for dilated convolution and upsampling operations. For example, both ShuffleNet and ShuffleNetV2 units are invalid on layers in which the number of input and output channels are not same. This is a common design for any upsampling layer. We could have used usual full-scale dilated and transposed convolution for these three frameworks, but for fair comparison with our compressed networks, we add two modifications to these competing frameworks. Firstly, for dilated convolution, we initially perform a dilated 33 depthwise convolution followed by 11 pointwise convolution. This, itself can be seen as a novel cheaper way of designing dilated convolution layer. Next, for upsampling, we perform bilinear upsampling followed by separable convolution. With these modified settings, we did not find any marked difference of visual quality between the cheaper models and baselines (samples provided in supplementary material for space constraints). From Table VIII we see that our recommended model, is much more computationally efficient than MobileNet and ShuffleNet variants and, more importantly, has all the necessary components to be seamlessly used in ‘image-to-image’ translation tasks.
|Noise Level ()||10||15||25||50|
In this section we show the applicability of our modules to reduce the computational costs of recent state-of-the-art image denoising networks. Henceforth in all experiments we will be using the design strategy and components from our variant, , to realize a cheaper version of a given baseline. We initially experimented with ‘DnCNN’ framework of Zhang et al.  for synthetic All White Gaussian Noise (AWGN) removal. We term our proposed smaller variant as DnCNN.
We also experimented to compress the more recent model of CBDNet  which showed appreciable performance on real-world unknown noise removal and has immediate applications in today’s AI-enable cameras. We term the smaller model as CBDNet.
Synthetic Dataset: We initially compared the performance of our cheaper realization of DnCNN on synthetic AWGN on the widely used BSD68  dataset consisting of 68 test images. We experimented on four different noise levels of and zero mean. We followed DnCNN to use 400 images with size for training the network. Random patches of were sampled for training.
Real World Dataset: We also experimented with datasets perturbed with noise from real life unknown noise distributions usually encountered while capturing pictures with contemporary cameras. For this, we followed the procedures in CBDNet for training our models. A combination of synthetic noise images and real noisy images (120 from RENOIR, 400 images from BSD500, 1600 images from Waterloo, and 1600 images from MIT-Adobe FIve) were used for training.
Since all the models are trained to minimize reconstruction loss (instead of adversarial loss), it is pragmatic to compare the models directly in terms of PSNR and SSIM (Structural Similarity Index ) instead of FID. Also, FID calculation requires at least a few thousand samples. However, our test set has a few hundred samples and thus FID metric would not have been a faithful representation of performance.
Denoising on Synthetic Dataset: In Table IX we report the denoising performances of baseline DnCNN and our proposed DnCNN in terms of PSNR and SSIM for AWGN noise removal. SSIM is the acronym for Structural Similarity Index. It is used a metric for comparing similarity between two images. SSIM = 1 means perfect match between two images. Across all noise levels, our model has comparable performance to that of DnCNN baseline.
Denoising on Real Dataset For quantitative evaluation we used the publicly available PolyU dataset  containing pairs of real-world noisy and ground truth images. The average PSNR and SSIM for full-scale CBDNet net is 37.95dB and 0.951 while for proposed CBDNet is 37.29dB and 0.948 Again, the differences are not significant. It is encouraging to see that even on real-world noise removal, our compressed variant performs at par with the full-scale CBDNet. Some visual comparisons are provided in Fig. 7. Additionally, for qualitative evaluations, we used the high-resolution DND  dataset in which the ground truths are not publicly available. Due to size limitations we include DND results in this Google Drive link.
Human Rating: In Table XI we report the MOS on different datasets. For each dataset, each subject was shown 20 random pairs of noisy and denoised (either from baseline or from our our compressed variant). Total 10 humans participated in the study. The grading strategy (between 0-5) was kept same as that we used during inpainting. We did not find any statistically significant difference (significance set to 10) between the MOS of baselines and our variant on any of the datasets.
We report the total number of parameters for the full-scale baseline models of DnCNN and CBDNet and our proposed compressed versions in Table X. On DnCNN we achieve 87.27% and on CBDNet we achieve 90.2% relative savings of parameters. Since the models are fully convolutional, any arbitrary resolution of image can be processed. Thus reporting a specific count of FLOPs is not possible. However, for reference, in Table X we report the FLOPs for processing input image of resolution 256256. Proposed CBDNet achieves 89.4% relative savings in FLOPS compared to CBDNet. We also compare against corresponding compressed variants of DnCNN and CBDNet with MobileNet, ShuffleNet and ShuffleNetV2 modules. Our proposed variant is more efficient in terms of memory requirement and FLOPs compared to both MobileNet and ShuffleNet variants.
In Table X we compare the execution times (@ 256256) on mobile (Asus) and CPU and also the model sizes for mobile deployment. Both of our proposed variants are computationally more economic compared to full-scale baselines as well as MobileNet and ShuffleNet variants.
Image denoising is an essential component in majority of contemporary AI-enabled smartphones and the above presented results make our compressed variant a natural substitute for the full-scale models on mobile platforms.
In this section we showcase the efficacy of our modules for single image super-resolution. For this, we consider the benchmark SRGAN model  as the baseline for 4 up-scaling. The baseline network consists of series of residual blocks (realized with 33 convolution) and upsampling is achieved with sub-pixel convolution with pixel-shuffle operation. We again follow the design principles of our model, to realize cheaper variant of SRGAN.
We used the training partition of Places2 dataset  to train baseline and proposed models. Similar to SRGAN we tested the models on the Set5 , Set14 , and BSD 100 (testing set of BSD300 ) dataset. Following , we randomly cropped 9696 patch from a given image as HR (high resolution) target and down-sample with bicubic interpolation by 4 to create the corresponding LR (low resolution) input.
We follow the exact same protocol of stagewise training as done in . Initially, we train the network with only reconstruction loss. The authors term this network as SRResNet. For our smaller model, we term this network as SRResNet
. Next, we fine-tune the network with VGG-54 content loss and an adversarial loss. Network at this stage is termed at SRGAN for baseline network and SRAGNfor our proposed smaller network.
In Table XII we first compare the PSNR (in dB) of SRResNet and SRResNet. Since, both of these models are trained on MSE loss, we compare the PSNR metric. Based on the average PSNR, we could not find any significant difference (significance level set to 10) between the two models.
Qualitative Comparison Next, we conducted a MOS test for the 2 models with 10 independent raters. Each rater was shown the original HR image and the super resolved versions by SRGAN and SRGAN networks. In Table XIV we report the MOS on the three datasets. Again, we could not find any significant difference between the scores received by the models. In Fig. 8 we visualize some super-resolved images by the two models. It is visually challenging to distinguish samples from the full-scale SRGAN baseline and our cheaper variant. More examples provided in supplementary material.
In Table LABEL:table_sr_computational we report the total number of parameters and FLOPs of different models. FLOPs were calculated on BSD100 dataset in which the original images are usually of dimension 480320. or 320480. So, for 4 super-resolution, input resolution is either 80120 or 12080. Compared to the baseline of SRGAN, our proposed cheaper variant, SRGAN achieves relative parameters and FLOPs savings of 88.4% and 99%. Proposed model is also appreciably cheaper compared MobileNet and ShuffleNet variants.
In Table LABEL:table_sr_computational we report the mobile model sizes and execution times on the Asus mobile and the commodity CPU. Proposed variant saves 92.1%, 47.1%, 76.1% and 75.0% on mobile memory compared to SRGAN-54 baseline, MobileNet, ShuffleNet and ShuffleNetV2 variants respectively. Execution speeds are reported on BSD100 dataset. Our proposed variant achieves significant speedup and reduction of FLOPs compared to full-scale baseline and even MobileNet and ShuffleNet versions.
In this paper we introduced several convolutional building blocks for low-level restoration tasks. Our proposed modules, LIST and GSAT were shown to be task agnostic and generalized to variety of restoration tasks. We showed that with specific design consideration, LIST layer can be made low cost computationally than contemporary de facto choices of depthwise separable and group convolution based 33 layer. We analytically and empirically analyzed the shortcoming of using depthwise separable kernels to realize sub-pixel convolution based upsampling in an encoder-decoder network configuration. Instead of we showed that homogeneity of network structure can be maintained by deterministic upsampling (instead of transposed convolution or pixel-shuffle based upsampling) followed by efficient convolution with LIST layer. Extensive evaluations on resource constrained platforms revealed the effectiveness of our modules in designing computationally efficient yet visually accurate models.
The work is funded by a Google PhD Fellowship and Qualcomm Innovation Fellowship awarded to Avisek.
Avisek is a Ph.D. candidate at the Indian Institute of Technology Kharagpur where he is focusing on image/video reconstruction tasks such as inpainting, super-resolution. His other research interests include data-efficient training of deep neural networks. He is recipient of Google PhD Fellowship and twice recipient of Qualcomm Innovation Fellowship. Avisek was selected as a Young Researcher by the Heidelberg Laureate Forum, 2019. Prior to his Ph.D, Avisek completed his M.S (by research) from IIT Kharagpur with focus on statistical machine learning.
Sourav received his M.tech degree from the department of Electronics and Electrical Communication Engineering, IIT Kharagpur, Kharagpur, India. He is currently working as an advanced deep learning engineer at Mathworks India Pvt. Ltd., Hyderabad. He received the ”Institute Silver Medal” for his academic performance at IIT Kharagpur. He received the ”Best Student Award” with a gold medal for academic performance during his B.Tech at Kalyani Government Engineering College, Kalyani, India. He has also received many scholarships and awards from the Government of West Bengal for securing 3 rank at state level in Class X Board examination (Madhyamik) and 8 rank at state level in Class XII Board examination (Higher Secondary). His current research interest lies in deep learning, machine vision and generative adversarial models.
Sutanu received BS degree in Electronics and Communication Engineering from the West Bengal University of Technology in 2015 and the MS degree in Medical Imaging and Informatics from the Indian Institute of Technology Kharagpur in 2018. He was awarded the Institute Silver Medal at the time of M.Tech. He is currently a Ph.D. student in the Department of Electrical and Electronics Communication Engineering, Indian Institute of Technology Kharagpur. His current research interest is deep learning for image restoration, medical image restoration, low-level image processing.
Siddhant is a final year integrated master’s student at Indian Institute of Technology Kharagpur. He is majoring in Electrical Engineering with a minor in Computer Science. He has been doing research in the domain of computer vision, natural language processing, adversarial attacks and causal inference.
Prof. Prabir Kumar Biswas (M’93–SM’03) received the B.Tech. (Hons.), M.Tech., and Ph.D. degrees from IIT Kharagpur, Kharagpur, India, in 1985, 1989, and 1991, respectively. He was a Visiting Fellow with the University of Kaiserslautern, Kaiserslautern, Germany, under the Alexander von Humboldt Research Fellowship from 2002 to 2003. Since 1991, he has been a Faculty Member with the Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, where he is currently a Professor, and is the Head of the Department. He has authored over 100 research publications in international and national journals and conferences, and has filed seven international patents. His current research interests include image processing, pattern recognition, computer vision, video compression, parallel and distributed processing, and computer networks.
Shift-net: Image inpainting via deep feature rearrangement.In ECCV, pages 1–17, 2018.
Robust lstm-autoencoders for face de-occlusion in the wild.IEEE Transactions on Image Processing, 27(2):778–790, 2017.
Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2018.