TCSVTLightWeightCNNs
Code for our IEEE TCSVT Paper: Lightweight Modules for Efficient Deep Learning based Image Restoration
view repo
Low level image restoration is an integral component of modern artificial intelligence (AI) driven camera pipelines. Most of these frameworks are based on deep neural networks which present a massive computational overhead on resource constrained platform like a mobile phone. In this paper, we propose several lightweight lowlevel modules which can be used to create a computationally low cost variant of a given baseline model. Recent works for efficient neural networks design have mainly focused on classification. However, lowlevel image processing falls under the imagetoimage' translation genre which requires some additional computational modules not present in classification. This paper seeks to bridge this gap by designing generic efficient modules which can replace essential components used in contemporary deep learning based image restoration networks. We also present and analyse our results highlighting the drawbacks of applying depthwise separable convolutional kernel (a popular method for efficient classification network) for subpixel convolution based upsampling (a popular upsampling strategy for lowlevel vision applications). This shows that concepts from domain of classification cannot always be seamlessly integrated into imagetoimage translation tasks. We extensively validate our findings on three popular tasks of image inpainting, denoising and superresolution. Our results show that proposed networks consistently output visually similar reconstructions compared to full capacity baselines with significant reduction of parameters, memory footprint and execution speeds on contemporary mobile devices.
READ FULL TEXT VIEW PDF
Data augmentation is an effective way to improve the performance of deep...
read it
The task of single image superresolution (SISR) aims at reconstructing ...
read it
Deep convolutional networks have become a popular tool for image generat...
read it
In this paper, we present a general framework for lowlevel vision tasks...
read it
Many realworld solutions for image restoration are learningfree and ba...
read it
Image superresolution is one of the most popular computer vision proble...
read it
Code for our IEEE TCSVT Paper: Lightweight Modules for Efficient Deep Learning based Image Restoration
Image restoration refers to recovery of clean signal from an observed noisy input. Following the groundbreaking work of Krizhevsky et al. [36]
on ImageNet classification with deep neural networks, CNNs have superseded traditional methods across a variety of tasks such as object recognition
[26, 66, 65], detection [15, 14, 56] and tracking [4, 23], action recognition [7, 22], segmentation [24, 51] to list a few. Image restoration frameworks also improved from the data driven hierarchical feature learning capability of deep neural networks with stateoftheart performances on inpainting [30, 76, 74, 40, 37, 38], denoising [81, 69, 78], superresolution [39, 67], dehazing [57], deocclusion [85], 3D surface reconstruction [32, 63] etc. Though these deep learning based restoration frameworks yield photorealistic outputs, the models are computationally expensive with millions of parameters. Inference through such complex networks requires billions of floating point operations (FLOPs). This might not be seen as a problem while executing over a GPU enabled workstation; however such networks are practically not scalable to run on resourceconstrained platforms such as a commodity CPU or a mobile device. However, with the proliferation of multimedia enabled mobile devices, there is an increased demand of ondevice multimedia manipulations. For example, image denoising is a crucial component of imaging setup in any contemporary smartphone. Superresolution is also an inevitable component because online multimedia hosting sites often prefer to transmit low resolution images and videos with superresolution performed on device so that the end user enjoys high resolution multimedia experience even on low bandwidth channel. Similarly, inpainting plays a crucial role in many downstream applications such as image editing, Augmented Reality [59], ‘disocclusion’ inpainting [45, 42] for novel view synthesis in a multi camera video capture setting [8] to be integrated with mobile Head Mounted Displays (HMD).Executing billions of FLOPs on mobile devices leads to fast reduction of battery life with potential heating up of the device. Also, the lag encountered while executing such large models on constrained platform tends to disrupt the engagement of the user. To address the above two issues, in this paper we propose several lightweight computing units which dramatically reduce the computational cost of a given deep neural network without any visual degradation of reconstructed outputs.
Recently, there has been a surge of interest for designing efficient neural networks mainly for object classification and detection. However, there is a dearth of literature for efficient processing of networks concerned with lowlevel image restoration. Fundamentally, restoration requires the spatial resolution of input and output signal to be same and the general practice [30, 76, 81] is to follow encoderdecoder based architectures to first downsample and later on upsample the intermediate feature maps of the network. On contrary, classification frameworks are mainly concerned with progressive downsampling and thus efficient strategies to upsample in a network are not discussed. Also, dense prediction tasks such as inpainting requires long range spatial information and often deploys dilated/atrous convolutions [75] to increase the receptive field of processing. However dilated convolutions are rarely used in classification frameworks and thus recent advancements such separable convolution [29] and group convolutions [82] cannot be directly applied for dilated convolution operations.
In this paper we have mainly focused on design principles for components to be used in lowlevel restoration tasks. Since 33 kernel ^{1}^{1}1Ideally, it should be 33 where is number of input channels. For brevity of notation, henceforth, we will drop the channel dimension. is the most commonly used kernel in contemporary low level vision applications [30, 76, 81, 39], we introduce ‘LIght Spatial Transition’ layer, (LIST), which simultaneously benefits from local feature aggregation [41] and multiscale spatial processing [65] and uses upto 24 fewer parameters than a similar 33 convolution layer. Next, we introduce ‘Grouped Shuffled Atrous Transition’ layer, GSAT, which is an efficient atrous/dilated convolution layer by leveraging recent concepts of group convolution [36] and channel shuffling [82] and each layer uses approximately 7 fewer parameters compared to an usual dilated convolution layer. While designing efficient upsampling module, we show that separable convolution kernels are inept at subpixel convolution [60] based upsampling and we provide an analytical justification for the same. Instead we show that deterministic upsampling such as bilinear upsampling followed by our LIST module provides an efficient upsampling framework. Combination of these modules enable us to run image restoration models on mobiles with milliseconds level execution speed compared to several seconds by contemporary fullscale models without any visual degradation of outputs. One of the major advantages of our proposed modules is that these can seamlessly replace commonly used computational blocks such as 33 convolution, dilated convolution, differentiable upsampling within a given network. Thus, in this paper we refrain from proposing new endtoend architectures; instead we select recent stateoftheart networks and reduce the computational footprints of those networks using our lightweight layers.
In summary, our key technical contributions in this paper are:
We present LIST layer as a computationally cheaper alternative to a regular 33 convolution layer. Each instance of LIST can save 12  24 parameters. Repeated use of LIST in a deep network leads to significant reduction of parameters and FLOPs
We present GSAT layer which implements dilated convolution on separate sparse group of channels to reduce FLOPs followed by feature mixing for enhanced representation capability. Each instance of the proposed module utilizes approximately 7 fewer parameters than a regular dilated convolution layer
 We present our findings on drawbacks of applying separable convolution for feature upsampling with subpixel convolution and provide a detailed insight for possible reason of failure. Instead, we show that deterministic upsampling followed by LIST layer based convolution is an efficient yet accurate alternative
 We perform extensive study of our network components on tasks of image inpainting, denoising and super resolution. On all tasks we achieve significant reduction in parameters and FLOPs and massive execution speedups on resource constrained platforms without any compromise in visual quality. Such exhaustive experiments manifest the generalizability of our processing components across a variety lowlevel restoration tasks.
In recent years deep neural networks have achieved overwhelming success on a variety of computer vision tasks in which network design plays a crucial role. Executing these large models on resource constrained platforms requires efficient design strategies
[25]. Recently there has been a surge of interest in either compressing existing pretrained big networks or designing small networks from scratch.For training small networks from scratch, factorization of kernels have been a preferred choice. The most common realization is depthwise separable convolution initially presented in [62]
and then popularized in Inception module
[65]. Following that, it has become the backbone of many popular architectures such as MobileNet [29] and MobileNetV2 [28]. Xception network [10] showed how to scale up depthwise separable convolutions to outperform InceptionV3 [66]. Another popular concept of group convolution was introduced in [36] to distribute model parameters over multiple GPUs. Currently, it is utilized in several recent efficient networks [83, 48, 72, 64]. The idea is to convert dense convolutions across all feature channels to be sparse by channel grouping and performing convolution only on grouped set of channels.Model compression is another genre of approach for efficient inferencing by lossy compression of a pretrained network while maintaining similar accuracy. Compression can be achieved either by pruning some of the intermediate synaptic connections in the network or by quantizing pretrained kernels to be represented as integers or booleans. Denton et al. [12]
applied Singular Value Decomposition (SVD) to approximate a pretrained network to achieve 2
inference speedup. Han et al. [21] pruned and finetuned a pretrained network to identity important network connections to create a smaller network. The work was extended in Deep Compression [20] to combine network pruning with quantization. Later, ‘Quantized CNN’ [71] was proposed which aimed at directly quantizing network weights during training. Chen et al. proposed ‘HashedNet’ [9] to compress networks with hashing.Some recent works have focused on smarter network designs for efficient lowlevel vision applications. Zhang and Tao [80] proposed a lightweight multiscale network for single image dehazing. In [1] Ahm et al. proposed a cascaded residual network coupled with group convolution for efficient single image superresolution. In [68], Tan et al. presented a lowcost network for unmanned aerial vehicle (UAV) noise reduction at low signaltonoise (SNR) level. In [84], Zhang et al. present a ‘mixedconvolution’ layer by merging normal and dilated convolution for image superresolution. Kim et al. [34] presented dilatedWinograd transformation for a faster realization of dilated convolution. In RHNet [77] the authors present a dilated special pyramid pooling framework for dense object counting.
In this paper we mainly focus on constructing lightweight modules for training efficient networks from scratch for lowlevel image restoration tasks. However, the building blocks of modern efficient networks are mainly concerned with classification tasks in which essential components such as upsampling, subpixel convolution and dilated convolution are usually not involved. Hence, those methods are not selfsufficient for lowlevel computer vision applications.
In recent years deep learning based methods have produced phenomenal performances on a variety of lowlevel image restoration tasks. However majority of research has been focused on improving the visual quality without worrying much about the computational burden. In this paper we aim to realize lightweight versions of these networks which can be run on mobile devices with milliseconds level execution time instead of multiple seconds required by fullscale baselines.
This section elaborates on the architectural details of LIST layer. Pictorial representation of a LIST layer is shown in Fig. 1(b). We will first discuss the driving intuitions and principles behind LIST followed by calculating computation savings achieved by using LIST instead of regular 3
3 convolution layer. Presence of a ‘subnetwork’ capable of universal functional approximation such as multilayer perceptron (MLP) in between two consecutive layers boosts the feature extraction capability in a CNN
[41]. In LIST, we realize this functionality by having one parallel branch of two successive layers of 11 convolution with ReLu nonlinearity in between to promote sparsity of features. Such cascades of 1
1 convolution promotes parametric crosschannel pooling and enables a network to learn nontrivial transformations.Starting from the Inception [65] module of GoogleNet (see Fig. 1(a)), multipath branched module has become the de facto choice for multiscale processing of features in deep neural networks [66]. Following that, we incorporate a branch for 33 convolution in parallel with the 11 branch. In this case, the initial (top) 11 layer acts an embedding layer by projecting incoming feature volume to a lower dimension and thereby reducing the FLOPs requirement for performing 33 convolution. We further reduce the FLOPs count for 33 convolution branch by factoring it with depthwise separable kernels. However, we deviate from the design principles of Inception by restricting the number of parallel branches inside the LIST layer. This is motivated by the ‘network fragmentation’ issue pointed out in [48]. Parallel branches in a network creates overhead of kernel launching and synchronization resulting in reduction of execution speed. So, unlike that in Inception, we refrain from using two additional parallel branches of 5
5 convolution and maxpool layer inside our
LIST layer. Apart from ‘network fragmentation’ issue, avoiding parallel branches also benefits from reduced number of final feature channels which need to be processed by next layer this further helps in decreasing FLOPs.A LIST layer is meant for replacing a normal 33 convolution layer with input and output feature channels.
Input to a LIST module is a feature volume of shape (, , ) (height, width, channels). In the first step, the input volume is pointwise convolved with number of 1 kernels; is the compression ratio. In the second stage, these feature maps are passed to two parallel streams of 11 and 33 convolution. In the 11 branch, we perform another set of pointwise 11 convolution and output channels; is the branching factor. The 33 branch is realized with depthwise separable kernels and outputs channels. Outputs from 11 and 33 streams are concatenated (to form total channels) and passed on to the next layer.
All throughout the paper, we consider ‘valid’ convolution by padding zeros at the border and stride of 1 pixel; this preserves the image resolution .
Comparison to 33 convolution: In this section we elaborate on the savings of parameters and FLOPs achieved by our LIST layer over the usual 33 layer. We assume the spatial resolution of incoming and outgoing features to be HW. Number of trainable parameters for a 33 is,
(1) 
while the total FLOPs is,
(2) 
Computations for a LIST module will consist of three components (a) 11 convolution in Stage1; (b) 11 convolution in Stage2 parallel stream; (c) separable 33 convolution in Stage2 parallel stream. Assuming = 2 (see Sec. IVA1) number of parameters for a LIST layer is,
(3) 
while total FLOPs is,
(4) 
Ratio of parameters of 33 to that of LIST is given by,
(5) 
(6) 
Since, , is the compression ratio of incoming and outgoing channels to the first 11 layer, 1. Thus, we have,
(7) 
From Eq. 7 we get the lower bound of parameters saving by using proposed LIST layer instead of a 33 convolution layer. Some of the usual settings in a network are , or . After a brief hyperparameters search (see Sec. IVA1) we set = 4 and thus we achieve 18, 12 and 24 parameters saving at , and . Thus a single instance of our LIST layer is significantly cheaper than a normal 33 convolution layer. On a similar note, we can show that the ratio of FLOPS of 33 to that of LIST is given by,
(8) 
Since the ratio is same as what we got for parameters savings,
following the approximation done in Eq. 6 and lower bound logic of Eq. 7, we get similar scales of FLOPs savings as we showed for the parameters.
Stacking several layers of LIST layer thereby helps in significant reduction of memory footprint (fewer parameters) and faster execution speed (fewer FLOPs) compared to a network realized with 33 convolution layers.
Comparison to depthwise separable 33 convolution: In this section we first find the condition under which proposed LIST layer is even cheaper than the widely used depthwise separable convolution layer. We again assume 33 convolution over a feature volume of incoming and outgoing channels and spatial resolution of HW. Number of trainable parameters for a separable 33 convolution layer is,
(9) 
while total FLOPS is,
(10) 
Ratio of parameters for a separable 33 convolution layer to that of LIST is,
(11) 
If we want then we need to satisfy the following condition:
(12) 
So, we have the following criteria for at different ratios of :
(13) 
To satisfy all the conditions of Eq. 13 we need which gives the lower bound of parameters savings. Since we set 4 for all our experiments, the conditions of Eq. 13 are satisfied. With = 4, from Eq. 11 we have = 2, 2.6 and 1.3 at =, = and = respectively. Similarly we can show that ratio of FLOPS of a depthwise separable 33 layer to that of LIST is,
(14) 
With , or we would approximately save 2, 2.6 and 1.3 FLOPs respectively. Our LIST layer’s design has appreciably fewer parameters and FLOPs compared to even a depthwise separable realization of 33 convolution and thus can be used as an offtheself replacement for separable convolution layer.
In this section we elaborate on the design of our proposed GSAT layer which is an efficient replacement for an usual atrous/dilated convolution layer found in numerous contemporary lowlevel vision applications [30, 70, 19]. Realizing a 33 dilated convolution is not trivially possible by our LIST module because of the 11 convolution in the first stage. For this we propose GSAT layer. We mainly consider a 33 dilated convolution with same number of incoming and outgoing channels. This is the most popular configuration in contemporary architectures. Illustration of a GSAT layer is shown in Fig. 2(b).
Input to the layer is a feature volume of shape HWM. Based on group convolution [36], we divide the incoming channels into nonoverlapping groups. Then each of the groups is individually processed by an usual dilated 33 convolution. The initial group partitioning helps in reduction of incoming channels to individual 33 dilated convolution layers and thereby saves on parameters and FLOPs. However, each of the groups are processed independently on a subgroup of channels without any crossgroup interaction. This property weakens the representation capability of the model. Thus for cross channel interaction we perform a channel shuffling operation [82] to periodically sample and stack features from each of groups. This results in an intermediate volume of shape HWM. So features from a particular group are stacked every alternate channels apart. Thus a group of channels inside the intermediate volume has features from each of the groups. Next, to perform a cross channel interaction [41] of features we include a 11 convolution layer. However to reduce FLOPS, we perform grouped 11 convolution partitioned over groups. Since the channel shuffling operation already populated each of the subgroups with features from all the 33 dilated convolution layers, the grouped 1
1 layer can now learn a nonlinear transformation conditioned on all the dilated convolution layers. Thus we avoid any further channel shuffling operation. Lastly, inspired by residual connection
[26], we add the input with the 11 group convolution’s output. To our best knowledge, this is the first realization of dilated convolution layer with grouped convolution and channel shuffling.In this section we numerically illustrate the computational benefits of using our GSAT layer instead of usual dilated convolution layer. Number of trainable parameters for a normal 33 dilated convolution layer is given by,
(15) 
where is the number of incoming and outgoing channels. For GSAT layer, number of parameters for the first stage of grouped convolution is while for the second stage of 11 grouped convolution is . So, total parameters for GSAT layer is,
(16) 
Ratio of parameters used in regular dilated convolution and that used by proposed GSAT layer is,
(17) 
So, we can save parameters if 1, which requires 2. In fact, after hyperparameters search (see Sec. IVA1) we used 8 and thus GSAT module requires almost 7 fewer parameters compared to normal dilated convolution layer.
Upsampling of intermediate feature maps in a network is an essential component for lowlevel vision tasks. However, recent frameworks for efficient network design do not discuss upsampling strategies because it is rarely required in classification frameworks. We thus devote this section for discussing possible solutions for efficient upsampling.
In recent literature transposed convolution (popular as deconvolution) [52] has become the de facto choice for upsampling. However, from an image generation perspective, transposed convolution is known to render ‘checkboard’ effects [53, 2] on the final synthesized image. Thus, even though there are efforts towards making transposed convolution computationally faster [66, 10] we explore other avenues for efficient upsampling.
Subpixel convolution based upsampling is a preferred paradigm of upsampling specifically for image generation tasks because of its demonstrated ability to get rid of ‘checkboard’ artifacts introduced by transposed convolution layer. In this section we elaborate on our initial failed attempt of applying (see Fig. 5 for failed inpainting results) separable kernels for subpixel convolution based upsampling and provide justifications for the same.
It can be shown that, for an upscale factor of, , a subpixel convolution with kernel shape ( height, width, # of output channels, # of input channels) is equivalent to a that of a transposed convolution by a kernel of shape . After subpixel convolution, the channel elements are periodically shuffled to upscale feature maps by factor of along height and width. See Fig. 3(a) for visualization. Refer to [60, 61] for more detailed derivation.
From the theory of subpixel convolution we know that with an upscale factor of 2, subpixel convolution can learn to represent feature maps in LR (low resolution space) which are equivalent to feature maps in HR (high resolution space). We will show that this essentially means both networks have same run time complexity but subpixel convolution has more parameters. Let us consider a general case where shape of input volume at layer is (height, width, depth) = . The target is to upscale this to spatial resolution of , for next layer, . Let, for subpixel convolution we choose kernels of shape . Then for counterpart of HR model (which first does deterministic upscaling followed by convolution in HR space itself), kernel sizes will be . Total FLOPs for subpixel convolution is,
(18) 
The number of trainable parameters for LR model is,
(19) 
For convolution in HR, total FLOPs,
(20) 
and number of parameters,
(21) 
So, important observation is that FLOPs for both LR and HR models are same but LR model has more parameters and thus greater representation capability.
Let us now examine what will happen when we try to realize separable subpixel convolution. See Fig. 3(b) for a visualization. In the first stage, we need kernels of shape (height, width, output channels, input channels). In this stage, total FLOPs,
(22) 
and number of parameters,
(23) 
In the next stage we need kernels of shape . Total FLOPs in this stage,
(24) 
and number of trainable parameters,
(25) 
So, total FLOPs for separable LR model, and total number of parameters, . Now consider the ratio,
(26) 
always and thus we see that converting a subpixel convolution to a separable paradigm reduces its representation prowess with respect to a convolution in HR. Similarly, if we compare the FLOPs by , separable subpixel convolution is computationally cheaper. But because of its reduced representation capability, it is not recommended for practical applications.
One way to mitigate ‘checkboard’ effect is to disentangle upsampling and convolution operations [53]. An usual procedure is to use some deterministic upscaling followed by convolution in the high resolution space. This has worked well in applications such as super resolution [13] and inpainting [76]. But, when implemented in naive version, this increases the computational cost. For example, if we do a bilinear upscaling by 4 followed by convolution, there is a quadratic increase of feature size but ‘same information content’ (if we count the number of floats). This makes bilinear upsampling + convolution almost 4
costlier than transposed convolution. We optimize this concept by first upsampling with bilinear interpolation followed by an efficient convolution block realized by the proposed
LIST layer. This is our preferred method for efficient upsampling.To maintain the homogeneity in network design we prefer to realize spatial downsampling with LIST layer. However, strided convolution is not trivially possible by LIST module because of initial 11 convolution stream. So, we first downsample feature maps with bilinear interpolation and follow up with LIST based efficient convolution.
We organize our results as follows. In Sec. IVA, we initially perform extensive studies to select the hyper parameters governing the design choices for various proposed modules based on image inpainting. We systematically investigate the role of individual components towards reduction of parameters and FLOPs. This is followed by comparison with recent full capacity inpainting baselines and compressed models realized with MobileNet [29], ShuffleNet [82] and ShuffleNetV2 [48].
Next, with our understanding of best network configurations we compare applicability of our proposed layers on image denoising (Sec. IVB) and image superresolution (Sec. IVC). It is encouraging to note that the proposed layers are quite insensitive to hyper parameters across different tasks which allows us to reuse the same set of hyper parameters across all the three above mentioned applications without degradation of visual quality.
We select the globally and locally consistent image inpainting model, GLCIC [30] as our baseline for image inpainting. Currently, GLCIC serves as a strong Generative Adversarial Network (GAN) [16] based contemporary baseline for inpainting and we aim at realizing a lightweight version of GLCIC using our proposed layers. A GAN framework consists of two deep neural nets, generator, , and discriminator, . The task of the generator is to generate an image,
with a latent noise prior vector,
, as input. is sampled from a known distribution, . A common choice [16] is, . The discriminator has to distinguish real samples (sampled from ) from generated samples. Discriminator and generator play the following twoplayer minmax game on :(27) 
At the core, GLCIC comprises of repeated applications of 33 convolution, 33 dilated convolution and transposed convolution layers. Please refer to [30] for details of the architecture. We replaced the corresponding layers with proposed LIST, GSAT and LIST based upsampling layers.
Automated Visual Quality Metric: Manually analyzing the perceptual quality of reconstruction by different models is not feasible. Recent works [39, 76] have shown that PSNR and MSSSIM metrics are not suitable for evaluating quality of adversarial loss guided reconstructions. Analyzing the quality and diversity of GAN samples is still an open research topic. Recently Fréchet Inception Distance (FID) [27] was proposed for quantifying quality and diversity of GAN samples. Lower FID value indicates overall better quality and diversity of generated samples. For automated screening of models, we use FID as the base metric.
Datasets: We experimented on CelebA (128128) [43], CelebAHQ (256256) [33], Places2 (256256) [87] and DTD (256256) [11]. For CelebA, hole sizes greater than 4848 occludes almost entire face and thus maximum training hole size is 4848 at random location. For comparing FID during evaluation, a randomly positioned hole (but same for all models for a given image) of 4848 is considered. At 256256 image resolution, the maximum hole size of 9696 is considered during training and FID is reported at hole size of 9696. From CelebA, CelebAHQ, Places2, and DTD we kept 20000, 10000, 20000, and 1000 (converted to 4000 with horizontal and vertical flip) samples for testing.
Training Details: In practice, we follow the stagewise training procedure as presented in [30]. In Stage1, we pretrain the inpainting (generator) network alone with (Mean Squared Error) loss for iterations. In Stage2, we freeze the parameters of inpainting network and pretrain the critic (discriminator) network to distinguish between real and inpainted samples for iterations using crossentropy loss. In Stage3, both completion and critic networks are iteratively updated under the minmax GAN game formulation [16] for iterations.
Implementation Details
We first discuss how we select design hyper parameters of our network modules such as LIST and GSAT. For a given parameter setting, we train on CelebA dataset and evaluate the FID on CelebA validation set (10000 samples). Due to lack of massive computational resources, we run parameter search sweep only on CelebA and adopted our understanding on other datasets. It is encouraging to see that lessons learned from CelebA generalize well to other datasets also. We set , and to 10, 10 and 10 iterations. Mini batch gradient descent based optimization is performed with ADAM [35] optimizer with batch size = 64. Following [39], we perform paired twosided Wilcoxon signedrank tests and significance level set to 10.
= 0.25  

0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  
FID  6.93  6.95  6.98  7.10  7.11  8.92  10.23  14.25 
= 0.5  
0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  
FID  8.11  7.31  6.92  6.85  6.83  6.80  6.78  6.76 
6.72  6.75  6.80  7.09  10.23  20.31 
Design parameters for LIST module: A LIST module is characterized by the two hyper parameters, and . Firstly, we study the effect of reducing 33 kernels in the network by varying . To keep things constant, dilation layer for each case was realized with normal dilated convolution and fixed at 0.25. In Table I we report FID metrics on CelebA validation set at a hole size of 4848. Decreasing (pushing more computations to 33 stream) less than 0.5 does not improve FID appreciably but increases the model parameters while FID deteriorates briskly with increase of (pushing more computations to 11 stream). We thus keep = 2 in our further experiments. Such a balance of channels along two parallel processing streams is also recommended in [54, 44]. Next, we sweep over different settings of at a fixed = 0.5. Increasing improves the representation efficacy of the Stage1 11 layer and thus aids in FID improvement but at a cost of higher parameters. With 0.35, FID improvement almost saturates.
Finally, to find a suitable threshold of FID aligned with human perception, we showed 100 inpainted images of five models with FID (model with FID 8 are perceptually not acceptable) to five independent raters who were asked to rate a given image in ; 5: excellent and 1: bad quality. The difference of mean scores of models with FID 7.0 were statistically insignificant. With = 0.5, from Table I we see that = 0.30 yields model in the regime of FID 7.0. Since channel counts in deep nets are usually of the form of , , we proceed in the remaining paper with = 4 ( = 0.25) and = 2 ( = 0.5).
Number of Groups in GSAT layer: Our proposed GSAT module is characterized by number of groups, , for the group convolution layers. For simplicity of parameter sweep, we keep the group numbers same for dilated 33 and 11 stages. In Table II we report FID scores on CelebA validation set for different values of . For other layers, all models used LIST with = 4 and = 2 as discussed in previous section. A smaller value of indicates more computational load on the initial 33 layers and subsequent better FIDs. However, at = 8, we get FID 7.0, which is perceptually acceptable. On the contrary, increasing creates many independent feature volumes and the combined channel shuffle and 11 group convolution is not able to properly amalgamate the groups leading to higher FID. So, for future experiments we set = 8 for GSAT layers.
Model  3X3  Upsampling  Dilation  Params (10)  FLOPs (10)  

DS 

Normal  3.42  33.1  
DS 

Normal  2.93  27.1  
DS  BiL + DS  Normal  2.81  26.9  
LIST  BiL + DS  Normal  2.63  24.8  
LIST  BiL + LIST  Normal  2.61  24.0  

BiL + LIST 

0.54  7.4 
In Table III we define the proposed architecture variants and compare the associated parameters and FLOPs. Such analysis gives a foundation to appreciate the effect of a given speedup technique.
In Table V we compare the FID scores of different proposed models with fullscale baseline models. Some of the key lessons from Tables III and V:
– Comparing and : Proposed LIST layer is a much more efficient alternative to depthwise separable 33 convolution layer, but both models have similar reconstruction performances.
– Comparing and : Proposed GSAT layer used in as an alternative for normal dilated convolution layer significantly helps in reduction of parameters without hampering visual quality. Since, combines both LIST and GSAT layers, it is our preferred proposed model unless otherwise stated.
– As per our theoretical justification, a network with separable subpixel convolution (model ) performs worse than a network with normal convolution based subpixel convolution (model ) as reflected by higher FID scores of . Also, see Fig. 5 for visualizing such failures.
– Comparing and : Model, , with bilinear upsampling + separable convolution has fewer FLOPs than (upsampled with subpixel convolution) while having similar FID. Thus it is prudent to have efficient bilinear upsampling which we improve further with proposed based upsampling in .
Since we design all our smaller models based on the architecture of GLCIC [30], it is fair to compare performances only with GLCIC as baseline. However, for initial benchmarking of our model designs we also compared against recent stateoftheart deep learning based models of
GIP [76] and Shift [74].
Reduction in Computation
In Table IV we report the parameters count, FLOPs and mobile memory size. Our preferred model, achieves almost 91% ( = ) relative parameters savings compared to the parent framework of GLCIC with 88.6% and 93.5% relative savings in FLOPs.
Comparison of Reconstruction
In Table V we report FID metrics of the comparing methods @256256 on CelebAHQ, Places2, and DTD datasets. We did not find any significant difference of FID between any of our models (except ) and the fullscale baselines. In Fig. 6 we provide some inpainting examples by GLCIC, GIP, Shift and our preferred proposed model, . Clearly, the reconstruction qualities of our proposed smaller model are indistinguishable from fullscale baselines.
GLCIC  GIP  Shift  

FLOPs (10)  65.0  41.2  70.1  33.1  27.1  26.9  7.4 
Params (10)  6.02  2.98  6.24  3.42  2.93  2.81  0.54 
Dataset  FullScale Baselines  

GIP  Shift  GLCIC  
CelebA  6.98  6.95  7.00  
CelebAHQ  8.12  8.00  8.05  
Places2  13.10  13.00  13.25  
DTD  6.00  6.01  6.04  
Proposed Efficient Variants  
CelebA  7.12  23.41  7.11  7.09  7.11  7.03 
CelebAHQ  8.09  27.21  8.16  8.14  8.10  8.09 
Places2  13.27  30.41  13.27  13.29  13.30  13.39 
DTD  6.04  18.21  6.84  6.06  6.06  6.04 
Dataset  GIP  Shift  GLCIC  Original  

CelebA  4.24  4.30  4.25  4.18  4.22  4.42 
CelebAHQ  4.17  4.20  4.14  4.13  4.18  4.72 
Places2  4.00  3.97  3.95  3.93  3.98  4.60 
DTD  4.36  4.38  4.41  4.32  4.40  4.55 
Mean Opinion Score Testing (MOS): To further bolster our findings, we conducted MOS testing to visually quantify the quality of inpainting by different models. Raters were asked to rate an inpainted image in the scale of 1 (bad quality) to 5 (excellent quality). Total of 20 raters were selected for the study. From each dataset, each rater was shown 50 inpainted images by GIP, Shift, GLCIC and proposed models , models. Original images were also rated. So, each rater rated 1200 samples (4 datasets 6 models 50 images). We used two random positioned holes (but same across all model for an image) of 6464. In Table VI we report the MOS for each dataset. Encouragingly MOS also follows the trend of FID scores. Similar to our FID findings, the difference of MOS scores between our models and any of the fullscale baselines are not significant.
Device  GLCIC  GIP  Shift  

Mi A1  8.2  5.5  9.2  1.9  1.6  1.4  0.8 
Motorola  8.0  5.4  9.1  1.7  1.5  1.2  0.7 
Asus  5.8  3.1  6.2  1.0  0.8  0.6  0.35 
CPU  2.1  0.8  1.4  0.49  0.42  0.38  0.30 
For comparison on mobile we select two lowend mobile device namely, Mi A1 and Motorola G5 SPlus and one highend Asus Zenfone 5Z all running on Android operating system. Mi and Motorola has 1.9GHz Qualcomm Snapdragon 625 processor while Asus has 2.8 GHz snapdragon 845 processor. TensorFlow Lite
[46] was used for mobile execution and the framework was executed on a single thread. In Table VII we report the execution times on 256256 resolution images. Our preferred model, consistently runs at milliseconds interval compared to multiple seconds by the fullscale baselines. It is also evident that newer generation processor present in Asus mobile helps in faster execution compared to the lowerend models of Mi and Motorola. We also profiled the execution times of the models on CPU of a regular commodity laptop with Intel i5 processor and 8GB RAM @ 2.2GHz without any GPU acceleration. It is encouraging to see that even without GPU, model is able to inpaint approximately 3.3 second compared to 0.9, 1.25, and 0.7 second by [30, 76, 74] respectively. Another encouraging observation is that subpixel convolution based upsampling (models and ) is slower on resource constrained mobile platform than proposed bilinear upsampling followed by efficient convolution. This is attributed to the computationally heavy pixelshuffle operation in and . However, on a more resourceful platform such as CPU, this difference is nullified. This observation further strengthens the pragmatism of using bilinear upsampling based efficient upsampling instead of pixelshuffle based upsampling.Method  FLOPs  Params  Mobile  CPU  Memory 

(10)  (10)  (s)  (s)  (MB)  
GLCIC  65.0  6.02  5.8  2.1  40.1 
GLCIC  9.8  0.68  0.52  1.1  4.3 
GLCIC  11.4  0.79  0.86  1.3  5.7 
GLCIC  10.6  0.70  0.68  1.2  4.9 
GLCIC(Proposed)  7.4  0.54  0.35  0.3  2.6 
We also designed cheaper variants of GLCIC baseline using efficient convolution units from MobileNet [29] and ShuffleNet [82] and ShuffleNetV2 [83]. However, as discussed earlier, these frameworks were targeted for classification tasks and lack any efficient designs for dilated convolution and upsampling operations. For example, both ShuffleNet and ShuffleNetV2 units are invalid on layers in which the number of input and output channels are not same. This is a common design for any upsampling layer. We could have used usual fullscale dilated and transposed convolution for these three frameworks, but for fair comparison with our compressed networks, we add two modifications to these competing frameworks. Firstly, for dilated convolution, we initially perform a dilated 33 depthwise convolution followed by 11 pointwise convolution. This, itself can be seen as a novel cheaper way of designing dilated convolution layer. Next, for upsampling, we perform bilinear upsampling followed by separable convolution. With these modified settings, we did not find any marked difference of visual quality between the cheaper models and baselines (samples provided in supplementary material for space constraints). From Table VIII we see that our recommended model, is much more computationally efficient than MobileNet and ShuffleNet variants and, more importantly, has all the necessary components to be seamlessly used in ‘imagetoimage’ translation tasks.
Noise Level ()  10  15  25  50  

PSNR(dB)  SSIM  PSNR(dB)  SSIM  PSNR(dB)  SSIM  PSNR(dB)  SSIM  
Methods  DnCNN  33.78  0.92  31.75  0.89  29.23  0.83  26.29  0.72 
DnCNN (Proposed)  33.66  0.92  31.50  0.88  29.11  0.82  26.10  0.71 
In this section we show the applicability of our modules to reduce the computational costs of recent stateoftheart image denoising networks. Henceforth in all experiments we will be using the design strategy and components from our variant, , to realize a cheaper version of a given baseline. We initially experimented with ‘DnCNN’ framework of Zhang et al. [81] for synthetic All White Gaussian Noise (AWGN) removal. We term our proposed smaller variant as DnCNN.
We also experimented to compress the more recent model of CBDNet [18] which showed appreciable performance on realworld unknown noise removal and has immediate applications in today’s AIenable cameras. We term the smaller model as CBDNet.
Synthetic Dataset: We initially compared the performance of our cheaper realization of DnCNN on synthetic AWGN on the widely used BSD68 [58] dataset consisting of 68 test images. We experimented on four different noise levels of and zero mean. We followed DnCNN to use 400 images with size for training the network. Random patches of were sampled for training.
Real World Dataset: We also experimented with datasets perturbed with noise from real life unknown noise distributions usually encountered while capturing pictures with contemporary cameras. For this, we followed the procedures in CBDNet for training our models. A combination of synthetic noise images and real noisy images (120 from RENOIR[3], 400 images from BSD500[50], 1600 images from Waterloo[47], and 1600 images from MITAdobe FIve[6]) were used for training.
Method  FLOPs  Params  Mobile  CPU  Memory  

(10)  (10)  (s)  (s)  (MB)  
DnCNN  36.73  0.55  3.43  0.58  7.8  
CBDNet  36.09  4.34  3.30  0.49  29.4  
DnCNN  5.73  0.08  0.34  0.20  0.46  
DnCNN  7.44  0.18  0.41  0.24  0.6  
DnCNN  7.30  0.14  0.37  0.22  0.5  
CBDNet  6.23  0.6  0.40  0.27  3.1  
CBDNet  8.09  0.72  0.52  0.30  4.3  
CBDNet  7.60  0.69  0.48  0.28  4.1  

2.97  0.04  0.16  0.11  0.21  

4.12  0.41  0.25  0.14  1.93 
Since all the models are trained to minimize reconstruction loss (instead of adversarial loss), it is pragmatic to compare the models directly in terms of PSNR and SSIM (Structural Similarity Index ) instead of FID. Also, FID calculation requires at least a few thousand samples. However, our test set has a few hundred samples and thus FID metric would not have been a faithful representation of performance.
Denoising on Synthetic Dataset:
In Table IX we report the denoising performances of baseline DnCNN and our proposed DnCNN in terms of PSNR and SSIM for AWGN noise removal. SSIM is the acronym for Structural Similarity Index. It is used a metric for comparing similarity between two images. SSIM = 1 means perfect match between two images. Across all noise levels, our model has comparable performance to that of DnCNN baseline.
Denoising on Real Dataset
For quantitative evaluation we used the publicly available PolyU dataset [73] containing pairs of realworld noisy and ground truth images. The average PSNR and SSIM for fullscale CBDNet net is 37.95dB and 0.951 while for proposed CBDNet is 37.29dB and 0.948 Again, the differences are not significant. It is encouraging to see that even on realworld noise removal, our compressed variant performs at par with the fullscale CBDNet. Some visual comparisons are provided in Fig. 7.
Additionally, for qualitative evaluations, we used the highresolution DND [55] dataset in which the ground truths are not publicly available. Due to size limitations we include DND results in this Google Drive link.
Human Rating:
In Table XI we report the MOS on different datasets. For each dataset, each subject was shown 20 random pairs of noisy and denoised (either from baseline or from our our compressed variant). Total 10 humans participated in the study. The grading strategy (between 05) was kept same as that we used during inpainting. We did not find any statistically significant difference (significance set to 10) between the MOS of baselines and our variant on any of the datasets.
We report the total number of parameters for the fullscale baseline models of DnCNN and CBDNet and our proposed compressed versions in Table X. On DnCNN we achieve 87.27% and on CBDNet we achieve 90.2% relative savings of parameters. Since the models are fully convolutional, any arbitrary resolution of image can be processed. Thus reporting a specific count of FLOPs is not possible. However, for reference, in Table X we report the FLOPs for processing input image of resolution 256256. Proposed CBDNet achieves 89.4% relative savings in FLOPS compared to CBDNet. We also compare against corresponding compressed variants of DnCNN and CBDNet with MobileNet, ShuffleNet and ShuffleNetV2 modules. Our proposed variant is more efficient in terms of memory requirement and FLOPs compared to both MobileNet and ShuffleNet variants.
In Table X we compare the execution times (@ 256256) on mobile (Asus) and CPU and also the model sizes for mobile deployment. Both of our proposed variants are computationally more economic compared to fullscale baselines as well as MobileNet and ShuffleNet variants.
Image denoising is an essential component in majority of contemporary AIenabled smartphones and the above presented results make our compressed variant a natural substitute for the fullscale models on mobile platforms.
Dataset  DnCNN  CBDNet  Proposed  Original 

Set68 (10)  4.58    4.60  4.72 
Set68 (15)  4.34    4.32  4.72 
Set68 (25)  4.10    4.11  4.72 
Real (PolyU)    4.50  4.49  4.83 
In this section we showcase the efficacy of our modules for single image superresolution. For this, we consider the benchmark SRGAN model [39] as the baseline for 4 upscaling. The baseline network consists of series of residual blocks (realized with 33 convolution) and upsampling is achieved with subpixel convolution with pixelshuffle operation. We again follow the design principles of our model, to realize cheaper variant of SRGAN.
We used the training partition of Places2 dataset [86] to train baseline and proposed models. Similar to SRGAN we tested the models on the Set5 [5], Set14 [79], and BSD 100 (testing set of BSD300 [49]) dataset. Following [39], we randomly cropped 9696 patch from a given image as HR (high resolution) target and downsample with bicubic interpolation by 4 to create the corresponding LR (low resolution) input.
We follow the exact same protocol of stagewise training as done in [39]. Initially, we train the network with only reconstruction loss. The authors term this network as SRResNet. For our smaller model, we term this network as SRResNet
. Next, we finetune the network with VGG54 content loss and an adversarial loss. Network at this stage is termed at SRGAN for baseline network and SRAGN
for our proposed smaller network.Quantitative Comparison:
In Table XII we first compare the PSNR (in dB) of SRResNet and SRResNet. Since, both of these models are trained on MSE loss, we compare the PSNR metric. Based on the average PSNR, we could not find any significant difference (significance level set to 10) between the two models.
Qualitative Comparison Next, we conducted a MOS test for the 2 models with 10 independent raters. Each rater was shown the original HR image and the super resolved versions by SRGAN and SRGAN networks. In Table XIV we report the MOS on the three datasets. Again, we could not find any significant difference between the scores received by the models. In Fig. 8 we visualize some superresolved images by the two models. It is visually challenging to distinguish samples from the fullscale SRGAN baseline and our cheaper variant. More examples provided in supplementary material.
In Table LABEL:table_sr_computational we report the total number of parameters and FLOPs of different models. FLOPs were calculated on BSD100 dataset in which the original images are usually of dimension 480320. or 320480. So, for 4 superresolution, input resolution is either 80120 or 12080. Compared to the baseline of SRGAN, our proposed cheaper variant, SRGAN achieves relative parameters and FLOPs savings of 88.4% and 99%. Proposed model is also appreciably cheaper compared MobileNet and ShuffleNet variants.
Dataset  SRResNet  SRResNet(Proposed) 

Set5  31.85  31.72 
Set14  27.90  27.74 
BSD100  27.01  26.90 
Method  FLOPs  Params  Mobile  CPU  Memory  

(10)  (10)  (s)  (s)  (MB)  
SRGAN  38.4  1.55  3.48  0.65  12.8  
SRGAN  0.42  0.19  0.05  0.03  1.1  
SRGAN  1.08  0.42  0.11  0.07  2.2  
SRGAN  1.05  0.40  0.09  0.05  1.7  

0.27  0.10  0.02  0.01  0.5 
Dataset  SRGAN  SRGAN (Proposed)  Original 

Set5  3.62  3.64  4.45 
Set14  3.69  3.72  4.41 
BSD100  3.50  3.49  4.23 
In Table LABEL:table_sr_computational we report the mobile model sizes and execution times on the Asus mobile and the commodity CPU. Proposed variant saves 92.1%, 47.1%, 76.1% and 75.0% on mobile memory compared to SRGAN54 baseline, MobileNet, ShuffleNet and ShuffleNetV2 variants respectively. Execution speeds are reported on BSD100 dataset. Our proposed variant achieves significant speedup and reduction of FLOPs compared to fullscale baseline and even MobileNet and ShuffleNet versions.
In this paper we introduced several convolutional building blocks for lowlevel restoration tasks. Our proposed modules, LIST and GSAT were shown to be task agnostic and generalized to variety of restoration tasks. We showed that with specific design consideration, LIST layer can be made low cost computationally than contemporary de facto choices of depthwise separable and group convolution based 33 layer. We analytically and empirically analyzed the shortcoming of using depthwise separable kernels to realize subpixel convolution based upsampling in an encoderdecoder network configuration. Instead of we showed that homogeneity of network structure can be maintained by deterministic upsampling (instead of transposed convolution or pixelshuffle based upsampling) followed by efficient convolution with LIST layer. Extensive evaluations on resource constrained platforms revealed the effectiveness of our modules in designing computationally efficient yet visually accurate models.
The work is funded by a Google PhD Fellowship and Qualcomm Innovation Fellowship awarded to Avisek.
Avisek is a Ph.D. candidate at the Indian Institute of Technology Kharagpur where he is focusing on image/video reconstruction tasks such as inpainting, superresolution. His other research interests include dataefficient training of deep neural networks. He is recipient of Google PhD Fellowship and twice recipient of Qualcomm Innovation Fellowship. Avisek was selected as a Young Researcher by the Heidelberg Laureate Forum, 2019. Prior to his Ph.D, Avisek completed his M.S (by research) from IIT Kharagpur with focus on statistical machine learning. 
Sourav received his M.tech degree from the department of Electronics and Electrical Communication Engineering, IIT Kharagpur, Kharagpur, India. He is currently working as an advanced deep learning engineer at Mathworks India Pvt. Ltd., Hyderabad. He received the ”Institute Silver Medal” for his academic performance at IIT Kharagpur. He received the ”Best Student Award” with a gold medal for academic performance during his B.Tech at Kalyani Government Engineering College, Kalyani, India. He has also received many scholarships and awards from the Government of West Bengal for securing 3 rank at state level in Class X Board examination (Madhyamik) and 8 rank at state level in Class XII Board examination (Higher Secondary). His current research interest lies in deep learning, machine vision and generative adversarial models. 
Sutanu received BS degree in Electronics and Communication Engineering from the West Bengal University of Technology in 2015 and the MS degree in Medical Imaging and Informatics from the Indian Institute of Technology Kharagpur in 2018. He was awarded the Institute Silver Medal at the time of M.Tech. He is currently a Ph.D. student in the Department of Electrical and Electronics Communication Engineering, Indian Institute of Technology Kharagpur. His current research interest is deep learning for image restoration, medical image restoration, lowlevel image processing. 
Siddhant is a final year integrated master’s student at Indian Institute of Technology Kharagpur. He is majoring in Electrical Engineering with a minor in Computer Science. He has been doing research in the domain of computer vision, natural language processing, adversarial attacks and causal inference. 
Prof. Prabir Kumar Biswas (M’93–SM’03) received the B.Tech. (Hons.), M.Tech., and Ph.D. degrees from IIT Kharagpur, Kharagpur, India, in 1985, 1989, and 1991, respectively. He was a Visiting Fellow with the University of Kaiserslautern, Kaiserslautern, Germany, under the Alexander von Humboldt Research Fellowship from 2002 to 2003. Since 1991, he has been a Faculty Member with the Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, where he is currently a Professor, and is the Head of the Department. He has authored over 100 research publications in international and national journals and conferences, and has filed seven international patents. His current research interests include image processing, pattern recognition, computer vision, video compression, parallel and distributed processing, and computer networks. 
Shiftnet: Image inpainting via deep feature rearrangement.
In ECCV, pages 1–17, 2018.Robust lstmautoencoders for face deocclusion in the wild.
IEEE Transactions on Image Processing, 27(2):778–790, 2017.Places: A 10 million image database for scene recognition.
IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2018.
Comments
There are no comments yet.