DeepAI
Log In Sign Up

Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models

11/03/2022
by   Muyang Li, et al.
6

During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1.2 method reduces the computation of DDIM by 7.5× and GauGAN by 18× while preserving the visual fidelity. With SIGE, we accelerate the speed of DDIM by 3.0x on RTX 3090 and 6.6× on Apple M1 Pro CPU, and GauGAN by 4.2× on RTX 3090 and 14× on Apple M1 Pro CPU.

READ FULL TEXT VIEW PDF

page 2

page 8

page 16

page 17

page 19

page 20

page 21

page 22

10/20/2022

DiffEdit: Diffusion-based semantic image editing with mask guidance

Image generation has recently seen tremendous advances, with diffusion m...
03/31/2022

Generating High Fidelity Data from Low-density Regions using Diffusion Models

Our work focuses on addressing sample deficiency from low-density region...
10/12/2022

Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation

Fashion attribute editing is a task that aims to convert the semantic at...
09/14/2022

Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models

Generative models (e.g., GANs and diffusion models) learn the underlying...
12/05/2022

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Generative models, particularly GANs, have been utilized for image editi...
06/16/2021

Cascading Modular Network (CAM-Net) for Multimodal Image Synthesis

Deep generative models such as GANs have driven impressive advances in c...
08/11/2019

To Beta or Not To Beta: Information Bottleneck for DigitaL Image Forensics

We consider an information theoretic approach to address the problem of ...

Code Repositories

sige

[NeurIPS 2022] Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models


view repo

1 Introduction

Deep generative models, such as GANs goodfellow2014generative; karras2019style and diffusion models sohl2015deep; ho2020denoising; song2020denoising, excel at synthesizing photo-realistic images, enabling many image synthesis and editing applications. For example, users can edit an image by drawing sketches isola2017image; sangkloy2017scribbler, semantic maps isola2017image; park2019semantic, or strokes meng2022sdedit. All of these applications require users to interact with generative models frequently and therefore demand short inference time.

In practice, content creators often edit images gradually and only update a small image region each time. However, even for a minor edit, recent generative models often synthesize the entire image, including the unchanged regions, which leads to a significant waste of computation. As a concrete example shown in Figure 2(a), the result of the previous edits has already been computed, and the user further edits 9.4% area. However, the vanilla DDIM song2020denoising needs to generate the entire image to obtain the newly edited regions, wasting 80% computation on the unchanged regions. A naive approach to address this issue would be to first segment the newly edited regions, synthesize the corresponding output regions, and blend the outputs back into the previous output. Unfortunately, this method often creates visible seams between the newly edited and unedited regions. How could we save the computation by only updating the edited regions without losing global coherence?

Figure 1: We introduce Spatially Sparse Inference, a general-purpose method to selectively perform computations at the edited regions for image editing applications. Our method reduces the computation of DDIM song2020denoising by and GauGAN park2019semantic by for the examples shown in the figures while preserving the image quality. When combined with existing model compression methods such as GAN Compression li2020gan, our method further reduces the computation of GauGAN by .

In this work, we propose Spatially Sparse Inference (SSI), a general method to accelerate deep generative models, including conditional GANs and diffusion models, by utilizing the spatial sparsity of edited regions. Our method is motivated by the observation that feature maps at the unedited regions remain mostly the same during user editing. As shown in Figure 2

(b), our key idea is to reuse the cached feature maps of the previous edits and sparsely update the newly edited regions. Specifically, given user input, we first compute a difference mask to locate the newly edited regions. For each convolution layer in the model, we only apply the filters to the masked regions sparsely while reusing the previous activations for the unchanged regions. The sparse update can significantly reduce the computation without hurting the image quality. However, the sparse update involves a gather-scatter process, which often incurs significant latency overheads for existing deep learning frameworks. To address the issue, we propose

Sparse Incremental Generative Engine (SIGE) to translate the theoretical computation reduction of our algorithm to measured latency reduction on various hardware.

To evaluate our method, we automatically create new image editing benchmark datasets on LSUN Church yu15lsun

and Cityscapes 

cordts2016cityscapes. Without loss of visual fidelity, we reduce the computation of DDIM song2020denoising by 7.5, Progressive Distillation salimans2021progressive by 2.7, and GauGAN by 18 measured by MACs***We measure the computational cost with the number of Multiply-Accumulate operations (MACs). 1 MAC=2 FLOPs.. Compared to existing generative model acceleration methods li2020gan; hou2021slimmable; fu2020autogan; li2022learning; jin2021teachers; shaham2021spatially; wang2020gan, our method directly uses the off-the-shelf pre-trained weights and could be applied to these methods as a plugin. When applied to GAN Compression li2020gan, our method reduces the computation of GauGAN by . See Figure 1 for some examples of our method. With SIGE, we accelerate DDIM 3.0 on RTX 3090 GPU and 6.6 on Apple M1 Pro CPU, and GauGAN on RTX 3090 GPU and on Apple M1 Pro CPU. Our code and benchmarks are available at https://github.com/lmxyy/sige.

2 Related Work

Generative models.

Generative models such as GANs goodfellow2014generative; karras2019style; karras2020analyzing; brock2018large, diffusion models ho2020denoising; sohl2015deep; dhariwal2021diffusion, and auto-regressive models esser2021taming; razavi2019generating

have demonstrated impressive photorealistic synthesis capability. They have also been extended to conditional image synthesis tasks such as image-to-image translation 

saharia2021palette; isola2017image; zhu2017unpaired; zhu2020sean, controllable image generation meng2022sdedit; nichol2021glide; park2019semantic, and real image editing choi2021ilvr; nichol2021glide; kim2021diffusionclip; zhu2016generative; patashnik2021styleclip; abdal2019image2stylegan; abdal2020image2stylegan++; zhu2020sean. Unfortunately, recent generative models have become increasingly computationally intensive, compared to their recognition counterparts. For example, GauGAN park2019semantic consumes 281G MACs, 500 more than MobileNet howard2019searching; howard2017mobilenets; sandler2018mobilenetv2. Similarly, one key limitation of diffusion models ho2020denoising is their long inference time and substantial computation cost. To generate one image, DDPM requires hundreds or thousands of forwarding steps ho2020denoising; dhariwal2021diffusion, which is often infeasible in real-world interactive settings. To improve the sampling efficiency of DDPMs, recent works song2020denoising; song2020score; kong2021fast

propose to interpret the sampling process of DDPMs from the perspective of ordinary differential equations. However, these approaches still require hundreds of steps to generate high-quality samples. To further reduce the sampling cost, DDGAN 

xiao2022DDGAN uses a multimodal conditional GAN to model each denoising step. Salimans et alsalimans2021progressive propose to progressively distill a pre-trained DDPM model into a new model that requires fewer steps. Although this approach drastically reduces the sampling steps, the distilled model itself remains computationally prohibitive. Unlike prior work, our work focuses on reducing the computation cost of a pre-trained model. It is complementary to recent efforts on model compression, distillation, and the sampling step reduction of the diffusion models.

Model acceleration.

People apply model compression techniques, including pruning han2016deep; he2018amc; lin2017runtime; he2017channel; liu2017learning; liu2019metapruning and quantization han2016deep; zhou2016dorefa; rastegari2016xnor; wang2019haq; choi2018pact; jacob2018quantization, to reduce the computation and model size of off-the-shelf deep learning models. Recent works apply Neural Architecture Search (NAS) zoph2017neural; zoph2018learning; liu2019darts; cai2019proxylessnas; tan2019mnasnet; wu2019fbnet; lin2020mcunet to automatically design efficient neural architectures. The above ideas can be successfully applied to accelerate the inference of GANs li2020gan; lin2021anycost; shu2019co; liu2021content; hou2021slimmable; ma2021cpgan; fu2020autogan; li2022learning; jin2021teachers; shaham2021spatially; wang2020gan; aguinaldo2019compressing. Although these methods have achieved prominent compression and speedup ratios, they all reduce the computation from the model dimension but fail to exploit the redundancy in the spatial dimension during image editing. Besides, these methods require re-training the compressed model to maintain performance, while our method can be directly applied to existing pre-trained models. We show that our method can be combined with model compression li2020gan to achieve a MACs reduction in Section 4.1.

Sparse computation.

Sparse computation has been widely explored in the weight domain han2015learning; li2016pruning; liu2015sparse; jaderberg2014speeding, input domain tang2022torchsparse; riegler2017octnet, and activation domain ren2018sbnet; judd2017cnvlutin2; shi2017speeding; dong2017more. For activation sparsity, RRN pan2018recurrent utilizes the sparsity in the consecutive video frame difference to accelerate video models. However, their sparsity is unstructured, which requires special hardware to reach its full speedup potential. Several works instead use structured sparsity. Li et alli2017not use a deep layer cascade to apply more convolutions on the hard regions than the easy regions to improve the accuracy and speed of semantic segmentation. To accelerate 3D object detection, SBNet ren2018sbnet uses a spatial mask, either from a priori problem knowledge or an auxiliary network, to sparsify the activations. It adopts a tiling-based sparse convolution algorithm to handle spatial sparsity. Recent works further integrate the spatial mask generation network into the sparse inference network in an end-to-end manner verelst2020dynamic and extend the idea to different tasks wang2021exploring; han2021spatially; wang2022adafocus; parger2022deltacnn. Compared to SBNet ren2018sbnet, our mask is directly derived from the difference between the original image and the edited image. Additionally, our method does not require any auxiliary network or extra model training. We also introduce other optimizations, such as normalization removal and kernel fusions, to better adapt our engine for image editing.

3 Method

Figure 2: In the interactive editing scenario, a user adds a new building, which occupies 9.4% pixels. (a) Vanilla DDIM has to regenerate the entire image, even though only 9.4% area is edited. (b) Our method instead reuses the feature maps of the previous edits and only applies convolutions to the newly edited regions sparsely, which has a MACs reduction in this case.

We build our method based on the following observation: during interactive image editing, a user often only changes the image content gradually. As a result, only a small subset of pixels in a local region is being updated at any moment. Therefore, we can reuse the activations of the original image for the unedited regions. As shown in Figure 

3, we first pre-compute all activations of the original input image. During the editing process, we locate the edited regions by computing a difference mask between the original and edited image. We then reuse the pre-computed activations for the unedited regions and only update the edited regions by applying convolutional filters to them. In Section 3.1, we show the sparsity in the intermediate activations and present our main algorithm. In Section 3.2, we discuss the technical details of how our Sparse Incremental Generative Engine (SIGE) supports the sparse inference and converts the theoretical computation reduction to measured speedup on hardware.

3.1 Activation Sparsity

Preliminary.

First, we closely study the computation within a single layer. We denote and

as the input tensor of the original image and edited image to the

-th convolution layer , respectively. and are the weight and bias of . The output of with input could be computed in the following way due to the linearity of convolution:

where is the convolution operator and . If we already pre-computed all the , we only need to compute . Naïvely, computing has the same complexity as . However, since the edited image shares similar features with the original image given a small edit, should be sparse. Below, we discuss different strategies to leverage the activation sparsity to accelerate model inference.

Our first attempt was to prune by zeroing out elements smaller than a certain threshold to achieve the target sparsity. Unfortunately, this pruning method fails to achieve measured speedup due to the overheads of the on-the-fly pruning and irregular sparsity pattern.

Figure 3: Tiling-based sparse convolution overview. For each convolution in the network, we wrap it into SIGE Conv. The activations of the original image are already pre-computed. When getting the edited image, we first compute a difference mask between the original and edited image and reduce the mask to the active block indices to locate the edited regions. In each SIGE Conv, we directly gather the active blocks from the edited activation according to the reduced indices, stack the blocks along the batch dimension, and feed them into . The gathered blocks have an overlap of width 2 if is convolution ren2018sbnet. After getting the output blocks from , we scatter them back into to get the edited output, which approximates .
Figure 4: Left: Detailed edit example. Right: Channel-wise average of at the -th layer of DDIM with different feature map resolutions. is sparse and non-zero values are aggregated at the edited regions.

Structured sparsity.

Fortunately, user edits are often highly structured and localized. As a result, should also share the structured spatial sparsity, where non-zero values are mostly aggregated within the edited regions, as shown in Figure 4. We then directly use the original image and edited image to compute a difference mask and sparsify with this mask.

3.2 Sparse Engine SIGE

But how could we leverage the structured sparsity to accelerate ? A naïve approach is to crop a rectangular edited region out of for each convolution and only compute features for the cropped regions. Unfortunately, this naïve cropping method works poorly for the irregular edited regions (e.g., the example shown in Figure 4).

Tiling-based sparse convolution.

Instead, as shown in Figure 5(a), we use a tiling-based sparse convolution algorithm. We first downsample the difference mask to different scales and dilate the downsampled masks (width 1 for diffusion models and 2 for GauGAN). Then we divide into multiple small blocks of the same size spatially and index the difference mask at the corresponding resolution. Each block index refers to a single block with non-zero elements. We then gather the non-zero blocks (we also call them active blocks) accordingly along the batch dimension and feed them into the convolution . Finally, we scatter the output blocks into a zero tensor according to the indices to recover the original spatial size and add the pre-computed residual back. The gathered active blocks have an overlap with width 2 for convolution to ensure the output blocks of the adjacent input blocks are seamlessly stitched together ren2018sbnet.

This pipeline in Figure 5(a) is equivalent to a simpler pipeline in Figure 5(b). Instead of gathering , we could directly gather . The convolution needs to be computed with bias . Besides, we need to scatter the output blocks into instead of a zero tensor. Thus, we do not need to store anymore, which further saves memory and removes the overheads of addition and subtraction. Figure 3 visualizes the pipeline.

However, the aforementioned pipeline still fails to produce a noticeable speedup due to extra kernel calls and memory movement overheads in Gather and Scatter. For example, the original dense convolution with 128 channels and input resolution would take 0.78ms on RTX 3090. The sparse convolution using pipeline Figure 5(b) on the example shown in Figure 4 (15.5% edited regions) needs 0.42ms in total, while Gather and Scatter introduce a significant overhead (0.17ms, about 41%). To reduce it, we further optimize SIGE by pre-computing normalization parameters and applying kernel fusion.

Figure 5: The titling-based sparse convolution pipelines. (a) We first compute the activation difference and gather the active blocks along the batch dimension from it according to the indices reduced from the difference mask. We then feed the blocks into the convolution without bias, scatter the output into a zero tensor, and add the residual back. (b) We directly gather the blocks from without computing . is computed with bias. We scatter the output into instead of a zero tensor.

Pre-computing normalization parameters.

For batch normalization 

ioffe2015batch

, it is easy to remove the normalization layer during inference time since we can use pre-computed mean and variance statistics from model training. However, recent deep generative models often use instance normalization 

ulyanov2016instance; huang2017arbitrary or group normalization wu2018group; nichol2021improved

, which compute the statistics on the fly during inference. These normalization layers incur overheads as we need to estimate the statistics from the full-size tensors. However, as the original and edited images are quite similar given a small user edit, we assume

. This allows us to reuse the statistics of for the normalization instead of recomputing them for . Thus, normalization layers could be replaced by simple Scale+Shift operations with pre-computed statistics.

Kernel fusion.

As mentioned before, both the Gather and Scatter operations introduce significant data movement overheads. To reduce it, we fuse several element-wise operations (Scale+Shift and Nonlinearity) into Gather and Scatter ren2018sbnet; ding2021ios; jia2019taso and only apply these element-wise operations to the active blocks (i.e., edited regions). Furthermore, we perform in-place computation to reduce the number of kernel calls and memory allocation overheads.

In Scatter, we need to copy the pre-computed activation . This copying operation is highly redundant, as most elements from do not involve any computation given a small edit and will be discarded in the next Gather. To reduce the tensor copying overheads, we fuse the Scatter with the following Gather by directly gathering the active blocks from

and the input blocks to be scattered. Sometimes, the residual connection in the ResBlock 

he2016deep contains shortcut convolution to match the channel number of the residual and the ResBlock output. We also fuse the Scatter in the shortcut branch, main branch, and the residual addition together to avoid the tensor copying overheads in the shortcut Scatter. Please refer to Appendix A for more details.

4 Experiments

Below we first describe our models, baselines, datasets, and evaluation protocols. We then discuss our main qualitative and quantitative results. Finally, we include a detailed ablation study regarding the importance of each algorithmic design.

Models.

We conduct experiments on three models, including diffusion models and GAN-based models, to explore the generality of our method.

  • [leftmargin=*]

  • DDIM song2020denoising is a fast sampling approach for diffusion models. It proposes to interpret the sampling process of diffusion models through the lens of ordinary differential equations.

  • Progressive Distillation (PD) salimans2021progressive adopts network distillation hinton2015distilling to progressively reduce the number of steps for diffusion models.

  • GauGAN park2019semantic is a paired image-to-image translation model which learns to generate a high-fidelity image given a semantic label map.

Baselines.

We compare our methods against the following baselines:

  • [leftmargin=*]

  • Patch. We crop the smallest patch coverring all the edited regions, feed it into the model, and blend the output patch into the original image.

  • Crop. For each convolution , we crop the smallest rectangular region that covers all masked elements of the activation , feed it into , and scatter the output patch into .

  • 40% Pruning. We uniformly prune 40% weights of the models without further fine-tuning, as our method directly uses the pre-trained weights. Since the fine-grained pruning is unstructured, it requires special hardware to achieve measured speedup, so we do not report MACs for this baseline.

  • 0.19 GauGAN. We reduce each convolution layer of GauGAN to channels ( MACs reduction) and train it from scratch.

  • GAN Compression li2020gan. A general-purpose compression method for conditional GANs. GAN Comp. (S) means GAN Compression with a larger compression ratio.

  • 0.5 Original means linearly scaling each layer of the original model to 50% channels, and we only use this to benchmark our efficiency results.

Datasets.

We use the following two datasets in our experiments:

  • [leftmargin=*]

  • LSUN Church. We use the LSUN Church Outdoor dataset yu15lsun and follow the same preprocessing steps as prior works ho2020denoising; song2020score. To automatically generate a stroke editing benchmark, we first use Detic zhou2021detecting to segment the images in the validation set. For each segmented object, we use its segmentation mask to inpaint the image by CoModGAN zhao2021comodgan and treat the inpainted image as the original image. We generate the corresponding user strokes by first blurring the masked regions with the median filter and quantizing it into 6 colors following SDEdit meng2022sdedit. We collect 454 editing pairs in total (431 synthetic + 23 manual). We evaluate DDIM song2020denoising and PD salimans2021progressive on this dataset.

  • Cityscapes. The dataset cordts2016cityscapes contains images of German street scenes. The training and validation sets consist of 2975 and 500 images, respectively. Our editing dataset has 1505 editing pairs in total. We evaluate GauGAN park2019semantic on this dataset.

Please refer to Appendix B for more details about the benchmark datasets.

Metrics.

Following previous works meng2022sdedit; li2020gan; park2019semantic, we use the standard metrics Peak Signal Noise Ratio (PSNR, higher is better), LPIPS (lower is better) zhang2018perceptual, and Fréchet Inception Distance (FID, lower is better) heusel2017gans; parmar2021cleanfidWe use clean-fid for FID calculation. to evaluate the image quality on both LSUN Church yu15lsun and Cityscapes cordts2016cityscapes. For Cityscapes, we adopt a semantic segmentation metric to evaluate the generated images. Specifically, we run DRN-D-105 yu2017dilated on the generated images and compute the mean Intersection over Union (mIoU) of the segmentation results. Generally, a higher mIOU indicates that the generated images look more realistic and better align to the input.

Model Method MACs PSNR () LPIPS () FID () mIoU ()
Value Ratio with G.T. with Orig. with G.T. with Orig.
DDIM Original 249G 26.8 0.069 65.4
40% Pruning 24.9 31.0 0.991 0.101 72.2
Patch 72.0G 3.5 26.8 40.6 0.076 0.022 66.4
Ours 65.3G 3.8 26.8 52.4 0.070 0.009 65.8
Original 66.9G 21.9 0.143 90.0
PD 40% Pruning 21.6 37.6 0.164 0.051 101
Ours 32.5G 2.1 21.9 60.7 0.154 0.003 90.1
GauGAN Original 281G 15.8 0.409 55.4 62.4
GAN Comp. li2020gan 31.2G 9.0 15.8 19.5 0.412 0.288 55.5 61.5
Ours 30.7G 9.2 15.8 26.5 0.413 0.113 54.4 62.1
0.19 GauGAN 13.3G 21 15.5 18.6 0.424 0.322 57.9 53.5
GAN Comp. (S) 9.64G 29 15.7 19.1 0.422 0.310 50.4 57.4
GAN Comp.+Ours 7.06G 40 15.7 19.2 0.416 0.299 54.6 60.0
Table 1: Quantitative quality evaluation. PSNR/LPIPS with G.T. means computing the metrics with the ground-truth images, and with Orig. means computing with the generated samples from the original model. 40% Pruning: Uniformly pruning 40% weights of the model without fine-tuning. Patch: Cropping the smallest image patch that covers all the edited region and blending the output patch into the original image. 0.19 GauGAN: Uniformly reducing each layer of GauGAN to 19% channels and training from scratch. GAN Comp. (S): GAN Compression with a larger compression ratio. For all models, our method outperforms other baselines with less computation. When applying our method to GAN Compression, we reduce the MACs of GauGAN by 40 with minor performance degradation.
Figure 6: Qualitative results of our method under different edit sizes. Our method well preserves the visual fidelity of the original model without losing global context. On the contrary, Patch (cropping the smallest image patch that covers all the edited regions and scatter the output patch back into the original image) performs poorly because of the lack of global context when the edit is small.

Implementation details.

The number of denoising steps for DDIM and PD are 100 and 8, respectively, and we use 50 and 5 steps for SDEdit. We dilate the difference mask by 5, 2, 5, and 1 pixels for DDIM, PD with resolution 128, PD with resolution 256 and GauGAN, respectively. Besides, we apply our sparse kernel to all convolution layers whose input feature map resolutions are larger than , , and for DDIM, PD, original GauGAN and GAN Compression, respectively. For DDIM song2020denoising and PD salimans2021progressive, we pre-compute and reuse the statistics of the original image for all group normalization layers wu2018group. For GAN Compression li2020gan, we pre-compute and reuse the statistics of the original image for all instance normalization layers ulyanov2016instance whose resolution is higher than . For all models, the sparse block size for convolution is 6 and convolution is 4.

4.1 Main Results

Image quality.

We report the quantitative results of applying our method on DDIM song2020denoising, Progressive Distillation (PD) salimans2021progressive, and GauGAN park2019semantic in Table 1 and show the qualitative results in Figure 6. For PSNR and LPIPS, with G.T. means computing the metric with the ground-truth images. With Orig. means computing the metric with the samples generated by the original model. On LSUN Church, we only use 431 synthetic images for the PSNR/LPIPS with G.T. metrics, as manual edits do not have ground truths. For the other metrics, we use the entire LSUN Church dataset (431 synthetic + 23 manual). On Cityscapes, we view the synthetic semantic maps as the original input and the ground-truth semantic maps as the edited input for the PSNR/LPIPS with G.T. metrics, which has 1505 samples. For the other metrics, we include the symmetric edits (view the ground-truth semantic maps as the original inputs and synthetic semantic maps as the edited inputs), which has 3010 samples in total. For the models with method Patch and Ours, whose computation is edit-dependent, we measure the average MACs over the whole dataset.

Model Edit Size Method MACs 3090 2080Ti Intel Core i9-10920X Apple M1 Pro
Value Ratio Value Ratio Value Ratio Value Ratio Value Ratio
DDIM Original 248G 37.5ms 54.6ms 609ms 12.9s
0.5 Original 62.5G 4.0 20.0ms 1.9 31.2ms 1.8 215ms 2.8 3.22s 4.0
1.20% Crop 32.6G 7.6 15.5ms 2.4 29.3ms 1.9 185ms 3.3 1.85s 6.9
Ours 33.4G 7.5 12.6ms 3.0 19.1ms 2.9 147ms 4.1 1.96s 6.6
15.5% Crop 155G 1.6 30.5ms 1.2 44.5ms 1.2 441ms 1.4 8.09s 1.6
Ours 78.9G 3.2 19.4ms 1.9 29.8ms 1.8 304ms 2.0 5.04s 2.6
PD256 Original 119G 35.1ms 51.2ms 388ms 6.18s
0.5 Original 31.0G 3.8 29.4ms 1.2 43.2ms 1.2 186ms 2.1 1.72s 3.6
1.20% Ours 25.9G 4.6 18.6ms 1.9 26.4ms 1.9 152ms 2.5 1.55s 4.0
15.5% Ours 48.5G 2.5 21.4ms 1.6 30.7ms 1.7 250ms 1.6 3.22s 1.9
GauGAN Original 281G 45.4ms 49.5ms 682ms 14.1s
GAN Compression 31.2G 9.0 17.0ms 2.7 25.0ms 2.0 333ms 2.1 2.11s 6.7
1.18% Ours 15.3G 18 11.1ms 4.1 19.3ms 2.6 114ms 6.0 0.990s 14
GAN Comp.+Ours 5.59G 50 10.8ms 4.2 16.2ms 3.1 53.1ms 13 0.370s 38
13.5% Ours 69.8G 4.0 17.8ms 2.5 27.1ms 1.8 238ms 2.9 4.06s 3.5
GAN Comp.+Ours 10.8G 26 11.8ms 3.8 17.4ms 2.8 94.4ms 7.2 0.741s 19
Table 2: Measured latency speedup on different devices. The detailed edit examples are shown in Figure 6. 0.5 Original: Linearly scaling each layer of the model to 50% channels. Crop: For each convolution, we find a smallest patch covering the masked elements, crop it out, feed it into the convolution and scatter the output patch into the original image activation. Our method could reduce up to 18 MACs, and achieve up to 4.1, , 6.0 and 14 latency reduction on NVIDIA RTX 3090, 2080Ti, Intel Core i9-10920X and M1 Pro CPU. With GAN Compression, we could further speed up GauGAN by on Intel Core-i9 and on Apple M1 Pro CPU.

For DDIM and Progressive Distillation, our method outperforms all baselines consistently and achieves results on par with the original model. The Patch inference fails when the edited region is small as the global context is insufficient. Although our method only applies convolutional filters to the local edited regions, we could reuse the global context stored in the original activations. Therefore, our method could perform the same as the original model. For GauGAN, our method also performs better than GAN Compression li2020gan with an even larger MACs reduction. When applying our method to GAN Compression, we further achieve a 40 MACs reduction with minor performance degradation, beating both 0.19 GauGAN and GAN Comp. (S).

Model efficiency.

For real-world interactive image editing applications, inference acceleration on hardware is more critical than the computation reduction. To verify the effectiveness of our proposed engine, we measure the speedup of the edit examples shown in Figure 6

on four devices, including NVIDIA RTX 3090, NVIDIA RTX 2080Ti, Intel Core i9-10920X CPU, and Apple M1 Pro CPU, with different computational powers. We use batch size 1 to simulate real-world use. For GPU devices, we first perform 200 warm-up runs and measure the average latency of the next 200 runs. For CPU devices, we perform 10 warm-up runs and 10 test runs, repeat this process 5 times and report the average latency. The latency is measured in PyTorch 1.7

https://github.com/pytorch/pytorch. The results are shown in Table 2.

The original Progressive Distillation salimans2021progressive can only generate images, which is too small for real use. We add some extra layers to adapt the model to resolution . For fair comparisons, we also pre-compute the normalization parameters for the Crop baseline. When the edit pattern is like a rectangle, this baseline reduces similar computation with ours (e.g., the first example of DDIM in Figure 6). However, the speedup is still worse than ours on RTX 3090, 2080Ti and Intel Core i9-10920X due to the large memory index overheads in native PyTorch. When the edited region is far from a rectangle (e.g., the third example of DDIM), the cropped patches have much redundancy. Therefore, even though only 15.5% region is edited, the MACs reduction is only 1.6. With edit size, our method achieves a 7.5, 4.6, and 18 MACs reduction for DDIM, Progressive Distillation, and GauGAN, respectively. With SIGE, we achieve at most 4.1, 2.9, 6.0 and 14 speedup on RTX 3090, 2080Ti, Intel Core i9-10920X and Apple M1 Pro CPU, respectively. When applied to GAN Compression, SIGE achieves a 9.5 and 38 latency reduction on Intel Core i9-10920X and Apple M1 Pro CPU, respectively.

MACs Optimizations Latency
Sparse Norm. Elem. Sct. Value Ratio
249G 54.6ms
34.0ms 1.6
32.6G 29.6ms 1.8
(7.6) 20.7ms 2.6
19.1ms 2.9
(a)
Method Edit Size MACs PyTorch TensorRT
Value Ratio Value Ratio Value Ratio
Original 249G 54.6ms 47.7ms
1.20% 33.4G 7.5 19.1ms 2.9 14.4ms 3.3
Ours 7.19% 51.8G 4.8 22.1ms 2.5 18.6ms 2.6
15.5% 78.9G 3.2 29.8ms 1.8 26.9ms 1.8
(b)
Table 3: (a) Ablation study of each kernel optimization. Sparse: Using tiling-based sparse convolution. Norm.: Pre-computing normalization parameters. Elem.: Fusing element-wise operations. Sct.: Fusing Scatter to reduce the tensor copying overheads. With all optimizations, we could reduce the latency of DDIM by 2.9 on NVIDIA RTX 2080Ti. (b) Latency comparisons of DDIM on RTX 2080Ti between PyTorch and TensorRT. The speedup ratio is larger in TensorRT than PyTorch, especially when the edit size is small.

4.2 Ablation Study

Below we perform several ablation studies to show the effectiveness of each design choice.

Memory usage.

The pre-computed activations of the original image require additional memory storage. We profile the peak memory usage of the original model and our method in PyTorch. Our method only increases the peak memory usage of a single forward for DDIM, PD, GauGAN, and GAN Compression by 0.1G, 0.1G, 0.8G, and 0.3G, respectively. Specifically, it needs to store additional 169M, 56M, 275M, and 120M parameters for DDIM song2020denoising, PD salimans2021progressive, original GauGAN park2019semantic and GAN Compression li2020gan, respectively, for a single forward. For the diffusion models, we need to store activations for all iteration steps (e.g., 50 for DDIM and 5 for PD). However, data movement and kernel computation are asynchronized on GPU, so we could store the activations in CPU memory and load the on-demand ones on GPU to reduce peak memory usage.

Speedup of each design.

Table 2(a) shows the effectiveness of each kernel optimization we add to SIGE for DDIM song2020denoising on RTX 2080Ti. Naïvely applying the tiling-based sparse convolution could reduce the computation by 7.6. Still, the latency reduction is only 1.6 due to the large memory overheads in Gather and Scatter. Pre-computing the normalization parameters could remove the latency of normalization statistics calculation and reduce the overall latency to 29.6ms. Fusing element-wise operations into the Gather and Scatter could remove some redundant operations that are applied to the unedited regions and also reduce the memory allocation overheads (about 9ms). Finally, fusing the Scatter and Gather to Scatter-Gather and Scatter in the shortcut branch and main branch could further reduce about 1.6ms tensor copying overheads, achieving 2.9 speedup.

Experiments with TensorRT.

Real-world model deployment also depends on deep learning backends with optimized libraries and runtimes. To demonstrate the effectiveness and extensibility of SIGE, we also implement our kernels in a widely-used backend TensorRT§§§We benchmark the results with TensorRT 8.4. and benchmark the DDIM latency results on RTX 2080Ti in Table 2(b). Specifically, our speedup ratio becomes more prominent with TensorRT compared to PyTorch, especially for small edits, as TensorRT better supports small convolutional kernels with higher GPU utilization than PyTorch.

5 Conclusion & Discussion

For image editing, existing deep generative models often waste computation by re-synthesizing the image regions that do not require modifications. To solve this issue, we have presented a general-purpose method, Spatially Sparse Inference (SSI), to selectively perform computation on edited regions, and Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on commonly-used hardware. We have demonstrated the effectiveness of our approach in various hardware settings.

Limitations.

As discussed in Section 4.2, our method requires extra memory to store the original activations, which slightly increases the peak GPU memory usage. It may not work on certain memory-constrained devices, especially for the diffusion models (e.g., DDIM song2020denoising), since our method requires storing activations of all denoising steps.

Our engine has limited speedup on convolution with low resolution. When the input resolution is low, the active block size needs to be even smaller to get a decent sparsity, such as 1 or 2. However, such extremely small block sizes have bad memory locality and will result in low hardware efficiency.

Besides, we sometimes observe noticeable boundary between the edited regions and unedited regions in our generated samples of GauGAN park2019semantic. This is because, for GauGAN model, the unedited region will also change slightly when we perform normal inference. However, since our method does not update the unedited region, there may be some visible seams between the edited and unedited regions, even though the semantic is coherent. Dilating the difference mask would help reduce the gap.

In most cases, the edit will only update the edited regions. However, sometimes the edit will also introduce global illumination changes such as shadow and reflection. For this case, as we only update the edited regions, we cannot update the global changes outside the edited regions accordingly.

Societal impact.

In this paper, we investigate how to update user edit locally without losing global coherence to enable smoother interaction with the generative models. In real-world scenarios, people could use an interactive interface to edit an image, and our method could provide a quick and high-quality preview for their edit, which eases the process of visual content creation and reduces energy consumption, leading to a greener AI application. The reduced cost also provides a good user experience for lower-end devices, which further democratize the applications of generative models.

However, our method can be utilized by malicious users to generate fake contents, deceive people, and spread misinformation, which may lead to potential negative social impacts. Following previous works meng2022sdedit, we explicitly specify the usage permission of our engine with proper licenses. Additionally, we run a forensics detector wang2020cnn to detect the generated results of our method. On GauGAN, our generated images can be detected with average precision (AP). However, on DDIM song2020denoising and Progressive Distillation salimans2021progressive, the APs are only and . Such low APs are caused by the model differences between GANs and diffusion models, as observed in SDEdit meng2022sdedit. We believe developing forensic methods for diffusion models is a critical future research direction.

Acknowledgment.

We thank Yaoyao Ding, Zihao Ye, Lianmin Zheng, Haotian Tang, and Ligeng Zhu for the helpful comments on the engine design. We also thank George Cazenavette, Kangle Deng, Ruihan Gao, Daohan Lu, Sheng-Yu Wang and Bingliang Zhang for their valuable feedback. The project is partly supported by NSF, MIT-IBM Watson AI Lab, Kwai Inc, and Sony Corporation.

References

Appendix A Kernel Fusion

Figure 7: Visualization of kernel fusion in DDIM song2020denoising ResBlock he2016deep. We omit the element-wise operations for simplicity and follow the notations in Section 3. As the kernel sizes of the convolution in the shortcut branch and main branch are different, their reduced active block indices are different (Indices and Shortcut Indices). To reduce the tensor copying overheads in Scatter, we fuse Scatter and the following Gather into Scatter-Gather and fuse the Scatter in the shortcut, main branch and residual addition into Scatter with Block Residual. We pre-compute an additional Scatter Map for the Scatter-Gather kernel.
Figure 8: Scatter-Gather fusion visualization. (a) The original pipeline of a Gather directly follows a Scatter. The indices indicate the top left corner of the Scatter/Gather position (zero-based). The black blocks are discarded by the Gather. (b) We pre-compute the Scatter process and get a Scatter Map, which tracks the data source during Scatter. If the data come from the original activation, it stores NULL at this location (gray blocks). Otherwise, it will store a triple locating the data in the input blocks (non-gray blocks). (c) In the fused Scatter-Gather kernel, we directly use the Scatter Map to index and fetch the data from the input blocks and the original activation, avoiding copying the entire original feature map.

As mentioned in Section 3.2, we fuse Scatter and the following Gather into a Scatter-Gather operator and also fuse Scatter in the shortcut, main branch and residual addition together. The detailed fusion pattern is shown in Figure 7. For simplicity, we omit the element-wise operations (e.g., Nonlinearity and Scale+Shift). Below we include more implementation details of each fusion design. Please refer to our code for the detailed implementation.

Scatter-Gather fusion.

When a Scatter is directly followed by a Gather, we could fuse these two operators into a Scatter-Gather to avoid copying the entire original activation . As shown in Figure 8(a), the black blocks are copyed from the original activation and then discarded Gather, which incur redundant data movement. Instead, we pre-build a Scatter Map to track the data source (Figure 8(b)). For example, if the data at position in the Scatter output comes from the original activation, then Scatter Map will store NULL at (gray blocks). Otherwise, it will store a triple at this position (non-gray blocks). The first element of the triple indicates the block ID that the data come from, while the latter two indicate the offsets of the data within the block. Note that the pre-computation is cheap and only needs to be computed once for each resolution. Therefore, in the fused Scatter-Gather, we could use the Scatter Map to index and fetch the data we want directly from either the input blocks or the original activation, given the Gather indices. For example, if we want to gather the data at location , we will look up this position in the Scatter Map. If it is NULL, we would fetch the data at location in the original activation. Otherwise, we would fetch data in the input blocks indicated by the triple. In this way, we could avoid copying the unused regions in Scatter.

Shortcut Scatter fusion.

The convolution in the shortcut branch consumes much less computation than the convolution in the main branch, therefore the overheads of Gather and Scatter weigh more in the shortcut branch. We fuse the Scatter in the shortcut branch and main branch along with residual addition together into Scatter with Block Residual to reduce these overheads. Specifically, as shown in Figure 7, we first scatter output in the pre-computed and add the original residual only at the scattered locations correspondingly according to Indices. Then we calibrate the resulted feature map with output by adding the residual difference at the scattered locations indexed by Shortcut Indices inplace.

Figure 9: Examples of our synthetic edits on (a) LSUN Church and (b) Cityscapes. On LSUN Church, we view the inpainted image as the original image and generate the edits by quantizing color at the corresponding regions. On Cityscapes, we generate the edits by pasting some foreground objects on the ground-truth semantic maps.
(a) LSUN Church.
(b) Cityscapes.
Figure 10: Detailed edit ratio distribution of our synthetic datasets.
Figure 11: Several examples of our collected foreground object semantic masks.
Figure 12: Visualization results of different dilation sizes on GauGAN. Although without mIoU improvement, increasing the dilation could smoothly blend the boundary between the edited region and unedited regions to improve the image quality slightly. Specifically, the shadow boundary of the added car fades when dilation increases. However, it will incur more computations.

Appendix B Benchmark Datasets

We elaborate more details on how we build the synthetic edit datasets.

LSUN Church.

Figure 9(a) shows some examples of our synthetic edits on LSUN Church. The average edited area of the whole dataset is 13.1%. The detailed distribution is shown in Figure 9(a).

Cityscapes.

We collect 27 foreground object semantic masks from the validation set. The objects include 4 bicycles, 1 motorcycle, 7 cars, 6 trucks, 3 buses, 5 persons, and 1 train. Figure 11 shows some visualization of the collected semantic masks. We generate the edits by randomly pasting one of these objects to the ground-truth semantic maps with augmentation. The augmentation includes random horizontal flip, resize (scale factor in ), translation ( for vertical and for horizontal). To make the synthetic edits more reasonable, when the scale factor is larger than 1, the vertical translation can only be positive, otherwise, it can only be negative. Figure 9(b) shows some edit examples. The average edited area of the entire dataset is 4.77%. The detailed distribution is shown in Figure 9(b).

Method MACs PSNR () LPIPS () mIoU ()
Value Ratio with G.T. with Orig. with G.T. with Orig.
Original 281G 15.9 0.414 57.3
GAN Comp. li2020gan 31.2G 9.0 15.8 19.1 0.417 0.329 56.3
Ours 30.7G 9.2 15.9 27.5 0.425 0.076 56.1
0.19 GauGAN 13.3G 21 15.4 18.4 0.427 0.356 49.5
GAN Comp. (S) 9.64G 29 15.8 18.9 0.422 0.344 51.2
GAN Comp.+Ours 7.06G 40 15.8 18.8 0.429 0.345 52.4
Table 4: Quality evaluation of GauGAN at the edited regions. PSNR/LPIPS with G.T. means computing the metrics with the ground-truth images, and with Orig. means computing with the generated samples from the original model. 0.19 GauGAN: Reducing each layer of GauGAN to 19% channels and training from scratch. GAN Comp. (S): GAN Compression with larger compression ratio. Our method could match the performance of GAN Compression li2020gan. When applying it to GAN Compression, our method achieve results on par with GAN Comp. (S) with less computation, achieving MACs reduction.
Figure 13: The input and output activation differences of a self-attention layer in vanilla DDIM. Left: Detailed edit example with the difference mask. Right: Activation differences with the downsampled difference mask. Attn is the self-attention layer. Both the input difference and output difference correspond to the difference mask very well.

Appendix C Additional Results

Dilation hyper-parameter.

We show the results of our method with different dilation sizes on GauGAN in Figure 12. Increasing the dilation brings more computations but also slightly improves the image quality. Specifically, the shadow boundary of the added car fades as the dilation increases. We choose dilation 1 since the image quality is almost the same as 20 while delivering the best speed.

Quality results at the edited regions.

In Table 1, we show the quantitative quality results of our method on DDIM song2020denoising, Progresive Distillation (PD) salimans2021progressive and GauGAN park2019semantic. For DDIM and PD, the unedited regions in the generated image keep the same as the input image due to the mask trick meng2022sdedit. For GauGAN, the generated unedited regions vary across different methods. In this case, the image quality at the unedited regions will influence the metrics we report in Table 1. We additionally include the quantitative quality results of GauGAN at the sole edited regions in Table 4. Our method could still preserve the image quality of original GauGAN and match the performance of GAN Compression li2020gan. When applied to GAN Compression, it achieves MACs reduction on average, achieving results on par with GAN Comp. (S) with less computation. This indicates that our method could update the edited regions in high quality without losing global context.

Working with self-attention layers.

Recent generative models often have some attention layers to improve the generated image quality zhang2019self; vaswani2017attention. Such attention layers could model long-range and multi-level dependencies across image regions, which seems to break the local correspondence between the edited regions in the input image and the generated images. In Figure 13, we visualize the difference between the input and output activations of a self-attention layer in the vanilla DDIM model. The pattern of the input and output activation differences are quite similar and correspond to the difference mask very well. This shows that the local correspondence of user edits still exists with self-attention layers, which justifies our Spatially Sparse Inference algorithm.

Large edits.

In Table 5 and Figure 14, we show the results of large edits () using our method. Specifically, we could achieve at most speedup on DDIM, speedup on PD256 and speedup on GauGAN without losing visual fidelity. Furthermore, in many practical cases, users can decompose a large edit into several small edits. Our method could incrementally update the results instantly when the edit is being created, as described below.

Sequential edits.

In Figure 15, we show the results of sequential edits with our method. Specifically, One-time Pre-computation performs as well as the Full Model, demonstrating that our method can be applied to multiple sequential edits with only one-time pre-computation in most cases. Moreover, for extremely large edits, we could use SIGE to incrementally update the pre-computed features (Incremental Pre-computation) and condition the later edits on the recomputed one. Its results are also as good as the full model. Therefore, our method could well address the sequential edits.

Model Edit Size Method MACs 3090 2080Ti Intel Core i9-10920X Apple M1 Pro
Value Ratio Value Ratio Value Ratio Value Ratio Value Ratio
DDIM Original 248G 37.5ms 54.6ms 609ms 12.9s
32.9% Ours 115G 2.2 26.0ms 1.4 36.9ms 1.5 449ms 1.4 7.53s 1.7
PD256 Original 119G 35.1ms 51.2ms 388ms 6.18s
32.9% Ours 64.3G 1.9 25.3ms 1.4 35.1ms 1.5 334ms 1.2 4.47s 1.4
GauGAN Original 281G 45.4ms 49.5ms 682ms 14.1s
GAN Compression 31.2G 9.0 17.0ms 2.7 25.0ms 2.0 333ms 2.1 2.11s 6.7
38.7% Ours 148G 1.9 27.9ms 1.6 41.7ms 1.2 512ms 1.3 8.37s 1.7
GAN Comp.+Ours 18.3G 15 15.3ms 3.0 22.2ms 2.2 169ms 4.0 1.25s 11
Table 5: Measured latency speedup of large edits on different devices. The detailed edit examples are shown in Figure 14. Our method could reduce up to 2.2 MACs, and 1.4, 1.5, 1.4 and 1.7 latency on NVIDIA RTX 3090, 2080Ti, Intel Core i9-10920X and M1 Pro CPU. With GAN Compression, we could further accelerate GauGAN by on Intel Core-i9 and on Apple M1 Pro CPU.
Figure 14: Qualitative results of our method with large edits. Our method could still well preserve the visual fidelity of the original model without losing global context while reducing the computation by .
Figure 15: Sequential edit results with SIGE. Full Model means the results with the full model. One-time Pre-computation means we pre-compute the original image features for all the edit steps. Incremental Pre-computation means we incrementally update the pre-computed features with SIGE before the next edit step. The image quality of all methods are quite similar.

Additional visualization.

In Figure 16, we show additional synthetic edit visual results of DDIM song2020denoising and Progressive Distillation salimans2021progressive on LSUN Church yu15lsun. In Figure 17, we show additional synthetic edit visual results of GauGAN on Cityscapes cordts2016cityscapes.

Appendix D License & Computation Resources

Here we show all the licenses of our used assets. The model DDIM song2020denoising, Progressive Distillation salimans2021progressive, GauGAN park2019semantic and GAN Compression li2020gan is under MIT license, Apache license, Creative Commons license and BSD license, respectively. SDEdit is under MIT license. The license of Cityscapes cordts2016cityscapes is here. LSUN Church yu15lsun does not have explicit license.

Since our method does not involve any model training, all our generated results are obtained on a single NVIDIA RTX 3090, which only takes hours to process all the test images ( in total) including both the original models and our method. We measure the model latency on NVIDIA RTX 3090, 2080Ti, Intel Core i9-10920X CPU, and Apple M1 Pro CPU. On Apple M1 Pro, we use Intel Anaconda for our Python environment.

Figure 16: More visualization results on LSUN Church of DDIM song2020denoising and Progressive Distillation. Prune 40%: Uniformly pruning 40% weights of the model without fine-tuning. Patch: Cropping the smallest image patch that covers all the edited regions of the model input and blend the model output back to the original output image. Our method achieves lower FID with less MACs for both DDIM and progressive distillation.
Figure 17: More visualization results on Cityscapes of GauGAN park2019semantic. 0.19 GauGAN: Uniformly reducing each layer of GauGAN to 19% channels and training from scratch. Our method could achieve higher mIoU than GAN Compression with less MACs. When applying to GAN Compression, our method achieves MACs reduction with minor mIoU drop.