sige
[NeurIPS 2022] Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
view repo
During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1.2 method reduces the computation of DDIM by 7.5× and GauGAN by 18× while preserving the visual fidelity. With SIGE, we accelerate the speed of DDIM by 3.0x on RTX 3090 and 6.6× on Apple M1 Pro CPU, and GauGAN by 4.2× on RTX 3090 and 14× on Apple M1 Pro CPU.
READ FULL TEXT VIEW PDF[NeurIPS 2022] Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
Deep generative models, such as GANs goodfellow2014generative; karras2019style and diffusion models sohl2015deep; ho2020denoising; song2020denoising, excel at synthesizing photo-realistic images, enabling many image synthesis and editing applications. For example, users can edit an image by drawing sketches isola2017image; sangkloy2017scribbler, semantic maps isola2017image; park2019semantic, or strokes meng2022sdedit. All of these applications require users to interact with generative models frequently and therefore demand short inference time.
In practice, content creators often edit images gradually and only update a small image region each time. However, even for a minor edit, recent generative models often synthesize the entire image, including the unchanged regions, which leads to a significant waste of computation. As a concrete example shown in Figure 2(a), the result of the previous edits has already been computed, and the user further edits 9.4% area. However, the vanilla DDIM song2020denoising needs to generate the entire image to obtain the newly edited regions, wasting 80% computation on the unchanged regions. A naive approach to address this issue would be to first segment the newly edited regions, synthesize the corresponding output regions, and blend the outputs back into the previous output. Unfortunately, this method often creates visible seams between the newly edited and unedited regions. How could we save the computation by only updating the edited regions without losing global coherence?
In this work, we propose Spatially Sparse Inference (SSI), a general method to accelerate deep generative models, including conditional GANs and diffusion models, by utilizing the spatial sparsity of edited regions. Our method is motivated by the observation that feature maps at the unedited regions remain mostly the same during user editing. As shown in Figure 2
(b), our key idea is to reuse the cached feature maps of the previous edits and sparsely update the newly edited regions. Specifically, given user input, we first compute a difference mask to locate the newly edited regions. For each convolution layer in the model, we only apply the filters to the masked regions sparsely while reusing the previous activations for the unchanged regions. The sparse update can significantly reduce the computation without hurting the image quality. However, the sparse update involves a gather-scatter process, which often incurs significant latency overheads for existing deep learning frameworks. To address the issue, we propose
Sparse Incremental Generative Engine (SIGE) to translate the theoretical computation reduction of our algorithm to measured latency reduction on various hardware.To evaluate our method, we automatically create new image editing benchmark datasets on LSUN Church yu15lsun
and Cityscapes
cordts2016cityscapes. Without loss of visual fidelity, we reduce the computation of DDIM song2020denoising by 7.5, Progressive Distillation salimans2021progressive by 2.7, and GauGAN by 18 measured by MACs***We measure the computational cost with the number of Multiply-Accumulate operations (MACs). 1 MAC=2 FLOPs.. Compared to existing generative model acceleration methods li2020gan; hou2021slimmable; fu2020autogan; li2022learning; jin2021teachers; shaham2021spatially; wang2020gan, our method directly uses the off-the-shelf pre-trained weights and could be applied to these methods as a plugin. When applied to GAN Compression li2020gan, our method reduces the computation of GauGAN by . See Figure 1 for some examples of our method. With SIGE, we accelerate DDIM 3.0 on RTX 3090 GPU and 6.6 on Apple M1 Pro CPU, and GauGAN on RTX 3090 GPU and on Apple M1 Pro CPU. Our code and benchmarks are available at https://github.com/lmxyy/sige.Generative models such as GANs goodfellow2014generative; karras2019style; karras2020analyzing; brock2018large, diffusion models ho2020denoising; sohl2015deep; dhariwal2021diffusion, and auto-regressive models esser2021taming; razavi2019generating
have demonstrated impressive photorealistic synthesis capability. They have also been extended to conditional image synthesis tasks such as image-to-image translation
saharia2021palette; isola2017image; zhu2017unpaired; zhu2020sean, controllable image generation meng2022sdedit; nichol2021glide; park2019semantic, and real image editing choi2021ilvr; nichol2021glide; kim2021diffusionclip; zhu2016generative; patashnik2021styleclip; abdal2019image2stylegan; abdal2020image2stylegan++; zhu2020sean. Unfortunately, recent generative models have become increasingly computationally intensive, compared to their recognition counterparts. For example, GauGAN park2019semantic consumes 281G MACs, 500 more than MobileNet howard2019searching; howard2017mobilenets; sandler2018mobilenetv2. Similarly, one key limitation of diffusion models ho2020denoising is their long inference time and substantial computation cost. To generate one image, DDPM requires hundreds or thousands of forwarding steps ho2020denoising; dhariwal2021diffusion, which is often infeasible in real-world interactive settings. To improve the sampling efficiency of DDPMs, recent works song2020denoising; song2020score; kong2021fastpropose to interpret the sampling process of DDPMs from the perspective of ordinary differential equations. However, these approaches still require hundreds of steps to generate high-quality samples. To further reduce the sampling cost, DDGAN
xiao2022DDGAN uses a multimodal conditional GAN to model each denoising step. Salimans et al. salimans2021progressive propose to progressively distill a pre-trained DDPM model into a new model that requires fewer steps. Although this approach drastically reduces the sampling steps, the distilled model itself remains computationally prohibitive. Unlike prior work, our work focuses on reducing the computation cost of a pre-trained model. It is complementary to recent efforts on model compression, distillation, and the sampling step reduction of the diffusion models.People apply model compression techniques, including pruning han2016deep; he2018amc; lin2017runtime; he2017channel; liu2017learning; liu2019metapruning and quantization han2016deep; zhou2016dorefa; rastegari2016xnor; wang2019haq; choi2018pact; jacob2018quantization, to reduce the computation and model size of off-the-shelf deep learning models. Recent works apply Neural Architecture Search (NAS) zoph2017neural; zoph2018learning; liu2019darts; cai2019proxylessnas; tan2019mnasnet; wu2019fbnet; lin2020mcunet to automatically design efficient neural architectures. The above ideas can be successfully applied to accelerate the inference of GANs li2020gan; lin2021anycost; shu2019co; liu2021content; hou2021slimmable; ma2021cpgan; fu2020autogan; li2022learning; jin2021teachers; shaham2021spatially; wang2020gan; aguinaldo2019compressing. Although these methods have achieved prominent compression and speedup ratios, they all reduce the computation from the model dimension but fail to exploit the redundancy in the spatial dimension during image editing. Besides, these methods require re-training the compressed model to maintain performance, while our method can be directly applied to existing pre-trained models. We show that our method can be combined with model compression li2020gan to achieve a MACs reduction in Section 4.1.
Sparse computation has been widely explored in the weight domain han2015learning; li2016pruning; liu2015sparse; jaderberg2014speeding, input domain tang2022torchsparse; riegler2017octnet, and activation domain ren2018sbnet; judd2017cnvlutin2; shi2017speeding; dong2017more. For activation sparsity, RRN pan2018recurrent utilizes the sparsity in the consecutive video frame difference to accelerate video models. However, their sparsity is unstructured, which requires special hardware to reach its full speedup potential. Several works instead use structured sparsity. Li et al. li2017not use a deep layer cascade to apply more convolutions on the hard regions than the easy regions to improve the accuracy and speed of semantic segmentation. To accelerate 3D object detection, SBNet ren2018sbnet uses a spatial mask, either from a priori problem knowledge or an auxiliary network, to sparsify the activations. It adopts a tiling-based sparse convolution algorithm to handle spatial sparsity. Recent works further integrate the spatial mask generation network into the sparse inference network in an end-to-end manner verelst2020dynamic and extend the idea to different tasks wang2021exploring; han2021spatially; wang2022adafocus; parger2022deltacnn. Compared to SBNet ren2018sbnet, our mask is directly derived from the difference between the original image and the edited image. Additionally, our method does not require any auxiliary network or extra model training. We also introduce other optimizations, such as normalization removal and kernel fusions, to better adapt our engine for image editing.
We build our method based on the following observation: during interactive image editing, a user often only changes the image content gradually. As a result, only a small subset of pixels in a local region is being updated at any moment. Therefore, we can reuse the activations of the original image for the unedited regions. As shown in Figure
3, we first pre-compute all activations of the original input image. During the editing process, we locate the edited regions by computing a difference mask between the original and edited image. We then reuse the pre-computed activations for the unedited regions and only update the edited regions by applying convolutional filters to them. In Section 3.1, we show the sparsity in the intermediate activations and present our main algorithm. In Section 3.2, we discuss the technical details of how our Sparse Incremental Generative Engine (SIGE) supports the sparse inference and converts the theoretical computation reduction to measured speedup on hardware.First, we closely study the computation within a single layer. We denote and
as the input tensor of the original image and edited image to the
-th convolution layer , respectively. and are the weight and bias of . The output of with input could be computed in the following way due to the linearity of convolution:where is the convolution operator and . If we already pre-computed all the , we only need to compute . Naïvely, computing has the same complexity as . However, since the edited image shares similar features with the original image given a small edit, should be sparse. Below, we discuss different strategies to leverage the activation sparsity to accelerate model inference.
Our first attempt was to prune by zeroing out elements smaller than a certain threshold to achieve the target sparsity. Unfortunately, this pruning method fails to achieve measured speedup due to the overheads of the on-the-fly pruning and irregular sparsity pattern.
Fortunately, user edits are often highly structured and localized. As a result, should also share the structured spatial sparsity, where non-zero values are mostly aggregated within the edited regions, as shown in Figure 4. We then directly use the original image and edited image to compute a difference mask and sparsify with this mask.
But how could we leverage the structured sparsity to accelerate ? A naïve approach is to crop a rectangular edited region out of for each convolution and only compute features for the cropped regions. Unfortunately, this naïve cropping method works poorly for the irregular edited regions (e.g., the example shown in Figure 4).
Instead, as shown in Figure 5(a), we use a tiling-based sparse convolution algorithm. We first downsample the difference mask to different scales and dilate the downsampled masks (width 1 for diffusion models and 2 for GauGAN). Then we divide into multiple small blocks of the same size spatially and index the difference mask at the corresponding resolution. Each block index refers to a single block with non-zero elements. We then gather the non-zero blocks (we also call them active blocks) accordingly along the batch dimension and feed them into the convolution . Finally, we scatter the output blocks into a zero tensor according to the indices to recover the original spatial size and add the pre-computed residual back. The gathered active blocks have an overlap with width 2 for convolution to ensure the output blocks of the adjacent input blocks are seamlessly stitched together ren2018sbnet.
This pipeline in Figure 5(a) is equivalent to a simpler pipeline in Figure 5(b). Instead of gathering , we could directly gather . The convolution needs to be computed with bias . Besides, we need to scatter the output blocks into instead of a zero tensor. Thus, we do not need to store anymore, which further saves memory and removes the overheads of addition and subtraction. Figure 3 visualizes the pipeline.
However, the aforementioned pipeline still fails to produce a noticeable speedup due to extra kernel calls and memory movement overheads in Gather and Scatter. For example, the original dense convolution with 128 channels and input resolution would take 0.78ms on RTX 3090. The sparse convolution using pipeline Figure 5(b) on the example shown in Figure 4 (15.5% edited regions) needs 0.42ms in total, while Gather and Scatter introduce a significant overhead (0.17ms, about 41%). To reduce it, we further optimize SIGE by pre-computing normalization parameters and applying kernel fusion.
, it is easy to remove the normalization layer during inference time since we can use pre-computed mean and variance statistics from model training. However, recent deep generative models often use instance normalization
ulyanov2016instance; huang2017arbitrary or group normalization wu2018group; nichol2021improved, which compute the statistics on the fly during inference. These normalization layers incur overheads as we need to estimate the statistics from the full-size tensors. However, as the original and edited images are quite similar given a small user edit, we assume
. This allows us to reuse the statistics of for the normalization instead of recomputing them for . Thus, normalization layers could be replaced by simple Scale+Shift operations with pre-computed statistics.As mentioned before, both the Gather and Scatter operations introduce significant data movement overheads. To reduce it, we fuse several element-wise operations (Scale+Shift and Nonlinearity) into Gather and Scatter ren2018sbnet; ding2021ios; jia2019taso and only apply these element-wise operations to the active blocks (i.e., edited regions). Furthermore, we perform in-place computation to reduce the number of kernel calls and memory allocation overheads.
In Scatter, we need to copy the pre-computed activation . This copying operation is highly redundant, as most elements from do not involve any computation given a small edit and will be discarded in the next Gather. To reduce the tensor copying overheads, we fuse the Scatter with the following Gather by directly gathering the active blocks from
and the input blocks to be scattered. Sometimes, the residual connection in the ResBlock
he2016deep contains shortcut convolution to match the channel number of the residual and the ResBlock output. We also fuse the Scatter in the shortcut branch, main branch, and the residual addition together to avoid the tensor copying overheads in the shortcut Scatter. Please refer to Appendix A for more details.Below we first describe our models, baselines, datasets, and evaluation protocols. We then discuss our main qualitative and quantitative results. Finally, we include a detailed ablation study regarding the importance of each algorithmic design.
We conduct experiments on three models, including diffusion models and GAN-based models, to explore the generality of our method.
[leftmargin=*]
DDIM song2020denoising is a fast sampling approach for diffusion models. It proposes to interpret the sampling process of diffusion models through the lens of ordinary differential equations.
Progressive Distillation (PD) salimans2021progressive adopts network distillation hinton2015distilling to progressively reduce the number of steps for diffusion models.
GauGAN park2019semantic is a paired image-to-image translation model which learns to generate a high-fidelity image given a semantic label map.
We compare our methods against the following baselines:
[leftmargin=*]
Patch. We crop the smallest patch coverring all the edited regions, feed it into the model, and blend the output patch into the original image.
Crop. For each convolution , we crop the smallest rectangular region that covers all masked elements of the activation , feed it into , and scatter the output patch into .
40% Pruning. We uniformly prune 40% weights of the models without further fine-tuning, as our method directly uses the pre-trained weights. Since the fine-grained pruning is unstructured, it requires special hardware to achieve measured speedup, so we do not report MACs for this baseline.
0.19 GauGAN. We reduce each convolution layer of GauGAN to channels ( MACs reduction) and train it from scratch.
GAN Compression li2020gan. A general-purpose compression method for conditional GANs. GAN Comp. (S) means GAN Compression with a larger compression ratio.
0.5 Original means linearly scaling each layer of the original model to 50% channels, and we only use this to benchmark our efficiency results.
We use the following two datasets in our experiments:
[leftmargin=*]
LSUN Church. We use the LSUN Church Outdoor dataset yu15lsun and follow the same preprocessing steps as prior works ho2020denoising; song2020score. To automatically generate a stroke editing benchmark, we first use Detic zhou2021detecting to segment the images in the validation set. For each segmented object, we use its segmentation mask to inpaint the image by CoModGAN zhao2021comodgan and treat the inpainted image as the original image. We generate the corresponding user strokes by first blurring the masked regions with the median filter and quantizing it into 6 colors following SDEdit meng2022sdedit. We collect 454 editing pairs in total (431 synthetic + 23 manual). We evaluate DDIM song2020denoising and PD salimans2021progressive on this dataset.
Cityscapes. The dataset cordts2016cityscapes contains images of German street scenes. The training and validation sets consist of 2975 and 500 images, respectively. Our editing dataset has 1505 editing pairs in total. We evaluate GauGAN park2019semantic on this dataset.
Please refer to Appendix B for more details about the benchmark datasets.
Following previous works meng2022sdedit; li2020gan; park2019semantic, we use the standard metrics Peak Signal Noise Ratio (PSNR, higher is better), LPIPS (lower is better) zhang2018perceptual, and Fréchet Inception Distance (FID, lower is better) heusel2017gans; parmar2021cleanfid†††We use clean-fid for FID calculation. to evaluate the image quality on both LSUN Church yu15lsun and Cityscapes cordts2016cityscapes. For Cityscapes, we adopt a semantic segmentation metric to evaluate the generated images. Specifically, we run DRN-D-105 yu2017dilated on the generated images and compute the mean Intersection over Union (mIoU) of the segmentation results. Generally, a higher mIOU indicates that the generated images look more realistic and better align to the input.
Model | Method | MACs | PSNR () | LPIPS () | FID () | mIoU () | |||
Value | Ratio | with G.T. | with Orig. | with G.T. | with Orig. | ||||
DDIM | Original | 249G | – | 26.8 | – | 0.069 | – | 65.4 | – |
40% Pruning | – | – | 24.9 | 31.0 | 0.991 | 0.101 | 72.2 | – | |
Patch | 72.0G | 3.5 | 26.8 | 40.6 | 0.076 | 0.022 | 66.4 | – | |
Ours | 65.3G | 3.8 | 26.8 | 52.4 | 0.070 | 0.009 | 65.8 | – | |
Original | 66.9G | – | 21.9 | – | 0.143 | – | 90.0 | – | |
PD | 40% Pruning | – | – | 21.6 | 37.6 | 0.164 | 0.051 | 101 | – |
Ours | 32.5G | 2.1 | 21.9 | 60.7 | 0.154 | 0.003 | 90.1 | – | |
GauGAN | Original | 281G | – | 15.8 | – | 0.409 | – | 55.4 | 62.4 |
GAN Comp. li2020gan | 31.2G | 9.0 | 15.8 | 19.5 | 0.412 | 0.288 | 55.5 | 61.5 | |
Ours | 30.7G | 9.2 | 15.8 | 26.5 | 0.413 | 0.113 | 54.4 | 62.1 | |
0.19 GauGAN | 13.3G | 21 | 15.5 | 18.6 | 0.424 | 0.322 | 57.9 | 53.5 | |
GAN Comp. (S) | 9.64G | 29 | 15.7 | 19.1 | 0.422 | 0.310 | 50.4 | 57.4 | |
GAN Comp.+Ours | 7.06G | 40 | 15.7 | 19.2 | 0.416 | 0.299 | 54.6 | 60.0 |
The number of denoising steps for DDIM and PD are 100 and 8, respectively, and we use 50 and 5 steps for SDEdit. We dilate the difference mask by 5, 2, 5, and 1 pixels for DDIM, PD with resolution 128, PD with resolution 256 and GauGAN, respectively. Besides, we apply our sparse kernel to all convolution layers whose input feature map resolutions are larger than , , and for DDIM, PD, original GauGAN and GAN Compression, respectively. For DDIM song2020denoising and PD salimans2021progressive, we pre-compute and reuse the statistics of the original image for all group normalization layers wu2018group. For GAN Compression li2020gan, we pre-compute and reuse the statistics of the original image for all instance normalization layers ulyanov2016instance whose resolution is higher than . For all models, the sparse block size for convolution is 6 and convolution is 4.
We report the quantitative results of applying our method on DDIM song2020denoising, Progressive Distillation (PD) salimans2021progressive, and GauGAN park2019semantic in Table 1 and show the qualitative results in Figure 6. For PSNR and LPIPS, with G.T. means computing the metric with the ground-truth images. With Orig. means computing the metric with the samples generated by the original model. On LSUN Church, we only use 431 synthetic images for the PSNR/LPIPS with G.T. metrics, as manual edits do not have ground truths. For the other metrics, we use the entire LSUN Church dataset (431 synthetic + 23 manual). On Cityscapes, we view the synthetic semantic maps as the original input and the ground-truth semantic maps as the edited input for the PSNR/LPIPS with G.T. metrics, which has 1505 samples. For the other metrics, we include the symmetric edits (view the ground-truth semantic maps as the original inputs and synthetic semantic maps as the edited inputs), which has 3010 samples in total. For the models with method Patch and Ours, whose computation is edit-dependent, we measure the average MACs over the whole dataset.
Model | Edit Size | Method | MACs | 3090 | 2080Ti | Intel Core i9-10920X | Apple M1 Pro | |||||
Value | Ratio | Value | Ratio | Value | Ratio | Value | Ratio | Value | Ratio | |||
DDIM | – | Original | 248G | – | 37.5ms | – | 54.6ms | – | 609ms | – | 12.9s | – |
0.5 Original | 62.5G | 4.0 | 20.0ms | 1.9 | 31.2ms | 1.8 | 215ms | 2.8 | 3.22s | 4.0 | ||
1.20% | Crop | 32.6G | 7.6 | 15.5ms | 2.4 | 29.3ms | 1.9 | 185ms | 3.3 | 1.85s | 6.9 | |
Ours | 33.4G | 7.5 | 12.6ms | 3.0 | 19.1ms | 2.9 | 147ms | 4.1 | 1.96s | 6.6 | ||
15.5% | Crop | 155G | 1.6 | 30.5ms | 1.2 | 44.5ms | 1.2 | 441ms | 1.4 | 8.09s | 1.6 | |
Ours | 78.9G | 3.2 | 19.4ms | 1.9 | 29.8ms | 1.8 | 304ms | 2.0 | 5.04s | 2.6 | ||
PD256 | – | Original | 119G | – | 35.1ms | – | 51.2ms | – | 388ms | – | 6.18s | – |
0.5 Original | 31.0G | 3.8 | 29.4ms | 1.2 | 43.2ms | 1.2 | 186ms | 2.1 | 1.72s | 3.6 | ||
1.20% | Ours | 25.9G | 4.6 | 18.6ms | 1.9 | 26.4ms | 1.9 | 152ms | 2.5 | 1.55s | 4.0 | |
15.5% | Ours | 48.5G | 2.5 | 21.4ms | 1.6 | 30.7ms | 1.7 | 250ms | 1.6 | 3.22s | 1.9 | |
GauGAN | – | Original | 281G | – | 45.4ms | – | 49.5ms | – | 682ms | – | 14.1s | – |
GAN Compression | 31.2G | 9.0 | 17.0ms | 2.7 | 25.0ms | 2.0 | 333ms | 2.1 | 2.11s | 6.7 | ||
1.18% | Ours | 15.3G | 18 | 11.1ms | 4.1 | 19.3ms | 2.6 | 114ms | 6.0 | 0.990s | 14 | |
GAN Comp.+Ours | 5.59G | 50 | 10.8ms | 4.2 | 16.2ms | 3.1 | 53.1ms | 13 | 0.370s | 38 | ||
13.5% | Ours | 69.8G | 4.0 | 17.8ms | 2.5 | 27.1ms | 1.8 | 238ms | 2.9 | 4.06s | 3.5 | |
GAN Comp.+Ours | 10.8G | 26 | 11.8ms | 3.8 | 17.4ms | 2.8 | 94.4ms | 7.2 | 0.741s | 19 |
For DDIM and Progressive Distillation, our method outperforms all baselines consistently and achieves results on par with the original model. The Patch inference fails when the edited region is small as the global context is insufficient. Although our method only applies convolutional filters to the local edited regions, we could reuse the global context stored in the original activations. Therefore, our method could perform the same as the original model. For GauGAN, our method also performs better than GAN Compression li2020gan with an even larger MACs reduction. When applying our method to GAN Compression, we further achieve a 40 MACs reduction with minor performance degradation, beating both 0.19 GauGAN and GAN Comp. (S).
For real-world interactive image editing applications, inference acceleration on hardware is more critical than the computation reduction. To verify the effectiveness of our proposed engine, we measure the speedup of the edit examples shown in Figure 6
on four devices, including NVIDIA RTX 3090, NVIDIA RTX 2080Ti, Intel Core i9-10920X CPU, and Apple M1 Pro CPU, with different computational powers. We use batch size 1 to simulate real-world use. For GPU devices, we first perform 200 warm-up runs and measure the average latency of the next 200 runs. For CPU devices, we perform 10 warm-up runs and 10 test runs, repeat this process 5 times and report the average latency. The latency is measured in PyTorch 1.7
‡‡‡https://github.com/pytorch/pytorch. The results are shown in Table 2.The original Progressive Distillation salimans2021progressive can only generate images, which is too small for real use. We add some extra layers to adapt the model to resolution . For fair comparisons, we also pre-compute the normalization parameters for the Crop baseline. When the edit pattern is like a rectangle, this baseline reduces similar computation with ours (e.g., the first example of DDIM in Figure 6). However, the speedup is still worse than ours on RTX 3090, 2080Ti and Intel Core i9-10920X due to the large memory index overheads in native PyTorch. When the edited region is far from a rectangle (e.g., the third example of DDIM), the cropped patches have much redundancy. Therefore, even though only 15.5% region is edited, the MACs reduction is only 1.6. With edit size, our method achieves a 7.5, 4.6, and 18 MACs reduction for DDIM, Progressive Distillation, and GauGAN, respectively. With SIGE, we achieve at most 4.1, 2.9, 6.0 and 14 speedup on RTX 3090, 2080Ti, Intel Core i9-10920X and Apple M1 Pro CPU, respectively. When applied to GAN Compression, SIGE achieves a 9.5 and 38 latency reduction on Intel Core i9-10920X and Apple M1 Pro CPU, respectively.
|
|
Below we perform several ablation studies to show the effectiveness of each design choice.
The pre-computed activations of the original image require additional memory storage. We profile the peak memory usage of the original model and our method in PyTorch. Our method only increases the peak memory usage of a single forward for DDIM, PD, GauGAN, and GAN Compression by 0.1G, 0.1G, 0.8G, and 0.3G, respectively. Specifically, it needs to store additional 169M, 56M, 275M, and 120M parameters for DDIM song2020denoising, PD salimans2021progressive, original GauGAN park2019semantic and GAN Compression li2020gan, respectively, for a single forward. For the diffusion models, we need to store activations for all iteration steps (e.g., 50 for DDIM and 5 for PD). However, data movement and kernel computation are asynchronized on GPU, so we could store the activations in CPU memory and load the on-demand ones on GPU to reduce peak memory usage.
Table 2(a) shows the effectiveness of each kernel optimization we add to SIGE for DDIM song2020denoising on RTX 2080Ti. Naïvely applying the tiling-based sparse convolution could reduce the computation by 7.6. Still, the latency reduction is only 1.6 due to the large memory overheads in Gather and Scatter. Pre-computing the normalization parameters could remove the latency of normalization statistics calculation and reduce the overall latency to 29.6ms. Fusing element-wise operations into the Gather and Scatter could remove some redundant operations that are applied to the unedited regions and also reduce the memory allocation overheads (about 9ms). Finally, fusing the Scatter and Gather to Scatter-Gather and Scatter in the shortcut branch and main branch could further reduce about 1.6ms tensor copying overheads, achieving 2.9 speedup.
Real-world model deployment also depends on deep learning backends with optimized libraries and runtimes. To demonstrate the effectiveness and extensibility of SIGE, we also implement our kernels in a widely-used backend TensorRT§§§We benchmark the results with TensorRT 8.4. and benchmark the DDIM latency results on RTX 2080Ti in Table 2(b). Specifically, our speedup ratio becomes more prominent with TensorRT compared to PyTorch, especially for small edits, as TensorRT better supports small convolutional kernels with higher GPU utilization than PyTorch.
For image editing, existing deep generative models often waste computation by re-synthesizing the image regions that do not require modifications. To solve this issue, we have presented a general-purpose method, Spatially Sparse Inference (SSI), to selectively perform computation on edited regions, and Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on commonly-used hardware. We have demonstrated the effectiveness of our approach in various hardware settings.
As discussed in Section 4.2, our method requires extra memory to store the original activations, which slightly increases the peak GPU memory usage. It may not work on certain memory-constrained devices, especially for the diffusion models (e.g., DDIM song2020denoising), since our method requires storing activations of all denoising steps.
Our engine has limited speedup on convolution with low resolution. When the input resolution is low, the active block size needs to be even smaller to get a decent sparsity, such as 1 or 2. However, such extremely small block sizes have bad memory locality and will result in low hardware efficiency.
Besides, we sometimes observe noticeable boundary between the edited regions and unedited regions in our generated samples of GauGAN park2019semantic. This is because, for GauGAN model, the unedited region will also change slightly when we perform normal inference. However, since our method does not update the unedited region, there may be some visible seams between the edited and unedited regions, even though the semantic is coherent. Dilating the difference mask would help reduce the gap.
In most cases, the edit will only update the edited regions. However, sometimes the edit will also introduce global illumination changes such as shadow and reflection. For this case, as we only update the edited regions, we cannot update the global changes outside the edited regions accordingly.
In this paper, we investigate how to update user edit locally without losing global coherence to enable smoother interaction with the generative models. In real-world scenarios, people could use an interactive interface to edit an image, and our method could provide a quick and high-quality preview for their edit, which eases the process of visual content creation and reduces energy consumption, leading to a greener AI application. The reduced cost also provides a good user experience for lower-end devices, which further democratize the applications of generative models.
However, our method can be utilized by malicious users to generate fake contents, deceive people, and spread misinformation, which may lead to potential negative social impacts. Following previous works meng2022sdedit, we explicitly specify the usage permission of our engine with proper licenses. Additionally, we run a forensics detector wang2020cnn to detect the generated results of our method. On GauGAN, our generated images can be detected with average precision (AP). However, on DDIM song2020denoising and Progressive Distillation salimans2021progressive, the APs are only and . Such low APs are caused by the model differences between GANs and diffusion models, as observed in SDEdit meng2022sdedit. We believe developing forensic methods for diffusion models is a critical future research direction.
We thank Yaoyao Ding, Zihao Ye, Lianmin Zheng, Haotian Tang, and Ligeng Zhu for the helpful comments on the engine design. We also thank George Cazenavette, Kangle Deng, Ruihan Gao, Daohan Lu, Sheng-Yu Wang and Bingliang Zhang for their valuable feedback. The project is partly supported by NSF, MIT-IBM Watson AI Lab, Kwai Inc, and Sony Corporation.
As mentioned in Section 3.2, we fuse Scatter and the following Gather into a Scatter-Gather operator and also fuse Scatter in the shortcut, main branch and residual addition together. The detailed fusion pattern is shown in Figure 7. For simplicity, we omit the element-wise operations (e.g., Nonlinearity and Scale+Shift). Below we include more implementation details of each fusion design. Please refer to our code for the detailed implementation.
When a Scatter is directly followed by a Gather, we could fuse these two operators into a Scatter-Gather to avoid copying the entire original activation . As shown in Figure 8(a), the black blocks are copyed from the original activation and then discarded Gather, which incur redundant data movement. Instead, we pre-build a Scatter Map to track the data source (Figure 8(b)). For example, if the data at position in the Scatter output comes from the original activation, then Scatter Map will store NULL at (gray blocks). Otherwise, it will store a triple at this position (non-gray blocks). The first element of the triple indicates the block ID that the data come from, while the latter two indicate the offsets of the data within the block. Note that the pre-computation is cheap and only needs to be computed once for each resolution. Therefore, in the fused Scatter-Gather, we could use the Scatter Map to index and fetch the data we want directly from either the input blocks or the original activation, given the Gather indices. For example, if we want to gather the data at location , we will look up this position in the Scatter Map. If it is NULL, we would fetch the data at location in the original activation. Otherwise, we would fetch data in the input blocks indicated by the triple. In this way, we could avoid copying the unused regions in Scatter.
The convolution in the shortcut branch consumes much less computation than the convolution in the main branch, therefore the overheads of Gather and Scatter weigh more in the shortcut branch. We fuse the Scatter in the shortcut branch and main branch along with residual addition together into Scatter with Block Residual to reduce these overheads. Specifically, as shown in Figure 7, we first scatter output in the pre-computed and add the original residual only at the scattered locations correspondingly according to Indices. Then we calibrate the resulted feature map with output by adding the residual difference at the scattered locations indexed by Shortcut Indices inplace.
![]() |
![]() |
We elaborate more details on how we build the synthetic edit datasets.
We collect 27 foreground object semantic masks from the validation set. The objects include 4 bicycles, 1 motorcycle, 7 cars, 6 trucks, 3 buses, 5 persons, and 1 train. Figure 11 shows some visualization of the collected semantic masks. We generate the edits by randomly pasting one of these objects to the ground-truth semantic maps with augmentation. The augmentation includes random horizontal flip, resize (scale factor in ), translation ( for vertical and for horizontal). To make the synthetic edits more reasonable, when the scale factor is larger than 1, the vertical translation can only be positive, otherwise, it can only be negative. Figure 9(b) shows some edit examples. The average edited area of the entire dataset is 4.77%. The detailed distribution is shown in Figure 9(b).
Method | MACs | PSNR () | LPIPS () | mIoU () | |||
---|---|---|---|---|---|---|---|
Value | Ratio | with G.T. | with Orig. | with G.T. | with Orig. | ||
Original | 281G | – | 15.9 | – | 0.414 | – | 57.3 |
GAN Comp. li2020gan | 31.2G | 9.0 | 15.8 | 19.1 | 0.417 | 0.329 | 56.3 |
Ours | 30.7G | 9.2 | 15.9 | 27.5 | 0.425 | 0.076 | 56.1 |
0.19 GauGAN | 13.3G | 21 | 15.4 | 18.4 | 0.427 | 0.356 | 49.5 |
GAN Comp. (S) | 9.64G | 29 | 15.8 | 18.9 | 0.422 | 0.344 | 51.2 |
GAN Comp.+Ours | 7.06G | 40 | 15.8 | 18.8 | 0.429 | 0.345 | 52.4 |
We show the results of our method with different dilation sizes on GauGAN in Figure 12. Increasing the dilation brings more computations but also slightly improves the image quality. Specifically, the shadow boundary of the added car fades as the dilation increases. We choose dilation 1 since the image quality is almost the same as 20 while delivering the best speed.
In Table 1, we show the quantitative quality results of our method on DDIM song2020denoising, Progresive Distillation (PD) salimans2021progressive and GauGAN park2019semantic. For DDIM and PD, the unedited regions in the generated image keep the same as the input image due to the mask trick meng2022sdedit. For GauGAN, the generated unedited regions vary across different methods. In this case, the image quality at the unedited regions will influence the metrics we report in Table 1. We additionally include the quantitative quality results of GauGAN at the sole edited regions in Table 4. Our method could still preserve the image quality of original GauGAN and match the performance of GAN Compression li2020gan. When applied to GAN Compression, it achieves MACs reduction on average, achieving results on par with GAN Comp. (S) with less computation. This indicates that our method could update the edited regions in high quality without losing global context.
Recent generative models often have some attention layers to improve the generated image quality zhang2019self; vaswani2017attention. Such attention layers could model long-range and multi-level dependencies across image regions, which seems to break the local correspondence between the edited regions in the input image and the generated images. In Figure 13, we visualize the difference between the input and output activations of a self-attention layer in the vanilla DDIM model. The pattern of the input and output activation differences are quite similar and correspond to the difference mask very well. This shows that the local correspondence of user edits still exists with self-attention layers, which justifies our Spatially Sparse Inference algorithm.
In Table 5 and Figure 14, we show the results of large edits () using our method. Specifically, we could achieve at most speedup on DDIM, speedup on PD256 and speedup on GauGAN without losing visual fidelity. Furthermore, in many practical cases, users can decompose a large edit into several small edits. Our method could incrementally update the results instantly when the edit is being created, as described below.
In Figure 15, we show the results of sequential edits with our method. Specifically, One-time Pre-computation performs as well as the Full Model, demonstrating that our method can be applied to multiple sequential edits with only one-time pre-computation in most cases. Moreover, for extremely large edits, we could use SIGE to incrementally update the pre-computed features (Incremental Pre-computation) and condition the later edits on the recomputed one. Its results are also as good as the full model. Therefore, our method could well address the sequential edits.
Model | Edit Size | Method | MACs | 3090 | 2080Ti | Intel Core i9-10920X | Apple M1 Pro | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Value | Ratio | Value | Ratio | Value | Ratio | Value | Ratio | Value | Ratio | |||
DDIM | – | Original | 248G | – | 37.5ms | – | 54.6ms | – | 609ms | – | 12.9s | – |
32.9% | Ours | 115G | 2.2 | 26.0ms | 1.4 | 36.9ms | 1.5 | 449ms | 1.4 | 7.53s | 1.7 | |
PD256 | – | Original | 119G | – | 35.1ms | – | 51.2ms | – | 388ms | – | 6.18s | – |
32.9% | Ours | 64.3G | 1.9 | 25.3ms | 1.4 | 35.1ms | 1.5 | 334ms | 1.2 | 4.47s | 1.4 | |
GauGAN | – | Original | 281G | – | 45.4ms | – | 49.5ms | – | 682ms | – | 14.1s | – |
GAN Compression | 31.2G | 9.0 | 17.0ms | 2.7 | 25.0ms | 2.0 | 333ms | 2.1 | 2.11s | 6.7 | ||
38.7% | Ours | 148G | 1.9 | 27.9ms | 1.6 | 41.7ms | 1.2 | 512ms | 1.3 | 8.37s | 1.7 | |
GAN Comp.+Ours | 18.3G | 15 | 15.3ms | 3.0 | 22.2ms | 2.2 | 169ms | 4.0 | 1.25s | 11 |
Here we show all the licenses of our used assets. The model DDIM song2020denoising, Progressive Distillation salimans2021progressive, GauGAN park2019semantic and GAN Compression li2020gan is under MIT license, Apache license, Creative Commons license and BSD license, respectively. SDEdit is under MIT license. The license of Cityscapes cordts2016cityscapes is here. LSUN Church yu15lsun does not have explicit license.
Since our method does not involve any model training, all our generated results are obtained on a single NVIDIA RTX 3090, which only takes hours to process all the test images ( in total) including both the original models and our method. We measure the model latency on NVIDIA RTX 3090, 2080Ti, Intel Core i9-10920X CPU, and Apple M1 Pro CPU. On Apple M1 Pro, we use Intel Anaconda for our Python environment.