[CVPR 2022 Oral] Towards Layer-wise Image Vectorization
Image rasterization is a mature technique in computer graphics, while image vectorization, the reverse path of rasterization, remains a major challenge. Recent advanced deep learning-based models achieve vectorization and semantic interpolation of vector graphs and demonstrate a better topology of generating new figures. However, deep models cannot be easily generalized to out-of-domain testing data. The generated SVGs also contain complex and redundant shapes that are not quite convenient for further editing. Specifically, the crucial layer-wise topology and fundamental semantics in images are still not well understood and thus not fully explored. In this work, we propose Layer-wise Image Vectorization, namely LIVE, to convert raster images to SVGs and simultaneously maintain its image topology. LIVE can generate compact SVG forms with layer-wise structures that are semantically consistent with human perspective. We progressively add new bezier paths and optimize these paths with the layer-wise framework, newly designed loss functions, and component-wise path initialization technique. Our experiments demonstrate that LIVE presents more plausible vectorized forms than prior works and can be generalized to new images. With the help of this newly learned topology, LIVE initiates human editable SVGs for both designers and other downstream applications. Codes are made available at https://github.com/Picsart-AI-Research/LIVE-Layerwise-Image-Vectorization.READ FULL TEXT VIEW PDF
[CVPR 2022 Oral] Towards Layer-wise Image Vectorization
Scalable Vector Graphics (SVGs) , which describe images with a collection of parametric shape primitives, have recently attracted increasing attention due to the high practical value in computer graphics. Compared with raster images that present the visual concepts using ordered pixels, vector images enjoy many advantages, like compact file size and resolution-independency. Most importantly, vector images provide layer-wise topological information, which is crucial for image understanding and editing.
In the last few years, we have witnessed various achievements in image-to-vector translation [14, 3, 24, 7, 29, 16], mainly due to advances in two technical directions: building powerful generation models, and employing decent differentiable rendering methods. These methods, despite their promising vectorization and generation ability, have always overlooked the topological information hidden behind the raster images. The missing of such information always incurs inadequate learning of vectorization and requires superfluous shape primitives to make up [14, 38, 15]. Some methods attempt to resolve this dilemma by either focusing on particular simple datasets [24, 25] or employing a segmentation pre-processing method [8, 7], but each one has its own drawbacks and subtleties. The first line of work learns to explore the geometric information of fonts or emojis but cannot be generalized to broad domains. The other line that considers segmentation pre-processing method requires heavy pre-processing operations and would segment high-contrast texture into multiple small regions, resulting in redundancy . Hence, a simple yet effective method is desired in the community to capture the layer-wise representation for image-to-vector translation.
In this paper, we introduce a Layer-wise Image VEctorization method, termed as LIVE, to translate a raster image to vector graphics (i.e., SVG) with layer-wise representation. Different from previous works [15, 24], LIVE is model-free and requires no shape primitive labels. This property helps us to escape the regime of particular domains like fonts and emojis, and bypass the difficulty of SVG dataset collection or generalization. Moreover, LIVE enjoys an intuitive and succinct learning course. In each step, we are in pursuit of maximizing the topology exploration rather than only minimizing the pixel-wise difference. The key insight behind this idea is that simply minimizing the vectorization error (e.g., MSE loss between an input raster image and rendered vector graphics) for optimization would lead to a color mean error. We achieve this by a component-wise path initialization method and a novel Unsigned Distance guided Focal loss function (UDF loss). Besides, to mitigate the self-interaction issue, which always occurs in the optimization course , we present a novel Self-Crossing loss (Xing loss) by adding constraints to the control points optimization.
We evaluate our proposed method for various tasks, including image-to-vector translation and interpolation across domains (e.g., cliparts, emojis, photos, and natural images) to showcase the effectiveness of LIVE. Our main contributions in this work can be summarized as follows:
We propose LIVE, a general image vectorization pipeline that hierarchically optimizes the vector graph in a layer-wise manner. Our rendering scheme is fully differentiable and can generate layer-wise SVGs which are largely consistent with human perception.
Together with LIVE, we also introduce a general initialization method and novel loss functions, including Self-Crossing loss (Xing loss) and Unsigned Distance guided Focal loss (UDF loss). These methods improve the generation of SVGs from raster images, reducing curve intersection and minimizing the shape distortion.
Comprehensive experiments demonstrate that LIVE can generate precise and compact SVGs in various domains. Our SVG results surpass the results from prior works in terms of simplicity and layer-wise topology.
In this section, we mainly summarize prior approaches and introduce works that are closely related to our paper.
Rasterization and vectorization are dual problems in computer graphics. In the past decades, many rasterization works focused on either effective rendering [20, 11, 9, 19] or anti-aliasing [6, 4, 2, 18]. Traditional vectorization methods [28, 32, 33, 13, 5, 34] pre-segmented images before vectorization. Among them,  and  utilized the empirical two-stage algorithm to regress segmented components as polygons and bezigons. Researchers also investigated other approaches that were independent of segmentation, such as diffusion curves [21, 35, 37] and gradient meshes . The rise of deep learning motivated researchers to tackle vectorization via differentiable rendering. Yang et al.  proposed that bezigons can be directly optimized with self-crafted loss functions by computing gradients using wavelet rasterization . Li et al.  found shape gradients by differentiating the formula of Reynolds transport theorem 
with Monta-Carlo edge sampling. Meanwhile, combining differentiable rendering techniques with deep learning models is a trending research direction. New networks that based on recurrent neural network
, variational autoencoder[10, 17], and transformer  are introduced to tackle vectorization and vector graph generation problems. In , Ha et al. introduce SketchRNN, which is the first RNN-based sketch generation network. In , Lopes et al. introduce an SVG decoder and combine it with a pixel VAE to generate novel font SVGs in latent space. In , Rebeiro et al. propose Sketchformer, a transformer-based network that recovers sketches from raster images.
A human editable SVG should be well-organized in objects and shapes. Prior works have explored such image topology for both raster images and vectorized shapes. A prototype work was Photo2ClipArt , where images were first split into segments which were later vectorized then combined for visual hierarchy. Similar design repeatedly occurs at other research works such as [30, 29]. Nevertheless, these methods were largely relied on the accuracy of the segmentation step and were clumsy to recover implicit shape geometry for complex scenes. Another research branch designed end-to-end frameworks to generate or edit image hierarchy through one forward pass. For example, DeepSVG  used VAE as its primary structure where input strokes are first represented via an encoder and later replaced by resampled strokes via a decoder. However, DeepSVG performed SVG to SVG translation, a relatively simple task. Stylized Neural Painting  reconstructed images progressively with stylized strokes. Their design principle was to greedy-search for the best stroke to minimize the loss. Yet their main focus was on raster images, not SVG. Lastly, Im2Vec  proposed the Encoder-RNN-Rasterizer pipeline to vectorize an image and obtain its topology simultaneously. However, the ordering of the generated shapes was not robust, and the method was domain-specific. Different from aforementioned methods, our LIVE requires no pre-segmentation and no deep models, but exhibits gratifying ability to explore image topologies.
We present a new method to progressively generate an SVG that fits the raster image in a layer-wise fashion. Given an arbitrary input image, LIVE recursively learns the visual concepts by adding new optimizable closed bzier paths and optimizing all these paths. While various shape primitives are available to be appended to an SVG, we consider the parametric closed bzier path as our fundamental shape primitive, like the implementation in [24, 14]. There are several reasons behind this setting. First, this strategy would greatly reduce the design space and significantly ease the learning course of LIVE. Also, bzier paths are powerful and easy to approximate diverse shapes, making it unnecessary to introduce various shape primitives. Last, it is convenient for us to control the shape complexity by varying the number of segments in each path. For complex visual concepts, we can easily increase the segment number to better reconstruct input, and vice versa. Note that the rendering operation is usually non-differentiable, making it difficult to directly optimize the path under the only supervision of the target raster image. To grapple with this dilemma, we take the advantage of a differentiable renderer from .
Algorithm. 1 shows the entire pipeline. Briefly, we first introduce a component-wise initialization method that selects the major components as the initialization points. Then we run a recursive pipeline to progressively add paths according to a path number scheduler sequence . For each step, we optimize the graph based on some newly proposed objective functions, including an Unsigned Distance guided Focal (UDF) loss and a Self-Crossing (Xing) loss for a better optimization result regarding the reconstruction quality and self-interaction problem. In addition to the layer-wise representation ability, our method is able to reconstruct an image using minimal number of bzier paths, significantly reducing the SVG file size compared to other methods. More details are covered in the following sections.
We find the initialization of bzier path is crucial in LIVE. A bad initialization will lead to unsuccessful topological extraction and generate redundant shapes. To overcome this defect, we introduce the component-wise path initialization, which greatly helps the optimization course.
The design principle of the component-wise path initialization is to identify the most suitable initial location of the path based on the color and size of each component. One component is one connected area that has a uniform-filled color. As we mentioned earlier, LIVE is a progressive learning pipeline. Given the SVG output from previous stages, we prioritize our next learning target so that the component is both large and missing. We justify such component via the following steps: a) We compute the pixel-wise color difference between the current rendered SVG and the ground truth image. b) We reject color differences that are smaller than a preset threshold . Empirically, in our paper. Pixel regions with color differences smaller than are considered to be correctly rendered. c) For other regions, we equally quantify all valid color difference values larger than
into 200 bins. The quantization is approximately uniformly distributed. d) Finally, we identify the largest connected component based on the quantization, and we then use its center of mass as our next path initial location. If we want to addmore paths, then we choose the top- components for next-stage initialization. Note that for each path, we consider the circle initialization method that all control points are initialized uniformly on a circle . Empirically, this simple strategy helps to ease the optimization course and is proved to be helpful.
The merit of our component-wise path initialization is that it maintains a good balance between the color and size of the missing region. Unlike DiffVG  and Neural Painting , in which the former randomly initializes paths and the later initializes strokes based on MSE, our approach focuses on semantic-influencing components that are independent from its RGB value. While adding new paths to the existing figures, our initialization methods can always identify the largest missing components with similar color, and fill in the major regions.
represents image size. MSE loss is simple yet efficient for image comparison, but it will bias towards the mean color of the entire target image, as shown in Figure2. This phenomenon is because MSE are computed using all available pixels, while not all pixels are related to the optimizing path. Hence, we are encouraged to only focus on valid pixels and ignore the unrelated ones.
To resolve this problem, we introduce the Unsigned Distance guided Focal (UDF) loss, which treats each pixel differently based on the distance to the shape contour. Intuitively speaking, UDF loss emphasizes the differences close to the contour and suppresses differences in other locations. By doing so, LIVE protects itself from MSE’s mean color issue and therefore maintains accuracy color reconstruction.
Without losing generality, we formulate our UDF loss assuming the case with a single path. We render the path and compute each pixel’s signed distance to the path represented by . We then threshold, flip, and normalize the unsigned distance by:
where both and are indices of pixels and is a distance threshold. We set equals to 10 by default. Next, we formulate our Unsigned Distance guided Focal loss as
where indexes the pixel in and indexes the RGB channel. With the help of UDF loss, we are able to pay close attention to the path contour and to avoid the effect from inner or distant regions. Figure 2 shows the learning course of Unsigned Distance guided Focal loss. To support multiple paths in our LIVE framework, we can easily extend Equation 2 by averaging over all paths.
We notice that it is possible for some bzier paths to become self-interacted during the course of optimization, leading to detrimental artifacts and improper topology [36, 24]. While it might be expected that additional paths can cover the artifacts, we emphasize this would complicate the generated SVG and cannot effectively explore the underlying topological information. To this end, we introduce the self-interaction (Xing) loss to mitigate this problem.
Assuming all the bzier curves in our paper are third-order, by analyzing a number of optimized shapes, we found that a self-interacted path always intersects the lines of its control points, and vice versa. Figure 3 shows the examples. This suggests that instead of optimizing the bzier path, one potential solution would be adding a constraint on the control points. Assume the control points of a cubic bzier path are and in sequence, we add a constraint that the angle between and ( in the figure) should be greater than . We first determine the characteristic (acute angle or obtuse angle) of as and the value of as by
where is a sign function that returns 1 (if ) or 0 (if ), is vector production that returns a real value. We then formulate our Xing loss as
The basic idea of Equation 4 is that we only optimize the case when (achieved by ). The first term is designed for case and the second term is designed for case . Combining both UDF loss and Xing loss, our final loss function is given by
where is set to 0.01 empirically to balance the two losses.
Existing vector graphics datasets [16, 3] mainly focus on the generation of either fonts or icons, but a broader domain of images is not explored. Also, there is no testing set that can serve as a benchmark for evaluation. In this paper, we test our model on two datasets, an Emoji dataset that mainly collects a subset of emojis from , and a Pics dataset that collects images from different domains. Figure 4 showcases some examples from Emoji and Pics datasets.
We collect 134 emojis with various shapes, colors, and combinations from the NotoEmoji project . While various fonts and icons are given in this project, we mainly collect the smiling face images, and resize all collected images to a resolution of . Compared with the emojis used in , our Emoji dataset includes more images and presents more diversity. Since images in  are relatively simple and present clear topological information, we mainly use this dataset to evaluate the exploration of layer-wise representation.
Besides the Emoji dataset, we also introduce Pics dataset, which contains 153 images, including fonts, icons, and complex clipart images. Compared with the Emoji dataset, the Pics dataset is more complex and challenging for image vectorization. Moreover, some images in the Pics dataset are with various backgrounds, further increasing the vectorization difficulty. We mainly use this dataset to examine the layer-wise modeling and compact SVG with fewer paths.
Note that our LIVE is a model-free method, both datasets are only used to evaluation. Besides the two datasets, we also evaluate LIVE on some realistic photos.
We implement LIVE in PyTorch and optimize it using Adam optimizer , with a learning rate of 1 and 0.01 for points and colors optimization, respectively. By default, we use four segments for each path in our experiments. The circle radius is set to 5 pixels for circle initialization. For each optimization step, all the parameters are trained for 500 iterations. Since our method progressively adds new paths to the canvas, the number of new paths in each step is flexible. Considering both the efficiency and vectorization quality, we set the path number in th optimization step to be . Other number setting strategies also work, like adding one path each time or a customized setting.
We first evaluate LIVE’s vectorization quality with both quantitative and qualitative analysis, measuring the differences between input targets and SVG rendered images.
Figure 5 shows the visual comparisons with previous state-of-the-art methods including DiffVG  and Im2Vec . For fairness, we set the number of path to 4 (the number of components in these emojis) and 20 (the default setting in Im2Vec) for evaluation. Clearly, our LIVE achieve a more faithful reconstruction with better component shapes and colors, while others may still have other artifacts. Therefore, the proposed LIVE better decouples the geometry of different components. More results are in the supplementary materials.
Next, we quantize the vectorization results on the Emoji and Pics datasets. For a fair comparison, the number of segments is set to 4 as the default setting in DiffVG. To showcase that LIVE can reconstruct one image with a minimal number of paths, we vary the path number from 8 to 64 for the simple Emoji dataset and from 32 to 256 for the complex Pics dataset. For comparison, we calculate the MSE of each image for the entire dataset.
The results on Emoji and Pics benchmarks are reported in Figure 6. Clearly, LIVE shows a much lower MSE than DiffVG, especially when the path number is small. When only a few paths are employed, LIVE is able to fit the desired shapes, leading to a better result. Increasing excessive paths would saturate the vectorization performance.
Besides the vectorization quality and efficiency, the main objective of LIVE is to build a Layer-wise representation. Empirically, LIVE is able to explicitly vectorize each individual visual concept and explore the layer-wise representation for simple images like emojis and simple cliparts. We demonstrate the layer-wise representation ability of LIVE in Figure 7. As shown in the figure, each component is clearly learned as a single bzier path. Different from the vectorization methods that leverage segmentation pre-processing or use abundant paths, we can learn each component as an unbroken shape. In Figure 9, we compare the vectorization results of LIVE and DiffVG on the Emoji benchmark.
For complex images like photos and natural images, the topological clues are relatively hard to model. However, LIVE still exhibits a gratifying ability of the “coarse-to-fine” learning style, as shown in Figure 8. LIVE is more likely to achieve better reconstruction performance under the same number of paths. Moreover, we notice that LIVE models the local information much better than others, as shown in the red boxes. This may be explained by our progressive learning and the initialization method. In each step, LIVE encourages the new paths to fit the local details. While previous paths have successfully reconstructed the main context, the newly added path will only focus on initialized local regions with the enforcement of UDF loss. A comprehensive user study also demonstrated the superiority of LIVE (please refer to the supplementary).
Among existing vectorization methods, some VAE-based methods explored the application of interpolation [24, 3, 16]. Even our LIVE is not based on the VAE model, we showcase that it is easy to achieve interpolation by integrating with vanilla raster image-based VAE model.
Before implementing the VAE interpolation, we first conduct an interesting interpolation experiment: given two semantically similar SVGs generated by LIVE, we directly interpolate the control points of each ordered path linearly. Normally, two SVGs are hard to be interpolated due to the disorder of both shapes and control points. In contrast, our LIVE will not suffer from this issue because of the ordered topology structures after optimization. Empirically, even with the simple interpolation of linear control points, LIVE still presents a reasonable result as shown in Figure 10.
Next, we integrate our method with the VAE model. We train a simple VAE model on the MNIST dataset. Next, two random images are selected, we linearly interpolate the two latent vectors to obtain the interpolated images and use our LIVE to vectorize the resulted images sequence. To form a continuous sequence, we treat the previous result as the initialization of the next sample. Results in Figure10 demonstrate that combing with a vanilla VAE model, our method works for interpolation as well. Given the efficient optimization method and great generalization ability, LIVE can be more practical to achieve the interpolation goal when combined with a powerful image generation model.
We first investigate the effectiveness of control point initialization. Figure 11 compares the circle initialization and random initialization. Clearly, the circle initialization significantly reduces the artifacts compared with random initialization. Moreover, we notice that circle initialization is more likely to achieve better vectorization results, as shown in the first row. The reason is that with a circle initialization of control points, the close path is enforced to be convex, and gets a finer optimization result.
To understand the effectiveness of the proposed Xing loss, we conduct an ablation study to investigate the impact of Xing loss through visualization in Figure 12. With the help of Xing loss, we clearly mitigate the problem of self-interaction under the same optimization conditions. The circle shape tends to not intersect, given the constraints on the control points. It shows the proposed Xing loss is an intuitive, simple but effective objective function for mitigating the self-intersection issues. More results will be presented in the supplementary materials.
LIVE presents a layer-wise vectorization result, which can be used for further clipart creation or other applications. However, there are still some issues that we can discuss. First, the layer-wise operation is not efficient as the single-pass optimization. Some other methods also suffer from this issue . An interesting research direction would be how we can combine the highly-efficient inference of deep models with the generalization ability of the optimization-based methods. Second, introducing gradient color and adaptively choosing the segment numbers and color type for each segment will be worth exploring. Third, for more complicated images like landscape or human photos, combining layer-wise vectorization with deep amodal segmentation in pixel space will be an interesting topic. We leave those for future works.
Image to vector technologies can be misused by illegally converting and copying vector graph resources online, especially for easily reused and modified font or other images. To mitigate these issues, one can protect the copyright of the graph by using watermarks on raster images. Besides, though our paper achieves reasonable layer-wise modeling of the images, results converted from raster images can still be differentiated by checking whether each component is intact enough. Those actions will avoid the abuse of similar algorithms.
In this work, we present Layer-wise Image VEctorization (LIVE), a framework to equip image vectorization with layer-wise representation. LIVE progressively infers the input raster image with the help of component-wise path initialization and new loss functions: an UDF loss for vectorization and a Xing loss to mitigate the self-interaction problem. With LIVE, we can explicitly vectorize individual components for a simple emoji or clipart, and investigate the “coarse-to-fine” representation for complex natural images. To ease the evaluation of image vectorization, we also present two datasets, Emoji, and Pics. Besides image vectorization, LIVE can also be integrated with other methods to explore other applications like interpolation.
European Conference on Computer Vision, pages 582–598. Springer, 2020.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7342–7351, 2021.
we conducted an user study***The details of the user study can be found at: https://wj.qq.com/s2/9665341/19ed. to quantitatively compare our LIVE with DiffVG and Neural Painting. We randomly selected 21 images from emoji and pics datasets, and invited 20 users to select the method that has the best learning processing. The average scores of DiffVG, Neural Painting, and our LIVE are 14.3%, 11.9%, and 73.8%, respectively. Results indicate that most people believe LIVE can achieve the best layer-wise representation.
We also tested on more emoji images, as shown in Figure 13. Without bells and whistles, LIVE explicitly disentangles the visual concepts, where each new path can fit a particular component in the input image. Adding more paths would not decrease the performance.
We evaluate the impact of Xing loss in Figure 14. Generally, adding the Xing loss would greatly reduce the risk of self-interaction problems. We notice that a small weight of Xing loss can achieve the best result, while a larger weight (i.e., 1.0) always leads to the failure of optimization. Empirically, we set the Xing loss weight to 0.01 by default.
An interesting property of LIVE-generated SVGs is the deterministic order of the optimized bzier paths, due to the progressively learning pipeline and our component-wise initialization method. We demonstrate this property by linearly Interpolating two generated SVGs. We compare the results of DiffVG with rand seed, DiffVG with fixed seed, and our LIVE. Clearly, the interpolation results of DiffVG (with rand seed) are messed up because the path order is not deterministic. Even we fixed all randomness seeds, DiffVG still performs worse than our LIVE.
We next present more interpolation results in Figure 16. Rather than interpolating between two images, we further interpolate new images among four randomly selected images. Holistically, the results shown in Figure 16 indicate that combining a VAE model with our LIVE method can achieve similar results as other VAE-based vectorization methods.
In this section, we present more examples that compare our LIVE with DiffVG and Neural Painting. The results are presented in the following figures, left (in the green box) is the input images, right is the output of different methods. Empirically, under the same conditions (i.e., the number of paths/strokes), our LIVE can exhibit much better representation results, especially when the path number is small. Please zoom in to see the details.