STYLE transfer is the task of migrating a style from an image to another to synthesize a new artistic image. It is of special interest in visual design, and has applications such as painting synthesis and photography post-processing. However, creating an image in a particular style manually requires great skills that are beyond the capabilities of average users. Therefore, automatic style transfer has become a trending topic both in academic literature and industrial applications.
Text and other stroke-based design elements such as symbols, icons and labels highly summarize the abstract imagery of human visual perceptions and are ubiquitous in our daily life. The stylization of text-based binary images as in Fig. 1(a) is of great research value but also poses a challenge of narrowing the great visual discrepancy between the binary flat shapes and the colorful style image.
Style transfer has been investigated for years, where many successful methods are proposed, such as the non-parametric method Image Quilting  and the parametric method Neural Style . Non-parametric methods take samples from the style image and place the samples based on pixel intensity [1, 3, 4]5]
of the target image to synthesize a new image. Parametric methods represent the style as statistical features, and adjust the target image to satisfy these features. Recent deep learning based parametric methods[2, 6, 7] exploit high-level deep features, and thereby have the superior capability of semantic style transfer. However, none of the aforementioned methods are specific to the stylization of text-based binary images. In fact, for non-parametric methods, it is hard to use pixel intensity or deep features to establish a direct mapping between a binary image and a style image, due to their great modality discrepancy. On the other hand, text-based binary images lack high-level semantic information, which limits the performance of the parametric methods.
As the most related method to our problem, a text effects transfer algorithm  is recently proposed to stylize the binary text image. In that work, the authors analyzed the high correlation between texture patterns and their spatial distribution in text effects images, and modeled it as a distribution prior, which has been proven to be highly effective at text stylization. But this method strictly requires the source style to be a well-structured typography image. Moreover, it follows the idea of Image Analogies  to stylize the image in a supervised manner. For supervised style transfer, in addition to the source style image, its non-stylized counterpart is also required to learn the transformation between them, as shown in Fig. 1(b). Unfortunately, such a pair of inputs is not readily available in practice, which greatly limits its application scope.
In this work, we handle a more challenging unsupervised stylization problem, only with a binary text-based binary image and an arbitrary style image as in Fig. 1(a). To bridge the distinct visual discrepancies between the binary image and the style image, we extract the main structural imagery of the style image to build a preliminary mapping to the binary image. The mapping is then refined using a structure transfer algorithm, which adds shape characteristics of the source style to the binary shape. In addition to the distribution constraint , a saliency constraint is proposed to jointly guide the texture transfer process for the shape legibility. These improvements allow our unsupervised style transfer to yield satisfying artistic results without the ideal input required by supervised methods.
Furthermore, we investigate the combination of stylized shapes (text, symbols, icons) and background images, which is very common in visual design. Specifically, we propose a new context-aware text-based binary image stylization and synthesis framework, where the target binary shape is seamlessly embedded in a background image with a specified style. By “seamless”, we mean the target shape is stylized to share context consistency with the background image without abrupt image boundaries, such as decorating a blue sky with cloud-like typography. To achieve it, we leverage cues considering both seamlessness and aesthetics to determine the image layout, where the target shape is finally synthesized into the background image. When a series of different styles are available, our method can generate diverse artistic typography, symbols or icons against the background image, thereby facilitating a much wider variety of aesthetic interest expression. In summary, our major technical contributions are:
We raise a new text-based binary image stylization and synthesis problem for visual design and develop the first automatic aesthetic driven framework to solve it.
We present novel structure and texture transfer algorithms to balance shape legibility with texture consistency, which we show to be effective in style transition between the binary shape and the style image.
We propose a context-aware layout design method to create professional looking artwork, which determines the image layout and seamlessly synthesizes the artistic shape into the background image.
The rest of this paper is organized as follows. In Section II, we review related works in style transfer and text editing. Section III defines the text-based binary image stylization problem, and gives an overview of the framework of our method. In Section IV and V, the details of the proposed legibility-preserving style transfer method and context-aware layout design method are presented, respectively. We validate our method by conducting extensive experiments and comparing with state-of-the-art style transfer algorithms in Section VI. Finally, we conclude our work in Section VII.
Ii Related Work
Ii-a Color Transfer
Pioneering methods transfer colors by applying a global transformation to the target image to match the color statistics of a source image [10, 11, 12]. When the target image and the source image have similar content, these methods generate satisfying results. Subsequent methods work on color transfer in a local manner to cope with the images of arbitrary scenes. They infer local color statistics in different regions by image segmentation [13, 14], perceptual color categories [15, 16] or user interaction . More recently, Shih et al. employed fine-grained patch/pixel correspondences to transfer illumination and color styles for landscape images  and headshot portraits . Yan et al. 
leveraged deep neural networks to learn effective color transforms from a large database. In this paper, we employ color transfer technology to reduce the color difference between the style image and the background image for seamless shape embedding.
Ii-B Texture Synthesis
Texture synthesis technologies attempt to generate new textures from a given texture example. Non-parametric methods use pixel  or patch  samplings in the example to synthesize new textures. For these methods, the coherence of neighboring samples is the research focus, where patch blending via image averaging , dynamic programming , graph cut  and coherence function optimization  is proposed. Meanwhile, parametric methods build mathematic models to simulate certain texture statistics of the texture example. Among this kind of methods, the most popular one is the Gram-matrix model proposed by Gatys et al. . Using the correlations between multi-level deep features to represent textures, this model produces natural textures of noticeably high perceptual quality. In this paper, we adapt conventional texture synthesis methods to dealing with binary text images. We apply four constrains of text shape, texture distribution, texture repetitiveness and text saliency to the texture synthesis method of Wexler  to build our novel texture transfer model.
Ii-C Texture Transfer
In texture transfer, textures are synthesized under the structure constraint from an additional content image. According to whether a guidance map is provided, texture transfer can be further categorized into supervised and unsupervised methods.
Supervised methods, also known as image analogies , rely on the availability of an input image and its stylized result. These methods learn a mapping between such an example pair, and stylize the target image by applying the learned mapping to it. Since first reported in , image analogies have been extended in various ways such as video analogies  and fast image analogies . The main drawback of image analogies is the strict requirement for the registered example pair. Most often, we only have a style image at hand, and need to turn to the unsupervised texture transfer methods.
Without the guidance of the example pair, unsupervised methods directly find mappings between different texture modalities. For instance, Efros and Freeman  introduced a guidance map derived from image intensity to help find correspondences between two texture modalities. Zhang et al. 
used a sparse-based initial sketch estimation to construct a mapping between the source sketch texture and the target image. Frigo et al.  put forward a patch partition mechanism for an adaptive patch mapping, which balances the preservation of structures and textures. However, these methods attempt to use intensity features to establish a direct mapping between the target image and the style image, and will fail in our case where the two input images have huge visual differences. By contrast, our method proposes to extract an abstract binary imagery from the style image, which shares the same modality as the target image and serves as a bridge.
Fueled by the recent development of deep learning, there has been rapid advancement of deep-based methods that leverage high-level image features for style transfer. In pioneering Neural Style , the authors adapted Gram-matrix-based texture synthesis  to style transfer by incorporating content similarities, which enables the composition of different perceptual information. This method has inspired a new wave of research on video stylization , perceptual factor control  and acceleration . In parallel, Li and Wand  introduced a framework called CNNMRF that exploits Markov Random Field (MRF) to enforce local texture transfer. Based on CNNMRF, Neural Doodle  incorporates semantic maps for analogy guidance, which turns semantic maps into artwork. The main advantage of parametric deep-based methods is their ability to establish semantic mappings. For instance, it is reported in  that the network can find accurate correspondences between real faces and sketched faces, even if their appearances differ greatly in pixel domain. However, in our problem, the plain text image provides little semantic information, making these parametric methods lose their advantages in comparison to our non-parametric method, as demonstrated in Fig. 12.
Ii-D Text Stylization
In the domain of text image editing, several tasks have been addressed like calligrams [34, 35, 36] and handwriting generation [37, 38]. Lu et al.  arranged and deformed pre-designed patterns along user-specified paths to synthesize decorative strokes. Handwriting style transfer  is accomplished using non-parametric samplings from a stroke library created by trained artists or parametric neural networks to learn stroke styles . However, most of these studies focus on text deformation. Much less has been done with respect to the fantastic text effects such as shadows, outlines, dancing flames (see Fig. 1), and soft clouds (see Fig. 2).
To the best of our knowledge, the work of Yang et al.  is the only prior attempt at generating text effects. It solves the text stylization problem using a supervised texture transfer technique: a pair of registered raw text and its counterpart text effects are provided to calculate the distribution characteristics of the text effects, which guide the subsequent texture synthesis. In contrast, our framework automatically generates artistic typography, symbols and icons based on arbitrary source style images, without the input requirements as in . Our method provides a more flexible and effective tool to create unique visual design artworks.
Iii Problem Formulation and Framework
We aim to automatically embed the target text-based binary shape in a background image with the style of a source reference image. To achieve this goal, we decompose the task into two subtasks: 1) Style transfer for migrating the style from source images to text-based binary shapes to design artistic shapes. 2) Layout design for seamlessly synthesizing artistic shapes in the background image to create visual design artwork such as posters and magazine covers.
Fig. 2 shows an overview of our algorithm. For the first subtask, we abstract a binary image from the source style image, adjust its contour and the outline of the target shape to narrow the structural difference between them. The adjusted results establish an effective mapping between the target binary image and the source style image. Then we are able to synthesize textures for the target shape. For the second subtask, we first seek the optimal layout of the target shape in the background image. Once the layout is determined, the shape is seamlessly synthesized into the background image under the constraint of the contextual information. The color statistics of the background image and the style image are optionally adjusted to ensure visual consistency.
Iii-a Style Transfer
The goal of text-based image style transfer is to stylize the target text-based binary image based on a style image . In previous text style transfer method , distribution prior is a key factor to its success. However, this prior requires that has highly structured textures with its non-stylized counterpart provided. By comparison, we solve a tougher unsupervised style transfer problem, where is not provided and contains arbitrary textures. To meet these challenges, we propose to build a mapping between and using a binary imagery of , and gradually narrow their visual discrepancy by structure and texture transfer. Moreover, a saliency cue is introduced for shape legibility.
In particular, instead of directly handling and , we first propose a two-stage abstraction method to abstract a binary imagery as a bridge based on the color features and contextual information of (Section IV-A). Since the textons in and the glyphs in probably do not match, a legibility-preserving structure transfer algorithm is proposed to adjust the contours of and (Section IV-B). The resulting and share the same structural features and establish an effective mapping between and . Given and , we are able to synthesize textures for by objective function optimization (Section IV-C). In addition to the distribution term , we further introduce a saliency term in our objective function, which guides our algorithm to stylize the interior of the target shape (white pixels in ) to be of high saliency while the exterior (black pixels in ) of low saliency. This principle enables the stylized shape to be highlighted from the background, thereby increasing its legibility.
Iii-B Layout Design
The goal of context-aware layout design is to seamlessly synthesize into a background image with the style of . We formulate a optimization function to estimate the optimal embedding position of . And the proposed text-based image stylization is adjusted to incorporate contextual information of for seamless image synthesis.
In particular, the color statistics of and are first adjusted to ensure color consistency (Section V-A). Then, we seek the optimal position of
in the background image based on four cues of local variance, non-local saliency, coherence across images and visual aesthetics (SectionV-B). Once the layout is determined, the background information around will be collected. Constrained by this contextual information, the target shape is seamlessly synthesized into the background image in an image inpainting manner (Section V-C).
Iv Text-Based Binary Image Style Transfer
. (e)-(g) K-means clustering results of (a)-(c), respectively. (h) Result by multi-scale label-map extraction. Cropped regions are zoomed for better comparison.
Iv-a Guidance Map Extraction
The perception of texture is a process of acquiring abstract imagery, which enables us to see concrete images from the disordered (such as clouds). This inspires us to follow human’s abstraction of the texture information to extract the binary imagery from the source image . serves as a guidance map, where white pixels indicate the reference region for the shape interior (foreground) and black pixels for the shape exterior (background). The boundary of foreground and background depicts the morphological characteristics of the textures in . We propose a simple yet effective two-stage method to abstract the texture into the foreground and the background with the help of texture removal technologies.
In particular, we use the Relative Total Variation (RTV)  to remove the color variance inside the texture, and obtain a rough structure abstraction . However, texture contours are also smoothed in (see Fig. 3(b)(f)). Hence, we put forward a two-stage abstraction method. In the first stage, pixels in are abstracted as fine-grained super pixels  to precisely match the texture contour. Each super pixel uses its mean pixel values in
as its feature vector to avoid the texture variance. In the second stage, the super pixels are further abstracted as the coarse-grained foreground and background via-means clustering (). Fig. 3 shows an example where our two-stage method generates accurate abstract imagery of the plaster wall. In this example, our result has more details at the boundary than the one-stage method, and fewer errors than the state-of-the-art label-map extraction method  (see the zoomed region in Fig. 3(h)).
Finally, we use the saliency as a criterion to determine the foreground and background of the image. Pixel saliency in is detected  and the cluster with higher mean pixel saliency is set as the foreground. Compared with the commonly used brightness criterion in artistic thresholding methods [45, 46] to retrieve artistic binary images, our criterion helps the foreground text find salient textures.
Iv-B Structure Transfer
Directly using extracted in Section IV-A and the input for style transfer results in unnatural texture boundaries as shown in Fig. 4(a). A potential solution could be employing the shape synthesis technique  to minimize structural inconsistencies between and . In Layered Shape Synthesis (LSS) , shapes are represented as a collection of boundary patches at multiple resolution, and the style of a shape is transferred onto another by optimizing a bidirectional similarity function. However, in our application such an approach does not consider the legibility, and the shape will become illegible after adjustment as shown in the second row of Fig. 6. Hence we incorporate stroke trunk protection mechanism into LSS and propose a legibility-preserving structure transfer method.
The main idea is to adjust the shape of the stroke ends while preserving the shape of the stroke trunk, because the legibility of a glyph is mostly determined by the shape of its trunk. Toward this, we extract the skeleton from and detect the stroke end as a circular region centered at the endpoint of the skeleton. For each resolution level , we generate a mask indicating the stroke end regions as shown in Fig. 5. At the top level , the radius of the circular region is set to the average radius of the stroke calculated by the method of . The radius increases linearly as the resolution increases. And at the bottom level (original resolution), it is set to just cover the entire shape. Let , and denote the downsampled , downsampled and the legibility-preserving structure transfer result at level respectively. Given , , and , we calculate by
where is the element-wise multiplication operator and is the bicubic upsampling operator. is the shape synthesis result of given as the shape reference by LSS, and
is the binarization operation with threshold. The pipeline of the proposed structure transfer is visualized in Fig. 5. In our implementation, the image resolution at the top level is fixed. Therefore the deformation degree is solely controlled by . We show in Fig. 6 that our stroke trunk protection mechanism effectively balances structural consistency with shape legibility even under very large .
For characters without stroke ends (e.g. “O”), our method automatically reverts to the baseline LSS algorithm. We use an eclectic strategy to keep these characters consistent with the average degree of adjustment of characters with stroke ends. Specifically, during our -level hierarchical shape transfer, these characters are masked out during the first levels, and then are adjusted using LSS in the following levels.
In addition, we propose a bidirectional structure transfer (Fig. 4(a)-(d)) to further enhance the shape consistency, where a backward transfer is added after the aforementioned forward transfer. The backward transfer migrates the structural style of the forward transfer result back to to obtain using the original LSS algorithm. The results and will be used as guidance for texture transfer. For simplicity, we will omit the superscripts in the following.
Iv-C Texture Transfer
In our scenarios, is not well-structured text effects, and thus the distribution prior used in  to ensure shape legibility takes limited effect. We introduce a saliency cue for compensation. We augment the texture synthesis objective function in  with the proposed saliency term as follows,
where is the center position of a target patch in and , is the center position of the corresponding source patch in and . The four terms , , and are the appearance, distribution, psycho-visual and saliency terms, respectively, weighted by s. and constrain the similarity of local texture pattern and global texture distribution, respectively. penalizes texture over-repetitiveness for naturalness. We refer to  for details of the first three terms. For the distribution term , we truncate the distance maps of and to a range of where distance corresponds to the shape boundaries. By doing so, we relieve the distribution constraint for pixels far away from the shape boundary. And these pixels are mainly controlled by our saliency cue,
where is the saliency at pixel in . is the gaussian weight with , the distance of to the shape boundary. The saliency term encourages pixels inside the shape to find salient textures for synthesis and keeps the background less salient. We show in Fig. 7 that a higher weight of our saliency term makes the stylized shape more prominent.
V Context-Aware Layout Design
V-a Color Transfer
Obvious color discontinuities may appear for style images that have a different color from the background . Therefore we employ color transfer technology. Here we use a linear method introduced by Image Analogies color transfer 
. This technique estimates a color affine transformation matrix and a bias vector which match the target mean and standard deviation of the color feature with the source ones. In general, color transfer in a local manner is more robust than the global method. Hence, we employ the perception-based color clustering technique to divide pixels into eleven color categories. The linear color transfer is performed within corresponding categories between and . More sophisticated methods or user interactions could be optionally employed to further improve the color transfer result.
V-B Position Estimation
In order to synthesize the target shape seamlessly into the background image, the image layout should be properly determined. In the literature, similar problems in cartography are studied to place text labels on maps . Viewed as an optimization problem, they only consider the overlap between labels. In this paper, both the seamlessness and aesthetics of text placement are taken into account. Specifically, we formulate a cost minimization problem for context-aware position estimation by considering the cost of each pixel of in four aspects,
where is a rectangular area of the same size as , indicating the embedding position. The position is estimated by searching an where pixels have the minimum total costs. and are local variance and non-local saliency costs, concerning the background image itself, and is a coherence cost measuring the coherence between and . In addition, is the aesthetics cost for subjective evaluation, weighted by . Here all terms are normalized independently. We use equal weights for the first three terms, and a lower weight for the aesthetics term.
We consider the local and non-local cues of . First, we seek flat regions for seamless embedding by using as the intensity variance within a local patch centered at . Then, a saliency cost which prefers non-salient regions is introduced. These two internal terms preclude our method from overlaying important objects in the background image with the target shape.
In addition, we use the mutual cost to measure the texture consistency between and . More specifically, is obtained by calculating the distance between the patch centered at in and its best matched patch in .
So far, both the internal and mutual cues are modeled. However, a model that considers only the seamlessness may find unimportant image corners for the target shape, which is not ideal for aesthetics as shown in Fig. 8(g). Hence, we also model the centrality of the shape by
where is the offset of to the image center, and is set to the length of the short side of . Fig. 8 visualizes these four costs, which jointly determine the ideal image layout.
As for the minimization of (4), we use the box filter to effectively solve the total costs for every valid throughout , and choose the minimum one.
We further consider the scales and rotations of text. In addition, when the target image contains multiple shapes (which is quite common for text), we investigate the placement of each individual shape rather than considering them as a whole, which greatly enhances the design flexibility.
Text scale: Our method can be easily extended to handle different scales. During the box filtering, we enumerate the size of the box and then find the global minimum penalty point throughout the space and scale. Specifically, we enumerate a scaling factor over a range of in steps of . Then the text box is zoomed in or out based on to obtain . Finally, the optimal embedding position and scale can be detected by
Fig. 9 shows an example where the target image is originally too large and is automatically adjusted by the proposed method so that it could be seamlessly embedded into the background.
Text rotation: Similar to the text scale, we enumerate the rotation angle over a range of in steps of , and find the global minimum penalty point in the entire space and angle. To use the box filter for fast solution, instead of rotating , we choose to rotate the cost map by , and perform box filter on the rotated , followed by minimum point detection. Fig. 10 shows an example where the target image is automatically rotated to match the direction of the coastline.
Multiple shapes: To deal with multiple shapes, we first view them as a whole and optimize (4) to search an initial position and then refine their layouts separately. In each refinement step, every shape searches for the location with the lowest cost within its small spatial neighborhood to update its original position. After several steps, all the shapes converge to their respective optimal positions. In order to prevent the shapes from overlapping, the search space is limited to ensure the distance between adjacent shapes is not less than their initial distance. Fig. 11 shows that after layout refinement, the characters on the left and right sides are adjusted to a more central position in the vertical direction, making the overall text layout better match the shape of the Ferris wheel.
It is worth noting that the above three extensions can be combined with each other to provide users with more flexible layout options.
V-C Shape Embedding
Once the layout is determined, we synthesize the target shape into the background image in an image inpainting manner. Image inpainting technologies [50, 51, 52] have long been investigated in image processing literature to fill the unknown parts of an image. Similarly our problem sets as the unknown region of , and we aim to fill it with the textures of under the structure guidance of and . We first enlarge by expanding its boundary by pixels. Let the augmented frame-like region be denoted as , and the pixel values of in provide contextual information for the texture transfer. Throughout the coarse-to-fine texture transfer process described in Section IV-C, each voting step is followed by replacing the pixel values of in with the contextual information . This manner will enforce a strong boundary constraint to ensure a seamless transition at the boundary.
Vi Experimental Results and Analysis
Vi-a Comparison of Style Transfer Methods
In Fig. 12, we present a comparison of our method with six state-of-the-art supervised and unsupervised style transfer techniques on text stylization. For supervised methods, the structure guidance map extracted by our method in Section IV-A is directly given as input. Please enlarge and view these figures on the screen for better comparison.
Structural consistency. In comparison to these approaches, our method can preserve the critical structural characteristics of textures in the source style image. Other methods do not consider to adapt the stroke contour to the source textures. As a result, they fail to guarantee structural consistency. For example, text boundaries in most methods are rigid in the leaf group of Fig. 12. By comparison, Neural Style  and CNNMRF  implicitly characterize the texture shapes using deep-based features, while our method explicitly transfers structural features. Therefore only these three approaches create leaf-like letters. Similar cases can also be found in the spume and coral reef groups of Fig. 12. The structural consistency achieved by our method can be better observed in the zoomed regions in Fig. 13, where even Neural Style  does not appear to transfer structure effectively.
Text legibility. For supervised style transfer approaches, the binary guidance map can only provide rough background/foreground constraints for texture synthesis. Consequently, the background of the island result by Image Analogies  finds many salient repetitive textures to fill, which confuses itself with the foreground. Text Effects Transfer  introduces an additional distribution constraint for text legibility, which, however, is not effective for pure texture images. For example, due to the scale discrepancy between and , the distribution constraint forces Text Effects Transfer  to place textures compactly inside the text, leading to textureless artifacts in leaf and spume results. Our method further proposes a complimentary saliency constraint, resulting in the creation of the artistic text that highlights from clean backgrounds. We show in Fig. 14 that when the foreground and background colors are not contrasting enough, our approach demonstrates greater superiority.
Texture naturalness. Compared with other style transfer methods, our method produces visually more natural results. For instance, our method places irregular coral reefs of different densities based on the shape of the text in the example coral reef of Fig. 12, which highly respects the contents of . This is achieved by our context-aware style transfer to ensure structure, color, texture and semantic consistency. By contrast, Image Quilting  relies on patch matching between two completely different modalities and , thus its results are just as flat as the raw text. Three deep-learning based methods, Neural Style , CNNMRF  and its supervised version Neural Doodles , transfer suitable textures onto the text. However, their main drawbacks are the color deviation and checkerboard artifacts (see example coral reef in Fig. 12).
User study. For quantitative evaluation, we conducted user studies where twenty participants were shown five test cases in Figures 12-14 and were asked to assign 1 to 7 scores to the seven methods in each case (a higher score indicates that the stylized result is more consistent in style with the style image). Figure 15 shows that our method outperforms other methods in all cases, obtaining the best average score of 6.54, significantly higher than 4.51, 4.04, 3.55, 2.99, 3.48 and 2.92 of Image Analogies , Neural Doodles , Text Effects Transfer , Image Quilting , Neural Style  and CNNMRF , respectively.
Vi-B Generating Stylish Text in Different Fonts and Languages
We experimented on text in various fonts and languages to test the robustness of our method. Some results are shown in Figs. 16-17. The complete results can be found in the supplementary material. In Fig. 16, characters are quite varied in different languages. Our method successfully synthesizes dancing flames onto a variety of languages, while maintaining their local fine details, such as the small circles in Thai. In Fig. 17, the rigid outlines of the text are adjusted to the shape of a coral reef, without losing the main features of its original font. Thanks to our stroke trunk protection mechanism, our approach balances the authenticity of textures with the legibility of fonts.
Vi-C Visual-Textual Presentation Synthesis
We aim to synthesize professional looking visual-textual presentation that combines beautiful images and overlaid stylish text. In Fig. 18, three visual-textual presentations automatically generated by our method are provided. In the example barrier reef, a LOVE-shaped barrier reef is created, which is visually consistent with the background photo. We further show in the example cloud that we can integrate completely new elements into the background. Clouds with a specific text shape are synthesized in the clear sky. The colors in the sky of are adjusted to match those in the background, which effectively avoids abrupt image boundaries. Please note the text layout automatically determined by our method is quite reasonable. Therefore, our approach is capable of artistically embellishing photos with meaningful and expressive text and symbols, thus providing a flexible and effective tool to create original and unique visual-textual presentations. This art form can be employed in posters, magazine covers and many other media. We show in Fig. 21 a poster design example, where its stylish headline is automatically generated by our method and the main body is manually designed. A headline made of clouds effectively enhances the attractiveness of the poster.
Vi-D Symbol and Icon Rendering
The proposed method has the ability to render textures for text-based geometric shapes such as symbols and icons. Fig. 19 shows that our method successfully transfers rippling textures onto the binary Zodiac symbols. It seems that the proposed method is also capable of stylizing more general shapes, like the emoji icons in Fig. 20. Meanwhile, we notice that our saliency term selects the prominent orange moon to be synthesized into the sun and heart, which enriches the color layering of the results.
Vi-E Structure-Guided Image Inpainting
Our method can be naturally extended to structure-guided image inpainting. Fig. 22 demonstrates the feasibility of controlling the inpainting result via user-specified shapes. The input photo in Fig. 22(a) is used as both a style image and a background image, where its red mask directly indicates the embedding position. The user sketches in white the approximate shapes of the branches (shown in the upper right corner of the result), and the resulting sketch serves as target for stylization. Figs. 22(b)(c) are our inpainting results, which are quite plausible and the filled regions blend well with the background.
Vi-F Running Time
When analyzing the complexity of the proposed method, we consider the time of guidance map extraction, position estimation and color/structure/texture transfer. To simplify the analysis, we assume the target image has pixels, and the image resolution of and have the same magnitude as . In addition, the patch size and the number of iterations are constants that can be ignored in computational complexity.
Guidance map extraction. According to [42, 43, 44], the complexity of RTV, super pixel extraction and saliency detection is . K-means has a practical complexity of , where and is the number of iterations. Ignoring and , the total complexity of guidance map extraction is .
Position estimation. The complexity of calculating , , and box filter is . For coherence cost , we use FLANN  to search matched patches between and , which has a complexity of . Therefore, the overall complexity of the proposed position estimation is .
Style transfer. According to , color transfer has an complexity. During structure transfer, patches along the shape boundary are matched using FLANN . The upper bound of the patch number is and thus the proposed structure transfer is complex. As reported in , PatchMatch in texture transfer has a complexity of .
In summary, the overall computational complexity of the proposed method is .
Table I shows the running time of our method on three test sets (Fig. 18) with Intel Xeon 3.00 GHz CPU E5-1607. The proposed method is implemented on MATLAB platform. Texture transfer is our major computational bottleneck, which accounts for about of the total time. This is because matching patches in mega-pixel (Mp) images can be slow at finer scales. As our method is not multithreaded, it just uses a single core. Our method can be further speeded up by implementing a well-tuned and fully parallelized PatchMatch algorithm.
While our approach has generated visually appealing results, some limitations still exist. Our guidance map extraction relies on simple color features, and is not fool-proof. Main structure abstraction from complex scenarios might be beyond its capability. The problem may be addressed by either employing high-level deep features or user interactions. Moreover, even with precise guidance maps, our method may fail when the source style contains tightly spaced objects. As shown in Fig. 23, our method yields interesting results, but they do not correctly reflect the original texture shapes. The main reason is that it is hard for our patch-based method to find pure foreground or background patches in the dense patterns for shape and texture synthesis. Therefore, when synthesizing the foreground (background) region, shapes/textures in the background (foreground) region will be used, which causes mixture and disorder.
Vii Conclusion and Future Work
In this paper, we demonstrate a new technique for text-based binary image stylization and synthesis to incorporate binary shape and colorful images. We exploit guidance map extraction to facilitate the structure and texture transfer. Cues for seamlessness and aesthetics are leveraged to determine the image layout. Our context-aware text-based image stylization and synthesis approach breaks through a barrier between images and shapes, allowing users to create fine artistic shapes and to design professional looking visual-textual presentations.
There are still some interesting issues for further investigation. A direction for future work is the automatic style image selection. that shares visual consistency and semantic relevance with the background image will contribute more to seamless embedding and aesthetic interests. Recommendation of could be achieved by leveraging deep neural networks to extract semantic information.
-  A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis and transfer,” in Proc. ACM Conf. Computer Graphics and Interactive Techniques, 2001, pp. 341–346.
L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in
-  O. Frigo, N. Sabater, J. Delon, and P. Hellier, “Split and match: example-based adaptive patch sampling for unsupervised style transfer,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2016.
-  M. Elad and P. Milanfar, “Style transfer via texture synthesis,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2338–2351, 2017.
-  J. Liao, Y. Yao, L. Yuan, G. Hua, and S. Kang, “Visual attribute transfer through deep image analogy,” ACM Transactions on Graphics, vol. 36, no. 4, p. 120, 2017.
-  C. Li and M. Wand, “Combining markov random fields and convolutional neural networks for image synthesis,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2016.
-  D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: An explicit representation for neural image style transfer,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2017.
-  S. Yang, J. Liu, Z. Lian, and Z. Guo, “Awesome typography: Statistics-based text effects transfer,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2017.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin, “Image analogies,” in Proc. Conf. Computer Graphics and Interactive Techniques, 2001, pp. 327–340.
-  E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Computer Graphics and Applications, vol. 21, no. 5, pp. 34–41, 2001.
-  A. Hertzmann, “Algorithms for rendering in artistic styles,” Ph.D. dissertation, New York University, 2001.
-  F. Pitié, A. C. Kokaram, and R. Dahyot, “Automated colour grading using colour distribution transfer,” Computer Vision and Image Understanding, vol. 107, no. 1, pp. 123–137, 2007.
Y. W. Tai, J. Jia, and C. K. Tang, “Local color transfer via probabilistic segmentation by expectation-maximization,” inProc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2005, pp. 747–754.
-  ——, “Soft color segmentation and its applications.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1520–1537, 2007.
-  Y. Chang, S. Saito, K. Uchikawa, and M. Nakajima, “Example-based color stylization of images,” ACM Transactions on Applied Perception, vol. 2, no. 3, pp. 322–345, 2006.
-  Y. Chang, S. Saito, and M. Nakajima, “Example-based color transformation of image and video using basic color categories,” IEEE Transactions on Image Processing, vol. 16, no. 2, pp. 329–336, 2007.
-  T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscale images,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 277–280, 2002.
-  Y. Shih, S. Paris, F. Durand, and W. T. Freeman, “Data-driven hallucination of different times of day from a single outdoor photo,” ACM Transactions on Graphics, vol. 32, no. 6, pp. 2504–2507, 2013.
-  Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand, “Style transfer for headshot portraits,” ACM Transactions on Graphics, vol. 33, no. 4, pp. 1–14, 2014.
-  Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu, “Automatic photo adjustment using deep neural networks,” ACM Transactions on Graphics, vol. 35, no. 1, 2016.
-  A. A. Efros and T. K. Leung, “Texture synthesis by non-parametric sampling,” in Proc. IEEE Int’l Conf. Computer Vision, 1999.
-  L. Liang, C. Liu, Y. Xu, B. Guo, and H. Shum, “Real-time texture synthesis by patch-based sampling,” ACM Transactions on Graphics, vol. 20, no. 3, pp. 127–150, 2001.
-  V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick, “Graphcut textures: image and video synthesis using graph cuts,” ACM Transactions on Graphics, vol. 22, no. 3, pp. 277–286, 2003.
-  Y. Wexler, E. Shechtman, and M. Irani, “Space-time completion of video.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 463–476, March 2007.
-  L. A. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” in Advances in Neural Information Processing Systems, 2015.
-  P. Bénard, F. Cole, M. Kass, I. Mordatch, J. Hegarty, M. S. Senn, K. Fleischer, D. Pesare, and K. Breeden, “Stylizing animation by example,” ACM Transactions on Graphics, vol. 32, no. 4, p. 119, 2013.
-  C. Barnes, F. L. Zhang, L. Lou, X. Wu, and S. M. Hu, “Patchtable: efficient patch queries for large datasets and applications,” ACM Transactions on Graphics, vol. 34, no. 4, p. 97, 2015.
-  S. Zhang, X. Gao, N. Wang, and J. Li, “Robust face sketch style synthesis,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 220–232, 2016.
-  ——, “Face sketch synthesis via sparse representation-based greedy search,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2466–2477, 2015.
-  D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, “Coherent online video style transfer,” in Proc. Int’l Conf. Computer Vision, 2017.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, “Controlling perceptual factors in neural style transfer,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2017.
J. Johnson, A. Alahi, and F. F. Li, “Perceptual losses for real-time style transfer and super-resolution,” inProc. European Conf. Computer Vision, 2016.
-  A. J. Champandard, “Semantic style transfer and turning two-bit doodles into fine artworks,” 2016, arXiv:1603.01768.
-  X. Xu, L. Zhang, and T.-T. Wong, “Structure-based ascii art,” ACM Transactions on Graphics, vol. 29, no. 4, pp. 52:1–52:9, July 2010.
-  R. Maharik, M. Bessmeltsev, A. Sheffer, A. Shamir, and N. Carr, “Digital micrography,” ACM Transactions on Graphics, pp. 100:1–100:12, 2011.
-  C. Zou, J. Cao, W. Ranaweera, I. Alhashim, P. Tan, A. Sheffer, and H. Zhang, “Legible compact calligrams,” ACM Transactions on Graphics, vol. 35, no. 4, pp. 122:1–122:12, July 2016.
-  T. S. F. Haines, O. Mac Aodha, and G. J. Brostow, “My text in your handwriting,” ACM Transactions on Graphics, vol. 35, no. 3, pp. 26:1–26:18, May 2016.
-  Z. Lian, B. Zhao, and J. Xiao, “Automatic generation of large-scale handwriting fonts via style learning,” in SIGGRAPH ASIA 2016 Technical Briefs. ACM, 2016, pp. 12:1–12:4.
-  J. Lu, C. Barnes, R. Mech, R. Mech, R. Mech, and A. Finkelstein, “Decobrush: drawing structured decorative patterns by example,” ACM Transactions on Graphics, vol. 33, no. 4, p. 90, 2014.
-  J. Lu, F. Yu, A. Finkelstein, and S. Diverdi, “Helpinghand: example-based stroke stylization,” ACM Transactions on Graphics, vol. 31, no. 4, pp. 13–15, 2012.
-  Y. D. Lockerman, B. Sauvage, R. Allègre, J.-M. Dischler, J. Dorsey, and H. Rushmeier, “Multi-scale label-map extraction for texture synthesis,” ACM Transactions on Graphics, vol. 35, no. 4, p. 140, 2016.
-  L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via relative total variation,” ACM Transactions on Graphics, vol. 31, no. 6, p. 139, 2012.
-  R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
-  J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in Proc. Int’l Conf. Computer Vision, 2013, pp. 153–160.
-  D. Mould and K. Grant, “Stylized black and white images from photographs,” in International Symposium on Non-Photorealistic Animation and Rendering, 2008, pp. 49–58.
-  J. Xu and C. S. Kaplan, “Artistic thresholding,” in International Symposium on Non-Photorealistic Animation and Rendering, 2008, pp. 39–47.
-  A. Rosenberger, D. Cohen-Or, and D. Lischinski, “Layered shape synthesis: automatic generation of control maps for non-stationary textures,” ACM Transactions on Graphics, vol. 28, no. 5, pp. 107:1–107:9, 2009.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: a randomized correspondence algorithm for structural image editing,” ACM Transactions on Graphics, vol. 28, no. 3, pp. 341–352, August 2009.
-  J. Christensen, J. Marks, and S. Shieber, “An empirical study of algorithms for point-feature label placement,” ACM Transactions on Graphics, vol. 14, no. 3, pp. 203–232, 1995.
-  M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” ACM Transactions on Graphics, pp. 417–424, 2000.
-  A. Criminisi, P. Pérez, and K. Toyama, “Region filling and object removal by exemplar-based image inpainting,” IEEE Transactions on Image Processing, vol. 13, pp. 1200 – 1212, September 2004.
-  O. Le Meur, M. Ebdelli, and C. Guillemot, “Hierarchical super-resolution-based inpainting,” IEEE Transactions on Image Processing, vol. 22, pp. 3779 – 3790, October 2013.
C. Elkan, “Using the triangle inequality to accelerate k-means,” in
Proc. IEEE Int’l Conf. Machine Learning, 2003, pp. 147–153.
M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2227–2240, 2014.