StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis

Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We introduce StyleCLIPDraw which adds a style loss to the CLIPDraw text-to-drawing synthesis model to allow artistic control of the synthesized drawings in addition to control of the content via text. Whereas performing decoupled style transfer on a generated image only affects the texture, our proposed coupled approach is able to capture a style in both texture and shape, suggesting that the style of the drawing is coupled with the drawing process itself. More results and our code are available at



There are no comments yet.


page 1

page 3


Deformable Style Transfer

Both geometry and texture are fundamental aspects of visual style. Exist...

Dual Adversarial Inference for Text-to-Image Synthesis

Synthesizing images from a given text description involves engaging two ...

CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders

This work presents CLIPDraw, an algorithm that synthesizes novel drawing...

Sukiyaki in French style: A novel system for transformation of dietary patterns

We propose a novel system which can transform a recipe into any selected...

Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

Text segmentation is a prerequisite in many real-world text-related task...

Autocomplete Repetitive Stroking with Image Guidance

Image-guided drawing can compensate for the lack of skills but often req...

Structural-analogy from a Single Image Pair

The task of unsupervised image-to-image translation has seen substantial...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Ethical Considerations

StyleCLIPDraw relies heavily on the feedback from the CLIP[radford2021-clip] image-text encoding model. CLIP was trained on 400 million image-text pairs scraped from the internet, and this dataset is not made publicly available. As pointed out in the original CLIPDraw paper[frans2021-clipdraw], the biases in this data will be reflected in the generated images from the model. The biases of the CLIP model have been investigated[radford2021-clip], and it is important to recognize the presence of them when utilizing StyleCLIPDraw.