StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis

Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We introduce StyleCLIPDraw which adds a style loss to the CLIPDraw text-to-drawing synthesis model to allow artistic control of the synthesized drawings in addition to control of the content via text. Whereas performing decoupled style transfer on a generated image only affects the texture, our proposed coupled approach is able to capture a style in both texture and shape, suggesting that the style of the drawing is coupled with the drawing process itself. More results and our code are available at



Ethical Considerations

StyleCLIPDraw relies heavily on the feedback from the CLIP[radford2021-clip] image-text encoding model. CLIP was trained on 400 million image-text pairs scraped from the internet, and this dataset is not made publicly available. As pointed out in the original CLIPDraw paper[frans2021-clipdraw], the biases in this data will be reflected in the generated images from the model. The biases of the CLIP model have been investigated[radford2021-clip], and it is important to recognize the presence of them when utilizing StyleCLIPDraw.