ClipMatrix: Text-controlled Creation of 3D Textured Meshes

by   Nikolay Jetchev, et al.

If a picture is worth thousand words, a moving 3d shape must be worth a million. We build upon the success of recent generative methods that create images fitting the semantics of a text prompt, and extend it to the controlled generation of 3d objects. We present a novel algorithm for the creation of textured 3d meshes, controlled by text prompts. Our method creates aesthetically pleasing high resolution articulated 3d meshes, and opens new possibilities for automation and AI control of 3d assets. We call it "ClipMatrix" because it leverages CLIP text embeddings to breed new digital 3d creatures, a nod to the Latin meaning of the word "matrix" - "mother". See the online gallery for a full impression of our method's capability.


page 1

page 2

page 4


Convolutional Generation of Textured 3D Meshes

Recent generative models for 2D images achieve impressive visual results...

Morphing of Triangular Meshes in Shape Space

We present a novel approach to morph between two isometric poses of the ...

Toward Controlled Generation of Text

Generic generation and manipulation of text is challenging and has limit...

NASA: Neural Articulated Shape Approximation

Efficient representation of articulated objects such as human bodies is ...

Leveraging 2D Data to Learn Textured 3D Mesh Generation

Numerous methods have been proposed for probabilistic generative modelli...

Towards Better Adversarial Synthesis of Human Images from Text

This paper proposes an approach that generates multiple 3D human meshes ...

High-resolution computer meshes of the lower body bones of an adult human female derived from CT images

Background Computer-based geometrical meshes of bones are important for ...

1 ClipMatrix: Background and Method

Pretrained neural networks know a lot about the visual world - and visualizing their learned representations as images is a digital artform with a passionate online following. Approaches creating 2d images as output are everywhere on the net, due to their instant appeal – colourful aesthetics, fast to train, easy to modify, suitable for social network sharing. Deepdream and related differentiable image parametrisations

Mordvintsev et al. (2018) were among the first approaches to show how optimizing neural networks w.r.t. input pixels can lead to beautiful art. More recently, CLIP Radford et al. (2021) ushered a new era for generative art – its joint embedding space relates both image and text modalities, which allows artists and ML practitioners to flexibly play with both. Telling the AI "draw me object X" and then the AI draws "X" is a powerful creativity paradigm. Many artists and researchers (2021)

showed what beauty can arise by optimizing image similarity with a text embedding. The CLIP representations are so flexible that they can guide also the creation of 3d graphics. CLIP is already used for 3d learning

Jain et al. (2021) with an image reconstruction objective. However, this rigid supervision limits artistic creativity; also NeRF fitting has huge computational cost.

In contrast, ClipMatrix is build around performant high-resolution mesh models as 3d representation. Our method can surprise the user with novel shapes and textures. ClipMatrix is controlled by the semantic similarity to CLIP’s text embeddings - different objective with many more optima than reconstruction supervised loss. As initial mesh we use a parametric rigged human body model Pavlakos et al. (2019). ClipMatrix tunes these parameters:

  • SMPL body shape

  • joint pose of the rigged SMPL model

  • deformation per SMPL vertex

  • texture image

  • camera, light and material parameters

The final rendered image output is , see Fig. 2. Here is the mesh output from SMPL (given the mesh params); is the rendered image given mesh, camera, material, light and texture. We leverage Pytorch3d Ravi et al. (2020) as a performant differentiable 3d renderer

. ClipMatrix connects images of rendered 3d views and text prompts in a fully differentiable loss function. We sample camera

and pose , and minimize the expected loss w.r.t. the parameters:


By sampling random camera we ensure our output mesh has the desired properties from any viewing angle. In contrast, optimizing a single fixed camera makes a method for simpler 2d image generation. Similarly, we sample random poses to leverage the dynamism of the rigged 3d model, as opposed to a static sculpture. is a standard 3d mesh regularization Mir et al. (2020) weighted by , keeping deformed meshes ’well-behaved’.

is the negative cosine similarity in CLIP embedding space

between image and the embedding of the fixed input text prompt , as used by (2021). We can flexibly sum over multiple text prompts . In addition, we use specifically defined camera distributions to enabling specific meshpart-text correspondence, e.g. Fig. 3)b) samples a grid of cameras centered around the mesh head.

2 Summary and Outlook

We presented ClipMatrix: a novel generative art tool that allows the text-controllable creation of high resolution 3d textured shapes. The method leverages the SMPL mesh model with a CLIP loss. The framework is very flexible, and practitioners can get a range of appealing results when engineering different text prompts and camera views. Appendix I and the online gallery gallery showcase sample creations. As a limitation, we note that optimization of discrete mesh parameters is quite sensitive to tweaks of the learning rate and regularisation strength . While acceptable for curated generation, this instability currently prevents fully automated 3d asset creation. We plan to investigate other 3d parametrisations like implicit surfaces - they can improve stability, but are costly in terms of image resolution and computational speed.

Figure 2: Schema of ClipMatrix: parameters (SMPL shape, vertex deform, texture, light and material) are optimized; random camera views and body poses are sampled. All these influence the renderer , which creates the final 2d image views . These views are embedded in CLIP space , and used together with input text prompts in a loss . We show this for one prompt only, but in general multiple prompts ca be used to define loss sum terms.
Figure 3: Illustration how ClipMatrix couples 3d mesh views with text control. Text prompts are shown on top of images. (a,b,c) different rendering images used for a set of text prompts, enabling enhanced control of the final results. (d) the learned UV texture is used for the renders (a,b) but not for (c) which uses plain material. W.l.o.g. we can have unique textures and cameras for each prompt - e.g. (b) zooms-in on the creature head, and the prompt says "head of undead sorcerer."


  • R. M. (2021) Thoughts on deepdaze, bigsleep, and aleph2image. Note: Cited by: §1, §1.
  • A. Jain, M. Tancik, and P. Abbeel (2021) Putting nerf on a diet: semantically consistent few-shot view synthesis. External Links: 2104.00677 Cited by: §1.
  • A. Mir, T. Alldieck, and G. Pons-Moll (2020) Learning to transfer texture from clothing images to 3d humans. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1.
  • A. Mordvintsev, N. Pezzotti, L. Schubert, and C. Olah (2018) Differentiable image parameterizations. Distill. Note: External Links: Document Cited by: §1.
  • G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Appendix II: Technical Details.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. External Links: 2103.00020 Cited by: §1.
  • N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W. Lo, J. Johnson, and G. Gkioxari (2020)

    Accelerating 3d deep learning with pytorch3d

    arXiv:2007.08501. Cited by: §1.

Ethical Implications

We see no specific risks related to the current work that exceeds the risks of similar 2d generation approaches. ClipMatrix is a tool allowing playful exploration and novel creation for artists. Such art does not touch any critical issues, such as privacy and personal data. It is also a tool requiring a human-machine interaction for best results (exploration, curation, quality control), so no full automation is possible yet. Full automation of 3d asset cr[preprint]eation will ultimately be disruptive to the 3d modelling and animation industry, but we don’t see this happen in the foreseeable future.

Appendix I: Additional Results

While creating art via optimization sounds straightforward, the design of loss function and rendering priors is a long process of trial and error experimentation. The ClipMatrix framework can produce many different results depending on the design choices. This includes the degree of penalizing mesh deviation from the base human form, how much to allow lighting and material to differ, how to place cameras and how many unique CLIP prompts to use as sum terms in the loss definition, etc. Figure 4 shows four examples (out of many more available online) of the evolution of the ClipMatrix method. Each of these presents a step in the improvement of the method, as the tweets and timestamps of the artworks indicate. We expect the method to change even more in the future, and would be very happy if the users contact the authors and share ideas for technical improvements, or interesting text prompts and sample artwork.

(a) video
(b) video
(c) video
(d) video
Figure 4:

Examples of ClipMatrix 3d artwork. See tweet text above each image for a description of the unique design choices explored inside each artwork. Click the video links for an animated viewing experience: a rotating camera and body pose interpolation shows different facets of each artwork.

Appendix II: Technical Details

We use the SMPLx model Pavlakos et al. [2019] as underlying mesh. It has around 10000 vertices, and 20000 triangle faces. We render the (textured) mesh at size 224x224 pixels, which is also the image size for CLIP embeddings. For inference and video post-processing, we typically render at size 768x768 pixels. Since this is a 3d model, output size can be flexibly adjusted depending on the context and model details. We optimize a texture of size 1024x1024 pixels, corresponding to the SMPLx UV coordinates. With the 224x224 render size, we fit 4 random camera views per minibatch for training, on a 16GB GPU card. Given the complexity of the overall rendering pipeline, a lot of tweaks are possible between image quality and memory computation footprint.