1 ClipMatrix: Background and Method
Pretrained neural networks know a lot about the visual world - and visualizing their learned representations as images is a digital artform with a passionate online following. Approaches creating 2d images as output are everywhere on the net, due to their instant appeal – colourful aesthetics, fast to train, easy to modify, suitable for social network sharing. Deepdream and related differentiable image parametrisationsMordvintsev et al. (2018) were among the first approaches to show how optimizing neural networks w.r.t. input pixels can lead to beautiful art. More recently, CLIP Radford et al. (2021) ushered a new era for generative art – its joint embedding space relates both image and text modalities, which allows artists and ML practitioners to flexibly play with both. Telling the AI "draw me object X" and then the AI draws "X" is a powerful creativity paradigm. Many artists and researchers https://twitter.com/advadnoun (2021)
showed what beauty can arise by optimizing image similarity with a text embedding. The CLIP representations are so flexible that they can guide also the creation of 3d graphics. CLIP is already used for 3d learningJain et al. (2021) with an image reconstruction objective. However, this rigid supervision limits artistic creativity; also NeRF fitting has huge computational cost.
In contrast, ClipMatrix is build around performant high-resolution mesh models as 3d representation. Our method can surprise the user with novel shapes and textures. ClipMatrix is controlled by the semantic similarity to CLIP’s text embeddings - different objective with many more optima than reconstruction supervised loss. As initial mesh we use a parametric rigged human body model Pavlakos et al. (2019). ClipMatrix tunes these parameters:
SMPL body shape
joint pose of the rigged SMPL model
deformation per SMPL vertex
camera, light and material parameters
The final rendered image output is , see Fig. 2. Here is the mesh output from SMPL (given the mesh params); is the rendered image given mesh, camera, material, light and texture. We leverage Pytorch3d Ravi et al. (2020) as a performant differentiable 3d renderer
. ClipMatrix connects images of rendered 3d views and text prompts in a fully differentiable loss function. We sample cameraand pose , and minimize the expected loss w.r.t. the parameters:
By sampling random camera we ensure our output mesh has the desired properties from any viewing angle. In contrast, optimizing a single fixed camera makes a method for simpler 2d image generation. Similarly, we sample random poses to leverage the dynamism of the rigged 3d model, as opposed to a static sculpture. is a standard 3d mesh regularization Mir et al. (2020) weighted by , keeping deformed meshes ’well-behaved’.
is the negative cosine similarity in CLIP embedding spacebetween image and the embedding of the fixed input text prompt , as used by https://twitter.com/advadnoun (2021). We can flexibly sum over multiple text prompts . In addition, we use specifically defined camera distributions to enabling specific meshpart-text correspondence, e.g. Fig. 3)b) samples a grid of cameras centered around the mesh head.
2 Summary and Outlook
We presented ClipMatrix: a novel generative art tool that allows the text-controllable creation of high resolution 3d textured shapes. The method leverages the SMPL mesh model with a CLIP loss. The framework is very flexible, and practitioners can get a range of appealing results when engineering different text prompts and camera views. Appendix I and the online gallery gallery showcase sample creations. As a limitation, we note that optimization of discrete mesh parameters is quite sensitive to tweaks of the learning rate and regularisation strength . While acceptable for curated generation, this instability currently prevents fully automated 3d asset creation. We plan to investigate other 3d parametrisations like implicit surfaces - they can improve stability, but are costly in terms of image resolution and computational speed.
- Thoughts on deepdaze, bigsleep, and aleph2image. Note: https://rynmurdock.github.io/2021/02/26/Aleph2Image.html Cited by: §1, §1.
- Putting nerf on a diet: semantically consistent few-shot view synthesis. External Links: Cited by: §1.
- Learning to transfer texture from clothing images to 3d humans. In , Cited by: §1.
- Differentiable image parameterizations. Distill. Note: https://distill.pub/2018/differentiable-parameterizations External Links: Cited by: §1.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Appendix II: Technical Details.
- Learning transferable visual models from natural language supervision. External Links: Cited by: §1.
Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501. Cited by: §1.
We see no specific risks related to the current work that exceeds the risks of similar 2d generation approaches. ClipMatrix is a tool allowing playful exploration and novel creation for artists. Such art does not touch any critical issues, such as privacy and personal data. It is also a tool requiring a human-machine interaction for best results (exploration, curation, quality control), so no full automation is possible yet. Full automation of 3d asset cr[preprint]eation will ultimately be disruptive to the 3d modelling and animation industry, but we don’t see this happen in the foreseeable future.
Appendix I: Additional Results
While creating art via optimization sounds straightforward, the design of loss function and rendering priors is a long process of trial and error experimentation. The ClipMatrix framework can produce many different results depending on the design choices. This includes the degree of penalizing mesh deviation from the base human form, how much to allow lighting and material to differ, how to place cameras and how many unique CLIP prompts to use as sum terms in the loss definition, etc. Figure 4 shows four examples (out of many more available online) of the evolution of the ClipMatrix method. Each of these presents a step in the improvement of the method, as the tweets and timestamps of the artworks indicate. We expect the method to change even more in the future, and would be very happy if the users contact the authors and share ideas for technical improvements, or interesting text prompts and sample artwork.
Appendix II: Technical Details
We use the SMPLx model Pavlakos et al.  as underlying mesh. It has around 10000 vertices, and 20000 triangle faces. We render the (textured) mesh at size 224x224 pixels, which is also the image size for CLIP embeddings. For inference and video post-processing, we typically render at size 768x768 pixels. Since this is a 3d model, output size can be flexibly adjusted depending on the context and model details. We optimize a texture of size 1024x1024 pixels, corresponding to the SMPLx UV coordinates. With the 224x224 render size, we fit 4 random camera views per minibatch for training, on a 16GB GPU card. Given the complexity of the overall rendering pipeline, a lot of tweaks are possible between image quality and memory computation footprint.