Acknowledgements
The authors would like to give special thanks to Kashif Rasul, whose PyTorch expertise greatly aided the project, and to Roland Vollgraf for the many useful generative model discussions.
Ethical Considerations
The MAGIC algorithm we presented here is a tool for digital collages. It allows artist practitioners to experiment and iterate faster when exploring various artistic choices, e.g. choice of source material memory templates as style or which content image to stylize. While the new tool amortizes the costs of the collage process – faster than either manually cutting paper or using traditional optimization based tools – this tool is inherently meant to be a part of a collaboration between artist and AI machine. In that sense, we would say that our tool is ethical and does not risk radical negative disruption of the artistic landscape (which is a risk when tools completely automate and replace processes). MAGIC is rather a subtle evolution towards finer and faster intelligent control of specific steps involved in the artistic process for a specific artform – collages and mosaics. Such a tool empowers artists to explore and create more interesting artworks.
References
 Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH. External Links: ISBN 158113374X, Link, Document Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 A neural algorithm of artistic style. CoRR abs/1508.06576. External Links: Link Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages, Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Imagetoimage translation with conditional adversarial networks. CoRR abs/1611.07004. Cited by: Appendix I: Architecture and loss details, Appendix I: Architecture and loss details, Appendix IV: Technical details, Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages, Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Spatial transformer networks. In Advances in Neural Information Processing Systems 28, Cited by: Appendix I: Architecture and loss details.
 GANosaic: mosaic creation with generative texture manifolds. CoRR abs/1712.00269. External Links: Link Cited by: Appendix I: Architecture and loss details, Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages, Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Texture synthesis with spatial generative adversarial networks. CoRR abs/1611.08207. External Links: Link Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Copy the old or paint anew? an adversarial framework for (non) parametric image stylization. CoRR abs/1811.09236. Cited by: Appendix III: Detailed comparison with Fully Adversarial Mosaics (FAMOS), Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Jigsaw image mosaics. In Proc. of the 29th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH. Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
 Adam: a method for stochastic optimization.. CoRR abs/1412.6980. Cited by: Appendix IV: Technical details.
 Set transformer. CoRR abs/1810.00825. Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.

Combining markov random fields and convolutional neural networks for image synthesis
. In CVPR, pp. 2479–2486. Cited by: Appendix I: Architecture and loss details, Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.  Visual attribute transfer through deep image analogy. ACM Trans. Graph. 36 (4), pp. 120:1–120:15. Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.

Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.  Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. External Links: Link Cited by: Appendix IV: Technical details, Appendix IV: Technical details.
 Stick ‘em up! a surprising history of collage. Note: https://www.1843magazine.com/culture/lookcloser/stickemupasurprisinghistoryofcollage Cited by: Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages.
Appendix I: Architecture and loss details
The structure of the generator is shown in Figure 3. The following tensors are used for the generator , (for simplicity we skip minibatch index and write the tensor sizes when having a single instance minibatch)

all spatial patch dimensions are for training.

w.l.o.g. we can apply the generator on spatially larger input tensors since all convolutional and ST layers can handle such size adjustment.

let be the number of memory template set elements.

input is the sampled memory set

output of the first STUNet is , a set of convex combination weights. By applying softmax on it we ensure that it holds that .

the collage is the convex combination of with as weights – this is a form of permutationinvariant setpooling, since the output does not change if we permute in the first dimension, a slice along the set elements of set and set .

the second UNet takes as input and output , using the parametric architecture of stacked convolutional blocks to correct the collage results and make them more perceptually plausible
takes as input randomly sampled memory sets with elements (optionally combined with content guidance patches ). This is fed to a UNet (with skip connections as in [4]). However, in addition to standard convolutional blocks, we also add blocks that can process the set structure of the data using Set Transformer (ST) layers – we call this network architecture STUNet. It outputs mixing coefficients and processes the set elements permutationequivariantly (the order does not matter) while also taking care of the interactions. See Figure 4 for illustration how the set operation is used exactly. Note that the usual convolutional block is permutationequivariant by definition, since it works on each element independently. Afterwards, by applying softmax on , we can calculate a convex combination of the memories from and produce the collage image with permutationinvariant set pooling.
In addition, we can also optionally do spatial warping on each image the memory set , using a parametrization like the Spatial Transformer [5] or directly a full optical flow. For this purpose, we just predict for each set element its deformation parameters . We calculate these parameters as output of the STUNet together with , and apply the warping deformation before the setpooling.
The discriminator uses a classical patchGAN [12] approach: it should discriminate the sampled training patches from the generated image patches. The overall loss for training MAGIC (and finding a good generator ) combines adversarial and content guidance terms (for the contentguidance use case, see [4, 6]) is the following:
(1)  
(2)  
(3) 
Appendix II: Finer collage control by imposing additional GAN generator constraints
The outputs of the settransformer UNet are the convex combination weights . These determine how the memory templates are blended (a form of setpooling). By constraining the generator to output weights with different statistics, we gain a way to enforce different artistic choices for the collage generation. We experimented with three additional constraints that influence the generator :

the entropy determines how sparse the convex combination weights are. Low entropy implies the property of having one memory template be fully copied in a spatial region with weight , and others left out with weight . Conversely, high entropy will be more soft and blend more gently different templates with weight . Please see Figure 5 for an example how this changes the face blending for unsupervised MAGIC.

total variation determines whether we have bigger segments with small borders, or many small segments that vary spatially. On Figure 6 we show the effects of such regularization, using as example a large guided mosaic collage.

It is desirable to have a collage using a varied selection of the memory template elements . To achieve this, we propose to penalize the spatial size of the largest memory template for the whole spatial region. Formally, we define this term as . This term is required especially for the unguided case, where a trivial solution would be to set everywhere spatially and copy completely a single memory element. This would fool the discriminator ideally, but is a failure mode for the collage purpose. Our design of prevents that case easily.
By tuning the scalar weights for each regularization loss term we can tune the contribution of each regularization term to the total loss, in addition to the adversarial and content terms we defined already in the previous section:
(4) 
Appendix III: Detailed comparison with Fully Adversarial Mosaics (FAMOS)
We can compare MAGIC, the method we presented in the current paper, with FAMOS[8], another approach proposing a hybrid combination of nonparametric patch copying and fully adversarial endtoend learning. The novel ST architecture of MAGIC improves image generation quality and convergence, allowing more flexible cutting of regions of interest from memory templates. Using the setstructure allows to flexibly generalize to randomly sampled sets – unlike FAMOS where a predefined ordered tensor was required, and patch coordinates were "memorized" explicitly. We can compare visually FAMOS and MAGIC on the guided mosaic task, supported by both models. We visualize the results of FAMOS, both fully parametric and hybrid memory copying mode, and MAGIC, our novel method. For training we used patches cropped from 4 Milan city satellite images for template distribution, Archimboldo portrait as content Image of size 1200x1800 pixels.

The fully parametric convolutional approach (top row) is smooth, but the city image lacks visual details and has some distortion – the training distribution of Milan city maps is not accurately learned.

The hybrid memory template copying and refining approach (middle row) shows a different mosaic result. memory templates were available, fixed for the whole training procedure. A few of the memory templates were copied as background and then the convolutional layers added some more details on top of them, mainly depicting the content image more accurately. However, this collage has quite rough structure: too big segments are copied from the memory templates, see the plot of the mixing coefficients at the bottom right. While such large memory segment cutting also has a certain charming visual look, it is actually imprecise control of exact patch cutting and placement for the collage.

For comparison see the respective MAGIC results (bottom row), with memory templates randomly sampled from a whole distribution. The size of cut and pasted segment is much smaller, and MAGIC can control much better what is copied and pasted where. In addition, despite training with MAGIC can work with different number of set elements, and sample them flexibly – an advantage of its set transformer generator architecture.
While all 3 tested methods can produce beautiful mosaics with good stylization and content properties, the aesthetic quality of a mosaic is a subjective estimate of the artist or audience. We think that the fine control that MAGIC offers over the placement and cutting of memory templates makes it a worthwhile addition to an AI artist toolbox.
Appendix IV: Technical details
We implemented our code using PyTorch 1.0, and ran experiments on a Tesla V100 GPU. Each convolutional block had convolution with kernel 5x5, instancenormalization and ReLU nonlinearity. Typical for UNets
[4], we use downsampling and upsampling to form an hourglass shape. Channels were 48 at the largest spatial resolution and doubled when the spatial resolution was halved. For the discriminator, we could use much more channels, 128 at the first layer and doubling after every layer. We also used FP16 precision in order to get lower memory costs and fit larger set sizes – note that the complexity of the ST block is square in . The UNets we used had skip connections.We used for training the usual crossentropy GAN loss [15], using minibatches of size , effectively meaning that the sampled memory templates were 5dimensional tensors . Since we use instance normalisation (and not batch normalisation), the batch size can be chosen flexibly depending on the GPU memory constraints. We trained on 3 V100 GPUs.
For the two experiments we showed in Figures 2 and 1 we had the following settings:

guided generation Fig. 1: raining data distribution for memory templates: randomly cropped 256x256 pixel patches from 6 Berlin city map images (each of resolution 1800x900 pixels). K=50 memory templates, UNets of depth 5. Discriminator depth 6. Batch size . For inference of the large output mosaic we can unroll on any size (typically for posters many megapixels can be used), and decide how to split spatially the rendering given the system GPU memory constraints.

unguided generation Fig. 2: image size 256x256 pixels, same size for training and inference. UNets of depth 5. Discriminator depth 7. Batchsize . Element count of memory set sampled randomly at each minibatch iteration.
We used the standard ADAM[10] optimizer settings as in [15]. In general, training a MAGIC model is quite fast, orders of magnitude faster than a respective classical parametric GAN – e.g. around 15 minutes for the guided example, 1 hour for the unguided example. Such quick time to adapt MAGIC to a new dataset and allow sampling convincing generated collages is yet another advantage for artistic exploration. This is much faster than what properly training until convergence a parametric GAN model would require.
However, we note that the integration of warping and blending makes the training of MAGIC more difficult. Depending on the deformation model (optical flow, or spatial transformer with various degrees of freedom) more training iterations may be requires. However, warping is not strictly necessary when the training data patches are well aligned for the unguided case, or for the guided case.