Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Image Stylization

11/22/2018 ∙ by Nikolay Jetchev, et al. ∙ Zalando 0

Parametric generative deep models are state-of-the-art for photo and non-photo realistic image stylization. However, learning complicated image representations requires compute-intense models parametrized by a huge number of weights, which in turn requires large datasets to make learning successful. Non-parametric exemplar-based generation is a technique that works well to reproduce style from small datasets, but is also compute-intensive. These aspects are a drawback for the practice of digital AI artists: typically one wants to use a small set of stylization images, and needs a fast flexible model in order to experiment with it. With this motivation, our work has these contributions: (i) a novel stylization method called Fully Adversarial Mosaics (FAMOS) that combines the strengths of both parametric and non-parametric approaches; (ii) multiple ablations and image examples that analyze the method and show its capabilities; (iii) source code that will empower artists and machine learning researchers to use and modify FAMOS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

Code Repositories

famos

Adversarial Framework for (non-) Parametric Image Stylisation Mosaics


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

Appendix I: Exploration of FAMOS Possibilities

Figure 3: A content image of certain two famous people stylized with an aerial view of downtown Los Angeles. Mosaic image size 1920x1080 pixels. Image created with a FAMOS model without memory templates – the regular nature of the city grid allowed nice results with the convolutional Unet generator.

Tweaking FAMOS: parametric vs non-parametric modules

Our tool can deliver good looking mosaics, but it is still a tool dependent on experimentation done by the artist. The right choice of texture and content images is a part of the creative search for a good artwork. The selection of the neural architecture and the hyperparameters can also change the final result a lot. Here we give a small summary of all the choices the user of FAMOS can do in order to influence the final artwork:

  • choice of content and texture image sets, potentially scaling them to tweak scale of visual details relative to generator receptive field (see [9] for examples). We get best results when using few style images with consistent texture properties (repetition of small details), but large sets of diverse images may also work.

  • the most important choice: keep only the parametric or non-parametric generative modules of the network, discussed in detail in the current section.

  • layer count and kernel size of the and networks, changing the receptive field of the network

  • Correspondence map for image reconstruction distance metric – see below

  • count of memory templates and choice of tiling mode (see below)

  • regularization of mask , affecting how much the FAMOS relies on copying non-parametrically and parametrically. See Appendix II for more on this and other regularization terms we considered.

  • stochastic noise – we add also noise to the bottleneck of the Unet, see [2] for some intuition how this can change the generated output.

We share the links for two online galleries we prepared with many additional examples of the mosaics that can be created with the FAMOS model:

  • Gallery 1: pure convolutional generation (memory module disabled). When the style images have texture-like properties, this architecture is also an interesting tool and can create good looking output image stylizations (texture mosaics), while being faster computationally. Please see Figure 3 for an example of the capabilities of that mode of FAMOS.

  • Gallery 2: non-parametric mosaics with FAMOS which make use of the memory module – this allows to copy flexibly parts of the style images when necessary.

Figure 5 shows the drastic differences in the output of the generative model from Figure 5 if we emphasize only some parts of it. If we disable the memory and have templates (equivalently if in the blending equation), we end up with a model very close to a traditional Unet generator, purely parametric generation of . While in some cases this can be quite efficient as well, especially for easy to learn repetitive textures, it can fail for more complicated image styles such as the Santorini island one.

We can also force and keep entirely the template image , a non-parametric behaviour. This can preserve well the details of Santorini, but has visual glitches: some hard edges, blurriness at borders of different template mixing regions.

The full output of FAMOS, the blended refined output has the best image quality in our opinion, keeping the details of the non-parametric mix and correcting some of the artifacts there.

However, as might be expected of an artistic digital algorithm with so many options, depending on the complexity of the stylization image distribution, any of the 3 variants can be an effective tool of artistic expression.

Figure 4: The image of the island of Santorini from commons.wikimedia.org we used as a complex stylization (unconventional texture since scale varies widly). The content image of a fashion model, from www.zalando.de.
Figure 5: Different outputs from different variations of the architecture. From left to right: (i) only parametric image generation, (ii) nonparametric mix without blending, (iii) blended parametric and non-parametric as in the full FAMOS architecture. Images of size 1536x1024 pixels, best seen zoomed-in.

Tiled memory templates

The memory templates are created by moving a regular coordinate grid from style image

. We can use either mirror or wrapping padding mode when interpolating using a coordinate grid

, where is a random offset added to .

In mirror mode (a.k.a. reflect mode), coordinate values are smoothly decreasing again and the interpolation routine would copy pixels from positions . In mirror mode we are simplifying the task of the prediction module, since the templates tile neatly into each other without hard borders. This also has a very interesting visually kaleidoscope-like aesthetics because of the axes of reflection in the appearance of the tiled templates.

In wrap mode, coordinates are hard reset and copy pixels from . In wrap mode, the model should learn and adapt to avoid borders, which will be penalized by the discriminator.

Figure 6

illustrates this process, showing wrapped and mirrored mode, and 2 interpolations with random coordinate offsets to illustrate how this translation enriches the memory tensor

and allows our fixed spatial position memory module to copy varied content.

Figure 6: Using the Santorini panorama image as style (size 1270x780 pixels ) tilings of size 1024x1536 pixels (size of the given content image ) are interpolated, 2 images for wrap and mirror mode illustration.

Correspondence map

The choice of correspondence map used for reconstruction also matters, as discussed by [4] who selects different layers of the pretrained VGG-19 network, or [8] where the effects of dowmsamplng are discussed. For FAMOS it worked well with any of these settings:

  • convolve with a Gaussian filter (using reflection padding to avoid border artifacts) and convert to greyscale

  • downsample image 4,8 or 16 times and convert to greyscale

  • train a small reconstruction

    conv. network with kernels 1x1 and stride 1 so that the loss becomes

    – effectively this converts from the color space of the textures to the content image.

Appendix II: Training and Implementation Details

Our code would be released in GitHub at https://github.com/zalandoresearch/famos after publication at a conference or workshop.

Network and training

We used typically memory templates for . Note that these can take a lot of memory if we have large spatial size

, but they can stay in RAM – only patches from them are shifted to the more limited GPU memory. Both generator and discriminator use batch norm and kernels of size 5x5. We use ReLU and leaky ReLU nonlinearities. To avoid checkerboard artefacts we use upsampling-convolution, instead of transpose convolution. The training patch size was 160 – but deeper Unets would required larger patch size; the minibatch size was 8. The typical channel (width )and layer (depth) counts we used for generator and discriminator are shown in Figure

7. Standards architectures as described in [7] can also work. For training we use DCGAN loss [14] and the ADAM [11] optimizer.

Our code is implemented in Pytorch, and we ran it on a single NVIDIA p100 gpu card. The time to get first nice image is a few minutes usually, but several hours can be required to train fully a complicated model.

Figure 7: Channel and layer counts for our standard FAMOS architecture. We use typically 4 layer deep encoder and decoder (in Unet generator), and 4 layer deep discriminator. The generator takes content image and coordinates as input, and outputs the mixture array , the blending mask and image to use for blending. The total number of output channels is , given memory templates.

Note of caution: training instabilities

We also note in general that the FAMOS architecture can sometimes diverge or require some tuning given different content and style images. The mixing properties of the texture process of the style image are very important (see for some discussion [2]). Such texture properties determine how easy it is to sample and generate – a simple texture (e.g. rice particles) is much easier to use than a non-texture image such as the island of Santorini from Figure 4. The interplay of generation and curation on the human artist side is an essential part of generative art, as a human counterpart to the adversarial nature of the GAN game between generator and discriminator. Changing the parameters or restarting can often lead to better results. As with many GAN models, early stopping may be useful: the user of FAMOS can save regularly the output of the model and keep those images that seem most promising.

A particular failure mode can happen if the mixing matrix collapses: few templates from are always used to generate and the other templates are ignored (i.e. entropy of close to 0). In that case, the parametric part of FAMOS can still paint a nice image on top of that serving as background canvass, but it would be preferable if FAMOS can use the full expressive power of it memory templates. The other extreme case happens rarely: the entropy of stays high and all templates mix to a greyish image . This may require special tuning of an entropy regularization schedule.

We acknowledge these convergence issues, but even so we think that FAMOS is an interesting novel image generation method, and is a fun tool to use and explore. Some regularization terms can help stabilize training of FAMOS, but the exact research of the "right" regularization terms and weight in the loss is left to future work. We considered these terms as part of a regularization loss for the generator.

  • small entropy of mixture matrix – this will force values to be 0 or 1, allowing to cleanly select and keep some memory template, avoid blurriness

  • small total variation of – smooth changes

  • small norm of blending mask – i.e. forcing that we are close to , the true memory templates, and paint less with the parametric GAN. Thus will force to only paint details with the GAN Unet when copying from memory does not work

  • small total variation of

While it is not entirely clear when these terms stabilize training, they are interesting on their own as additional controls from side of the artist user of FAMOS. The next paragraph has more comments on the stability and performance of the method.

Practical tips: what works and does not work

In order to find a good architecture for FAMOS, we tested a lot of architectural choices. Some work better than others.

  • If the cropping coordinates are identical for content and template image patches distributions – generalization to out-of-sample content images may deteriorate since the network learns by heart that some content patches go together with some template patches. To remedy this, cropping content patches and memory templates at different random locations allows better generalization.

  • Downsampling and copying the coordinates after every pair of conv+batch-norm layers makes FAMOS better when using this croppng mode for generalization. This effect is subtle and needs more investigation, but we assume that it makes the network more sensitive to the cropped template offset location.

  • However, if we want to train really well just mosaics for the training content image set, and do not need to generalize to additional out-of-sample content images, then we can crop the "same" coordinates from and , both for training and inference, thus allowing the network to learn much better the spatial relation between content and memory template, and give better mosaics result. In a sense, such a mode is analogous to optimization based stylization [4, 8], since only the result on a single image matters. However, even in that case FAMOS remains a very performant model capable of dealing with very high resolution content images.

  • We use a single Unet to predict the channels of for mixing templates and the 4 channels of for blending – this additional weight sharing works and is much faster than having a design with 2 Unets as in Figure 8. But further experiments may find cases when more capacity (e.g. by adding residual layers) can improve image quality, as is often the case for GAN methods.

  • WGAN-GP [6] loss gives better convergence to FAMOS than DCGAN loss [14] when having nonparametric memory . However, if we disable the memory and rely only on the parametric part (e.g as in Figure 3) than DCGAN is much better. This inconsisteny which loss is best was quite surprising for us, but we did not investigate it in detail.

  • Adding some noise channels (identical spatially and of spatial dimension as ) to the input of the Unet (concatenated to ) or to the bottleneck, seems to stabilize the behaviour of the Unet. However, the outputs are not very sensitive to the stochastic noise inputs – some techniques [18] exist to ensure that conditional GAN methods have multimodal outputs.

  • If is an input to the generator as well, more channels when having templates – no gain in quality, only downside of more computation required.

  • We use randomly translated copies of the textures in and duplicate the images with different offset. An alternative can be to predict the morphing to each of the texture images and avoid having random duplicates. We tried directly morphing with optical flow – there were issues with poor gradients of the loss and deformations of the style images, so we did not pursue that option.

  • If use only a supervised content loss without GAN loss for the memory mixing module – poor results, the GAN loss is required indeed.

Figure 8: An alternative architecture of the generator with a chain of non-parametric and parametric modules. The discriminator of true/fake looking texture patches is standard patch-based one [7].

Appendix III: Outlook

As stated in the previous section, we are examining in detail various model convergence issues and testing regularization terms that can improve stability of FAMOS.

In the future, we plan to investigate the ability of our model to generalize to new content and memory images at inference time. The capacity of our model (channels of the Unet) with respect to the template memory size can be also examined.

We currently mix the memory templates only at a single scale using RGB pixels. Previous work [12] has used multiple scales and feature spaces other than RGB for copying – this can be added as improvement in the FAMOS framework.

The U-net we use for mixing coefficients prediction can be replaced with an attention-like structure [15], which can further improve the generalization of the FAMOS model, allowing to use at inference time many additional texture templates. Our first results seem promising in that direction.