Image Synthesis and Style Transfer

Affine transformation, layer blending, and artistic filters are popular processes that graphic designers employ to transform pixels of an image to create a desired effect. Here, we examine various approaches that synthesize new images: pixel-based compositing models and in particular, distributed representations of deep neural network models. This paper focuses on synthesizing new images from a learned representation model obtained from the VGG network. This approach offers an interesting creative process from its distributed representation of information in hidden layers of a deep VGG network i.e., information such as contour, shape, etc. are effectively captured in hidden layers of neural networks. Conceptually, if Φ is the function that transforms input pixels into distributed representations of VGG layers h, a new synthesized image X can be generated from its inverse function, X = Φ^-1( h). We describe the concept behind the approach, present some representative synthesized images and style-transferred image examples.

READ FULL TEXT VIEW PDF

page 4

page 7

page 8

03/14/2020

Interactive Neural Style Transfer with Artists

We present interactive painting processes in which a painter and various...
03/22/2017

Deep Photo Style Transfer

This paper introduces a deep-learning approach to photographic style tra...
06/15/2016

A Powerful Generative Model Using Random Weights for the Deep Image Representation

To what extent is the success of deep visualization due to the training?...
12/13/2019

A Method for Arbitrary Instance Style Transfer

The ability to synthesize style and content of different images to form ...
03/31/2021

Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes

There have been many successful implementations of neural style transfer...
09/01/2018

Vectorization of Large Amounts of Raster Satellite Images in a Distributed Architecture Using HIPI

Vectorization process focus on grouping pixels of a raster image into ra...
04/28/2021

Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Neural network designers have reached progressive accuracy by increasing...

Code Repositories

ebsynth

Fast Example-based Image Synthesis and Style Transfer


view repo

1 Introduction

Computer has been extensively employed to process 2D vector graphics and raster graphics. Human designers apply various techniques such as cropping, compositing, transforming e.g., scaling, rotating, and applying various visual effects using filtering techniques. We may further describe the graphic content processing of raster graphics into two main approaches: (i) manually or algorithmically composing existing images into a new graphical content; and (ii) creating a new graphical content using generative models that are algorithmically crafted

111One of the pioneers in this area is Harold Cohen aaronshome.com/aaron/index.html [1, 2, 3, 4]

; or learning from image examples, e.g., deep dream

222https://github.com/google/deepdream. The first approach is the more popular contemporary techniques where graphic designers employ off-the-shelf graphics authoring tools to create new content from existing images. In this work, we are particularly interested in the second approach where a new graphical content is synthesized at pixel level using a generative model learnt from examples333In this paper, the terms synthesis and generate are often used interchangably.. This approach has received much interest in recent years.

Recent advances in deep neural networks have shed some interesting ideas on the knowledge representation and learning issues [5, 6]

. It is found that hidden nodes of a deep convolution neural networks (CNNs) trained with audio or visual stimuli could represent the basic fundamental frequency of sound or basic visual patterns

[7, 8]. These ideas motivate us to look into the synthesis of an image from a distributed representation. In this paper, we explore a generative model that synthesizes new images using a distributed representation obtained from VGG16 network [9]. More information about the VGG16 deep convolution neural network will be discussed in Section 4.

The rest of the paper is organized into the following sections: Section 2 discusses the background and some representative related works; Section 3 discusses our approach and gives the details of the techniques behind it; Section 4 provides a critical discussion of the output from the proposed approach; and finally, the conclusion and further research are presented in Section 5.

2 Related Works

A two dimensional image is a projection of the three dimensional world onto a 2D plane, with a primitive unit as pixel. This process abstracts the world to only pixel-intensity and colour. Features such as histograms, edges, contours, corner points, object skeleton after erosion, lines regions, Fourier coefficients, convolution filter coefficients, etc. have been extensively exploited to infer the original information of the world from pixel information. Much research in computer vision has been poured into object recognition and understanding of scenic content from features derived from pixel information. It is interesting to explore whether it would be possible to reverse the process and synthesize an image using the features mentioned.

Abstract patterns have been commonly generated using mathematical functions [10]. Random patterns, fractal and abstract patterns can normally be expressed using rather short finite length programs. Although many insights have been observed and formulated in the graphic design domain, such as the rule of third and the golden ratio, there is still a big gap in our understanding of how a non-abstract image can be automatically generated using a computer program i.e., creating an outdoor scenery on a blank canvas. Due to its complexity, researchers have often choosen to explicitly describe the generative process as a computer program [1] or describing a sequence of image processing operations to be applied to existing images, e.g., non-photo realistic rendering (NPR) [11].

In the more popular contemporary approach, a new image is created by modifying original image using transformation, artistic filters and various composition tactics. The existing images will go through various processes, where their color, shapes, texture are modified and then re-composited into a new image [12]. In this style, the content is consciously modified by a human designer who asserts all extraneous information to the composited content. Various commercial graphics software packages have been designed to assist human designers in this approach. Humans employ both top-down and bottom-up creative processes in this approach.

One of the early works attempting to automate the above creative process is from [14], the authors propose the concept named image analogies. Given images and where , is a function that transform to e.g., could be an artistic filter function. The authors show that the transforming function could be learnt from an example pair and then apply the learnt function to generate a new output . This approach provides an interesting style transferred process.

With the recent advances in deep learning, it has been shown that the convolution filters of a convolution neural network (CNN) exhibit important characteristics of the way our brain responds to visual patterns; e.g., simple patterns such as lines, dots, colors that emerge in early layers and complex patterns such as textures, and compound structures that emerge in deeper layers [7, 8].

The weights of CNN can be seen as a function that re-represents an input image in hidden nodes of CNN . Given an input image to a trained CNN, information residing in the trained CNN can be transferred to the input image by enhancing the activation signal on the hidden nodes that responds strongly to the input image. The gradient of the hidden nodes with respect to the input image could be used to modify the input image. Iteratively repeating the process will enhance the features strongly correlated with those hidden nodes e.g., abstract geometrical patterns and complex patterns that have been learnt by the network in the resulting image. Google deep dream is one of the influential works that employs this technique to generate images from its network trained with images of cats and dogs. Feeding a new input image such as the sky, the network generates a kind of hallucinative flavour by embedding parts of cats and dogs to the output images. This sparks many subsequent works in image synthesis and style transfer [15].

3 Formalizing the Pixel-based Image Synthesis Process

3.1 Composite Models

A composite model generates a new image by modifying existing pixels information through the modification of pixel intensity, through affine transformation, convolution, or compositing pixels from various sources. We highlight some important operations in the categories below.

3.1.1 Convolutional Filters:

Let be a matrix of pixels of image where is the pixel intensity. We can express convolution operation between image and a convolution kernel as

(1)

where is a convolution operator and is the output image.

3.1.2 Layers Blending:

We can express the composition between image and as

(2)

where controls the blending percentage of images and . Various operations such as affine transformation, intensity/color manipulation, and convolution operations can be expressed as a sequence of these basic processes.

Figure 1: Composite models: the image in column two is generated by applying a filter to emulate the pencil sketch effect (see [12]). The image in column three is generated by compositing two images together.
Figure 2: Automated compositing models: the images in columns one and two are generated using cellular automata [16] with rule 30 and rule 90 respectively. The image in column three is generated using a compact mathematical formula while the image in column four is generated using an elaborate procedure that draws sunflower seeds, petals and their compositions (see [1]).

Figure 1 shows the output from composite approach. Columns one and three show the original image and columns two and four shows the processed images. The output in columns two and four are generated using Eqs. 1 and 2 respectively. The generated images closely resemble the original images. In this style, the variation is manually controlled by a human graphic designer. The process above could be automated using computers. Weaving many small generative units together into algorithms, complex processes such as drawing sunflower seeds and petals could be automated [1]. Figure 2 shows a more complex composite model generated by computer programs.

3.2 Learned Distributed Representation Models

Let be a matrix of input pixels and be a gradient matrix computed from hidden nodes parameters. Conceptually speaking, the generative model that optimizes the activations of hidden nodes can be expressed as:

(3)

where is the step size, that is, an image gradually transforms according to the added gradient. Images produced in this style are interesting but the contents are random in nature since the generated image is dependent on the input pixels and the contents learnt by the network. There is no means to control the relationships among the components in the generated contents.

A generative model can be viewed as a function. If this function can be precisely determined, then, given an input image, a synthesized image can be computed and vice versa, given a synthesized image, an original image can be computed using its reverse function. Here the functionality of this generative function is emulated in an artificial neural network architecture using the process explained below.

Let be the function that transforms input pixels into a feature vector h derived from hidden nodes (e.g., weights or activations) distributed in different layers of the network . Conceptually, the original image can be reproduced from a distributed representation of [17]:

(4)

This concept can be extended to the generation of from weighted of multiple transform functions, .

(5)

4 Synthesizing Images from Distributed Representations

We exploit the VGG deep neural network in our image synthesis task. VGG is the acronym for Visual Geometry Group from the University of Oxford. The group has released two fully trained deep convolution neural networks, namely VGG16 and VGG19, to the public444see www.robots.ox.ac.uk/ vgg/research/very_deep. Here, we experiment with image synthesis using parameters read from convolution layers of VGG16 network. Fig. 3 shows the architecture of VGG16 network. Each block represents a hidden layer, for example, 33 conv, 64 denotes 64 convolution filters of size 33.

Figure 3: The graphical abstract representation of the VGG16.

Feeding an image to the VGG network, we observe the activations in the all hidden convolution layers , . In other words, activations in all hidden layers represent the pixel information of the input image and, conceptually, a copy of should be reproducible by reverse the process . In this work, instead of analytically solving of a function and its inverse , we adopt an optimization method that gradually adjusts the activation of hidden nodes to the desired .

In [18]

, the authors approach the style transfer task by mnimizing two kinds of loss functions: content loss and style loss. Let

, and be three matrices derived from the layer of the VGG network fed with content image, style image and noise input respectively. denotes the number of feature maps and denotes the size of the feature map. The style loss and content loss are defined as follows:

(6)
(7)
(8)
(9)

where and is the Gram matrix which is calculated from the inner product of and respectively.

4.1 Synthesized Images

Figure 4 presents twelve images in two groups. The first and the third rows are the original images; the second and the third rows are the synthesized images based on the original images. These twelve images are chosen to highlight the characteristics of synthesized images using Eq 6. Four patterns (the two-rightmost columns) are chosen to display strong long range dependency. It is clear that all synthesized images successfully capture local dependency using the loss. However, long range dependency is not successfully captured. It is clear from the synthesized images in the two rightmost columns that although the synthesized patterns seem to capture the style (local dependency) in general, the dependency among components in a longer distance are lost and a lot of information is missing.

Figure 4: Synthesized images using equation 6 display good similarities at the local level but relationships among image components over a spatial distance are lost (see synthesized images in the last two columns).
Figure 5: Images of a mask (top row) and images of a child (bottom row) are synthesized using combination of the style loss (Eq 6) and the content loss (Eq 7).

An image synthesized using the loss function from equation 7 displays an exact replica of the image since it is a generation based on content loss. Combining the losses from both content loss and style loss, the relationships among components can be obtained via content loss while the texture is obtained via style loss. Figure 5 shows synthesized images from equations 6 and 7 which represent content loss and style loss respectively. Combining weighted losses from both functions produces an interesting output since the pixel information from two different sources is blended together. The blending is not according to the spatial positions of the pixels(as in Eq 2) but from a deeper abstraction obtained from hidden layers of a deep neural network. This gives a kind of control known as style transfer, where one image provides information on the content and the other image provides information about the style. New synthesized images successfully capture both content and style information. This provides a new interesting generative approach.

5 Reflection & Discussion

In [14], the image analogies learn a generative function between a pair of images . The generative function learns a specific transformation which can be applied to other images. The transformation is, however, limited to the specific learnt function . Leverage on recent advances in deep learning, pre-trained models (e.g., deep dream, VGG networks) are employed in the generative process. This allows a richer transformation style since the deep neural network acts as a transform function that re-represents information of a given image in the hidden layers. In [15], two classes of loss functions: content loss and style loss are proposed. This allows different combinations of style and content to be realized with ease.

We offer a summary of the creative process using distributed representation as follows: let and

be a matrix of input pixels from white noise and the target image

respectively. Feeding and to the VGG network produces two set of activations and in the hidden layers. Gradually reducing the discrepancy between and should conceptually synthesize the image based on information from the image . Let be a gradient matrix computed from the loss function , hence, the generative model can be expressed as an iterative update:

(10)

where is the step size. That is, an image gradually transforms into a new image using information from . The synthesized image will share many characteristics with the original image depending on the loss functions. The content loss is, in essence, the differences between the synthesized image (initialized using white noise) and the target image:

(11)

where is a constant normalizing the loss. minimizes one to one relationship between the nodes in the hidden layers and thus preserve original content. On the other hand, the style loss minimizes the gram matrix in the hidden layers. Minimizing the Gram matrix abstracts away spatial information (since the inner product only correlates the feature map as a whole and not the detail inside the feature map).

(12)

where is a constant normalizing the loss. In [19], the authors argue that the essence of style transfer is to match the feature distribution between the style and the generated image and shows that minimizing the gram matrix is equivalent to minimizing the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel.

6 Conclusion & Future Work

The synthesis of an image using the information obtained from distributed VGG layers have many strengths (i) the approach often produces visually appealing images, more appealing than those produced by a filter technique e.g., artistic filters; (ii) the approach offers a flexible means to combine different content images and style images together. The synthesized output convincingly shows that style loss produces an image with a clear local texture but often lacks a clear relationship among texture components over a long spatial distance. Source images having strong local texture such as pebble, line drawing, etc., will produce impressive outcomes.

The issue of long range dependency is a universal issue in all domains and researchers have approached this differently in different domains. For example, a Long Short-Term Memory (LSTM)

[20]

is an enhanced recurrent neural network that has been successfully applied to speech, text and image processing. Combining context loss and style loss to synthesize a new image offers a means to deal with long range dependency issue in images. The approach always produces interesting output since the content loss always preserves the content while the style loss decorates the existing content with the style texture. In future works, we wish to further explore how to assert controls into the generative process

[1, 21].

Acknowledgments

We would like to thank the GSR office for their partial financial support given to this research.

References

  • [1] Phon-Amnuaisuk, S., Panjapornpon, J.: Controlling generative processes of generative art. In: Proceedings of the International Neural Network Society Winter Conference (INNS-WC 2012). Procedia Computer Science 13:43-52 (2012)
  • [2]

    Mohd Salleh, N.D., Phon-Amnuaisuk, S.: Quantifying aesthetic beauty through its dimensions: a case study on trochoids. International Journal of Knowledge Engineering and Soft Data Paradigms 5:51-64 (2015)

  • [3] Mandelbrot, B.: The Fractal Geometry of Nature. New York: W.H. Freeman (1983)
  • [4] Prusinkiewicz, P., Lindenmayer, A.: The Algorithmic Beauty of Plants. Springer (1990)
  • [5] Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks, 61:85-117. (2015)
  • [6] LeCun, Y., Bengio, Y., and Hinton, G.: Deep learning. Nature, 521:436–444. (2015)
  • [7] Zeiler, M.D., Fergus, B.: Visualizing and Understanding Convolutional Networks. In: Proceedings of the European Conference on Computer Vision. (ECCV 2014) pp. 818-833 (2013)
  • [8] Zhou, B., Bau, D., Oliva, A., Torralba, A.: Interpreting deep visual representations via network dissection. arXiv:1711.05611 (2017)
  • [9] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (ICLR 2015) arXiv:1409.1556 (2015)
  • [10] Kalajdzievski, S.: Math and Art: An Introduction to Visual Mathematics. CRC Press (2008)
  • [11] Strothotte, T., and Schlechtweg, S.: Non-Photorealistic Computer Graphics: Modeling, Rendering and Animation. Morgan Kaufmann Publishers, Elsevier Science, USA. (2002)
  • [12] Ahmad, A. Phon-Amnuaisuk, S.: Emulating pencil sketches from 2D images. In: Proceedings of the International Conference on Soft Computing and Data Mining (SCDM 2014) pp. 613-622 (2014)
  • [13] He, K., Wang, Y., Hopcroft, J.: A powerful generative model using random weights for the deep image representation. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. pp. xxx-xxx (2016)
  • [14] Hertzmann, A., Jacobs, C.E., Oliver. N., Curless, B., Salesin, D.H. Image Analogies. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press / ACM SIGGRAPH. pp 327–340 (2001)
  • [15]

    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 2414-2423 (2016)

  • [16] Wolfram, S.: Cellular automata as models of complexity. Nature 331(4) 419-424 (1984)
  • [17] Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015) pp. 5188-5196 (2015)
  • [18] Gatys, L.A., Ecker, A.S., Bethge, M.: Texture synthesis using convolutional neural networks. In: Proceedings of of the Annual Conference on Neural Information Processing Systems (NIPS 2015) 262-270 (2015)
  • [19] Li, Y., Wang, N., Liu, J., Hou, X.: Demystifying neural style transfer arXiv:1701.01036v2 (2017)
  • [20] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8):1735-1780 (1997)
  • [21] Champandard, A.J.: Semantic style transfer and turning two-bit doodles into fine artwork nuci.ai Conference 2016, Artificial Intelligence in Creative Industries. arXiv:1603.01768v1 (2016)