Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

05/28/2018 ∙ by Daan Wynen, et al. ∙ 0

In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings. Our method is based on archetypal analysis, which is an unsupervised learning technique akin to sparse coding with a geometric interpretation. When applied to deep image representations from a collection of artworks, it learns a dictionary of archetypal styles, which can be easily visualized. After training the model, the style of a new image, which is characterized by local statistics of deep visual features, is approximated by a sparse convex combination of archetypes. This enables us to interpret which archetypal styles are present in the input image, and in which proportion. Finally, our approach allows us to manipulate the coefficients of the latent archetypal decomposition, and achieve various special effects such as style enhancement, transfer, and interpolation between multiple archetypes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artistic style transfer consists in manipulating the appearance of an input image such that its semantic content and its scene organization are preserved, but a human may perceive the modified image as having been painted in a similar fashion as a given target painting. Closely related to previous approaches to texture synthesis based on modeling statistics of wavelet coefficients heeger1995pyramid ; portilla2000parametric , the seminal work of Gatys et al. gatys_neural_2015 ; gatys_texture_2015

has recently shown that a deep convolutional neural network originally trained for classification tasks yields a powerful representation for style and texture modeling. Specifically, the description of “style” in 

gatys_neural_2015 consists of local statistics obtained from deep visual features, represented by the covariance matrices of feature responses computed at each network layer. Then, by using an iterative optimization procedure, the method of Gatys et al. gatys_neural_2015 outputs an image whose deep representation should be as close as possible to that of the input content image, while matching the statistics of the target painting. This approach, even though relatively simple, leads to impressive stylization effects that are now widely deployed in consumer applications.

Subsequently, style transfer was improved in many aspects. First, removing the relatively slow optimization procedure of gatys_neural_2015 was shown to be possible by instead training a convolutional neural network to perform style transfer johnson_perceptual_2016 ; ulyanov_texture_2016 . Once the model has been learned, stylization of a new image requires a single forward pass of the network, allowing real-time applications. Whereas these networks were originally trained to transfer a single style (e.g., a network trained for producing a “Van Gogh effect” was unable to produce an image resembling Monet’s paintings), recent approaches have been able to train a convolutional neural network to transfer multiple styles from a collection of paintings and to interpolate between styles chen_stylebank:_2017 ; dumoulin_learned_2016 ; huang_arbitrary_2017 .

Then, key to our work, Li et al. li_universal_2017

recently proposed a simple learning-free and optimization-free procedure to modify deep features of an input image such that their local statistics approximately match those of a target style image. Their approach is based on decoders that have been trained to invert a VGG network 

simonyan_very_2014 , allowing them to iteratively whiten and recolor feature maps of every layer, before eventually reconstructing a stylized image for any arbitrary style. Even though the approach may not perserve content details as accurately as other learning-based techniques yeh_quantitative_2018 , it nevertheless produces oustanding results given its simplicity and universality. Finally, another trend consists of extending style transfer to other modalities such as videos ruder_artistic_2017 or natural photographs luan2017deep (to transfer style from photo to photo instead of painting to photograph).

Whereas the goal of these previous approaches was to improve style transfer, we address a different objective and propose to use an unsupervised method to learn style representations from a potentially large collection of paintings. Our objective is to automatically discover, summarize, and manipulate artistic styles present in the collection. To achieve this goal, we rely on a classical unsupervised learning technique called archetypal analysis cutler_archetypal_1994 ; despite its moderate popularity, this approach is related to widely-used paradigms such as sparse coding mairal2014sparse ; field1996 or non-negative matrix factorization paatero1994

. The main advantage of archetypal analysis over these other methods is mostly its better interpretability, which is crucial to conduct applied machine learning work that requires model interpretation.

In this paper, we learn archetypal representations of style from image collections. Archetypes are simple to interpret since they are related to convex combinations of a few image style representations from the original dataset, which can thus be visualized (see, e.g., chen_fast_2014 for an application of archetypal analysis to image collections). When applied to painter-specific datasets, they may for instance capture the variety and evolution of styles adopted by a painter during his career. Moreover, archetypal analysis offers a dual interpretation view: if on the one hand, archetypes can be seen as convex combinations of image style representations from the dataset, each image’s style can also be decomposed into a convex combination of archetypes on the other hand. Then, given an image, we may automatically interpret which archetypal style is present in the image and in which proportion, which is a much richer information than what a simple clustering approach would produce. When applied to rich data collections, we sometimes observe trivial associations (e.g., the image’s style is very close to one archetype), but we also discover meaningful interesting ones (when an image’s style may be interpreted as an interpolation between several archetypes).

After establishing archetypal analysis as a natural tool for unsupervised learning of artistic style, we also show that it provides a latent parametrization allowing to manipulate style by extending the universal style transfer technique of li_universal_2017 . By changing the coefficients of the archetypal decomposition (typically of small dimension, such as 256) and applying stylization, various effects on the input image may be obtained in a flexible manner. Transfer to an archetypal style is achieved by selecting a single archetype in the decomposition; style enhancement consist of increasing the contribution of an existing archetype, making the input image more “archetypal”. More generally, exploring the latent space allows to create and use styles that were not necessarily seen in the dataset, see Figure 1.

Figure 1: Using deep archetypal style analysis, we can represent an artistic image (a) as a convex combination of archetypes. The archetypes can be visualized as synthesized textures (b), as a convex combination of artworks (c) or, when analyzing a specific image, as stylized versions of that image itself (d). Free recombination of the archetypal styles then allows for novel stylizations of the input.

To the best of our knowledge, ghiasi is the closest work to ours in terms of latent space description of style; our approach is however based on significantly different tools and our objective is different. Whereas a latent space is learned in ghiasi for style description in order to improve the generalization of a style transfer network to new unseen paintings, our goal is to build a latent space that is directly interpretable, with one dimension associated to one archetypal style.

The paper is organized as follows: Section 2 presents the archetypal style analysis model, and its application to a large collection of paintings. Section 3 shows how we use them for various style manipulations. Finally, Section 4 is devoted to additional experiments and implementation details.

2 Archetypal Style Analysis

In this section, we show how to use archetypal analysis to learn style from large collections of paintings, and subsequently perform style decomposition on arbitrary images.

Learning a latent low-dimensional representation of style.

Given an input image, denoted by , we consider a set of feature maps produced by a deep network. Following li_universal_2017 , we consider five layers of the VGG-19 network simonyan_very_2014 , which has been pre-trained for classification. Each feature map may be seen as a matrix in where  is the number of channels and is the number of pixel positions in the feature map at layer . Then, we define the style of as the collection of first-order and second order statistics of the feature maps, defined as

where represents the column in that carries the activations at position in the feature map . A style descriptor is then defined as the concatenation of all parameters from the collection , normalized by the number of parameters at each layer—that is,   and are divided by

, which was found to be empirically useful for preventing layers with more parameters to be over-represented. The resulting vector is very high-dimensional, but it contains key information for artistic style

gatys_neural_2015

. Then, we apply a singular value decomposition on the style representations from the paintings collection to reduce the dimension to

while keeping more than

of the variance. Next, we show how to obtain a lower-dimensional latent representation.

Archetypal style representation.

Given a set of vectors in , archetypal analysis cutler_archetypal_1994 learns a set of archetypes in such that each sample can be well approximated by a convex combination of archetypes—that is, there exists a code in such that , where lies in the simplex

Conversely, each archetype is constrained to be in the convex hull of the data and there exists a code in such that . The natural formulation resulting from these geometric constraints is then the following optimization problem

(1)

which can be addressed efficiently with dedicated solvers chen_fast_2014 . Note that the simplex constraints lead to non-negative sparse codes for every sample since the simplex constraint enforces the vector  to have unit -norm, which has a sparsity-inducing effect mairal2014sparse . As a result, a sample will be associated in practice to a few archetypes, as observed in our experimental section. Conversely, an archetype can be represented by a non-negative sparse code and thus be associated to a few samples corresponding to non-zero entries in .

In this paper, we use archetypal analysis on the -dimensional style vectors previously described, and typically learn between to archetypes. Each painting’s style can then be represented by a sparse low-dimensional code in , and each archetype is itself associated to a few input paintings, which is crucial for their interpretation (see the experimental section). Given a fixed set of archetypes , we may also quantify the presence of archetypal styles in a new image by solving the convex optimization problem

(2)

where is the high-dimensional input style representation described at the beginning of this section. Encoding an image style into a sparse vector allows us to obtain interesting interpretations in terms of the presence and quantification of archetypal styles in the input image. Next, we show how to manipulate the archetypal decomposition by modifying the universal feature transform of li_universal_2017 .

3 Archetypal Style Manipulation

In the following, we briefly present the universal style transfer approach of li_universal_2017 and introduce a modification that allows us to better preserve the content details of the original images, before presenting how to use the framework for archetypal style manipulation.

A new variant of universal style transfer.

We assume, in this section only, that we are given a content image and a style image . We also assume that we are given pairs of encoders/decoders such that produces the -th feature map previously selected from the VGG network and is a decoder that has been trained to approximately “invert” —that is, .

Universal style transfer builds upon a simple idea. Given a “content” feature map in , making local features match the mean and covariance structure of another “style” feature map can be achieved with simple whitening and coloring operations, leading overall to an affine transformation:

where are the mean of the content and style feature maps, respectively, is the coloring matrix and is a whitening matrix that decorrelates the features. We simply summarize this operation as a single function .

Of course, feature maps between network layers are interconnected and such coloring and whitening operations cannot be applied simultaneously at every layer. For this reason, the method produces a sequence of stylized images , one per layer, starting from the last one in a coarse-to-fine manner, and the final output is . Given a stylized image (with ), we propose the following update, which differs slightly from li_universal_2017 , for a reason we will detail below:

(3)

where in controls the amount of stylization since corresponds to the -th feature map of the original content image. The parameter in controls how much one should trust the current stylized image in terms of content information before stylization at layer . Intuitively,
 (a)  can be interpreted as a refinement of the stylized image at layer in order to take into account the mean and covariance structure of the image style at layer .
 (b)  can be seen as a stylization of the content image by looking at the correlation/mean structure of the style at layer regardless of the structure at the preceding stylization steps.

Whereas takes into account the style structure of the top layers, it may also have lost a significant amount of content information, in part due to the fact that the decoders do not invert perfectly the encoders and do not correctly recover fine details. For this reason, being able to make a trade-off between (a) and (b) to explicitly use the original content image at each layer is important.

In contrast, the update of li_universal_2017 involves a single parameter and is of the form

(4)

Notice that here the original image is used only once at the beginning of the process, and details that have been lost at layer have no chance to be recovered at layer . We present in the experimental section the effect of our variant. Whenever one is not looking for a fully stylized image—that is, in (3) and (4), content details can be much better preserved with our approach.

Archetypal style manipulation.

We now aim to analyze styles and change them in a controllable manner based on styles present in a large collection of images rather than on a single image. To this end, we use the archetypal style analysis procedure described in Section 2. Given now an image , its style, originally represented by a collection of statistics , is approximated by a convex combination of archetypes , where archetype can also be seen as the concatenation of statistics . Indeed, is associated to a sparse code in , where is the number of training images—allowing us to define for archetype and layer

where and are the mean and covariance matrices of training image at layer . As a convex combination of covariance matrices, is positive semi-definite and can be also interpreted as a valid covariance matrix, which may then be used for a coloring operation producing an “archetypal” style.

Given now a sparse code in , a new “style” can be obtained by considering the convex combination of archetypes:

Then, the collection of means and covariances may be used to define a coloring operation. Three practical cases come to mind: (i) may be a canonical vector that selects a single archetype; (ii) may be any convex combination of archetypes for archetypal style interpolation; (iii) may be a modification of an existing archetypal decomposition to enhance a style already present in an input image —that is, is a variation of defined in (2).

4 Experiments

In this section, we present our experimental results on two datasets described below. Our implementation is in PyTorch 

paszke2017automatic and relies in part on the universal style transfer implementation111https://github.com/sunshineatnoon/PytorchWCT. Archetypal analysis is performed using the SPAMS software package chen_fast_2014 ; mairal2010online , and the singular value decomposition is performed by scikit-learn pedregosa2011scikit . Our implementation will be made publicly available. Further examples can be found on the project website at http://pascal.inrialpes.fr/data2/archetypal_style.

GanGogh

is a collection of 95997 artworks222https://github.com/rkjones4/GANGogh downloaded from WikiArt.333https://www.wikiart.org The images cover most of the freely available WikiArt catalog, with the exception of artworks that are not paintings. Due to the collaborative nature of WikiArt, there is no guarantee for an unbiased selection of artworks, and the presence of various styles varies significantly. We compute 256 archetypes on this collection.

Vincent van Gogh

As a counter point to the GanGogh collection, which spans many styles over a long period of time and has a significant bias towards certain art styles, we analyze the collection of Vincent van Gogh’s artwork, also from the WikiArt catalog. Based on the WikiArt metadata, we exclude a number of works not amenable to artistic style transfer such as sketches and studies. The collection counts 1154 paintings and drawings in total, with the dates of their creation ranging from 1858 to 1926. Given the limited size of the collection, we only compute 32 archetypes.

4.1 Archetypal Visualization

To visualize the archetypes, we first synthesize one texture per archetype by using its style representation to repeatedly stylize an image filled with random noise, as described in li_universal_2017 . We then display paintings with significant contributions. In Figure 2, we present visualizations for a few archetypes. The strongest contributions usually exhibit a common characteristic like stroke style or choice of colors. Smaller contributions are often more difficult to interpret (see project website for the full set of archetypes). Figure 1(a) also highlights correlation between content and style: the archetype on the third row is only composed of portraits.

To see how the archetypes relate to each other, we also compute t-SNE embeddings maaten_visualizing_2008 and display them with two spatial dimensions. In Figure 3, we show the embeddings for the GanGogh collection, by using the texture representation for each archetype. The middle of the figure is populated by Baroque and Renaissance styles, whereas the right side exhibits abstract and cubist styles.

(a) Archetypes from GanGogh collection.
(b) Archetypes from van Gogh’s paintings.
Figure 2: Archetypes learned from the GanGogh collection and van Gogh’s paintings. Each row represents one archetype. The leftmost column shows the texture representations, the following columns the strongest contributions from individual images in order of descending contribution. Each image is labelled with its contribution to the archetype. For layout considerations, only the center crop of each image is shown. Best seen by zooming on a computer screen.
Figure 3: t-SNE embeddings of 256 archetypes computed on the GanGogh collection. Each archetype is represented by a synthesized texture. Best seen by zooming on a computer screen.

Similar to showing the decomposition of an archetype into its contributing images, we display in Figure 4 examples of decompositions of image styles into their contributing archetypes. Typically, only a few archetypes contribute strongly to the decomposition. Even though often interpretable, the decomposition is sometimes trivial, whenever the image’s style is well described by a single archetype. Some paintings’ styles also turn out to be hard to interpret, leading to non-sparse decompositions. Examples of such trivial and “failure” cases are provided on the project website.

(a) Picasso’s “Pitcher and Fruit Bowl”.
(b) “Portrait of Patience Escalier” by van Gogh.
Figure 4: Image decompositions from the GanGogh collection and van Gogh’s work. Each archetype is represented as a stylized image (top), as a texture (side) and as a decomposition into paintings.
Figure 5: Top: stylization with our approach for , varying the product from to on an equally-spaced grid. Bottom: results using li_universal_2017 , varying . At , the approaches are equivalent, resulting in equal outputs. Otherwise however, especially for , li_universal_2017 produces strong artifacts. These are not artifacts of stylization, since in this case, no actual stylization occurs. Rather, they are the effect of repeated, lossy encoding and decoding, since no decoder can recover information lost in a previous one. Best seen on a computer screen.

4.2 Archetypal Style Manipulation

First, we study the influence of the parameters and make a comparison with the baseline method of li_universal_2017

. Even though this is not the main contribution of our paper, this apparently minor modification yields significant improvements in terms of preservation of content details in stylized images. Besides, the heuristic

appears to be visually reasonable in most cases, reducing the number of effective parameters to a single one that controls the amount of stylization. The comparison between our update (3) and (4) from li_universal_2017 is illustrated in Figure 5, where the goal is to transfer an archetypal style to a Renaissance painting. More comparisons on other images and illustrations with pairs of parameters , as well as a comparison of the processing workflows, are provided on the project website, confirming our conclusions.

Then, we conduct style enhancement experiments. To obtain variations of an input image, the decomposition of its style can serve as a starting point for stylization. Figure 6 shows the results of enhancing archetypes an image already exhibits. Intuitively, this can be seen as taking one aspect of the image, and making it stronger with respect to the other ones. In Figure 6, while increasing the contributions of the individual archetypes, we also vary , so that the middle image is very close visually to the original image (), while the outer panels put a strong emphasis on the modified styles. As can be seen especially in the panels surrounding the middle, modifying the decomposition coefficients allows very gentle movements through the styles.

(a) “Les Alpilles, Mountain Landscape near South-Reme” by van Gogh, from the van Gogh collection.
(b) Self-Portrait by van Gogh, from the van Gogh collection.
(c) “Schneeschmelze” by Max Pechstein, from the GanGogh collection.
(d) “Venice” by Maurice Prendergast, from the GanGogh collection.
Figure 6: We demonstrate the enhancement of the two most prominent archetypal styles for different artworks. The middle panel shows a near-perfect reconstruction of the original content image in every case and uses parameters . Then, we increase the relative weight of the strongest component towards the left, and of the second component towards the right. Simultaneously, we increase and from in the middle panel to on the outside.

As can be seen in the leftmost and rightmost panels of Figure 6, enhancing the contribution of an archetype can produce significant changes. As a matter of fact, it is also possible, and sometimes desirable, depending on the user’s objective, to manually choose a set of archetypes that are originally unrelated to the input image, and then interpolate with convex combinations of these archetypes. The results are images akin to those found in classical artistic style transfer papers. In Figure 7, we apply for instance combinations of freely chosen archetypes to “The Bitter Drunk”. Other examples involving stylizing natural photographs are also provided on the project website.

(a) Content image
(b) Pairwise interpolations between four freely chosen archetypal styles.
Figure 7: Free archetypal style manipulation of “The Bitter Drunk” by Adriaen Brouwer.

5 Discussion

In this work, we introduced archetypal style analysis as a means to identify styles in a collection of artworks without supervision, and to use them for the manipulation of artworks and photos. Whereas other techniques may be used for that purpose, archetypal analysis admits a dual interpretation which makes it particularly appropriate for the task: On the one hand, archetypes are represented as convex combinations of input image styles and are thus directly interpretable; on the other hand, an image style is approximated by a convex combination of archetypes allowing various kinds of visualizations. Besides, archetypal coefficients may be used to perform style manipulations.

One of the major challenge we want to address in future work is the exploitation of metadata available on the WikiArt repository (period, schools, art movement…) to link the learned styles to the descriptions employed outside the context of computer vision and graphics, which we believe will make them more useful beyond style manipulation.

Acknowledgements

This work was supported by a grant from ANR (MACARON project ANR-14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project) and the ERC advanced grant Allegro.

References

  • [1] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. StyleBank: an explicit representation for neural image style transfer. In

    Proc. Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [2] Y. Chen, J. Mairal, and Z. Harchaoui. Fast and robust archetypal analysis for representation learning. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [3] A. Cutler and L. Breiman. Archetypal analysis. Technometrics, 36(4):338–347, 1994.
  • [4] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In Proc. International Conference on Learning Representations (ICLR), 2017.
  • [5] L. A. Gatys, A. S. Ecker, and M. Bethge. A Neural Algorithm of Artistic Style. preprint arXiv:1508.06576, 2015.
  • [6] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In Adv. in Neural Information Processing Systems (NIPS), 2015.
  • [7] G. Ghiasi, H. Leeu, M. Kudlur, V. Dumoulin, and J. Shlens. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In Proc. British Machine Vision Conference (BMVC), 2017.
  • [8] D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proc. 22nd annual conference on Computer graphics and interactive techniques (SIGGRAPH), 1995.
  • [9] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. International Conference on Computer Vision (ICCV), 2017.
  • [10] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision (ECCV), 2016.
  • [11] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In Adv. Neural Information Processing Systems (NIPS), 2017.
  • [12] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [13] J. Mairal, F. Bach, J. Ponce, et al. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision, 8(2-3):85–283, 2014.
  • [14] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research (JMLR), 11(Jan):19–60, 2010.
  • [15] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.
  • [16] P. Paatero and U. Tapper.

    Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values.

    Environmetrics, 5(2):111–126, 1994.
  • [17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research (JMLR), 12(Oct):2825–2830, 2011.
  • [19] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
  • [20] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transfer for videos and spherical images. International Journal on Computer Vision (IJCV), 2018.
  • [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. preprint arXiv:1409.1556, 2015.
  • [22] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: feed-forward synthesis of textures and stylized images. In Proc. International Conference on Machine Learning (ICML), 2016.
  • [23] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(Nov):2579–2605, 2008.
  • [24] M.-C. Yeh, S. Tang, A. Bhattad, and D. A. Forsyth. Quantitative evaluation of style transfer. preprint arXiv:1804.00118, 2018.