Log In Sign Up

Formal Analysis of Art: Proxy Learning of Visual Concepts from Style Through Language Models

by   Diana Kim, et al.

We present a machine learning system that can quantify fine art paintings with a set of visual elements and principles of art. This formal analysis is fundamental for understanding art, but developing such a system is challenging. Paintings have high visual complexities, but it is also difficult to collect enough training data with direct labels. To resolve these practical limitations, we introduce a novel mechanism, called proxy learning, which learns visual concepts in paintings though their general relation to styles. This framework does not require any visual annotation, but only uses style labels and a general relationship between visual concepts and style. In this paper, we propose a novel proxy model and reformulate four pre-existing methods in the context of proxy learning. Through quantitative and qualitative comparison, we evaluate these methods and compare their effectiveness in quantifying the artistic visual concepts, where the general relationship is estimated by language models; GloVe or BERT. The language modeling is a practical and scalable solution requiring no labeling, but it is inevitably imperfect. We demonstrate how the new proxy model is robust to the imperfection, while the other models are sensitively affected by it.


page 3

page 5

page 6

page 15

page 18

page 19


The Shape of Art History in the Eyes of the Machine

How does the machine classify styles in art? And how does it relate to a...

Visually Grounded, Situated Learning in Neural Models

The theory of situated cognition postulates that language is inseparable...

Cold-start Active Learning through Self-supervised Language Modeling

Active learning strives to reduce annotation costs by choosing the most ...

Learning Unsupervised Visual Grounding Through Semantic Self-Supervision

Localizing natural language phrases in images is a challenging problem t...

Nearest Neighbor Language Models for Stylistic Controllable Generation

Recent language modeling performance has been greatly improved by the us...

Incorporating Stylistic Lexical Preferences in Generative Language Models

While recent advances in language modeling have resulted in powerful gen...

Visual Recognition by Request

In this paper, we present a novel protocol of annotation and evaluation ...

1 Introduction

Artists and art historians usually use elements of art, such as line, texture, color, and shape [fichner2011foundations], and principles of art, such as balance, variety, symmetry, and proportion [ocvirk2002art] to visually describe artworks. These elements and principles provide structured grounds for effectively communicating about art, especially the first principle of art, which is “visual form” [van1887principles].

However, in the area of AI, understanding art has mainly focused on a limited version of the first principle, through developing systems such as predicting styles [elgammal2018shape, diana2018artprinciple], finding non-semantic features for style [mao2017deepart], or designing digital filters to extract some visual properties like brush strokes, color, textures, and so on [berezhnoy2005computerized, johnson2008image]. While they are useful, the concepts do not reveal much about the visual properties of paintings in depth. Kim et al. [diana2018artprinciple] suggested a list of 58 concepts that break down the elements and principles of art. We focus on developing an AI system that can quantify such concepts. These concepts are referred to as “visual elements” in this paper and presented in Table 1.

Elements and Principles of Art Concepts
Subject representational, non-representational
blurred, broken, controlled
curved, diagonal, horizontal
vertical, meandering
thick, thin, active, energetic, straight
bumpy, flat, smooth
gestural, rough
calm, cool, chromatic
monochromatic, muted, warm, transparent
ambiguous, geometric, amorphous
biomorphic, closed, open, distorted, heavy
linear, organic, abstract, decorative, kinetic, light
Light and Space
bright, dark, atmospheric
planar, perspective
General Principles
of Art
overlapping, balance, contrast
harmony, pattern, repetition, rhythm
unity, variety, symmetry, proportion, parallel
Table 1: A list of 58 concepts describing elements and principles of art. We propose an AI system that can quantify such concepts. These concepts are referred to as “Visual Elements" in this paper.

The main challenge in learning the visual elements and principles of art is that it is not easy to deploy any supervised methodology. In general, it is difficult to collect enough annotation with multiple attributes. When it comes to art, the lack of visual element annotation becomes a more significant issue. Art is typically annotated with artist information (name, dates, bio), style, genre attributes only, while annotating elements of art requires high specialties to identify the visual proprieties of artworks. Perhaps the sparsity of art data might be a reason why art has been analyzed computationally in the limited way.

To resolve the sparsity issue, this paper proposes to learn the visual elements of art through their general relations to styles (period style). While it is difficult to obtain the labels for the visual concepts, there are plenty of available paintings labeled by styles and language resources relating styles to visual concepts, such as online encyclopedia and museum websites. In general, knowing the dominant visual features of a painting enables us to identify its plausible styles. So we have the following questions; (1) what if we can take multiple styles as proxy components to encode visual information of paintings? (2) Can a deep Convolutional Neural Network (deep-CNN) help to retrace visual semantics from the proxy representation of multiple styles?

In these previous studies [elgammal2018shape, diana2018artprinciple]

, existence of the conceptual ties between visual elements and styles is demonstrated by using a hierarchical structure in the deep-CNN. They showed the machine can learn underlying semantic factors of styles from its hidden layers. Inspired by the studies, we hypothetically set a linear relation between visual elements and style. Next, we constrain a deep-CNN by the linear relation to make the machine learn visual concepts from its last hidden layer, while it is trained as a style classifier only.

To explain the methodology, a new concept–proxy learning–is defined first. It refers to all possible learning methods aiming to quantify paintings with a set of finite visual elements, which has no available label, by correlating it to another concept that has abundant labeled data. In this paper, we reformulate four pre-existing methods in the context of proxy learning and introduce a novel approach that utilizes a deep-CNN to learn visual concepts from styles labels and language models. We propose to name it deep-proxy. The output of deep-proxy quantifies the relatedness of an input painting to each of visual elements. In Table 2, the most relevant or irrelevant visual elements are listed for the example paintings. The results are computed based on a deep-proxy model which is trained by only using the style labels and the language model, BERT

In the experiment, deep-proxy and four methods in attribute learning—sparse coding [efron2004least]

, logistic regression (LGT)


, Principal Component Analysis method (PCA)

[diana2018artprinciple], and an Embarrassingly Simple approach to Zero-Shot Learning (ESZSL) [romera2015embarrassingly]—are quantitatively compared with each other. We analyze their effectiveness depending two practical solutions to estimate a general relationship: (1) language models—GloVe [pennington2014glove] and BERT [devlin2018bert, vaswani2017attention]—and (2) sample means of a few ground truth values. The language modeling is a practical and scalable solution requiring no labeling, but it is inevitably imperfect. We demonstrate how deep-proxy’s cooperative structure learning with styles creates strong resilience to the imperfection from the language models, while PCA and ESZSL are significantly affected by them. On the other hand, as a general relation is estimated by some ground truth samples, PCA performs best in various experiments. We summarize our contributions as follows.

  1. Formulating the proxy learning methodology and applying it to learn visual artistic concepts.

  2. A novel and end-to-end framework to learn multiple visual elements from fine art paintings without any direct annotation.

  3. A new word embedding trained by BERT [devlin2018bert, vaswani2017attention]) and a huge art corpus ( sentences). This is a first BERT model for art, trained by art-related texts.

  4. A ground truth set of 58 visual semantics for 120 fine art paintings completed by seven art historians.

Image Relatedness Words Image Relatedness Words
muted, balance, representational
abstract, blurred, transparent
atmospheric, smooth
non-representational, thick
planar, rhythm, blurred
dark, horizontal, controlled
thick, abstract
balance, representational
muted, balance, heavy
abstract, blurred thick
controlled, representational
non-representational, biomorphic
planar, rhythm, thick
balance, smooth, planar
abstract, blurred
dark, representational
dark, atmospheric, muted
abstract, blurred, thick
horizontal, representational
biomorphic, non-representational
amorphous, rhythm, thick
rough, kinetic, balance
blurred, planar
smooth, representational
atmospheric, dark, muted
abstract, thick, biomorphic
horizontal, warm
gestural, pattern
planar, thick, kinetic
geometric, amorphous, monochromatic
rough, amorphous
planar, representational
atmospheric, dark, smooth
abstract, thick, blurred
warm, muted
biomorphic, rhythm
thick, kinetic, rough
balance, controlled, smooth
planar, amorphous
dark, representational
Table 2: The Relevant (R) and Irrelevant (IR) Visual Elements by Deep-Proxy: Based on the output of deep-proxy, top and bottom five ranked visual elements are listed. In this result, deep-proxy is trained by using the style labels and the general relationship estimated by the language model, BERT. The most relevant or irrelevant words are in bold. The title, author, year of made, and style of these paintings are shown in Supplementary Information (SI) A.

2 Related Work

2.1 Attribute Classification

For learning semantic attributes, mainstream literature has been based on simple binary classification and fully [farhadi2009describing, lampert2013attribute] or weakly supervision methods [ferrari2007learning, shankar2015deep]

. Support Vector Machine

[farhadi2009describing, lampert2013attribute, patterson2014sun] and logistic regression [danaci2016low, farhadi2009describing] are used to recognize the presence or absence of targeted semantic attributes.

2.2 Descriptions by Visual Semantics

This paper’s method is not designed using a classification problem, but rather it generates real-valued vectors. Each dimension of each vector is aligned with a certain visual concept, so the vectors naturally indicate which paintings are more or less relevant to the concept. As is the case with most similar formats, Parikh et al.

[parikh2011relative, ma2012unsupervised] propose to predict the relative strength of the presence of attributes through real-valued ranks.

For attribute learning, recently its practical merits have been rather emphasized, such as zero-shot learning [xian2018zero] and semantic [li2010objects] or non-semantic attributes [huang2016unsupervised] to boost object recognition. However, in this paper, we focus on attribute learning itself and pursue its descriptive and human understandable advantages, in the same way that Chen et al. [chen2012describing] focused on describing clothes with some words understandable to humans.

2.2.1 Incorporating Classes as Learning Attributes

Informative dependencies between semantic attributes and objects (class) are useful; in fact, they have co-appeared in many papers. Lampert et al. [lampert2013attribute] assign attributes to images on a per-class basis and train attribute classifiers in a supervised way. On the other hand, Yu et al. [yu2014modeling] model attributes based on their generalized properties—such as their proportions and relative strength—with a set of categories and make learning algorithms satisfy them as necessary conditions. The methods do not require any instance-level attributes for training like this paper method, but learning visual elements satisfying the constraints of relative proportions among classes is not related to our goal or methodology. Some researchers [mahajan2011joint, wang2013unified] propose joint learning frameworks to more actively incorporate class information into attribute learning. In particular, Akata et al. [akata2013label] and Romera-Paredes et al. [romera2015embarrassingly] hierarchically treat attributes as intermediate features, which serve to describe classes. The systems are designed to learn attributes by bi-directional influences from class to attributes (top-down) and from image features to attributes (bottom-up) like deep-proxy. However, their single and linear layering, from image features to their intermediate attributes, are different from the multiple and non-liner layering in deep-proxy.

2.2.2 Learning Visual Concepts from Styles

Elgammal et al. [elgammal2018shape, diana2018artprinciple] show that a deep-CNN can learn semantic factors of styles from its last hidden layers by using a hierarchical structure of deep-CNN. They interpret deep-CNN’s last hidden layer with pre-defined visual concepts through multiple and separated post-procedures, but deep-proxy simultaneously learns visual elements while machines are trained for style classification. In the experiment, the method proposed by Kim et al. [diana2018artprinciple] is compared with deep-proxy as the name of PCA.

3 Methodology

3.1 Linear Relation

3.1.1 Two Conceptual Spaces

Styles are seldom completely uniform and cohesive, and often carry forward within them former styles and other influences that are still operating within the work. As explained in The Concept of Style [lang1987concept], a style can be both a possibility and an interpretation. It is not a definite quality that inherently belongs to objects, although each of the training samples are artificially labeled with a unique style. Due to the complex variations of the visual properties of art pieces in sequential arrangements of times, styles can be overlapped, blended, and merged. Based on the idea, this research begins with representing paintings with the two entities: a set of visual elements and a set of styles. Two conceptual vector spaces and for style and visual elements are introduced, whose each dimension is aligned with their semantic. Two vector functions, and , are defined to transform input image into the conceptual spaces in equation (1) below.


Figure 1 and 2 show example paintings that are encoded by visual elements and styles. They are generated by deep-proxy (using a general relationship estimated by sample means of a few ground truth value).

Figure 1: Fire Island by Willem de Kooning (1946): Along with the original style of Abstract Expressionism, Expressionism, Cubism, and Fauvism have all left visual traces within the painting. These are earlier styles that De Kooning, the artist, knew and learned from before developing his own mature style. And the visual traces of the styles are as follows: the breaking apart of whole forms or objects (Cubism), a loose, painterly touch (Expressionism) and vivid, striking color choices and juxtapositions such as pink, teal and yellow (Fauvism).
Figure 2: East Cowes Castle, the seat of J.Nash, Esq. the Regatta beating to windward by J.M.W Turner (1828): The painting is originally belonged to Romanticism but Realism and Impressionism are also used for its style encoding.

3.1.2 Category-Attribute Matrix

Inspired by a prototype theory [murphy2004big] in cognitive science, we posit that a set of pre-defined visual elements of art is sufficient for characterizing styles. According to this theory, once a set of attributes are arranged to construct a vector space, a vector point can summarize each of categories. Mathematically modeling the principles, is set to be the typical (best) example for the style , where and is the number of styles. This is represented by . By accumulating the as columns for all different styles, a matrix is formed. Matrix becomes a category-attribute matrix.

3.1.3 Linear System

Matrix ideally defines typical points for styles in the attribute space of . However, as aforementioned, images that belong to a specific style show intra-class variations. In this sense, for a painting that belongs to style , the is likely to be the closest to the , and its similarities to other styles’ typical points can be calculated by the inner products between and for all styles, . All computations are expressed by and its output . This results in the linear equation (2) below.

Figure 3: Summary of Proxy Learning: (a) The paintings of three styles (Pop art, Post-impressionism, and Expressionism) are scattered in the space of three visual elements (abstract, gestural, and linear). The red vectors represent typical vectors of the three styles. (b) A painting , originally positioned in the visual space, can be transformed to the three-style (proxy) representation by computing inner products with each of the typical vectors. (c) Proxy learning aims to estimate or learn its original visual coordinates from a proxy representation and a set of typical vectors.

3.2 Definition of Proxy Learning

In equation (2), knowing becomes linearly tied with knowing , so we have the following questions: (1) given and , how can we learn the function ? (2) before doing that, how can we properly encode and first? This paper aims to answer these questions. We first re-define them by a new concept of learning, named proxy learning. Figure. 3 is an illustrative example to describe it.

Proxy learning: a computational framework that learns the function from through a linear relationship . is estimated by language models or human survey.

3.3 Language Modeling

The matrix is estimated by using distributed word embeddings in NLP. Two embeddings were considered: GloVe [pennington2014glove] and BERT [devlin2018bert, vaswani2017attention]. However, their original dictionaries do not provide all the necessary art terms. Especially for BERT, it holds a relatively smaller dictionary than GloVe. In the original BERT, vocabulary words are represented by several word-pieces [wu2016google], so it is unnecessary to hold a large set of words. However, the token-level vocabulary words could lose their original meanings, so a new BERT model had to be trained from scratch on a suitable art corpus in order to compensate for the deficient dictionaries.

3.3.1 A Large Corpus of Art

To prepare a large corpus of art, we first gathered all the descendent categories (about 6500) linked with the parent categories of ‘‘Painting’’ and ‘‘Art Movement’’ in Wikipedia and scrawled all the texts under the categories by using a library available in public. Some art terms and their definitions presented in TATE museum 111 were also added, so finally, with sentences, a set word embedding set—BERT—is newly trained for art.

3.3.2 Training BERT

For a new BERT model for art, the BERT-BASE model (12-layer, 768-hidden, 12-heads, and not using cased letters) was selected and trained from scratch with the collected art corpus. For training, the original vocabulary set is updated by adding some words which are missed in the original framework. We averaged all 12 (layers) embeddings to compute each of final word embeddings. All details about BERT training is presented in SI B.

3.3.3 Estimation of Category-Attribute Matrix

To estimate a matrix , vector representations were collected and the following over-determined system of equations was set. Let the denote a matrix of which each column implies a -dimensional word embedding to encode one of visual elements, and the be a word embedding that represents style among styles.


By solving the equation (3) for , the vector was estimated, which becomes each column vector of . It quantifies how the visual elements are positively or negatively related to a certain style in a distributed vector space. In general, word embedding geometrically captures semantic or syntactical relations between words, so this paper postulates that the general relationship among the concepts can be reflected by the linear formulation (3).

3.4 Deep-Proxy

Figure 4: Deep-Proxy Architecture

We propose a novel method to jointly learn the two multi-modal functions, and , through a pre-estimated general matrix (). Its principal mechanism is a category-attribute matrix () is hard-coded into the last fully connected (FC) parameters of a deep-CNN, so it is enforced to learn visual elements () from its last hidden layers, while it is outwardly trained to learn multiple styles (). We propose to name this framework deep-proxy. In this paper, the original VGG- [Simonyan15] is adapted for its popularity and modified as a style classifier, as shown in Figure 4.

3.4.1 Implementation of Deep-Proxy

All convolution layers are transferred from the ImageNet as is and frozen, but the original FC layers,

, are expanded to the five layers . These FC parameters (cyan colored and dashed box) are updated during training. We also tried to fine-tune convolution parts, but they showed slightly degraded performance compared to the FC-only training. Therefore, all the results presented in the later sections are FC-only trained for steps at batch size by the momentum optimizer (momentum ). The learning rate is initially set as and degraded at the factor of every epochs. The final soft-max output is designed to encode the , and the last hidden layer (-D) is set to encode the . The two layers are interconnected by the FC block (magenta colored and dashed box) to impose a linear constraint between the two modalities. For the

, the hidden layer’s Rectified Linear Units (ReLU) is removed, so it can have both positive and negative values.

3.4.2 Objective Function of Deep-Proxy

Let be an indicator function stating whether or not the -th sample belongs to style class . Let be the -th style component of the soft-max simulating . Let be the last hidden activation vector, where is an input image and is the network’s parameters. Then, an objective for multiple style classification is set as in equation (4) below. The is added to regularize the magnitudes of the last hidden layer.


In the next subsections, three versions of deep-proxy are defined depending on how matrix is formulated.

3.4.3 Plain Method

A matrix is estimated and plugged into the network as it is. Two practical solutions are considered to estimate , language models and sample means of a few ground truth values. In training for Plain, the is fixed as the matrix, while the other FC layers are updated. This modeling is exactly aligned with equation (2).

3.4.4 SVD Method

A structural disadvantage of Plain method is noted and resolved by using Singular Vector Decomposition (SVD).

It is natural that the columns of a matrix are correlated because a considerable number of visual properties are shared among typical styles. Thus, if the machine learns visual space properly, the multiplication of with necessarily produces , which is highly valued on multiple styles. On the other hand, the deep-proxy is trained by one-hot vectors promoting orthogonality among styles and a sharp high value on a specific style component. Hence, learning with one-hot vectors can cause interference on learning visual semantics if we simply adopt the plain method above. For example, suppose there is a typical Expressionism painting . then, it is likely to be highly valued both on the Expressionism and Abstract-Expressionism under the equation (2) because the two styles are correlated visually. But if one hot-vector encourages the machine to value highly on the Expressionism axis only, then the machine might not be able to learn visual concepts well, such as gestural brush-strokes or mark-making, and the impression of spontaneity, because those concepts are supposed to be high on Expressionism, too. To fix this discordant structure, the and are transformed to the space where typical style representations are orthogonal to one another. It reformulates equation (2) by the equation (5), where is a transform matrix to the space.


To compute the transform matrix , is decomposed by SVD. As the number of attributes () is greater than the number of classes (), and its rank is , the is decomposed by , where the and are the matrices whose columns are orthogonal and the is a diagonal matrix. From the decomposition, , so we can use as the transform matrix . The transforms each column of to each orthogonal column of . In this deep-proxy SVD method, the is reformulated by these SVD components as presented in equation (6) below.


3.4.5 Offset Method

A positive offset vector is introduced to learn a threshold to determine a visual concept as relevant or not. Each component of implies an individual threshold for each of the visual elements, so when it is subtracted from a column of a matrix, we can take zero as an absolute threshold to interpret whether or not visual concepts are relevant to a style class. Since matrix is often encoded by the values between zero and one, especially when it is created by human survey (ground truth values), we need a proper offset to shift the matrix. Hence, the vector is set as learnable parameters in the third version of deep-proxy. It sets the as , where is the tiling matrix of the vector . In Offset method, the SVD components and are newly calculated for the new at every batch in training.

4 Experiments

Four pre-existing methods, sparse coding [efron2004least], logistic regression (LGT) [danaci2016low], Principal Component Analysis (PCA) method [diana2018artprinciple], and an Embarrassingly Simple approach to Zero-Shot Learning (ESZSL) [romera2015embarrassingly]

are reformulated in the context of proxy learning and quantitatively compared with each other. In this section, we demonstrate how LGT and deep-proxy based on deep-learning are more robust than others when a general relationship (

matrix) is estimated by language models; GloVe [pennington2014glove] or BERT [devlin2018bert, vaswani2017attention]. We also show LGT is degraded sensitively, when matrix is sparse. All detailed implementations of the four pre-existing methods are explained in SI C.

4.1 Four Proxy Methods

4.1.1 Logistic Regression (LGT) [danaci2016low]

Each column of was used to assign attributes to images on a per class basis. When matrix is not a probabilistic representation, without shifting zero points, the positives were put into the range of to and the negatives were put into the range of to .

4.1.2 Pca [diana2018artprinciple]

The last hidden feature of a deep-CNN style classifier is encoded by styles and then multiplied with the transpose of matrix to finally compute each degree of the visual elements.

4.1.3 Eszsl [romera2015embarrassingly]

This can be regarded as a special case of the deep-proxy Plain by setting a single FC layer between visual features and , replacing the softmax loss with Frobenius norm , and encoding styles with . To compute the single layer, a global optimum is found through a closed-form formula proposed by Romera-Paredes et al. [romera2015embarrassingly].

4.1.4 Sparse Coding [efron2004least]

It estimates directly from the style encodings and matrix by solving a sparse coding equation without seeing input images. Its better performance versus random cases proves our hypothetical modeling assuming informative ties between style and visual elements.

4.2 WikiArt Data Set and Visual Elements

This paper used the paintings in WikiArt’s data set [wikiart] and merged their original styles into styles 222 Abstract-Expressionism, Art-Nouveau-Modern, Baroque, Color-Field-Painting, Cubism, Early-Renaissance, Expressionism, Fauvism, High-Renaissance, Impressionism, Mannerism, Minimalism, Naïve-Art-Primitivism, Northern-Renaissance, Pop-Art, Post-Impressionism, Realism, Rococo, Romanticism, and Ukiyo-e, the same as those presented by Elgammal et al. [elgammal2018shape]. paintings were separated for evaluation, and the remaining samples were randomly split into for training and for validation. This paper adopts the pre-selected visual concepts proposed by Kim et al. [diana2018artprinciple]. In the paper, visual words are suggested, but we used words, excluding the “medium” because it is not descriptive.

4.3 Evaluation Methods

4.3.1 Human Survey

A binary ground truth set was completed by seven art historians. The subjects were asked to choose between one of the following three choices: (1) yes, the shown attribute and painting are related. (2) they are somewhat relevant. (3) no, they are not related. Six paintings were randomly selected from each of 20 styles, and art historians made three sets of ground truths of 58 visual elements for the 120 paintings first. From the three sets, a set was determined based on the majority vote. For example, if a question is marked by three different answers, the (2) ‘somewhat relevant’ is determined as the final answer. The results show 1652 (24%) as relevant, 782 (11%) as somewhat, and 4526 (65%) as irrelevant. In order to balance positive and negative values, this paper considered the somewhat answers as relevant and created a binary ground truth set. The 120 paintings will be called “eval” throughout this paper.

4.3.2 AUC Metric

The Area Under the receiver operating characteristic Curve (AUC) was used for evaluation. When we say AUC, it means an averaged AUC score, where the denotes the number of attributes to be averaged. A random case is simulated and drawn in every plot as a comparable baseline. Images are sorted randomly without the consideration of the machine’s out values and then AUCs are computed. We explained why AUC is selected for art instead of other metrics (mAP or PR) in SI D.

4.3.3 Plots

To draw a plot, we measured AUC scores for all visual elements. The scores were sorted in descending order, every three scores were grouped, and points of AUC were computed. Since many of the visual concepts were not learnable (AUC ), a single averaged AUC value did not differentiate performance clearly. Hence, the descending scores were used, but averaged at every three points for simplicity. Regularization parameters were written in the legend boxes of plots if necessary.

4.3.4 SUN and CUB

SUN [patterson2014sun] and CUB [WahCUB_200_2011] are used to understand the models in general situations. All experiments are based on the standard splits, proposed by Xian et al. [xian2018zero]. For evaluation, mean Average Precision (AP) is used because the ground truth of the data sets is imbalanced by very large negative samples (the mean of all the samples is for SUN and for CUB at the binary threshold of ). For matrix, their ground truth samples are averaged.

4.4 Estimation of Category-Attribute Matrix

Two ways to estimate matrix are considered. First, from the two sets of word embeddings—GloVe and BERT—two matrices are computed by equation (3). This paper will refer to the BERT matrix as and to the GloVe matrix as . The is used only for -style experiments in a section later because the vocabulary of GloVe does not contain the all terms for the -style. As necessary, they will be written with the number of styles involved in experiments like or . Second, -D ground truths of the three paintings, randomly selected among the six paintings of each style, were averaged and accumulated into columns, and the ground truth matrix was also established. To do so, we first mapped the three answers of the survey with integers: “relevant” = ; “somewhat” = ; and “irrelevant” = . The paintings of “eval” used to compute will be called “eval-” and the others will be called “eval-”.

4.5 Quantitative Evaluations

Figure 5: (a) Three deep-proxy models by are compared on “eval”. SVD is selected as the best model for art. (b) Proxy models by are compared. The solid-lines are evaluated by “eval-”, and the dotted-lines are evaluated by “eval-”. (c) Five proxy models by are evaluated by “eval”. (d) SVD and LGT by and are compared by “eval-VAL”.

4.5.1 Model Selection for Deep-Proxy

To select the best deep-proxy for art, the three versions of Plain, SVD, and Offset by are compared. For Offset, the is pre-shifted by to make all components of matrix positive, and let machines learn a new offset from it. For the regularization in equation (4), , , , and are tested. In Figure 5 (a), SVD achieved the best rates and outperformed the Plain model. Offset was not as effective as SVD. Since was computed from the ground truth values, its zero point was originally aligned with “somewhat”, so offset learning may not be needed.

For a comparable result with SUN data, Offset is shown as the best in Figure 6 (a). SUN’s

matrix is computed by “binary” ground truths, so it is impossible to gauge the right offset. Hence, Offset learning becomes advantageous for SUN. However, for CUB, SVD and Offset were not learnable (converged to a local minimum, whose recognition is the random choice of equal probabilities). Since CUB’s

matrix has smaller eigenvalues than other data sets, implying subtle visual difference among bird classes, the two deep-proxy methods happen to be infeasible by demanding a neural net to capture fine and local visual differences of birds first, in order to discern the birds as orthogonal vectors. However, for the neural net especially in the initial stage of learning, finding the right direction to the high goal is far more challenging compared to the case of art and SUN, whose attributes can be found rather distinctively and globally in different class images. The detailed results for CUB and SUN are shown in SI E.

4.5.2 Sensitive Methods to Language Models

Proxy models by and are evaluated in Figure 5 (b) and (c). To avoid the bias by the samples used in computing matrix, for the models by , validation (solid-line) and test (dotted-line) are separately computed based on “eval-” and “eval- ” each.

High sensitivity to is observed for PCA and ESZSL. In Figure 5 (b), PCA and LGT show similar performance on “eval-”, but on “eval-”, PCA performs better than LGT. The same phenomenon is observed between ESZSL and SVD again. The better performance on “eval-” indicates somewhat direct replication of matrix into outcomes. This hints that ESZSL and PCA can suffer more degradation than other models if matrix is estimated by language models, so its imperfection straightly act on their results, as shown in Figure 5 (b) and (c). Since they compute visual elements through direct multiplications between processed features and matrix, and particularly for ESZSL, it finds a global optimum given a matrix, they showed the highest sensitivity to the condition of matrix.

4.5.3 Robust Methods to Language Models

Deep-learning makes LGT and deep-proxy slowly adapt to the information given by matrix, so the models are less affected by language models than ESZSL and PCA, as shown in Figure 5 (c). LGT can learn some visual elements through BERT or GloVe, even when not all style relations for the elements are correct in the models. For example, for , ‘expressionism’ is encoded as more related with “abstract” than ’cubism’ or ‘abstract-expressionism’, which is false. But despite the partial incorrectness, LGT gradually learns the semantic of “abstract” at the rate of AUC using the training data in a larger range of styles, correctly encoded; northern-renaissance (least related to “abstract”) rococo cubism abstract-expressionism (most related to “abstract”) etc (abstract AUCs of SVD, PCA, and ESZSL by : ).

Deep-proxy more actively adjusts some portion of matrix. Suppose there is a neural net trained with distorted by . By equation (2), this is a valid convergent point of the net, and we also can see this as another possible point, where . If the bottom of the neural net approximately learns , it would work as if a better is given, absorbing some errors. This adjustment could explain the robustness to the imperfection of language models than others, and also the flexibility to the sparse matrix that is shown to be directly harmful for LGT. This will be discussed in the next section.

Figure 6: SUN experiment: (a) Validation results for all models are shown. (b) The relationship between validation-AP (y-axis) and intra-class (x-axis) for LGT is presented for each attribute (each red dot). Points of Offset (two green dots) are drawn only when the AP-gap between Offset and LGT is more than . The higher scores on the green dots show that Offset is less affected by the sparsity of matrix than LGT. As the AP-gap gets lower to , words were found, and Offset worked better than LGT for all the words.
Figure 7: (a) The relationship between test-AUC (y-axis) and (x-axis) and (size of dots) for LGT (by and “eval-NG”) is shown for each attribute (each red dot). Offset points (four green dots) are drawn only when the AUC-gap between the SVD and LGT is more than . Each performance gap is traced by the blue dotted-lines. (b) Visual elements scored more than AUC by SVD or LGT (by and “eval”) are presented. (c) Style Encoding of BERT and GloVe for the word “planar”.

4.5.4 Logistic and Deep-Proxy on

Two factors are analyzed with LGT and SVD performance: intra-class’s standard deviation (

) and mean (). The intra-statistics of each style are computed with “eval” and averaged across the styles to estimate and for visual elements. For LGT and SVD by , AUC is moderately related with (Pearson correlation coefficient = and = ), but their performance is not explained solely by . As shown in Figure 7 (a), “monochromatic” (AUC of LGT and SVD: and ) scored far less than “flat” (AUC of LGT and SVD : and ) even if both words’ are similar and small. Since the element of monochromatic was not a relevant feature for most of styles, it was consistently encoded as very small values across the styles in

matrix. The element has small variance within a style, but does not have enough power to discriminate styles so failed to learn. LGT can be more degraded with the sparsity because the information encoded that is close to zero for all styles cannot be back-propagated properly. As shown in Figure

6 (b), intra-class of attributes in SUN are densely distributed between and , so LGT is lower ranked compared to art. LGT AP is most tied in the sparse to others ( = , = , = , = at ).

For SVD by , its overall performance is lower than LGT by . When the words “diagonal”, “parallel”, “straight”, and “gestural” (four green dots in Figure 7 (a)) were found by the condition of , LGT scored higher than SVD for all words. Since SVD is trained by an objective function for multiple style classification, the learning visual elements can be restricted by the following cases. Some hidden axes could be used to learn non-semantic features to promote learning styles. Or, some necessary semantics for styles could be strewn throughout multiple axes. Hence, LGT generally learns more words than SVD when matrix is estimated by some ground truths as shown in Figure 5 (b), but matrix should not be too sparse for LGT.

Figure 8: Correlation analysis between AUCs and matrix scores. The BERT plots of (a-1) and (a-2) are drawn based on the visual elements which scored more than AUC by any of SVD or LGT (by ). The GloVe plots of (b-1) and (b-2) are drawn based on the visual elements which scored more than AUC by any of SVD or LGT (by ). This shows LGT is more sensitively affected by the quality of language models.
Figure 9: AUC plots of random split on BERT: SVD by scored better than LGT at most of AUC@ points.

4.5.5 Logistic and Deep-Proxy on BERT and GloVe

For language models, it is a bit hard to generalize the performance of LGT and SVD. As shown in Figure 7 (b), it was not clear which is better with BERT. We needed another comparable language model to understand their performance. GloVe is tested after dividing styles into train ( styles) 333Baroque, Cubism, Expressionism, Fauvism, High-Renaissance, Impressionism, Mannerism, Minimalism, Pop-Art, and Ukiyo-e and test ( styles). Aligned with the split, the “eval” was also separated into “eval-VAL” and “eval-TEST ” ( unseen styles in training). Here, the “eval-VAL” was used to select hyper parameters. On the same split, the models by BERT were compared, too. Depending on each language model, the ranking relations were differently shown. In Figure 5 (d), SVD by scored better than LGT at all AUC@ points. However, LGT by was better than SVD for the first top words, but for second words, SVD scored better than LGT. To figure out a key factor of the different performance, we scored the quality of BERT and GloVe with for each visual element and conducted correlation analysis between the scores and the AUC results. Pearson correlation coefficient between AUCs and the scores are computed. The results are shown in Figure 8.

In the analysis, GloVe scored higher than BERT, and LGT showed the stronger correlation than SVD between AUCs and scores. This proves the robustness of SVD to the imperfection of language models along with the results of Fig 5 (d). As a specific example, the word “planar” is incorrectly encoded by BERT, quantifying some negatives on Expressionism, Impressionism, and Post-Impressionism as shown by Figure 7 (c). The wrong information influenced more on LGT, so its AUC scored (eval-TEST: ) by BERT but (eval-TEST: ) by GloVe, while SVD learned “planar” by the similar rates of (eval-TEST: ) by BERT and (eval-TEST: ) by GloVe on “eval-VAL”. For LGT, the defective information is directly provided through training data, so it is more sensitively affected by noisy language models. On the other hand, SVD can learn some elements even when it is trained by a matrix that is not perfect if the elements are essential for style classification possibly through the adjustment operation, as aforementioned. Another split of 12-training style 444Art-Nouveau-Modern, Color-Field-Painting, Early-Renaissance, Fauvism, High-Renaissance, Impressionism, Mannerism, Northern-Renaissance, Pop-Art, Rococo, Romanticism, and Ukiyo-e vs. 8-test style is also tested by BERT. In this experiment, LGT also scored less than SVD as shown in Figure 9.

Visual Elements (AUC 0.65) Visual Elements (AUC 0.65)
abstract chromatic atmospheric planar representational geometric perspective
0.90 0.79 0.75 0.71 0.67 0.63 0.46
Table 3: Descending ranking results (top to bottom) based on the prediction of SVD ( and ). The five most ( rows) and five least ( rows) relevant paintings are shown as the machine predicted. The last row indicates the AUC score of each visual element.

4.5.6 Descending Ranking Results of SVD by

To present some example results, 120 paintings of “eval” are sorted based on the activation values of SVD by . Table 3 presents some results of words that achieved more than or less than with BERT model. This shows how the “eval” paintings are visually different according to each output-value of deep-proxy- for the selected seven visual elements (abstract, chromatic, atmospheric, planar, representational, geometric, and perspective).

5 Conclusion and Future Work

Quantifying fine art paintings based on visual elements is a fundamental part of developing AI systems for art, but their direct annotations are very scarce. In this paper, we presented several proxy methods to learn the valuable information through its general and linear relations to style, which can be estimated by language models or human survey. They are quantitatively analyzed to reveal how the inherent structures of the methods make them robust or weak on the practical estimation scenarios. The robustness of deep-proxy to the imperfection of language models is a key finding. For future study, we will look at more complex systems. For example, a non-linear relation block learned by language models could be transferred or transplanted to a neural network to learn visual elements through the deeper relation with styles. Furthermore, direct applications for finding acoustic semantics for music genres or learning principle elements for fashion designs would be interesting subjects for proxy learning. Their attributes are visually or acoustically shared to define higher level of categories, but their class boundaries could be softened as proxy representations.


Supplementary Information (SI)

Appendix A Painting Information

Painting Information Painting Information
title: Madonna Conestabile
title: The Architect, Jesus T. Acevedo
author: Raphael
author: Diego Rivera
year: 1502
year: 1915
style: High Renaissance
style: Cubism
title: The Sistine Madonna
title: Water of the Flowery Mill
author: Raphael
author: Arshile Gorky
year: 1513
year: 1944
style: High Renaissance
style: Abstract Expressionism
title: Judith
title: Pendulum
author: Gorreggio
author: Helen Frankenthaler
year: 1512-1514
year: 1972
style: High Renaissance
style: Abstract
title: Morning in a Village
title: Untitled Vessel #10
author: Fyodor Vasilyev
author: Denise Green
year: 1869
year: 1977
style: Realism
style: Neo-Expressionism
title: Forest Landscape with Stream
title: Untitled No. 40
author: Ivan Shishkin
author: Claude Viallat
year: 1870-1880
year: 1996
style: Realism
style: Color-Field-Painting
Table A.1: Information of title, author, year of made, and style

Appendix B BERT Model

b.1 Training BERT

List of Words
non-representational, representational, meandering, gestural
amorphous, biomorphic, planar, chromatic
monochromatic, bumpy, art-nouveau, cubism
expressionism, fauvism, abstract-expressionism
color-field-painting, minimalism, naive-art, ukiyo-e
early-renaissance, pop-art, high-renaissance, mannerism
northern-renaissance, rococo, romanticism, impressionism
Table B.1: 28 words newly added to the original dictionary of BERT

For a new BERT model for art, the BERT-BASE model (12-layer, 768-hidden, 12-heads, and not using cased letters) was selected and trained from scratch over the collected art corpus. For training, the original BERT vocabulary is updated by adding new words. Table B.1 shows the words that are newly added. For optimizer, Adam algorithm with decay is used. The BERT model is trained for steps with the learning rate of and the number of warm-up steps is set to .

b.2 Collecting BERT Embedding

Context-free embedding is collected for each of art terms (20 styles and 58 visual elements). The context-free means only a target word is inputted to BERT to collect word-embedding without accompanying other words. Each target word is enclosed only by [CLS] and [SEP] and inputted to the BERT as the format of “[CLS] target-word [SEP]”. 12 output vectors (768-D) from all 12 hidden layers are collected and averaged. The representations corresponding to the [CLS] and [SEP] are discarded so only the vector representations for input words are taken as the final embedding. We also tried to average the embeddings collected only from the top or bottom four layers, but all 12 embeddings were slightly better at presenting the general relationship between styles and visual elements.

Figure B.1: A general relation between 58 visual elements and 20 styles is estimated by BERT. It is visualized in gray-scale.

b.3 Visualization of BERT matrix

The general relation estimated by BERT is visualised in gray-scale in Figure B.1. In this figure, the birther square shows the stronger association between styles and visual elements.

Appendix C Four Proxy Models

c.1 Sparse Coding

Sparse coding [efron2004least] is the simplest model for proxy learning, which estimates the visual information from style encodings without seeing input paintings. By treating the rows of matrix as over-complete bases of style space (the number of styles in this paper is at most 20 and the number of visual elements is 58), () is estimated from given and by sparse coding equation (7) below.


To encode a painting into mutliple styles (

), the soft-max output of a style classifier is used. To implement a style classifier, VGG-16 ImageNet’s

[Simonyan15] all convolution layers are transferred as is and frozen, but its original FC layers are modified to five FC layers and their parameters are updated in training. The hyper-parameter tried during development are , , . The linear equation (7) is solved by Least Angle Regression [efron2004least].

c.2 Logistic Regression

Logistic regression [danaci2016low] (LGT) is used to learn visual elements in a sort of supervised way. Similar to the work of Lampert et al. [lampert2013attribute], each column of was used to assign attributes to images on a per class basis. Each column index was determined based on the style label. When matrix are not probabilistic representations, without shifting zero points, the positives were put into the range of to and the negatives were put into the range of to . Finally, the values were adjusted from zero to one.

The logistic regression is implemented on the last convolution layer of a VGG-16 ImageNet. After adding multiple FC layers on the top of the convolution layer, same as deep-proxy, and the FC parts are newly updated by an objective function. Let be an indicator function that states whether or not the -th sample belongs to style class . Let (58-D) be the probabilistic representation of a column of G matrix corresponding to style class. Let be a (58-D) logistic output of the network, where is an input image, implies the network’s parameter, and denotes the last FC layer. Then, an objective function for logistic regression is set as in equation (8) below.


The (regularization) is added to reduce undesirable correlations among attributes by restricting the magnitudes of the last layer parameters. Tested s are , , , and . All logistic models are trained for 200,000 steps at 32 batch size by a momentum optimizer. Their learning rates are initially set as and degraded at the factor of 0.94 every two epochs as same as deep-proxy.

c.3 Pca

Kim et al. [diana2018artprinciple] proposed a framework based on Principal Component Analysis (PCA) to annotate paintings with a set of visual concepts. Differently to proxy learning, they consider a style conceptual space as a joint space where paintings and each visual concept can be properly encoded to measure their associations. Even though its approach to style space is different from proxy learning, PCA method can be reproduced in the context of proxy learning as follows.

  1. PCA transform of visual embedding The visual embedding, collected from the last hidden layer (before activation) of a deep-CNN (style classification model), is transformed to PCA space. A VGG-16 style classification model (FC part : ) proposed by Elgammal et al. [elgammal2018shape] is used to collect the embedding. Let us denote is the dimension of the embedding, is the number of painting samples, and is the number of PCA components that cover 95% variance of the embedding. The last layer’s hidden embedding is collected, and then PCA projection matrix (without whitening) is computed from . Then, is transformed to PCA samples by equation (9) below, where is a sample mean of , and is the tiling matrix of .

  2. Style encoding of PCA components Let us denote each column of is one-hot binary vector to encode a corresponding style label of each sample in , where

    denotes number of styles. The multivariate linear regression equation (

    10) below is solved to compute . In the solution , each PCA-axis is encoded by styles.

  3. Computing visual attributes Finally, attribute representation is computed by the equation (11) below, where indicates a PCA representation of a test sample and the is a category-attribute matrix.


c.4 Eszsl

An Embarrassingly Simple Approach to Zero-shot Learning (ESZSL) [romera2015embarrassingly] is originally designed for zero-shot learning, not for learning visual semantics. However, its linear modeling between semantics and classes is comparable with deep-proxy, and can be re-defined as a method for proxy learning as follows.

  1. Computing matrix to transform visual features to visual attributes Let us denote image features of training samples, where the is the dimension of the features. Let us denote the ground truth labels of each sample belonging to any of styles. Then, the closed formulation (12) below computes a single matrix . The single matrix is used to transform the image samples to the visual elements. Two regularization parameters ( and ) are used and all the possible combinations of and for are considered. In this paper, the last hidden layer of VGG-16 ImageNet (4096-D before activation), generally called “fc7-feature”, is used for the collection of the image feature ,

  2. Computing visual attributes Final attribute representation is computed by the equation (13) below, where the indicates a feature vector of an input image.


Appendix D Evaluation Metrics

(a) PR Example: “representational” and “proportion”.
(b) AUC example:“transparent” and “monochromatic”
Figure D.1: Example cases of PR and AUC. AUC can be biased by the imbalance of large negative ground truths as the case of “transparent” in figure (b). However, we confirmed AUC is less affected by the imbalance of data in general than PR or mAP, while Precision-Recall (PR) presented deceptively too high score as shown in (a). For AUC, most of random cases scored around 0.5 regardless of the negative ratios like “monochromatic” in figure (b).
58 Visual Elements of Art
Elements of Art Words
Subject representational (0.18/0.56), non-representational (0.76/0.5)
blurred (0.72/0.43), broken (0.82/0.43), controlled (0.35/0.51)
curved (0.6/0.52), diagonal (0.82/0.53), horizontal (0.61/0.55)
vertical (0.46/0.45), meandering (0.93/0.37)
thick (0.78/0.51), thin (0.8/0.45), active (0.53/0.48)
energetic (0.56/0.46), straight (0.76/0.52)
bumpy (0.87/0.42), flat (0.45/0.53), smooth (0.63/0.58)
gestural (0.63/0.46), rough (0.68/0.44)
calm (0.63/0.46), cool (0.63/0.49), chromatic (0.36/0.54)
monochromatic (0.88/0.54), muted (0.8/0.35)
warm (0.5/0.56), transparent (0.95/0.82)
ambiguous (0.72/0.42), geometric (0.78/0.49), amorphous (0.88/0.4)
biomorphic (0.82/0.4), closed (0.3/0.51), open (0.68/0.46), distorted (0.75/0.45)
heavy (0.72/0.52), linear (0.72/0.51), organic (0.76/0.43), abstract (0.69/0.48)
decorative (0.63/0.56), kinetic (0.77/0.48), light (0.81/0.55)
Light and Space
bright (0.56/0.48), dark (0.74/0.57), atmospheric (0.68/0.54)
planar (0.58/0.55), perspective (0.5/0.54)
General Principles
of Art
overlapping (0.59/0.56), balance (0.51/0.41), contrast (0.47/0.49)
harmony (0.5/0.56), pattern (0.61/0.56) repetition (0.57/0.54)
rhythm (0.65/0.48), unity (0.59/0.45), variety (0.55/0.46)
symmetry (0.78/0.48), proportion (0.35/0.51), parallel (0.77/0.56)
Table D.1: 58 Visual Elements: ratios of negative ground truth (value on the left-hand side) and random-AUC scores (value on the right-hand side) on the paintings in “eval”. We confirmed AUC is less affected by the imbalance of data in general than PR or mAP. Most of random cases scored close to 0.5 regardless of the negative ratios of ground truths.

Both mean Average Precision (mAP) and area of Precision of Recall curve (PR) were initially considered along with AUC. However, we found that some visual concepts were highly imbalanced in “eval” (positive samples negative samples), so the two metrics presented deceptively too high scores. For example, for the concepts of “representational” and “proportion”, randomly ordered samples achieved 0.81 and 0.8 PR because more than 80% paintings of the 120 paintings in “eval” are relevant or somewhat relevant to the concepts. In Figure D.1 (a), each element’s positive ground truth ratio is directly reflected on their PR scores.

Theoretically, AUC scores of random samples can also be biased by larger negative ground truths as presented in the “transparent” in Figure D.1 (b). However, we confirmed AUC is less affected by the imbalance of data in general than PR or mAP (PR is the lower bound of mAP). As we compared random performances of AUC, in most of cases the random scores are marked around 0.5 regardless of the negative ratios like the “monochromatic” of Figure D.1 (b). As a reference, both ratios of negative ground truths and random-AUC scores on “eval” are presented for all 58 visual elements of art in Table D.1.

Appendix E SUN and CUB AP Plots

Figure E.1: SUN AP performance of proxy models: Due to the sparsity of matrix of SUN, LGT is degraded. The ranking relations among the five models in validation are also hold in test.
Figure E.2: Eigenvalue-Distributions of CUB, SUN, and Art: The cropped eigenvalue-distributions between 0.0 to 10.0 are presented for CUB, SUN, and Wikiart to show CUB’s matrix has more small eigenvalues than others. This implies there exist subtle visual differences among bird-classes.
Figure E.3: CUB AP performance of proxy models: Since the visual difference among bird-classes are very subtle, fine, and local, Offset and SVD were not learnable.

To understand the models in more general situations, SUN [patterson2014sun] and CUB [WahCUB_200_2011] are tested for all five proxy models. All experiments are based on the standard splits (train, validation, and test), proposed by Xian et al. [xian2018zero] for zero-shot learning. The classes in the test split are unseen in training and the validation split is used to select hyper parameters. Since both ground truth attributes are imbalanced by very large negative samples (the mean of all the samples is for SUN and for CUB at the binary threshold of ), mean Average Precision (AP) is used for evaluation. For matrix, their ground truth samples are averaged. PCA scored the best on both SUN and CUB but other models performed differently depending on the data sets. All AP plots for SUN and CUB are presented in Figure E.1 and Figure E.3. Eigenvalue-distributions of CUB, SUN, and Art are presented in Figure E.2.