Automatically generating meaningful and accurate image descriptions is a challenging task that has been extensively addressed in the recent years. This task implies recognizing objects and their relationship in an image and generating syntactically and semantically correct textual descriptions. In resolving this task, significant progress has been made using deep learning based - techniques. A prerequisite for this kind of approach are large datasets of semantically related image and sentence pairs. In the domain of natural images, several well-known large-scale datasets are commonly used for caption generation, such as the MS COCO, Flickr30  and Visual Genome  dataset. Although the availability of such datasets enabled remarkable results in generating high quality captions for photographs of various objects and scenes, the task of generating image captions still remains difficult for domain-specific image collections. In particular, in the context of the cultural heritage domain, generating image captions is an open problem with various challenges. One of the major obstacles is the lack of a truly large-scale dataset of artwork images paired with adequate descriptions. It is also relevant to address what kind of description would be regarded as ”adequate” for a particular purpose. Considering for instance Erwin Panofsky’s three levels of analysis , we can distinguish the ”pre-iconographic” description, ”iconographic” description and the ”iconologic” interpretation as possibilities of aligning semantically meaningful, yet very different textual descriptions with the same image. While captions of natural images usually function on the level of ”pre-iconographic” descriptions, which implies simply listing the elements that are depicted in an image, for artwork images this type of description represent only the most basic level of visual understanding and is often not considered to be of great interest.
In the context of artwork images, it would be more interesting to generate ”iconographic” captions that capture the subject and symbolic relations between objects. Creating a dataset for such a complex task requires expert knowledge in the process of collecting sentence-based descriptions of images. There have been some attempts to create such datasets, but those existing datasets consist only of a few thousand images and are therefore not suitable to train deep neural models in the current state-of-the-art setting for image captioning. However, there are several existing large-scale artwork collections that associate images with keywords and specific concepts. The idea of this work is to use a concatenation of concept descriptions associated with an image as textual inputs for training an image captioning model. Recently an interesting large-scale artwork dataset has been published under the name ”Iconclass AI Test Set” . This dataset represents a collection of various artwork images assigned with alphanumeric classification codes that correspond to notations from the Iconclass system . Iconclass is a classification system designed for art and iconography and is widely accepted by museums and art institutions as a tool for the description and retrieval of subjects represented in images. Although the ”Iconclass AI Test Set” is not structured primarily as an image captioning dataset, each code is paired with its ”textual correlate” - a description of the iconographic subject of the particular Iconclass notation. Therefore the main intention of this work is to extract and preprocess the given annotations into clean textual description and create the ”Iconclass Caption” dataset. This dataset is then used to fine-tune a pre-trained unified vision-language model on the down-stream task of image captioning 
. Transformer-based vision-language pre-trained models currently represent the leading approach in solving a variety of tasks in the intersection of computer vision and natural language processing. This paper represents a first attempt to employ the aforementioned approach on a collection of artwork images with the goal to generate image captions relevant in the context of art history.
2 Related work
The availability of large collections of digitized artwork images led to an increase of interest in the employment of deep learning-based techniques for a variety of different tasks. Research in this area most commonly focuses on addressing problems related to computer vision in the context of art historical data, such as image classification [4, 29], visual link retrieval [31, 3], analysis of visual patterns and conceptual features [33, 11, 6, 14]
, object and face detection[10, 36], pose and character matching [24, 19] and computational aesthetics [18, 5, 30].
Recently however there has been a surge of interest in topics that deal with not only visual, but both visual and textual modalities of artwork collections. The pioneering works in this research area mostly addressed the task of multi-modal retrieval. In particular,  introduced the SemArt dataset, a collection of fine-art images associated with textual comments, with the aim to map the images and their descriptions in a joint semantic space. They compare different combinations of visual and textual encodings, as well as different methods of multi-modal transformation. In projecting the visual and textual encodings in a common multimodal space, they achieve the best results by applying a neural network trained with cosine margine loss on ResNet50 features as visual encodings and bag-of-word as textual encodings. The task of creating a shared embedding space was also addressed in  where the authors introduce a new visual semantic dataset named BibleVSA, a collection of miniature illustrations and commentary text pairs, and explore supervised and semi-supervised approaches to learning cross-references between textual and visual information in documents. In  the authors present the Artpedia dataset consisting of 2930 images annotated with visual and contextual sentences. They introduce a cross-modal retrieval model that projects images and sentences in a common embedding space and discriminates between contextual and visual sentences of the same image. A similar extension of this approach to other artistic datasets was presented in .
Besides multi-modal retrieval, another emerging topic of interest is visual question answering (VAQ). In 
the authors annotated a subset of the ArtPedia dataset with visual and contextual question-answer pairs and introduced a question classifier that discriminates between visual and contextual questions and a model that is able to answer both types of questions. In the authors introduce a novel dataset AQUA, which consists of automatically generated visual and knowledge-based QA pairs, and also present a two-branch model where the visual and knowledge questions are handled independently.
A limited number of studies contributed to the task of generating descriptions of artwork images using deep neural networks and all of them rely on employing the encoder-decoder architecture-based image captioning approach. For example, 
proposes an encoder-decoder framework for generating captions of artwork images where the encoder (ResNet18 model) extracts the input image feature representation and the artwork type representation, while the decoder is a long short-term memory (LSTM) network. They introduce two image captioning datasets referring to ancient Egyptian art and ancient Chinese art, which contain 17,940 and 7,607 images respectively. Another very recent work presented a novel captioning dataset for art historical images consisting of 4000 images across 9 iconographies, along with a description for each image consisting of one or more paragraphs. They used this dataset to fine-tune different variations of image captioning models based on the well-known encoder-decoder approach introduced in .
Influenced by the success of utilizing large scale pre-trained language models like BERT 
for different tasks related to natural language processing, there has recently been a surge of interest in developing Transformer-based vision-language pre-trained models. Vision-language models are designed to learn joint representations that combine information of both modalities and the alignments across those modalities. It has been shown that models pre-trained on intermediate tasks with unsupervised learning objectives using large datasets of image-text pairs, achieve remarkable results when adapted to different down-stream tasks such as image captioning, cross-modal retrieval or visual question answering[42, 37, 23, 7]. However, to the best of our knowledge, this approach has until now not been explored for tasks in the domain of art historical data.
3 Experimental setup
3.1 Iconclass Caption Dataset
In our experiment we use a subset of 86 530 valid images from the ”Iconclass AI Test Set” .This is a very diverse collection of images sampled from the Arkyves database 111www.arkyves.org. It includes images of various types of artworks such as paintings, posters, drawings, prints, manuscripts pages, etc. Each image is associated with one or more codes linked to labels from the Iconclass classification system. The authors of the ”Iconclass AI Test Set” provide a json file with the list of images and corresponding codes, as well as an Iconclass Python package to perform analysis and extract information from the assigned classification codes. To extract textual descriptions of images for the purpose of this work, the English textual descriptions of each code associated with an image are concatenated. Further preprocessing of the descriptions includes removing text in brackets and some recurrent uppercased dataset-specific codes. In this dataset, the text in brackets most commonly includes very specific named entities, which are considered a noisy input in the image captioning task. Therefore, when preprocessing the textual items, all the text in brackets is removed, even at the cost of sometimes removing useful information. Figure 1 shows several example images from the Iconclass dataset and their corresponding descriptions before and after preprocessing. Depending on the number of codes associated with each image, the final textual descriptions can significantly vary in length. Also, because of the specific properties of this dataset, the image descriptions are not structured as sentences but as a list of comma-separated words and phrases.
Because of this type of structure, and because of having only one reference caption for each image, the Iconclass Caption dataset is not a standard image captioning dataset. However, having in mind the difficulties of obtaining adequate textual descriptions for images of artworks, this dataset can be considered a valuable source of image-text pairs in the current context. Particularly because of the large number of annotated images that enables training deep neural models. In the experimental setting, a subset of 76k items is used for training the model, 5k for validation and 5k for testing.
3.2 Vision-Language Model
In this work the unified vision-language pre-training model (VLP) introduced in 
is employed. This model is denoted as “unified” because the same pre-trained model can be fine-tuned for different types of tasks. Those task include both vision-language generation (e.g. image captioning) and vision-language understanding (e.g. visual question answering). The model is based on an encoder-decoder architecture comprised of 12 Transformer blocks. The model input consist of the image embedding, text embedding and three special tokens that indicate the start of the image input, the boundary between visual and textual input and the end of the textual input. The image input consist of 100 object classification aware region features extracted using the Faster RCNN model pre-trained on the Visual Genome dataset . For a more detailed description of the overall VLP framework and pre-training objectives, the reader is refered to . The experiments introduced in this work employ as the base model the VLP model pre-trained on the Conceptual Captions dataset 
using the sequence-to-sequence objective. This base model is fine-tuned on the Iconclass Caption Dataset using recommended fine-tuning configurations, namely training with a constant learning rate of 3e-5 for 30 epochs. Because the descriptions in the Iconclass Caption Dataset are on average longer than captions in other caption datasets, when fine-tuning the VLP model, the maximum number of tokens in the input and target sequence is modified from the default value (20) to a new higher value (100).
4.1 Quantitative results
To quantitatively evaluate the generated captions, standard language evaluation metrics for image captioning on the Iconclass Caption test set are used. Those include the standard 4 BLEU metrics, METEOR  ROUGE  and CIDEr 
. The BLUE, ROUGE and METEOR are metrics that originate from machine translation tasks, while CIDEr was specifically developed for image caption evaluation. The BLUE metrics represent n-gram precision scores multiplied by a brevity penalty factor to assess the length correspondence of candidate and reference sentences. ROUGE is a metric that measures the recall of n-grams and therefore rewards long sentences. Specifically ROUGE-L measures the longest matching sequence of words between a pair of sentences. METEOR represents the harmonic mean of precision and recall of unigram matches between sentences and additionally includes synonyms and paraphrase matching. CIDEr measures the cosine similarity between TF-IDF weighted n-grams of the candidate and the reference sentences. The TF-IDF weighting of n-grams reduces the score of frequent n-grams and appoints higher scores to distinctive words. The results obtained using those metrics are presented in Table1.
|Evaluation metric||Iconclass Caption test set|
Although the current results cannot be compared with any other work because the experiments are performed on a new and syntactically and semantically different dataset, the quantitative evaluation results are included to serve as a benchmark for future work. In comparison with current state-of-the-art caption evaluation results on natural image datasets (e.g. BLEU4 37 for COCO and 30 for Flickr30 datasets) [40, 42], the BLUE scores are lower for the Iconclass dataset. A similar behaviour is also reported in another study addressing iconographic image captioning . On the other hand, the CIDEr score is quite high in comparison to the one reported for natural image datasets (e.g. CIDEr 116 for COCO and 68 for Flickr30 dataset) [40, 42].
However, it remains questionable how adequate these metrics are in assessing the overall quality of the captions in this particular context. All of the reported metrics mostly measure the word overlap between generated and reference captions. They are not designed to capture the semantic meaning of a sentence and therefore often lead to poor correlation with human judgement. Also, they are not appropriate for measuring very short descriptions which are quite common in the IconClass Caption dataset. Moreover, they do not address the relation between the generated caption and the image content, but express only the similarity between the original and generated textual descriptions. The generated caption could be semantically aligned with the image content but represent a different version of the original caption and therefore have very low metric scores. In Figure 2, several such examples from the Iconclass Caption test set are presented.
Those examples indicate that the existing evaluation metrics are not very suitable in assessing the relevance of generated captions for this particular dataset. Therefore a qualitative analysis of the results is also required in order to better understand potential contributions and drawbacks of the proposed approach.
4.2 Qualitative analysis
For the purpose of qualitative analysis, examples of images and generated captions on two datasets are analyzed. One is the test set of the Iconclass Caption dataset that serves for direct comparison between the generated captions and ground-truth descriptions. The other dataset is a subset of the WikiArt painting collection, which does not include textual descriptions of images but has a broad set of labels associated with each image. This enables the study of the relation between generated captions and other concepts, e.g genre categorization of paintings, as well as gives an insight into how well the model generalizes to a different artwork dataset.
4.2.1 Iconclass Caption test set
To gain a better insight into the generated image captions, in Figure 3 several examples are shown. The presented image-text pairs are chosen to demonstrate both successful examples (the left column) and failed examples (the right column) of generated captions.
Analysis of the failed examples indicates an existing “logic” in those erroneous captions, as well as demonstrates underlying biases within the dataset. For instance, in the Iconclass Caption training test there are more than thousand examples that include the phrase “New Testament” in the description. Therefore images that include structurally similar scenes, particularly from classical history and mythology, are sometimes wrongly attributed as depicting a scene from the New Testament. This signifies the importance of balanced examples in the training dataset and indicates directions for possible future improvements. The Iconclass dataset is a collection of very diverse images and apart from the Iconclass classification codes, there are currently no other metadata available for the images. Therefore it is difficult to perform an in-depth exploratory analysis of the dataset and the generated results in regard to attributes relevant in the context of art history such as the date of creation, style, genre, etc. For this reason, the fine-tuned image captioning model is employed on a novel artwork dataset - a subset of the WikiArt collection of paintings.
4.2.2 WikiArt dataset
In order to explore how the model generalizes to a new artwork dataset, a subset of 52562 images of paintings from the WikiArt 222www.wikiart.org collection is used. Because images in the WikiArt dataset are annotated with a broad set of labels (e.g. style, genre, artist, technique, date of creation, etc. ), the study of the relation between the generated captions on those labels is performed as one method of qualitative assessment. Figure 4 shows the distribution of most commonly generated descriptions in relation to four different genres. From this basic analysis it is obvious that the generated captions are meaningful in relation to the content and the genre classification of images.
To understand the contribution of the proposed model in the context of iconographic image captioning, it is interesting to compare the Iconclass captions with captions obtained from models trained on natural images. For this purpose, two models of the same architecture but fine-tuned on the Flickr 30 i MS COCO datasets are used. Figure 5 shows several examples from the WikiArt dataset with corresponding Iconclass, Flickr and COCO captions. It is evident that the other two models generate results that are meaningful in relation to the image content but do not necessarily contribute to producing more fine-grained and context-aware descriptions.
This paper introduces a novel model for generating iconographic image captions. This is done by utilizing a large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. To the best of our knowledge, this dataset has not yet been widely used in the computer vision community. Within the scope of this work, the available annotations are processed into clean textual descriptions and the existing dataset is transformed into a collection of suitable image-text pairs. The dataset is used to fine-tune a transformer-based visual-language model. For this purpose, object classification aware region features are extracted from the images using the Faster RCNN model. The base model in our fine-tuning experiment is an existing model, called the VLP model, that is pre-trained on a natural image dataset on an intermediate tasks with unsupervised learning objectives. Fine-tuning pre-trained vision-language models represents the current state-of-the-art approach for many different multimodal tasks.
The captions generated by the fine-tuned models are evaluated using standard image captioning metrics. Unlike in other image captioning datasets which usually contain several short sentences, the ground-truth descriptions of the Iconclass dataset significantly vary in length. Because of the specific properties of the Iconclass dataset, standard image captioning evaluation metrics are not very informative regarding the relevance and appropriateness of the generated captions in relation to the image content. Therefore, the quality of the generated captions and the model’s capacity to generalize to new data are further explored by employing the model on another artwork dataset. The overall quantitative and qualitative evaluation of the results suggests that the model can generate meaningful captions that capture not only the depicted objects but also the art historical context and relation between subjects. However, there is still room for significant improvement. In particular, the unbalanced distribution of themes and topics within the training set result in often wrongly identified subjects in the generated image descriptions. Furthermore, the generated textual descriptions are often very short and could serve more as labels rather than captions. Nevertheless, the current results show significant improvement in comparison to captions generated from artwork images using models trained on natural image caption datasets. Further improvement can potentially be achieved with fine-tuning the current model on a smaller dataset with more elaborate ground-truth iconographic captions.
Aligning text and document illustrations: towards visually explainable digital humanities.
2018 24th International Conference on Pattern Recognition (ICPR), pp. 1097–1102. Cited by: §2.
-  (2020) Visual question answering for cultural heritage. arXiv preprint arXiv:2003.09853. Cited by: §2.
-  (2020) Towards a tool for visual link retrieval and knowledge discovery in painting datasets. In Italian Research Conference on Digital Libraries, pp. 105–110. Cited by: §2.
Fine-tuning convolutional neural networks for fine art classification. Expert Systems with Applications 114, pp. 107–118. Cited by: §2.
-  (2019) A deep learning perspective on beauty, sentiment, and remembrance of art. IEEE Access 7, pp. 73694–73710. Cited by: §2.
-  (2020) Learning the principles of art history with convolutional neural networks. Pattern Recognition Letters 129, pp. 56–62. Cited by: §2.
-  (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §2.
-  (2020) Explaining digital humanities by aligning images and textual descriptions. Pattern Recognition Letters 129, pp. 166–172. Cited by: §2.
-  (1983) Iconclass: an iconographic classification system. Art Libraries Journal 8 (2), pp. 32–49. Cited by: §1.
-  (2014) In search of art. In European Conference on Computer Vision, pp. 54–70. Cited by: §2.
-  (2020) Exploring the representativity of art paintings. IEEE Transactions on Multimedia. Cited by: §2.
-  (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380. Cited by: §4.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
The shape of art history in the eyes of the machine.
32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2183–2191. Cited by: §2.
-  (2018) How to read paintings: semantic art understanding with multi-modal retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.
-  (2020) A dataset and baselines for visual question answering on art. arXiv preprint arXiv:2008.12520. Cited by: §2.
-  (2020) Towards image caption generation for art historical data. AI methods for digital heritage, Workshop at KI2020 43rd German Conference on Artificial Intelligence. Cited by: §2, §4.1.
-  (2017) Subjective ratings of beauty and aesthetics: correlations with statistical image properties in western oil paintings. i-Perception 8 (3), pp. 2041669517715474. Cited by: §2.
-  (2019) Linking art through human poses. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1338–1345. Cited by: §2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §1, §3.2.
-  (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: §2.
-  (2019) Recognizing characters in art history using deep learning. In Proceedings of the 1st Workshop on Structuring and Understanding of Multimedia heritAge Contents, pp. 15–22. Cited by: §2.
-  (1972) Studies in iconology. humanistic themes in the art of the renaissance, new york. New York: Harper and Row. Cited by: §1.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §4.1.
-  (2020) Brill iconclass ai test set. Cited by: §1, §3.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2.
-  (2019) Two-stage deep learning approach to the classification of fine-art paintings. IEEE Access 7, pp. 41770–41781. Cited by: §2.
-  (2020) Aesthetical issues of leonardo da vinci’s and pablo picasso’s paintings with stochastic evaluation. Heritage 3 (2), pp. 283–305. Cited by: §2.
-  (2016) Visual link retrieval in a database of paintings. In European Conference on Computer Vision, pp. 753–767. Cited by: §2.
-  (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565. Cited by: §3.2.
-  (2019) Discovering visual patterns in art collections with spatially-consistent feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9278–9287. Cited by: §2.
-  (2019) Generating captions for images of ancient artworks. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. Cited by: §2.
-  (2019) Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In International Conference on Image Analysis and Processing, pp. 729–740. Cited by: §2.
-  (2018) Omniart: a large-scale artistic benchmark. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 1–21. Cited by: §2.
-  (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.1.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2.
-  (2020) Xgpt: cross-modal generative pre-training for image captioning. arXiv preprint arXiv:2003.01473. Cited by: §4.1.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: §1.
-  (2020) Unified vision-language pre-training for image captioning and vqa.. In AAAI, pp. 13041–13049. Cited by: §1, §2, §3.2, §4.1.