Log In Sign Up

Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?

by   Pietro Bongini, et al.

The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks.


Textually Enriched Neural Module Networks for Visual Question Answering

Problems at the intersection of language and vision, like visual questio...

Visual Question Answering for Cultural Heritage

Technology and the fruition of cultural heritage are becoming increasing...

Visual Madlibs: Fill in the blank Image Generation and Question Answering

In this paper, we introduce a new dataset consisting of 360,001 focused ...

Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Image description task has been invariably examined in a static manner w...

Annotation Methodologies for Vision and Language Dataset Creation

Annotated datasets are commonly used in the training and evaluation of t...

CapWAP: Captioning with a Purpose

The traditional image captioning task uses generic reference captions to...

1 Introduction

Cultural Heritage often relies on digital resources to engage and attract visitors. From audio-guides to smartphone applications, museum visits are becoming increasingly more interactive, allowing users to deepen concepts without the need of a human assistant or after the visit is concluded. Forms of gamification are also important, favoring engagement especially for young visitors and instructional purposes. Artificial Intelligence and Computer Vision are playing a large part in the development of such smart visits and applications 

[13, 5, 6, 8]

. A notable machine learning application that has recently found usage in cultural heritage is Visual Question Answering (VQA), which exploits both Computer Vision and Natural Language Processing to allow users to ask questions on the content of an image 

[6]. The advantage of VQA is that it allows museums to develop smart guides and interactive gamification approaches. However, for pictorial art, most questions posed by users concern contextual information rather than what is actually depicted in a painting.

To address this limitation, an evolution of VQA known as Contextual Question Answering (CQA) was proposed [6]. The authors explicitly focused on cultural heritage applications, combining visual and contextual cues to answer questions. The contextual information is derived from a textual meta-data, which is fed to the model along with the question and the image. In this way the VQA/CQA model has to learn to attend either relevant parts of an image or relevant sections of the text to provide an adequate answer. The need of a textual data nonetheless opens a new issue, namely where to obtain such description. Information sheets for artworks may already be available to museum curators yet extending this kind of application to new data becomes time-consuming and requires a domain expert.

In this paper we explore the usage of a generative natural language processing model to automatically create contextual information to be fed to a CQA model. In fact, recently, generative text models have been finding large diffusion with groundbreaking results. Among these we find GPT-3, a generative model trained on a massive corpus of textual data regarding several domains, including art [7]. GPT-3 is capable of generating a description starting from a textual query and it has been demonstrated that the model includes knowledge of the entities described in the training data, for example paintings and artworks. We therefore investigate the possibilities and the limitations of GPT-3 in applications for cultural heritage, with a specific focus on question answering. In particular, we explore the quality of the textual description of artworks that the model is able to generate and we evaluate their applicability for visual and contextual question answering.

The main contributions of our work are the following:

  • We propose an automatic approach to generate textual information sheets of artworks exploiting GPT-3. We find that the model has excellent knowledge of art concepts and event details of specific paintings.

  • We propose a method to answer both visual and contextual questions which is artwork agnostic, i.e. it does not require any additional data or training to be adapted to a new set of images.

  • We explore the applicability of GPT-3 in cultural heritage applications. To the best of our knowledge we are the first to apply GPT-3 to the art domain.

2 Related Work

Natural Language Processing (NLP) in recent years has evolved at an extremely fast pace, converging to a set of well defined application paradigms [33]. Such paradigms include text classification, matching, machine reading comprehension, sequence to sequence translation, sequence tagging and language modeling. Despite the wide variety of tasks [28, 1, 9], some recent noticeable approaches have been shown to perform well as generic pre-training for NLP models [11, 7]

. In particular, this can be attributed to the introduction of attention models, based on the transformer architecture

[36]. The effectiveness of models such as BERT [11] stems from the capability of processing text bidirectionally exploiting the self-attention mechanism of transformers to obtain word level representations that are informed of their surrounding context within the sentence. Whereas BERT is built exploiting the encoder part of the transformers, another state of the art approach for NLP, Generative Pre-trained Transformer (GPT) [22], is built stacking transformer decoder blocks and is trained to predict the next word in a sentence. The model has then been improved in subsequent versions, GPT-2 [23] and GPT-3 [7], yielding larger and more effective models.

Interestingly, GPT-3 has been trained using a large quantity of internet data, meaning that the training process has distilled into the model common sense knowledge making it able to generate essays and even poetry [10]. In this paper we exploit GPT-3 as a generator of textual content describing artworks, showing that it can be used for interactive applications for cultural heritage such as captioning [19] and Visual Question Answering (VQA) [2]. VQA is a recent trend in machine learning that bridges the Natural Language Processing and Computer Vision domains [4]. The goal is to answer questions regarding the content of an image through artificial intelligence. This involves several sub-tasks such as object detection [15] and recognition [16], question reasoning [20]

. Typical VQA approaches use Convolutional Neural Networks (CNNs) to interpret images and Recurrent Neural Networks (RNNs) to process questions. The authors of  

[1] proposed a bottom-up attention mechanism looking at salient objects in images. Differently from previous approaches that considered regularly spaced image portions  [30], they use object Faster R-CNN [25] features as attention candidates. In the past few years multiple Transformer-based approaches reached impressive performances on this task  [32, 34, 17, 38].

Recently, a few approaches [6, 14, 35, 3] have addressed VQA in the cultural heritage domain. A dataset of questions and answers for art related questions has been recently proposed [3], exploiting an ontology based framework to extract data with question templates. The authors of [6] and [14] found that to make the best out of VQA for museum applications, a model must be able to integrate some source of external knowledge in order to address contextual questions, i.e. questions concerning non-visual cues such as name of the author, year and artistic style. In particular, [6]

used a question classifier to understand if visual of contextual knowledge is required. Depending on the output of the classifier a VQA model is used, otherwise a purely textual based question answering model is used discarding the image content. In this work we explore the effectiveness of using GPT-3 to generate artwork captions, suitable for such a visual and contextual question answering model.

Other approaches have been used to answer questions relying on captions, yet only regarding visual content [29]. The most similar approach to ours is instead [39], which used GPT-3 for VQA. However, differently from us, the authors feed GPT-3 with questions and descriptions generated by an image captioner directly to obtain an answer. We, instead, aim at extracting the domain specific knowledge from GPT-3 which is requested to correctly answer a question.

3 Gpt-3

To provide to the reader a better understanding of our work, here we present a brief background context about GPT-3, the third version of Generative Pre-Trained Transformer [7]. This is an autoregressive language model with 175 billion parameters that can be used for different tasks without any finetuning, achieving strong performances.

The architecture of the GPT-3 Transfomer model is made of 96 attention layers. While language models like BERT [11] use the Encoder to generate embeddings from the raw text which can be used in other machine learning applications, GPT-3 use the Decoder half, so it takes embeddings as inputs and produces text. In particular the GPT-3 language model has the ability to generate natural language text that can be hard to distinguish from human-written text, to the point that research has been carried out to asses whether GPT-3 could pass a written Turing test [12].

Concretely, during inference, the target of the new task is directly predicted conditioned on the given context and the new task’s input , as a text sequence generation task. Note that all , and are text sequences. For example, . Therefore, at each decoding step we have


where are the weights of the pretrained language model, which are frozen for all new tasks. The context consists of an optional prompt head and in-context examples from the new task.

4 Method

In a Cultural Heritage context, the information useful to answer questions about a specific artwork is contained in the artwork image and in its contextual description. Finding such a description might not be trivial, since it might require a domain expert to write it down. At the same time, it is quite costly to train a Visual Question Answering model that takes in input both the image and the description. This is also not straightforward, since the two modalities need to be blended and matched together. Consequently, the main idea of this work is to generate new descriptions for artworks based on a specific prompt or a specific question and directly use these descriptions to answer visual and contextual questions. The overall pipeline of our proposed work is as follows:

  1. GPT-3 caption generation. We use GPT-3 to generate descriptions of artworks, leveraging its memorization capabilities that allowed the model retain relevant information about training instances. An important aspect in this phase in to feed the correct prompt in input to GPT-3 in order to obtain realistic and correct descriptions. We consider two different types of input prompt:

    • General - A general prompt where the expected output is a general description of the artwork. The input text follows the structure:

      "Describe and Contextualize the painting painting_name "

    • Question-based - A specific question based prompt. The input text follows the structure:

      "Painting painting_name question ".

      The expected generated text by GPT-3 is a small text snippet that consists in a couple of sentences, focused on the topic of the question.

  2. Question answering. Once the description has been generated in the previous step, we can exploit it to answer both visual and contextual questions through a Question Answering language model. For this purpose we use a pretrained version of DistilBert [27] fine-tuned on the SQUAD [24] dataset. We feed in input to the DistilBert model the generated text from the previous step together with the question. The answer given as output will be the final answer of our method.

Fig. 1 and Fig. 2 show a scheme of the two variants of our method. More precisely, in Fig. 1 the general input prompt for GPT-3 yields the generation of a long description of the artwork (similar to a museum information sheet). On the other hand, the question-based prompt in Fig. 2 yields only the generation of a brief output text, which we find suitable for answering the question. In conclusion, these two schemes follow roughly the same structure. The difference is in the input prompt that in the case of Fig. 1 is more general and in Fig. 2 is more task oriented.

Figure 1: Scheme of our method for answering questions using a general generated description. A prompt with a specific structure is given in input to GPT-3. Subsequently the generated text is fed together with the question to a Question Answering model that outputs the answer.
Figure 2: Scheme of our method for answering questions using a question-based generated description. A prompt containing the name of the painting and the question is given in input to GPT-3. Subsequently the generated text is fed together with the question to a Question Answering model that outputs the answer.

5 Experiments

In this section we first outline the experimental setting for the experiments carried out in this paper, presenting dataset and experimental protocol and we then move on to a discussion of the results.

5.1 Dataset

For our experiments, we use the Artpedia dataset [31]. Artpedia contains a collection of 2,930 artworks, associated to a variable number of textual descriptions gathered from WikiPedia. Sentences are labelled as a visual descriptions or as a contextual descriptions. Contextual descriptions regard information about the artwork that does not directly describe its visual content. For instance, contextual descriptions can describe the historical context of the artwork, its author, the artistic influence or the museum where a painting is exhibited. The dataset contains 28,212 descriptions, 9,173 of which are labelled as visual and the remaining 19,039 as contextual. The Artpedia dataset has been extended with Question-Answer annotations in [6]. In fact, a subset of the images have been associated with visual and contextual questions, derived from the corresponding captions. In this work we follow the dataset split of [6].

5.2 Experimental protocol

Following prior work such as [6], we evaluate visual questions and contextual questions with different metrics. In fact, visual question answering and traditional text-based question answering are often treated in two different ways. Visual Question Answering is considered as a classification problem, meaning that a model has to pick an answer from a predefined dictionary of possible candidates containing a few words each. This stems from the fact that questions in most datasets are a way of guiding attention towards specific objects or attributes in the image, without requiring any complex form of language reasoning. Question Answering on the other hand is based on a set of sentences, which may contain rare or out-of-dictionary words. The task is in fact defined as identifying a subset of the textual description that contains the answer.

In light of this, to evaluate visual questions we rely on accuracy:


where is the number of correct answers and the number of total answers.

For text-based question answering, instead, we use both accuracy and F1-measure, a metric that takes into account the global correctness of the answer:


Where is defined as:


with is the number of common words between the output answer and the ground truth answer and the number of words in the generated answer.

instead is defined as:


where is the number of words in the ground truth.

We also evaluate the quality of the descriptions generated by GPT-3, considering it as a standalone image captioning model. We use the following standard metrics for captioning:

  • BLEU1 [21]: BiLingual Evaluation Understudy (BLEU) is the most commonly used metric for machine translation and image captioning. BLEU scores are based on how similar a generated caption is to a reference caption, computing the precision of the generated words. The downside of BLEU is that it is very sensitive to small changes, such as synonyms or different word order.

  • ROUGE [18]: differently from BLEU, which measures the precision of the caption, Recall Oriented Understudy of Gisting Evaluation (ROUGE) focuses on quantifying the amount of correct words with respect to the reference. Thus, this metric is recall-based and tends to reward long sentences.

  • CIDEr [37]

    : Consensus-based Image Description Evaluation (CIDEr) is an automatic consensus metric that measures the similarity of captions against a set of ground truth sentences written by humans. This metric has been shown to yield a higher agreement with humans generated text since it captures notions of grammar, importance and precision and recall.

  • Cosine Similarity

    : we compute the cosine similarity between feature vectors for the generated caption and the reference caption. Features are extracted with the algorithm TF-IDF


5.3 Experimental Results

Description type     Metric   OFA [38]    Ours General Ours Question-based
Visual BLEU1 0.048 0.181 0.137
ROUGE 0.138 0.188 0.16
CIDEr 0.091 0.079 0.172
COSINE 0.113 0.157 0.110
Contextual BLEU1 0.002 0.168 0.160
ROUGE 0.062 0.178 0.179
CIDEr 0.000 0.248 0.129
COSINE 0.082 0.218 0.324
All BLEU1 0.000 0.113 0.185
ROUGE 0.053 0.158 0.184
CIDEr 0.000 0.016 0.098
COSINE 0.122 0.253 0.341
Table 1: Image captioning results. We compare our method which generates captions with GPT-3 with the General and the Question-based approaches. In the Question-based approach we concatenate all the outputs of GPT-3 after conditioning it with different questions related to the image. We compare the results against visual captions, contextual captions or both.

5.3.1 Captioning results

We start by assessing the quality of the captions generated by GPT-3. First of all, we ask GPT-3 to generate captions with our General approach. In Tab. 1 we compare the captions using as reference visual captions, contextual captions and both. All reference captions are ground truth captions taken from the Artpedia dataset [31].

Interestingly, the model appears to better results for visual captions using BLEU1 and ROUGE metrics, while using CIDEr and cosine similarity, the model obtaines higher results for contextual captions. This may seem counter-intuitive but can be explained looking at the nature of the metrics. BLEU1 and ROUGE in fact respectively check for word-wise precision and recall, while CIDEr and cosine distance perform a sentence level scoring, which is closer to human consensus. We observe that the model is able to obtain good results, especially with the cosine metric, even when using all the captions as reference.

We then evaluate the method by taking a concatenation of the outputs generated by GPT-3 after being conditioned by different questions related to the image. This obviously introduces a strong bias, given also the fact that questions have been generated from information contained in the captions, but at the same time proves the usefulness of such captions for more advanced applications such as visual question answering. As can be seen in Tab. 1, conditioning GPT-3 with the captions leads to better captions according to most metrics.

In Tab. 1 we also provide a baseline as reference, i.e. the output of the state of the art OFA captioning model [38]. We observe that captions generated by OFA do not align well with the ground truth sentences. We attribute this to a domain shift between the datasets commonly used to train captioning models and descriptions of artworks. In fact, the former are sentences written by non-experts while for applications in cultural heritage a domain knowledge is required. This further motivates the usage of GPT-3, which seems to have integrated sufficient knowledge to articulate complex sentences with a domain specific jargon.

Visual Contextual Accuracy F1 score
VQA-CH [6] 0.684 0.832
VQA-CH [6] 0.176 0.150
VQA-CH [6] 0.504 0.417
Ours - General 0.557 0.719
Ours - General 0.070 0.055
Ours - General 0.239 0.360
Ours - Question-based 0.473 0.602
Ours - Question-based 0.134 0.202
Ours - Question-based 0.256 0.330
Table 2: Experimental results for Visual Question Answering. We compare our approach against VQA-CH [6] to understand whether GPT-3 can replace information sheets for artworks either for visual or contextual questions. We compare two versions of our model, the General version, which produces generic descriptions of artworks and the Question-based version, where prompts are conditioned with the input question to generate more specific descriptions.

5.3.2 VQA results

To evaluate the Visual Question Answering capabilities of our proposed method, we follow the setting of [6]. However, we do not rely on any vision-based model but rather on a fully textual question answering model based on DistilBert [27], as explained in Sec. 4. In Tab. 2, we compare our approach to the one of VQA-CH [6]. It has to be noted that, contrary to [6], we do not rely on real textual descriptions, which are known to contain the answer, but we only extract information from GPT-3. This is a strong disadvantage for our method. However, we are not interested in obtaining better results than VQA-CH, but rather our goal is to demonstrate if GPT-3 can act as a substitute of textual descriptions handcrafted by domain experts.

We test our method evaluating the accuracy for visual questions, contextual questions and both together. Quantitative results indicate that captions generated by GPT-3 can yield to high results for contextual questions, yet very low accuracy for visual questions. As for the captioning setting, we impute this behavior to the fact that GPT-3 generates generic descriptions, without including a fine-grained description of the visual content. Thus, on the one hand the question answering model is capable of extracting meaningful information from the generated captions. This means that GPT-3 is indeed capable of integrating domain knowledge during training and is capable of generating a complete information sheet of the artwork. On the other hand, captions appear to be too generic to obtain information about specific details in the image.

To overcome this limitation, we test the model using captions generated with out Question-based approach. By feeding the answer to GPT-3 along with the title of the artwork, the model is able to generate more specific captions. Such captions, as explained in Sec. 6 are usually shorter but are focused on the prompt. This is particularly interesting since it means that a purely text-based model is capable of addressing a vision-based task. In Tab. 2 it can be seen that for visual questions alone, our method with question-based captions performs on par or better than the vision-based VQA-CH model.

Figure 3: Qualitative results of our method. Green: ground truth description from the Artpedia dataset [31] and input question. Yellow: general descriptions provided by GPT-3 and answer obtained based on such text. Blue: Question-based description and correspondent answer. General descriptions are longer and more detailed than question-based generated descriptions. However, question-based generated descriptions are customized for the specific question.

6 Qualitative Analysis

In this section we provide a qualitative analysis of the captions generated by GPT-3 in order to characterize which kind of information they contain in both the General and Question-based formulation.

Since the prompts that we feed to GPT-3 are different, with one being more general and the other being question-based, we expect that the corresponding generated text by GPT-3 will be different. In Fig. 3 we can observe these differences. Generated general descriptions are very long and have the aspect of artwork information sheets in which we can find some visual and contextual information. Question-based generated descriptions are instead shorter and contain the knowledge needed to answer to the specific questions. From Fig 3 we can observe that the general description is very useful to answer to contextual questions but fails on some visual questions. This is likely due to different reasons:

  • The generated text does not take into account any specific question and this can lead to the generation of a description without specific information useful to answer to the question.

  • Visual questions are very specific since they refer to object relationships, colors, counting, etc. and the GPT-3 model tends to be more shallow in generating its descriptions.

On the other hand, question-based generated descriptions are helpful to answer visual questions but the small generated description useful to answer those specific questions could contain incorrect information leading to wrong answer predictions. In conclusion these two ways of generating text to answer visual and contextual questions have some pros and cons:

  • General descriptions are longer and contain several pieces of information about the artwork. However this is fixed and could not contain the information needed to answer some questions.

  • Question-based descriptions are generated for specific questions and contain only the information needed to answer the question on which GPT-3 has been conditioned. If the model has not memorized any specific information regarding such questions it may contain mistakes and descriptions will have to be re-computed for each question.

7 Considerations on complexity and accessibility of GPT-3

In the previous sections we have demonstrated that GPT-3 could indeed replace the usage of an information sheet handcrafted by a domain expert. However, we need to understand the actual applicability of GPT-3 in a real case application. GPT-3 has 175B parameters, which approximately amounts to 700GB. This means that inference on a single GPU is unfeasible due to current technological limits. The model however has been made available from OpenAI and is accessible through API that have a pricing fee per generated token. These considerations somewhat limit a large-scale usage of the model, especially if a description has to be generated for each question to be answered. On the other hand, generating fixed descriptions offline, one for each artwork, appears a viable solution at least for addressing contextual questions.

8 Conclusions

In this paper we presented a method for Visual Question Answering in the Cultural Heritage domain. In particular we have addressed the problem of data annotation for artworks, generating descriptions with GPT-3. The performances for the VQA task show that the generated descriptions are useful to answer the questions correctly. This technique allows to answer visual and contextual questions focusing only on the generated description and can be used for any artwork. In fact, there is no need to retrain the model to incorporate new knowledge. This is possibile thanks to the memorization capabilities of GPT-3, which at training time has observed millions of tokens regarding domain-specific knowledge. Finally the generated description can be integrated as textual input (textual description) in a more complex architecture as [6] in order to address tasks like Visual Question Answering. This is of particular interest for Cultural Heritage due to the domain shift between common VQA and captioning datasets compared to the technical jargon that is needed to properly address questions about art.


  • [1]

    Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6077–6086 (2018)

  • [2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015)
  • [3]

    Asprino, L., Bulla, L., Marinucci, L., Mongiovì, M., Presutti, V.: A large visual question answering dataset for cultural heritage. In: International Conference on Machine Learning, Optimization, and Data Science. pp. 193–197. Springer (2021)

  • [4] Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering: Which investigated applications? Pattern Recognition Letters 151, 325–331 (2021)
  • [5] Becattini, F., Ferracani, A., Landucci, L., Pezzatini, D., Uricchio, T., Del Bimbo, A.: Imaging novecento. a mobile app for automatic recognition of artworks and transfer of artistic styles. In: Euro-Mediterranean Conference. pp. 781–791. Springer (2016)
  • [6] Bongini, P., Becattini, F., Bagdanov, A.D., Del Bimbo, A.: Visual question answering for cultural heritage. In: IOP Conference Series: Materials Science and Engineering. vol. 949, p. 012074. IOP Publishing (2020)
  • [7] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners (2020)
  • [8] Cetinic, E., She, J.: Understanding and creating art with ai: Review and outlook. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18(2), 1–22 (2022)
  • [9] Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10578–10587 (2020)
  • [10] Dale, R.: Gpt-3: What’s it good for? Natural Language Engineering 27(1), 113–118 (2021)
  • [11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [12] Elkins, K., Chun, J.: Can gpt-3 pass a writer’s turing test? Journal of Cultural Analytics 5(2), 17212 (2020)
  • [13] Fiorucci, M., Khoroshiltseva, M., Pontil, M., Traviglia, A., Del Bue, A., James, S.: Machine learning for cultural heritage: A survey. Pattern Recognition Letters 133, 102–108 (2020)
  • [14] Garcia, N., Ye, C., Liu, Z., Hu, Q., Otani, M., Chu, C., Nakashima, Y., Mitamura, T.: A dataset and baselines for visual question answering on art. In: European Conference on Computer Vision. pp. 92–108. Springer (2020)
  • [15] Han, J., Zhang, D., Cheng, G., Liu, N., Xu, D.: Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine 35(1), 84–100 (2018)
  • [16]

    Kheradpisheh, S.R., Ganjtabesh, M., Thorpe, S.J., Masquelier, T.: Stdp-based spiking deep convolutional neural networks for object recognition. Neural Networks

    99, 56–67 (2018)
  • [17] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision. pp. 121–137. Springer (2020)
  • [18]

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)

  • [19] Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision. pp. 873–881 (2017)
  • [20] Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems. pp. 289–297 (2016)
  • [21] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
  • [22] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
  • [23] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8),  9 (2019)
  • [24] Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
  • [25] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
  • [26] Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information processing & management 24(5), 513–523 (1988)
  • [27] Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
  • [28] Seidenari, L., Galteri, L., Bongini, P., Bertini, M., Del Bimbo, A.: Language based image quality assessment. In: ACM Multimedia Asia, pp. 1–7 (2021)
  • [29] Sheng, S., Laenen, K., Moens, M.F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval. pp. 94–101. Springer (2019)
  • [30] Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4613–4621 (2016)
  • [31] Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: A new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing. pp. 729–740. Springer (2019)
  • [32] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
  • [33] Sun, T.X., Liu, X.Y., Qiu, X.P., Huang, X.J.: Paradigm shift in natural language processing. Machine Intelligence Research 19(3), 169–183 (2022)
  • [34] Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
  • [35] Vannoni, F., Bongini, P., Becattini, F., Bagdanov, A.D., Bimbo, A.: Data collection for contextual and visual question answering in the cultural heritage domain (2020)
  • [36] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
  • [37] Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)
  • [38] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052 (2022)
  • [39] Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., Wang, L.: An empirical study of gpt-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 3081–3089 (2022)