Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

by   Julia Kruk, et al.
SRI International

Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example a caption might reflect ironically on the image, so neither the caption nor the image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram post labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 8 compared to using only image modality, demonstrating the commonality of non-intersective meaning multiplication. Our dataset offers an important resource for the study of the rich meanings that results from pairing text and image.



There are no comments yet.


page 1

page 3

page 8


Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0

This paper presents our work on the Situated Interactive MultiModal Conv...

MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

In multimodal assistant, where vision is also one of the input modalitie...

On the Complementarity of Images and Text for the Expression of Emotions in Social Media

Authors of posts in social media communicate their emotions and what cau...

FiLMing Multimodal Sarcasm Detection with Attention

Sarcasm detection identifies natural language expressions whose intended...

IGA : An Intent-Guided Authoring Assistant

While large-scale pretrained language models have significantly improved...

Edited Media Understanding: Reasoning About Implications of Manipulated Images

Multimodal disinformation, from `deepfakes' to simple edits that deceive...

A Bootstrapped Model to Detect Abuse and Intent in White Supremacist Corpora

Intelligence analysts face a difficult problem: distinguishing extremist...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multimodal social platforms such as Instagram let content creators combine visual and textual modalities. The resulting widespread use of text+image makes interpreting author intent in multimodal messages an important task for NLP.


Figure 1: Image-Caption meaning multiplication: A change in the caption of the image completely changes the overall meaning of the image-caption pair.

There are many recent studies on images accompanied by basic text labels or captions  chen2015microsoft; faghri2018vsepp

. But prior work on image–text pairs has generally been asymmetric, regarding either image or text as the primary content, viewing the other as a mere complement. Scholars from semiotics as well as machine learning (or computer vision) have pointed out that this is insufficient; often text and image are not combined by a simple addition or intersection of the component meanings

bateman2014text; marsh2003taxonomy; zhang2018equal.

Rather, determining author intent with textual+visual content requires a richer kind of meaning composition that has been called meaning multiplication bateman2014text: the creation of new meaning through integrating image and text. Meaning multiplication includes the simpler intersective or concatenative kinds of overlap (a picture of a tennis court with the text “tennis court”, or a picture of a dog with the label “Rufus”). But it also includes more sophisticated kinds of composition, such as the irony or indirection, as shown in Figure 1, where the integration requires inference that creates a new meaning. In Figure 1, a picture of a young woman smoking is given two different hypothetical captions that result in different composed meanings. Image I uses the picture to highlight relaxation through smoking, while Image II uses the tension between her looks and her actions to highlight the dangers of smoking.

To better understand author intent given such meaning multiplication, we create three novel taxonomies related to the relationship between text and image and their combination/multiplication in Instagram posts, designed by modifying existing taxonomies bateman2014text; marsh2003taxonomy from semiotics, rhetoric, and media studies. Our taxonomies measure the authorial intent behind the image-caption pair and two kinds of text-image relations: the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We then introduce a new dataset, MDID (Multimodal Document Intent Dataset), with Instagram posts covering a variety of topics, annotated with labels from our three taxonomies.

Finally, we build a deep neural network based model for automatically annotating Instagram posts with the labels from each taxonomy, and show that combining text and image leads to better classification, especially when the caption and the image diverge.

2 Prior Work

A wide variety of work in multiple fields has explored the relationship between text and image and extracting meaning, although often assigning a subordinate role to either text or images, rather than the symmetric relationship in media such as Instagram posts. The earliest work in the Barthesian tradition focuses on advertisements, in which the text serves as merely another connotative aspect to be incorporated into a larger connotative meaning heath1977image. marsh2003taxonomy offer a taxonomy of the relationship between image and text by considering image/illustration pairs found in textbooks or manuals. We draw on their taxonomy, although as we will see the connotational aspects of Instagram posts require some additions.

For our model of speaker intent, we draw on the classic concept of illocutionary acts austin1962things to develop a new taxonomy of illocutionary acts focused on the kinds of intentions that tend to occur on social media. For example, we rarely see commissive posts on Instagram and Facebook because of the focus on information sharing and constructions of self-image.

Computational approaches to multi-modal document understanding have focused on key problems such as image captioning chen2015microsoft; faghri2018vsepp or visual question answering  goyal2017making; zellers2018recognition, which assume that the text is a subordinate modality, or on extracting the literal or connotative meaning of a post soleymani2017survey, or its perlocutionary force (how it is perceived by its audience), including aspects such as memorability khosla2015understanding, saliency bylinskii2018different, popularity khosla2014makes and virality deza2015understanding; alameda2017viraliency.

Some prior works have focused on intention. joo2014visual and huang2016inferring study prediction of intent behind politician portraits in the news. hussain2017automatic studies the understanding of image and video advertisements, predicting topic, sentiment, and intent. alikhani introduce a corpus of the coherence relationships between recipe text and images. Our work builds on siddiquieICMI2015, who focused on a single type of intent (detecting politically persuasive video on the internet) and even more closely on zhang2018equal, who study visual rhetoric as interaction between the image and the text slogan in advertisements. They categorize image-text relationships into parallel equivalent (image-text deliver same point at equal strength), parallel non-equivalent (image-text deliver the same point at different levels) and non-parallel (text or image alone is insufficient in point delivery). They also identify the novel issue of understanding the complex, non-literal ways in which text and image interacts. Our work thus takes the next step by exploring the kind of “non-additive” models of image-text relationships that we see in Figure 1.

3 Taxonomies

We propose three taxonomies, two (contextual and semiotic) to capture different aspects of the relationship between the image and the caption, and one to capture speaker intent.

3.1 The Contextual Taxonomy

The contextual relationship taxonomy captures the relationship between the literal meanings of the image and text. We draw on the three top-level categories of the marsh2003taxonomy taxonomy, which distinguished images that are minimally related to the text, highly related to the text, and related but going beyond it. These three classes— reflecting Marsh et al.’s primary interest in illustration—frame the image only as subordinate to the text. We slightly generalize the three top-level categories taxonomy of marsh2003taxonomy to make them symmetric for the Instagram domain:

Minimal Relationship:

The literal meanings of the caption and image overlap very little. For example, a selfie of a person at a waterfall with the caption “selfie”.

Close Relationship:

The literal meanings of the caption and the image overlap considerably. For example, a selfie of a person at a crowded waterfall, with the caption “Selfie at Hemlock falls on a crowded sunny day”.

Transcendent Relationship:

The literal meaning of one modality picks up and expands on the literal meaning of the other. For example, a selfie of a person at a crowded waterfall with the caption “Selfie at Hemlock Falls on a sunny and crowded day. Hemlock falls is a popular picnic spot. There are hiking and biking trails, and a great restaurant 3 miles down the road …”.

3.2 The Semiotic Taxonomy

The contextual taxonomy described above does not deal with the more complex forms of “meaning multiplication” as illustrated in Figure 1. For example, an image of three frolicking puppies with the caption “My happy family,” sends a message of pride in one’s pets that is not directly reflected in either modality taken by itself. First, it forces the reader to step back and consider what is being signified by the image and the caption, in effect offering a meta-comment on the text-image relation. Second, there is a tension between what is signified (a family and a litter of young animals respectively) that results in a richer idiomatic meaning.

Our second taxonomy therefore captures the relationship between what is signified by the respective modalities, their semiotics. We draw on the earlier 3-way distinction of Kloepfer1977 as modeled by bateman2014text and the two-way (parallel vs. non-parallel) distinction of zhang2018equal to classify the semiotic relationship of image/text pairs as divergent, parallel and additive. A divergent relationship occurs when the image and text semiotics pull in opposite directions, creating a gap between the meanings suggested by the image and text. A parallel relationship occurs when the image and text independently contribute to the same meaning. An additive relationship occurs when the image and text semiotics amplify or modify each other.

The semiotic classification is not always homologous to the contextual. For example, an image of a mother feeding her baby with a caption “My new small business needs a lot of tender loving care”, would have a minimal contextual relationship. Yet because both signify loving care and the image intensifies the caption’s sentiment, the semiotic relationship is additive. Or a lavish formal farewell scene at an airport with the caption “Parting is such sweet sorrow”, has a close contextual relationship because of the high overlap in literal meaning, but the semiotics would be additive, not parallel, since the image shows only the leave-taking, while the caption suggests love (or ironic lack thereof) for the person leaving.


Figure 2: The top three images exemplify the semiotic categories. Images I-VI show instances of divergent semiotic relationships.

Figure 2 further illustrates the proposed semiotic classification. The first three image-caption pairs (ICP’s) exemplify the three semiotic relationships. To give further insights into the rich complexity of the divergent category, the six ICP’s below showcase the kinds of divergent relationships we observed most frequently on Instagram.

ICP I exploits the tension between the reference to retirement expressed in the caption and the youth projected by the two young women in the image to convey irony and thus humor in what is perhaps a birthday greeting or announcement. Many ironic and humorous posts exhibit divergent semiotic relationships. ICP II has the structure of a classic Instagram meme, where the focus is on the image, and the caption is completely unrelated to the image. This is also exhibited in the divergent “Good Morning” caption in the top row. ICP III is an example of a divergent semiotic relationship within an exhibitionist post. A popular communicative practice on Instagram is to combine selfies with a caption that is some sort of inside joke. The inside joke in ICP III is a lyric from a song a group of friends found funny and discussed the night this photo was taken. ICP IV is an aesthetic photo of a young woman, paired with a caption that has no semantic elements in common with the photo. The caption may be a prose excerpt, the author’s reflection on what the image made them think or feel, or perhaps just a pairing of pleasant visual stimulus with pleasant literary material. This divergent relationship is often found in photography, artistic and other entertainment posts. ICP V uses one of the most common divergent relationships, in which exhibitionist visual material is paired with reflections or motivational captions. ICP V is thus similar to ICP III, but without the inside jokes/hidden meanings common to ICP III. ICP VI is an exhibitionist post that seems to be common recently among public figures on Instagram. The image appears to be a classic selfie or often a professionally taken image of the individual, but the caption refers to that person’s opinions or agenda(s). This relationship is divergent—there are no common semantic elements in the image and caption—but the pair paints a picture of the individual’s current state or future plans.

3.3 Intent Taxonomy

We developed a set of eight illocutionary intents informed by our examination of a large body of representative Instagram content, and by previous studies of intent in Instagram posts. For example drawing on Goffman’s idea of the presentation of self goffman59, Mahoney16 in their study of Scottish political Instagram posts defines acts like Presentation of Self, which, following hogan10 we refer to as exhibition, or Personal Political Expression, which we generalize to advocative.

  1. advocative: advocate for a figure, idea, movement, etc.

  2. promotive: promote events, products, organizations etc.

  3. exhibitionist: create a self-image for the user using selfies, pictures of belongings etc.

  4. expressive: express emotion, attachment, or admiration at an external entity or group.

  5. informative: relay information regarding a subject or event using factual language.

  6. entertainment: entertain using art, humor, memes etc.

  7. provocative/discrimination: directly attack an individual or group.

  8. provocative/controversial: be shocking.

Category # Samples
Category # Samples
Contexual Relationship
Category # Samples
Table 1:

Counts of different labels in Multimodal Document Intent Dataset (MDID); the distribution is highly skewed.

4 The MDID Dataset

Our dataset, MDID (the Multimodal Document Intent Dataset) consists of Instagram posts that we collected with the goal of developing a rich and diverse set of posts for each of the eight illocutionary types in our intent taxonomy. For each intent we collected at least hashtags or users that would be likely to yield a high proportion of posts that could be labeled by that heading, the goal of populating each category with a rich and diverse set.

For the advocative intent, we selected mostly hashtags advocating and spanning political or social ideology such as #pride and #maga. For the promotive intent we relied on the #ad tag that Instagram has recently begun requiring for sponsored posts. We used tags such as relating to events rather than products, and #selfie and #ootd (outfit of the day) for the exhibitionist intent. Any tags that focused on the self as the most important aspect of the post would usually yield exhibitionist data. The expressive posts were retrieved via tags that actively expressed something, such as #lovehim or #merrychristmas. Informative posts were taken from informative accounts such as news websites. Entertainment posts drew on an eclectic groups of tags such as #meme, #earthporn, #fatalframes. Finally, provocative posts were extracted via tags that either expressed a provocative message or that would draw people into being influenced or provoked by the post (#redpill, #antifa, #eattherich, #snowflake).

Data Labeling:

Data was pre-processed (for example to convert all albums to single image-caption pairs). We developed a simple annotation toolkit that displayed an image–caption pair and asked the user to confirm whether the data was acceptable and if so to identify the post’s intent (advocative, promotive, exhibitionist, expressive, informative, entertainment, provocative), contextual relationship (minimal, close, transcendent), and semiotic relationship (divergent, parallel, additive). Every image was labeled by at least two independent human annotators. We retained only those images on which all annotators agreed. Dataset statistics are shown in Table 1.

5 Computational Model

We train and test a deep convolutional neural network (DCNN) model on the dataset, both to offer a baseline model for users of the dataset, and to further explore our hypothesis about meaning multiplication.

Our model can take as input image (Img), text (Txt) or both (Img + Txt), and consists of modality specific encoders, a fusion layer, and a class prediction layer. We use the ResNet-18 network pre-trained on ImageNet as the image encoder 

he2016deep. For encoding captions, we use a standard pipeline that employs a RNN model on word embeddings. We experiment with both word2vec type (word token-based) embeddings trained from scratch mikolov2013distributed and pre-trained character-based contexual embeddings (ELMo) peters2018deep

. For our purpose ELMo character embeddings are more useful since they increase robustness to noisy and often misspelled Instagram captions. For the combined model, we implement a simple fusion strategy that first linearly projects encoded vectors from both the modalities in the same embedding space and then adds the two vectors. Although naive, this strategy has been shown to be effective at different tasks such as Visual Question Answering  

nguyen2018improved and image-caption matching ahuja2018understanding. We then use the fused vector to predict class-wise scores using a fully connected layer.

6 Experiments

-.3in-.3in Method Intent Semiotic Contexual Chance Img () () () () () () Txt-em () () () () () () Txt-ELMo () () () () () () Img + Txt-emb () () () () () () Img + Txt-ELMo () () () () () ()

Table 2: Table showing results with different DCNN models– image-only (Img), text-only (Txt-emb and Txt-ELMo), and combined model (Img + Txt-emb and Img + Txt-ELMo). Here emb is the model using standard word (token) based embeddings, while ELMo is the pre-trained ELMo based word embeddings peters2018deep.

We evaluate our models on predicting intent, semiotic relationships, and image-text relationships from Instagram posts, using image only, text only, and both modalities.

6.1 Dataset, Evaluation and Implementation

We use the 1299-sample MDID dataset (section 4). We only use corresponding image and text information for each post and do not use other meta-data to preserve the focus on image-caption joint meaning. We perform basic pre-processing on the captions such as removing stopwords and non-alphanumeric characters. We do not perform any pre-processing for images.

Due to the small dataset, we perform 5-fold cross-validation for our experiments reporting average performance across all splits. We report classification accuracy (ACC) and also area under the ROC curve (AUC) (since AUC is more robust to class-skew), using macro-average across all classes jeni2013facing; stager2006dealing.

We use a pre-trained ResNet-18 model as the image encoder. For word token based embeddings we use dimensional vectors trained from scratch. For ELMo we use a publicly available API 111 and use a pre-trained model with two layers resulting in a dimensional input. We use a bi-directional GRU as the RNN model with dimensional hidden layers. We set the dimensionality of the common embedding space in the fusion layer to . In case there is a single modality, the fusion layer only projects features from that modality. We train with the Adam optimizer with a learning rate of , which is decayed by after every epochs. We report results with the best model selected based on performance on a mini validation set.

6.2 Quantitative Results

We show results in Table 2. For the intent taxonomy images are more informative than (word2vec) text ( for Img vs for Txt-emb) but with ELMo text outperforms just using images ( for Txt-ELMo, for Img). ELMo similarly improves performance on the contexual taxonomy but not the semiotic taxonomy.

For the semiotic taxonomy, ELMo and word2vec embeddings perform similarily. ( for Txt-emb vs. for Txt-ELMo), suggesting that individual words are sufficient for the semiotic labeling task, and the presence of the sentence context (as in ELMo) is not needed.

Combining visual and texual modalities helps across the board. For example, for intent taxonomy the joint model Img + Txt-ELMo achieves compared to for Txt-ELMo. Images seem to help even more when using a word embedding based text model ( for Img + Txt-emb vs. for Txt-emb). Joint models also improve over single-modality on labeling the image-text relationship and the semiotic taxonomy. We show class-wise performances with the single- and multi-modality models in Table 3. It is particularly interesting that in the semiotic taxonomy, multimodality helps the most with divergent semiotics (gain of compared to the image-only model).

Class I T I + T
Class I T I + T
Class I T I + T
Table 3: Class-wise results (AUC) for the three taxonomies with different DCNN models on the MDID dataset. Except for the semiotic taxonomy we used ELMo text representaitons (based on the performance in Table 2). Here I refers to image-only, T refers to text-only, and I + T refers to the combined model.


Figure 3:

Confusion between intent classes for the intent classification task. The confusion matrix was obtained using the Img + Txt-ELMo model and the results are averaged over the


6.3 Discussion

In general, using both text and image is helpful, a fact that is unsurprising since combinations of text and image are known to increase performance on tasks such as predicting post popularity hessel17. Most telling, however, were the differences in this helpfulness across items. In the semiotic category the greatest gain came when the text-image semiotics were “divergent”. By contrast, multimodal models help less when the image and text are additive, and helps the least when the image and text are parallel and provide less novel information. Similarly, with contextual relationships, multimodal analysis helps the most with the “minimal” category (1.6%). This further supports the idea that on social media such as Instagram, the relation between image and text can be richly divergent and thereby form new meanings.

The category confusion matrix in Figure 3 provides further insights. The least confused category is informative. Informative posts are least similar to the rest of Instagram, since they consist of detached, objective posts, without much use of “I” or “me.” Promotive posts are also relatively easy to detect, since they are formally informative, telling the viewer the advantages and logistics of an item or event, with the addition of a persuasive intent reflecting the poster’s personal opinions (“I love this watch”.). The entertainment label is often misapplied, perhaps to some extent all posts have a goal of entertaining and any analysis must account for this filter of entertainment. Exhibitionist tends to be predicted well, likely due to its visual and textual signifiers of individuality (e.g. selfies are almost always exhibitionist, as are captions like “I love my new hair”. There is a great deal of confusion, however, between the expressive and exhibitionist categories, since the only distinction lies in the whether the post is about a general topic or about the poster, and between the provocative and advocative categories, perhaps because both often seek to prove points in a similar way.


Figure 4:

Sample output predictions for the three taxonomies, showing ranked classes and predicted probabilities. In images IV the same image when paired with a different caption gives rise to a different intent.

6.4 Sample Outputs

We show some sample outputs of the (multimodal) model in Figure 4. The top-left Image-caption pair (Image I) is classified as exhibitionist with expressive as a close second since it is a picture of someone’s home with a caption describing an experience at home. The semiotic relationship is classified as additive because the image and caption together signify the concept of spending winter at home with pets before the fireplace. The contextual relationship is classified as transcendental because the caption goes well beyond the image.

The top-right image-caption pair (Image II) is classified as entertainment because the image caption pair works as an ironic reference to dancing (“yeet”) grandparents, who are actually reading, in language used usually by young people that a typical grandparent would never use. The semiotic relationship is classified as divergent and the contextual relationship is classified as minimal because of the semantic and semiotic divergence of the image-caption pair caused by the juxtaposition of youthful references with older people.

To further understand the role of meaning multiplication, we consider the change in intent and semiotic relationships when the same image of the British Royal Family is matched with two different captions in the bottom row of Figure 4 (Image IV). When the caption is “the royal family” the intent is classified as entertainment because such pictures and caption pairs often appear on Instagram intending to entertain. The semiotic relationship is classified as parallel, and the contextual relationship as close because the caption and the image overlap heavily. But when the caption is “my happy family” the intent is classified as expressive because the caption expresses family pride, and the semiotic relationship is additive because the caption’s reference to a happy family goes beyond what the image signifies.

7 Conclusion

We have proposed a model to capture the complex meaning multiplication relationship between image and text in multimodal Instagram posts. Our three new taxonomies, adapted from the media and semiotic literature, allow the literal, semiotic, and illocutionary relationship between text and image to be coded. Of course our new dataset and the baseline classifier models are just a preliminary effort, and future work will need to examine larger datasets, richer classification schemes, and more sophisticated classifiers. Some of these may be domain-specific. For example, alikhani show how to develop rich coherence relations that model the contextual relationship between recipe text and accompanying images (specific versions of Elaboration or Exemplification such as “Shows a tool used in the step but not mentioned in the text”). Expanding our taxonomies with richer sets like these is an important goal. Nonetheless, the fact that we found multimodal classification to be most helpful in cases where the image and text diverged semiotically points out the importance of these complex relations, and our taxonomies, dataset, and tools should provide impetus for the community to further develop more complex models of this important relationship.

8 Acknowledgment

This project is sponsored by the Office of Naval Research (ONR) under the contract number N00014-17C-1008. Disclaimer: The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.