With the widespread usage of online advertising to promote brands (advertisers), there has been a steady need to innovate upon ad formats, and associated ad creatives (3). The image and text comprising the ad creative can have a significant influence on online users, and their thoughtful design has been the focus of creative strategy teams assisting brands, advertising platforms, and third party marketing agencies.
Numerous recent studies have indicated the emergence of a phenomenon called ad fatigue, where online users get tired of repeatedly seeing the same ad each time they visit a particular website (e.g., their personalized news stream) (Schmidt and Eisend, 2015; Youn and Kim, 2019). Such an effect is common even in native ad formats where the ad creatives are supposed to be in line with the content feed they appear in (3; S. Youn and S. Kim (2019)). In this context, frequently refreshing ad creatives is emerging as an effective way to reduce ad fatigue (7; 17).
From a creative strategist’s view, coming up with new themes and translating them into ad images and text is a time taking task which inherently requires human creativity. Numerous online tools have emerged to help strategists in translating raw ideas (themes) into actual images and text, e.g., via querying stock image libraries (25), and by offering generic insights on the attributes of successful ad images and text (27). In a similar spirit, there is room to further assist strategists by automatically recommending brand specific themes which can be used with downstream tools similar to the ones described above. In the absence of human creativity, inferring such brand specific themes using the multimodal (images and text) data associated with successful past ad campaigns (spanning multiple brands) is the focus of this paper.
A key enabler in pursuing the above data driven approach for inferring themes is that of a dataset of ad creatives spanning multiple advertisers. Such a dataset (2) spanning ad images was recently introduced in (Hussain et al., 2017), and also used in the followup work (Ye and Kovashka, 2018). The collective focus in the above works (Hussain et al., 2017; Ye and Kovashka, 2018) was on understanding ad creatives in terms of sentiment, symbolic references and VQA. In particular, no connection was made with the brands inferred in creatives, and the associated world knowledge on the inferred brands. As the first work in connecting the above dataset (2) with brands, (Mishra et al., 2019) formulated a keyword ranking problem for a brand (represented via its Wikipedia page), and such keywords could be subsequently used as themes for ad creative design. However, the ad images were not used in (Mishra et al., 2019), and recommended themes were restricted to single words (keywords) as opposed to longer keyphrases which could be more relevant. For instance, in Figure 1, the phrase take anywhere has much more relevance for Audi than the constituent words in isolation.
In this paper, we primarily focus on addressing both the above mentioned shortcomings by (i) ingesting ad images as well as textual information, i.e., Wikipedia pages of the brands and text in ad images (OCR), and (ii) we consider keyphrases (themes) as opposed to keywords. Due to the multimodal nature of our setup, we propose a VQA formulation as exemplified in Figure 1, where the questions are around the advertised product (as in (Ye and Kovashka, 2018; Hussain et al., 2017)) and the answers are in the form of keyphrases (derived from answers in (2)). Brand specific keyphrase recommendations can be subsequently collected from the predicted outputs of brand-related VQA instances. Compared to prior VQA works involving questions around an image, the difference in our setup lies in the use of Wikipedia pages for brands, and OCR features; both of these inputs are considered to assist the task of recommending ad themes. In summary, our main contributions can be listed as follows:
we study two formulations for VQA based ad theme recommendation (classification and ranking) while using multimodal sources of information (ad image, OCR, and Wikipedia),
we show the efficacy of transformer based visual-linguistic representations for our task, with significant performance lifts versus using separate visual and text representations,
we show that using multimodal information (images and text) for our task is significantly better than using only visual or textual information, and
we report selected ad insights from the public dataset (2).
2. Related Work
In this section, we cover related work on online advertising, understanding ad creatives, and visual-linguistic representations.
2.1. Online advertising
Brands typically run online advertising campaigns in partnerships with publishers (i.e., websites where ads are shown), or advertising platforms (McMahan et al., ; Zhou et al., 2019) catering to multiple publishers. Such an ad campaign may be associated with one or more ad creatives to target relevant online users. Once deployed, the effectiveness of targeting and ad creatives is jointly gauged via metrics like click-through-rate (CTR), and conversion rate (CVR) (Bhamidipati et al., ). To separate out the effect of targeting from creatives, advertisers typically try out different creatives for the same targeting segments, and efficiently explore (Li et al., 2010) which ones have better performance. In this paper, we focus on quickly creating a pool of ad creatives for a brand (via recommended themes learnt from past ad campaigns), which can then be tested online with targeting segments.
2.2. Automatic understanding of ad creatives
, where the authors focused on automatically understanding the content in ad images and videos from a computer vision perspective. The dataset has ad creatives with annotations including topic (category), questions and answers (e.g., reasoning behind the ad, expected user response due the ad). In a followup work (Ye and Kovashka, 2018)
, the focus was on understanding symbolism in ads (via object recognition and image captioning) to match human-generated statements describing actions suggested in the ad. Understanding ad creatives from a brand’s perspective was missing in both(Ye and Kovashka, 2018; Hussain et al., 2017), and (Mishra et al., 2019) was the first to study the problem of recommending keywords for guiding a brand’s creative design. However, (Mishra et al., 2019) was limited to only text inputs for a brand (e.g., the brand’s Wikipedia page), and the recommendation was limited to single words (keywords). In this paper, we extend the setup in (Mishra et al., 2019) in a non-trivial manner by including multimodal information from past ad campaigns, e.g., images, text in the image (OCR), and Wikipedia pages of associated brands. We also extend the recommendations from single words to longer keyphrases.
2.3. Visual-linguistic representations and VQA
With an increasing interest in joint vision-language tasks like visual question answering (VQA) (Antol et al., 2015), and image captioning (Sharma et al., 2018), there has been lot of recent work on visual-linguistic representations which are key enablers in the above mentioned tasks. In particular, there has been a surge of proposed methods using transformers (Devlin et al., 2018), and we cover some of them below.
In LXMERT (Tan and Bansal, 2019), the authors proposed a transformer based model that encodes different relationships between text and visual inputs trained using five different pre-training tasks. More specifically, they use encoders that model text, objects in images and relationship between text and images using (image,sentence) pairs as training data. They evaluate the model on two VQA datasets. More recently ViLBERT (Lu et al., 2019) was proposed, where BERT (Devlin et al., 2018) architecture was extended to generate multimodal embeddings by processing both visual and textual inputs in separate streams which interact through co-attentional transformer layers. The co-attentional transformer layers ensure that the model learns to embed the interactions between both modalities. Other similar works include VisualBERT (Li et al., 2019b), VLBERT (Su et al., 2019), and Unicoder-VL (Li et al., 2019a).
In this paper, our goal is to focus on leveraging visual-linguistic representations to solve an ads specific VQA task formulated to infer brand specific ad creative themes. In addition, VQA tasks on ad creatives tend to relatively challenging (e.g., compared to image captioning) due to the subjective nature and hidden symbolism frequently found in ads (Ye and Kovashka, 2018). Another difference between our work and existing VQA literature is that our task is not limited to understanding the objects in the image but also the emotions the ad creative would evoke in the reader. Our primary task is to predict different themes and sentiments that an ad creative image can invoke in its reader, and use such brand specific understanding to help creative strategists in developing new ad creatives.
We first describe our formulation of ad creative theme recommendation as a classification problem in Section 3.1. This is followed by subsections on text and image representation, cross modal encoder, and optimization (Figure 2 gives an overview). Finally, in Section 3.5 we cover an alternative ranking formulation for recommendation.
3.1. Theme recommendation: classification formulation
In our setup, we are given an ad image (indexed by ), and associated text denoted by . Text is sourced from: (i) text in ad image (OCR), (ii) questions around the ad, and (iii) Wikipedia page of the brand in the ad. Given , we represent the image as a sequence of objects together with their corresponding regions in the image (details in Section 3.2). The sentence is represented as a sequence of words . Given the three sequences , the objective is to recommend a keyphrase , where is a pre-determined vocabulary of keyphrases. In other words, for, and then the top keyphrase for instance can be selected as:
The above classification formulation is similar to that for VQA in (Ye and Kovashka, 2018); the difference is in the multimodal features explained below.
3.2. Text and image embeddings
Text embedding. We first use WordPiece Tokenizer (Wu et al., 2016) to convert a sentence into a sequence of tokens
. Then, the tokens are projected to vectors in the embedding layer leading to(as shown in (2)). Their corresponding positions are also projected to vectors leading to (as shown in (3)). Then, and are added to form as shown in (4) below:
where and are the embedding matrices. and are the vocabulary size of tokens, and token positions. and are the dimensions of token and position embeddings.
Image embedding. We use bounding boxes and their region-of-interest (RoI) features to represent an image. Same as (Lu et al., 2019; Tan and Bansal, 2019), we leverage Faster R-CNN (Ren et al., 2015) to generate the bounding boxes and RoI features. Faster R-CNN is an object detection tool which identifies instances of objects belonging to certain classes, and then localizes them with bounding boxes. Though image regions lack a natural ordering compared to token sequences, the spatial locations can be encoded (e.g., as demonstrated in (Tan and Bansal, 2019)). The image embedding layer takes in the RoI features and spatial features and outputs a position-aware image embedding as shown below:
where and are weights, and and are biases.
3.3. Transformer-based cross-modality encoder
We apply a transformer-based cross-modality encoder to learn a joint embedding from both visual and textual features. Here, without loss of generality, we follow the LXMERT architecture from (Tan and Bansal, 2019) to encode the cross-modal features. As shown in Figure 2, the token embedding is first fed into a language encoder while the image embedding goes through an object-relationship encoder. The cross-modality encoder contains two unidirectional cross-attention sub-layers which attend the visual and textual embeddings to each other. We use the cross-attention sub-layers to align the entities from two modalities and to learn a joint embedding of dimension . We follow (Devlin et al., 2018) to add a special token [CLS] to the front of the token sequence. The embedding vector learned for this special token is regarded as the cross-modal embedding111Recently proposed ViLBERT (Lu et al., 2019), and VisualBERT (Li et al., 2019b) can serve as alternatives.. In terms of query (), key (), and value (), the visual where , , and are linguistic features, visual features, and visual features, respectively; represents the dimension of linguistic features (Tan and Bansal, 2019). Textual cross-attention is similar with visual and linguistic features swapped.
3.4. Learning and optimization.
Based on the joint embedding for each image and sentence pair, the keyphrase recommendation task can now be tackled with a fully connected layer. Given the cross-modal embedding, the probability distribution over all the candidate keyphrases is calculated by a fully-connected layer and the softmax function as shown below:
where and are the weight and bias of a fully connected layer, and is the cross modal embedding.
3.5. Theme recommendation: ranking formulation
We also consider solving the theme recommendation problem via a ranking model, where the model outputs a list of keyphrases in decreasing order of relevance for a given (image, sentence) pair, i.e., . We use the state-of-the-art pairwise deep relevance matching model (DRMM) (Guo et al., 2016) whose architecture for our theme recommendation setup is shown in Figure 3
. It is worth noting that our pairwise ranking formulation can be changed to accommodate other multi-objective or list-based loss-functions. We chose the DRMM model since it is not restricted by the length of input, as most ranking models are, but relies on capturing local interactions between query and document terms with fixed length matching histograms. Given an(image, sentence, phrase)
combination, the model first computes fixed length matching histogram between cross-modal embedding and the phrase embedding. Each matching histogram is passed through a multi-layer perceptron (MLP), and the overall score is aggregated with a query term gate which is a softmax function over all terms in that query.
The ranking segment of the model takes two inputs: (i) cross-modal embedding for (as explained in Section 3.3), and (ii) the phrase embedding. It then learns to predict the relevance of the given phrase with respect to the query (image, sentence) pair. Given that our input documents (i.e., keyphrases) are short, we select top interactions in matching histogram between the cross-modal embedding, and the keyphrase embedding. Mathematically, we denote the (image, sentence) pair by just the imgq below. Given a triple (, , ) where is ranked higher than with respect to image-question, the loss function is defined as:
where denotes the predicted matching score for phrase , and the query image-question pair.
In this section, we go over the public dataset used in our experiments, classification and ranking results, and inferred insights.
We rely on a publicly available data set (Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, and A. Kovashka (2017); 2) that consists of advertisement creatives, spanning brands across categories, among which 80% is training set and 20% is test set. We select 10% data from the training set for validation. Crowdsourcing was used to gather following labels for each creative: (i) topics ( types), (ii) questions and answers as reasons for buying from the brand depicted in the creative (3 per creative). In addition to the existing annotations, we add the following annotations: (i) brand present in a creative, (ii) Wikipedia page relevant to the brand-category pair in a creative, and (iii) the set of target themes (keyphrases) associated with each image. In particular, for (i) and (ii) we follow the method in (Mishra et al., 2019), and for (iii) the keyphrases (labels) were extracted from the answers using the position-rank method (Boudin, 2016; Florescu and Caragea, 2017) for each image. The number of keyphrases was limited to at most (based on the top keyphrase scores returned by position-rank). We define a score for each keyphrase. All five keyphrases have scores of and in order 222The annotated ads dataset can be found at https://github.com/joey1993/ad-themes..
The minimum, mean and maximum number of images associated with a brand are 1, 19 and 282 respectively. The top three categories of advertisements are clothing, cars and beauty products with 7798, 6496 and 5317 images respectively. Least number of advertisements are associated with gambling (32), pet food (37) and security and safety services (47) respectively. Additional statistics around the dataset (i.e., keyphrase lengths, images per category, and unique keyphrases per category) are shown in Figures 4, 5, and 6.
4.2. Evaluation Metrics
We use different evaluation metrics to measure performance of our classification and ranking models. We use three different metrics to evaluate the performance of each model.
4.2.1. Classification metrics
We rely on accuracy, text similarity and set-intersection based recall to evaluate model performance.
Accuracy. We predict the keyphrase with the highest probability for each image and match it with the labels (ground truth keyphrases) for the image. We use the score of the matched phrase as the accuracy. If no labels match for a sample, the accuracy is . We average the accuracy scores over all the test instances to report test accuracy.
Similarity: Accuracy neglects the semantic similarity between the predicted phrase and the labels. For example, a predicted keyphrase “a great offer” is similar to one of the labels, “great sale”, but will gain
for accuracy. So we calculate the cosine similarity(Han et al., 2011) between the embeddings of the predicted keyphrase and each label. Then we multiply the similarity scores with each label’s score and keep the maximum as the final similarity score for the sample.
VQA Recall: we use Recall at (@) as an evaluation metric for the classification task (essentially like the VQA formulation in (Ye and Kovashka, 2018)). For each test instance , the ground truth is limited to top keyphrases leading to set . From the classification model’s predictions the top keyphrases are chosen leading to set . is simply .
4.2.2. Ranking metrics
We use the same evaluation metrics from prior work (Mishra et al., 2019), mainly precision (P@K), recall (R@K), and NDCGK (Järvelin and Kekäläinen, 2002) to evaluate the proposed ranking model. It is worth noting that recall is computed differently for evaluating ranking and classification models proposed in this work. Formally, given a set of queries , set of phrases labeled relevant for each query and the set of relevant phrases retrieved by the model for at position , we define .
4.3. Implementation details
For the classification model, we set the number of object-relationship, language, and cross-modality layers as , , , and leverage pre-trained parameters from (Tan and Bansal, 2019). We fine-tune the encoders with our dataset for epochs. The learning rate is (adam optimizer), and the batch size is . We also set , and equal to . For the similarity evaluation, we average the GloVe (Pennington et al., 2014) embeddings of all the words in a phrase to calculate the phrase embedding. For the DRMM model (ranking formulation), we used the MatchZoo implementation (18), with for batch size, training epochs, as the last layer’s size, and learning rate of (adadelta optimizer). We combine textual data from different sources in a consistent order separated by a [SEP] symbol before feeding them into the encoder.
|features||accuracy (%)||similarity (%)||@|
For different sets of multimodal features, the performance results are reported in Table 1 (for classification) and Table 2 (for ranking) respectively. The presence of Wikipedia and OCR text gives a significant lift over using only the image. Both classification and ranking metrics show the same trend in terms of feature sets.
Table 1 shows that linguistic features dramatically lift the performance by , , and in accuracy, similarity, and @, compared to the performance of the model trained only with visual features while only using linguistic features (Q+W+O) causes a big drop in all the performances. It occurs that the OCR features bring more performance lift, compared to the Wiki. We think knowing more about the brand with the Wikipedia pages is beneficial to recommend themes to designers (Mishra et al., 2019) while the written texts on the images (OCR) are sometimes more straightforward for recommendations. In addition, as reported in Table 1 (non cross-modal), using separate text and image embeddings (obtained from the model in Figure 2) is inferior in performance compared to the cross-modal embeddings. We notice that the accuracy scores are comparatively low; this reflects the difficult nature of understanding visual ads (Hussain et al., 2017).
In Table 2, we observe very similar patterns: OCR features bring more benefits to ranking than Wikipedia pages. We notice that in P@ and R@, only using image feature (I) achieves a better score compared to adding the question features. This may indicate that the local interactions in DRMM are not effective with short questions, but favor longer textual inputs such as OCR and Wikipedia pages.
Figure 7 shows the performance lifts in accuracy and similarity metrics per category (where lift is defined as ratio of improvement to baseline result without using text features in the classification task). As shown, multiple categories, e.g., public service announcement (PSA) ads around domestic violence and animal rights benefit from the presence of text features; this may be related to the hidden symbolism (Ye and Kovashka, 2018) common in PSAs, where the text can help clarify the context even for humans. Also, similarity and accuracy metrics do not have the same trend in general. Along the lines of inferring themes from past ad campaigns, and assisting strategists towards designing new creatives, we show an example based on our classification model in Figure 8. In general, a strategist can aggregate recommended keyphrases across a brand or product category, and use them to design new creatives.
In this paper, we make progress towards automating the inference of themes (keyphrases) from past ad campaigns using multimodal information (i.e., images and text). In terms of model accuracy, there is room for improvement, and using generative models for keyphrase generation may be a promising direction. In terms of application, i.e., automating creative design, we believe that the following are natural directions for future work: (i) automatically selecting ad images and generating ad text based on recommended themes, and (ii) launching ad campaigns with new creatives (designed via our proposed method) and learning from their performance in terms of CTR and CVR. Nevertheless, the proposed method in this paper can increase diversity in ad campaigns (and potentially reduce ad fatigue), reduce end-to-end design time, and enable faster exploratory learnings from online ad campaigns by providing multiple themes per brand (and multiple images per theme via stock image libraries).
- VQA: Visual Question Answering. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.
-  Automatic understanding of image and video advertisements. Note: http://people.cs.pitt.edu/~kovashka/ads Cited by: item 4, §1, §1, §2.2, Figure 8, §4.1.
-  Banner blindness. Note: https://en.wikipedia.org/wiki/Banner_blindness Cited by: §1, §1.
-  A large scale prediction engine for app install clicks and conversions. In CIKM 2017, Cited by: §2.1.
Pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Cited by: §4.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.3, §2.3, §3.3.
-  Facebook business: optimize your ad results by refreshing your creative. Note: https://www.facebook.com/business/m/test-ads-on-facebook Cited by: §1.
- PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §4.1.
- A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Cited by: §3.5.
- Data mining: concepts and techniques. Elsevier. Cited by: 2nd item.
- Automatic understanding of image and video advertisements. In CVPR, Cited by: §1, §1, §2.2, §4.1, §4.4.
- Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §4.2.2.
- Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §2.3.
- Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.3, footnote 1.
- Exploitation and exploration in a performance based contextual advertising system. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §2.1.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §2.3, §3.2, footnote 1.
-  Marketing land: social media ad fatigue. Note: https://marketingland.com/ad-fatigue-social-media-combat-224234 Cited by: §1.
-  Match zoo. Note: https://github.com/NTMC-Community/MatchZoo Cited by: §4.3.
-  Ad click prediction: a view from the trenches. KDD 2013. Cited by: §2.1.
- Guiding creative design in online advertising. RecSys. Cited by: §1, §2.2, §4.1, §4.2.2, §4.4.
- Glove: global vectors for word representation. In In EMNLP, Cited by: §4.3.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2.
- Advertising repetition: a meta-analysis on effective frequency in advertising. Journal of Advertising 44 (4), pp. 415–428. Cited by: §1.
- Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, Cited by: §2.3.
-  Shutterstock: search millions of royalty free stock images, photos, videos, and music.. Note: https://www.shutterstock.com/ Cited by: §1, Figure 8.
- Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.3.
-  Taboola-trends. Note: https://trends.taboola.com/ Cited by: §1.
- LXMERT: learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, Cited by: §2.3, §3.2, §3.3, §4.3.
Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.2.
- ADVISE: symbolism and external knowledge for decoding advertisements. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pp. 868–886. Cited by: §1, §1, §2.2, §2.3, §3.1, 3rd item, §4.5.
- Newsfeed native advertising on facebook: young millennials’ knowledge, pet peeves, reactance and ad avoidance. International Journal of Advertising 38 (5), pp. 651–683. Cited by: §1.
Understanding consumer journey using attention based recurrent neural networks. KDD. Cited by: §2.1.