VisualTextRank: Unsupervised Graph-based Content Extraction for Automating Ad Text to Image Search

08/05/2021 ∙ by Shaunak Mishra, et al. ∙ Verizon Media 0

Numerous online stock image libraries offer high quality yet copyright free images for use in marketing campaigns. To assist advertisers in navigating such third party libraries, we study the problem of automatically fetching relevant ad images given the ad text (via a short textual query for images). Motivated by our observations in logged data on ad image search queries (given ad text), we formulate a keyword extraction problem, where a keyword extracted from the ad text (or its augmented version) serves as the ad image query. In this context, we propose VisualTextRank: an unsupervised method to (i) augment input ad text using semantically similar ads, and (ii) extract the image query from the augmented ad text. VisualTextRank builds on prior work on graph based context extraction (biased TextRank in particular) by leveraging both the text and image of similar ads for better keyword extraction, and using advertiser category specific biasing with sentence-BERT embeddings. Using data collected from the Verizon Media Native (Yahoo Gemini) ad platform's stock image search feature for onboarding advertisers, we demonstrate the superiority of VisualTextRank compared to competitive keyword extraction baselines (including an 11% accuracy lift over biased TextRank). For the case when the stock image library is restricted to English queries, we show the effectiveness of VisualTextRank on multilingual ads (translated to English) while leveraging semantically similar English ads. Online tests with a simplified version of VisualTextRank led to a 28.7 a 41.6 ad platform.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online advertising is an effective tool for both major brands, and small businesses to grow awareness about their products and influence online users towards purchases (Bhamidipati et al., 2017; Zhou et al., 2019). Ad images are a crucial component of online ads, and small businesses with limited advertising budget may find it challenging to obtain relevant high quality images for their ad campaigns. In this context, small business are increasingly relying on online stock image libraries (21) which offer access to high quality and copyright free images available for marketing purposes. To assist such small businesses in a self-serve manner, the Verizon Media Native (Yahoo Gemini) ad platform offers onboarding advertisers the ability to query a third party stock image library for candidate ad images (as shown in Figure 1).

Figure 1. Onboarding workflow for Verizon Media Native advertisers: (1) advertiser enters their website URL with ad title and description (ad text), (2) advertiser proceeds to choose an ad image and may be shown a random selection of images from a third party image library, (3) advertiser queries the library with a relevant textual query, and (4) selects a query result to get a preview of the final ad.

Although the example (an ad for bikes) shown in Figure 1

is borrowed from Verizon Media Native, it brings up a fundamental question: can we automatically figure out the search query for images given the ad text? Such automation will not only attract advertisers towards the stock image library, but will also speed up the onboarding process. However, there are several challenges in this context: (i) limited data (for supervised learning) on ad text and associated image search queries, and (ii) third party image libraries with unknown (proprietary) image indexing. The first challenge,

i.e., limited data stems from the fact that this is a relatively new service (i.e., ad platforms offering stock image search feature), and small businesses may not be aware about such features. One can use current state-of-the-art image understanding methods (i.e., object detection (Krasin et al., 2017), captioning (Sharma et al., 2018)) on the large set of existing ad images (which were not created by querying a stock image library) to get image annotations given ad text. But in our analysis (details in Section 3), such annotations were not specific enough to be considered as image queries in most cases, and hence the problem of limited ad text-to-image query data for supervised learning still remains. The second challenge listed above, i.e., unknown indexing, makes image libraries act like black boxes allowing short (few words) queries which may often be restricted to the English language. Due to such a black-box nature of third party image libraries, we do not consider approaches requiring the mapping of a text query to an image.

Given the above challenges, in this paper, we focus on unsupervised approaches for figuring out the ad image search query given an ad text. As we explain later in this paper (in Section 3), available logged data on ad image search behavior shows that in a significant fraction of cases, the ad text may already contain a keyword suitable as the image search query (as in the bike ad example in Figure 1); in other words the image query is extractive in nature. In the remaining cases, the ideal image search query may be symbolic (abstractive) with respect to the input ad text. Citing a real example from logged data, the (anonymized) ad ‘Best ETFs to buy ’ had image search queries: ’resort’ and ‘private jet’, both of which are indicative (symbolic) of wealth after plausibly investing in the advertised financial products. Although we mostly focus on methods for extractive queries in this paper, we report preliminary findings on the usage of symbolic image queries given ad text.

Focusing on extractive image queries, we build on prior work on unsupervised graph based keyword extraction (Kazemi et al., ; Boudin, 2016) (none of which are specifically designed for ad text). Such graph based methods typically build a graph where nodes are tokens (e.g., words) from input text, and the weighted edges between the nodes represent similarity between the associated tokens (Kazemi et al., ). With such a graph built from tokens, PageRank style algorithms (Mihalcea and Tarau, ) are used to infer the dominant token (keyword); the popularity of such methods spans more than a decade. Recently, biased TextRank (Kazemi et al., ) was proposed to leverage sentence-BERT (SBERT) (Reimers and Gurevych, )

embeddings of the input text to adjust the bias for each node (random restart probability in Page Rank). In other words, tokens which are closer to the overall

meaning of the entire text (as encoded by SBERT) are likely to have higher bias; this works significantly better than TextRank. In our setup focused on ad text (and ad image queries), we improve upon biased TextRank using two key ideas. Firstly, we retrieve similar existing ads to augment the (relatively short) input ad text. The augmentation is done using the text of similar ads as well as image tags (detected objects) in the images of similar ads in an unsupervised manner. Figure 2 illustrates this idea: for both the English and French ad text in the example, a semantically similar English ad already contains hints (in the image and text), that the original ad is about furniture (which is also the ground truth image query in the example for both ads). The above ideas not only make VisualTextRank ads specific, but also significantly push performance on the ad text-to-image query task.

Figure 2. Illustrative example where an existing similar ad has image query hints about two input ads (one in English and one in French) advertising furniture. The image tags are objects detected in the raw ad image but may not be descriptive enough to serve as image queries in many cases.

Our main contributions can be summarized as follows.

(1) Keyword extraction for ad text-to-image query

We formulate ad text-to-image query as a keyword extraction problem. For this task, we propose an unsupervised graph based content extraction method (VisualTextRank) which builds on biased TextRank (Kazemi et al., ) by introducing category biasing, and input text augmentation using the text and images of similar (existing) ads. For our task, we obtain lift in offline accuracy using VisualTextRank compared to competitive keyword extraction baselines.

(2) Cross-lingual learning

Leveraging VisualTextRank, we extend the ad-text-to-image task for non-English ad text (including German, French, Spanish, and Portuguese) and English-only image queries. For such a setup, we demonstrate the benefit of using semantically similar English ads to augment the non-English ad text for the ad-text-to-image query task.

(3) Product impact

We productionized a light-weight version of VisualTextRank for online tests with Verizon Media Native advertisers qualifying as small businesses (e.g., with relatively low advertising budgets). For advertisers with English ad text, we compared the effect of automatically showing ad text-relevant images (via VisualTextRank queries) as initial (default) images to the advertiser versus showing random stock images. Against this baseline, we observed a 28.7% lift in the rate of advertisers selecting stock images, and a 41.6% lift in the onboarding completion rate for advertisers.

The remainder of the paper is organized as follows. Section 2 covers related work, and Section 3 covers data analysis. The proposed VisualTextRank method and its multilingual extension is covered in Sections 4 and 5. Experimental results are covered in Section 6, and we end with paper with a discussion in Section 7.

2. Related Work

2.1. Keyword extraction

In this paper, we formulate the ad text-to-image query task as an unsupervised keyword extraction problem (explained in Section 4). Unsupervised keyword extraction research has a long history spanning over a decade (Boudin, 2016). In this paper, we focus on graph based methods which are inspired from PageRank on a graph created from the input text (Mihalcea and Tarau, ; Boudin, 2016). TextRank is a widely used method in this class, with applications spanning keyword extraction, summarization, and opinion mining (Kazemi et al., ; Mihalcea and Tarau, ). A notable recent work is by (Kazemi et al., ), where the authors improve TextRank by introducing sentence-BERT (Reimers and Gurevych, ) based biasing leading to the biased TextRank method. Sentence-BERT (SBERT) (Reimers and Gurevych, ) is a sentence representation method based on BERT (Devlin et al., 2018)

. In particular, SBERT is a modification of pretrained BERT that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings which can be compared using cosine-similarity. This provides a computationally efficient way to achieve state-of-the-art results for sentence-pair regression tasks like semantic textual similarity. Our proposed VisualTextRank method builds on top of biased TextRank

(Kazemi et al., ) with a focus on ad text, and the ad text-to-image query task.

2.2. Textual description of images: object detection and image captioning

Object detection methods (image tagging) predict the category and location details of the objects present in an image. Faster R-CNN (Ren et al., 2015) is a popular method for this task, and more recent works built on top of it (Qiao et al., 2020)

. Compared to image tagging which provides a bag-of-words style textual description (objects tags with confidence scores) of an image, captioning models generate a natural language description; this is a much harder task than object detection. Current state-of-the-art image captioning models use a pre-trained object detector to generate features and spatial information of the objects present in an image (

e.g., as in Oscar (Li et al., 2020b)). In particular, (Li et al., 2020b) utilizes BERT-like objectives to learn cross-modal representation on different vision-language tasks (similar ideas form the basis of recent pre-trained multi-modal models, e.g., VisualBERT (Li et al., 2020a)). Prior captioning approaches have involved attention mechanisms and their variants to capture spatial relationship between objects (Herdade et al., 2019) for generating captions. In this paper, we explore both object detection (image tagging) and captioning as features for assisting the ad text-to-image query task. In addition, since we focus on an unsupervised approach, we do not consider the possibility of fine-tuning the above mentioned pre-trained multi-modal models for our task.

2.3. Ad image and text understanding

Studying ad images and text using state-of-the-art deep learning models in computer vision and natural language processing (NLP) is an emerging area of research. In

(Hussain et al., 2017), ad image content was studied using computer vision models, and their dataset had manual annotations for: ad category, reasons to buy products advertised in the ad, and expected user response given the ad. Using this dataset, (Mishra et al., 2019; Zhou et al., 2020) used ranking models to recommend themes for ad creative design using a brand’s Wikipedia page. In (Mishra et al., 2020), object tag recommendations for improving an ad image was studied using data from A/B tests. Although related to ads, the above methods are not applicable in our setup due to the lack of sufficient (ad text-to-image query) data for supervised training.

3. Data Insights

In this section, we cover data insights around ad image search behavior which guided our proposed approach for the ad text-to-image task. We first explain the data source for the analysis, followed by metadata from image understanding models (for the purpose of analysis), and the resultant insights.

3.1. Data source

To gather preliminary insights around ad text-to-image search behavior, we collected data from a sample of advertisers who used the stock image library feature for their ad image while launching ad campaigns (onboarding) in the Verizon Media Native (Yahoo Gemini) ad platform. For each advertiser, the data included: (i) ad text, (ii) raw ad image, and (iii) image query for the ad image. We will refer to this dataset as onboarded-sample.

3.2. Image tags and caption metadata

For the purposes of analyzing onboarded-sample for insights, we used image tags, and captions as described below. We obtained image tags using the pre-trained Inception Resnet v2 object detection model (Krasin et al., 2017). The model consists of about classes (possible tags). For each image the model returns a list of inferred tags with confidence scores (we used tags with confidence above ). For captioning, we used an object relation transformer (ORT) model (Herdade et al., 2019)

with a Faster R-CNN object detector. For our analysis, we trained two ORT models: one on Microsoft COCO 2014 Captions dataset

(Lin et al., 2014) (COCO-captions model), and the other on Conceptual Captions dataset (Sharma et al., 2018) (CC-captions model).

3.3. Insights

3.3.1. Query length, parts-of-speech, and ad text overlap

We found that of image queries were a single word, and the remaining of queries comprised of two or more words. For English ads in onboarded-sample, we found an overwhelming of queries to be either nouns or proper nouns; verbs accounted for . The parts-of-speech (POS) tags were inferred via Spacy (Honnibal et al., 2020). Finally, we found that of image queries were already present (as a word or as a phrase) in the ad text. This indicates majority of the image queries are extractive in nature; of remaining queries are symbolic in nature (e.g., an ad about retirement investments with image queries like ‘vacation’ and ‘resort’ which are not part of the ad text).

Supported by the above observations on query length and overlap with ad text, we formulate the ad text-to-image task as a keyword extraction task given ad text (and additional side information as explained later in Section 4.1). For our experiments, we restrict the output to be either a noun, proper noun or verb (guided by the POS tags observation above).

3.3.2. Poor correlation with image captions (but better with image tags)

The idea of using image understanding models stems from the observation that there is a huge dataset of existing ad text and images which were not created via a stock image query (e.g., the public dataset of older ads in (Ye and Kovashka, 2018), or proprietary older ads in the ad platform). If it is possible to get descriptions for existing ad images via an image understanding tool (e.g., captions (Sharma et al., 2018), object detectors (Krasin et al., 2017)), one can treat such a description as a proxy image query, and train an end-to-end model for ad text-to-image query. To test this hypothesis, we analyzed the captions and image tags of onboarded ad images and compared them with the ground truth image queries. Checking for image query overlap with captions for the raw ad image, we found that there was an overlap in of samples for the COCO-captions model and an overlap in of samples for the CC-captions model. Figure 3 shows two examples of ad images from the onboarded sample with captions (COCO), tags and the ground truth query; clearly, captioning failed in both examples, while tagging did better in one.

Figure 3. Examples of ad images with captions (COCO), tags, and ground truth image query.

In comparison to captions, for the image tag model the overlap with image queries was . Due to higher overlap, we retained image tags for existing ads (used in our experiments), and did not focus on captions. Note that with overlap, image tags are still not expressive enough (just labels) to create a labeled dataset for training an ad-text-to-image query predictor; but they can plausibly be used as a signal for better keyword extraction. Table 1 provides a brief summary of the insights.

feature image query insights
query length of queries are a single word
ad text of queries are within the ad text
image captions (COCO) 5% query overlap
image captions (CC) 7.6% query overlap
image tags query overlap
Table 1. Summary of insights from onboarded-sample data.

4. Keyword Extraction for Ad Text-to-Image Query

In this section, we first formulate ad text-to-image query as a keyword extraction task in Section 4.1. Next, we describe the TextRank method as necessary background in Section 4.2. Finally, we explain our proposed method VisualTextRank in Section 4.3.

4.1. Problem Formulation

Given an ad text , our objective is to come up with a single keyword to serve as the ad image query (to a stock image library). We assume the presence of a pool of existing ads , where each ad has the following attributes: (i) ad text, and (ii) image tags in the raw ad image (as described in Section 3.3.2).

In this paper, we focus on a graph based unsupervised method to model the ad text (and information relevant to in ), for the purpose of extracting a word (from the union of and ) as a relevant ad image query given . The key motivation behind a graph based approach as opposed to just using the image tags from the most similar ad text (i.e., a nearest neighbor approach), lies in Table 1. The intuition here is to identify the central entity in the ad text (which contains the query in of cases as noted in Table 1), with additional help from similar ads. In the following sections, we first explain an existing graph based keyword extraction method (TextRank) in our context as necessary background, and then explain the proposed VisualTextRank method.

4.2. TextRank

At a high level, TextRank is primarily based on PageRank (Mihalcea and Tarau, ) on the graph of tokens (words) obtained from the input text. We explain below, the token graph construction, the (original) unbiased version of TextRank as well as the recently proposed biased version.

Token graph

The token-graph is formed from tokens (words) from the input ad text , where denotes the set of vertices and denotes the set of (undirected) edges. Each word is mapped to a vertex . An edge between two vertices has weight:


where denotes the cosine similarity between the (word) embeddings of and (denoted by and respectively). It is common to set to zero if it is below a similarity threshold (Kazemi et al., ).

Unbiased TextRank

The original (unbiased) TextRank method (Mihalcea and Tarau, ) iteratively computes the score of each vertex as:


where is the damping factor (typically ), and is the set of vertices which share an edge (with non-zero edge weight) with vertex .

Biased TextRank

The recently proposed biased TextRank method (Kazemi et al., ) iteratively computes the score of each vertex in the token-graph in the following manner:


where the only change with respect to the unbiased version in (2) is the addition of a bias term for vertex . For the task of keyword extraction, in (Kazemi et al., ) bias is defined as:


where is the sentence embedding for input ad text . Intuitively, biased TextRank tries to favor words which are closer to the overall (semantic) meaning of the input ad text. In (Kazemi et al., ), the authors demonstrate that this intuition works well and show the superiority of biased TextRank over the original TextRank method. In the remainder of the paper, we will refer to the above biasing method (as in (4)) as self-biasing since it computes the bias with respect to the input (ad text). Sentence-BERT (SBERT) (Reimers and Gurevych, ) is an effective sentence embedding method used in biased TextRank (Kazemi et al., ), and we leverage the same in our proposed method (explained below).

4.3. VisualTextRank

We start with an overview of the proposed VisualTextRank method in Section 4.3.1, and then go over the details of important components in Sections 4.3.3 and 4.3.2.

4.3.1. Overview

The biased TextRank method (3) has the following shortcomings when it comes to understanding an ad (i.e., extracting a keyword for image search):

  1. ads are usually tied to categories (e.g., IAB category taxonomy (7)) and TextRank is oblivious to this, and

  2. ad text can be short leading to a very sparse token graph with no or negligible edges above a reasonable similarity threshold.

In addition to the above shortcomings from a keyword extraction perspective, TextRank’s original motivation was not to extract queries suitable for ad images, and hence it lacks the visual aspect needed in an image search query (in both extractive and symbolic cases). In VisualTextRank, we build on the above shortcomings as outlined below (detailed explanation after the outline).

  1. Category biasing: we introduce an ad category specific biasing term in addition to the self-biasing term in (4).

  2. Augmentation with similar ads’ text: for a given input ad, we fetch semantically similar ads from a pool of existing ads. We augment the token-graph with a filtered version of the text from similar ads.

  3. Augmentation with similar ads’ image tags: we also augment the token-graph with a filtered version of the image tags (as described in Section 3.3.2) of similar ads’ images.

Steps (2) and (3) above, not only alleviate the problem of short ad text, but also offer the capability of going beyond words in the ad text while coming up with the image query. As a result of augmenting the initial token-graph (as described in Section 4.2) with text and image tags from similar ads (augmentation details described later in Section 4.3.2), we obtain an augmented token-graph .

Combining the steps above, the vertex value update for VisualTextRank can be written as follows.


where is a vertex in the augmented graph , and denotes the set of vertices shares an edge with in . The details behind the category biasing term are described in Section 4.3.3, and the details behind the graph augmentation leading to are described in 4.3.2.

4.3.2. Token-graph augmentation with similar ads

Token graph augmentation has three steps as described below.

Retrieving similar ads

For augmentation, we assume a pool (set) of existing ads denoted by . An ad in has the following attributes: (i) ad text, and (ii) image tags for the ad image along with confidence scores for each tag. For each ad in the pool , we compute the semantic similarity (relevance) of the ad with respect to the input ad as follows:


where is an ad in , and denotes its relevance with respect to the input ad . A sentence embedding method like SBERT (Reimers and Gurevych, ) (trained for semantic textual similarity) can be used to obtain and . We use the above relevance score to obtain the top- similar ads from the pool for the given input ad .

Augmentation with similar ads’ image tags

We augment the token-graph  with the image tags of similar ads’, by selecting image tags which are semantically close to a word in the ad text of the similar ad. For example, as shown in Figure 4, if the similar ad has a word like ‘furniture’, and the corresponding ad image has tags like ‘chair’ and ‘table’ (which are semantically close to furniture), we select such tags for augmenting the original token-graph .

Figure 4. Illustrative example of VisualTextRank in action for a furniture ad. The green nodes in the token graph are augmented using the similar ad’s text and image tag, and eventually lead to ‘furniture’ being selected (as they form a distinct cluster with other furniture related words).

The details of our proposed image tag augmentation can be described as follows. We assume a bound on: (i) the maximum number of image tags that we can add (), and (ii) the minimum cosine similarity between an image tag and a word in its parent ad text to select the image tag for augmentation (). As described in Algorithm 1, we start with a list of similar ads sorted in decreasing order of relevance to the input ad text . For each similar ad , denotes the list of image tags associated with the corresponding ad image; this tags list is sorted in decreasing order by the confidence scores from the object detector. Given and , we select a tags which are close to at least one word in the (i.e., ). We keep iterating as shown in Algorithm 1, till we collect images tags in .

Data: similar ad texts  (sorted by decreasing relevance to input), image tags list for ad
Result: image tags set  to augment ,
for  do
      for  do
             if  then
                   if  then
                   end if
             end if
       end for
end for
Algorithm 1 Augmentation with similar ads’ image tags.
Augmentation with similar ads’ text

In addition to image tags, we also use words from similar ads to augment the token-graph . Our proposed word augmentation method is described in Algorithm 2. It is similar in spirit to the image tag augmentation method. We keep iterating over words in an ad text in their order of occurrence, and select words which are above the similarity threshold (cosine similarity between the input ad text and the candidate word using SBERT embeddings). The set of selected words is used to augment .

Data: Similar ad texts  (sorted by decreasing relevance to input)
Result: words set  to augment , namely  
for  do
      for   do
             if  then
                   if  then
                   end if
             end if
       end for
end for
Algorithm 2 Augmentation with with similar ads’ text.

In the example in Figure 4, the word ‘furniture’ from the similar ad’s text was selected for augmentation using the above method.

4.3.3. Category biasing

We consider a set of category phrases . Such a set can be created by using phrases from a standard (flattened) ad category taxonomy, e.g., IAB categories (7) (see Appendix for details). For example, the set can contain words and phrases like ‘travel’, ‘legal services’, and ‘fashion’ which are used to denote different categories in the IAB taxonomy. Given such a set, we intend to find the closest category in the following manner


where and denote (SBERT) embeddings for the category phrase and the input ad text. We do not assume the presence of category labels for a given ad text in our setup, and the above method is a proxy to infer the closest category for a given ad text. Having inferred for given input ad text, we compute the category bias for each vertex in the augmented graph as:


where denotes the word associated with vertex . Intuitively, category biasing tries to up-weight the words closer to the inferred category. In the final vertex value update in (4.3.1), is used in conjunction with the self bias as defined as in (4).

5. Multilingual Ad Text

In the multilingual setting, we assume the following: (i) ad text can be non-English, (ii) the image query is in English, and (iii) we can retrieve semantically similar (existing) English ads. Furthermore, in this paper, we assume access to a generic translation tool which can translate non-English ad text to English. Given the above assumptions, for multilingual ad text, we simply translate the ad text from its original language to English, then use the English version with VisualTextRank (i.e., fetching similar English ads, and doing category biasing). The intuition behind such an approach is that regardless of the language, visual components in the ad image will be strongly tied to the product mentions in the ad text. For example, for a furniture ad in English or an equivalent ad in French, the ad image is likely to have furniture. By borrowing from text and image tags of semantically similar English ads for a given multilingual ad, VisualTextRank facilitates a simple yet effective form of cross lingual learning. It also does not rely on having a large set of existing ads in the ad’s original language (except in the case of English ads).

In our experiments, we also explored the use of multilingual (xlm) SBERT (Reimers and Gurevych, ) to: (i) fetch similar English ads for a given multilingual ad (without translating it), (ii) run VisualTextRank with such xlm-SBERT embeddings to get a keyword (translated to English by using word-level-translation). As we explain in Section 6.4.3, this approach was inferior to the above ad text translation approach.

6. Results

6.1. Offline evaluation dataset

We collected ad text to image query data from Verizon Media Native’s stock image search feature (as illustrated in Figure 1). This feature is currently available in the onboarding workflow for small and medium businesses with relatively low advertising budgets. For a given ad text, the advertiser might make multiple search queries before converging on the final choice for onboarding (launching the ad campaign). We consider all such queries as relevant (golden set) for the given ad text. For example, for the ad text ‘Bank Foreclosing? Stop Foreclosure Now!’, if the user searches for ‘stress’, ‘eviction’, and ’bills’, all the three queries are labeled relevant queries. We sampled the collected data to create an evaluation data set with ads with ad text in English, and non-English (multilingual) ads spanning German, French, Spanish, Portuguese, Swedish, and Italian ad text. For the set of English ads, the image queries were editorially reviewed and corrected for spelling mistakes, and we will refer to this set as the ENG-eval set. For the multilingual set, each non-English ad was translated to English using an internal translation tool. The image queries were also editorially reviewed, translated to English (if they were not in English), and corrected for spelling mistakes as well (if they were already in English). We will refer to this as the non-ENG-eval set. While processing raw data to create the non-ENG-eval set we observed a significant number of cases where the advertiser tried only non-English queries, or had spelling mistakes in the English query (plausibly due to lack of English proficiency). Hence, editorial review was needed to correct such samples for proper evaluation with our proposed methods (and baselines). Note that both ENG-eval and non-ENG-eval datasets are of a unique nature due to the integration of the stock image search feature in the ad platform (allowing us to map ad text to corresponding ad image search queries).

6.2. Ads pool for similar ads retrieval

Since VisualTextRank leverages semantically similar ads for augmenting the input ad text, we need a pool of ads from which we can retrieve similar ads given an input ad text. For our experiments, we collected a sample of English ads (US-only) from Verizon Media Native (Yahoo Gemini), with their ad text and the raw ad image. We obtained image tags for the raw image (as explained in Section 3.3.2); this set had ads which were not created using the stock image search feature, and hence did not have associated image queries. The time range for the data pull was same as that for the evaluation datasets described in Section 6.1. In the remainder of this paper, we will refer to this set of existing ads as ads-pool.

6.3. Evaluation metrics

Offline metrics

For offline evaluation on the ENG-eval and non-ENG-eval sets, we consider three metrics as defined below. For a given ad text, if the predicted keyword is , and the golden set of queries is the set ,


where hard-accuracy simply checks whether the predicted keyword is in the golden set. Soft-accuracy uses the word2vec (Mikolov et al., 2013) similarity (as implemented in Spacy (Honnibal et al., 2020), details in Appendix) between a query in the golden set and the predicted keyword (represented by ) to if the predicted keyword is approximately (i.e., above the cosine similarity threshold) close to any of the golden set queries. This takes care of minor differences between the golden set queries and the predicted keyword (e.g., run and running would be considered similar). Finally, w2v-similarity is the absolute similarity between the predicted keyword and the closest query (in word2vec sense) in the golden set. Note that our accuracy metrics are essentially since we focus on a single keyword as the ad image query.

Online metrics

An advertiser session is defined as a continuous stream of interactions of the advertiser with the onboarding UI, with gaps no more than minutes between consecutive events. To gauge the effectiveness of automatic ad text-to-image search, we track two events in each advertiser session: (i) selecting (clicking) stock images to get a preview of the ad, and (ii) completing the onboarding process. Tracking the image selection event, we define image selection rate as:


Tracking the onboarding event, we define onboarding rate as:


The online metrics listed above are expected to capture the effect of automatic image recommendations on the stock image selection rate and the onboarding completion rate. Note that an image selection event may or may not lead to an onboarding completion event (if the advertiser does not launch the ad campaign).

6.4. Offline results

We first describe multiple baseline methods for ad text-to-image query (most of them are existing keyword extraction methods), and then go over the offline results for VisualTextRank and the baselines (for both ENG-eval and non-ENG-eval sets).

model hard soft avg. w2v
accuracy accuracy similarity
position rank 11.5% 17.07% 0.412
YAKE 14.11% 20.03% 0.4655
topical page rank 14.81% 21.25% 0.4515
multipartite rank 15.51% 21.78% 0.4639
topic rank 16.2% 23.17% 0.4719
tf-idf 18.12% 23.69% 0.486
TextRank 21.95% 28.92% 0.5297
TextRank 26.48% 35.37% 0.5739
VisualTextRank 28.05% 39.37% 0.6147
Table 2. Offline metrics for ENG-eval.

6.4.1. Baseline methods

We use TF-IDF, existing graph based keyword extraction methods (Boudin, 2016) (TextRank, Position Rank, Topic Rank, Topical Page Rank, Multipartite Rank, YAKE), as well as the recently proposed (self) biased TextRank (Kazemi et al., ) as baselines. For all the baselines, as well as the VisualTextRank method, we consider only nouns, proper nouns and verbs as valid output keywords, i.e., we have a POS tag filter (as guided by the POS tag insight in Section 3).

6.4.2. ENG-eval results

Table 2 shows the performance of the proposed VisualTextRank method versus the baselines outlined in Section 6.4.1 for the ENG-eval set. As shown in Table 2, VisualTextRank significantly outperforms the closest baseline (self) biased TextRank by in terms of soft-accuracy.

6.4.3. Non-ENG-eval results

For brevity, Table 3 just shows the performance of the proposed VisualTextRank method versus the (self) biased TextRank method (best baseline in the ENG-eval case) as well as TextRank for the non-ENG-eval set; the results for other baselines as in Table 3, are in line with the observations for the ENG-eval set.

model hard soft avg. w2v
accuracy accuracy similarity
TextRank 13.19% 17.18% 0.4403
TextRank 22.39% 29.75% 0.5302
VisualTextRank 23.62% 31.60% 0.5563
Table 3. Offline metrics for non-ENG-eval.

As shown in Table 3, in terms of soft-accuracy, VisualTextRank outperforms (self) biased TextRank by in the multilingual setting as well. Note that VisualTextRank leverages similar English ads for multilingual (input) ad text in this setup (as explained in Section 5). Compared to the above result for VisualTextRank, the xlm-SBERT based approach described in Section 5 led to poorer metrics ( soft accuracy, avg. w2v similarity).

6.5. Ablation study

In this section, we study the contribution of different components of VisualTextRank via offline performance on the ENG-eval set and non-ENG-eval set. Table 4 shows the impact of different components on offline metrics for the ENG-eval set. Baseline result for the ablation study is a self-biased TextRank method (line 1 in Table 4), which achieves in terms of soft accuracy. With addition of category biasing to self biasing (as explained in Section 4.3.3), we observe a soft-accuracy improvement to (line 3 in Table 4). The second major improvement (, line 5 in Table 4) is from adding one image tag from similar English ads (as explained in Section 4.3.2). Putting it with similar ads’ text (as explained in Section 4.3.2), we get the best result of soft accuracy (line 7 in Table 4). We notice that: (i) increasing number of tags or words from the similar ad does not help, because it takes away the attention from the ad text words (which are more likely to be in the golden set); (ii) adding image tags without pre-filtering by Algorithm 1 (line 6) gets lower soft accuracy than with the pre-filtering (line 5).

model hard soft avg. w2v
accuracy accuracy similarity
self biased
TextRank 26.48% 35.37% 0.5739
cat biased
TextRank 27.70% 35.71% 0.5975
self+cat biased
TextRank 28.92% 37.80% 0.6061
self+cat biased
+ -nbr text 28.57% 38.15% 0.6074
self+cat biased
+ -nbr img tag 28.75% 39.02% 0.611
self+cat biased
+ -nbr img tag 28.92% 38.5 % 0.6111
self+cat biased
+ -nbr img tag 28.05% 39.37% 0.6147
+ -nbr text
Table 4. VisualTextRank ablation study for ENG-eval.

For the non-ENG eval set, the ablation study (Table 5) shows that similar ads in English provide useful information for non-English ad text-to-image query task (and hence facilitate cross-lingual learning). In particular, the results follow the same pattern as for the ENG-eval set, except that the best result (line 5 in Table 5) is achieved by augmentation with only similar image tags, and no similar ad text words.

model hard soft avg. w2v
accuracy accuracy similarity
self biased
TextRank 22.39% 29.75% 0.5235
cat biased
TextRank 21.78% 28.83% 0.5409
self+cat biased
TextRank 24.23% 31.29% 0.5575
self+cat biased
+ -nbr text 23.93% 31.60% 0.554
self+cat biased
+ -nbr img tag 24.85% 32.21% 0.5676
self+cat biased
+ -nbr img tag 24.54% 31.9 % 0.5608
self+cat biased
+ -nbr img tag 23.62% 31.60% 0.5563
+ -nbr text
Table 5. VisualTextRank ablation study for non-ENG-eval.

Table 6 shows two (anonymized) examples of the data samples in the non-ENG-eval set with image search queries, and algorithm’s output for (self) biased TextRank and VisualTextRank. The first sample in Table 6 is an English ad text, and the second sample is a Spanish ad text. In both cases, VisualTextRank returns a keyword matching the (golden) query, unlike the (self) biased TextRank.

ad user biased Visual-
text queries TextRank TextRank
Online Store. Asian antique and vintage furniture furniture; wooden furniture antique furniture
Necesitas Vender Tu Casa Rápido? home selling; house sell house
Table 6. Biased TextRank vs VisualTextRank examples.

For these two ad examples, we also show their (anonymized) nearest neighbours and tags of the corresponding neighbor images (Table  7). Note that in both cases neighbour image tags coincide with the ground truth user query and shift algorithm predictions towards the correct output (furniture and house, correspondingly).

ad nearest nbr image
text nbr ad tag
Online Store. Asian antique and vintage furniture Furniture and Decor Sale. Up to 70% Off Top Brands And Styles! furniture
Necesitas Vender Tu Casa Rápido? Homeowners Could Sell Their Homes Fast. ____ realtors get the job done. house; home
Table 7. Examples of ads nearest neighbours and image tags.

6.6. Online results

We conducted an A/B test where advertisers were split into test and control buckets (50-50 split). Test advertisers saw recommended images (via automatic text-to-image search query) when they accessed the image search feature, while control advertisers saw a random set of images. Both test and control advertisers were permitted to query the stock image library further after seeing the initial set of (recommended or random) images. Due to latency constraints, a lighter version of VisualTexTRank (with category biasing, averaged word2vec embedding for text instead of SBERT, and no augmentation via similar ads) was used. The end-to-end latency of this version was below a second. In the A/B test, we saw a lift in the selection rate in the test bucket (compared to control). In terms of onboarding rate, we saw a lift. The data was collected over a one month period, and was limited to US-only traffic. Note that, the final image selected from the stock image library need not be from the initial set of images shown to the advertiser. However, the selection rate lift indicates higher adoption of stock images (driven by initial recommendations which increase the advertiser’s interest to query further). The lift in onboarding rate is indicative of the positive impact on the ad platform.

7. Discussion

The motivation behind VisualTextRank stemmed from insufficient data for supervised training for the ad text-to-image query task. As more advertisers explore stock images (influenced by VisualTextRank), we may collect enough data to fine-tune pre-trained multi-modal models (Li et al., 2020b) on our task. Cross-lingual learning with such multi-modal models for our task is another future direction.


  • N. Bhamidipati, R. Kant, and S. Mishra (2017) A large scale prediction engine for app install clicks and conversions. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17. Cited by: §1.
  • F. Boudin (2016)

    Pke: an open source python-based keyphrase extraction toolkit

    In COLING 2016, Cited by: TextRank and baselines for offline evaluation, §1, §2.1, §6.4.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
  • S. Herdade, A. Kappeler, K. Boakye, and J. Soares (2019) Image captioning: transforming objects into words. In arXiv preprint arXiv:1906.05963, Cited by: §2.2, §3.2.
  • M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd (2020) spaCy: Industrial-strength Natural Language Processing in Python Cited by: POS tagging and offline metrics, §3.3.1, §6.3.
  • Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, and A. Kovashka (2017) Automatic understanding of image and video advertisements. In CVPR, Cited by: §2.3.
  • [7] IAB categories. Note: Cited by: Category list for computing category bias, item 1, §4.3.3.
  • [8] A. Kazemi, V. Pérez-Rosas, and R. Mihalcea Biased textrank: unsupervised graph-based content extraction. COLING 2020. Cited by: §1, §1, §2.1, §4.2, §4.2, §6.4.1.
  • I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy (2017) OpenImages: a public dataset for large-scale multi-label and multi-class image classification.. Dataset available from Cited by: §1, §3.2, §3.3.2.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2020a) What does bert with vision look at?. In ACL (short), Cited by: §2.2.
  • X. Li, X. Yin, C. Li, X. Hu, P. Zhang, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao (2020b) Oscar: object-semantics aligned pre-training for vision-language tasks. ECCV 2020. Cited by: §2.2, §7.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §3.2.
  • [13] R. Mihalcea and P. Tarau TextRank: bringing order into text. In Proceedings of EMNLP 2004, Cited by: §1, §2.1, §4.2, §4.2.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. NIPS’13. Cited by: §6.3.
  • S. Mishra, M. Verma, and J. Gligorijevic (2019) Guiding creative design in online advertising. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19. Cited by: §2.3.
  • S. Mishra, M. Verma, Y. Zhou, K. Thadani, and W. Wang (2020) Learning to create better ads: generation and ranking approaches for ad creative refinement. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20. Cited by: §2.3.
  • S. Qiao, L. Chen, and A. Yuille (2020) DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv preprint arXiv:2006.02334. Cited by: §2.2.
  • [18] N. Reimers and I. Gurevych Sentence-bert: sentence embeddings using siamese bert-networks. EMNLP 2020. Cited by: TextRank and baselines for offline evaluation, §1, §2.1, §4.2, §4.3.2, §5.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS 2015, Cited by: §2.2.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL 2018, Cited by: §1, §3.2, §3.3.2.
  • [21] Unsplash: Photos for everyone. Note: Cited by: §1.
  • K. Ye and A. Kovashka (2018) ADVISE: symbolism and external knowledge for decoding advertisements. In ECCV 2018, pp. 868–886. Cited by: §3.3.2.
  • Y. Zhou, S. Mishra, J. Gligorijevic, T. Bhatia, and N. Bhamidipati (2019)

    Understanding consumer journey using attention based recurrent neural networks

    In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19. Cited by: §1.
  • Y. Zhou, S. Mishra, M. Verma, N. Bhamidipati, and W. Wang (2020) Recommending themes for ad creative design via visual-linguistic representations. In Proceedings of The Web Conference 2020, WWW ’20. Cited by: §2.3.

Notes on Reproducibility

In this section, we list helpful information for reproducing the results in our paper.

POS tagging and offline metrics

For POS tagging based filters (i.e., to focus on only nouns, proper nouns and verbs), we used Spacy’s POS tagger (Honnibal et al., 2020). For, the offline evaluation method based on word2vec similarity between the ground truth query and model output we used Spacy’s word similarity function implemented using dimensional word2vec embeddings (trained on Google News dataset) for English text (Honnibal et al., 2020).

TextRank and baselines for offline evaluation

We used the PKE (Boudin, 2016) library for running experiments with baseline keyword extraction methods: Position Rank, Topic Rank, Topical Page Rank, Multipartite Rank, and YAKE (as reported in Section 6.4.2). For TextRank and biased TextRank we used the implementation in, with damping factor , iterations =

(for the node importance estimate to converge), and node similarity threshold =

(using SBERT embeddings). The SBERT (Reimers and Gurevych, ) embeddings were based on stsb-distilbert-base as listed in the pre-trained models in We explored the larger BERT models like BERT-base, BERT-large, RoBERTa-large but the benefits were not significant.

Category list for computing category bias

We used a list of categories as the set defined in Section 4.3.3. Each entry in is a phrase corresponding to a category in the flattened IAB ads category taxonomy (7). For example, the ‘Arts & Entertainment’ high level IAB category has ‘Books & Literature’ and ‘Music’ as subcategories (7). After flattening, ‘Arts & Entertainment’, ‘Books & Literature’ and ‘Music’ are listed as three separate entries in category set used for computing category bias.


For the best results from VisualTextRank we used the following configuration (for both ENG-eval and non-ENG-eval):

  • damping factor

  • node similarity threshold = (using distilbert based SBERT embeddings)

  • min. tag similarity for augmentation

  • min. word similarity for augmentation

  • for category biasing, we set all below to zero.

Apart from using distilbert based SBERT, we also explored the larger BERT models like BERT-base, BERT-large, RoBERTa-large but the benefits were not significant. For the multilingual ad experiment with xlm-SBERT (i.e., without ad text translation) described in Section 5, we used the stsb-xlm-r-multilingual model which was trained on parallel data for 50+ languages as mentioned in