A Price-Per-Attention Auction Scheme Using Mouse Cursor Information

01/21/2020 ∙ by Ioannis Arapakis, et al. ∙ aalto Universitat Pompeu Fabra 0

Payments in online ad auctions are typically derived from click-through rates, so that advertisers do not pay for ineffective ads. But advertisers often care about more than just clicks. That is, for example, if they aim to raise brand awareness or visibility. There is thus an opportunity to devise a more effective ad pricing paradigm, in which ads are paid only if they are actually noticed. This article contributes a novel auction format based on a pay-per-attention (PPA) scheme. We show that the PPA auction inherits the desirable properties (strategy-proofness and efficiency) as its pay-per-impression and pay-per-click counterparts, and that it also compares favourably in terms of revenues. To make the PPA format feasible, we also contribute a scalable diagnostic technology to predict user attention to ads in sponsored search using raw mouse cursor coordinates only, regardless of the page content and structure. We use the user attention predictions in numerical simulations to evaluate the PPA auction scheme. Our results show that, in relevant economic settings, the PPA revenues would be strictly higher than the existing auction payment schemes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The majority of online advertisements are sold through auctions. Online ad auctions differ in their baseline format,111This is especially true when multiple slots are sold at the same time. Google, for example, introduced and still uses the celebrated Generalized Second Price (GSP) auction format, whereas Facebook follows the classic Vickrey-Clark-Groves (VCG) format. For single-slot ad sales, both GSP and VCG are identical. but all existing formats adopt either pay-per-impression (PPI) or pay-per-click (PPC) schemes. While important parts of the market still adopt the PPI, in recent years the market has increasingly shifted toward the PPC scheme, which presents several advantages for advertisers.222The PPC format is predominant for ads which are sold through auctions. With posted prices, PPI systems tend to be preferred instead; see, e.g., Choi et al. (2018). First, it insures against the risk of paying for ineffective ads, as under the PPC scheme advertisers pay only if their ads are effectively clicked.333Google reported that more than half of the ad impressions go unnoticed by users (Google, 2014). Second, it provides a better alignment between the platform’s incentives and the advertisers’ objectives, since it links the revenue of the former directly to the Click-Through-Rate (CTR), which represents the main performance measure targeted by advertisers. Perhaps for these reasons, PPC systems also tend to induce lower costs per click than PPI on average.

CTRs, however, are often an imperfect measure of ad performance. If clicks are one of the main objectives of the advertisers, they need not be the only, nor the most important.444Blake et al. (2015), for example, discuss evidence which suggests that advertisers are willing to pay to post an ad beyond the clicks it is likely to generate. This is especially the case for advertisers whose campaigns aim to generate mainly brand or product awareness, rather than to induce direct online sales — which is the case for many, if not most, of the highest-value advertisers. In these cases, advertisers ultimately care more about making sure that consumers notice the ad, and may thus attach a value to grabbing the consumer attention beyond the click it may or may not trigger. Therefore, if consumer attention to ads could be measured in an accurate and reliable way, not only would it provide a more effective target for advertiser campaigns, but it could also serve as a basis to define auctions which directly price user attention, thereby further aligning the platforms’ incentives with the objectives of the advertisers. This creates an opportunity for a novel ad pricing paradigm to take root; one in which ads are paid only if they are actually noticed.

In this article, we introduce a novel auction format based on a pay-per-attention

(PPA) scheme, in which advertisers’ payments is proportional to the probability that their ad is noticed by users. To make this PPA format possible, we also contribute a scalable diagnostic technology to measure users’ attention to ads, using client-side user interactions derived from mouse cursor movements (without such a diagnostic technology, the PPA auction format would not be feasible in practice). Finally, we show both analytically and through numerical simulations that the two innovations combined may enable new ad platforms to extract revenues which are strictly higher, and in any environment never lower, than the existing PPI and PPC auction payment schemes.

We begin by introducing our novel PPA auction scheme, a second-price auction which takes measures of users’ attention as input, similar to the way existing PPC formats take clicks as input. We show that, like its PPI and PPC counterparts, the PPA auction is strategy-proof (that is, bidding one’s own value is optimal regardless of the bids placed by others) and efficient (that is, the ad slot is always allocated to the advertiser with the highest valuation), and that it has desirable revenue properties. In particular, we show that in relevant economic environments the PPA revenues are strictly larger, and in any environment never lower, than those of the PPI and PPC formats. Environments in which the PPA yields strictly higher revenues than the PPI or PPC formats include, for example, environments in which advertisers’ valuations are positively correlated with the probability of their ads being noticed, or if at least some of the advertisers are subject to framing effects, in the sense that they bid only taking into account the components of the value which are made salient by the different auction formats.

Then, we introduce our diagnostic technology, which is specifically designed to measure consumer attention to ads in sponsored search. More concretely, it is based on mouse cursor movements analysis and builds upon previous work examining user engagement with direct displays in web search (Arapakis and Leiva, 2016)

. We conduct a crowdsourced user study and collect mouse cursor data from participants who interacted with instrumented Search Engine Result Pages (SERPs) in brief transactional search tasks with Google Search. The SERPs contained, among other elements, sponsored ads that were served under different formats and positions. We use the collected data to train several baseline machine learning models and a recurrent neural network model to predict user attention to SERP ads. We further demonstrate noticeable improvements by our recurrent neural network (which uses raw mouse cursor data) over the baseline machine learning models (which rely on ad-hoc and domain-specific features).

Next, we provide numerical simulations based on a variation of the statistical model proposed by Ausubel and Baranov (2018), both to exemplify the main insights highlighted by our theoretical findings and to illustrate their significance in the context of the distribution of attention probabilities observed in our online user study. Taken together, our results show that, combined with an effective and reliable technology to predict users’ attention, such as the one proposed in this work, the PPA auction may draw from sources of economic surplus which may be precluded to the existing formats. Our analysis focuses on some of such possibilities, but the main innovations of our work open other directions for future research, which we discuss in the final section of this article.

2. Related Work

In what follows, we provide a review of the state of the art in the two areas which our work bridges: online ad auctions and mouse cursor analysis.

2.1. Online Ad Auctions

A large body of work has analysed the various auction formats used to sell online ads. For the sake of conciseness, we only review work that focuses on the two most common formats, namely the Generalized Second Price (GSP) auction (used e.g., by Google and Taobao) and the Vickrey-Clarke-Groves (VCG) mechanism (used e.g., by Facebook and Quora). For the case of a single slot, both formats coincide with the sealed-bid second-price auction.

2.1.1. Models with Competitive Bidding

The VCG is a classic and well-known mechanism in the economics literature, introduced by Vickrey (1961), Clarke (1971) and Groves (1973). It is both strategy-proof (that is, bidding one’s own value is optimal regardless of the bids placed by others) and efficient (that is, the ad slot is always allocated to the advertiser with the highest valuation). The study of the GSP was pioneered by Aggarwal et al. (2006), Edelman et al. (2007) and Varian (2007), who introduced an equilibrium refinement of the GSP (which induces the same revenues and allocations as the VCG), and further refinements were later provided by others (Edelman and Schwarz, 2010; Milgrom and Mollner, 2018). Interestingly, most of the literature on the GSP auctions studies environments with complete information. A notable exception considered a standard independent private values environment (Gomes and Sweeney, 2014). Borgers et al. (2013) also maintain the complete information assumption, but consider a more general model of CTRs and valuations. Without resorting to Edelman et al. (2007) and Varian (2007)’s refinement, it provides a more critical view of the GSP; see also Paes Leme and Tardos (2010). Athey and Nekipelov (2014) introduced uncertainty over quality scores in a model with competitive bids, to account for the fact that existing models assume that bids are customised for a single user query but in practice queries arrive more quickly than advertisers can change their bids.

2.1.2. Coordinated and Collusive Bidding

Recent literature studied online ad auctions with coordinated or collusive bidding, to account for the increasing diffusion of digital marketing agencies and of agency trading desks. Mansour et al. (2012), for example, pointed at the potential risk of collusive bidding that intermediaries pose for online ad auctions, and focused on the ad exchanges used for sponsored ads. Decarolis et al. (2018) studied agency bidding in both the GSP and VCG auctions, allowing for the co-presence of collusive and competitive bidders. The earlier literature on “bidding rings” mostly focused on single-unit mechanisms in which either non-cooperative behavior is straightforward (Mailath and Zemsky, 1991), or assuming that the coalition includes all bidders (McAfee and McMillan, 1992; Hendricks et al., 2008). In multi-unit settings, Bachrach (2010)

studied collusive bidding in the VCG, but from a cooperative game theory perspective.

2.1.3. Variations of the Baseline Formats

In the baseline GSP auction, Yahoo! initially ranked advertisers by bids. Then Google adopted a ranking based on value per impression, in which bids are weighted by quality scores, designed to increase revenues. Other variations of the baseline GSP apply instead reservation prices in the spirit of Myerson (1981), thereby offering a compromise between efficiency and revenue maximization. Ostrovsky and Schwarz (2011) studied the effects of applying optimal per-impression reserve price and Roberts et al. (2016) studied the revenue optimal auction, empirically showing that it led to good trade-offs between revenue and other objectives. Theoretical results on optimal trade-offs are also available (Bachrach et al., 2014; Jehiel and Lamy, 2015). Finally, Thompson and Leyton-Brown (2013) studied a variety of ways of increasing revenue, including optimal reserve prices as well as alternative ranking algorithms, and others have studied the welfare effects of reservation prices in various settings (Edelman and Schwarz, 2010; Lahaie, 2011; Athey and Ellison, 2011).

2.1.4. PPC, PPI, and Consumer Attention

A few works compare PPC and PPI schemes, but mostly in non-auction pricing models of ads — where, unlike auction settings, PPI schemes are predominant; see Choi et al. (2018) and references therein. In deterministic settings, Mangani (2004) compares revenues between PPC and PPI schemes, and Fjell (2009) determines the optimal choice between both schemes, which have been further studied with stochastic arrivals of viewers and advertisers (Najafi-Asadolahi and Fridgeirsdottir, 2014; Fridgeirsdottir and Najafi-Asadolahi, 2018).

2.2. Mouse Cursor Analysis

The construct of attention, broadly indicating a high degree of involvement in a given activity, has become a common currency on the Web nowadays. Objective measurements of attentional processes are increasingly sought after by both the media industry and scholar communities to explain or predict user behavior. In recent years, a large body of research (Shapira et al., 2006; Guo and Agichtein, 2008, 2010; Guo et al., 2012; Huang et al., 2012a; Navalpakkam et al., 2013; Lagun et al., 2014a; Liu et al., 2015; Martín-Albo et al., 2016; Chen et al., 2017) has demonstrated the utility of mouse cursor analysis as a low-cost and scalable proxy of visual attention. In line with this evidence, several works have investigated closely the user interactions that stem from mouse cursor data for various use cases, such as web search (Guo and Agichtein, 2008, 2010; Guo et al., 2012; Lagun et al., 2014a; Liu et al., 2015; Arapakis and Leiva, 2016; Chen et al., 2017) or web page usability evaluation (Atterer et al., 2006; Arroyo et al., 2006; Leiva, 2011). In what follows, we review those research efforts that have focused on mouse cursor analysis to predict user interest and attention. We deliberately leave out works that investigate ad impression forecasting using click logs or query traffic features (Kolesnikov et al., 2012; Jiang et al., 2016; Nath et al., 2013; Guo and Agichtein, 2010; Lagun et al., 2014b; Zhai et al., 2016; Mao et al., 2018b), since our approach relies solely on implicit, online interaction signals instead of historical, click-through data. Furthermore, while the PPA scheme we propose is independent of how user attention is detected, the diagnostic technology we introduce here addresses a desktop setting. Hence, works on ad noticeability or attention in mobile browsing (Barbieri et al., 2016; Li et al., 2017; Wang et al., 2018; Mao et al., 2018a; Grusky et al., 2017; Lagun et al., 2016) also fall outside the scope of this article.

2.2.1. Measuring User Interest

For a long time, user models of scanning behaviour in SERPs have been assumed to be linear, as users tend to explore the list of search results from top to bottom. Today this is no longer the case, since SERPs now include several heterogeneous modules (direct displays) such as image and video search results, featured snippets, or ads (Arapakis et al., 2015). To account for this SERP heterogenity, Diaz et al. (2013) proposed a generalization of the classic linear scanning model which incorporated ancillary page modules. Here, a user interaction log is represented by a sequence of visited (mouse-hovered) SERP modules. This model can help improve SERP design by anticipating searchers’ engagement patterns given a proposed arrangement of the SERP. However, this model is not designed to effectively measure if a user is actually paying attention to ads and does not exploit the potential information encoded in mouse coordinates.

Early research considered simple, coarse-grained features derived from mouse cursor data to be surrogate measurements of user interest, such as the amount of mouse cursor movements (Shapira et al., 2006) or the mouse cursor’s “travel time” (Claypool et al., 2001). More recent work has adopted fine-grained mouse cursor features, which have been shown to be more effective. For example, Guo and Agichtein (2008)

found differences in mouse cursor distances between informational and navigational queries, and could classify the query type using mouse cursor movements more accurately than using clicks alone. In a similar vein, 

Guo and Agichtein (2010) examined implicit interaction signals like mouse cursor movements, hovers, and scrolling activity to accurately infer search intent and interest in SERPs. They focused on automatically identifying a user’s research or purchase intent based on features of the interaction. These approaches have been directed at predicting general-purpose web search tasks like search success (Guo et al., 2012) or search satisfaction (Liu et al., 2015) and, in that respect, lack the granularity in predicting attention with particular direct displays of a SERP, such as ads, that our proposed modelling approach achieves.

In a more recent work, Huang et al. (2012b) and Speicher et al. (2013) modelled mouse cursor interactions on SERPs, by extending click models to compute more accurate relevance judgements for the search results. In a similar line, Huang et al. (2011) sought to understand results relevance and search abandonment by mining mouse cursor behaviour on SERPs. They showed that the mouse cursor position is mostly aligned to eye gaze, especially on SERPs, and that could be used as a good proxy for predicting good and bad abandonment. Diriye et al. (2012) extended this work and investigated the effectiveness of mouse cursor interactions for predicting the reasons for observed search abandonment, whether it was because the user’s information need was satisfied or because they were dissatisfied with the search results. Feild et al. (2010) considered mouse movements, among other sensory and log-based features, to predict success and frustration in an information seeking task and Guo et al. (2012) examined mouse cursor movements to identify patterns of examination and interaction behaviour that indicated search success. Finally, Guo and Agichtein (2012) looked at mouse cursor interactions after the click onto the landing page and found that these post-click interactions (e.g., mouse cursor movements, dwell time) correlate with document relevance. They showed that a post-click behaviour model is more effective than simply using dwell time for computing document relevance scores.

2.2.2. Measuring User Attention

Most research studies assume that eye fixation means examination, including studies from industry (Brightfish et al., 2018). However, Liu et al. (2014) noticed that almost half of the search results fixated by users are not being read, since there is often a preceding skimming step in which the user quickly looks at the search result without reading it. Based on this observation, they propose a two-stage examination model: a first “from skimming to reading” stage and a second “from reading to clicking” stage. Interestingly, they showed that both stages can be predicted with mouse movement behaviour, which can be collected at large scale.

Cursor movements can therefore be used to estimate user attention on SERP components, including traditional snippets, aggregated results, maps, and advertisements, among others. However, works that employ mouse cursor information to predict user attention with specific elements within a web page have been scarce. This is understandable, considering the inherently complex nature of the mouse cursor data and the difficulty in constructing ground-truth labels at scale. Despite these challenges, some of the early work by 

Arapakis et al. (2014a, b) investigated the utility of mouse movement patterns to measure within-content engagement on news pages and predict reading experiences. Lagun et al. (2014a) introduced the concept of frequent cursor subsequences (namely motifs) in the estimation of result relevance. Although their work proposes a more general approach to mouse cursor pattern analysis, it does not target specific user engagement proxies such as attention. Similarly, Liu et al. (2015) applied the motifs concept to SERPs in order to predict search result utility, searcher effort, and satisfaction at a search task level. Their approach assumes a uniform engagement with all parts of the page and, in that sense, lacks the desired granularity in the analysis of mouse cursor interactions.

To our knowledge, the closest work to ours is that of Arapakis and Leiva (2016)

, which investigated user engagement with direct displays on SERPs, and more specifically with the Knowledge Graph display.

555The Knowledge Graph is a card-like direct display that appears, for some informational queries, at the top-right part of the SERPs, comprising information related to the named entities related to the current user query. Similarly, we implement a predictive modelling framework to measure user attention to ads, which can be seen as a particular instance of direct displays. However, our work differs significantly from theirs in three key respects. First, we compare their machine learning model (which relies on ad-hoc and domain-specific features) against a recurrent neural network model (which uses raw mouse cursor data) to predict user attention, and demonstrate noticeable improvements by our recurrent neural network model. Second, we examine the performance of our predictive modelling approach w.r.t. sponsored ads served under (i) different formats and (ii) different positions within a SERP and, thus, significantly expand on the original application and findings by Arapakis and Leiva (2016). Last, we examine interaction effects in performance between our predictive model and different demographic attributes, such as gender and age, which highlights new opportunities for market segmentation.

2.3. Summary

Overall, a few works have studied the relative performance of PPI and PPC auction schemes, but we are not aware of theoretical work which has provided explanations of the difference in performance which seem to emerge from the data, nor of why most of the auction platforms tend to favour a PPC over a PPI scheme (as explained above, the opposite is true in non-auction pricing settings). To the best of our knowledge, existing theoretical analysis of online auctions assume that bidders attach no value to obtaining an ad slot unless it is clicked. This assumption has proven to be a good proxy for the theoretical questions addressed by the literature, but it overlooks aspects of bidders’ valuations which are important in many markets, where the ads value is more directly related to their ability to attract users’ attention, beyond their clicks. There is thus an opportunity to devise a more effective ad pricing paradigm, in which ads are paid only if they are actually noticed. However, for our novel PPA auction scheme to become feasible, a scalable diagnostic technology that estimates user attention to ads becomes necessary. We make it possible with a recurrent neural network model that relies exclusively on implicit interaction signals derived from mouse cursor movements, and show noticeable improvements over the state of the art. We also show that our model generalizes to different ad formats and different positions within a SERP.

3. The PPA Auction Scheme

In line with the motivation discussed in the 1. Introduction section, the auction model we introduce in this section explicitly accounts for the possibility that advertisers value an ad slot’s ability to grab the user attention, beyond the clicks that it may generate (Blake et al., 2015). We show that the PPA auction is both efficient (that is, the slot is always allocated to advertiser with the highest valuation) and strategy-proof (that is, bidding one’s own value is optimal regardless of the bids placed by others). We also compare the revenue properties of the PPA auction with those of its standard PPI and PPC counterparts, under various settings. Analytic results show that PPA outperforms its PPI and PPC counterparts both when bidders’ valuations are positively correlated with their ability to attract users’ attention, and in the presence of framing effects, under which bidding strategies are assumed to affect advertisers’ bidding strategies, through the component of their valuations which are made salient. Both of these features (i.e., positive correlation and framing effects) are expected to be present in most relevant economic settings. Moreover, our results also show that in environments in which these conditions are not met, revenues under the PPA auction are in any case no lower than under the standard PPI and PPC formats.

3.1. Auction Environment

For simplicity, we focus on the case of a single slot for sale (extensions to the multiple-slot case are the subject of future work). We thus consider a setting in which a single ad slot needs to be allocated to one of bidders, indexed by . We let denote the probability that a particular consumer visiting a web page notices the posted ad, and let denote its click-through rate (CTR), conditional on being noticed (the case of bidder-specific attention probabilities is discussed below).666In most models, CTRs are typically expressed unconditional on the ad being noticed. The standard unconditional CTR in this model is thus equal to . For each , let denote the probability that realizes a sale conditional on the consumer having clicked on the ad, and let denote the probability that realizes a sale conditional on the ad being noticed but not clicked. Finally, we let denote the value of a sale for bidder .

We assume that both and are commonly known by advertisers, whereas , and are ’s private information, which follow mutually independent distributions , and , respectively, i.i.d. across bidders. Bidders’ types are thus three-dimensional, . The total unconditional expected value of obtaining the slot for bidder with type therefore is

(1)

The most standard auction format in this environment is the second-price auction: bidders submit bids, denoted by ; the highest bidder gets the slot and pays the second-highest bid, with ties broken randomly with a fair lottery over the highest bidders. Existing formats differ in the payment rule: they are typically based on either a PPC or PPI scheme. In the PPC scheme, a bidder pays only if their ads are clicked; in a PPI scheme, the bidder pays for the very fact of obtaining the ad slot.

In the PPA scheme we propose, a bidder’s payment is proportional to the probability their ad is noticed. Clearly, the possibility of defining such a payment rule presumes that the attention probability is observable. For this reason, the next two sections will be dedicated to the development of a diagnostic technology to detect attention probabilities. Here, however, we first focus on the properties of the PPA auction, under the assumption that attention probabilities are observable. We will return to the matter of how attention probabilities can be predicted in Sections 4 and 5, and then will combine our theoretical and experimental results with numerical simulations in Section 6.

Formally, letting denote the highest opponent’s bid, i.e. , in the PPA second-price auction bidder ’s (expected) payoff, conditional on a particular profile of bids and , is:

(2)

For later reference, we contrast this with the payoff functions which result from the PPI and PPC schemes, respectively:

(3)

Note that the term in the denominators in equations (2) and (3) is equal to the number of bidders who are placing the highest bid, as in that case.

Existing analysis of the PPI and PPC formats are special cases of these payoff functions by setting in Eq. 1. As discussed in the previous footnote, in most models of PPC auctions it is standard to work with the unconditional click-through rate, or , and valuations are expressed in terms of value-per-click (VPC), or in our setting; see e.g., Varian (2007); Edelman et al. (2007). Models of PPI-schemes instead typically ignore specifying the various components of the overall value (namely CTRs, attention probabilities, and value per click), which are irrelevant in those settings, and work directly with the value-per-impression (VPI), which in our setting corresponds to . In our setting, it is also convenient to define the valuation per attention (VPA) for each , as .

3.2. Main Properties

In what follows we show that the PPA second-price auction thus defined shares the main properties of its PPI and PPC counterparts, namely strategy-proofness and efficiency. As standard in the literature, strategy-proofness refers to the property that bidding one’s own valuation is a dominant strategy, i.e. it is optimal independent of bids placed by others; whereas efficiency instead refers to the standard ex-post concept, which ensures that the slot is always assigned to the highest-valuation advertiser (both per-attention, and overall).

Theorem 3.1 ().

In the PPA second-price auction, bidding is a dominant strategy for every player. It follows that, in this equilibrium, for each realisation of types the slot is assigned to the advertiser with the highest valuation per attention (both per-attention, and overall).

Proof.

We first show that dominates any bid : if , both and yield a payoff of zero, since the slot is allocated to a different bidder; if , both and suffice to win the auction, yielding a payoff of ; if , then loses the auction and yields a payoff of , while wins the auction yielding a payoff of . Hence, dominates any bid . A similar argument shows that it also dominates any . Thus, if everybody follows the dominant strategy , the rules of the auction imply that the winner is in the set . ∎

Corollary 3.2 ().

Relabeling players if necessary so that , for each realisation of types the revenues in this auction are equal to , and bidders’ payoffs are equal to for the highest bidder, and for all others.

3.3. Revenue Comparisons

We now compare the revenues of the PPA second-price auction with those of its PPI and PPC counterparts, in different settings. We show that, with fully sophisticated bidders (i.e., those who try to bid optimally) both the PPI and PPC auctions induce the same revenues and payoffs for all bidders as the PPA. Such equivalence between PPI and PPC may seem at odds with the industry’s consensus that the latter induces on average lower costs-per-click, but it stems from the fact that all parameters and distributions in the model are kept constant across different payment schemes, and perfectly anticipated by a fully sophisticated bidder.

To make sense of the common wisdom, one has to enrich the model so as to take into account the possibility that CTRs and attention probabilities vary systematically across payment schemes, possibly due to the platforms incentives to improve their performance, thereby increasing the overall value for the advertisers. We abstract from these possibilities here, which would provide further reasons to prefer the PPA scheme, and focus instead on two simpler variations of the baseline model; namely heterogeneity in the attention probabilities and the possibility of framing effects associated with the different payment formats.

In this section we offer some analytic results. In Section 6 we will illustrate the significance of these results for relevant distributions of attention probabilities, through some numerical simulations specifically calibrated on the distributions of attention probabilities observed in the online user study from Sections 4 and 5.

3.3.1. Analytic Results

Fully Sophisticated Bidders

Existing theoretical analysis of PPI and PPC auctions do not consider the possibility that advertisers may value an ad beyond it being clicked. They thus consider an ex-ante value equal to , where CTR is the unconditional click-through rate, and is the value-per click.777See e.g.,Varian (2007); Edelman et al. (2007), which boil down to this in the special case of a single-slot on sale. In terms of our model, as we mentioned, this amounts to assuming for all , and letting and . In such models, it is well-known that bidding in the PPI, and in the PPC are dominant strategies. But if bidders are concerned with their ad being noticed, and if they are fully sophisticated, then their optimal bids would be higher than these, to take into account the possibility that, for some realisation, ads may be noticed but not clicked. The real incentives to win the auction therefore exceeds the value of being clicked, which would be , and a fully sophisticated bidder incorporates this by raising their bid by an extra . This logic is confirmed by the next analytic result:

Proposition 3.3 ().

With fully sophisticated bidders, the PPI and PPC auctions have dominant strategies and , respectively. In the corresponding equilibria, the PPI and PPC induce the same allocation, revenues and payoffs for every bidder as the PPA auction.

Proof.

If bidders fully realize the payoff implications of the auction rule, their payoffs in the PPI and PPC auctions when are equal to and , respectively. For both auctions, payoffs are if , and if . By the same logic as the proof of Theorem 3.1, it follows that and . Hence, for each realisation of types, revenues are equal to the second highest bid in the PPI, and by that times the unconditional CTR in the PPC. The result follows. ∎

Hence, with fully sophisticated bidders, the PPA second-price auction performs just as well as its PPI and PPC counterparts. This equivalence result may seem at odds with one’s intuition that the PPC only prices the value-per-click, but it follows from the fact that is common to all bidders, and that – under the full sophistication assumption – all parameters and distributions are perfectly anticipated by all bidders, which (as explained) therefore bid higher than their value-per-click. We discuss next two variations of the baseline model, to accommodate the possibility of heterogeneous attention probabilities, as well as possible framing effects associated to different payment schemes.

Bidder-Specific Attention

We consider a simple extension of the model, in which the probability of grabbing a consumer attention varies with the identity of the advertiser. We thus substitute the parameter above with a profile . For each realisation , we relabel advertisers if necessary in decreasing order of value-per-attention and per-impression, respectively: that is, so that , and .

Proposition 3.4 ().

For each realisation of types, expected revenues in the dominant-strategy equilibria of the PPI, PPC and PPA auctions are, respectively, , and .

Proof.

By the arguments above, , , and are dominant strategies. Therefore, for each , in the PPI the highest bid is placed by the highest VPI bidder (which is ), who wins the auction and pays ; in the PPC and PPA auctions instead, the highest bids are placed by the highest VPA bidder (which is ), who wins the auction and pays: (i) whenever clicked in the PPC auction, which happens with probability ; and (ii) whenever noticed, which happens with probability . Multiplying the price by the corresponding probability yields the result. ∎

Thus, for any , the PPA does just as well as the PPC, and they both outperform the PPI whenever

. The revenue ranking at the ex-ante stage thus depends on the joint distribution of the

’s and ’s. For example, if attention probabilities and values-per attention are positively correlated, the ex-ante expected revenues under the PPA are higher than under the PPI. This point will be later illustrated with numerical simulations.

Framing Effects

As explained above, a fully sophisticated advertiser in the PPC should raise their bid by an extra over their per-click valuation, because they would take also into account the expected value the ad might generate conditional on not being clicked. This, however, is not a straightforward calculation, and advertisers in practice need not bid this way. The bidding tutorials provided by most prominent platforms, for example, implicitly assume that .888See e.g., the Google AdWords tutorial in which Hal Varian teaches how to bid in the GSP auction: https://www.youtube.com/watch?v=jRx7AMb6rZ0 — recall that, in the single-slot case, the GSP auction coincides with the PPC second-price auction discussed above. Hence, if advertisers with followed the tutorials’ recommendation, they would fail to take the term into account, and hence they would bid based on the VPC alone, not on their full value-per-impression (VPI).

Bidding tutorials are only one of the reasons why advertisers may bid this way. More broadly, by drawing advertisers’ attention to different components of their valuation, it may be that different pricing systems produce framing effects which may impact the way advertisers bid in practice, for example by only taking into account the component of the value which is made salient by the particular pricing scheme. These considerations are relevant in practice, since the level of understanding of the environment implicit in the model with fully sophisticated bidders, and the associated calculations needed to obtain the optimal bidding strategies, are unlikely to be completely reflected in the abilities and sophistication of real-world bidders.

For these reasons, we also study the performance of the three auction formats if advertisers were affected by the pricing system, in the sense that they only focus on the value made salient by the rules of the auction, which is the one they pay for: the VPI in the PPI, the VPC in the PPC, and the VPA in the PPA. For ease of reference, we reproduce here the formulae for these different values:

As above, for each realisation , we relabel advertisers if necessary in decreasing order of value per-attention, per-click and per-impression, respectively: that is, so that , , and .

Proposition 3.5 ().

If bidders only focus on the value which is made salient by the auction rules, for each realisation of types, the revenues in the dominant-strategy equilibria of the PPA, PPC and PPI auctions are, respectively, , and . Hence: (i) if and only if ; and (ii) if and , then whenever .

Proof.

If bidders only focus on the value which is made salient by the auction rules, the perceived payoff in case of win (i.e., if ) are equal to and , and , where , and . By the usual argument, it is easy to show that dominant-strategies are, respectively, , and . Hence, for each type realisation, the winner in the PPA, PPC, and PPI is the bidder with the highest VPA, VPC, and VPI, respectively (resp., bidders , and ). In the PPA the bidder pays if noticed; in the PPC he pays for each click; in the PPI he pays . The revenues in the statement are obtained multiplying these payments by the corresponding probabilities ( in the PPA, in the PPC, and in the PPI). The revenue ranking betwen PPA and PPI follows immediately from these results, noting that revenues can be rewritten as and . For the revenue ranking between PPA and PPC, first note that, by definition of the relabellings, implies that , and if . Hence, if , it follows that . ∎

Hence, as long as some advertisers are subject to framing effects in the sense that they solely focus on the value made salient by the rules of the auction, the second-price PPA auction does better than both its PPI and PPC counterparts under weak conditions which are expected to hold in relevant economic settings, as will be illustrated in the numerical simulations of Section 6.

4. User Study

Online advertising involves (1) a publisher who integrates ads into its online content, such as the native ads on a SERP, promoted tweets on Twitter, or sponsored content in a news stream; and (2) an advertiser, who provides the ads to be displayed. These ads can be served under different formats (e.g., text, image, video, or rich media), each with its unique look and feel. Some formats appear to be more effective than traditional online ads in terms of user attention and purchase intent (IPG, 2013), and others may cause ad blindness to a greater or a lesser extent (Owens et al., 2011). Therefore, to understand how web search users engage with ads that appear under different formats and positions in SERPs, we conducted a user study through the Figure Eight999https://www.figure-eight.com crowdsourcing platform. Following a similar experimental setup to that introduced by Arapakis and Leiva (2016), we collected feedback from participants who performed brief transactional search tasks using Google Search. With this study, we aimed to predict when do users notice the ads that appear on SERPs under the aforementioned conditions.

Crowdsourcing studies offer several advantages over in-situ methods of experimentation (Mason and Suri, 2012), such as access at a larger and more diverse pool of participants with stable availability, collection of real usage data at a relatively large scale, and a low-cost alternative to the more expensive laboratory-based experiments. On the downside, experimenters have to account for potential threats to ecological validity, distractions in the physical environment of the participant, and privacy issues, to name a few. Still, crowdsourcing allows for exploring a wider range of parameters in a more controlled manner as compared to in-the-wild large-scale studies. To mitigate and discount low-quality responses, several preventive measures were put into practice, such as introducing test (gold-standard) questions to our tasks, selecting experienced contributors (Level 3) with high accuracy rates, and monitoring their task completion time, thus ensuring the internal validity of our experiment.

(a) Native ad
(b) Bundled ad (left)
(c) Bundled ad (right)
Figure 1. Examples of ad formats and their positions on SERPs. In our experiments, only one ad format was shown at a time.

4.1. Experiment Design

The experiment had a between-subjects design with two independent variables: (1) ad format, with two levels: “native” (organic ads) or “bundled” (direct display ads), and (2) bundled ad position, with two levels: “left” and “right” position. Native ads are only shown in the left part of Google SERPs (see below). The dependent variable was ad attention (see Section 4).

Our experiment consisted of a brief transactional search task where participants were presented with a predefined search query and the corresponding SERP, and were asked to click on any element of the page that answered it best. The search queries (Section 4.2) were all picked from a pool of queries that triggered both native (Figure a) and bundled ads (Figures (b)b and (c)c) on Google SERPs. The search queries were randomly assigned to the participants.

All SERPs, which were in English, were scraped for later instrumentation. As hinted previously, all SERPs had both native and bundled ads. Native ads appear both at the top-left and bottom-left position of the SERP, whereas bundled ads could appear either at the top-right or top-left position (but not both at the same time on the same SERP). Therefore, we ensured that only one ad was visible per condition and participant at a time, since we are focusing on the single-slot auction case. This was possible by instrumenting each downloaded SERP with custom JavaScript code that removed all ads excepting the one that would be tested in each of the experimental conditions. In any case, native bottom-most ads were not shown, since (i) users have to scroll all way down to the bottom of the SERP to reveal them and (ii) these ads have the same look and feel than the native ads shown on the top-most position.

Participants accessed the instrumented SERPs through a dedicated server, which did not alter the look and feel of the original SERPs. This allowed us to capture fine-grained user interactions while ensuring that the content of the SERPs remained consistent and that each experimental condition was properly administered. Each participant was allowed to perform the search task only once, since inquiring at post-task about the presence of an ad would make them aware of it and could introduce carry over effects, thus altering their browsing behaviour in the subsequent search tasks. In sum, each participant was exposed only to a single condition; i.e. a unique combination of query, ad format and ad position.

4.2. Search Query Sample

Our search query set was constructed as follows. Starting from Google Trends,101010https://trends.google.com/trends/ we selected a subset of the Top Categories and Shopping Categories (Table 1) that were suitable candidates for the transactional character of our search tasks. From this subset of categories, we extracted the top search queries issued in the US during the last 12 months. Next, from the resulting collection of 375 search queries, we retained 150 for which the SERPs were showing at least one bundled ad (50 search queries for each combination of bundled ad format and position). Such examples include the search queries samsung tablet, casio watches, or adidas ultra boost. Using this final selection of search queries, we produced the static version of the corresponding Google SERPs and injected the JavaScript code (Section 4.3) that allowed us to control the ads format and capture all client-side user interactions. The final collection of 150 search queries per ad condition was repeated as many times as needed to produce the desired number of search sessions for the final dataset.

Top Categories Shopping Categories
Autos & Vehicles Apparel
Computers & Electronics Event Ticket Sales
Food & Drink Gifts & Special Event
Games Luxury Goods
Real Estate Photo & Video Services
Travel Sporting Goods
Tobacco Products
Toys
Wholesalers & Liquidation
Table 1. Selected search query categories (Google Trends).

4.3. Mouse Cursor Tracking

As previously stated, all SERPs were downloaded and instrumented with custom JavaScript code. This way, we could automatically insert mouse tracking code and log cursor movements, hovers, and associated metadata. For this, we used EvTrack,111111https://github.com/luileito/evtrack

an open source JavaScript event tracking library derived from the smt2

system (Leiva and Vivó, 2013). EvTrack makes it possible to specify which browser events should be captured and how, i.e., via event listeners (the event is captured as soon as it is fired) or via event polling (the event is captured at fixed-time intervals). Concretely, we captured all regular browser events (e.g., load, click, scroll) via event listeners and only mousemove via event polling (every 150 ms), since this event may introduce unnecessary overhead both while recording on the client side and while transmitting the data to the server (Leiva and Huang, 2015). Whenever an event was recorded, we logged the following information: mouse cursor position (x and y coordinates), timestamp, event name, xpath of the DOM element that relates to the event, DOM element attributes, and the Euclidean distance to five control points (four corners and middle point) of the ad. This distance was required by one of the baseline models we tested (Section 5.2).

4.4. Self-Reported Measures

In addition to the aforementioned mouse cursor data, we collected ground-truth labels on the noticeability of the ads through an online questionnaire. Similar to what other works have done before (Feild et al., 2010; Liu et al., 2015; Lagun et al., 2014a; Arapakis and Leiva, 2016), the questionnaire was administered at post-task and asked the following question: While performing the search task, to what extent did you pay attention to the advertisement? We used a 5-point Likert-type scale to collect the labels: 1 (“Not at all”), 2 (“Not much”), 3 (“I can’t decide”), 4 (“Somewhat”), and 5 (“Very much”).

These scores would be later collapsed to binary labels (true/false), but we felt it was necessary to use a 5-point Likert-type scale for several reasons. First, using 2-point scales often results in highly skewed data 

(Johnson et al., 1982). Second, it is important to leave room for neutral responses, because some users may not want to say one way or another, otherwise this can produce response biases. But 3-point scales can lead more users to stay neutral, because the remaining options can be seen as “too extreme”. Therefore, we opted for a 5-point scale, which leaves more room for “soft responses” and in addition is easy to understand. With this scoring scheme, therefore, we are confident that the eventual binary labels actually reflect positive and negative user votes. All tasks that received a neutral score were not considered for analysis.

4.5. Participants

We recruited participants through Figure Eight, of which we retained data from (“female” = , “male” = , “Prefer not to say” = ) after excluding those cases which had incomplete mouse cursor logs. Participants aged from 18 to 66 (“18–23” = , “24–29” = , “30–35” = , “36–41” = , “42–47” = , “+48” = ), were of mixed nationality (e.g., American, Belgian, British, German) and had diverse educational background: 21.6% had a high school diploma, 16.9% had a college diploma, 27.1% had a BSc degree, 17.5% were graduates, 14.3% had an MSc, 1.1% had a PhD, and 1.5% preferred not to say. The majority were full-time (45.1%) or part-time (15.38%) employees while the remaining were either full-time students (11.6%), pursuing further studies while working (13.9%), performing home duties (6.5%) or other (7.5%). Finally, all participants were proficient in English and were experienced (Level 3) contributors.

4.6. Procedure

Initially, the participants were instructed to read carefully the terms and conditions of the study which, among other things, informed them that they should perform the task from a desktop or laptop computer using a computer mouse (and refrain from using a touchpad, tablet, or mobile device) and that they should deactivate any ad-blocker before proceeding with the search task. Our JavaScript code detected any installed ad-blockers and thus prevented the user from taking part in the study.

Participants were also asked to act naturally and choose anything that would best answer the search query, since all “clickable” elements (e.g., result links, images, etc.) on the SERP were considered valid answers. The instructions were followed by a brief search task description like “Imagine that you want to buy ¡noun¿ (for you or someone else as a gift) and you have submitted the search query ‘¡noun¿’ to Google Search. Please browse the search results page and click on the element that you would normally select under this scenario.

The search task had to be completed in a single session and each search query was performed on average by five different participants. The SERPs were randomly assigned to the participants and each participant could take the study only once (see Section 4.1

). The participants were allowed as much time as they needed to examine the SERP and proceed with the search task, which would conclude whenever they selected for any of the “clickable” elements on the SERP. Upon concluding the search task, participants were asked to complete the post-task questionnaire (which inquired about the presence of the ad and other ground-truth information) and a brief demographics questionnaire. The payment for the participation was $0.20. Participants could also opt out at any moment, in which case they were not compensated.

5. Predicting Ad Attention

In this section we present our diagnostic technology for predicting user attention to ads on SERPs, using as ground-truth the labels collected in our user study. To this end, we implement several baseline models: the Random Forest classifier proposed in 

(Arapakis and Leiva, 2016)

, a ZeroR classifier that always predicts the majority class, and a feed-forward neural network using three classic IR features, see

Section 5.2. We also implement a recurrent neural network exclusively on the raw sequences of 2D mouse cursor coordinates that we collected (Section 5.4). Then, we compare and contrast the accuracy of these models’ predictions for the three different ad conditions from our user study: (1) native ad, (2) left-bundled ad and (3) right-bundled ad, and for different demographic attributes (gender, age). Our findings show that our recurrent neural network model achieves better performance over the baseline models in most cases, while avoiding the additional cost of feature engineering and the use of additional page-level information.

5.1. Data Set

After excluding those logs with incomplete mouse cursor data (less than five mouse coordinates, which corresponds roughly to one second of user interaction data), we concluded on a set of cursor positions from search sessions. Of these search sessions, correspond to the native ad, correspond to the left-bundled ad, and correspond to the right-bundled ad. We then converted our ground-truth labels to a binary scale, using the following mapping: (1) “Not at all” and (2) “Not much” were assigned to the negative class, and (4) “Somewhat” and (5) “Very much” were assigned to the positive class. We note that the class distribution was fairly balanced (66% of positive cases) across the experimental conditions. Next, our data set was divided per ad condition, and for each condition we created a 10-fold cross-validation split using stratified sampling to produce balanced splits that preserve the original class distribution. In each fold, 70% of the data was used for training and 30% was used for validation.

5.2. Baseline Models

We trained a random forest (RF) classifier to predict ad attention that implemented all the features proposed by Arapakis and Leiva (2016). More specifically, we engineered the base features (e.g., viewport position, cursor distance from the ad, cursor speed, cursor acceleration) and the high-level meta-features (e.g., cursor traversed distance, hovers over the ad, entropy indices, spectral features) derived from the mouse cursor data. Table 2 summarises these features under different categories and also lists the aggregate functions applied to them. We then removed the highly correlated () and linearly dependent features from our feature set. In addition, we normalised the values for all features in the range so that feature values that fall in greater numeric ranges would not dominate those in smaller numeric ranges. As a last step, we determined via grid search the optimal hyper-parameter values (number of trees, number of features, -threshold) for the baseline model and evaluated its performance against the test set.

Base features Meta-features Aggregate functions
Viewport (width, height) # Moves (towards, away) Ad ,
Cursor positions and timestamps # Moves (towards, away) Ad within dist. , ,
Unique cursor positions # Clicks (inside, outside) Ad , , SST
Normalised viewport positions      Time to first click on Ad intra-distances of cursor positions w.r.t. Ad
Unique normalised viewport pos. # Preceding clicks to Ad Shannon entropy
Subsequent points’ distance # Hovers over Ad Permutation entropy )
Subsequent points’ duration # Hovers over other elements Weighted Permutation entropy )
Cursor distance from Ad # Hovers over Ad vs. other elements Approximate entropy )
Cursor speed # Preceding hovers over other elements FFT most powerful frequency )
Cursor normalised speed      Time to first hover (Ad, other elements) Multivariate KL div. (symmetric, non-symmetric)
Cursor acceleration      Time hovering (Ad, other elements) Earth mover’s distance
Cursor normalised acceleration      Distance traversed overall Hausdorff distance
Cursor position status wrt. Ad      Distance traversed (inside, outside) Ad
Vector angles      Distance from Ad (corners, center)
# Cursor positions within distance from Ad
* These functions are computed for most base and meta-features.
Table 2. Features used by the baseline RF model for predicting ad attention.

We also tested a ZeroR classifier, also known as 0-R (zero rule), which simply predicts the majority class. It will always output the same target value and does not use any input features, hence its name. Despite its simplicity and lack of discriminative power, this classifier is very useful for determining the baseline performance, as a benchmark for other classification methods like the ones we used in these experiments. If any other classifier is correct less frequently than ZeroR, it is obviously of no value for the task at hand.

5.3. Feed-forward Neural Network Model

The RF model introduced previously is a machine learning ensemble using a sum of piecewise functions for classification, therefore it may have limited accuracy. In contrast, neural networks can easily model any dependencies within the data. Therefore, we trained a feed-forward neural network (FFNN) as a third baseline model. The FFNN uses three classic features from the literature that have been suggested to correlate well with user engagement (Arapakis et al., 2014a; Barbieri et al., 2016; Lagun and Agichtein, 2015): dwell time, number of clicks over the ad, and number of hovers over the ad.

The FFNN input layer takes a vector with these three features and feeds it to a fully-connected hidden layer with 6 neurons (two neurons per feature) and ReLU activation, followed by a dropout layer with drop rate

for regularisation, and finally a fully-connected layer as output with 1 neuron and sigmoid activation. The FFNN outputs a probability prediction of the user’s attention to an ad, where indicates that the user has noticed the ad.

We trained the FFNN with a batch size of 64 sequences and for 50 epochs, using the same 10-fold cross-validation splits as the RF model. We used the popular Adam optimizer (stochastic gradient descent with momentum) with learning rate

and decay rates and

. The loss function to minimise is binary cross-entropy, since the task is a 2-class classification problem.

5.4. Recurrent Neural Network Model

Feature engineering requires domain expertise to come up with the optimal set of discriminative features. The previous baseline models use ad-hoc features that exploit the SERP structure and thus are potentially less generalizable. Given that we are interested in a scalable diagnostic technology of user attention, regardless the underlying page contents or structure, we propose a more versatile model to predict user attention to ads. The model is a particular type of recurrent neural networks (RNNs), since mouse movements are of sequential nature and RNNs are very good at modelling data sequences and time series; were each multivariate data point can be assumed to be dependent on the previous ones.

Concretely, the model architecture is a bidirectional long short-term memory (LSTM) network; see

Figure 2. An LSTM network is essentially an RNN that can remember long-term dependencies. We used the bidirectional variant (BLSTM) since a major issue with all RNNs is that they can only learn representations from previous time steps. However, sometimes we have to learn representations from future time steps to better understand the context and thus eliminate potential ambiguities.

Figure 2. Diagram of our bidirectional LSTM architecture.

The BLSTM takes as input a sequence of raw mouse cursor positions, which can be seen as a multivariate time series of two-dimensional data points. The input layer has 50 neurons (one neuron per timestep). The hidden layer is a recurrent block with a forward + backward LSTM, with hyperbolic tangent as activation function and sigmoid activation in the recurrent step. Similar to the previously discussed FFNN model, we added a dropout layer with drop rate

for regularisation, followed by a fully-connected layer of 1 output unit using sigmoid activation. The BLSTM outputs a probability prediction of the user’s attention to an ad, where indicates that the user has noticed the ad. In sum, the only architectural difference between our FFNN and BLSTM models is the input layer and the first hidden layer. The model architecture is illustrated in Figure 2.

Since our BLSTM takes as input a raw sequence of mouse cursor positions only, and because each sequence has a different length, the input sequences are padded to a fixed length of 50 timesteps, which corresponds roughly to the mean sequence length observed in our data set plus one standard deviation. Also because each mouse cursor sequence was performed on different web browsers with different screen sizes and thus different positions of the SERP components, horizontal coordinates were normalised by each user’s viewport width.

We trained this model with a batch size of 64 sequences and for 50 epochs, using the same 10-fold cross-validation splits as the baseline models. We used the popular Adam optimizer (stochastic gradient descent with momentum) with learning rate and decay rates and .

5.5. Results

5.5.1. Classification Accuracy

Table 3 shows the experimental results comparing the baseline models (Section 5.2) and our recurrent neural network (Section 5.4) of user attention. We report weighted Precision, Recall, and F-measure (F1 score), according to the target class distributions in each case, averaged across the ten folds. We also report the Area Under the ROC curve (AUC), to highlight the discriminative power of each classifier. We use the Pearson’s test of proportions as omnibus test and, if the result of the omnibus test reveals a statistically significant difference, we use as post-hoc test pairwise comparisons between pairs of proportions with correction for multiple testing, to see if a statistically significant difference exists between individual models.

Ad condition Model Adj. Precision Adj. Recall Adj. F-measure AUC
Native RF 0.584 0.570 0.568 0.601
ZeroR 0.465 0.682 0.553 0.500
FFNN 0.515 0.663 0.552 0.473
BLSTM 0.712 0.714 0.650 0.634
Bundle, left RF 0.592 0.580 0.578 0.641
ZeroR 0.524 0.724 0.608 0.500
FFNN 0.524 0.724 0.608 0.519
BLSTM 0.524 0.724 0.608 0.624
Bundle, right RF 0.537 0.514 0.498 0.590
ZeroR 0.485 0.692 0.570 0.500
FFNN 0.485 0.692 0.570 0.501
BLSTM 0.560 0.687 0.576 0.630
Table 3. Ad attention prediction results, weighted by class distribution. A bold typeface denotes the best result for the corresponding experimental condition.

Our findings indicate that our BLSTM classifier achieves competitive performance in detecting ad attention, as compared to the other models, for most metrics and under most ad conditions. Notice that the BLSTM does not use engineered features (like the RF model) and does not use page-level information (like the FFNN model), only the raw sequences of mouse movement coordinates.

For the case of native ads, the omnibus test was statistically significant for all metrics. In terms of Precision
, BLSTM performed significantly better than the other models, and there was no statistically significant difference between ZeroR and FFNN. All other differences were statistically significant. The effect size suggests a moderate practical importance. In terms of Recall , all models performed better than RF. All other differences were not statistically significant. The effect size suggests a moderate practical importance. In terms of F-measure , BLSTM performed significantly better than the other models. All other differences were not statistically significant. The effect size suggests a small practical importance. In terms of AUC , both BLSTM and RF performed significantly better than FFNN and ZeroR, and the difference between BLSTM and RF was not statistically significant. The difference between FFNN and ZeroR was not statistically significant. The effect size suggests a moderate practical importance.

For the case of left-bundled ads, the omnibus test was statistically significant for all metrics excepting F-measure . The effect size suggests a small practical importance. In terms of Precision , the post-hoc tests revealed no statistically significant differences between models. The effect size suggests a small practical importance. In terms of Recall , all models performed better than RF. All other differences were not statistically significant. The effect size suggests a moderate practical importance. In terms of AUC , both BLSTM and RF performed significantly better than FFNN and ZeroR, and the difference between BLSTM and RF was not statistically significant. The difference between FFNN and ZeroR was not statistically significant. The effect size suggests a moderate practical importance.

For the case of right-bundled ads, the omnibus test was statistically significant for all metrics. In terms of Precision , BLSTM performed significantly better than FFNN and ZeroR, and the difference between BLSTM and RF was not statistically significant. All other differences were not statistically significant. The effect size suggests a small practical importance. In terms of Recall , BLSTM performed significantly better than the other models. All other differences were not statistically significant. The effect size suggests a moderate practical importance. In terms of F-measure , the RF performed significantly worse than the other models. All other differences were not statistically significant. The effect size suggests a small practical importance. In terms of AUC , both BLSTM and RF performed significantly better than FFNN and ZeroR, but the difference between BLSTM and RF was not statistically significant. The difference between FFNN and ZeroR was not statistically significant. The effect size suggests a moderate practical importance.

The case of left-bundled ads is interesting. As can be observed in Table 3

, both neural network models achieved the same Precision, Recall, and F-Measure as the ZeroR classifier, suggesting that they were unable to model the data distribution and thus learned to use the prior probability for classification. The results suggest therefore that predicting attention to left-bundled ads is a challenging task. Still, the superiority of the BLSTM was evident with respect to the AUC. The AUC represents the capability of a classifier to distinguish between classes, and when AUC is

, it means the model has no class separation capacity, as is the case of the ZeroR classifier. Thus, we conclude that it is possible to detect user attention to online ads with competitive accuracy. More importantly, it is possible to do so unobtrusively and at large scale.

In what follows, we examine the effect that certain demographic attributes like gender and age may have on the proposed diagnostic technology of ad attention. Such effects are important as they allow for market segmentation, better ads tailoring, and informing the online auction schemes and further improving the auction performance in various ways. We use the BLSTM model since it is the best performer overall, as indicated by the results discussed above.

5.5.2. Gender Analysis

We were interested in observing how the accuracy of our ad attention model may vary per user gender, i.e. whether users of a specific gender exhibit more (or less) predictable patterns of attention to ad displays. To this end, we analysed the mouse cursor data separately for male and female users using the same experimental setup as in

Section 5.5.1, and compared the BLSTM model’s performance across all ad conditions. Pearson’s test with Yates’ continuity correction to assess the differences in accuracy performance and highlight those cases where the model performs significantly better.

Gender N Ratio Adj. Precision Adj. Recall Adj. F-measure AUC
Male 1256 334:922 0.490 0.700 0.576 0.596
Female 884 289:595 0.652 0.691 0.593 0.576
Table 4. Variation of ad attention prediction performance by gender. ‘N’ denotes the sample size of each group and ‘Ratio‘ denotes the number of positive:negative instances (users who noticed vs not noticed the ad). A bold typeface denotes the best result.

The BLSTM model achieved significantly better Precision when predicting attention from female users than male users . However, no statistically significant differences were observed for any of the other metrics and effect sizes were small in all cases , suggesting thus a small practical importance. Therefore, we cannot conclude that user’s gender plays an important role in predicting user attention to online ads and, subsequently, gender should not be used to inform the online auction.

5.5.3. Age Analysis

We were also interested in observing how the accuracy of our diagnostic technology may be affected by users at different ages, i.e. whether users of a specific age group exhibit more (or less) predictable patterns of attention to ad displays, compared to other age groups. We divide our users into six age groups: “18–23”, “24-29”, “30-35”, “36-41”, “42-47”, and “48+”. The age groups are split in such a way that each group has enough users while preserving common understanding of young, adults, middle-aged and elder people.

Again, we use the same setup as in the previous experiments and compared the BLSTM model’s performance across all ad conditions. We use the Pearson’s test of proportions as omnibus test and, if the result of the omnibus test reveals a statistically significant difference, we use as post-hoc test pairwise comparisons between pairs of proportions with correction for multiple testing, to see if a statistically significant difference exists between individual age groups.

Age Group N Ratio Adj. Precision Adj. Recall Adj. F-measure AUC
18–23 289 91:198 0.787 0.689 0.574 0.469
24–29 471 131:340 0.602 0.654 0.530 0.616
30–35 459 119:340 0.679 0.739 0.641 0.586
36–41 343 105:238 0.607 0.660 0.541 0.578
42–47 206 64:142 0.659 0.709 0.637 0.580
+48 383 118:265 0.531 0.721 0.612 0.598
Table 5. Variation of ad attention prediction performance by age. ‘N’ denotes the sample size of each group and ‘Ratio‘ denotes the number of positive:negative instances (users who noticed vs not noticed the ad). A bold typeface denotes the best result.

There was a statistically significant difference between groups for all metrics. In terms of Precision
, differences between the “+48” group and all the other age groups were found to be statistically significant. The difference between the “18–23” group and any of the remaining groups and were also statistically significant. The effect size suggests a small practical importance. In terms of Recall and F-measure , the post-hoc tests revealed no statistically significant differences between age groups. The effect size suggests a small practical importance for both metrics. In terms of AUC , the difference between the “18–23” group and all the other groups was found to be statistically significant. The effect size suggests a small practical importance.

As can be observed in Table 5, the more accurate results were observed for younger age groups, up to 30–35 years old. Then, as user’s age increased, the BLSTM decreased significantly in Precision and consistently increased in Recall. This observation, together with the fact that the AUC also deteriorated with older age groups, suggests an increase in the number of false positives, and therefore a degadation in classification performance.

Our findings underline potential age effects on the way a mouse device is used in an online search task. We also found that the number of mouse movements consistently increased with age in our dataset; e.g. the average number of coordinates in the “18–23” group is (), in the “30–25” group is (), and in the “+48” group is (). However, we should point out that the number of mouse movements alone provides an incomplete picture of age-related effects. Overall, ageing is marked by a decline in motor control abilities, therefore it is expected to affect the users’ pointing performance and, by extension, how they move the computer mouse. For example, Smith et al. (1999) observed that older people incurred in longer mouse movement times, which we also found in our data, but also more sub-movements and more pointing errors than the young. Prior work (Hsu et al., 1999; Bohan and Chaparro, 1998; Jastrzembski et al., 2003; Lindberg et al., 2006; Smith et al., 1999; Walker et al., 1997) has also linked age with motor control and pointing performance in tasks that involve the use of a computer mouse. Therefore, we conclude that user’s age plays an important role in predicting user attention to online ads and, subsequently, age could be used to inform the online auction. We elaborate more about these observations in Section 7 and discuss how they may impact our diagnostic technology.

6. PPA Evaluation

Having shown that user attention to ads on SERPs can be accurately predicted and at a large scale, we proceed to evaluate the expected performance of the PPA auction scheme. Since we do not have full control over an ads platform to test our method live, in this section we illustrate our theoretical findings with a series of numerical simulations, to show how the revenue rankings between the three auction formats are affected when we vary both the correlation between values and attention probabilities, and the fraction of bidders subject to framing effects. We also exemplify the main insights highlighted by our theoretical results and show their significance in the context of the distribution of attention probabilities derived from the BLSTM model discussed in the previous section.

First, we assume that parameters and in Section 3.1

are i.i.d. draws from a uniform distribution over the unit interval, and that attention probabilities

are independently drawn from some distribution . Recall that denotes the probability that bidder realizes a sale conditional on the consumer having clicked on the ad and denotes the probability that realizes a sale conditional on the ad being noticed but not clicked.

To illustrate the effects of varying the correlation between valuations and attention probabilities, we follow the simple statistical model of Ausubel and Baranov (2018). Namely, we assume that with probability valuations are independent draws from a uniform over ; with probability if valuations instead are perfectly correlated with attention probabilities (specifically, such that ). Recall that is not a probability, but it denotes ’s value for making a sale.121212We normalise the valuations to the interval so that revenues in these simulations are directly expressed as percentages of the highest value of making a sale. This is a simple statistical model to illustrate how results are affected when one varies the correlation between attention probabilities and values, captured by the parameter (if , all variables are independent; if , and are perfectly correlated). We also let denote the fraction of bidders who are subject to the framing effects discussed above.

We then simulate the expected revenues of the three auction formats, assuming that for each drawn from such a joint distribution, advertisers follow the optimal strategies identified in our results above. Namely, , and if they are sophisticated, and , and if they are subject to framing effects. We thus compute the expected revenues generated by the optimal strategy profiles in the three auction formats (with and without framing effects), by sampling the parameters from the same common distribution, held constant across the auction formats. The two simulations only differ in the exact specification of such a distribution, and particularly in that of the distribution of attention probabilities.

In the first simulation, we set the distribution of attention probabilities equal to a uniform distribution over . This textbook example is best suited to illustrate the various possibilities indicated by the theoretical results. Namely, expected revenues under the three auction formats are the same if all bidders are sophisticated () and if there is no correlation between valuations and attention probabilities (); but as soon as , the difference between the revenues of the PPA and PPI auction becomes positive and increasing in . If , PPI revenues are strictly higher than those of the PPC for all , and also higher than those of the PPI for all , and more so as the fraction of non-fully sophisticated bidders increases. As for the relative ranking between PPI and PPC revenues, for any , there is a increasing in such that revenues are higher in the PPI if , and in the PPC if . The results are illustrated in Figure 3, respectively, for . For , for example, PPA revenues are about higher than PPC and up to higher than PPI (if , or about if ).

(a) No framing effects
(b) Framing effects for half the bidders
(c) Framing effects for all bidders
Figure 3. Comparisons of expected revenues as a function of the correlation between valuations and attention probabilities, with attention probabilities uniformly distributed over the unit interval, under varying fractions of bidders subject to framing effects: none LABEL:sub@fig:no_framing1, half of the bidders LABEL:sub@fig:half_framing1, all bidders LABEL:sub@fig:all_framing1.
(a) No framing effects
(b) Framing effects for half the bidders
(c) Framing effects for all bidders
Figure 4. Comparisons of expected revenues as a function of the correlation between valuations and attention probabilities, with attention probabilities following a distribution (MLE computed from the BLSTM model and the data gathered from our user study), under varying fractions of bidders subject to framing effects: none LABEL:sub@fig:no_framing2, half of the bidders LABEL:sub@fig:half_framing2, all bidders LABEL:sub@fig:all_framing2.

In the second simulation, we set the distribution of attention probabilities equal to a parametric maximum-likelihood estimate (MLE) based on the BLSTM model predictions from our crowdsourced user study. We note that as long as a diagnostic technology like ours is correct on average – that is, without systematically under- or over-estimating the attention probabilities – then the accuracy of the estimates would not affect the optimal bids of risk-neutral advertisers, nor the expected revenues of the auctions. For this reason, the accuracy of the estimated attention probabilities plays no role in these simulations.

More specifically, in the second simulation we fit a Beta distribution to the observed attention probabilities derived from the BLSTM model and then perform a two-parameter MLE, by pooling data across all subjects and ad types in our experiment. It is possible to perform alternative simulations using MLE-distributions based on subsamples of the data. The resulting parameters of the fitted Beta distribution are, respectively,

and . We thus set the distribution of attention probabilities in the numerical simulation equal to and generate the expected revenues for the three auction formats as explained above. The results are illustrated in Figure 4, respectively for .

Comparing these results with those from the previous simulations, it is interesting to note that using the MLE of the observed probabilities (derived from the BLSTM model) increases the extent by which the PPA outperforms the PPC. For example, for , PPA revenues are about higher than those of the PPC across all values of . In contrast, the difference between revenues under the PPA and PPI for low values of is less pronounced than in the previous simulation. Obviously, these figures are purely indicative, since the actual revenue performance also depends on the distribution of other variables, and in particular the valuations, which were not elicited by our experiment and hence were not calibrated in the above simulations. Nonetheless, as shown in Section 3.3.1, our analytic results imply that the PPA revenues would always be strictly higher than the PPI for any , and higher than the PPC for any .

7. Discussion

Our analysis explicitly accounts for the possibility that bidders may value ads beyond the clicks they generate. The literature’s benchmark, in which only clicks are valued, is embedded as a special case in the PPA scheme. Thanks to this generalisation, our study produces novel theoretical insights on the revenue ranking of PPI and PPC schemes, as well as on the novel auction format we propose in this article.

We have shown that the PPA second-price auction has the same desirable properties (namely, strategy-proofness and efficiency) as its PPI and PPC counterparts. Revenues are identical under the three formats if bidders are fully sophisticated and if attention probabilities are either constant across bidders or uncorrelated with their valuations. But PPA’s revenues are higher than the PPI if valuations and attention probabilities are positively correlated, and they are higher than the PPC as soon as some of the bidders are subject to framing effects. PPA’s revenues could be lower than PPI’s only for correlation structures (e.g., with negative correlation between valuations and attention probabilities) under which also the PPC would do worse than the PPI. Since the PPC is widely considered to outperform the PPI, the possibility of such environments seems less relevant in practice.

To the extent that higher valuation advertisers have stronger incentives to invest in better advertisements, a positive correlation between valuations and attention probabilities should be expected on average; i.e. situations in which the PPA outperforms the PPI. It is also expected that at least a small fraction of bidders are not fully sophisticated; i.e. situations in which the PPA outperforms the PPC. If we consider the possibility that different formats affect the incentives of the platform to maximise CTRs and users’ attention, there would be even stronger reasons to prefer the PPA over the current alternatives, because it would align the platform’s incentives with a more direct measure of the advertisers’ objectives. This would have the effect of increasing the total surplus, and hence increase revenues beyond the effects covered by our analysis.

The fact that estimated probabilities may be strictly between and reflects the uncertainty of the information we may have on whether the ad was actually noticed () or not (). However, as long as the estimates are consistent, whenever the probability is estimated at, say, it suggests there is a high chance that the ad was truly noticed. Hence, the fact that payment is proportional to the estimated already ensures that, for large numbers, advertisers are paying the right proportion of times: the lower the probability, the lower the payment, and hence the revenues. Different payment schemes (e.g., PPI) would generate lower revenues in a different way, i.e. through lower bids, since, under the PPI scheme, bidders would understand that they would be charged also for ads which are ineffective – the point of the PPA auction is precisely to mitigate this effect, without going all the way to the PPC scheme, in which payments are not made even if the ad is noticed but not clicked.

In this line, perhaps other metrics related to ad performance could be taken into account in the PPA scheme or derivatives. For example, the amount of time an ad is shown on screen may increase the chance of noticing the ad. However, to the best of our knowledge, on-screen time is not factored in the pricing of existing auction formats. Of course, one could devise an alternative auction format which takes time, rather than attention, as input for the pricing scheme, but we think that it would be a less direct and hence less effective method than the format we propose. The reason is that the ultimate source of surplus is always the attention of users. A longer on-screen time may make it more probable that an ad will be eventually noticed, but it is not obvious that time per se may create extra value independent of the increase in the probability of noticing the ad that it may generate.

One of the reasons why the PPC is often preferred to the PPI in practice is because it insures advertisers against the risk of paying for ineffective ads. In the PPI, in contrast, it is the seller who is fully insured, in that there is no uncertainty associated to the payments they receive. The PPA scheme provides an intermediate allocation of risk. Hence, in situations in which both sides exhibit some degree of risk aversion, the PPA may actually increase the total surplus, as well as provide a more equitable split thereof. A systematic analysis of the impact of risk-aversion on the different payment schemes, however, is left as an opportunity for future work.

The results on demographics discussed at the end of Section 5 suggest perceptual differences across the examined user groups. For example, the findings reported in Sections 5.5.2 and 5.5.3 suggest differences in the reliability of the attention predictions across gender and age groups, though gender was not found to be a statistically significant confounding factor. Demographics information may be used to fine-tune the design of the PPA auction, so as to further increase its profitability or to pursue other kind of desiderata. More specifically, it could be used in further developments of the PPA auction by re-weighting the way bids affect advertisers’ payments, as a function of the observable demographics. We note that, in our analysis, we do not assume to have prior knowledge of any of such a demographic information, nor we use gender and age attributes as part of the training input to the BLSTM model. However, we argue that inferring these attributes is possible even for a commercial Web Search service that does not intentionally store such user profile information (Hu et al., 2007; Pentel, 2017; Yamauchi and Bowman, 2014).

At first sight, one might think that the PPA auction will only benefit the advertisers, because it ensures that they are only charged if their ads are actually noticed. However, as shown by our results, it is also beneficial for the platform, since it may boost its revenues. Ultimately, the PPA auction promotes a fairer and more transparent auction process. The reason is that it directly prices user attention, the PPA provides a more effective target for advertiser campaigns, thereby further aligning the ad delivery platforms incentives with the objectives of the advertisers. As mentioned in the 1. Introduction section, this is especially the case for advertisers whose campaigns aim to generate mainly brand or product awareness, rather than to induce direct online sales, which is the case for many (if not most) of the highest-value advertisers.

The PPA scheme depends on a diagnostic technology to effectively capture the user attention, therefore in principle it is more difficult to put into production than the existing PPC and PPI schemes. In PPC, a simple re-direct through the host site is sufficient to know that the user clicked the ad. In PPI, the ad delivery platform site knows that a particular ad was displayed while the host site renders the page. In PPA, the host site must track the mouse movements, and on exiting the page it must inform the ad delivery platform. However, it is relatively easy to use JavaScript code to track the mouse cursor movements and send the data unobtrusively in the background. It can be accomplished, for example, with two lines of code in Google Analytics: one line to set the onmousemove event listener and other line to call the ga.send() method. In addition, previous work has showcased a scalable technology to log mouse movements, such as transmitting the data whenever a mouse pause is detected (Huang et al., 2011) or even using LZW compression to save bandwidth (Leiva and Vivó, 2013). On the other hand, the host site does not need to compute ad attention, nor it should do it. Instead, computing ad attention should be performed by the ad delivery platform, not only because some host sites may have limited computational resources, but also to prevent potential ad fraud. So, as we see it, the host site only needs to transmit the mouse movements to the ad delivery platform, and this can be done easily and at scale. Then, the ad delivery platform must query a trained model, which usually takes a few milliseconds, to re-estimate ad valuations according to the predicted attention probability. However, computing ad attention does not need to be done in real-time. Instead, each ad could be queued for a few seconds before effectively charging the advertiser.

8. Limitations and Future Work

Our work comes with certain limitations that we intend to address in future work. First and foremost, the proposed PPA scheme has been introduced for the single-slot auction case, and it is clear that SERPs display more than one ad at the same time, at different positions. Now that we know that the PPA auction scheme is feasible, an important extension we plan to pursue is to investigate the case in which multiple slots are sold at the same time. This is challenging from an economic analysis perspective because of the well-known complexity associated to ensuring strategy-proofness in multi-unit settings (see, e.g., Facebook’s VCG mechanism), or for the complexity of strategic behavior in the (non strategy-proof) multi-unit versions of the second-price auctions (such as Google’s GSP auction format). Nonetheless, the logic of the PPA pricing scheme can be extended in both directions, and we expect that the advantages discussed above for the single-item case would extend to multi-unit environments as well.

The multi-slot auction case is also challenging for our diagnostic technology because mouse cursor movements may not clearly indicate user attention to every possible ad on the SERP. The reason for that is because current SERPs usually include a variable number of modules such as advertisements, query suggestions, video and image results, and media-rich vertical content that compete for users’ attention. There is evidence to support the claim that increasing the number of modules on the SERP may affect task completion (Rosenholtz et al., 2005). Furthermore, the diversity of modules in SERPs can also impact user experience and scan order (McCay-Peet et al., 2012; Marcos et al., 2015). One way to circumvent this would be incorporating contextual information about the SERP structure, at the expense of making our BLSTM model less generalizable. Note that, currently the input to our BLSTM model is a sequence of raw mouse cursor coordinates only. In addition to trying to solve these challenges, we will examine other dependent variables, such as perceived ad relevance or usefulness, which may be used to inform fine-grained auction schemes.

Second, our experimental methodology, and in particular the way we split our query sample during the cross-validation, may have resulted in some artifacts. More specifically, we introduced multiple instances of the same queries for each combination of ad format and ad position. Ideally, one would like a model to generalize to previously unseen queries. We argue that the way we chose to sample the search queries, i.e. so that they span across several popular and diverse topics, and over a period of 12 months, helped mitigate unaccounted, systematic biases in our analysis. This is further supported by prior evidence which has demonstrated that mouse cursor patterns are, to some extent, independent of the web page content (Arapakis et al., 2014b; Lagun et al., 2014a) and, as an extension, of the web search queries as well. Nevertheless, we cannot discount entirely the possibility that our collection of web search queries may have had an effect on the mouse cursor behaviour. Therefore, we plan to investigate this further in follow-up work.

Third, the experimental approach to studying user attention through simulated web search tasks may have affected the ecological validity of results. However, unlike other possible methodologies that could be adopted to analyse web browsing behaviour at large scale (e.g., bucket testing or query log analysis), crowdsourcing allows for exploring a wider range of parameters in a more controlled manner. The downside is, as noted, the difficulty of generalising the findings, because the sample size that can be collected in a crowdsourcing study (e.g., hundreds or thousands of users) is typically much smaller than the sample size that can be collected in the wild (e.g., millions of users). Ultimately, we decided to ensure to our best effort the internal validity of the experiment; i.e. the extent to which an observed effect is due to the test conditions. While taking into consideration these limitations, we devised our experimental design so that it mitigates most of the unwanted effects. For example, we introduced brief search tasks that offered a context that the users could relate to, by engaging them with popular web search queries. We further allowed the users to conduct the study from their environment of choice, whether that was e.g., their home or office. Moreover, we did not impose any limitations on the task duration and, last, we collected the mouse cursor data in a non-invasive way that did not disrupt the natural flow of the search task.

We have shown that it is possible to detect user attention to online ads with competitive accuracy. And, more importantly, we have shown that is possible to do so unobtrusively and at large scale. Still, there is an opportunity to improve further the prediction capabilities of our diagnostic technology. For example, by stacking more recurrent layers or increasing the number of neurons, using other activation functions, improving the optimizer hyperparameters, or implementing more sophisticated model architectures. For example, the multi-headed self-attention mechanism for RNNs 

(Vaswani et al., 2017)

has shown promise in Natural Language Processing tasks such as Machine Translation and Speech Recognition, therefore it could be explored for mouse cursor trajectories as well.

Our work has presented some preliminary findings on the potential effects of user’s age on attention prediction. We observed perceptual differences in performance, across different age groups, that are in line with previous findings from the motor control literature. These findings may have implications for the proposed diagnostic technology and, by extension, to the PPA auction. Although in this work we did not explore the role of user’s age in the bidding process, we believe it is definitely worthy of investigation and thus leave this extension to our auction scheme for follow-up work. We note that the general idea of the PPA auction scheme, as well as our diagnostic technology, lend themselves to many extensions such as this one.

Another promising venue for future work is the examination of other implicit behavioural signals, such as eye movements, brain activity (which can be measured by means of electroencephalography), and other potentially more objective metrics of attention. While these sensory channels are not as easy to collect as mouse cursor activity, they can nevertheless provide more accurate insights about how users perceive and react to different ad formats. So far, the role of these signals in web search have been studied mostly in isolation (Goldberg et al., 2002; Barral et al., 2016; Jacucci et al., 2019) and we believe that a combination of them would provide us with more insightful information.

Finally, we should mention that our diagnostic technology has been studied in a desktop setting only. It is not expected to work without adjustments on touch-capable devices such as tablets and smartphones. This limitation may raise some concerns about the practicality of this diagnostic technology, since currently half of the web traffic is mobile. However, it has been reported elsewhere131313https://www.perficientdigital.com/insights/our-research/mobile-vs-desktop-usage-study141414https://hostingfacts.com/internet-facts-stats/ that engagement is higher on desktop. For example, 58% of time spent on sites is by desktop users and 42% of time spent on sites is by mobile users. Similar trends are reported for the percentage of page views per visit. In other words, desktop search is still very relevant and amounts for a profitable and sizeable percentage of web traffic, and hence it provides a reasonable and important starting point for our analysis. Potential extensions of our diagnostic technology to account for touch-based interactions include, for example, tracking zoom/pinch gestures and scroll activity instead of the mouse cursor position. This was in fact investigated in previous work by Guo et al. (2013), who proposed the Mobile Touch Interaction model, a feature-based classifier that could identify basic patterns of reading and scanning behaviour. Unfortunately, all the proposed touch-based features were found to be weakly correlated with explicit judgements of document relevance. Therefore, there is still plenty of room for improvement in this research area.

9. Conclusion

We have introduced the PPA auction scheme, a novel pay-per-attention second-price auction format that includes user attention to ads in the bidding process. Under the PPA scheme, advertisers are charged only if their ads are actually noticed by the users. We have proved that PPA inherits the same desirable properties of the popular PPI and PPC formats (namely, strategy-proofness and efficiency). We also have proved that, in any environment, the revenues of the PPA auction are never lower than its PPI and PPC counterparts, and that in many relevant economic environments they are in fact strictly higher than those of the PPI and PPC formats.

To make PPA feasible, we have introduced a scalable diagnostic technology that estimates user attention to ads in sponsored search using mouse cursor information and a recurrent neural network. We have tested up to four different classifiers that achieve reasonable performance, and showed further noticeable improvements by our recurrent neural network, which uses raw mouse cursor data, over the other baseline models, which rely on ad-hoc and domain-specific features about the SERP.

Further, our numerical simulations exemplify the main insights highlighted by our theoretical findings. They also illustrate the significance of our findings in the context of the distribution of attention probabilities provided by our diagnostic technology. Ultimately, this work extends our current understanding about user engagement with ad displays in web search and paves the way towards more intelligent ad auctions and better user models.

Acknowledgements.
We thank the anonymous reviewers for their constructive and valuable feedback. We also thank Malachy Gavan for valuable research assistance work. L. A. Leiva acknowledges support from the Academy of Finland (BAD project).

References

  • G. Aggarwal, A. Goel, and R. Motwani (2006) Truthful auctions for pricing search keywords. In Proc. EC, Cited by: §2.1.1.
  • I. Arapakis, M. Lalmas, B. B. Cambazoglu, M. Marcos, and J. M. Jose (2014a) User engagement in online news: under the scope of sentiment, interest, affect, and gaze. J. Assoc. Inf. Sci. Technol. 65 (10). Cited by: §2.2.2, §5.3.
  • I. Arapakis, M. Lalmas, and G. Valkanas (2014b) Understanding within-content engagement through pattern analysis of mouse gestures. In Proc. CIKM, Cited by: §2.2.2, §8.
  • I. Arapakis, L. A. Leiva, and B. B. Cambazoglu (2015) Know your onions: understanding the user experience with the knowledge module in web search. In Proc. CIKM, Cited by: §2.2.1.
  • I. Arapakis and L. A. Leiva (2016) Predicting user engagement with direct displays using mouse cursor information. In Proc. SIGIR, Cited by: §1, §2.2.2, §2.2, §4.4, §4, §5.2, §5.
  • E. Arroyo, S. Sullivan, and T. Selker (2006) CarCoach: a polite and effective driving coach. In Proc. CHI EA, Cited by: §2.2.
  • S. Athey and D. Nekipelov (2014) A structural model of sponsored search advertising auctions. Mimeo. Cited by: §2.1.1.
  • Susan. Athey and G. Ellison (2011) Position auctions with consumer search. Q. J. Econ. 126 (3). Cited by: §2.1.3.
  • R. Atterer, M. Wnuk, and A. Schmidt (2006) Knowing the user’s every move: user activity tracking for website usability evaluation and implicit interaction. In Proc. WWW, Cited by: §2.2.
  • L. M. Ausubel and O. Baranov (2018) Core-selecting auctions with incomplete information. Mimeo. Cited by: §1, §6.
  • Y. Bachrach, S. Ceppi, I. A. Kash, P. Key, and D. Kurokawa (2014) Optimising trade-offs among stakeholders in ad auctions. In Proc. EC, Cited by: §2.1.3.
  • Y. Bachrach (2010) Honor among thieves: collusion in multi-unit auctions. Proc. AAMAS. Cited by: §2.1.2.
  • N. Barbieri, F. Silvestri, and M. Lalmas (2016) Improving post-click user engagement on native ads via survival analysis. In Proc. WWW, Cited by: §2.2, §5.3.
  • O. Barral, I. Kosunen, T. Ruotsalo, M. M. Spapé, M. J. Eugster, N. Ravaja, S. Kaski, and G. Jacucci (2016) Extracting relevance and affect information from physiological text annotation. User Model. User-Adap. 26 (5). Cited by: §8.
  • T. Blake, C. Nosko, and S. Tadelis (2015) Consumer heterogeneity and paid search effectiveness: a large-scale field experiment. Econometrica 83 (1). Cited by: §3, footnote 4.
  • M. Bohan and A. Chaparro (1998) Age-related differences in performance using a mouse and trackball. Hum. Factors 42 (2). Cited by: §5.5.3.
  • T. Borgers, I. Cox, M. Pesendorfer, and V. Petricek (2013) Equilibrium bids in sponsored search auctions: theory and evidence. Am. Econ. J.: Microeconomics 5 (4). Cited by: §2.1.1.
  • Brightfish, Profacts, and Lumen (2018) From viewable to viewed: using eye tracking to understand the reality of attention to advertising across media. Note: White paper. Retrieved: October 10 2019. Available: https://effectiveviews.be/files/White_Paper_From_Viewable_to_viewed.pdf, Cited by: §2.2.2.
  • Y. Chen, Y. Liu, M. Zhang, and S. Ma (2017) User satisfaction prediction with mouse movement information in heterogeneous search environment. IEEE Trans. Knowl. Data. Eng. 29 (11). Cited by: §2.2.
  • H. Choi, C. F. Mela, S. Balseiro, and A. Leary (2018) Online display advertising markets: a literature review and future directions. Columbia Business School Research Paper No. 18-1 (). Cited by: §2.1.4, footnote 2.
  • E. H. Clarke (1971) Multipart pricing of public goods. Public Choice 11. Cited by: §2.1.1.
  • M. Claypool, P. Le, M. Wased, and D. Brown (2001) Implicit interest indicators. In Proc. IUI, Cited by: §2.2.1.
  • F. Decarolis, M. Goldmanis, and A. Penta (2018) Marketing Agencies and Collusive Bidding in Online Ad Auctions. Mimeo. Cited by: §2.1.2.
  • F. Diaz, R. White, G. Buscher, and D. Liebling (2013) Robust models of mouse movement on dynamic web search results pages. In Proc. CIKM, Cited by: §2.2.1.
  • A. Diriye, R. White, G. Buscher, and S. Dumais (2012) Leaving so soon?: understanding and predicting web search abandonment rationales. In Proc. CIKM, Cited by: §2.2.1.
  • B. Edelman, M. Ostrovsky, and M. Schwarz (2007) Internet advertising and the generalized second-price auction: selling billions of dollars worth of keywords. Am. Econ. Rev. 97 (1). Cited by: §2.1.1, §3.1, footnote 7.
  • B. Edelman and M. Schwarz (2010) Optimal auction design and equilibrium selection in sponsored search auctions. Am. Econ. Rev. 100 (2). Cited by: §2.1.1, §2.1.3.
  • H. A. Feild, J. Allan, and R. Jones (2010) Predicting searcher frustration. In Proc. SIGIR, Cited by: §2.2.1, §4.4.
  • K. Fjell (2009) Online advertising: pay-per-view versus pay-per-click – a comment. IJRM 8 (2-3). Cited by: §2.1.4.
  • K. Fridgeirsdottir and S. Najafi-Asadolahi (2018) Cost-per-impression pricing for display advertising. Oper. Res. 66 (3). Cited by: §2.1.4.
  • J. H. Goldberg, M. J. Stimson, M. Lewenstein, N. Scott, and A. M. Wichansky (2002) Eye tracking in web search tasks: design implications. In Proc. ETRA, Cited by: §8.
  • R. Gomes and K. Sweeney (2014) Bayes–nash equilibria of the generalized second-price auction. Games Econ. Behav. 86. Cited by: §2.1.1.
  • Google (2014) The importance of being seen: viewability insights for digital marketers and publishers. Technical report Cited by: footnote 3.
  • T. Groves (1973) Incentives in teams. Econometrica 41 (4). Cited by: §2.1.1.
  • M. Grusky, J. Jahani, J. Schwartz, D. Valente, Y. Artzi, and M. Naaman (2017) Modeling sub-document attention using viewport time. In Proc. CHI, Cited by: §2.2.
  • Q. Guo and E. Agichtein (2008) Exploring mouse movements for inferring query intent. In Proc. SIGIR, Cited by: §2.2.1, §2.2.
  • Q. Guo and E. Agichtein (2010) Ready to buy or just browsing?: detecting web searcher goals from interaction data. In Proc. SIGIR, Cited by: §2.2.1, §2.2.
  • Q. Guo and E. Agichtein (2012) Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior. In Proc. WWW, Cited by: §2.2.1.
  • Q. Guo, H. Jin, D. Lagun, S. Yuan, and E. Agichtein (2013) Mining touch interaction data on mobile devices to predict web search result relevance. In Proc. SIGIR, Cited by: §8.
  • Q. Guo, D. Lagun, and E. Agichtein (2012) Predicting web search success with fine-grained interaction data. In Proc. CIKM, Cited by: §2.2.1, §2.2.1, §2.2.
  • K. Hendricks, R. Porter, and G. Tan (2008) Bidding rings and the winner’s curse. RAND J. Econ. 39 (4). Cited by: §2.1.2.
  • S. H. Hsu, C. C. Huang, Y. H. Tsuang, and J. S. Sun (1999) Effects of age and gender on remote pointing performance and their design implications. Int. J. Ind. Ergon. 23 (5). Cited by: §5.5.3.
  • J. Hu, H. Zeng, H. Li, C. Niu, and Z. Chen (2007) Demographic prediction based on user’s browsing behavior. In Proc. WWW, Cited by: §7.
  • J. Huang, R. White, and G. Buscher (2012a) User see, user point: gaze and cursor alignment in web search. In Proc. CHI, Cited by: §2.2.
  • J. Huang, R. W. White, G. Buscher, and K. Wang (2012b) Improving searcher models using mouse cursor activity. In Proc. SIGIR, Cited by: §2.2.1.
  • J. Huang, R. W. White, and S. Dumais (2011) No clicks, no problem: using cursor movements to understand and improve search. In Proc. CHI, Cited by: §2.2.1, §7.
  • S. IPG (2013) Native ads vs banner ads: native ad research from ipg & sharethrough reveals that in-feed beats banners. Technical report Cited by: §4.
  • G. Jacucci, O. Barral, P. Daee, M. Wenzel, B. Serim, T. Ruotsalo, P. Pluchino, J. Freeman, L. Gamberini, S. Kaski, and B. Blankertz (2019) Integrating neurophysiologic relevance feedback in intent modeling for information retrieval. J. Assoc. Inf. Sci. Technol. 70 (9). Cited by: §8.
  • T. Jastrzembski, N. Charness, P. Holley, and J. Feddon (2003) Input devices for web browsing: age and hand effects. Universal Access Inf. 4. Cited by: §5.5.3.
  • P. Jehiel and L. Lamy (2015) On discrimination in auctions with endogenous entry. Am. Econ. Rev. 105 (). Cited by: §2.1.3.
  • Z. Jiang, S. Gao, and W. Dai (2016) Research on CTR prediction for contextual advertising based on deep architecture model. Control Eng. Appl. Inf. 18. Cited by: §2.2.
  • S.M. Johnson, P. Smith, and S. Tucker (1982) Response format of the job descriptive index: assessment of reliability and validity by the multitrait-multimethod matrix. J. Appl. Psychol. 67 (4). Cited by: §4.4.
  • A. Kolesnikov, Y. Logachev, and V. Topinskiy (2012) Predicting ctr of new ads via click prediction. In Proc. CIKM, Cited by: §2.2.
  • D. Lagun, M. Ageev, Q. Guo, and E. Agichtein (2014a) Discovering common motifs in cursor movement data for improving web search. In Proc. WSDM, Cited by: §2.2.2, §2.2, §4.4, §8.
  • D. Lagun and E. Agichtein (2015) Inferring searcher attention by jointly modeling user interactions and content salience. In Proc. SIGIR, Cited by: §5.3.
  • D. Lagun, C. Hsieh, D. Webster, and V. Navalpakkam (2014b) Towards better measurement of attention and satisfaction in mobile search. In Proc. SIGIR, Cited by: §2.2.
  • D. Lagun, D. McMahon, and V. Navalpakkam (2016) Understanding mobile searcher attention with rich ad formats. In Proc. CIKM, Cited by: §2.2.
  • P. Lahaie (2011) Efficient ranking in sponsored search auctions. In Proc. WINE, Cited by: §2.1.3.
  • L. A. Leiva and J. Huang (2015) Building a better mousetrap: compressing mouse cursor activity for web analytics. Inf. Process. Manag. 51 (2). Cited by: §4.3.
  • L. A. Leiva and R. Vivó (2013) Web browsing behavior analysis and interactive hypervideo. ACM Trans. Web 7 (4). Cited by: §4.3, §7.
  • L. A. Leiva (2011) Restyling website design via touch-based interactions. In Proc. MobileHCI, Cited by: §2.2.
  • Y. Li, P. Xu, D. Lagun, and V. Navalpakkam (2017) Towards measuring and inferring user interest from gaze. In Proc. WWW Companion, Cited by: §2.2.
  • T. Lindberg, R. Näsänen, and K. Müller (2006) How age affects the speed of perception of computer icons. Displays 27 (4). Cited by: §5.5.3.
  • Y. Liu, Y. Chen, J. Tang, J. Sun, M. Zhang, S. Ma, and X. Zhu (2015) Different users, different opinions: predicting search satisfaction with mouse movement information. In Proc. SIGIR, Cited by: §2.2.1, §2.2.2, §2.2, §4.4.
  • Y. Liu, C. Wang, K. Zhou, J. Nie, M. Zhang, and S. Ma (2014) From skimming to reading: a two-stage examination model for web search. In Proc. CIKM, Cited by: §2.2.2.
  • G. J. Mailath and P. Zemsky (1991) Collusion in second price auctions with heterogeneous bidders. Games Econ. Behav. 3 (4). Cited by: §2.1.2.
  • A. Mangani (2004) Online advertising: pay-per-view versus pay-per-click. IJRM 2 (4). Cited by: §2.1.4.
  • Y. Mansour, S. Muthukrishnan, and N. Nisan (2012) Doubleclick ad exchange auction. CoRR abs/1204.0535. Cited by: §2.1.2.
  • J. Mao, Y. Liu, N. Kando, Z. He, M. Zhang, and S. Ma (2018a) A two-stage model for user’s examination behavior in mobile search. In Proc. CHIIR, Cited by: §2.2.
  • J. Mao, C. Luo, M. Zhang, and S. Ma (2018b) Constructing click models for mobile search. In Proc. SIGIR, Cited by: §2.2.
  • M. Marcos, F. Gavin, and I. Arapakis (2015) Effect of snippets on user experience in web search. In Proc. INTERACCION, Cited by: §8.
  • D. Martín-Albo, L. A. Leiva, J. Huang, and R. Plamondon (2016) Strokes of insight: user intent detection and kinematic compression of mouse cursor trails. Inf. Process. Manag. 52 (6). Cited by: §2.2.
  • W. Mason and S. Suri (2012) Conducting behavioral research on amazon’s mechanical turk. Behav. Res. Methods 44 (1). Cited by: §4.
  • R. McAfee and J. McMillan (1992) Bidding rings. Am. Econ. Rev. 82 (3). Cited by: §2.1.2.
  • L. McCay-Peet, M. Lalmas, and V. Navalpakkam (2012) On saliency, affect and focused attention. In Proc. CHI, Cited by: §8.
  • P. Milgrom and J. Mollner (2018) Equilibrium selection in auctions and high stakes games. Econometrica 86 (1). Cited by: §2.1.1.
  • R. B. Myerson (1981) Optimal auction design. Math. Oper. Res. 6 (1). Cited by: §2.1.3.
  • S. Najafi-Asadolahi and K. Fridgeirsdottir (2014) Cost-per-click pricing for display advertising. Manuf. Serv. Oper. Manag. 16 (4). Cited by: §2.1.4.
  • A. Nath, S. Mukherjee, P. Jain, N. Goyal, and S. Laxman (2013) Ad impression forecasting for sponsored search. In Proc. WWW, Cited by: §2.2.
  • V. Navalpakkam, L. Jentzsch, R. Sayres, S. Ravi, A. Ahmed, and A. Smola (2013) Measurement and modeling of eye-mouse behavior in the presence of nonlinear page layouts. In Proc. WWWW, Cited by: §2.2.
  • M. Ostrovsky and M. Schwarz (2011) Reserve prices in internet advertising auctions: a field experiment. In Proc. EC, Cited by: §2.1.3.
  • J. W. Owens, B. S. Chaparro, and E. M. Palmer (2011) Text advertising blindness: the new banner blindness?. J. Usability Stud. 6 (3). Cited by: §4.
  • R. Paes Leme and É. Tardos (2010) Pure and bayes–nash price of anarchy for generalized second price auction. In Proc. FOCS, Cited by: §2.1.1.
  • A. Pentel (2017) Predicting age and gender by keystroke dynamics and mouse patterns. In Proc. Adj. UMAP, Cited by: §7.
  • B. Roberts, D. Gunawardena, I. A. Kash, and P. Key (2016) Ranking and tradeoffs in sponsored search auctions. ACM Trans. Econ. Comput. 4 (3). Cited by: §2.1.3.
  • R. Rosenholtz, Y. Li, J. Mansfield, and Z. Jin (2005) Feature congestion: a measure of display clutter. In Proc. CHI, Cited by: §8.
  • B. Shapira, M. Taieb-Maimon, and A. Moskowitz (2006) Study of the usefulness of known and new implicit indicators and their optimal combination for accurate inference of users interests. In Proc. SAC, Cited by: §2.2.1, §2.2.
  • M. W. Smith, J. Sharit, and S. J. Czaja (1999) Aging, motor control, and the performance of computer mouse tasks. Hum. Factors 41 (3). Cited by: §5.5.3.
  • M. Speicher, A. Both, and M. Gaedke (2013) TellMyRelevance!: predicting the relevance of web search results from cursor interactions. In Proc. CIKM, Cited by: §2.2.1.
  • D. R.M. Thompson and K. Leyton-Brown (2013) Revenue optimization in the generalized second-price auction. In Proc. EC, Cited by: §2.1.3.
  • H. R. Varian (2007) Position auctions. Int. J. Ind. Organ. 25 (6). Cited by: §2.1.1, §3.1, footnote 7.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, Cited by: §8.
  • W. Vickrey (1961) Counterspeculation, auctions, and competitive sealed tenders. J. Finance 16 (1). Cited by: §2.1.1.
  • N. Walker, D. A. Philbin, and A. D. Fisk (1997) Age-related differences in movement control: adjusting submovement structure to optimize performance. J. Gerontol. A Biol. Sci. Med. Sci. 52 (1). Cited by: §5.5.3.
  • X. Wang, N. Su, Z. He, Y. Liu, and S. Ma (2018) A large-scale study of mobile search examination behavior. In Proc. SIGIR, Cited by: §2.2.
  • T. Yamauchi and C. Bowman (2014) Mining cursor motions to find the gender, experience, and feelings of computer users. In Proc. ICDM, Cited by: §7.
  • S. Zhai, K. Chang, R. Zhang, and Z. M. Zhang (2016) DeepIntent: learning attentions for online advertising with recurrent neural networks. In Proc. KDD, Cited by: §2.2.