Online Advertising Security: Issues, Taxonomy, and Future Directions

06/06/2020 ∙ by Zahra Pooranian, et al. ∙ Università di Padova Imperial College London 0

Online advertising has become the backbone of the Internet economy by revolutionizing business marketing. It provides a simple and efficient way for advertisers to display their advertisements to specific individual users, and over the last couple of years has contributed to an explosion in the income stream for several web-based businesses. For example, Google's income from advertising grew 51.6 exponential growth in advertising revenue has motivated fraudsters to exploit the weaknesses of the online advertising model to make money, and researchers to discover new security vulnerabilities in the model, to propose countermeasures and to forecast future trends in research. Motivated by these considerations, this paper presents a comprehensive review of the security threats to online advertising systems. We begin by introducing the motivation for online advertising system, explain how it differs from traditional advertising networks, introduce terminology, and define the current online advertising architecture. We then devise a comprehensive taxonomy of attacks on online advertising to raise awareness among researchers about the vulnerabilities of online advertising ecosystem. We discuss the limitations and effectiveness of the countermeasures that have been developed to secure entities in the advertising ecosystem against these attacks. To complete our work, we identify some open issues and outline some possible directions for future research towards improving security methods for online advertising systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 8

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past few years, the widespread adoption of the Internet has led to the emergence of a new form of online business – i.e., online advertising – to make money through this means. A significant financial pillar of the Internet ecosystem is provided by online advertising [vratonjic2008securing, crussell2014madfraud, song2013multi, gill2013best].

Many companies such as Google and Microsoft have increased their investment in online advertising to improve their revenue and sales. According to the report in [Forbes], Google’s income from advertising grew 51.6% between 2016 and 2018, to $136.8 billion. It is expected that this revenue will reach nearly $203.4 billion by 2020 and will continue to increase over time.

Online advertising uses the same mechanisms that are applied to manage other “traditional” advertising channels, such as newspapers, radio or TV, but is much more creative in providing targeted and personalized advertisements [yurovskiy2015pros, haddadi2011targeted]. Thanks to the rise of the Internet and online advertising, sales of TV and radio advertisements have stagnated, and those of newspaper advertisements have dropped. Fig. 1 shows a comparison of global ad spending by medium [Recode].

Online advertising provides profit for all the components of the system, such as publishers, advertisers, and advertising network (or ad network). Given the high profits involved, the online advertising system is an obvious target for fraud. Hence, several attacks on the current online advertising market have been identified that have targeted various entities in the market, such as hacking [mladenow2015online], click fraud [linden2012method], malvertising [poornachandran2017demalvertising], adware [haddadi2009not], and inflight modification of advertising (ad) traffic [vratonjic2011online].

The value of the global online advertising market is expected to reach $225 billion in 2020, so it is not surprising that fraudsters are attempting to steal a piece of the pie [PYMNTS]. The study in [Yahoo] argues that the total ad spend lost to fraud exceeded $23.7 billion in 2019. The level of fraud is expected to reach $32 billion by 2022.

The inherent lack of transparency and complexity of the online advertising ecosystem give rise to higher risks, and an adversary can easily exploit these aspects to engage in fraudulent activities and launch an attack on the system. Ad fraud can occur in various forms and may involve fooling different components of the online advertising ecosystem to make money. For instance, dishonest publishers may deceive advertisers into paying an extra fee, or hackers could hijack an advertising slot to gain revenue for themselves.

In view of the factors described above, the success and popularity of the online advertising ecosystem depend primarily on the level of security that can provide against such malicious threats. The considerations mentioned above motivate the current work in terms of studying security issues in the online advertising market and essential related techniques.

Figure 1: Global ad spending by medium.

I-a Contributions

This article presents a survey that primarily targets the security issues and challenges of online advertising systems and reviews the related fundamental concepts. From a security perspective, it presents a comprehensive taxonomy of well-known ad fraud. It also categorizes several security mechanisms that have been proposed in recent years to cope with and mitigate the existing security challenges in the online advertising industry. In particular, our classification focuses on the goals of attacks, the revenue model, and the primary component targets.

Numerous existing works have discussed general aspects of online advertising systems. Most of the early works focused on issues relating to the economic aspects of advertising [edelman2007internet, evans2008economics, tucker2012economics, chen2014economic, evans2009online], challenges in online advertising [bostanshirin2014online], theoretical or analytical assessments of sponsored searches [aggarwal2008sponsored, goldfarb2008search, varian2007position], and especially analyses of privacy threats and protection mechanisms [estrada2017online, chen2016depth, sprankel2011online]. However, none of the existing works address security issues with an emphasis on online ad fraud in this area. There is therefore a need for a concise survey to provide a reader who is planning to undertake research in this field with a classification of online ad fraud, along with an exhaustive review of the corresponding countermeasures. In brief, the essential contributions of the survey are as follows:

  • First, some essential background knowledge is presented, including the differences between traditional and current online advertising systems, the terminology used, and the existing architecture of online advertising. The goal is to enable new readers to gain the required familiarity with online advertising systems and its underlying technologies, such as revenue models and the payment of commissions.

  • We present a detailed taxonomy of the current security threats to online advertising. We investigate several possibilities, including both theoretical and practical vulnerabilities, that fraudsters can use to launch an attack on the online advertising industry. In addition, we present a detailed discussion of the goals of these attacks, the revenue model, and the primary component targets.

  • We review several cutting-edge solutions that address security threats to online advertising systems, and explain the advantages and disadvantages of each solution.

  • Finally, we identify a number of open challenges and future research directions in the field of online advertising, with particular attention to the security aspects.

To the best of our knowledge, there are no existing surveys that have reviewed and summarized the existing security vulnerabilities and outlined future research directions in the realm of online advertising systems. Motivated by this consideration, the main goals of this study are threefold: to help the reader to understand the scope and consequences of the security threats and challenges in the domain of online advertising systems;

to estimate the potential damage associated with these threats; and

to highlight paths that are likely to lead to the detection and containment of these threats. From a practical perspective, our research aims to raise awareness in the online advertising research community of the urgent need to prevent various attacks from disrupting the healthy online advertising market.

I-B Roadmap

The remainder of this article is structured as follows. In Section II, we explain the differences between the online advertising system and traditional advertising networks, introduce the terminology used, and describe its architecture. Section III presents our proposed taxonomy of attacks on the online advertising system. We also discuss the goal of these attacks, the revenue model, and the primary targets. In Section IV, we categorize and discuss various security solutions identified in the literature and present a preliminary overview of the advantages and disadvantages of the use of these solutions in online advertising systems. In Section V, we highlight several open challenges for future research in online advertising systems. We conclude our work in Section VI.

Ii Background

Today, when users visit any website using a PC or mobile device, they are often presented with advertising content. Advertising helps publishers to disseminate their materials and grab the audience’s attention. They also can sell space on their sites to gain income from advertising or for other purposes.

We begin this section with a brief introduction to the current online advertising system, and then compare it with the traditional approach in Section II-A. Section II-B then explains some of the most widely used terminology associated with the online advertising ecosystem, including terms used in the remainder of the present article. To gain insight into how online advertising networks operate, we discuss the current architecture in Section II-C. Next, different methods of targeted advertising and the most common types of ad campaign (revenue model) are described in Sections II-D and II-E, respectively. Finally, we explore the way in which advertisers pay commission fees to commissioners and publishers in Section II-F, and review the types of adversaries who are likely to launch attacks on online advertising systems in Section II-G.

For ease of reading, in Table I, we list the all abbreviations used in this paper.

Abbreviation Description
Ad Advertising
Ad Network Advertising Network
HTTP Hypertext Transfer Protocol
CPC Cost per Click
CPM Cost per Impression Mile
CPA Cost per Action
MITM Man-In-The-Middle
DNS Domain Name System
Adware Advertising Software
CTR Click-Through Rate
TTP Trusted Third Party
CGI Common Gateway Interface
GBF Group Blooms Filter
TBF Timing Blooms Filter
ISP Internet Service Provider
SLEUTH Single-publisher attack dEtection Using correlaTion Hunting
NMF Non-negative Matrix Factorization
CAPTCHA Completely Automated Public Turing test to tell Computers and Humans Apart
IoT Internet of Things
AI Artificial Intelligence
AR Augmented Reality
5G 5th Generation of Mobile Internet
ML Machine Learning
DLT Distributed Ledger Technology
API Application Programming Interface
Table I: List of abbreviations and corresponding descriptions.

Ii-a What is Online Advertising?

Not surprisingly, advertising techniques have evolved over time with the growth of the Internet, and online advertising has become one of the biggest and most profitable Internet businesses. The main idea behind online advertising is to provide an advertiser with a cost-effective, easy, fast, and flexible way to promote and sell their products through the Internet to suitable customers. There are several significant differences between current online advertising and traditional advertising (e.g., via television, radio, and newspapers). For example, traditional advertising uses massive broadcast advertisements without considering the user’s interests; in contrast, online advertising can deliver advertisements to targeted users based on their interests and browsing behavior, regardless of geographical barriers.

Ii-B Terminology

In this section, we define some essential terminology related to online advertising ecosystems, as used throughout this paper.

  • [leftmargin=*]

  • An advertiser is a party who is willing to show a product, service, or event to the user via advertisements, in order to promote sales or attendance. Advertisers typically pay (or buy traffic from) an advertising network (ad network) to display their advertisements in the advertising space on publishers’ websites or phone applications. The publisher also receives a percentage of this fee.

  • A publisher is an entity that receives money (via selling traffic) from advertisers by displaying their advertisements to users through its web pages.

  • A user is an individual who visits a publisher’s web pages..

  • An advertising network (such as Google, Yahoo, Google AdSense, Media.net, or PulsePoint) also known as a commissioner, is part of an ad exchange. It acts as a broker between the advertiser and the publisher to manage the interaction between them [blizard2012click], and is responsible for finding suitable spaces to present advertisements on publishers’ websites for advertisers. They may also buy or sell ad traffic (as ad requests), either internally or together with other ad networks.

  • An ad exchange (such as DoubleClick [doubleclick], AdECN [AdECN], or OpenX [OpenX]) is a graph of the ad networks that allows the advertiser and publisher to serve advertisements more effectively within an advertising space.

  • Ad servers are a type of web server (or platform) that is used to host the content of an online advertisement and distribute this content on digital platforms such as Facebook, Quora, Twitter, etc.

  • An advertising request (ad request) is a query, in the form of Hypertext Transfer Protocol (HTTP) traffic, that is triggered by a web user’s impressions or clicks, and calls an ad server to display an ad to the user.

  • Sections (also known as zones or regions) are components in the form of a block of space on a web page that can help a publisher to load advertisements dynamically (by placing these sections inside their web pages) rather than exposing the same content using static advertisements.

  • Creative content is associated with the actual advertising message (e.g., an anchor tag, an Adobe Flash animation, text, or images) in the ad slot displayed to the user. The process of linking an ad message to an advertiser’s website is called click-through [stone2011understanding].

  • An ad server enumerates a click event when a user clicks on an ad.

  • An ad server counts an impression event whenever the content or ad page is loaded for the user. Clicks and impressions generate two different events, which are handled separately in the online advertising system.

  • An auction is a competitive process that runs within the ad exchange. It is designed to allow each advertiser to bid for advertisement space, where the highest bidder is permitted to place an advertisement in the slot. An auction aims to generate more profit for publishers. In general, the time taken to complete the entire process is on the order of 100 ms.

  • After an auction, ad networks may perform arbitrage to increase their revenue. To initiate arbitrage, the ad network must run a new and independent auction by buying and reselling traffic from the publisher.

  • An ad campaign is a method that emerged to help advertisers to decide how much to pay when their advertisements are displayed. We discuss the most common forms of ad campaigns in Section II-E.

  • A banner is a space on a page that displays a message from the advertiser.

Ii-C Architecture of Online Advertising

In this subsection, we briefly review the infrastructure of an online advertising system and describe how such network typically works.

Fig. 2 shows an architecture for an online advertising system. This scheme relies on the integration of four main components: an advertiser, a publisher, an ad exchange (e.g., multiple ad networks), and a user.

The process of ad serving in an online advertising system is illustrated in Fig. 3. The process is initiated when a user request calls for an advertisement to be served by the publisher (step 1-2). Following this, the publisher asks the ad exchange to fetch the ad that best matches the user’s profile and has the best price (step 3). The ad exchange starts an auction between multiple advertisers to determine which can make the most profit for the publisher and consequently for the whole network [niu2017era] (step 4). Finally, the highest paying advertiser (winning bidder) wins the auction (step 5) and its ad is served and displayed to the user (steps 6-7).

Figure 2: Architecture of current online advertising systems.
Figure 3: The process of serving advertisements in an online advertising system.

Ii-D Targeted Advertising

The most obvious difference between online advertising and a traditional approach is that the former displays advertisements to the customer based on their interests, while in the latter, advertisements are massively broadcast without considering the customer’s interests. Ad networks use ad targeting methods to increase their income, and in this way can display advertisements based on the user’s preferences. The three most popular types of ad targeting can be categorized into contextual, behavioral, and location-based approaches.

In the contextual approach, advertisers display relevant advertisements by focusing solely on the content of the web page being viewed by the user [zhang2012contextual]. A behavioral targeting strategy allows advertisers and publishers to utilize information from the user’s browsing history (e.g., by monitoring the behavior of the user on the Internet) to customize the types of advertisements they are served. Whenever an individual visits a website, all of the relevant information, including the pages visited, the period of time spent on each page, the links that are clicked on, and the things that are interacted with, are stored in a profile linked to that visitor [koran2013behavioral]. Based on the data in these profiles, publishers can show related advertisements to visitors that match their habits. In a location-based targeting, location-specific advertisements are delivered to potential users; this technique is particularly useful for mobile advertising [chatwin2013overview].

Ii-E Revenue Models

In this subsection, we discuss how entities in the online advertising network generate revenue.

Typically, publishers agree to display an advertiser’s advertisements and share the keywords used by the advertiser in their website, charging a commission fee for the action(s) generated by the user. This agreement includes a contract made by a broker (also called an Internet advertising commissioner) between publishers and advertisers. The commissioner also controls the advertisers’ budget, to avoid over-spending [metwally2006hide]. As soon as the advertiser pays the publisher the commission fee, it displays links determined by the advertiser on its website [mittal2006detecting].

The general models [stone2011understanding, chatwin2013overview] used by publishers to make money through advertising are determined based on the numbers of impressions, clicks, and actions. We explain each of these types of revenue model in detail in the following sub-sections.

Ii-E1 Cost per Impression

This model is favored by publishers and was developed based on traditional advertising systems. A metric called Cost Per impression Mile (CPM) is often used to measure the cost per impression, where the advertiser’ payment to the ad network is calculated based on the cost of 1,000 views of an ad.

To enable a better understanding of how the commissioners process the receiving impression traffic, Fig. 3(a) illustrates this process. As shown in the figure, advertisements are loaded along with the page requested in the user’s browser. The steps are as follows: (1) a user requests a website; (2) in response, the publisher displays the requested website in the user’s web browser; and (3) the user’s browser redirects to the commissioner’s web server (the commissioner does not repeat advertisements since it stores the recent advertisements shown to the user in browser cookies). In steps 4 and 5, the commissioner allows the user’s browser to redirect to the advertiser server. In step 6, the commissioner loads the advertisement into the user’s browser.

Ii-E2 Cost per Click

In the Cost Per Click (CPC) model, the advertiser pays the publisher based on how many times a viewer clicks the ad on the publisher’s web page. Many search engines, including Yahoo, Microsoft, and Google, prefer to use the CPC model. The reason for that is because a user clicking on an ad is a strong signal of interest; as such, CPC guarantees a better return on investment than CPM, where advertisers pay for their advertisements to be shown without counting on any implicit feedback from users.

The click traffic model is the approach that is most similar to the impression traffic model. Fig. 3(b) illustrates this scheme. As shown in the figure, when a user clicks on a hyperlink on the publisher’s site, the user is redirected to the commissioner’s server. The server then logs the click for accounting purposes. After that, the server of the advertising commissioner redirects the user’s web browser to the web page related to the advertiser.

(a) Steps in the impression traffic model
(b) Steps in the click traffic model
Figure 4: Traffic models used in online advertising systems.

Ii-E3 Cost per Action

In general, the CPC charging model is considered to be a specific case of the Cost Per Action (CPA) model, in which the publisher is paid whenever a user-generated click leads to a predefined action being performed, e.g., filling in a form on the page, signing up, registering, or downloading an item corresponding to the ad. Advertisers prefer to deploy this type of cost model since they only pay the publisher for specific actions. Although this approach has advantages for the advertiser, it also has some drawbacks. It is challenging to implement, especially in the case of complex actions, and the publisher is less interested in applying this model since dishonest advertisers may deflate the number of actions to pay a lower commission fee (See Section III-B2).

Ii-F Payment of commissions in online advertising systems

In this subsection, we briefly explain how advertisers pay commission fees to commissioners and publishers.

When advertisers receive valid traffic generated from impressions or clicks, they have to pay the publisher. The commissioner also earns a fraction of this income. If the advertiser uses a similar scheme to pay the publisher, then the commissioner’s percentage will be calculated at a fixed rate. For example, in the case where an advertiser pays a publisher per click (or impression), and the publisher receives the money based on the number of clicks (or impressions), then the commissioner receives a fixed payment.

However, an advertiser may pay based on the number of sales, while the publisher earns per click (or impression). This practice is known as an arbitrage campaign [metwally2006hide]. In formal terms, an arbitrage campaign is one where the advertiser uses different payment metrics to pay the commissioner and publisher. In an arbitrage campaigns, the commissioner should ensure that its share of the profit from the advertiser is more than the publisher’s payment; otherwise, the commissioner loses money. In reality, advertisers prefer to pay based on sales, while publishers prefer to receive income according to the number of impressions or clicks. Hence, Internet advertising schemes are mainly arbitrage campaigns. However, some advertisers may prefer to pay on the basis of clicks or impressions for product branding.

Ii-G Adversary

We can classify adversaries who attack an online advertising system into two types:

selfish and malicious adversaries. A selfish adversary is one who exploits the vulnerabilities in the online advertising system to divert a portion of the ad income for him- or herself, while the aim of the malicious type is to launch an attack with fraudulent or malicious intent (e.g., hurting a competitor, executing or spreading malware). As all of the entities in the online advertising ecosystem benefit from the delivery of advertisements to customers, they may all play the role of the adversary to increase their profits. There are several different ways that these attacks can be launched. Depending on the amount of resources available to the adversary, these may range from a simple attack using a single machine to an enormous number of machines performing an automated attack.

One example of the standard tools that are currently used to launch distributed attacks and perpetrate ad fraud is a botnet. To build a botnet, a botmaster (an entity who controls the botnet remotely) needs a network of software robots – i.e., bots – that are run independently. Bots turn a compromised machine into a member of a botnet. The system can then be used to perform denial-of-service attack, send spam, or steal sensitive personal information. An adversary who wants to carry out a fraud via a botnet can either create a custom botnet or rent one from an existing botmaster.

Iii Security: Taxonomy of Attacks on Online Advertising Systems

As described above, web browsing and online advertising systems are still dependent on HTTP, with no guarantee of the comprehensive authenticity and the integrity of web content. Consequently, adversaries can exploit the lack of a secure protocol to engage in fraudulent activities and perform attacks to increase their profits. In view of the large amounts of money at stake, it is unsurprising that all components involved in online advertising systems are concerned about the security of these systems. Since this type of system is becoming a notable revenue source for many online businesses, attacks can threaten the industry model of the participating stakeholders, and there are significant concerns over the future of the Internet.

Online advertising systems are vulnerable to various types of attacks, and in this section, we presents a taxonomy of current attack methods. We classify online advertising attack methods into five main categories: hacking, ad fraud, malvertising, inflight modification of ad traffic, and adware. Fig. 5 illustrates the proposed taxonomy.

The rest of this section explains, for each type of attack, how an adversary can exploit the risks in online advertising systems and conduct ad fraud. We review the methods that fraudsters use to gain money from online advertising systems, and present a detailed discussion of the goals of these attacks, and identify which revenue model is the goal of the adversary and which components of the online business system could be the primary targets. To aid in comprehension, the results of this comparison are presented in Table II.

Figure 5: Proposed taxonomy of ad fraud attacks in online advertising systems.
Figure 6: Hacking attack in online advertising network.

Iii-a Hacking

The threat of hacking in online advertising arises due to unauthorized access to campaign accounts [mladenow2015online]. One of the most effective ways for companies to scale their businesses, make more money and find new customers is to use online advertising. In search engine advertising, companies aim to attract customers by improving the visibility of their advertisements in results pages. The primary factors in the success of an online advertising business include flexibility, cost savings, time, and quality. Many companies utilize online advertising for these reasons [goldfarb2011search, langville2011google, xiang2011travel]. Online campaigns can quickly adapt information in their ad campaigns, which are more flexible, targeted, and tailored than traditional marketing campaigns. The flexibility and time savings of online campaigns guarantee that the transaction processing will be fast. An example of this is AdWords [AdWords], a tool developed by Google to allow advertisers to create online campaigns in only a few minutes. However, despite all the above advantages to online business, online campaigns face with many challenges, including security and privacy.

We illustrate this with an example. Consider the case where an advertiser creates an AdWords account. Users navigate via the web to run search queries, and advertisements can be presented on the websites of publishers or on the search engine network. If an adversary takes control of the advertiser’s AdWords account to launch an attack, this is known as hacking. The consequences of campaign accounts being hacked include blocking, limited access or unauthorized entry to the account of the advertiser. The availability of short-term online campaigns will also be limited. These results may lead to significant reputational damage, loss of money, and violations of user privacy. Fig. 6 illustrates the hacking of an advertiser’s AdWords account.

Revenue Model Goal Primary Component Targets
Attack Description Attack Goal CPC CPM CPA Advertiser Publisher User Ad Network
Hacking [mladenow2015online] Unauthorized access to campaign accounts Hacker aims at taking over control of advertiser’s account
Crowd Fraud [tian2015crowd] Malicious behaviors by humans against competitors for specific targets Increase fraudulent traffic
Badvertising [gandhi2006badvertisements] Utilizing malicious JavaScript code to publish invisible automatic advertisements in the user’s browser Increase the number of clicks
Hit Shaving [reiter1998detecting] Dishonest advertisers claim that they received less traffic than in reality Dishonest advertisers omit to pay commission on some of the received traffic to the publisher
Hit Inflation [metwally2007detectives] Artificial inflation of the actual amount of traffic Economic advantage from over-counting the numbers of transactions
Malvertising [vratonjic2011online] Perpetrators inject malicious code into legitimate online advertising networks to spread malware Malicious code, eventually, attempts to redirect users to malicious websites
Inflight Modification of Ad Traffic [vratonjic2011online] Infecting the system to show altered search results along with modified advertisements to the users Generate revenue fraudulently for ad networks and publishers
Adware [vratonjic2011online] Advertising software to display advertisements with out the users’ permission in order to generate revenue Generate revenue based on the collected marketing information or displaying advertisements
Table II: Summary of attacks, description, attack goal, revenue model goal and primary component targets in online advertising system.

Iii-B Click Fraud

Online advertisements help to develop a healthy internet, since they provide financial support for the online businesses. The emergence of click fraud (also known as malicious clicks, or click spam [li2014search]) therefore poses a serious security risk to the internet ecosystem. Click fraud refers to cybercrime activity that is carried out either manually (using human clickers) or automatically (software-supported) to generate fraudulent clicks on the advertisement to make illegal profits.

Fraudulent clicks can damage the health of online businesses, since these clickers can increase their profits or deplete the advertising budgets of their competitors. They achieve this by clicking on advertisements with no actual interest in the content.

In the manual approach, fraud consists of hiring a group of people to increase fraudulent traffic, while automatic click fraud attack is usually based on the use of botnets [rodriguez2013survey]. Malicious software called a “clickbot” [daswani2007anatomy] is one example of this use of botnets to generate fraudulent clicks automatically [kantardzic2008improving]. Using a clickbot to launch a click fraud attack is more efficient than the manual type of attack, since it can perform automatic clicking over a time period of several minutes to avoid detection. We can categorize click fraud into two types, crowd fraud and conventional ad fraud, as described the following subsections.

Iii-B1 Crowd Fraud

The emergence of crowdsourcing [kamar2012combining] has led to a novel form of fraud in online advertising, since it can broadcast a large number of tasks to a numerous online workers. Due to the openness of crowdsourcing systems [choi2016detecting, howe2006rise, doan2011crowdsourcing], a crowd of workers can easily be recruited via malicious crowdsourcing platforms to perform an attack against a competitor or to increase their advertising expenses. There are many differences between automatic fraudulent behaviors (conventional fraud), and frauds carried out by humans. For example, a vast number of workers via crowdsourcing platforms can be involved in human-generated fraud, while automatic fraudulent traffic can be deployed relatively few machines. A difficulty also arises in differentiating normal and no distinct traffic induced by real humans from the noisy traffic generated by machines. Methods used to detect conventional fraud therefore fail to identify these human-generated frauds. The phenomenon of exploiting a group of real humans to increase fraudulent traffic in online advertising is termed crowd fraud [tian2015crowd].

Iii-B2 Conventional Ad Fraud

In contrast to crowd fraud, which is carried out by large numbers of attacking machines, normal and no distinct click behaviors by each web worker, the limited fake traffic generated by each web worker, conventional forms of advertising fraud often have specific features in terms of individual behavior patterns, with few sources and large amounts of traffic. In this regard, the detection of conventional fraud is more straightforward than crowd fraud [tian2015crowd].

We divided the conventional advertising frauds shown in Fig. 5 into three categories: badvertising, hit shaving, and hit inflation. A brief overview of how these attacks are carried out on online advertising ecosystems is given below.

  • [leftmargin=*]

  • Badvertising. Gandhi et al. defined badvertisement as a kind of camouflaged click fraud attack on the advertising industry [gandhi2006badvertisements] that silently and automatically generates click-through on an advertisement when users visit the website. This attack can not only remain undetected by web publishers, but also does not compromise the user’s machine. Unlike a traditional malware-based click fraud attack [miller2011s], badvertisement is a stealthy offense in the form of a malicious mutation of spam and phishing [jagatic2007social] attacks, except that this attack targets the unaware advertiser as the victim rather than an individual. This is very worrying, since it is easier for an attacker to deceive an individual into visiting a web page than to damage a machine with malware.

    This attack artificially and stealthily increases the number of clicks on ad banners hosted by the fraudster or unaware associates to generate more revenue for the attacker through advertising. The revenue generated in this way is transferred from the advertiser to the hosting websites by the fraudster.

    Badvertisement has two main components: delivery, which either transfers consumers to corrupt data or corrupt data to consumers; and execution, which automatically and invisibly displays advertisements to a targeted user. This stealth attack can be accomplished by corrupting the JavaScript code that is downloaded and executed by the client’s browser to publish sponsored advertisements [zhang2008detecting]. Online advertisement systems typically work by placing a JavaScript snippet file into a publisher’s web page. Whenever a user visits this page and downloads an advertisement from the ad server, the JavaScript file will be executed. Downloading the ad causes the frame in the JavaScript file to be rewritten with the HTML code required to show the advertisement. The publisher relies on the click-through payment process to count the number of times the user clicks on the link to the ad provider’s server. Finally, the user is referred to the ad client’s website. This scenario is illustrated in Fig. 7.

    Figure 7: Typical online advertisement services.

    Badvertisements run extra malicious scripts to automatically deploy clicks. In a nutshell, after running the script code and rewriting the frame, the malicious script parses the HTML code and compiles all links. It then changes the web page to embed an HTML iframe. If the user decides to click the link, the iframe will be activated in the background, and loads its content to exploit the user (Fig. 8).

    Figure 8: Auto-clicking in a hidden badvertisement.
  • Hit Shaving. Advertisers often prefer the CPA model for online advertising since they pay the publisher based on the desired user action, rather than for each click on their ad. However, the CPA model is vulnerable to hit shaving (also called deflation fraud [ding2010hybrid]). In this attack, a fraudulent advertiser undercounts the real transactions to pay a lower commission fee. Inflation fraud includes the problem of deflation fraud, where an entity fraudulently over-counts or over-reports transactions to gain more revenue.

    Before describing how the hit shaving attack is applied in an advertising network, we need to give an overview of the mechanisms used in click-through payment programs.

    Electronic online commerce is crawling gradually, while the Internet has rapidly become recognized as an effective advertising medium. Hence, advertising has become a pivotal technology on the Internet, as confirmed by the growth of click-through payments. The main entities involved in click-through payment programs are the user who views the page and clicks on a link, the referrer who exposes advertising material to the user, and the target site running the click-through payment process.

    A click-through payment system works as: we suppose that there are two websites A and B, and that A can refer the user to B. Hence, whenever B receives a referral from A, B has to pay the webmaster111The webmaster is the person controlling the content served to the user. of A for this reference. In more detail, when a user views web page A and clicks on a link that refers the user to web page B, then A should receive money from B. In other words, the user has “clicked-through” A to reach B. The use of a click-through payment program by the webmaster of B leads to an increase in traffic to the website, since other websites display links to B. However, since the underlying infrastructure of this structure is based on the HTTP protocol, it is exposed to attack.

    For a better understanding of how this mechanism is vulnerable to fraud, we review the procedure used to exchange HTTP messages (see Fig. 9) during a click-through event. As illustrated in Fig. 9, when users view a web page from site A (called the referrer), the HTTP procedure is executed. Site A includes a link to site B (called the target), and agrees to take part in the process of click-through payment to site B. The customer’s browser sends a request to load the page from site B when the link is clicked. Site B can identify the site from which the requested web page originated (i.e. where the user are is being referred from) simply by checking the referrer field in the HTTP header.

    The previous explanation should reveal that the click-through payment system has the potential to be exploited for fraud. The problem arises from the lack of communication between A and B after the user clicks on the link. A cannot verify how many times its web page has referred users to the targeted page, and as a consequence, B is able to omit some of the click-through events from the referrer, in a scheme called hit shaving. In addition, although the referrer site can detect that the target site has shaved its referrals, it cannot provide proof of this to a third party. A can also conduct fraud against B by generating false requests in order to increase the payment from B, and this is called hit inflation. In brief, hit shaving is a form of fraud by a dishonest advertiser who can undetectably change the number of clicks received from a publisher in order to pay a lower commission fee [metwally2007hit, metwally2006hide].

    Figure 9: Workflow for a click-through system. Step 1: user retrieves Page A. html from site A (referrer site). Step 2: user clicks on a link in site A and requests the page from site B (target site). Step 3: Page B. html on site B will be uploaded for the user.
  • Hit Inflation. This is a fraudulent activity performed by an adversary to inflate the hit count, in order to boost revenue or hurt competitors.

    In [anupam1999security], a sophisticated type of hit inflation attack is defined that is very hard to detect. Fig. 10 illustrates this attack scenario, which involves an association between a fraudulent website (W) and a fraudulent publisher (P), where W uses a script code to silently divert a user to P. The scenario starts when a user simulates a request or click to fetch page W. html from W (step 1). However, the user is redirected to page P. html (step 2). P has two forms of the web page: a manipulated form and a valid form. P will show a manipulated web page to the user when the referrer field in the HTTP request shows W (step 3) and clicks the ad by itself without knowing the user. Otherwise, P will direct the user to the valid web page, and the user is free to either click on the ad or not (step 4, 5).

    Publishers and advertisers are the two entities in online advertising systems that are the major sources of inflation attacks. The two most common types of hit inflation attack are called publisher click inflation and advertiser competitor clicking. We briefly illustrate both types of attack below.

    Figure 10: Hit inflation attack on online advertising network.

    Publisher click inflation. In publisher click inflation, a dishonest publisher is motivated to artificially inflate the click-through count (without real interest in the content of the advertisement) to obtain more income from ad networks. As discussed earlier, if the advertiser wants to present its advertisements on the publisher’s website, the publisher enters into a contract with the broker (commissioner). The publisher then gains income from advertisers through the user-generated traffic that they send to websites of advertisers. Obviously, the more clicks the publishers earn, the more money they generate. Consequently, this opens the way for malicious publishers to create illegal revenue by increasing the numbers of clicks, impressions, and actions on their websites.

    Publisher click inflation attacks can be classified into two categories: non-coalition and coalition attacks [oger2015privacy]. The former is performed by a single publisher (one fraudster) who solely generates traffic to its resource(s), while the latter involves a coalition attack among a group of publishers who share their systems. If we can detect both categories of attack, we can claim that the problem of hit inflation is solved.

    Launching a coalition attack has several benefits for fraudsters. Firstly, the possibility of fraud detection decreases because the attackers do not need to reuse their resources to generate more attacks [kim2011catch], making it difficult for detection algorithms to identify the relationships (e.g., the relationships between the cookie IDs and IP addresses of the resources generating traffic and the sites of fraudsters) between each fraudster and all the attacking machines. Secondly, the cost of launching an attack is reduced by sharing resources rather than increasing the number of physical resources. Fig. 11 illustrates non-coalition and coalition attacks [kim2011dark].

    (a) Non-coalition attack
    (b) Coalition attack
    Figure 11: Non-coalition and coalition attacks. Fig. 10(a): in a non-coalition attack, each attacker creates traffic to its own website; Fig. 10(b): in a coalition attack, each attacker creates traffic to both its own website and those of others in the coalition.

    The study in [metwally2006hide] classifies non-coalition attacks according to the number of IPs and the cookie IDs of the system, and the way in which the commissioners recognize the machines of the surfers (potential Internet customers). When customers visit a website, this traffic has certain fixed characteristics which are different from automatic traffic, and typically involve relationships between IP addresses and cookie IDs. Hence, if fraud detectives find inconsistencies between the cookie IDs and the IP addresses, they can investigate manually by selecting a subgroup of the publishers to detect the attack. On the other hand, when dishonest publishers want to launch the attack, they can leave a false fingerprint for the relationship between the IPs and cookie IDs in order to confuse the detection mechanisms.

    The attack can be launched by one or multiple IPs, and these addresses may be associated with no, one or multiple cookie IDs. There are therefore six possible types of attack based on combinations of IPs and cookie IDs, as follows.

    1. [leftmargin=*]

    2. Cookie-Less Attacks. A fraudster can launch cookie-less attacks in at least two known ways. Firstly, there is the option for the attacker to turn off cookies on the system(s) which plan to launch the attack. Secondly, a fraudster can employ commercial services called network anonymization, which are designed to protect the privacy of users [broder1999data] and to block third party cookies to give more cookie-less traffic.

    3. Single Cookie and Single IP Address Attacks. In this type of attack, a dishonest publisher can employ a script to launch an attack from one machine with a fixed IP and one cookie ID. The author in [klein1999defending] provided an example of this type of script.

    4. Single Cookie and Multiple IP Addresses Attacks. Attacks of this type are more widespread among fraudulent advertisers than fraudulent publishers, since changing the IP address of the attacking machines is more convenient than changing the cookie ID. The commissioner shows the most profitable advertisements to Internet customers that have not recently been displayed. In addition, if repeating the same cookie sends to the commissioner, as a consequence, the same advertisements display to the users. Hence, a dishonest advertiser can start the attack by visiting the publisher’s website and continuing until the broker shows advertisements from its competitors. The fraudster then stores the cookie ID with the intention to continuously applying the ID to force the broker to show the advertisements from its competitors. In this way, it can simulate clicks on advertisements in order to drain its competitors’ advertising budgets.

    5. Multiple Cookies and Single IP Address Attacks. An attacker can perform this type of attack in various forms. The simplest method is to connect different systems to the Internet via a single router, and then execute various scripts on the systems. In this way, the attacker can simulate receiving traffic with several cookie IDs but a single IP address. However, this type of attack is not economically viable. This attack suffers from a resemblance to the regular Internet traffic problem, in which different customers connect to the Internet with various cookie IDs using a single IP address through an ISP.

      In the second form, in order to make the attack more comprehensive and sophisticated, the attacker can connect several machines to the internet via an ISP with a similar IP. To reduce the impact of this malicious attack and defraud the detection algorithms, a dishonest publisher can combine fraudulent traffic with regular traffic.

    6. Multiple Cookies and Multiple IP Addresses Attacks. Performing and detecting this class of attack is difficult. The malicious publisher uses various valid cookies and IPs. The attacker can perform this type of attack by using the cookies and IPs in multiple forms. In the most simple form, which is not economically viable, the attacking publisher has access to various machines with different accounts with ISPs. Another method is to use botnets, such as spyware and Trojans. The aim of using a botnet [shaw2003spyware] is to simulate impressions and clicks on the website of the attacker by sending the proper HTTP requests while exploiting the cookies and IPs of legal users. The traffic generated in this way is very similar to regular traffic.

      This type of attack can be considered a more sophisticated version of some of the above examples. Suppose that the publisher has access to different legal cookies and IPs, such that IPs can generate random or can be pre-assigned. Then, whenever a cookie ID and a pre-assigned IP is used in the attack, the attack can be considered a more sophisticated version of the multiple cookies/single IP attack that uses multiple IPs. In contrast, when the IP is selected randomly, this results in the use of identical cookies for different IPs. This attack can also be considered a more sophisticated version of the single cookie/multiple IPs attack with multiple cookies.

    Advertiser competitor clicking.

    In this attack, malicious advertisers carry out hit inflation attacks against their competitors to drain their advertising budgets. In the case where competitors have limitations on their daily advertising budget to participate in bidding, fraudsters can increase the probability of their advertisements being displayed by winning the auction.

More generally, the consequences of fraudulent traffic include reducing the reputation of the commissioner and attracting fewer advertisers, and also may lead to extra fees or penalty payments for advertisers [johnston1976cliques, kannan2004clusterings].

Iii-C Malvertising

The primary goal of the online advertising system is to reach users, and these entities are therefore more vulnerable to threat in this system than the others. We recall that the online advertising system is based on users’ web browsing interests. As users surf the web, their movements within websites enable ad providers to track them (e.g., by using tracking cookies [pishva2013online]) and identify them to deliver targeted advertisements in the future. Many ad companies such as Google or Yahoo monitor how visitors have landed on their website with the help of tracking cookies. The main problem with using tracking cookies is the violation of the individual’s privacy.

When a user navigates the Internet and visits different websites associated with a single advertiser, the same cookies are allocated to the user. In this way, the ad provider can track the user’s online activities by compiling the information from the cookies without the user’s permission or consent. The consequence of this tracking is that the user’s privacy is violated. Moreover, users can be involved in fraud (e.g., click fraud) without realizing. Malvertising (malicious advertisements) is another fast-growing security threat on the web that can infect users [vratonjic2011online].

Malvertisement is a platform for distributing malware by injecting malicious code into legitimate ad networks. This malicious code eventually attempts to redirect users to malicious websites that serve malware [sood2011malvertising].

As previously mentioned, there are several entities involved in an online advertising system, making it a complex network. This complexity and the use of multiple redirections between different components allows attackers to embed malicious content (e.g., malicious advertisements) in places that publishers and ad networks would not expect. For example, an report by Blue Coat [larsen2010exploiting] shows that JavaScript code can be served by an ad server to inject a hidden iframe tag into a benign site instead of fetching legitimate advertisements. In this scheme, the iframe commands the browser of the victim to silently interact with a malware server, allowing a PDF exploit file to be downloaded. Both publishers and advertisers in the online advertising ecosystem have the potential to launch a malvertising threat; for instance, an advertiser can easily inject a malicious ad into a legal ad network to trigger malvertising. As a result, the advertising network may deploy those advertisements on publishers’ websites, and users will then access them by clicking. Moreover, publishers can insert malicious content into their sites to indirectly cause a consumer to install malware. In this scheme, users even do not need to click on advertisements to activate malware. One of the most common forms of malvertising is flash-based advertisements [ford2009analyzing], in which an Adobe Flash File (also referred to as a SWF) that contains malicious script is abused by criminals to run arbitrary commands. Creating advertisements with animation and sound in an SWF file allows the advertisers to attract a greater audience, and this means that Flash is vulnerable to being used in malicious attacks. It is therefore clear that attackers can spread malicious advertisements via Flash, which is known as “malvertisement” [ford2009analyzing].

Iii-D Inflight Modification of Ad Traffic

In [vratonjic2011online], a new form of ad fraud was presented that involves the inflight modification of advertising traffic (also called a Man-In-The-Middle (MITM) attack). An well-known example of this type of fraud is the Bahama botnet, which allows malware to force compromised machines to show surfers altered advertisements, and to change the results of searches [Botnet]. The key difference between this attack and traditional click fraud is that in the latter case, ad networks can gain income from fraudulent clicks, while inflight modification of ad traffic can allow either traffic or income to be diverted from the ad networks to the attacker’s server.

In the Bahama botnet, compromised systems direct users to a malicious site that looks identical to real Google search results. In this case, the attacker leads the user traffic to another site of the attacker’s choosing, such as a fake website, by corrupting the translation of the Domain Name System (DNS) on the infected systems. For example, when a compromised user clicks on advertisements on Yahoo or Google, they are silently redirected to a server that is under the attacker’s control. Consequently, the domain name/hostname “Google.com” (or Yahoo.com) translates to an IP address that belongs to the attacker and not to Google (or Yahoo).

Moreover, a viewer can enter a query into the input box that appears to belong to the Google server, but the traffic is in fact redirected to the poisoned server. The user is sent back (malicious) results for the given query from Google, i.e. results that are different from the real ones. Clicking on these fake results leads to the click-through payment program being triggered, and thus to advertisers receiving money, meaning that click fraud has taken place. In the case of Bahama botnet, income is diverted from main ad networks to smaller publishers and ad networks.

The adversary can also use botnets of compromised wireless routers rather than compromising the users’ systems [Botnet2]. In this scheme, the wireless router, which is hacked by malware, is converted to a bot. The botnet master can then give instructions to launch an inflight modification of traffic attack to transmit traffic through the router. Many public hotspots operate on this model by providing users with free Wi-Fi while embedding advertisements in the users’ traffic to earn more money.

In-flight modification of ad traffic has a drawback in that if a user clicks on the displayed advertisements, profit is generated for the fraudster rather than the legal ad network. Hence, this attack weakens the network industry model. It is worth noting that there are other catastrophic effects of these attacks in terms of the security of end-users (as it leads to malvertisement rather than legitimate advertisements), and also a loss of reputation and income for legal advertisers.

Iii-E Adware

According to [chien2005techniques, zhang2011inflight], adware (advertising software) is software with advertisements embedded in the application that can display advertisements without the user’s knowledge. This type of software is mainly used to show advertisements with the help of the websites users visit. The primary goal of this software is typically to make a profit based on the collection of marketing information or by displaying advertisements. Some people consider adware to be similar to malvertising, but there are notable differences between them: the target of adware is a single user, while malvertising serves malicious code to be deployed on a publisher’s web page. Adware is code that runs continuously on a user’s machine, while malvertising can only affect the user’s machine when the infected web page is viewed. Both advertisers and publishers can produce adware, and this software can therefore be divided into two main categories.

The first group of adware is known as shareware. This is designed for consumers who are not willing to pay for specific software, and numerous ad-supported software, games, and utilities have been distributed as adware. This type of software automatically displays advertisements in the form of annoying pop-up messages, and users have an option to disable these advertisements if they buy a license key. Moreover, when users uninstall the software, the advertisements should disappear. The developer uses the adware to recover the costs of development, and this approach allows consumers to use the software free of charge or for a low price. The revenue from displaying advertisements is the source of motivation for the developers, and helps them to carry out the development, maintenance, and upgrading of the software. For example, the Eudora mail client is a substitute for shareware registration fees to use for displaying advertisements to the users.

The second category can be thought of as a kind of spyware. This group stealthily collects information on customers by spying on them, in order to serve advertisements embedded in websites. In formal terms, these types of applications contain adware that tracks the user’s Internet surfing habits to display advertisements associated with the user. This type of adware acts as an intrusive application with respect to the user’s data, and users need to protect their system against this software for security and privacy reasons. This adware is able to gather information about the individuals by surfing unauthorized sites via the Internet connection, and by monitoring the user’s favorites list and browser profile. The adware can even collect the required information by continually monitoring the search toolbars of browsers without the user’s awareness or permission. In extreme cases, the adware sells this private information to other entities without the awareness or permission of the user. This adware can also hijack the user’ web browser homepage and search engines in such a way that they cannot be changed.

For instance, YapBrowser is adware or spyware that can serve unrequested, offensive advertisements, modify system configuration settings and redirect users to unwanted websites. This software is illegally installed on the user’s machine to create revenue for spyware and adware owners. UK’s SearchWebMe assured users in June 2006 that the updated version of YapBrowser did not contain either adware or malicious apps to sniff and gather private information from users. Last but not least, Gator and Bargain Buddy are two other popular adware programs in this class, and were developed by Claria Corporation and Exact Advertising, respectively.

Iv countermeasures for online advertising attacks

Over recent years, the field of online advertising security has attracted attention from many researchers in both Academia and Industry. Several solutions have been proposed to tackle the security threats identified in Section III.

Motivated by this consideration, we discuss several approaches proposed in the literature to combat various types of attacks on online advertising systems. Table III summarizes the existing detection methods for online advertising systems and gives a preliminary overview of the pros and cons of using these methods.

Iv-a Countermeasures to Hacking

When Google AdWords [GOOGLECompany] was launched in 2000 and quickly became Google’s primary source of revenue, it soon became a rich source of targets for ad fraud attacks.

Various reports and forums have discussed the fact that the majority of Gmail address and passwords are used to hack campaign accounts. The different approaches used to hack Google AdWords accounts can be categorized as brute force login attacks; email spoofing; and malware and spy tools for obtaining user account information [LEFTYGBALOGH]. When fraudsters enter to a campaign account, they can duplicate campaigns. Attackers can also generate enormous numbers of clicks and redirect destination URLs to other companies [MOZBlog, GOOGLE].

One of the more straightforward options for preventing this type of attack is to select strong passwords. The security and protection of an account can be increased by choosing a complicated and lengthy password combining letters and numbers with special characters, and by changing the password regularly. It is also possible to monitor and control browsers with phishing filters [GOOGLEE], especially when connecting to unsecured WIFI connections and signing in to Google accounts. Industries and business owners should have a contingency advertising plan for monitoring their revenue trends to handle the drop in revenue caused by fraudulent campaigns.

To detect hacking attacks in online advertising, a daily check can be carried out of accounts to guarantee not only the cost-benefit ratio and performance of the campaign but also to protect the campaign from reputational damage and loss of income. It is vital to monitor and analyze the performance of each AdWords campaign on a daily basis [mladenow2015online].

Iv-B Countermeasures to Crowd Fraud

As discussed in Section III-B1, there are significant differences between automatic fraudulent behaviors and frauds carried out by humans. The most immediate difference is that crowd fraud usually derives from a large number of attacking machines, while the malicious generated traffic from each computer is low. Another major difference is that fraudulent behaviors by web workers are irregular and have no specific order, allowing the suspicious traffic generated in this way to blend into normal traffic. Based on these differences, we can conclude that short-term crowd fraud can appear approximately similar to normal individual behaviors. As a result, conventional fraud detection methods are unable to detect crowd fraud.

The techniques typically used in business markets for crowd fraud detection mainly emphasize human interactions, including prior knowledge of malicious queries and principles associated with filtering. These approaches are costly, and tend to become invalid quickly because web workers may change their patterns of behavior to avoid detection.

To address these problems, the authors of [tian2015crowd] investigated the group behaviors associated with crowd fraud, and found that compared with the individual actions of each worker, which may involve considerable noise, group behaviors were more continuous.

In formal terms, these authors discovered certain typical feature distributions and network functions of crowd fraud that can be effectively applied to detect this activity. They noted the following aspects: moderateness: crowd fraud sometimes targets advertisers or queries with medium hit frequencies; synchronicity: Internet users participating in crowd fraud can classify into coalitions [49zhang2016detecting] via which they typically target a distinct collection of advertisers and execute the fraud quickly; and dispersivity: surfers involved in crowd fraud may search for an irrelevant series of topics and click advertisements from different industries simultaneously.

Based on the attributes mentioned above, the authors of [tian2015crowd] introduced an efficient solution for crowd fraud in search engine advertising, which was divided into three phases: constructing, clustering, and filtering. In the constructing phase, they deleted irrelevant data from raw data logs of queries that did not meet the moderateness condition (e.g., either markedly small or large hit frequencies) to create a surfer-advertiser bigraph in which each edge referred to a single unique click history and included aspects such as search queries and hit times. Finally, they built a surfer-advertiser inverted list for this bigraph for the next phase. In this list, each entry referred to the click history for each unique surfer. In the clustering phase, they described the sync-similarity between click histories to discover coalitions of surfers, indicating synchronicity.

Next, they converted the coalition detection system into a clustering problem that could be solved through a nonparametric clustering algorithm (such as DP-means [kulis2011revisiting]). After the clustering phase, the percentage of finding coalitions was high, and this caused false detections and therefore false alarms. For instance, in some business domains such as healthcare or games, regular Internet users with related interests may repeatedly click on the same advertisements to receive similar services. Hence, using infiltering, they created a filter for clusters based on the dispersivity to eliminate false alarm clusters.

Since this method does not require tuning of any parameters, it can be applied in real scenarios to find an infinite number of coalitions without human interaction. The authors also built a parallel version of their detection method (by parallelizing the nonparametric clustering algorithm) to make the system more scalable for massive web searching. The results of this experiment validated the accuracy and scalability of their approach. Although the proposed algorithm was capable of detecting crowd fraud, however, it failed to prevent this fraud. Moreover, evaluating the accuracy of the algorithm was hard due to the difficulty of collecting fraud data.

Iv-C Countermeasures to Badvertising

A successful badvertisement stealthily and artificially generates automatic clicks on advertisements when users visit a site hosted by a fraudster, and can persist unseen by auditors from the ad provider. It does not require any specific technical knowledge to run this kind of attack, and any illegal webmaster can perform it [gandhi2006badvertisements].

At first glance, it may seem easy to detect this attack by controlling the click-through rate (CTR222 The CTR is the number of clicks an advertiser (i.e., publisher or ad) gains as a proportion of the impressions.) from the intended domain, but this is not always the case. For example, the attacker can generate both click-throughs and non-click-throughs by manipulating the traffic in the damaged page, while the customers correlated to those types are not informed of the advertisement. It should be noted that the owners of the site who earn income for a “badvertiser” may not be aware of their participation in running the attack. For example, the owners of a domain may be pretending not to know of the existence of an attack, or may be fooled by a corrupt webmaster. The former case corresponds to a phishing attack [tripathi2017novel].

Developing tools for the discovery and prevention of frequent click fraud attacks is a major aim of industries in this field. AdWatcher [AdWatcher] and ClickProtector [ClickProtector] are two well-known companies that try to detect and prevent such attacks. The most common attack types are malware-based, which use automated scripts, individuals hired to deplete their competitors’ advertising budget [India] or proxy servers to generate fake clicks. These attacks can be detected by tracking the IP addresses of the systems that generate the clicks or by distinguishing the click registered domains. Companies try to identify aspects such as duplicate clicks for a specific ad by a single IP address or irregularities in the traffic history, and to carry out careful analyses. However, a badvertising attack cannot easily be detected using these approaches, and there is a pressing need for other types of mechanisms to detect and prevent this attack.

The countermeasures discussed in [gandhi2006badvertisements] involve the construction of a ad code to detect an attack when preventing it is not possible. These methods can be divided into two types: active and passive. Active methods are used to detect click fraud, while passive methods are used to monitor the progress of a click fraud.

In formal terms, an active client-side solution is based on interactions with search engines, the execution of public searches, and visits to the resulting sites. It can carry out web surfing in a manner similar to the user. An active mechanism can conceal its status such that an agent cannot recognize it as a robot, and can present itself as a real user to the servers in order to interact with the agent and other entities.

In contrast, passive client-side approaches monitor the actions performed by users that lead to a click. It is possible to trap requests for advertisements by virtual execution of JavaScript code, and any attempt to display a specific web page in a way that it should be occurring after a click can be considered a fraudulent request. It should be pointed out that although this solution can be used against automatic click-fraud, it cannot be applied to protect a system against a type of attack that first creates a significant delay and then performs a click fraud. The only way to do this and to capture a delay is to let the virtual machine randomly select scripts for generating a delay.

We should recall that long delays are not preferred by attackers, since their session might be disconnected from the target before they can generate a click on the website. Passive client-side methods can be included with security toolboxes or anti-virus programs.

Another form of passive scheme is an infrastructure component. That can detect click fraud by shifting traffic, identifying candidate traffic and mimicking the system of the user receiving the packets. Example applications of infrastructure component schemes include an ISP-level spam filter and MTA.

We can conclude from a performance analysis that if a client-side detection mechanism is installed only by a small proportion of customers, these attacks become entirely unprofitable.

Iv-D Countermeasures to Hit Shaving

The author of [Johnson] explained that the rationale behind all inflation and deflation fraud (also called hit shaving) is a lack of knowledge. In both attacks, the entities who perform the fraud may under- or over-count transactions for financial gain, and it is difficult for the victim to prove the damage that arises. As a consequence, a general technique for detecting these frauds is to collect information relating to the victim’s claim.

For example, in the case of deflation fraud, the authors of [Johnson] proposed the use of an online Trusted Third Party (TTP) as a mediator to facilitate interactions between two parties. To detect deflation fraud, the publisher must collect as much information as it can, based on the advertiser’s claim. In a nutshell, the more info a publisher can gather, the stronger the detection scheme. The disadvantage of this solution is that it cannot be applied to the online advertising ecosystem. Similarly, in Google’s AdWords, the publisher directly monitors the transactions. The methods mentioned above suffer from a lack of scalability and efficiency, since the publisher can interfere in the business operation of the advertiser and in turn with the TTP.

In [ding2010hybrid], an efficient and flexible mechanism was proposed to relax the security solution slightly. The authors point out that there is a certain level of tolerable counting error for the publisher if they miss some transactions. Their mechanism involved a novel deflation fraud detection scheme that applied cryptography and probability-based techniques with the following features: the publisher can detect deflation fraud with a high probability of success, and the security parameters can be tuned by the publisher to provide a balance between cost-effectiveness and security assurance; under these conditions, the web publisher can estimate and detect the expected number of transactions on a large scale; although a transaction takes place only between advertiser and users, the proposed scheme is easy for end-users, since they are not required to keep any secret information; the costs (such as computation, communication, and storage) of this method are all constant, making the scheme efficient and scalable.

The proposed hybrid method does not require the cooperation of a third party, and retains the simplicity of the current advertising system. The publisher also has the option to tune the security parameters to balance the security and cost of the model. The drawback of the proposed scheme is the need for manual tuning of parameters by the publisher.

Although there are many click-through payment mechanisms on the web, the publishers cannot verify whether they have received payment for each click-through to the target site. This allows for hit shaving, in which the target sites can avoid paying the publisher sites for some click-throughs.

The study in [reiter1998detecting] proposed some rapid and straightforward approaches to enable referrers to track the number of click-throughs, allowing them to be aware of how much money they are owed. These methods included ways of creating web pages and Common Gateway Interface (CGI) scripts that offer the referrer webmasters a greater ability to monitor the numbers of legal clicks, and also which pages the users click. They implemented these approaches by placing upper bounds and lower bounds on referrals. These are effective techniques that do not require awareness or cooperation by the webmasters of the sites to which the referrals are made.

The authors also explore more aggressive approaches for cooperating with the providers of click-through mechanisms, to allow webmasters to more accurately control the number of click-throughs. Although this second group of approaches requires cooperation by the webmasters of the click-through payment programs, it does not need trusted webmasters, since any failure to cooperate is quickly detectable. This is a robust solution: a referrer can discover this fraud after 20 times probe even if the target shaves only 5% of the commission. However, this method is not always feasible, for example if the target website sells expensive items. In this method, referrers are expected to report their payments for leads and sales correctly, with the help of the target sites. Although techniques presented here are mainly invisible to the web user, their main disadvantage is the communication overhead for implementing the protocol, which causes it to be an inefficient and inflexible scheme.

Iv-E Countermeasures to Malvertising

As discussed in detail in Section III-C, malvertising can affect both web users and publishers in different ways. Malvertising can redirect users to malicious sites [sculley2011detecting] or install malware on the user’s computer simply by viewing the ad, without even clicking it. This results in losses of reputation, traffic and revenue for the publishers, and even if publishers are aware of this attack, it is difficult for them to find and block malicious advertisements, since the online advertising ecosystem is dynamic and displays advertisements from a vast number of advertisers.

To avoid malvertising, the authors of [vratonjic2011online] suggest checking the advertisements regularly and validating their appropriateness. It is the responsibility of the publishers and ad networks to verify the advertising content (whether active or malicious) by performing regular checks. They should avoid publishing advertisements to end-users if publishers and ad networks become aware of any unexpected or unwanted behavior in the code, such as automated redirections. For example, in June 2009, Google launched an investigative research engine to help ad networks by regularly checking the source code of websites. This search engine is publicly available at www.anti-malvertising.com, and enables ad networks to detect potential malvertising providers. Surfers also need to update/install anti-malware programs on their systems to protect against such risks.

Iv-F Countermeasures to Inflight Modification of Ad Traffic

As in [Rescorla], the authors of [vratonjic2011online] proposed data integrity and authentication tools to ensure end-to-end security for communication to prevent inflight modification. However, the use of these mechanisms has certain disadvantages that make them challenging to deploy on a wide scale. Firstly, web content protection depends on cryptographic processes that impose a high computational cost on servers [Rescorla]. Secondly, since the authentication mechanism uses digital certificates to activate Web servers authentication, which are expensive since certificate authorities are required carry out authentication of web servers manually. Clearly, if a site has a certificate assigned by a trusted certification authority, a trusted connection can be made that helps browsers to authenticate websites [vratonjic2011online].

Web administrators also prefer to use a customized self-signed certificate without relying on third-party certification authorities to avoid the extra cost; however, such self-signed certificates are vulnerable to MITM attacks, and do not provide a reliable solution that allows the web browser to identify the website, and users need to decide whether or not to trust the corresponding website [Wendlandt]. From the user’s point of view, it is complicated to determine the operation of a given certificate and to validate it. As a result, a malicious server can often communicate with users. A notary office can be established to control the consistency of the web server’s public keys and to help the user verify self-signed certificates. Although this technique is a new and reliable solution, it has the same limitations as the scheme in [Rescorla].

To tackle the above problems, researchers have introduced several alternative approaches to protect Web content effectively [Langley, Reis, Vratonjic]. For example, in [Langley], the authors present a new opportunistic encryption method for encrypting web communications, involving a secure channel without other host authentication. However, this technique is unable to protect systems against MITM attacks, since the attacker can easily access the certificates used for authentication and replace them to impersonate web page. In other work, the authors of [Reis] adopted a web-based measurement tool called Web Tripwire to detect inflight changes to websites. This method can inject JavaScript code into the site and monitor the HTTP web page to identify any changes in it. The tool immediately reports any modifications to the web page to both the end-user and the web server. Tripwire is a cheaper tool than HTTPS, which checks the integrity of pages, but is a non-cryptographically secure method. In [Vratonjic], a secure scheme based on a collaboration between ad networks and web servers was introduced to counteract inflight traffic modification. This method is based on the fact that ad networks with digital authentication certificates can ensure the authenticity and integrity of the traffic. However, the implementation of this method imposes a high cost on publishers and ad networks.

Iv-G Countermeasures to Adware

Untrusted websites generally deliver spyware and adware to unaware customers [vratonjic2011online], and it is therefore better that users avoid these kinds of sites. Users are also advised to install or update their anti-adware software regularly. Finally, some free software has the potential to install adware in the user’s system, and the user therefore needs to pay attention to license agreements and installation screens before installing them.

Iv-H Countermeasures to Hit Inflation

Due to the nature of hit inflation attacks, they are an important concern for advertising commissioners [zeller2004each]. Most research to date has focused on publisher fraud, since this can also be generalized to advertiser fraud. In the following subsections, we therefore concentrate on publisher fraud unless it is specifically necessary to investigate advertiser fraud. We start with examples of classical approaches to inflation fraud detection in Section IV-H1, and give an overview of cryptography-based methods in Section IV-H2. We conclude that the commissioner cannot track individual computers to identify fraud due to violations of user privacy. Finally, in Section IV-H3, we argue that the application of statistical analysis to streams of traffic is the most appropriate way to detect hit inflation.

Iv-H1 Classical Approach

Classical fraud detection, also called offline fraud detection, employs a variety of metrics to evaluate publishers according to the quality of traffic to their websites [metwally2007hit]. It should be emphasized that the quality of traffic can be measured by its adaptation with normal network traffic. In classical detection methods, brokers can store the total traffic in databases and validate the quality (based on certain metrics) of the stored traffic using complex SQL scripts.

One of the most appropriate metrics is the CTR of the advertisements, which is constant across websites of the same type [klein1999defending], while advertisements of different types have different CTRs on identical sites. If the website automatically visits and clicks, consequently, not only produce similar CTRs for the advertisements but rather the CTR of the displayed advertisements deviates from the normal values. Commissioners can develop this technique to monitor the behavior of advertisements by loading empty advertisements into the websites of publishers and checking clicks on these false advertisements.

However, classical metrics have several problems. They are not efficient metrics, since fraudsters can easily circumvent traditional tools, and can fool classical detection tools by abusing the site architecture [metwally2006hide] of a specific publisher to model the network metrics of advertisements and gain information about the parameters of the advertisements displayed on their website.

A lack of scalability is the second problem. It should be noted that the average impressions per second currently received by the commissioner is 20K, corresponding to 70M records that need to be stored in a database per hour. It is clear that executing SQL scripts to compute these metrics will lead to a decrease in database performance, and commissioners therefore execute them only periodically. Moreover, the updating of these metrics is also not scalable. Each click on an ad in any site may mean that the statistical parameters and the ranking of the website need to be recalculated.

Thirdly, the classical approach was developed before Internet advertising reached maturity, and hence represents the standard conflict between advertisers and commissioners. Traffic that does not adapt with the network metrics may be legal, although it will be low-quality traffic. Since classical methods are unable to detect malicious intent, they omit legitimate traffic with low quality.

Iv-H2 Cryptographic Approach

There are various cryptographic methods in the literature that can replace off-line measures. The central idea behind these is to change the industry standard to give fraudulent publishers less chance to conduct fraud [naor1998secure]. For example, in [jakobsson1999secure], a simple model involving e-coupons was developed. In this model, the advertiser exploits cryptographic algorithms to produce coupons and distributes them to the publishers. Then, the publishers redistribute the coupons to users, who can use these cryptographic coupons to purchase items from the websites of advertisers. Web advertisers favor this model because it is based on pay-per-sale. Most publishers prefer to be paid based on the number of clicks or impressions, since this relates to the load on their servers.

Conversely, advertisers can exploit the model to receive a vast amount of clicks or impressions, which are essential to increase awareness of their brand. The authors claim that the proposed model meets most security and safety requirements; however, the model is vulnerable to hit shaving attack by advertisers.

The solution proposed by Goodman [goodman2005pay] is to replace the current pay-per-click scheme used in online advertising with a pay-per-impression system. This approach does not involve a monetary cost to the advertiser for click fraud, since they are no longer paid per click. The authors of [juels2007combating] suggest a cryptographic technique for changing the CPC model to CPA in which valid clicks are identified rather than invalid clicks being removed. This model guarantees the legitimacy of the clicks received by advertisers through a TTP. However, this model requires sharing information between third parties, which is not possible due to the security restrictions in modern browsers.

Other cryptographic methods rely on assistance from users to identify fraudulent traffic from regular traffic. Different groups of protocols using basic cryptography methods have been introduced to count the total number of visitors viewing a website [reiter1998detecting, blundo2002sawm]. One framework requires users to register with a broker, from which the user receives a token from the broker to use free services on the website of the publisher. The broker also shares the corresponding token with the publisher to allow them to recognize registered users. In this way, each time a user visits the publisher’s website and sends a token-based authentication to the publisher, access is granted to that free service. The user updates the publisher via a hash function when an authentication token is sent to the publisher. Since publisher cannot predict the cost of the next visit (but can verify the value of the token), the number of user visits stored in the last token is sent back to the publisher at accounting time.

There are some limitations to this framework. Firstly, it presumes that the users trust brokers to download code to run the hash function [khare1998trust] and communicate with the publishers’ servers. Secondly, it suffers from a lack of scalability, since numerous hash functions are required (one for each user). Thirdly, this scheme needs brokers to identify users uniquely in order to be effective, although exposing personal information on the users to the brokers violates the user’s privacy. The last problem can be handled by user registration in the broker’s website (by exposing the user’s personal information). Brokers can also track and monitor the behavior of users by downloading spyware [saroiu2004measurement] onto their systems.

Iv-H3 Data Analysis Approach

Many advanced data analysis technologies have been developed to alleviate the problems caused by cryptographic methods. As mentioned above, a broker needs to deal with the conflict between protecting the user’s privacy and security, and the best way to address this challenge is to carry out statistical analysis on collected data (such as cookie IDs and IPs) with the help of temporary user identification. The analysis of IPs and cookie IDs is more privacy-friendly than cryptographic methods.

Commissioners can also track users based on their cookie IDs and IPs. In the current Internet architecture, the use of cookies and IPs to detect fraud can be a less intrusive technique than methods requiring user login. Cookies do not store any personal information, and the user has the ability to block, accept, or periodically clear them [mcgann2005study]. IP addresses can also be assigned to the user temporarily, and can be shared with other users. There is therefore no reason to change the industry model and to obfuscate the identity of the users when applying data analysis methods to cookie IDs and IPs, and these methods can detect fraud with high accuracy [metwally2007detectives].

Several data analysis techniques have been proposed in the literature to detect and fight click fraud [metwally2005using, metwally2005duplicate]. The principal aim of these technologies is to find particular patterns that characterize fraudulent traffic [mann2006click]. The known data analysis approaches to defending against hit inflation are described below.

Detecting duplicate clicks. Since some publishers try to increase the number of clicks on their websites by clicking the same advertisement, some detection techniques rely on searching for duplicate clicks in the clickstream [zhang2008detecting, metwally2005duplicate]. The detection of duplicate clicks within a short time (for example single a day) raise suspicion for the commissioner.

In classical data analysis techniques, the commissioner can store the total traffic in databases and run complex SQL scripts to find duplicate clicks within a certain period. However, this method suffers from scalability and performance problems. Storing traffic in the database and then checking them to find duplicate clicks is very expensive for commissioners, since they receive a vast amount of traffic (an average size of around 70M records is generated per hour). In a online scenario, a detection scheme also needs to be fast, and should process the total traffic entry within 50s. Hit inflation detection is therefore a critical part of streaming and sampling algorithms.

To cope with the above problem, Metwally et al. [metwally2005duplicate] proposed a fast algorithm for detecting duplicate clicks in data streams. Their algorithm relies on original Bloom filters [bloom1970space] and aims to find click fraud with an error rate of less than 1%. They provide different solutions by considering three types of window, as follows: sliding windows (finding duplicate clicks corresponding to the last observed part of the stream); landmark windows (keeping particular parts of the stream for deduplication); and jumping windows (a trade-off between the first two types).

The results of an experiment on a real dataset show that within one day, one ad was clicked 10,781 times by users with the same cookie ID. Since the method is successful in identifying fraudulent intent, it can be considered a complementary approach to classical schemes that cannot differentiate low-quality from malicious traffic. However, the method has high computational complexity of order O(n), since it needs to keep active click identifications in its memory until they expire.

To address this problem, two algorithms, namely the Group Bloom Filter (GBF) and Timing Bloom Filter (TBF) algorithms, were developed in [zhang2008detecting]. The difference between them lies in the number of sub-windows. The GBF can detect click fraud using jumping windows with a small number of sub-windows, whilst TBF achieves this using a large number of sub-windows. These two algorithms involve simple operations and relatively little storage space, with zero false negatives. The error rate of duplicate detection is also reduced to less than 0.1%.

Fabricated impressions and clicks. Other solutions collect ad traffic across user IPs and cookie IDs to identify fabricated clicks and impressions. They are based on finding client behavior (e.g., advertisement traffic) that deviates from normal behavior [metwally2007detectives, metwally2007hit].

Cryptographic and classical methods cannot determine the difference between attacks launched by a single publisher and by a group of publishers (also called a coalition attack). In principle, making this difference is the main idea behind the data analysis approach. In coalition attacks, fraudsters share their machines to reduce the overhead and costs by carrying out distributed attacks rather than individual ones. Since numerous publishers share the pattern of fraudulent traffic, the detection of coalition attacks is difficult. Although it is easy for coalition attacks to defraud classical methods, data analysis mechanisms have been developed to try to find evidence of these attacks [metwally2007detectives].

Metwally et al. [metwally2005using] designed a scheme to detect the hit inflation attack identified in [anupam1999security]. They observed that several websites could cooperate to make fake clicks and consequently improve their business interests, and proposed an algorithm named Streaming-Rules to detect hit inflation in an online advertising system. This approach relies on discovering the association rules (defined as forward and backward association rules) between each pair of corresponding elements in the stream.

This algorithm requires cooperation between Internet Service Providers (ISPs) and brokers. An ISP can recognize which websites are generally visited before a particular website, while maintaining users’ privacy [iqbal2018protecting], by analyzing the entire HTTP requests stream. The authors claimed that Streaming-Rules could discover the association between elements occurring in a stream with tight error guarantees and minimal memory usage.

The solution proposed in [metwally2005using] is not efficient against other coalition attacks, since it is designed to detect the specific attack described in [anupam1999security]. For example, if each adversary in the coalition attack takes control of the user’s system via Trojans, then the adversary can separate the HTTP request stream by ISP, making it impossible to detect the attack using Streaming-Rules. Hence, in [metwally2007detectives], an approach was developed to identify different types of sophisticated coalition attacks (e.g., a coalition formed of multiple dishonest publishers) called the Similarity-Seeker algorithm. This detection mechanism relies on analyzing traffic to find similarities in the traffic to websites. Legitimate websites do not have similar traffic, and traffic from similar sets of IPs is therefore suspicious. The original model can discover coalition attacks of size two, and the extended model can find attacks by coalitions of arbitrary sizes. The exploitation of statistical traffic analysis gives more scalability than traditional technologies.

Another method presented by Metwally et al. in [metwally2008sleuth] called SLEUTH (Single-pubLisher attack dEtection Using correlaTion Hunting) addresses the problem of fraudulent traffic generated by a single publisher via several IPs. This approach focuses on discovering an association between the publisher and the IP address of a machine. However, SLEUTH is only an adequate solution for a botnet that utilizes a vast number of IPs, and assumes that the traffic features of non-fraudulent publishers and IPs are constant. This assumption is not applicable to online advertising systems, where trends are highly temporal.

The Clicktok tool used a Non-negative Matrix Factorization (NMF) algorithm to partition click traffic to identify fraudulent clicks [nagaraja2019clicktok]. The authors claimed that the proposed solution reached an accuracy of 99.6%. Despite this high efficiency, however, the solution only works on the user side.

Although these solutions have certain benefits, all of them are under the threaten of complicated botnet ad fraud [stone2011understanding]. Many compromised machines are used to modify the IPs and cookie IDs of fraudulent requests.

In [haddadi2010fighting], the authors described the use of bluff advertisements, an online click-fraud detection strategy that blacklists malicious publishers based on a predefined threshold. This approach was designed to display several unrelated/fake advertisements amongst the user’s targeted advertisements, with the expectation that these advertisements will not be clicked on. In addition to monitoring IPs and applying profile-matching and threshold detection techniques, bluff advertisements can create some obstacles for botnet owners who want to train their software. Negative attitudes of users can also be reduced by decreasing the number of precisely targeted advertisements. These considerations motivated the authors of [dave2012measuring] to recommend a technique for advertisers to count the proportion of invalid clicks on their advertisements by generating fake ones. Running bluff advertisements leads to an increase in advertising budgets for advertisers.

All of the above detection methods can only address fraud after it has occurred. The authors of [costa2012proposal] therefore proposed a new automated method for preventing click fraud called clickable CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart). In the proposed method, customers complete a simple Turing test [turing2009computing] and are then diverted to the publisher’s site. Although click fraud can be identified based on valid users, the loading of CAPTCHAs requires time and space.

Attack Countermeasure Advantage Disadvantage Ref.
Hacking Daily checking of the user’s account
  • [leftmargin=*]

  • Protecting from possible financial and reputation losses

  • [leftmargin=*]

  • Highly time-consuming

[mladenow2015online]
Detection strategies based on human interactions
  • [leftmargin=*]

  • Strong detection scheme

  • [leftmargin=*]

  • Labor costs

  • Becoming invalid quickly due to rapid change in web workers’ behavior

[tian2015crowd]
Crowd Fraud Substantial randomness solution based on the group behaviors
  • [leftmargin=*]

  • Robustness, scalable, and reliable

  • No need to tune parameters manually

  • Applicable in real-world

  • [leftmargin=*]

  • Fails in preventing fraud

  • Difficulty in evaluating the accuracy of the algorithm

[Johnson]
Badver. Detecting and preventing Badvertisment via active and passive schemes
  • [leftmargin=*]

  • Preserving user privacy

  • [leftmargin=*]

  • Needs third-party interaction

  • Time-consuming

[gandhi2006badvertisements]
Collecting information
  • [leftmargin=*]

  • Strong detection scheme

  • [leftmargin=*]

  • Lack of scalability

  • Lack of efficiency

[Johnson]
Using cryptography and probability tools to detect fraud
  • [leftmargin=*]

  • User-friendly and simple model

  • No need third party

  • Constant ad’s communications, computation, and storage cost

  • [leftmargin=*]

  • Need to tune parameters manually

[ding2010hybrid]
Hit Shaving Enabling the referrer webmasters to monitor the number of legal clicks
  • [leftmargin=*]

  • No need awareness or cooperation by the webmasters

  • [leftmargin=*]

  • Communication overhead

[reiter1998detecting]
Enabling the providers of click-through mechanisms to control the number of clicks
  • [leftmargin=*]

  • Robust

  • No need to honest webmaster

  • [leftmargin=*]

  • Cooperation or awareness by the webmaster

[reiter1998detecting]
Malvertising Checking the advertisements regularly and validate their appropriateness by publishers or ad networks
  • [leftmargin=*]

  • Prevent losses of reputation, traffic, and revenue

  • [leftmargin=*]

  • Highly time-consuming

[vratonjic2011online]
Install/update anti-malware software by users
  • [leftmargin=*]

  • Preventing to install malware on the user’s machine

  • [leftmargin=*]

  • Use up a lot of memory & disk space and slowing down the system

[vratonjic2011online]
Data integrity and authentication mechanisms
  • [leftmargin=*]

  • Ensure the end-to-end security of communications to prevent inflight modifications

  • [leftmargin=*]

  • Lack of scalability

  • Highly communication cost

[Rescorla]
Inflight Modification of Ad Traffic Using a new encryption method to encrypt Web communications without other host authentication
  • [leftmargin=*]

  • Highly scalable

  • [leftmargin=*]

  • Fail to protect the system against MITM attacks

[Langley]
Using Web Tripwire to detect inflight changes to websites
  • [leftmargin=*]

  • A cheaper tool than HTTPS

  • [leftmargin=*]

  • Non-cryptographically secure method

[Reis]
Secure scheme based on the collaboration between ad networks and web servers
  • [leftmargin=*]

  • Ensure authenticity and integrity of the traffic

  • [leftmargin=*]

  • Additional charge for publishers and ad networks

[Vratonjic]
Do not visit the untrusted websites
  • [leftmargin=*]

  • Preventing to deliver adware and spyware to unaware users

  • [leftmargin=*]

  • Installing adware in the user’s system by some of the free software

[vratonjic2011online]
Adware Install/ Update anti-adware software
  • [leftmargin=*]

  • Preventing to install adware and spyware on the user’s machine

  • [leftmargin=*]

  • Use up a lot of memory & disk space and slowing down the system

[vratonjic2011online]
Table III: Comparison on existing detection methods in online advertising system. Ref. := Reference, Badver. := Badvertising.
Attack Countermeasure Advantage Disadvantage Ref.
Classical Using a variety of metrics to monitor the quality of the traffic to find fraud
  • [leftmargin=*]

  • No need third party

  • [leftmargin=*]

  • Lack of efficiency and scalability

  • Conflict of interest between commissioners and advertisers

[metwally2007hit], [klein1999defending], [metwally2006hide]
Changing the industry model based on pay-per-sale
  • [leftmargin=*]

  • Safe

  • Robust

  • [leftmargin=*]

  • Vulnerable to hit shaving

[jakobsson1999secure]
Cryptographic Changing the pay-per-click model with the pay-per-impression/ pay-per-action model
  • [leftmargin=*]

  • Guarantees the legitimacy of the receiving clicks by advertisers through a trusted third party

  • [leftmargin=*]

  • Sharing the information between the third parties

[goodman2005pay], [juels2007combating]
Hit Inflation The assistance of the users to identify fraudulent traffic from regular traffic
  • [leftmargin=*]

  • Cost saving by free service

  • [leftmargin=*]

  • Lack of scalability and user privacy

  • Sharing the information between the third parties

[reiter1998detecting], [blundo2002sawm]
Detecting duplicate clicks:
  • [leftmargin=*]

  • Original Bloom Filter algorithm

  • GBF algorithm and TBF algorithm

  • [leftmargin=*]

  • Less error rate

  • Requires simpler operations and less storage space/ Low false-positive rate

  • [leftmargin=*]

  • Memory waste

  • Theoretical analysis was made

[zhang2008detecting],[metwally2005duplicate]
Data Analysis Fabricated impressions and clicks:
  • [leftmargin=*]

  • Streaming-Rules algorithm

  • Similarity-seeker algorithm

  • SLEUTH

  • Clicktok

  • [leftmargin=*]

  • Scalability and ability to detect specific hit inflation

  • Highly scalable

  • High accuracy & ability to detect complex coalition attacks

  • Low latency

  • [leftmargin=*]

  • Thwarted by sophisticated botnet ad fraud

  • Under the threaten of complicated botnet ad fraud

  • Not applicable to online advertising systems

  • Work on the user side

[metwally2007detectives], [metwally2007hit], [metwally2008sleuth], [nagaraja2019clicktok]
  • [leftmargin=*]

  • Bluff Ads

  • CAPTCHAs

  • [leftmargin=*]

  • Put some obstacles against the botnet’s owner to train their software

  • Identifying click fraud based on the valid user

  • [leftmargin=*]

  • Increasing advertisers’ budget on advertisements

  • Loading CAPTCHAs needs time and space

[haddadi2010fighting], [dave2012measuring]
Table IV: Continued from Table III.

V Summary of Observations and Future Research Directions

As discussed in detail in Section III, various types of security threats can endanger the online advertising ecosystem, and many types of research have been conducted to deal with these threats. Nevertheless, there is a shortage of providing consensus qualitative and quantitative analysis concerning the security of the online advertising system in these studies. Fig. 12 shows an overview of four open issues and the corresponding possible solutions.

Due to the various limitations on previous investigations and properties of the current online advertising system, we introduce some possible future research directions towards building a reliable, secure, and efficient online advertising ecosystem in Section V-A. In Section V-B, we describe some possible solutions to mitigate each open issue.

Figure 12: Proposed research roadmap for measuring and optimizing the security of online advertising networks.

V-a Future Direction

The security, reliability, and efficiency of online advertising systems rely on four major aspects of research, as described below.

  • Combating ad fraud: Although 2020 is expected to be a year of growth, this can be subverted by ad fraud. A report released by Juniper Research states that in 2018, about $42 billion was lost to ad fraud in the online advertising business. It is expected that this amount will grow to $100 billion by 2023. The damages do not simply involve financial loss, and can affect user privacy and hide the best performing marketing channels. To deal with these damages, growth marketers must consider fraud prevention as a priority. The report claims that attackers tend to apply methods such as domain spoofing to increase the number of clicks by misrepresenting a low-quality site to resemble a high-quality website, rather than using techniques such as app install farms. As a result, it is essential to detect which ad clicks are fake and which are genuine, not an easy task in real-time bidding.

  • Demand for transparency. The report in [Nielsen] points out that the majority of the cost allocated to online advertising currently goes directly to waste, due to fraud or off-target audiences. However, there are ways to adapt, and transparency can play a significant role in this. For the entities that are involved in the ad industry, it is vital to know where their banners are served and where their budgets are spent, since if control over the budget allocated to the ad campaign is lost, advertisers will not know what has been spent where. Advertisers and publishers are doing business, and their activities therefore aim to make money, but the fragmentation of this economy means that media customers spend more high-priced than it’s worth.

  • Cross-border complexity. This aspect aims to attract and protect global users who require multi-currency pricing options. For example, customers from all parts of the world trust ad providers to give them ad services. However, the payment methods by ad providers are not acceptable. As a result, to gain customer loyalty, ad service providers need to allow them to change money on their side at suitable exchange rates. In this way, they can build a sustainable and secure platform to execute different multi-currency scenarios.

  • Disruptive technologies. The online advertising industry has been significantly penetrated by technological innovations like the Internet of Things (IoT), Artificial Intelligence (AI), Augmented Reality (AR), and 5th generation mobile Internet (or 5G). In 2018, for example, Google launched a beta experiment involving automatic ad placement on the basis of AI, and publishers’ incomes increased by 10%. To gain a competitive advantage in the market to survive, an enterprise needs to adapt to these changes faster than others, and the future of companies who are not ready for the newest technologies is in question.

V-B Suggestion of Security Responses

In this section, we propose some responses to the challenges introduced in Section V-A.

  • Ad fraud has become a significant concern for everyone involved in the ad industry, and can lead to reductions in trustworthiness and campaign effectiveness, and the siphoning of budgets. Many companies have put considerable effort into fighting against ad fraud.

    The industry’s primary solution for combating all types of fraud is the use of Machine Learning (ML) to analyze the history of attacks and how they appeared, to help companies predict what will happen next. However, as mentioned in Section V-A, one of the best and most efficient solutions to prevent ad fraud is to apply sophisticated click validation mechanisms. This increases the workload for fraudsters aiming to steal advertisers’ and brands’ budgets, and makes it uneconomical for them. In 2019, Adjust [Adjust] proposed a standard based on click validation in which ad channels send impressions with a unique identifier before the click claim is sent.

    As mentioned previously, whenever users click on a hyperlink in a publisher’s website, the advertiser must pay a fee. The question therefore arises as to how an advertiser can verify that the bill received from the publisher is correct. This poses a challenge and remains an open issue. In this case, our suggestion is to apply Verifying Computations without requiring the user to re-execute [walfish2015verifying] them. The fundamental theorem behind this is a probabilistic proof system, which is composed of two elements, a prover and a verifier. The prover aims to prove a mathematical assertion (so-called proof) for the verifier, while the verifier checks the proof.

    However, in practice, this computational technique is not economically sound. We therefore propose the use of a blockchain-based scheme for validation and verification. The concept underlying the blockchain is Distributed Ledger Technology (DLT), which helps various untrusting and distributed agents to transmit data in a trusted, secure, and valid way by providing distributed validation, transparency, and cryptographic immutability [croman2016scaling, androulaki2018hyperledger]. Recently, a wide range of applications (such as healthcare [shae2018transform] and genomics [ozercan2018realizing]) have begun to use the blockchain to guarantee trustworthiness in interactions among untrusting agents. Thus, the blockchain is an appropriate mechanism to ensure trust in cases which require long-running computations. We believe that an important future research direction in the use of validation clicks to fight ad fraud could be to investigate how blockchain-based validation can be extended and used to ensure effective, trusted verifiable computations.

  • High levels of transparency play a significant role in building trust between entities in the online advertising system and customers. This also affects the relationship between the publisher and advertiser. One way to help bring transparency over cost is to create a real-time analytic method to follow all activities. In the following, we highlight some other technologies and tools that can improve and guarantee transparency.

    • In 2016, IAB released a Programmatic Fee Transparency Calculator to add transparency to the collaboration between publisher and advertiser. This tool was designed to help actors in the online advertising market to define and apply cost models differently. In this way, they have the flexibility to enter their planning rates and budgets into the calculator, and then select the available advertising technologies for the campaign. It is essential to mention that the calculation cost model is based on the “% of media.”

    • The blockchain can provide security and transparency for the transfer of data from advertiser to publisher. It is also possible to do real-time transactions by exploiting blockchain technology, especially in the case where the price is obvious to all participating members of the supply chain.

    • A few advertising tools are available to cope with the transparency challenge, including Havas and Apomaya. These platforms aim to support transparency by calculating the fees that media buyers must pay.

  • Engineers have expertise in developing ad software that facilitates multiple-currency and cross-border operations. They are aware of how to create and maintain smart billing services that can support multi-currency payments. We identify some other techniques for coping with the challenges of cross-border complexity as follows.

    • One technique is to integrate a currency converter calculator into a pre-built framework. This requires finding an Application Programming Interface (API), such as currencylayer, Fixer, or XE Currency Data, to allow regular updating of currency exchange rates and access to the maximum number of worldwide currencies.

    • Another technique is to provide customers with access to different payment gateways, including PayPal, Secure-Pay, Stripe, Authorize.Net, etc. Offering diverse payment options can help to attract and retain loyal customers.

    • Ad services can also be provided with adjustable prices by considering the average transaction cost across a specific country, since a given amount might be adequate for one country but too high for another.

  • It is not an easy task to apply cutting-edge technologies when the traditional types work well. For example, it is difficult for an advertiser to change their ad campaigns to the emerging ones. However, in this new era, there is a need to adapt and be aware of the latest technologies, and the domain of online advertising systems is no exception. Emerging technologies such as AI, AR, IoT, and 5G can help ad tech companies in several ways. For example, the role of AI is three-fold. Firstly, the use of AI-based chatbot applications will motivate users to buy products, since a chatbot allows them to ask questions, give commands and receive services in a conversational style. An AI chatbot can read data, analyze complex information and make decisions based on this information. Depending on the customer’s question, the system should refer them to a specific social group to demonstrate the items that can be purchased. Secondly, AI provides a method of targeted advertising. Assisted by the application of machine learning algorithms to big data, AI can automatically sort marketing messages and deliver them to the target users, making ad targeting more accurate and cost-effective. Thirdly, running AI-based algorithms allows ad mediation to be optimized to maximize profits for publishers by finding the best-matched slots for their advertisements. AI-based advertising helps companies in four ways: by displaying personalized advertisements to the relevant customers and minimizing human effort; by interacting with audiences in a natural way; by reducing errors using a data-oriented approach for network selection; and by saving time through automating the process of ad publishing.

    Although AR-based marketing is in its infancy, it has become interesting to marketers. Since everyone has a smartphone, advertisement based on AR is now much easier than before. For example, stores can install AR-driven ad applications to send customers popup advertisements to tell them about products, and consequently attract customers to purchase items. AR-based advertising helps companies in three ways: by providing targeted and innovative contextual advertising; by improving customer experience and making it unique and immersive; and by boosting customer loyalty through interactive advertising.

    A report from the IAB found that around 65% of people in the US own at least one IoT device, and are interested in receiving advertisements on IoT screens. IoT technologies can therefore provide new levels of ad targeting. IoT data can be used to dig even deeper into customers’ habits, interests, preferences, and other factors, and allows advertisers to learn more about their customers to create customer personas and targeted ad campaigns. It is also possible to integrate cloud solutions with various gateways to achieve better results in the ad campaign. IoT-based ad software can help advertising companies in three ways: better recognition and prediction of consumers’ individual preferences and needs to increase the efficiency and accuracy of target advertisements; increased user engagement and satisfaction by providing them with valuable information about products; and improved ad campaign effectiveness.

    The arrival of 5G is expected to open up substantial new opportunities for advertising. Although current 4G providers have attempted to influence the public regarding the security and privacy concerns over 5G, the possibility of achieving Internet speeds 20 times faster than 4G will tempt both advertisers and consumers. Needless to say, to fully exploit the potential of 5G in the advertising industry, all the entities in the industry should prepare themselves before launching 5G. In the following, we identify some of the issues that should be considered.

    • [leftmargin=*]

    • Faster load speeds. Despite the advent of new technologies, like AI, AR, and 3D modeling, which have revolutionized the ad market, advertisers may not be attracted to online advertising due to issues relating to speed. With the high speeds of 5G, a new era will open up for advertisers to exploit customer profiling, ad creative, targeting, and many more aspects. 5G will increase the speed of a device from 45 Mbps up to a maximum of one gigabit, meaning that response times will be a few milliseconds, thus leading to a decrease in latency. This can provide a better space for the use of streaming video (or even deeper augmented and virtual reality) to create advertisements. It also opens the way for creating video advertisements, giving customers the chance to stop scrolling the web page to watch high-resolution advertisements.

    • Precise locations. 5G can not only help advertisers with ad creation but with targeted advertising. The targeting of audiences with low-speed Internet was not straightforward, but 5G paves the way for the creation of a range of channels that can enable advertisers to connect directly with consumers. 5G networks can also enable cloud-based processing to increase speed and connectivity. With higher speeds, devices can offload processing into the cloud, meaning that it is not necessary to process data in the devices’ processors, thus preserving battery life and allowing for more connected devices.

    • Unlocking identity. It is worth pointing out that advertisers with more digital touchpoints are more likely to be selected by customers, and should consider this a chance to learn more information about their audiences. Not surprisingly, the most significant impact of 5G will be on the quality and quantity of data in the system, which will allow the advertiser to target and capture the correct audience more effectively. To achieve maximum benefit from the new data, it is vital to make sure all the basis includes the right technology partners, the right infrastructure and the right kind of privacy measures for the use of the data. Moreover, in the run-up to the introduction of 5G, marketers should warn stakeholders to consider the principles of privacy by design when using this valuable data.

    In a nutshell, advertising companies can use the potential of new technologies as follows: AI-based chatbots can help companies to communicate with their customers; AR-based advertisements can lead to more interactive experiences; interacting with customers through IoT devices will allow companies to match advertisements with the real interests of the audience; and the faster and more sophisticated network speeds available through 5G will allow video resolution to be increased and page loading times to be reduced, creating more interaction between customers and advertisers.

Vi Conclusion

Online advertising is vital in sustaining the economy of the Internet, since each party in the system can gain profit. However, abuse can result in severe damages. In many countries, there is a lack of legal protection against ad fraud, and given the amount of ad revenue at stake, online advertising has become a target for criminals to gain financial incentives through fraudulent activities.

In this article, we have investigated and discussed the security aspects of the online advertising market. We first gave a brief introduction to the online advertising system, followed by the fundamental concepts that have emerged in relation to the online advertising system. Next, we presented a state-of-the-art study of the various forms of security attacks on the online advertising ecosystem that arise from the weaknesses of the ecosystem. We then proposed a comprehensive taxonomy of ad fraud to describe these threats in global terms and facilitate cooperation among researchers to deal with ad fraud attacks. We classified the existing solutions that have been proposed in the literature to cope with these attacks, along with the limitations and effectiveness of these solutions. Finally, we presented our view of current research challenges and future directions to improve existing security solutions in the online advertising system.

References