Fake news and misinformation, not only pose serious threats to the integrity of journalism, but have also created societal turmoils (i) in the economy  (ii) the political world [7, 67] even (iii) in human life  However, this phenomenon is not new. There have always been fake news (e.g., tabloids have been capitalizing on fake news for decades), but the tools people use to spread misinformation have dramatically improved with Internet, social media and other Web resources, providing the habitable environment for fake news to thrive. Unlike the yellow newspapers of the past, social media and search engines pose an additional threat to truth: the more successful (i.e., in luring visitors) the content of a website is, the more it is promoted by the algorithms underpinning these platforms. BBC interviewed a panel of 50 experts about the “grand challenges we face in the 21st century” and many of them named propaganda and fake news  as a key challenge.
Considering its significant impact, tech firms, researchers, governments and stakeholders have explored various methods to identify and curtail the spread of fake news. There is an abundance of academic works aiming at analyzing [1, 68, 102, 19, 87] or detecting [85, 103, 105, 92] fake news sources on the Web nowadays. As a mitigation action, the World Federation of Advertisers’ Global Media Charter states  that advertisers commit to supporting their partners in the avoidance of the funding of actors seeking to influence division or seeking to inflict reputational harm on business or society and politics at large. In 2015, AppNexus purged their exchange of outright fake sites and reduced the fake ad impressions from 260 billion to 20 billion per month . Google and other tech companies, signed up to a voluntary EU code of conduct which required them to “improve the scrutiny of ad placements to reduce revenues of the purveyors of disinformation” .
Albeit these important actions and studies, unreliable news websites significantly increased () their share of engagement among the top performing news sources in the past year alone . There is no doubt that the success of curbing fake news primarily depends on the efforts to reduce or eliminate the incentives of fake news producers. And admittedly, there is little we know about the incentives and funding of fake news on the Web. Aside from the various political gains that may motivate the spread of doctored narratives [1, 94], disseminating fake news has been a lucrative Web business. The ad industry provides wide avenues for high revenues: for every $2.16 spent on news websites in USA, $1 is spent on misinformation [54, 59]. In fact, ad-tech agencies programmatically bid [83, 77, 79] for ad spaces (of lower cost) that reside in sites of questionable content. As a consequence, ad budgets move from the high quality news websites to low-cost ones, of controversial or false content , with various examples of ads from prestigious companies (e.g., Microsoft, Citigroup, and IBM ads) and small business owners’s ads being placed on (and thus unwittingly fund) websites that promote fake or even illegal (e.g., Jihadi terrorist  and neo-nazi related ) content.
In this study, we shed light on the revenue flows of fake news sites by investigating who supports and maintains their existence. We systematize the auditing process of digital advertising in those sites by developing a novel ad detection methodology, which enables us to identify 1) the intermediary companies that sell or facilitate the ad space of fake news websites to the ad ecosystem, 2) the advertisers who buy the ad space on such sites, 3) the type of ads they place, 4) the companies who track users visiting such sites and consume their content and their ads.
The contributions of this paper are summarized as follows:
We develop a novel ad detection methodology which enables us to identify the ad-related companies that collaborate with fake news websites. We find that about 70% of the fake news websites advertise “Business” products and services, and close to 40% of the fake news websites advertise “Entertainment” products and services.
We study who provides advertising revenue of fake news websites and show that the most well-known legitimate advertising networks (such as google.com, indexexchange.com, and appnexus.com) have a direct advertising relation with more than 40% of the fake news websites in our dataset, and have a reseller relation with more than 60% of those sites. This means that about half of fake news websites receive ads from the well-known top advertising networks. Additionally, we show that fake news websites track their users more aggressively than an average website. The main trackers used in fake news sites are analytics and ad-related properties of Google, Facebook and Amazon.
We show that owners of fake news websites own other types of websites as well, including “Entertainment”, “Business”, and “Politics”. This implies that the operation of an average fake news website is not an isolated or outlying event, focused on fast ad-profits, but instead is probably part of a wider business function.
We make publicly available  our lists of fake and real news websites, screenshots of ads collected on top 100 websites, and code of crawler and ad detection method.
2 Overview of Methodology
In this section, we outline the methodological steps followed in our study, to facilitate our analysis of fake news websites and advertisers who support them. As illustrated in Figure 1, we first construct a list of fake and real news websites and crawl them to collect ad-/tracking-related data on each. Then, we perform a three-prong study, focusing on:
Ads displayed on such websites, including ad-networks and intermediaries facilitating them, and categories of their products or services.
Publisher-specific IDs embedded in these websites, and through graph-based analysis we detect and study clusters of such websites operated by the same entity.
User tracking intensity demonstrated by third parties embedded in fake and real news websites.
2.1 Fake and real news website lists
We utilize four different datasets to create a corpus of fake and real news websites. We focused on publicly available data to enable reproducibility of our work:
MBFC : We extract information provided by Media Bias Fact Check, an organization that aims to detect bias of media and other information sources. MBFC follows a very strict and manual methodology , that makes use of a combination of objective measures, while also pledging to always review and change any factual errors. MBFC’s bias detection has been a very valuable service, used in numerous past studies (e.g., [52, 34, 48, 35, 104]). To bootstrap our process, we download the list on 15.11.2021 and extract fake news websites. Specifically, we focus on websites that have been labeled as “Questionable Source” or “Conspiracy-Pseudoscience” and have “Low” or “Very Low” factual reporting and their credibility has been described as “Low”. These websites manifest extreme bias, obvious propaganda, lack of proper sourcing to credible information, complete lack of transparency, and focus on spreading fake news. From the original list of 3,915 websites, we conclude with 816 fake news websites that meet the above criteria.
CJR: We process a list of fake news website provided by Columbia Journalism Review (CJR) , a journal for professionals of various disciplines. Their list was created by merging some of the most common fake news lists (e.g., OpenSources , Politifact  and Snopes ) and then removing inactive websites. Additionally, CJR curated the list to remove high-partisan websites that do not serve fake news content. From the resulting list, we consider websites that have been labeled as “fake news”, “conspiracy” or “extremely biased websites”, and ended up with a list of 350 websites111
At the moment of this writing, this list is no longer available, but is accessible through an archive site..
Golbeck et al. : We also utilize the dataset described in this work, where the authors focused on fake news and satirical articles related to USA politics, posted after January 2016. They follow a manual investigation process, where each article is evaluated by two researchers. Additionally, for fake news articles, they provide a link to a well-researched, factual article that rebutted the fake news story. From this list, we select 55 websites that have been found to have published more than two such articles.
Zhou et al. : The authors created a dataset of 2,029 news articles and 140,820 tweets related to COVID-19. Regarding the news articles, they extracted knowledge using NewsGuard  and MBFC. Similarly, we focus on news websites with low credibility with at least three fake news articles. Thus, we end up with 31 websites.
Additionally, we form a list of credible news websites that serve factual content, cite credible sources and usually cover both sides of reported stories. We focus on websites that have been evaluated by the MBFC organization and have been found to have minimal or no bias. Specifically, we extract websites that have been labeled as “Pro-Science”, “Least-Biased”, “Left-Center” or “Right-Center” and have “High” or “Very High” factual reporting and high credibility. This approach results in a list of 1,368 credible websites, which we refer to as real news.
|1. Media Bias/Fact Check ||Questionable Sources||816|
|2. CJR ||Fake News & Biased||350|
|3. Golbeck et al. ||Fake & Satire Articles||55|
|4. Zhou et al. ||News articles||31|
|5. Media Bias/Fact Check ||High Credibility||1,368|
|Total unique fake news websites||1,044|
|Total unique real news websites||1,368|
Fake & Real News Lists: By combining these sources, we construct a list of 1,044 unique fake news and 1,368 unique real news websites. Table 1 summarizes the four aforementioned sources of fake and real news websites (our lists are publicly available ). One might be tempted to think that fake news sites are not popular (apart possibly from a small number of them). However, this is far from true. Figure 2 suggests that several fake news websites in our list are popular. Indeed 45 of them are among the top 10K most popular sites in the Tranco list . Such rankings usually translate in a wide audience with millions of visitors per month.
2.2 Website crawling
We crawl the identified news websites, and extract various identifiers and other metadata relevant for our analysis. For the crawling process, we utilize and build on top of the public crawling system presented in . This puppeteer-based crawler provides scraping functionality and is able to store the HTML content of the visited website, a cookie-jar for both first-party and third-party cookies, the ads.txt file of the server (if present), a screenshot of the landing page, as well as the HTTP(S) requests issued during the visit. According to the original implementation, all requests and responses are observed passively without any modifications, thus, ensuring that the behavior of the website remains the same. Additionally, we make use of the fingerprinting detection methodology presented in  and update the list of fingerprinting functions based on the current version of the popular fingerprinting library, FingerprintJS . Finally, we implement and integrate the ad-detection mechanism, described later in Sec. 3.2.1. Ethical considerations regarding the crawling process are discussed in Appendix A. Using this crawler, we visit the landing page of all news websites on 13.12.2021. The crawler was located in a EU-based institution and collected approximately 31GB of raw data. The timeout for loading the landing page was set up to 60 seconds.
3 News Website Financing
In this section, we set out to explore the following question: “Who funds fake news websites through ads?”. We answer the question in (i) Section 3.1, where we present the ad networks that have business relations with fake news websites and (ii) Section 3.2, where we present the ad producers and service providers who are advertised on fake news websites.
3.1 Who sells ad space on fake news websites?
To understand who facilitates the monetization of news websites via ads, we study the entities responsible for selling ads.
For this, we utilize the ads.txt files served by websites.
ads.txt is part of a technology introduced by the Interactive Advertising Bureau (IAB) Tech Lab  to stop domain spoofing, where attackers create fake bid requests for impressions in websites they do not own . ads.txt is a simple text file located at the root of the website, and explicitly states which auctioneers are authorized to sell the impression inventory of a specific website. In order for the entire ad ecosystem to work as expected, Supply-Side Platforms (SSPs) should ignore inventory which they are not authorized to sell, while Demand-Side Platforms (DSPs) should not buy inventory from unauthorized sellers. Each record in the file is an entry with comma-separated fields and it authorizes a specific SSP to sell impressions for this website. The fields that the record contains are: (i) the domain of the SSP, (ii) an identifier which uniquely identifies the account of the publisher within the service specified in (i), (iii) the relationship for this account, and (iv) optionally, an identifier that maps to the company listed in (i) and uniquely identifies it within a certification authority. The relationship, which is defined in the third field of an entry, can be either DIRECT (i.e., the publisher is the owner of the specified account) or RESELLER (i.e., a third-party has been assigned by the website owner to manage the specified account). Finally, every <seller, publisher ID, relationship> triple found in the ads.txt file defines a business relationship between the owner of the website and the seller .
We parse and analyze the content of these files, fetched by the crawler described in Section 2.2. We disregard comments, as well as lines that do not follow the standard. In total, we find 198 fake news websites and 627 real news websites with a valid ads.txt file.
In Figure 4, we illustrate the top 10 most popular digital sellers of ads for the DIRECT relationships that appear in both real and fake news websites (i.e., intersection). We find that, on average, a fake news website forms direct business relationships with 27 ad systems, while surprisingly, real news websites do so with 41 systems. For each ad network, we plot the portion of websites that provide an ads.txt file and have a business relationship with this network. According to specification , relationships with the DIRECT type indicate that the publisher (i.e., the owner of the content) directly controls the specific account related with the respective service. Consequently, these relationships are of special interest, since they disclose a direct business contract between the publisher and the ad network.
Figure 4 suggests that a large portion of real news websites tend to form DIRECT business relationships with the well-known ad networks. For example, 96.0% of real news websites have a direct business relationship with google.com, 82.1% of real news websites have a direct business relationship with indexexchange.com, and 82.0% of real news websites have a direct business relationship with appnexus.com.
What is more interesting, however, is that a lot of fake news websites also have direct business relationships with these ad networks. Indeed, 80.8% of fake news websites have a direct business relationship with google.com, 49.0% of fake news websites have a direct business relationship with indexexchange.com, and 52.5% of real news websites have a direct business relationship with appnexus.com. Although the percentages vary from one ad network to the next, Fig. 4 suggests that these percentages are always higher than 45.5%, which suggests that, on average, these popular ad networks have DIRECT business relation with about half of the fake news websites in our corpus.
It is interesting to note that before starting such a DIRECT business relation between an ad network and a website, there is a vetting process to be followed. For example, before a publisher is able to start showing ads using Google’s AdSense, a review process is necessary, where Google ensures that the website complies with the policy of the AdSense service . One might expect that during such a review process, the popular ad networks would not approve requests of fake news websites, or of websites proven to publish misinformation.
Moreover, we examine independently the top ad systems for fake and real news websites and find that revcontent.com is the only ad system that is popular (i.e., ranked 5th) among these ad networks integrated with fake news websites, but ranked very low (i.e., 51st) among the ad networks of real news websites, which suggests that this network might prefer doing business with fake news websites. Contrary, we find that yahoo.com is mainly preferred by real news websites: 68% of them form a business relationship with yahoo.com, but less then half (30%) of fake news websites do so.
Fig. 4 suggests that this is not the case, as popular ad networks sell digital content in fake news websites, and in this way may support fake news content through advertising.
Although ads.txt and Fig. 4 provide a clear view of the top ad ecosystem in fake news websites, this view is based on data provided by the fake news websites themselves (i.e., the ads.txt file).
To provide the point of view of the sellers themselves, we make use of sellers.json files.
sellers.json is a complementary mechanism to ads.txt and introduced by the IAB Tech Lab to oppose ad fraud and profit from counterfeit inventory . Specifically, sellers.json files can be used by buyers to discover the final sellers of a bid request (either direct sellers or intermediaries). In an attempt for a more transparent marketplace, each seller (i.e., SSP) publishes in its own sellers.json file all entities, with which it has business relationships. According to the specification , the list of entities which are represented by the ad network must be included in this file, even if the identity of the seller is confidential. Additionally, for each seller, a seller ID is required. This ID is the same as the one that appears in the website’s ads.txt file. In cases where the information is not confidential, the name of the legal entity (i.e., company name), which generates revenue under the given ID, is also specified. Finally, for each seller ID, the type of the account must be specified. If this account is defined as PUBLISHER, then the ad inventory is sold on a website directly owned by the company. In such cases, the ad network pays the company directly. On the other hand, INTERMEDIARY accounts indicate that ad inventory is sold by an entity which does not directly own it.
On 12.01.2022, we downloaded and parsed these files for the fake and real news sites in our lists, to verify the business relationships we found in ads.txt. We exclude sovrn.com from this experiment as we were unable to retrieve its sellers.json file. For each identifier with DIRECT relationship found in ads.txt files of news websites, we verify whether the respective business relationship is also registered by the advertising system in its sellers.json file. Please note that we do not investigate whether there is a business relationship mismatch. Instead, we focus whether there is a business relationship of any kind between a news website and the respective ad network. Indeed, discrepancies and mislabeled relationships are common between ads.txt and sellers.json files , but they are considered as out of scope for this work and left for future research. We present our findings for the top 10 most popular sellers in Figure 4.
We find that for all ad networks, the results reported in Fig. 4 and Fig. 4 are very similar (or even the same in some cases). We attribute the small disparities that may exist between the two figures to (i) the one-month difference between the crawling of news websites (i.e., their ads.txt files) and the fetching of the sellers.json files and (ii) the fact that ads.txt files might not be all-inclusive, up-to-date or syntactically correct . Despite the differences that may exist, the important thing to focus is that both points of view agree: a substantial percentage of fake news websites receive ads through well-known services including google.com, indexexchange.com, appnexus.com, etc.. For example, 74.3 - 80.8% of the fake news websites in our dataset have a DIRECT relation with google.com (i.e., receive ads through Google), 47.0 - 49.0% receive ads through indexechange.com, and 46.0 - 52.5% receive ads from appenexus.com.
|Real News||Fake News|
Next, we examine how many fake news websites have a RESELLER relationship with the ad networks studied so far. A RESELLER business relationship expresses cases where a third party has been authorized to control the ad space . Table 2 presents the results. We see that 67.71 - 73.73% of fake news websites in our list have a RESELLER relationship with these popular ad networks: appnexus.com, openx.com, rubiconproject.com, indexechange.com, and pubmatic.com. We note that these percentages reported in Tab. 2) are even higher than those reported in Fig. 4. For example, although as many as 52.5% of fake news websites engage in a DIRECT relation with appnexus.com, an even higher percentage of them (73.73%) engage in a RESELLER relation with appnexus.com. The same trend is true for the rest of the ad networks, which means that roughly six out of ten fake news websites have RESELLER relationships with the major ad networks.
3.2 Who buys ad space on fake news websites?
3.2.1 Ad Detection
Detecting ads embedded in websites is trickier than it sounds. The main difficulty here is that the final advertiser may be selected after an auction and is accessed after several re-directions. To detect ads embedded in websites and identify the actual advertisers who post them, we propose and implement a novel methodology as outlined in Figure 5. It consists of two main components that utilize external blocking lists and network traffic monitoring, respectively.
First (step (1) in Fig. 5), we extract all URLs from the landing page of the website by leveraging the Chrome DevTools Protocol. Specifically, we make use of Puppeteer’s ability to create a session for the Chrome DevTools Protocol and, recursively, visit all nodes of the Document Object Model (DOM) tree. Then, we extract the URL attribute of all hyperlink nodes (i.e., HTML <a> tag). This way, we are able to extract all hyperlinks that can be found even in iFrames or the Shadow DOM. From the extracted URLs, we consider only URLs to other domains. Next (step (2)), we search for URLs which belong to ad networks and represent ads. When users click on such URLs, either directly, or because they clicked on an image, they are redirected to the landing page of the actual advertiser. To detect such ad URLs, we make use of Brave’s Rust-based adblock engine 
and the popular open-source filter listsEasyList  and uBlock Origin . Using these lists and the adblock engine, we evaluate all URLs found in the website, and detect (step (3)) those which are ads.
Additionally, our methodology is able to detect ad URLs that belong to the actual advertiser, and not to an ad network that redirects to the advertiser. Specifically, we perform an application-level network traffic analysis and trace HTTP(S) requests towards ad or tracking domains, by evaluating their URLs against the EasyList and uBlock Origin filter lists. For a more robust and thorough approach, we follow redirect chains and evaluate all request URLs in these chains. When we discover such a request, we extract the body of the response (step (4)) and the request URL (step (5)). If we find a URL in this response, and we know that this URL has already been found in the website (step (6)), we determine whether the original request was towards an advertising domain (step (7)). If so, we deduct that this URL has been placed into the website through an ad network, and consequently, it is an ad URL (step (8)). By combining the two approaches (steps (1), (4), and (5)), our methodology is able to detect ad URLs that are either direct URLs to the actual advertiser, or URLs of ad networks that eventually redirect to the advertiser. In order to establish and attribute the actual advertiser, we navigate to these URLs and extract the landing page.
Manual Verification: To validate our methodology we use a list of popular websites. Particularly, using SimilarWeb , we extract the 50 most popular websites from the “News and Media” and “Sports” categories, for a total of 100 websites. We select these categories based on empirical analysis, since they are more likely to contain ads. We crawl these websites and apply our methodology for ad detection and advertiser attribution, while at the same time storing a screenshot of the website, as well as its HTML code. Next, we manually evaluate how accurately our method can detect ads on these 100 websites. We find that our approach has both high Precision ( of “ads” marked in the websites are actual ads), and Recall ( of actual ads in the websites were correctly detected). These results indicate that our method detects very accurately most ads in websites, with very few false positives. For extensibility, we release a collection of annotated screenshots with ads detected by our methodology .
3.2.2 Types of advertising on news websites
Using this novel methodology, we detect and extract the actual entities that advertise their products or services in news websites. We crawled these websites with a clean browser state (i.e., no synthetic personas). For each ad that our methodology detects, we visit the respective website being advertised and extract the domain of the specific advertiser. Our methodology was able to detect almost 200 advertisers in 138 fake news websites and 900 advertisers in real news websites. We discover that entertainment websites with captivating and luring ads are the most popular ads in fake news websites. In particular, we find that newscityhub.com and inspiredot.net are the most popular advertisers, appearing in 15% and 14% of fake news websites, respectively. These advertisers are known for using click-bait ads with “catchy” titles that attract users and entice their curiosity. Consequently, these advertisers fuel fake news content and, through their ad impressions, financially support this ecosystem.
Next, we extract the categories of advertisers by utilizing the URL classification engine provided by Cyren . Cyren is an Internet Security company that also provides Threat Intelligence services and its classification engine has already been used in previous academic work (e.g., [82, 39, 14]
), showing it can classify a greater set of websites than other similar systems. Using their classification service, we are able to extract the categories of over 95% of advertisers in our dataset. For websites assigned to multiple categories, we single out the most frequent label in our dataset.
Figure 6 illustrates the types of advertisers both in real news and in fake news websites. The first thing we notice is that the majority of advertisers in both fake and real news websites come from the “Business” category. This behavior is expected, since these advertisers promote websites that contain business-related information in an attempt to popularize their services or products. In addition to this, we observe that a large number of fake news websites (almost 40%) display ads from the “Entertainment” websites. These ads contain captivating, and, sometimes even click-bait, content from celebrity websites, television and movie programs, as well as entertainment news that tempt users. The rest of the advertisers fall into “Technology”, “Shopping”, “Education”, “News”, etc. On the other hand, real news websites place ads coming primarily from advertisers of other businesses, news, and education-related services.
What is interesting is that all these types of advertisers on fake news sites seem to be normal and legitimate business. Even the “Spam” category in fake news sites seem to be less prominent than it is in real news sites. These results suggest that fake news websites are serious about ads, host ads from legitimate advertisers, and try to avoid ads from malicious sites such as SPAM, which could jeopardize their existence or legitimacy in the ad ecosystem.
4 News Websites Ownership
In this section, our goal is to answer the following two questions: Who owns fake news websites? And, What other websites do the owners of fake news websites own? Answering such questions will allow us to understand if operating a fake news website is an isolated thing or it is part of a broader business. Towards this goal, we first need to expand our news websites data set. Indeed, so far we focused on websites which were categorized as fake news (1,044), or which were clearly categorized as real news (1,368). Therefore, in the following analysis, we also include a corpus of 1,548 extra news websites from the sources of Sec. 2.1, which were not clearly categorized as either fake or real. This is primarily to answer the second aforementioned question. This brings the total news websites to 3,960.
4.1 ID & graph methodology
To be able to answer what kind of other websites the owners of fake news websites own, we first need to determine who the owner is of a fake news website. Although this question is rather tricky to answer, we capitalize on the methodology described in . The methodology makes use of four different types of Publisher-specific IDs used in three separate Google Services. Then, websites can be linked together if they have common such identifiers. We outline details of this process in the next paragraphs.
4.1.1 Publisher-specific ID detection
|Description||Volume||% of total|
|Initial set of websites||3,960||100.00%|
|Websites successfully crawled||3,311||83.61%|
|Websites that errored||649||16.39%|
|Websites with no ad-related identifiers||737||22.26%|
|Fake News websites||325||9.82%|
|Real News websites||172||5.19%|
|Websites with at least one identifier||2,574||77.74%|
|Fake News websites||385||11.63%|
|Real News websites||1,025||30.96%|
|Websites with all types of identifiers||184||5.56%|
|Fake News websites||2||0.06%|
|Real News websites||62||1.87%|
Such identifiers are alphanumeric values that follow strict formats and uniquely identify user accounts in popular services, such as AdSense and Google Analytics. Since some of these identifiers are associated with the receipt of the funds generated via ads, it is generally safe to assume that websites that share the same identifier (i.e., give their ad revenue to the same entity) are closely related, or even owned by the same entity . Using regular expressions and common data cleaning techniques, identifiers are extracted from the HTML code of websites, network traffic and first- or third party cookies. Then, values that are words of the English dictionary, or match a custom list of common keywords are removed.
Table 3 summarizes the websites containing Publisher-specific IDs. We find that there are 385 fake news websites and 1,025 real news websites with at least one type of identifier. Additionally, a summary of the detected identifiers can be found in Table 4. There is also clear indication that there are identifiers which are being re-used in more than one domain. This is evident by the fact that for Publisher IDs, Measurement IDs and Container IDs, there are more domains than identifiers.
|Description||Unique||Unique Domains||% successful|
|Identifiers||of landing URLs||websites|
4.1.2 Graph Analysis & Cluster Construction
Using the aforementioned identifiers detected, we construct three directed bipartite graphs (identifierswebsites), one per type of identifier. Then, we combine these bipartite graphs, by constructing a Metagraph. Websites that share an identifier are connected through an edge in the Metagraph. The weight of the edge is proportional to the number of Publisher-specific IDs two websites share. A large edge weight represents greater confidence that these two websites are indeed operated and managed by the same entity. We focus only on identifiers which can be found in more than 1 but at most 50 websites. Consequently, we focus only on the Small and Medium classes of website administrators found in , and exclude intermediary publishing partners from our analysis. In addition to this, we integrate information from the 1MT crawl dataset, presented in , which contains Publisher-specific IDs found in the top 1M most popular websites of the Tranco list, created on April, 2021. The resulting Metagraph contains over 114.5K nodes (websites) and 443K edges.
To detect clusters of websites operated or owned by the same entity, a graph-based community detection algorithm is applied on the Metagraph. Very large communities may arise due to the presence of intermediary publishing partners, which are third-party services that help publishers manage their websites and increase website popularity, and consequently generate more revenue. Contrary to  which uses the Girvan-Newman method to detect communities, in this paper we apply the Louvain method 
. Our decision is based on the performance benefits of the Louvain method: it is faster, scalable, and able to accommodate the entire Metagraph without performing any edge-pruning. Additionally, we perform hierarchical clustering by successive instances of the algorithm. Specifically, we extract a dendogram where each level is a partition of the nodes of the metagraph. The first level contains the smallest communities while moving to higher levels results to bigger communities.
|Level||Number of||Fake News||Websites in||Average|
Table 5 summarizes the detected communities of websites operated by the same entity. We define a Fake News Cluster as a community of websites that contains at least one fake news website. This implies that a community is operated or owned by an entity which, among other business, also spreads fake news. Similarly, we define a Real News Cluster as a community with at least one real news website.
In Table 5, we observe that for higher levels of the dendogram (i.e., levels 1 and 2), fake news clusters contain thousands websites, and each such cluster is very big in size (i.e., 50.43 for level 2). These communities are formed due to the presence of intermediary publishing partners that control hundreds or even thousands of websites and do not indicate a clear co-administration relationship. We find 73 fake news clusters that remain identical across the different levels of the dendogram. Thus, we focus only on the first level of the dendogram, containing small and more accurate communities.
4.2 Birds of a feather flock together - or not?
For each cluster identified in the previous section (Fake News and Real News) we compute the portion of its websites labeled (fake or real, resp.) based on our lists. For example, a portion of for a Fake News Cluster indicates that half of the websites in the cluster were labeled as fake news. In Figure 9, we illustrate the size of the detected fake news and real news clusters, as well as their labeled portions.
We see that both Fake and Real News Clusters behave similarly apart from the (0.8, 1]. Indeed, in that range, large communities of real news (the big green rectangle at right) imply that we have Real News clusters of decent size where more than 80% of them are categorized as real news. This implies that such clusters contain several websites that disseminate real news. On the contrary, for the Fake News Clusters, in the same range (i.e., (0.8, 1]) we see that the red rectangle is very thin with a value close to two. This implies that fake news websites do not tend to cluster together - at least not as much as the real news websites do.
4.3 Categories of website clusters
By definition, each Fake News Cluster contains at least one fake news website. However, it also contains other websites as well. Figure 9 shows the types of these other sites contained in each cluster. We see that for the Fake News Clusters (red bars) about 29.5% of the websites are news. The rest (almost 70%) are “not news” websites and encompass “Entertainment”, “Business”, “Politics”, “Technology”, etc. This suggests that, indeed, fake news owners also concern themselves with controversial topics that have been found to be a focus of fake news stories . A similar trend can also be seen in the Real News Clusters, although the actual percentages differ. It seems that both types of clusters have some diversity, but it is not clear which type engages in more diversity. To find out, we engage Shannon’s diversity index , a statistical measure that can indicate how many different categories there are in a community, while at the same time reflecting the relative abundance of website categories. Shannon’s diversity index is defined as , where is the number of different categories in the dataset (i.e., richness) and is the proportion of websites belonging to category . When all categories in a community are equally common, the Shannon index takes the maximum value . The more unequal the categories are, the smaller the index is. Shannon’s diversity index equals zero when there is only one category of websites in a community.
We apply this statistical measure to communities that contain fake news or real news websites. Figure 9 illustrates the distribution of the diversity index for fake news and real news clusters. This index is normalized by , which is the case where all categories are equally common. Consequently, the case of 0% in Fig. 9 indicates that there is only one category of websites in the community, while the case of 100% suggests that the categories are equally distributed, thus revealing a diverse community. We see that Real News Clusters tend to cluster higher and to the left (for the same value of y) of the Fake News Clusters. This indicates that Real News Clusters are more homogeneous: owners of these clusters tend to focus on a smaller number of different Web businesses. At the same time, owners of Fake News Clusters seem to engage in higher diversity in their business.
4.4 Who owns fake news websites?
In this section, we provide some examples of fake news sites, their owners, and their ecosystem. To do so, we manually investigate communities that contain at least one fake news website, and discover the legal entity that operates the websites of each community.
For example, we find a cluster of 6 websites published by Sophia Media. Four of these websites are part of the Health Impact News Network. These websites have been marked as “Pseudoscience website” by MBFC, since they promote anti-vaccination propaganda and have multiple failed fact checks [30, 24, 25]. In fact, in 2020 NewsGuard , a journalism company that tracks online misinformation, identified Health Impact News as one of the greatest spreaders of COVID-19 misinformation on Facebook .
In another cluster, we detect a community of websites related to the Family Research Council (FRC). FRC is an activist group and it also has an affiliated lobbying organization. However, one of these websites has been labeled as a “Questionable Source” by MBFC since it promotes far-right propaganda, it lacks transparency regarding funding and it has numerous failed fact checks . Finally, the Southern Poverty Law Center (SPLC) designated FRC as a hate group . We also find the websites adfmedia.org and adflegal.org being controlled by the same entity with the latter having been labeled as an extreme biased website due to propaganda . The Alliance Defending Freedom (ADF) is a multi-million organization with an extensive legal force  and has been classified as a hate group by the SPLC [16, 18].
Furthermore, we find that not only coordinated organizations, but also individuals are behind communities of fake news websites. Specifically, we find a pair of websites, freedomforceinternational.org and needtoknow.news, founded and powered by G. Edward Griffin. In his websites, he promotes not only right-wing beliefs, but also conspiracy theories and pseudoscience treatments . Some of his beliefs about cancer treatment have been debunked by the American Journal of Public Health, since he promoted the use of a banned chemical compound without any scientific evidence . Similarly, we find another pair of websites, thetruthaboutcancer.com and thetruthaboutvaccines.com, owned by Ty and Charlene Bollinger. Their websites promote both unproven and dangerous remedies (i.e., pseudoscience), as well as information regarding COVID-19 and vaccines which has been proven to be false . In fact, Ty and Charlene Bollinger have been identified as part of the “Disinformation Dozen”, a set of 12 individuals that produce 65% of the misinformation and misleading claims regarding COVID-19 on social media [9, 46].
These examples, along with others excluded for brevity, demonstrate the correctness and efficiency of our methodology. That is, we are able to accurately detect clusters of fake news websites, owned or operated by the same entity that pushes a specific political, ideological, or conspiracy theory agenda, and tries to shift the public opinion based on their ideology. Apparently, administrators of such websites rely on the fact that there is strength in numbers, meaning that the more times a visitor comes across a fake story, the more likely it is that they will be convinced it is true. Sadly, such malpractice can be extremely harmful to people. In fact, people may be led to accept false beliefs or even make life-altering decisions based on this false information .
Ambiguous Website Ownership: Finally, we observe some contradicting communities that contain both a real (i.e., credible) and a fake news website. First, we observe a community of three websites consisting of checkyourfact.com, smokeroom.com and dailycaller.com. According to MBFC, CheckYourFact is a credible fact checker with a high factual reporting . Specifically, even though the slightly favor right-wing opinions, they utilize proper sources and experts and adhere to credible fact checking principles. Additionally, since 2019, CheckYourFact is an accepted signatory of the International Fact Checking Network . However, according to their About page, CheckYourFact is a news product of TheDailyCaller, a conservative news website that deliberately publishes misleading information and false stories.
In addition to this, we also find another contradicting community of 51 different websites. We manually visited and explored all of these websites and deducted that they belong to Salem Media Group and its subsidiaries (e.g., Salem Web Network and Salem Interactive Media). One of the websites in this community, srnnews.com, is part of our real news dataset since it has been rated HIGH for its factual reporting based on the fact that it adheres to proper sourcing and has a clean fact check record . In the same community, we also find pjmedia.com. This website is labeled as a questionable source since it displays extreme right-wing bias, it regularly promotes propaganda, as well as conspiracy theories, and it has published multiple false stories that failed fact checks .
It is evident that news websites tend to form business relationships with ad networks in order to monetize their published content and generate revenue. Based on our analysis, we find that not all such networks evaluate their clients and refuse deals with fake news websites. Such ad companies prefer to increase the profits in the expense of a more transparent, reliable and safe Web. Therefore, even if these business relationships have been formed due to lack of thorough examination of news websites, it is evident that some ad networks facilitate fake news content on the Web.
5 User Tracking on Fake and Real News
In this section, we would like to explore what kind of user tracking such websites do, and how it compares with the tracking being done by real news websites, and by the “general” websites. Specifically, we are interested in the (i) number of unique identifiers found in each news website, and how these identifiers are different across type of news website (fake vs. real), (ii) number of third parties engaged with fake and real news websites, and (iii) number of websites which engage in advanced tracking approaches such as browser fingerprinting.
Our first question explores the number of unique identifiers each website employs. We plot this information for fake and real news websites, in Figures 12 and 12 (resp.). From Fig. 12, we find that 86% of fake news websites use a single Publisher ID or Container ID, 76.8% of them use only one Tracking ID, and almost all of them (i.e., 98%) contain only one Measurement ID. On the contrary, Fig. 12 shows that real news websites follow a different approach and seem to use more unique identifiers in their websites: 33% and 20% of real news websites contain two or more Container IDs and Tracking IDs, respectively. This may be due to the fact that real news websites may have different authors and each author may have their own ID. On the other hand, fake news websites may have fewer authors, possibly due to the lower variety or diversity of their material, and thus, tend to have fewer IDs (or even just one ID). From a tracking point of view, more IDs usually mean more aggressive tracking. Thus, real news websites seem to track their users more intensively. To further support this observation, we study the number of third parties that websites interact with. Figure 12 illustrates our findings. We focus on the fake and real news websites of our lists, as well as the 1MT crawl dataset described in . We argue that the last dataset (i.e., “General Web” column) represents the behavior of the general Web, since it has been formed by crawling the top 1M most popular websites.
|Company||Real News||Fake news||General Web|
First, we observe that both fake and real news websites interact with more third-parties than the standard behavior in the general Web. This result is inline with previous work that found that right-leaning websites embed more advertising and tracking third-parties [1, 2]. This is expected since news websites are more likely to load third party content, libraries, images or even use third party services for advertising and user tracking. In contrast, in the Web, there are multiple types of websites that are self-sufficient and do not need to fetch external content. Then, we observe that the median fake news website interacts with 12 third parties, while a real news website interacts with 23 unique third parties (i.e., almost double the number of fake). This further supports our hypothesis that real news websites track users more aggressively than fake news: indeed each request to a third party is one more opportunity to track the users of the website.
In Table 6 we show the most popular trackers (actually companies that own these trackers) which can be found in the top 10 third-parties. We find that big technology conglomerates (such as Google, Facebook, Amazon, etc.) are among the most popular companies that interact with news websites. It is interesting to see all these companies being engaged in aggressive tracking in Fake News websites - much more aggressive than the tracking they do in the “General Web”.
Additionally, we plot in Figure 13 the most popular trackers that websites interact with. We classify domains as trackers based on the Tracker Protection Lists provided by Disconnect, a popular technology company that focuses on transparency and control over personal information in the Web . Again, we observe that tracking and advertising services provided by Google are the most popular among all types of websites. More importantly, we find that for all trackers, fake news websites seem to have a behavior more similar to the general Web. On the other hand, a far greater portion of real news websites tracks their clients through popular third-party trackers. As an illustration, we find that 70.26% of real news websites interact with Google Analytics, with only 47.32% of fake news websites doing so. This behavior further supports our hypothesis and shows that, indeed, real news websites use third-party trackers in order to improve their website performance and, ultimately, generate greater ad revenue.
Next, we study the alignment that websites of the same cluster have with respect to tracking activities. To that extent, we compute the Jaccard similarity of the vector of third-parties among websites of the same cluster. We plot in Figure14 the distribution of this similarity for real and fake news clusters. We observe that websites in fake news clusters are more aligned regarding the third-parties they utilize, indicating that they use the same tracking resources and services. Take, for example, the case where y=0.5 (the median). We see that the Jaccard Similarity of the median real news website (green point at y=0.5) is at while the red point at (Jaccard Similarity of the median Fake News website) is at . This implies that the median real news website has a much more diverse set of trackers compared to the median fake news website.
Finally, using the methodology described in Sec. 2.2, we investigate websites performing browser fingerprinting. In total, we detect 29 fingerprinting websites. Surprisingly, 25 (86%) of these websites have been labeled as real news while only one of them is a fake news website. The rest three websites are news websites, which have been evaluated by MBFC but do not fall in either of the two categories. This indicates real news websites tend to perform more sophisticated and persistent forms of tracking. This finding is inline with a recent privacy campaign that found that popular lifestyle and news websites utilize the most trackers , as well as previous academic work that found that news websites have the most trackers across all examined categories .
Overall, we find that the clusters of fake news owned by the same entity are usually aligned when it comes to type and resources of user tracking, which are then used for ad targeting, and finally ad revenue generation. This alignment is more intense than real news websites, who use a more diverse set of trackers and methods. However, both types of news websites use typical, well-known tracking resources such as Google, Facebook and Amazon.
6 Related Work
Fake News: Over the last few years, there has been a phenomenal growth of fake news and this has urged researchers of multiple disciplines to investigate and explore this alarming social phenomenon. To that extent, there has been a lot of effort to create fake news datasets to support and enable future research on misinformation . Most recently, in , authors produced a dataset regarding fake news information related to the COVID-19 pandemic. Similarly in , authors manually annotated a set of news articles and social media posts of real or fake stories related to COVID-19. In , authors collected and manually evaluated recent news articles regarding the American Politics ecosystem, resulting in a dataset of fake news and satirical news articles. For fake news articles, a well-researched and factual article that disproves and invalidates the fake news article is provided. In addition to this, the authors utilized their dataset and performed an analysis on the shared content. In , authors assembled over 2,000 news articles and 140,000 tweets about the COVID-19 pandemic. Using NewsGuard and Media Bias/Fact Check as credibility sources, they form lists of reliable and unreliable news publishers. Using these news publishers, they identify news articles related to COVID-19, they extract their content and then explore the spread of these articles on Twitter. The resulting dataset contains valuable information for each article such as publication date, authors, images, country and political bias. Finally, using this dataset, the authors trained multiple classifiers in order to predict the credibility of COVID-19-related news.
More importantly and similar to our work, in  authors explored the advertising market of traditional, fake news and low-quality news websites. They collected 12 weeks of ad-related URLs from over 1.6K news websites and using a manually curated list of popular ad servers, they found that (i) fake publishers interact with fewer ad servers than low-quality and traditional news websites, (ii) fake and low-quality news websites are more likely to form business relationships with risk ad servers, yet (iii) still rely on credible ad servers to display ads and monetize their traffic. Surprisingly, they found that even though the most popular ad servers deliver ad revenue to a very big portion of fake news websites, the revenue that these ad servers generate through fake and low-quality websites is insignificant. In , the authors utilized non-perceptual features to detect disinformation websites in the wild. Specifically, using features related to the domain name, DNS configuration, certificates and hosting infrastructures, the authors trained a multi-class model which was able to accurately distinguish between disinformation, news and non-news websites. Then, they deployed their system on over 1.3M websites and were able to discover two unlisted disinformation websites. In , Bakir et al. discuss, among other things, how the digital advertising generates revenue for fake news websites due to the lack of understanding and control that advertisers have regarding where they ads appear. Additionally, they explain how fake news websites proliferate because even if they are blocked in one ad network, they can readily move to another.
Website Administration: Academic research has focused on identifying the legal entities that control, operate and handle websites. The methodology followed in this work is closely related to the one presented in . Specifically, authors proposed a graph-based model of website administration using ad network and tracking services relationships in order to study the entities that operate and monetize websites. They performed a large-scale analysis on the monetization models of advertising networks and web publishers and detected patterns of preferential administration of websites, as well as correlation between popularity and portfolio size. Moreover, through a historical analysis, they were also able to detect trends in publisher behavior over time, as well as study the evolution of the market of publishing partners. In our work, we make use of the bipartite graphs of Publisher-specific IDs and the notion of the metagraph to detect websites operated or even owned by the same entity. Contrary to , we do not perform edge pruning on the metagraph and focus only on identifiers found in more than 1 but at most 50 websites. Additionally, we refrain from analyzing the behavior of intermediary publishing partners since they do not provide any additional information to this work. Finally, we make use of ads.txt and sellers.json files in order to have access to more identifiers used by more advertising network for our analysis of business relationships instead of focusing solely on Google’s Publisher-specific IDs.
In , authors proposed a property graph that represents Internet infrastructures in order to study security threats and the entities involved. To discover these entities, they make use of HTTPS certificates to extract organization names. In  the authors utilize the email addresses found in WHOIS records and filter out addresses that do not uniquely identify domains. Next, similar to our work, they build bipartite graphs to represent the relationships between legal entities and domains and use Louvain’s community detection algorithm to extract groups of domains owned by the same entity.
Ad Detection: The study, detection and exclusion of advertisements in websites has been the focus of research work for a long period of time. In , the authors proposed MadTracer, a system that detects malicious advertisements in websites. Similar to our work, their system relies on the redirection chains among publishers and ad networks. By identifying the nodes in ad-delivery paths, MadTracer is able to detect malicious advertising systems and infected publishers. In , similar to our methodology, authors crawl websites and extract URLs embedded in the webpage and in iFrames. Using EasyList and through manual curation, they form a list of popular ad servers, against which URLs are matched. Contrary to our work, they do not examine network traffic and delivered content for ad detection.
The analysis of funding received by advertisers relies on the methodology presented in Section 3.2.1. Even though our methodology has a high Precision score, we acknowledge that it might fail to detect some advertisements in websites. However, we argue that this limitation does not reduce the credibility of our finding, since we were still able to study a big portion of the advertising ecosystem. Additionally, our study utilizes Publisher-specific IDs to discover clusters of news websites operated by the same entity. As discussed in , the existence of intermediary publishing partners can affect results by pointing to different resulting communities. We made efforts to exclude these intermediaries by discarding Publisher-specific IDs found in a substantial number of websites. Finally, as explained in , domain classification services might suffer from classification and disagreement flaws. To that extent, we acknowledge that no service is error-free, but we select Cyren for our analysis because it: (i) accepts mis-classification reports, (ii) is language and content agnostic covering a wide range of websites, and (iii) has a vast database with approximately 140 million classified domains.
Although fake news is being used more in recent years as a tool of political propaganda, there is no doubt that spreading fake news has become a very lucrative business on the Web. Consequently, the success of curbing fake news primarily depends on the incentives of fake news producers and the ability of stakeholders to remove these incentives.
In this paper, we are the first to systematize the auditing process of fake news ad-based revenue flows. Specifically, we develop a novel ad detection methodology which enables us to identify the companies that advertise in fake news websites and the middlemen responsible for keeping the avenues of ad revenues open. We show that popular, legitimate advertising networks (such as google.com, indexexchange.com, and app-nexus.com) have a direct advertising relation with more than 40% of the fake news websites in our list, and have a re-seller relation with more than 60% of them. Through clustering based on advertiser IDs present in such websites, we report that owners (or operators) of fake news sites usually operate a set of websites that include entertainment, business, politics, technology, etc. As a result, the operation of a fake news website is rarely an isolated event, but it is frequently part of a larger business function. Finally, fake news websites clustered together under the same owner perform user tracking which is aligned across all websites of the cluster, using a smaller set of tracking resources than real news websites. However, they still use top tracking companies such as Google, Facebook and Amazon. We hope our study, and the material we make publicly available, help curb the financial and advertising incentives that such websites have been enjoying so far.
Data & Code Availability
To support and enable further research on fake news, and extensibility of our work, we make publicly available :
The lists of 1,044 fake and 1,368 real news websites
Screenshots of ads collected on top 50 most popular websites for each category of “News & Media” and “Sports”
Code for the crawler and novel ad detection method used.
This project received funding from the EU H2020 Research and Innovation programme under grant agreements No 830927 (Concordia), No 830929 (CyberSec4Europe), No 871370 (Pimcity) and No 871793 (Accordion). These results reflect only the authors’ view and the Commission is not responsible for any use that may be made of the information it contains.
-  (2020) Stop tracking me bro! differential tracking of user demographics on hyper-partisan websites. In Proceedings of The Web Conference, pp. 1479–1490. Cited by: §1, §1, §5.
-  (2021) Under the spotlight: web tracking in indian partisan news websites. arXiv preprint arXiv:2102.03656. Cited by: §5.
-  Open-source code and public datasets. Note: https://anonymous.4open.science/r/FakeNews-9358/ Cited by: item 4, §2.1, §3.2.1, Data & Code Availability.
-  (2021) The sites that are tracking your every move online. Note: https://surfshark.com/whos-tracking-you Cited by: §5.
-  (2018) Fake news and the economy of emotions: problems, causes, solutions. Digital journalism 6 (2), pp. 154–175. Cited by: §6.
-  (2019) A longitudinal analysis of the ads. txt standard. In Proceedings of the Internet Measurement Conference, pp. 294–307. Cited by: §3.1, §3.1.
-  (2019) The brexit botnet and user-generated hyperpartisan news. Social Science Computer Review 37 (1), pp. 38–54. External Links: Cited by: §1.
-  (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (10). Cited by: §4.1.2.
-  (2021) Just 12 people are behind most vaccine hoaxes on social media, research shows. Note: https://www.npr.org/2021/05/13/996570855/disinformation-dozen-test-facebooks-twitters-ability-to-curb-vaccine-hoaxes?t=1628769021116 Cited by: §4.4.
-  (2021) Market forces: quantifying the role of top credible ad servers in the fake news ecosystem. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 15, pp. 83–94. Cited by: §6, §6.
-  Adblock-rust. Note: https://github.com/brave/adblock-rust Cited by: §3.2.1.
-  (2016) Measurement and analysis of private key sharing in the https ecosystem. In Proceedings of the SIGSAC Conference on Computer and Communications Security, pp. 628–640. Cited by: §6.
-  (2014) Understanding interest-based behavioural targeted advertising. arXiv preprint arXiv:1411.5281. Cited by: §3.2.2.
-  (2015) I always feel like somebody’s watching me: measuring online behavioural advertising. In Proceedings of the Conference on Emerging Networking Experiments and Technologies, pp. 1–13. Cited by: §3.2.2.
-  Sign up for adsense. Note: https://support.google.com/adsense/answer/10162 Cited by: §3.1.
-  Alliance defending freedom. Note: https://www.splcenter.org/fighting-hate/extremist-files/group/alliance-defending-freedom Cited by: §4.4.
-  Family research council. Note: https://www.splcenter.org/fighting-hate/extremist-files/group/family-research-council Cited by: §4.4.
-  Why is alliance defending freedom a hate group?. Note: https://www.splcenter.org/news/2020/04/10/why-alliance-defending-freedom-hate-group Cited by: §4.4.
-  (2021) The rise and fall of fake news sites: a traffic analysis. In Proceedings of the Web Science Conference, pp. 168–177. Cited by: §1.
-  Alliance defending freedom. Note: https://mediabiasfactcheck.com/alliance-defending-freedom/ Cited by: §4.4.
-  CheckYourFact. Note: https://mediabiasfactcheck.com/check-your-fact/ Cited by: §4.4.
-  Daily caller. Note: https://mediabiasfactcheck.com/daily-caller/ Cited by: §4.4.
-  Family research council. Note: https://mediabiasfactcheck.com/family-research-council/ Cited by: §4.4.
-  Health impact news. Note: https://mediabiasfactcheck.com/health-impact-news/ Cited by: §4.4.
-  Medical kidnap. Note: https://mediabiasfactcheck.com/medical-kidnap/ Cited by: §4.4.
-  Need to know. Note: https://mediabiasfactcheck.com/need-to-know/ Cited by: §4.4.
-  PJ media. Note: https://mediabiasfactcheck.com/pj-media/ Cited by: §4.4.
-  Salem radio network news (srn news). Note: https://mediabiasfactcheck.com/salem-radio-network-news-srn-news/ Cited by: §4.4.
-  The truth about cancer. Note: https://mediabiasfactcheck.com/the-truth-about-cancer/ Cited by: §4.4.
-  Vaccine impact. Note: https://mediabiasfactcheck.com/vaccine-impact/ Cited by: §4.4.
-  Methodology. Note: https://mediabiasfactcheck.com/methodology/ Cited by: item 1.
-  Search and learn the bias of news media. Note: https://mediabiasfactcheck.com/ Cited by: item 1, Table 1.
-  (2020) Proactive discovery of fake news domains from real-time social media feeds. In Companion Proceedings of the Web Conference, pp. 584–592. Cited by: item 1.
Joint estimation of user and publisher credibility for fake news detection. In Proceedings of the International Conference on Information & Knowledge Management (KDD), pp. 1993–1996. Cited by: item 1.
-  (2021) Examining opaque programmatic markets with the credibility coalition adsellers dataset. Note: https://misinfocon.com/examining-opaque-programmatic-markets-with-the-credibility-coalition-adsellers-dataset-b9ff5d6781c4 Cited by: §3.1.
-  (2016) Jihadi website with beheadings profited from google ad platform. Cited by: §1.
-  Website url category check. Note: https://www.cyren.com/security-center/url-category-check-gate Cited by: §3.2.2.
-  (2020) The seven deadly sins of the html5 webapi: a large-scale study on the risks of mobile sensor-based attacks. ACM Transactions on Privacy and Security (TOPS) 23 (4), pp. 1–31. Cited by: §3.2.2.
-  (2014) Legal alliance gains host of court victories for conservative christian movement. Note: https://www.nytimes.com/2014/05/12/us/legal-alliance-gains-host-of-court-victories-for-conservative-christian-movement.html Cited by: §4.4.
-  (2018) Dear google: please stop using my advertising dollars to monetize hate speech. Cited by: §1.
-  (2016) Online tracking: a 1-million-site measurement and analysis. In Proceedings of the SIGSAC Conference on Computer and Communications Security, pp. 1388–1401. Cited by: §5.
-  EasyList. Note: https://easylist.to/ Cited by: §3.2.1.
-  FingerprintJS. Note: https://github.com/fingerprintjs/fingerprintjs Cited by: §2.2.
-  “Unreliable” news sources got more traction in 2020. Note: https://www.axios.com/unreliable-news-sources-social-media-engagement-297bf046-c1b0-4e69-9875-05443b1dca73.html Cited by: §1.
-  The disinformation dozen. why platforms must act on twelve leading online anti-vaxxers. Note: https://www.counterhate.com/disinformationdozen Cited by: §4.4.
-  (2020) Marketers helped fake news kill real news with bullets supplied by ad tech. Note: https://www.forbes.com/sites/augustinefou/2020/10/18/marketers-helped-fake-news-kill-real-news-with-bullets-supplied-by-ad-tech/?sh=19ac37b15d73 Cited by: §1.
-  (2021) Fakeflow: fake news detection by modeling the flow of affective information. arXiv preprint arXiv:2101.09810. Cited by: item 1.
-  (2018) Fake news vs satire: a dataset and analysis. In Proceedings of the Web Science Conference, pp. 17–21. Cited by: item 3, Table 1, §6.
-  (2022) Battling misinformation. Note: https://newsinitiative.withgoogle.com/dnifund/report/battling-misinformation/ Cited by: §1.
-  (2017) Lies, propaganda and fake news: a challenge for our age. Note: https://www.bbc.com/future/article/20170301-lies-propaganda-and-fake-news-a-grand-challenge-of-our-age Cited by: §1.
-  (2021) NELA-gt-2020: a large multi-labelled news dataset for the study of misinformation in news articles. arXiv preprint arXiv:2102.04567. Cited by: item 1.
-  (2021) Mapping recent development in scholarship on fake news and misinformation, 2008 to 2017: disciplinary contribution, topics, and impact. American behavioral scientist 65 (2), pp. 290–315. Cited by: §4.3.
-  (2021) Advertisers spend $2.6bn on misinformation websites, study finds. Note: https://www.campaignlive.co.uk/article/advertisers-spend-26bn-misinformation-websites-study-finds/1725293 Cited by: §1.
-  UBlock origin. Note: https://github.com/gorhill/uBlock Cited by: §3.2.1.
-  (2020) Identifying disinformation websites using infrastructure features. In 10th USENIX Workshop on Free and Open Communications on the Internet (FOCI), Cited by: §6.
-  Tracker protection lists. Note: https://github.com/disconnectme/disconnect-tracking-protection Cited by: §5.
-  (2020) Adgraph: a graph-based approach to ad and tracker blocking. In IEEE Symposium on Security and Privacy (SP), pp. 763–776. Cited by: §6.
-  (2020) Fake news websites still profit from google advertising. Note: https://www.ft.com/content/5f8a405c-c132-4d9b-a86f-c52884535f3e Cited by: §1.
-  (2020) Fake news makes disease outbreaks worse, study finds. Note: https://www.reuters.com/article/us-health-fake-idUSKBN208028 Cited by: §4.4.
-  (2012) The menlo report: ethical principles guiding information and communication technology research. Available at SSRN 2445102. Cited by: Appendix A.
-  (2021) FibVID: Comprehensive fake news diffusion dataset during the COVID-19 period. Telematics and Informatics 64, pp. 101688. Cited by: §6.
-  (2017) The economics of “fake news”. IT Professional 19 (6), pp. 8–12. External Links: Cited by: §1.
-  (2019) Ads.txt specification version 1.0.2. Note: https://iabtechlab.com/wp-content/uploads/2019/03/IAB-OpenRTB-Ads.txt-Public-Spec-1.0.2.pdf Cited by: §3.1, §3.1, §3.1.
-  (2019) Sellers.json specification. Note: https://iabtechlab.com/wp-content/uploads/2019/07/Sellers.json_Final.pdf Cited by: §3.1.
-  World without cancer; the story of vitamin b17. American Journal of Public Health. Note: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1653400/ Cited by: §4.4.
-  (2016) How teens in the balkans are duping trump supporters with fake news. Note: https://www.buzzfeednews.com/article/craigsilverman/how-macedonia-became-a-global-hub-for-pro-trump-misinfo Cited by: §1.
-  (2018) The science of fake news. Science 359 (6380), pp. 1094–1096. Cited by: §1.
-  (2019) Tranco: a research-oriented top sites ranking hardened against manipulation. In Network and Distributed System Security Symposium (NDSS), Cited by: §2.1.
-  (2012) Knowing your enemy: understanding and detecting malicious web advertising. In Proceedings of the Conference on Computer and Communications Security, pp. 674–686. Cited by: §6.
-  Website traffic - check and analyze any website. Note: https://www.similarweb.com/ Cited by: §3.2.1.
-  (2021) Dataset of fake news detection and fact verification: a survey. arXiv preprint arXiv:2111.03299. Cited by: §6.
-  IFCN code of principles - check your fact. Note: https://ifcncodeofprinciples.poynter.org/application/public/check-your-fact/16EBE6DB-6072-CE51-EDC0-0D6347FF9605 Cited by: §4.4.
-  NewsGuard - combating misinformation with trust ratings for news. Note: https://www.newsguardtech.com/ Cited by: item 4, §4.4.
-  Tracking Facebook’s COVID-19 Misinformation “Super-spreaders”. Note: https://www.newsguardtech.com/special-reports/superspreaders/ Cited by: §4.4.
-  (2006) Beachcomber biology: the shannon-weiner species diversity index. In Proc. Workshop ABLE, Vol. 27, pp. 334–338. Cited by: §4.3.
-  (2013) Selling off privacy at auction. In Network and Distributed System Security Symposium (NDSS), Cited by: §1.
-  Sources. Note: https://github.com/BigMcLargeHuge/opensources/blob/master/sources/sources.csv Cited by: item 2.
-  (2019) No more chasing waterfalls: a measurement study of the header bidding ad-ecosystem. In Proceedings of the Internet Measurement Conference, Cited by: §1.
-  (2021) User tracking in the post-cookie era: how websites bypass gdpr consent to track users. In Proceedings of the Web Conference, pp. 2130–2141. Cited by: §2.2.
-  (2022) Leveraging google’s publisher-specific ids to detect website administration. Cited by: §2.2, §4.1.1, §4.1.2, §4.1.2, §4.1, §5, §6, §7.
-  (2017) The long-standing privacy debate: mobile websites vs mobile apps. In Proceedings of the Web Conference, pp. 153–162. Cited by: §3.2.2.
-  (2017) If you are not paying for it, you are the product: how much do advertisers pay to reach you?. In Proceedings of the 2017 Internet Measurement Conference, Cited by: §1.
-  (2021) Fighting an infodemic: COVID-19 fake news dataset. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation, pp. 21–29. Cited by: §6.
-  (2017) Automatic detection of fake news. arXiv preprint arXiv:1708.07104. Cited by: §1.
-  Fake news almanac. Note: https://infogram.com/politifacts-fake-news-almanac-1gew2vjdxl912nj Cited by: item 2.
-  (2017) A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638. Cited by: §1.
-  (2021) Index of fake-news, clickbait, and hate sites. Note: https://web.archive.org/web/20210720140548/https://www.cjr.org/fake-beta Cited by: item 2, Table 1, footnote 1.
-  (2014) Ethical research standards in a world of big. F1000Research 3. External Links: Cited by: Appendix A.
-  (2017) Hoax over ‘dead’ ethereum founder spurs $4 billion wipe out. Note: https://fortune.com/2017/06/26/vitalik-death/ Cited by: §1.
-  (2016) The $8.2 billion adtech fraud problem that everyone is ignoring. Note: https://techcrunch.com/2016/01/06/the-8-2-billion-adtech-fraud-problem-that-everyone-is-ignoring/ Cited by: §3.1.
-  (2017) Fake news detection on social media: a data mining perspective. ACM SIGKDD explorations newsletter 19 (1), pp. 22–36. Cited by: §1.
-  (2021) WebGraph: capturing advertising and tracking information flows for robust blocking. arXiv preprint arXiv:2107.11309. Cited by: §6.
-  (2017) Inside the partisan fight for your news feed. Note: https://www.buzzfeednews.com/article/craigsilverman/inside-the-partisan-fight-for-your-news-feed Cited by: §1.
-  (2017) Who controls the internet? analyzing global threats using property graph traversals. In Proceedings of the Web Conference, pp. 647–656. Cited by: §6.
-  (2020) Filter list generation for underserved regions. In Proceedings of The Web Conference, pp. 1682–1692. Cited by: §6.
-  Field guide to fake news sites and hoax purveyors. Note: https://www.snopes.com/news/2016/01/14/fake-news-sites/ Cited by: item 2.
-  (2020) Mis-shapes, mistakes, misfits: an analysis of domain classification services. In Proceedings of the Internet Measurement Conference, pp. 598–618. Cited by: §7.
-  (2016) 6 months after fraud cleanup, appnexus describes impact on its exchange. Note: https://www.adexchanger.com/ad-exchange-news/6-months-after-fraud-cleanup-appnexus-shares-effect-on-its-exchange/ Cited by: §1.
-  (2018) Stakeholders including WFA develop voluntary code on disinformation. Note: https://wfanet.org/knowledge/item/2018/09/27/Stakeholders-including-WFA-develop-voluntary-code-on-disinformation Cited by: §1.
-  (2021) Fighting misinformation in the time of COVID-19, one click at a time. Note: https://www.who.int/news-room/feature-stories/detail/fighting-misinformation-in-the-time-of-covid-19-one-click-at-a-time Cited by: §1.
-  (2019-05) The web of false information: rumors, fake news, hoaxes, clickbait, and various other shenanigans. Journal of Data and Information Quality. Cited by: §1.
-  (2020) An overview of online fake news: characterization, detection, and discussion. Information Processing & Management 57 (2), pp. 102025. Cited by: §1.
-  (2020) ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), External Links: Cited by: item 1, item 4, Table 1, §6.
-  (2019) Fake news: fundamental theories, detection strategies and challenges. In Proceedings of the International Conference on Web Search and Data Mining (WSDM), pp. 836–837. Cited by: §1.
Appendix A Ethical Considerations
The execution of this work has followed the principles and guidelines of how to perform ethical information research and the use of shared measurement data [61, 89]. In particular, this study paid attention to the following dimensions.
We keep our crawling to a minimum to ensure that we do not slow down or deteriorate the performance of any web service in any way. Therefore, we crawl only the landing page of each website and visit it only once. We do not interact with any component in the website visited, and only passively observe network traffic. In addition to this, our crawler has been implemented to wait for both the website to fully load and an extra period of time before visiting another website. Consequently, we emulate the behavior of a normal user that stumbled upon a website. Therefore, we make concerted effort not to perform any type of DoS attack to the visited website.
In accordance to the GDPR and ePrivacy regulations, we did not engage in collection of data from real users. Also, we do not share with any other entity any data collected by our crawler. Moreover, we ensure that the privacy of publishers is not invaded and do not collect any of their information (e.g., email addresses). Last but not least, we intentionally do not make our crawled dataset public (but only the fake and real news lists and ads detected on them), to ensure that there is no infringement of copyrighted material from any website.
Finally, regarding the ad detection methodology, we were cautious not to affect the advertising ecosystem or deplete advertiser budgets. The development and testing of our methodology was performed on offline captures of websites. Additionally, for each website we process, we visit only the landing page and “click” on advertisements only once. We argue that this approach emulates a normal visitor that is interested in advertisements and follows them to the advertiser website.