DeepAI
Log In Sign Up

Leveraging Google's Publisher-specific IDs to Detect Website Administration

Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively answer questions such as: Which entities monetize what websites? What categories of websites does an average entity typically monetize on and how diverse are these websites? How has this website administration ecosystem changed across time? In this paper, we propose a novel, graph-based methodology to detect administration of websites on the Web, by exploiting the ad-related publisher-specific IDs. We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration. Our findings show that approximately 90 associated each with a single publisher, and that small publishers tend to manage less popular websites. We perform a historical analysis of up to 8 million websites, and find a new, constantly rising number of (intermediary) publishers that control and monetize traffic from hundreds of websites, seeking a share of the ad-market pie. We also observe that over time, websites tend to move from big to smaller administrators.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/10/2022

Who Funds Misinformation? A Systematic Analysis of the Ad-related Profit Routines of Fake News sites

Fake news is an age-old phenomenon, widely assumed to be associated with...
03/16/2020

Correlation between Content and Traffic of the Universities Website

The purpose of this study is to analyse the correlation between content ...
07/24/2019

No More Chasing Waterfalls: A Measurement Study of the Header Bidding Ad-Ecosystem

In the last few years, Header Bidding (HB) has gained popularity among w...
08/11/2020

Fingerprinting the Fingerprinters: Learning to Detect Browser Fingerprinting Behaviors

Browser fingerprinting is an invasive and opaque stateless tracking tech...
10/29/2018

Credibility of Automatic Appraisal of Domain Names

Both domain names and entire websites are increasingly frequently treate...
09/03/2021

Hide and seek in Slovakia: utilizing tracking code data to uncover untrustworthy website networks

The proliferation of misleading or false information spread by untrustwo...

1. Introduction

Digital advertising keeps the content we consume on the Web free of charge, being an important stream of revenue for web publishers (Cramer-Flood, 2021). Even during 2020, with all the adverse economic impacts of the COVID-19 pandemic, there was a reported 12.2% increase in ad-revenues (Laboratory, 2021), and $100s of billions in annual spending worldwide ($455B in 2021 (Cramer-Flood, 2021)). However, it is inherently difficult to assess the effectiveness of digital ad spending, due to the overly complex, layered ecosystem of digital marketing, with thousands of intermediaries brokering ads and ad-slots between sellers and buyers. Some even consider this market overvalued and possibly due to correction with various adverse effects (Hwang, 2020).

In an attempt to increase ad profits, advertisers, intermediaries and publishers resort to analytics and other web tracking services for better measuring of user audiences and their engagement with webpages. But the increasing complexity of this ecosystem makes it hard to answer questions such as: Who are the entities that control and monetize websites, and which websites? How many websites does the average such entity control? Are they from the same category, or diverse in nature? What are the characteristics of these website administrators, and how have these changed over time?

In the last decade, journalists and academic researchers have grappled with such questions. In fact, they made several efforts to (i) provide more transparency to the ecosystem of web content monetization and administration (Silverman et al., 2017), (ii) raise awareness of its impact to user’s privacy due to online tracking and possibilities of de-anonymization (Baio, 2011; Matic et al., 2015; Starov et al., 2018; Bashir et al., 2019; Yoon et al., 2019), (iii) shed light on how this ecosystem drives misinformation and fake news (Alexander, 2015; Silverman et al., 2017; Samson, 2018). For example, L. Alexander (Alexander, 2015) used Google analytic IDs to find evidence of a pro-Kremlin concerted web campaign, executed among different websites owned by the same entity. C. Silverman et al. (Silverman et al., 2017) looked into Google-related IDs and found websites being operated by the same entities, which promoted fake news content and delivered polarizing ads during the USA 2016 presidential election. Furthermore, C.I. Samson (Samson, 2018) discussed the issue of fake news spreading within the context of the Philippines 2016 presidential election. Such reports demonstrate the urgent need for more transparency in the issue of website administration. In addition, academic works (Matic et al., 2015; Starov et al., 2018; Bashir et al., 2019; Yoon et al., 2019) have looked at the problem from the point of view of user tracking or de-anonymization using such Google-related IDs to detect malicious websites and their administrators. However, to date, there has been no systematic study to reveal, at scale, the way websites are monetized and by which entities.

In this work, we try to shed light on website administration and propose a novel, graph-based methodology to detect entities that are in charge of websites. To that extent, we exploit the ad-related, publisher-specific IDs that publishers embed in their websites in order to use third-party services. We (i) apply our methodology across the top 1 million websites of Tranco list, to detect groups of websites monetized by the same entity, (ii) study the characteristics of the generated website administration graphs, (iii) find intermediary publishers that manage and monetize traffic from hundreds or even thousands of websites. We perform a 2-year historical analysis of up to top 8 million websites and explore how the small, medium and large publishers have evolved over time.

In summary, the contributions of this work are:

  • We propose a novel methodology for detecting website administration and co-ownership based on publisher-specific IDs, with applicability in different use cases.

  • We conduct the first, to our knowledge, large-scale systematic study of such publisher-specific IDs embedded in up to 8M websites. We make our implementation (Papadogiannakis, 2021a) along with our results (Papadogiannakis, 2021b) publicly available to support further research on this topic.

  • Our findings show that approximately 90% of the websites are associated with a single publisher and that small publishers tend to manage less popular websites. We also conclude that there is preferential administration with an inclination towards “News and Media” websites. Finally, we show that over time, websites tend to move away from big administrators into smaller ones.

2. Publisher-specific IDs

2.1. Google AdSense

AdSense is a service for publishers to generate revenue by displaying ads in their websites. For ads to be displayed, publishers need to insert the AdSense code snippet in their website, which includes a Publisher ID: a unique identifier for an AdSense account that follows the format pub-XXXXXXXXXXXXXXX. The owner of the account is allowed to share the account with employees, or even business partners, however, the account holder is always one, and different AdSense accounts cannot be merged (Center, 2021c). An AdSense account cannot be transferred to another individual (Center, 2021b), but two or more AdSense accounts with different Publisher IDs can co-exist on the same website (Center, 2021e). These other Publisher IDs can belong to a business partner, contributing authors, or even third-parties.

2.2. Google Tag Manager (GTM)

Google Tag Manager (GTM) is a service for web administrators to manage code snippets (called Tags - provided by third-parties to integrate their respective service e.g., analytics, marketing, support) in their website. GTM provides an interface for publishers to handle such code snippets, which uses an abstraction called container that needs to be installed in the website by inserting its own snippet (Farney, 2016). A container is uniquely identified by a Container ID, formatted as GTM-XXXXXX. One GTM account can create and manage more than one containers. Usually, a GTM account represents the topmost level of organization and, typically, an organization uses a single GTM account (Center, 2021f). Thus, containers are not bound to a domain or a website, and with the appropriate configuration, the same container can be used in multiple websites (Center, 2021d).

2.3. Google Analytics

Google Analytics is a service to track and report website traffic. The service revolves around Properties, which contain the reports and traffic data for one or more websites or applications. There are two types: (i) Google Analytics 4 Properties that are identified by a Measurement ID, which follows the format G-XXXXXXX, and (ii) Universal Analytics (the older version of properties (Center, 2021h)) that are uniquely identified by a Tracking ID, formatted as UA-000000-1. When a user creates a Google Analytics account, a unique identifier is created that acts as a prefix of Tracking IDs (i.e., the first set of numbers). Consequently, the Tracking ID which is included in the code snippet indicates which account data is sent to (Center, 2021g). The suffix of a Tracking ID represents the property that data is sent to. A website publisher that owns more than one website is able to associate a single property with all of these websites.

3. Methodology

Unique Unique % of
Description IDs URLs sites HTML Reqs Cookies
Publisher IDs 71,745 87.273 10.05% 99.4% 77.6% 0.00%
Tracking IDs 485,405 451.498 52.02% 76.9% 94.3% 22.5%
Measurement IDs 47,087 47,606 5.48% 96.0% 91.2% 0.36%
Container IDs 193,693 179,114 20.64% 99.7% 93.0% 0.01%
Table 1. Detected publisher-specific IDs and their origin.

3.1. Crawling Methods

To detect websites operated by the same entity, we search for the identifiers of the respective services described in Section 2. Specifically, we develop a Puppeteer-based crawler that instruments instances of the Chromium Browser. Using these instances, we crawl with clean state the landing page of the top 962K websites of the Tranco list, which aggregates the ranks from the lists provided by Alexa, Umbrella, and Majestic, from 16/3/2021 to 14/4/2021(Tranco, 2021). This list is formed based on techniques which enable list stability, facilitate reproducibility, and protect against adversarial manipulation. The implementation of our crawler is publicly available (Papadogiannakis, 2021a). When our crawler visits a website, it waits until the page has completely loaded and for an additional 5-sec period to ensure that all programmatically purchased ads (via Real Time Bidding (RTB)) have been rendered. Then, it stores the HTML of the page, a cookie-jar and the HTTP(S) requests performed during the website visit. We capture all requests passively in a read-only fashion without mutating or intercepting them. This ensures that the behavior of the website is not affected by our crawler. To collect the HTML of the website, we utilize the Chrome DevTools Protocol (The Chromium Authors, 2014). This way, we ensure that we capture not only the actual HTML code but also the documents, styles or code fetched by iFrames or code snippets. Our crawler visits 962K websites of the Tranco list (1MT crawl) from 15-27/4/2021 and collects 415GB of data. In 93,817 websites (9.75%) the crawling process failed due to timeouts or site inaccessibility. Overall, we detect 525,493 websites, with at least one of the identifiers discussed in Section 2. Ethical concerns regarding the crawling process and collected data are addressed in Appendix A.

3.2. Detecting Identifiers

We detect the Google Identifiers described in Section 2 by performing an offline analysis on the collected data. Specifically, using regular expressions111pub-[0-9]{9,}, UA-[0-9]{4,}-[0-9]+, G-[A-Z0-9]{7,} and GTM-[A-Z0-9]{6,}, we search for these identifiers inside the page content, HTTP(S) requests and stored cookies. Then, we remove false positives using a combination of data-filtering techniques.

First, using the dictionary of GNU Aspell (Project and Atkinson, 2019)

, an open-source spell-checking tool, we remove values which are words of the English dictionary, but match the suffix of regular expressions (e.g.,

G-BACKPACK). Using this technique, we were able to remove 1,500 distinct false positive values, which were found in 5,000 unique websites. Then, we remove false positives using a list of common keywords. This list was generated by manually inspecting over 10,000 values that satisfy the regular expressions, and investigating whether they are actually used as identifiers. Our keyword list contains over 1,250 values which were filtered out (e.g., G-APRIL2020).

As in Table 1, we find that 10% of the most popular websites monetize their content through an AdSense account (i.e., Publisher ID), 52% use Google Analytics to track their traffic, and 20% use the Google Tag Manager for easier management of code snippets. Moreover, for some services, we observe that there are more domains than publisher-specific IDs. This suggests that some identifiers are being re-used in more than one website. Additionally, we examine the source of information for each type of identifier. Specifically, we investigate whether the identifiers can be found in the HTML code of a website, its outgoing network traffic, or in the cookies set by either the first-party or various third-parties. As shown in Table 1, regardless of the type of identifier, the majority of them can be found in both the HTML code of the website and the HTTP(S) requests. This result is inline with the official guidelines for using Tags (Center, 2021a). This indicates that the detected identifiers are not only valid, but since they are sent to the respective Google services, they are in use. Finally, we find that only Tracking IDs are commonly found in cookies.

Figure 1. Example of a Publisher ID bipartite graph. Blue nodes represent websites and red nodes represent Publisher IDs. A directed edge in these bipartite graphs indicates that the website contains the respective identifier.
Figure 2. Example of metagraph construction. Websites that share an identifier, share an undirected edge in the resulting graph. The weight of the edge rises proportionally with the number of common publisher-specific IDs.

3.3. Bipartite graphs

Using the detected publisher-specific IDs, we construct a bipartite graph for each of the respective types of identifiers. In these graphs, the nodes are either websites or identifiers. Whenever a website contains an identifier, we introduce a directed edge from the respective website node to the respective identifier node. Tracking IDs and Measurement IDs are placed into the same graph since they represent the same service and have similar functionality. Thus, we create three bipartite graphs. Moreover, for Tracking IDs, we focus only on the prefix, which refers to the account number, as discussed in Section 2.3. Figure 1 illustrates an example of a small Publisher ID bipartite graph.

3.4. Metagraph

We also form a metagraph based on the three bipartite graphs of different publisher-specific IDs. The metagraph contains nodes only for websites and represents the relationships between websites. Whenever two websites share an identifier, we introduce an undirected meta-edge between the two respective website nodes. The more identifiers two websites share, the greater the weight of the connecting edge. Each shared identifier increases the weight of the meta-edge by , where is the total number of distinct identifiers of this type found in more than one website. A larger edge weight between two websites implies greater confidence that they are indeed operated and monetized by the same entity. Figure 2 illustrates an example of how the metagraph is constructed. The code to construct both the bipartite graphs and the metagraph is publicly available (Papadogiannakis, 2021b).

Figure 3. Distribution of Girvan-Newman community sizes in the metagraph. Over 60% of the communities are made of just two sites.
Figure 4. Distribution of publisher-specific IDs per website. Most websites contain only one identifier.
Figure 5. Number of websites monetized by each publisher. Most publishers (63,000) appear in only one site.
Figure 6. Connected components sorted by size (number of nodes) for the Publisher ID bipartite graph.

3.5. Metagraph Validation

We hypothesize that the metagraph constructed by following the steps above can lead to clusters of websites operated by the same entity. Meta-edges reflect the actual relationships between websites, thus the ones operated by the same entity should form a strongly connected community. Since the metagraph combines information from multiple service (i.e., bipartite graphs), it provides us with greater confidence about the actual relationships of the websites.

We find that there are some outlying cases where communities consist of thousands of websites. After manual investigation, we conclude that these communities are formed due to intermediary publishing partners. These publishers provide services for content creators to monetize their content or improve their website traffic and require that websites integrate the partner’s identifiers. To focus our analysis on an enhanced level of granularity, we ignore such publishers for the time being. This allows us to study more detailed cases of website administrators and results in a metagraph that contains 127,000 nodes and 2,885,000 edges. We discuss intermediate publishing partners in later sections.

To find the websites operated and monetized by the same entity with high confidence, we perform edge pruning, thus removing noise. Specifically, we remove edges that do not belong to the top 5%, when ranked by weight. We choose this threshold based on empirical analysis. This way, we ensure that there are limited false positives, i.e., websites that are wrongfully added to a community because of a typographical error in their source code, or older identifiers. After the edge pruning, dangling nodes are also removed from the graph as they do not provide any additional information. To further explore this graph, we execute the Girvan–Newman community detection algorithm (Girvan and Newman, 2002). In fact, we compare our methodology with the one described in (Cangialosi et al., 2016), where the authors apply the Louvain method (Blondel et al., 2008) to the connected components extracted from their bipartite graph. Specifically, we manually examine and evaluate 40 distinct communities, which can be found through both methodologies, consisting of 215 unique websites. We find that applying the Girvan-Newman algorithm, after edge-pruning, results in better communities in 42.5% of the cases, in exactly the same communities 37.5% of the cases, and in worse communities in 20% of the cases. Consequently, the Girvan–Newman algorithm results in higher quality communities at the expense of a higher computational cost.

Finally, we compare the communities our methodology detects with the (publicly available) communities detected in (Cangialosi et al., 2016). We acknowledge that the comparison is difficult to achieve because the two studies focus on different websites, in a different time period (i.e., 5-year ago: a big portion of the websites are no longer active and cannot be evaluated) and in a very dynamic environment like the Web. We compare only communities that contain websites, all of which have been crawled in our 1MT crawl. In total, we manually evaluated and found 12 communities with results similar to ours and 15 cases where the methodology of (Cangialosi et al., 2016) fails and places websites which are operated by the same legal entity into different communities. Using our methodology we detect 2,369 communities, formed by 11,000 distinct websites. The distribution of community sizes (Figure 3) shows that the majority of them are small (i.e., <6 websites). Indeed, 61% of the communities are pairs of websites, indicating that the median publisher operates just 2 websites.

Figure 7. Average popularity of publishers (based on ranking of sites) vs. publisher size.
Figure 8. Distribution of categories for websites with Publisher ID.
Figure 9. Poisson sampling experiment for site categories
Figure 10. Shannon’s Diversity Index for website categories in detected communities.

4. Analysis of Bipartite Graphs

4.1. Contained publisher-specific IDs

First, we study how organizations or publishers use identifiers necessary for Google services. For this, we measure the number of unique identifiers found in each website of our dataset. In Figure 6, we plot the portion of websites in our data that contain a certain number of identifiers. Around 82-83% of the websites contain only one Container ID or Analytics ID, and about 90% of websites have only one Publisher ID. This indicates that the majority of organizations prefer to use the simplest and most straightforward configuration of services in their websites, where they use a single identifier to achieve their goal, be it monetization or traffic measuring. Most importantly, in the case of Publisher IDs, it indicates that the majority of websites have a single contributing author and that revenue is not shared. This is contrasting the small portion of websites (less than 3.2% in all cases), with 3 or more identifiers, which indicate multiple collaborating authors, with their own Publisher IDs, contributing to a website. Surprisingly, we see a small number of websites with an extremely large number of identifiers. For instance, we find that prykoly.ru contains 94 Tracking IDs, while www.pps.net, a website for public schools in Portland, contains 88 IDs hard-coded in the JavaScript code and the correct identifier is selected based on the page that is visited. To further investigate this an abnormal behavior, we lookup these websites in the VirusTotal(Limited, ) and Sucuri(GoDaddy Media Temple, 2021a) security services for malicious content. Sucuri reports(GoDaddy Media Temple, 2021b) that prykoly.ru contains a known JavaScript malware associated with a back-link purchase service, called Sape. For pps.net, Virustotal reports(Limited, 2021) that there are 4 detected files that communicate with this domain. In total, we find 67 distinct websites with over 40 publisher-specific IDs of any type. This preliminary analysis suggests that cases with numerous publisher-specific IDs in a website might imply abnormal or even malicious behavior. This observation, though interesting, is considered as out of scope for this work and left for future research.

4.2. Publisher Size

Next, we explore the amount of websites that publishers manage and monetize. For each publisher, we measure the number of websites in which they place their publisher-specific IDs. Analysis from now on, is performed on distinct domains of landing pages, and not distinct domains in the Tranco list. Specifically, if two different domains in the Tranco list redirect to the same domain, we measure this website only once towards the size of the respective publisher. In Figure 6, we plot, in descending order, the number of websites monetized by each unique Publisher ID in our data. We show that the great majority of publishers (up to 87.8%), monetize traffic from a single site. On the other hand, we find 340 publishers monetizing traffic from more than 10 websites each.

We observe some “mega-publishers” that can be found in hundreds or even thousands of websites. Indeed, the top 10 publishers in our data can be found in a total of more than 4,200 websites. We observe similar behavior in Container IDs and Analytics ID, where we find that the top 10 identifiers can be found in a total of 4,245 and 6,795 websites, respectively. To verify this finding, we explore the connectivity of the three bipartite graphs (described in Section 3.3) and generate a list of connected components in each graph. In Figure 6 we plot, in decreasing order using a log-log scale, the number of nodes in each connected component of the Publisher ID bipartite graph. We see that the distribution of connected component sizes can fit a power-law with a cutoff, an anomaly due to the intermediary publishers mentioned earlier. By applying appropriate statistical tests (Clauset et al., 2009)

, we find that the distribution is indeed heavy-tailed with power law being a better fit than the exponential distribution (loglikelihood ratio test with

). We find similar results for the Container IDs and Analytics IDs bipartite graphs but exclude them for brevity. This verifies our finding that there are only a few publishers monetizing traffic from a very large number of websites, while the majority of publishers operate one website.

We attribute this behavior to the existence of intermediary publishing partners (Google, 2021). These are third-party services which provide services for content creators to readily deliver their content, effortlessly monetize it, optimize their revenue and deliver better experiences to users. Publishers are required to integrate the service’s identifiers in their websites so that the service can monitor traffic and user behavior, and deliver ads. By examining requests towards third-party services, we successfully identified multiple such “mega-publishers”, including Ezoic, optAd360, Blogger and ProjectAgora. Specifically, Blogger provides publishers with AdSense gadgets, which can be used to display ads in a blog, without taking any percentage of earnings (Community, 2019)

. At the moment of writing, the

PublicWWW service (PublicWWW, ) reports that Blogger’s Publisher ID can be found in more than 364,000 websites.

4.3. Monetizer Popularity

Next, we explore if there is an association between the popularity of website and the size of the publishers. First, we group together websites that share the same Publisher ID, meaning that there is a single account responsible for their monetization. For each such account (i.e., publisher), we compute its popularity as the average rank of the websites it operates, based on the Tranco list. In Figure 10, we plot this average popularity of publishers, for a given size of publishers. We show that the average website popularity (y-axis) increases (i.e., Tranco rank decreases), as the number of monetized websites increases (x-axis). The average popularity subsections have also been fitted with a straight line (the negative slope is from the reversed y-axis to indicate increased popularity) indicating a clear trend with a

. We observe similar behavior when plotting the median popularity, indicating that there is no skewness in the distribution.

As a result, independent publishers who generate revenue from a single website, tend to monetize less popular websites. On the other hand, publishers that manage multiple websites, usually manage the most popular ones. Consequently, in a classic case of rich getting richer, big publishers who operate dozens of websites, not only claim a bigger share of the market and generate a bigger revenue, but they are also able to improve their reputation and attract more visitors. The increased popularity of some websites can also be credited to the intermediary publishers mentioned earlier.

Figure 11. Total websites crawled and portion of websites that contain Publisher ID and Tracking IDs.
Figure 12. Distribution of Publisher IDs per website through time. Almost 90% of websites in all snapshots contain only one Publisher ID.
Figure 13. Transition of websites between publishers through time.

4.4. Website Categories

Manual inspection of communities (Section 3.5) revealed that most operators tend to manage websites with similar content. To investigate this further, we retrieve (if available) from SimilarWeb (LTD, ) the category of each website with a Publisher ID in our dataset. Figure 10 illustrates the distribution of the categories we retrieved from over 23,000 websites with a Publisher ID. Websites with no category information are excluded from the analysis. We see a preference towards “News and Media” websites (24.5%), followed by websites related to “Computers, Electronics and Technology” (18.6%), “Arts” (11%) and “Science” (8%).

Next, we investigate if there is a preference in the types of websites a publisher monetizes, or if their portfolio is usually random. As a first step, we perform a Poisson Sampling experiment to construct a scenario where publishers monetize websites based on their categories’ measured popularity in the data. For this sampling, we perform the following steps. For a given size of publisher (i.e., number of websites they operate and monetize), we randomly select websites from our data. For example, if the size of a publisher is 10, we randomly select 10 websites from the 23K websites with a category. However, this selection is biased, based on the prior probability of a website appearing due to its category. Thus, for example, a “News and Media” website has almost 1/4 chance to be selected in any publisher. We perform this process for all publishers. Then, for each publisher, we compute the number of unique website categories they have in their control (i.e.,

richness).

Figure 10 plots the richness distribution of the observed (or “actual”) data and the Poisson-Sampling experiment data. The

line represents the case of a uniform distribution, where all of a publisher’s websites come from different categories, with equal probability. We see that the average number of website categories in our actual data is lower than a purely probabilistic choice, for every possible publisher size. This actively demonstrates a preferential administration of websites when it comes to their category. Thus, publishers tend to monetize websites with similar type of content. To verify this hypothesis, we also utilize Shannon’s diversity index 

(Shannon, 1948). Shannon’s diversity index is a statistical measure, which provides information about the composition of a community. It is defined as , where is the number of different categories in the dataset (i.e., richness) and is the proportion of websites belonging to category . The maximum value of the diversity index is , where stands for the number of distinct website categories in a community. This represents the case where all categories are equally common inside a cluster of websites operated by the same publisher. As we can see in Figure 10, Shannon’s diversity index for the websites in our data is much smaller than the maximum value and is closer to zero. A smaller diversity index corresponds to a more unequal composition of the community. We conclude that there is indeed a preferential administration or monetization and that publishers tend to acquire new websites of the same type as the ones they already manage.

5. Historical Analysis

5.1. Historical Presence of Identifiers

We perform a historical analysis of the last two years (April 2019 to April 2021), on a trimester basis, resulting in 9 snapshots. We use the entire dataset of HTTPArchive (Archive, 2021) and do not limit ourselves to the websites found in the Tranco list. In Section 3.2, we show that HTTP requests are a reliable source to find identifiers of interest. Thus, we examine HTTP(S) requests of websites in these snapshots and detect Publisher IDs and Tracking IDs embedded in them. Figure 13 illustrates our findings for the total number of websites crawled, per trimester snapshot. We find that, on average, 9.9% of websites monetize their content using Google AdSense Publisher IDs

, while around 64% use Google Analytics in order to track and measure their traffic. These trends are stable over both the snapshots and the sample size, with a standard deviation of only 0.7 and 1.81, respectively. These results, computed on

more websites than the 1MT crawl (up to 8M websites in 2021), are inline with our earlier findings described in Section 3, and lend credence to our analysis as being representative of the general Web.

Figure 14. Transition linear trends between different size publishers through time.
Figure 15. Number of websites monetized by the top 10 publishers in each snapshot.
Figure 16. Changes in the population of different sizes of publishers.

Next, we study how many publishers contribute to the content of a website. In Figure 13, we plot the portion of websites that contain one, two, or three or more Publisher IDs for each time period. On the axis, we plot the average number of distinct Publisher IDs in each website and observe that it is almost constant across time, with a mean value of 1.11, and a standard deviation of only 0.015. We also find that, on average, 88.75% of the websites, have a single contributing publisher that generates revenue. Finally, we find a very small amount of websites (less than 1% in all snapshots) that contain identifiers of 3 or more publishers. These cases are either due to an intermediary publishing partner, or websites running under the partnership of various authors or authorized external authors. Overall, these numbers match our earlier in-depth analysis using the 1MT crawl of April 2021.

5.2. Top Publishers Market

Next, we explore how the market of publishers has changed in the last couple of years, and specifically, how intermediary publishing partners have grown. First, we study how websites behave with regards to their Publisher IDs and detect changes of these identifiers. We perform our analysis on websites which have been crawled in all snapshots (i.e., their intersection) and contain at least one Publisher ID. There are over 191,000 such websites. For each time interval and for each website, we compare the detected identifiers in the previous and the next snapshot. We ignore websites that made no change in their Publisher IDs. For websites that do not contain exactly the same identifiers across two consecutive snapshots, we compare the size of their publishers. Specifically, we define the old community size of a website as the maximum size of its publishers, detected in the first snapshot for that website. Respectively, we retrieve the new community size from the second snapshot. Please note that the size of a publisher for a specific snapshot is computed across all websites in the snapshot and not only for the common websites. If the new community size of a website is greater than the old size, then we conclude that the website moved to a bigger publisher. If the old size is greater, the website moved to a smaller publisher, while if the size is the same, then the publisher made an insignificant change. This change might be the addition or removal of a secondary contributing author, the move to a different AdSense account, etc. The results of this analysis can be seen in Figure 13. We observe that the majority of websites made no changes in their Publisher IDs. Indeed, over 96% of websites have a consistent and stable behavior and do not change their monetization scheme. In contrast, we find that, on average, 3.35% of websites made a change in their contained Publisher IDs.

In Figure 16

, we show a linear regression model for the different cases of Figure 

13. Interestingly, we find that both the cases where a website moved to a bigger publisher, and where a website made an insignificant change in their Publisher IDs, have a negative slope. In contrast, the case where websites move to a smaller publisher is the only one with a positive slope. This suggests that there is a tendency for decentralization, meaning that websites are inclined to move away from big intermediary publishers. To test this hypothesis, we plot in Figure 16 the sum of websites operated by the 10 most popular publishers in each snapshot along with the total number of websites operated by the top 10 publishers, present in all snapshots. We can see that there is a constant decrease in the number of websites that these “mega-publishers” manage. Indeed, in just a 2-year span, “mega-publishers” lost approximately 25% of their population. Interestingly, this decrease in managed websites is observed even though the number of crawled websites has increased over the years (as shown in Figure 13).

Next, we explore how this market of big publishers has changed over time. We characterize as Small publishers with up to 10 websites, Medium those that monetize from 11 to 50 websites, Large those who monetize from 51 to 100, and as Mega, the publishers that monetize more than 100 websites. In Figure 16, we plot the population of these classes, i.e., the number of such publishers, and we also fit the data subsections with a straight line. As we can see, the number of Small publishers has greatly increased over the years (15K new Small publishers per trimester), which is expected with the increasing ad-revenues motivating new independent publishers to monetize their content. We also observe that there is an increase in the number of Medium and Large publishers (29 new Medium and 5 new Large publishers per trimester), while Mega publishers are the only class that shrinks over time (2 Mega publishers lost per trimester). This is evident in the negative slope of the fitted straight line. This behavior attests to the fact that the market of intermediary publishing partners seems to be flourishing and that new such services have emerged during the last couple of years. These services provide a new platform for independent content creators to generate revenue and lure clients away from Mega publishers. It is evident that these new services seek their share of a competitive, but highly profitable market.

6. Detecting Website Ownership

During our manually analysis, we found that there were a lot of communities which were not only operated but also owned by the same legal entity. To better understand the utility of the metagraph in identifying websites owned by the same legal entity, we manually examine detected communities.Overall, we find communities belonging to the news and media sector, music and entertainment, as well as manufacturing and other industrial applications. As an example, we detect a community of websites owned by Koninklijke Philips N.V. We find 45 official websites, each with a different country code top-level domain (ccTLD), all of which belong to the same company but serve clients of different countries. Another community in our dataset is a cluster of 73 news websites, all serving news content using .au as their ccTLD. In their privacy policy, these websites mention that they are published by a subsidiary or related body corporate of Rural Press Pty, Ltd and, in their footer, they declare that they are operated by Australian Community Media & Printing. Australian Community Media is a media company operating over 160 regional publications and targets a vast audience in multiple geographic locations.

One of the biggest detected communities is related to the music industry (i.e., over 140 websites of popular singers or music bands). To our surprise, these websites are owned by various companies including Atlantic Records, Electra Records, Warner Records and Nonesuch Records. By observing the copyright notification and the privacy notices of these websites, we find that all of them are subsidiaries of a single multinational conglomerate, Warner Music Group (Group, 2021). This is a clear proof that our methodology is even able to overcome the barriers of business organization and subsidiaries, and detect ownership in the highest level of hierarchy. Finally, we find two communities of websites related to public entertainment and information. We detect a community of 111 radio websites, operated by Townsquare Media, Inc, a US-based radio network and media company that owns hundreds of local terrestrial radio stations (Media, 2021). We also detect a community of 76 websites, which explicitly state in their copyrights claim that they are owned by Gray Television, Inc., an American television broadcasting company.

These examples and many more not analyzed here due to space, demonstrate the efficacy of our methodology to detect co-ownership status of websites by organizations monetizing on them in a collective fashion. Overall we find and report 112 distinct communities of various sizes, consisting of over 1,280 websites. For each community we report its size, the websites that compose the community, as well as, the legal entity that owns the respective websites. We manually visited, evaluated and verified all of these websites and make our results publicly available (Papadogiannakis, 2021b). We report some of the largest communities along with their size (i.e., number of websites) and some indicative websites as examples in Appendix B.

7. Related Work

The ecosystem of digital advertising and analytics has motivated a lot of studies that aim to reverse engineer it (e.g., (Papadopoulos et al., 2017, 2018; Carrascosa et al., 2015; Englehardt and Narayanan, 2016)). In (Gill et al., 2013), the authors studied the advertising ecosystem and services provided by Google, including AdSense and focused on how revenues are generated across aggregators. In (Matic et al., 2015), the authors presented an automated tool to de-anonymize Tor hidden services using information like Google Analytics and AdSense IDs to disclose the server’s IP. Their analysis is limited regarding publisher-specific IDs, since they only extract 24 unique Analytics IDs and 3 Publisher IDs. Similarly, in (Yoon et al., 2019), Yoon et al. studied phishing threats in the DarkWeb, by trying to obtain the identity of owners operating such websites. Using the technique of (Matic et al., 2015), they extracted 276 Analytics IDs and 1,171 Publisher IDs. In (Starov et al., 2018), Starov et al. analyzed identifiers of multiple analytics services to bundle websites and discover malicious websites and campaigns. With a focus on malicious content, they identified 7,945 Analytics IDs and 278 Container IDs and, contrary to our work, they did not consider Publisher IDs or Measurement IDs. In (Rogers, 2021), authors outlined how Google Analytics IDs can be used for digital forensics investigations to unmask online actors and lead to the entity, that operates a cluster of websites, which can be an individual, organization, or media group.

In (Simeonovski et al., 2017), the authors associate organizations with domain names in an attempt to create a property graph for Internet infrastructures. To achieve this, they utilize X.509 certificates and extract the organization to which the certificate was issued. In (Cangialosi et al., 2016) authors, argue that relying on such certificates is not effective and propose a methodology that revolves around the email addresses found in WHOIS records. Similar to our work, authors build a bipartite graph and apply a community detection algorithm to extract clusters of domains, owned by the same organization. Limitations of the proposed methodology include that many WHOIS records contain the email address of the registrar or the hosting provider instead (i.e., WHOIS privacy service). Finally, our methodology provides an additional advantage in the cases where websites are purchased by a new legal entity, since new website owners or administrators have an incentive to update the publisher-specific IDs in their new websites in order to gain revenue or insight. This does not apply to WHOIS records.

In (Bashir et al., 2019), Bashir et al. performed a study of the specification and adoption of ads.txt files during a 15-month period and clustered publishers serving identical ads.txt files. Similar to our work, they found that there is a big number of smaller clusters (i.e., less than 5 websites) but only a few big clusters with over 50 websites. Finally, authors manually investigated the top clusters in their dataset and found that such clusters exist due to (i) shared media properties with a common owner, (ii) independent publishers, (iii) the use of the same platform to deliver their content, or (iv) the use of consolidated SSP services.

8. Discussion

8.1. Summary

In this work, we shed light on website administration by using bipartite graphs and exploiting the publisher-specific IDs that publishers embed in their websites to use third-party ad-related services. We studied various properties induced by these graphs, reflecting important characteristics of administration, such as portfolio size, popularity, etc., and we identified power-law patterns of website administration, as well as indications of preferential monetization in the type of controlled websites. We studied the use of such publisher-specific IDs across time and showed how the market of intermediary publishing partners has boomed in the last few years. We showed that our methodology can be used to detect ownership in the Web and even overcome the company organization barriers (i.e., subsidiary companies).

8.2. Limitations

Our methodology is based on detecting publisher-specific IDs using regular expressions. However, there are cases where alphanumeric values might match with these regular expressions without being actual identifiers. While we perform various techniques to limit these false positives (Section 3.2), we acknowledge that there might be cases that we miss. Additionally, our study focuses on publisher-specific IDs related to services offered by Google, one of the biggest players in the advertising and analytics ecosystem. Even though the analysis of Google services provides a good coverage of the real world, there are several other ad networks and analytics services that can be studied. Finally, we acknowledge that our analysis of website categories (Section 4.4) relies on SimilarWeb, which might be prone to errors or subjective bias.

8.3. Implications

We believe our graph methodology and analysis is a powerful tool for web and privacy measurements aiming to understand the context, nature and activity of websites, as well as the possible leverage or political agendas behind their administration. In fact, our proposed technique can help researchers, journalists, and even individual users to better understand popular websites and the entities that control and monetize them. Furthermore, our preliminary analysis shows that outlier websites in the bipartite graphs yielded by our method may reveal anomalous or even malicious behavior, suggesting that our methodology can be used to discover malicious actors without even examining their published content. Also, ad networks can make use of our technique to detect fraudulent or fake news-related website administrators that may violate their ad campaign policies. Altogether, we believe that our method can help improve the safety and health of the Web ecosystem at large.

Acknowledgements

This project received funding from the EU H2020 Research and Innovation programme under grant agreements No 830927 (Concordia), No 830929 (CyberSec4Europe), No 871370 (Pimcity) and No 871793 (Accordion). These results reflect only the authors’ view and the Commission is not responsible for any use that may be made of the information it contains.

References

  • L. Alexander (2015) Open-source information reveals pro-kremlin web campaign. Note: https://globalvoices.org/2015/07/13/open-source-information-reveals-pro-kremlin-web-campaign/ Cited by: §1.
  • I. Archive (2021) HTTPArchive. Note: https://httparchive.org/ Cited by: §5.1.
  • A. Baio (2011) Think you can hide, anonymous blogger? two words: google analytics. Note: https://www.wired.com/2011/11/goog-analytics-anony-bloggers/ Cited by: §1.
  • M. A. Bashir, S. Arshad, E. Kirda, W. Robertson, and C. Wilson (2019) A longitudinal analysis of the ads.txt standard. In Proceedings of the Internet Measurement Conference, IMC ’19, New York, NY, USA, pp. 294–307. External Links: ISBN 9781450369480, Link, Document Cited by: §1, §7.
  • V. D. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre (2008) Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), pp. P10008. Cited by: §3.5.
  • F. Cangialosi, T. Chung, D. Choffnes, D. Levin, B. M. Maggs, A. Mislove, and C. Wilson (2016) Measurement and analysis of private key sharing in the https ecosystem. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 628–640. External Links: ISBN 9781450341394, Link, Document Cited by: §3.5, §3.5, §7.
  • J. M. Carrascosa, J. Mikians, R. Cuevas, V. Erramilli, and N. Laoutaris (2015) I always feel like somebody’s watching me: measuring online behavioural advertising. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT ’15, New York, NY, USA. External Links: ISBN 9781450334129, Link, Document Cited by: §7.
  • G. H. Center (2021a) About tags. Note: https://support.google.com/tagmanager/answer/3281060 Cited by: §3.2.
  • G. H. Center (2021b) Ad placement policies. Note: https://support.google.com/adsense/answer/2659106 Cited by: §2.1.
  • G. H. Center (2021c) Manage user access to your account. Note: https://support.google.com/adsense/answer/2646544 Cited by: §2.1.
  • G. H. Center (2021d) Organize your containers. Note: https://support.google.com/tagmanager/answer/6261285 Cited by: §2.2.
  • G. H. Center (2021e) Revenue share. Note: https://support.google.com/adsense/answer/1346295 Cited by: §2.1.
  • G. H. Center (2021f) Setup and install tag manager. Note: https://support.google.com/tagmanager/answer/6103696 Cited by: §2.2.
  • G. H. Center (2021g) Tracking id and property number. Note: https://support.google.com/analytics/answer/7372977 Cited by: §2.3.
  • G. H. Center (2021h) Universal analytics property. Note: https://support.google.com/analytics/answer/10220206 Cited by: §2.3.
  • A. Clauset, C. R. Shalizi, and M. E. Newman (2009) Power-law distributions in empirical data. SIAM review 51 (4), pp. 661–703. Cited by: §4.2.
  • A. H. -. Community (2019) What is the adsensehostid on blogger?. Note: https://support.google.com/adsense/thread/18637422/what-is-the-adsensehostid-on-blogger-why-there-is-another-pub-id Cited by: §4.2.
  • E. Cramer-Flood (2021) Worldwide digital ad spending 2021. Note: https://www.emarketer.com/content/worldwide-digital-ad-spending-2021 Cited by: §1.
  • D. Dittrich and E. Kenneally (2012) The menlo report: ethical principles guiding information and communication technology research. Technical report Cited by: Appendix A.
  • S. Englehardt and A. Narayanan (2016) Online tracking: a 1-million-site measurement and analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 1388–1401. External Links: ISBN 9781450341394, Link, Document Cited by: §7.
  • T. Farney (2016) Google analytics and google tag manager. ALA TechSource, Chicago. External Links: ISBN 978-0-8389-5976-3 Cited by: §2.2.
  • P. Gill, V. Erramilli, A. Chaintreau, B. Krishnamurthy, K. Papagiannaki, and P. Rodriguez (2013) Best paper – follow the money: understanding economics of online aggregation and advertising. In Proceedings of the 2013 Conference on Internet Measurement Conference, IMC ’13, New York, NY, USA, pp. 141–148. External Links: ISBN 9781450319539, Link, Document Cited by: §7.
  • M. Girvan and M. E. Newman (2002) Community structure in social and biological networks. Proceedings of the national academy of sciences 99 (12), pp. 7821–7826. Cited by: §3.5.
  • Inc. GoDaddy Media Temple (2021a) Sucuri - free website security check and malware scanner. Note: https://sitecheck.sucuri.net/ Cited by: §4.1.
  • Inc. GoDaddy Media Temple (2021b) Sucuri report - prykoly.ru. Note: https://sitecheck.sucuri.net/results/prykoly.ru Cited by: §4.1.
  • Google (2021) Certified publishing partner. Note: https://www.google.com/ads/publisher/partners/ Cited by: §4.2.
  • W. M. Group (2021) Services - recorded music. Note: https://www.wmg.com/services Cited by: §6.
  • T. Hwang (2020) Subprime attention crisis: advertising and the time bomb at the heart of the internet. FSG Originals x Logic, Farrar, Straus and Giroux, New York. External Links: ISBN 9780374538651 Cited by: §1.
  • I. T. Laboratory (2021) IAB releases internet advertising revenue report for 2020. Note: https://www.iab.com/news/iab-internet-advertising-revenue/ Cited by: §1.
  • [30] C. S. I. Limited VirusTotal - analyze suspicious files and urls to detect types of malware, automatically share them with the security community. Note: https://www.virustotal.com/ Cited by: §4.1.
  • C. S. I. Limited (2021) VirusTotal report - pps.net. Note: https://www.virustotal.com/gui/domain/www.pps.net Cited by: §4.1.
  • [32] S. LTD Website traffic - check and analyze any website. Note: https://www.similarweb.com/ Cited by: §4.4.
  • S. Matic, P. Kotzias, and J. Caballero (2015) CARONTE: detecting location leaks for deanonymizing tor hidden services. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, New York, NY, USA, pp. 1455–1466. External Links: ISBN 9781450338325, Link, Document Cited by: §1, §7.
  • T. Media (2021) Digital media and radio advertising company. Note: https://www.townsquaremedia.com/ Cited by: §6.
  • E. Papadogiannakis (2021a) Scrape titan. Note: https://gitlab.com/papamano/scrape-titan Cited by: 2nd item, §3.1.
  • E. Papadogiannakis (2021b) Website administration graphs. Note: https://gitlab.com/papamano/website-administration-graphs Cited by: 2nd item, §3.4, §6.
  • P. Papadopoulos, N. Kourtellis, and E. P. Markatos (2018) The cost of digital advertisement: comparing user and advertiser views. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, Republic and Canton of Geneva, CHE, pp. 1479–1489. External Links: ISBN 9781450356398, Link, Document Cited by: §7.
  • P. Papadopoulos, N. Kourtellis, P. R. Rodriguez, and N. Laoutaris (2017) If you are not paying for it, you are the product: how much do advertisers pay to reach you?. In Proceedings of the 2017 Internet Measurement Conference, IMC ’17, New York, NY, USA, pp. 142–156. External Links: ISBN 9781450351188, Link, Document Cited by: §7.
  • G. Project and K. Atkinson (2019) GNU aspell - free and open source spell checker. Note: http://aspell.net/ Cited by: §3.2.
  • [40] PublicWWW Source code search engine. Note: https://publicwww.com/ Cited by: §4.2.
  • C. M. Rivers and B. L. Lewis (2014) Ethical research standards in a world of big data. F1000Research 3, pp. 38. External Links: Document, Link Cited by: Appendix A.
  • R. Rogers (2021) Digital forensics: repurposing google analytics ids. In The Data Journalism Handbook: Towards A Critical Data Practice, pp. 241–245. External Links: ISBN 9789462989511, Link Cited by: §7.
  • C. I. Samson (2018) VERA files fact check yearender: ads reveal links between websites producing fake news. Note: https://www.verafiles.org/articles/vera-files-fact-check-yearender-ads-reveal-links-between-web Cited by: §1.
  • C. E. Shannon (1948) A mathematical theory of communication. The Bell system technical journal 27 (3), pp. 379–423. Cited by: §4.4.
  • C. Silverman, J. Lytvynenko, L. T. Vo, and J. Singer-Vine (2017) Inside the partisan fight for your news feed. Note: https://www.buzzfeednews.com/article/craigsilverman/inside-the-partisan-fight-for-your-news-feed Cited by: §1.
  • M. Simeonovski, G. Pellegrino, C. Rossow, and M. Backes (2017) Who controls the internet? analyzing global threats using property graph traversals. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, Republic and Canton of Geneva, CHE, pp. 647–656. External Links: ISBN 9781450349130, Link, Document Cited by: §7.
  • O. Starov, Y. Zhou, X. Zhang, N. Miramirkhani, and N. Nikiforakis (2018) Betrayed by your dashboard: discovering malicious campaigns via web analytics. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, Republic and Canton of Geneva, CHE, pp. 227–236. External Links: ISBN 9781450356398, Link, Document Cited by: §1, §7.
  • The Chromium Authors (2014) Chrome devtools protocol. Note: https://chromedevtools.github.io/devtools-protocol/ Cited by: §3.1.
  • Tranco (2021) Tranco list with the 1m top sites generated on 14 april 2021. Note: https://tranco-list.eu/list/7JVX/full Cited by: §3.1.
  • C. Yoon, K. Kim, Y. Kim, S. Shin, and S. Son (2019) Doppelgängers on the dark web: a large-scale assessment on phishing hidden web services. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 2225–2235. External Links: ISBN 9781450366748, Link, Document Cited by: §1, §7.

Appendix A Ethical Considerations

The execution of this work has followed the principles and guidelines of how to perform ethical information research and the use of shared measurement data (Dittrich and Kenneally, 2012; Rivers and Lewis, 2014). In particular, this study paid attention to the following dimensions.

We keep our crawling to a minimum to ensure that we do not slow down or deteriorate the performance of any web service in any way. Therefore, we crawl only the landing page of each website and visit it only once. We do not interact with any component in the visited website, and only passively observe network traffic. In addition to this, our crawler has been implemented to wait for both the website to fully load and an extra period of time before visiting another website. Consequently, we emulate the behavior of a normal user that stumbled upon a website.

In accordance to the GDPR and ePrivacy regulations, we did not engage in collection of data from real users. Also, we do not share with any other entity any data collected by our crawler. Our analysis is, to a large extent, based on public historical data (e.g., HTTPArchive Project). Moreover, we ensure that the privacy of publishers and administrators is not invaded. We do not collect any of their information (e.g., email addresses) and only discuss publishers, who explicitly and voluntarily disclose their identity in their websites, as we did in Section 6. Last but not least, we intentionally do not make our 1MT crawl dataset public, to ensure that there is no infringement of copyrighted material.

Appendix B Detected Communities

Description Size Websites
MinuteMedia 172 showsnob.com, sodomojo.com,
sportdfw.com, thejetpress.com,
reignoftroy.com, 90min.de, …
Warner Music Group 142 brunomars.com, blakeshelton.com,
greenday.com, vancejoy.com,
paramore.net, disturbed1.com, …
Townsquare Media, Inc 111 wkdq.com, wgrd.com, wbkr.com,
wbckfm.com, mix108.com, b985.fm,
keyw.com, 929nin.com, …
Gray Media Group, Inc 76 wrdw.com, witn.com, whsv.com,
wcax.com, wbay.com, nbc29.com,
kwch.com, kktv.com, abc12.com,…
Australian Community Media 73 thesenior.com.au, nvi.com.au,
theflindersnews.com.au,
mailtimes.com.au, portnews.com.au, …
Postmedia Network Canada Corp 58 lfpress.com, nationalpost.com,
windsorstar.com, winnipegsun.com,
intelligencer.ca, coldlakesun.com, …
Philips 45 usa.philips.com, philips.com.br,
philips.com.mx, philips.com.pk,
philips.cz, philips.ru, philips.pl …
Table 2. Communities detected using the proposed methodology. Through manual investigation we determine the legal entity behind these websites.

We manually examined communities and tried to determine the legal entity operating or even owning all websites in each community. We report some of the largest of these communities in Table 2 along with their size (i.e., number of websites) and some indicative websites as examples.