The Internet with Privacy Policies: Measuring The Web Upon Consent

09/01/2021
by   Nikhil Jha, et al.
Politecnico di Torino
0

To protect users' privacy, legislators have regulated the usage of tracking technologies, mandating the acquisition of users' consent before collecting data. Consequently, websites started showing more and more consent management modules – i.e., Privacy Banners – the visitors have to interact with to access the website content. They challenge the automatic collection of Web measurements, primarily to monitor the extensiveness of tracking technologies but also to measure Web performance in the wild. Privacy Banners in fact limit crawlers from observing the actual website content. In this paper, we present a thorough measurement campaign focusing on popular websites in Europe and the US, visiting both landing and internal pages from different countries around the world. We engineer Priv-Accept, a Web crawler able to accept the privacy policies, as most users would do in practice. This let us compare how webpages change before and after. Our results show that all measurements performed not dealing with the Privacy Banners offer a very biased and partial view of the Web. After accepting the privacy policies, we observe an increase of up to 70 trackers, which in turn slows down the webpage load time by a factor of 2x-3x.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

04/24/2018

WhoTracks.Me: Monitoring the online tracking landscape at scale

We present the largest and longest measurement of online tracking to dat...
10/26/2021

Exploring Content Moderation in the Decentralised Web: The Pleroma Case

Decentralising the Web is a desirable but challenging goal. One particul...
11/18/2021

Reining in Mobile Web Performance with Document and Permission Policies

The quality of experience with the mobile web remains poor, partially as...
08/21/2019

Case Study: Disclosure of Indirect Device Fingerprinting in Privacy Policies

Recent developments in online tracking make it harder for individuals to...
07/10/2020

Web View: A Measurement Platform for Depicting Web Browsing Performance and Delivery

Web browsing is the main Internet Service and every customer wants the m...
05/21/2019

The Blind Men and the Internet: Multi-Vantage Point Web Measurements

In this paper, we design and deploy a synchronized multi-vantage point w...
08/28/2021

TrackerSift: Untangling Mixed Tracking and Functional Web Resources

Trackers have recently started to mix tracking and functional resources ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Web is a complex ecosystem where websites monetize their audience through advertising and data collection. They use Web trackers, i.e., third-party services that collect the visitors browsing history, to build per-user profiles and display targeted ads and personalised content (acar2014web; rizzo2021unveiling; papadogiannakis2021user). Hundreds of tracking platforms exist, with many of them gathering information from a large base of users and websites (falahrastegar2014rise; metwalley2015online; pujol2015annoyed; iordanou2018tracing).

This picture has created tension over users’ online privacy, and regulatory bodies have started governing the scenario. Lastly, in May 2018, the EU introduced the General Data Protection Regulation (GDPR) (gpdr). It sets strict rules on collecting and storing personal data and mandates firms to ask for informed consent. Similarly, the California Consumer Privacy Act of 2018 (CCPA) (ccpa) gives consumers more control over the personal information that businesses collect. All this has changed the Web too. Nowadays, when users visit a website for the first time, a consent management module – the commonly called Privacy Banner – prompts, asking the visitors whether they accept the website privacy policy and the use of tracking techniques, and eventually which tracking mechanisms to accept or to block. Upon user’s acceptance, the browser activates the accepted tracking techniques and updates the webpage to include all ads and third-party objects.

This challenges the commonly accepted approach to automatically crawl websites to measure the Web ecosystem on privacy (acar2014web; falahrastegar2014rise; metwalley2015online; englehardt2016online; pujol2015annoyed; iordanou2018tracing; hu2019characterising; rizzo2021unveiling; papadogiannakis2021user; vandrevu2019what; pujol2015annoyed; traverso2017benchmark; mazel2019comparison) and performance (wang2014speedy; de2015http; erman2015towards; bocchi2016measuring; alay2017experience; asrese2019measuring; ruamviboonsuk2017vroom; sivakumar2014parcel; netravali2015mahimahi). These measurements are typically carried out with headless browsers to access webpages of popular websites and to automatize the collection of metadata and statistics. However, today, these measurements could result biased and unrealistic, with the crawler observing possibly very different content than what a user would get after accepting the privacy policies – as most users would commonly do (bauer2021you; hausner2021dark; CookieBenchmarkStudy). While researchers have shown the importance of carefully choosing which webpages to test (aqeel2020on), to the best of our knowledge, we are the first to consider the impact of Privacy Banners on automatic measurements.

For this, we engineer Priv-Accept, a tool to automatically handle the privacy acceptance mechanisms the websites put in place. In a nutshell, Priv-Accept enables the collection of user-like Web measurements. It overcomes the limitations of traditional crawling approaches, allowing the measurement of the tracking ecosystem to which users are actually exposed and obtain thus realistic figures on performance. The non-standard way of displaying the Privacy Banner, the presence of multiple languages, and the freedom to customize the accept button make automatic detection and acceptance not trivial. We base Priv-Accept on a keyword list that we thoroughly build to accept the privacy policies automatically. Compared to other solutions (idontcare; remove; ninja; consentomatic), Priv-Accept proves the most robust approach, bypassing the Privacy Banner in about of cases when present.

Armed with Priv-Accept, we run an extensive measurement campaign. We focus mostly on European and US websites that we visit from different countries. We demonstrate how different is the picture we observe before and after accepting the website privacy policies. Interestingly, many websites correctly implement the regulations, and they activate trackers and personalized ads only after consent is collected. This makes the illusion that tracking is decreasing with respect to the past (hu2019characterising). However, the number of trackers websites embed substantially increases upon acceptance of the privacy policy, in some cases up to 70. As such, popular trackers suddenly become much more pervasive than one can measure using traditional and naive Web crawlers. Considering performance, after accepting privacy policies, webpages become more than three times heavier and more complex, loading objects from many more third-party websites. Thus, they are slower to load, so that webpages embedding many trackers and ads double or triple the webpage load time.

Recently, authors of (aqeel2020on) showed how important it is to extend the crawling to internal pages. Here, we show that it is on par fundamental to correctly handle the Privacy Banners when running extensive Web measurements. For this, we offer Priv-Accept

as an open-source tool to incentive also other researchers to contribute to it. Similarly, we offer all the data we collected for this study to the community in an effort to support reproducibility and foster other studies.

111Priv-Accept is available as an open-source GitHub project at: https://github.com/marty90/priv-accept

After discussing the scenario and related work in Section 2, we present Priv-Accept and thoroughly test it in Section 3. In Section 4, we report how different the picture results when checking the Web tracking ecosystem before and after the acceptance of the privacy policies. We then show the implications on performance in Section 5. After discussing Ethics in Section 6, we summarize our findings in Section 7.

2. Background and related work

Content providers on the Web (websites, social networks, etc.) often monetize the content they offer using advertisements. To increase their effectiveness, the so-called behavioral advertisement leverages users’ interests to provide targeted ads. This is possible thanks to Web trackers, i.e., third-party services embedded in the webpages that gather users’ browsing history. Trackers are nowadays largely present on websites and reach the majority of internauts (metwalley2015online; pujol2015annoyed). Trackers exploit cookies and advanced techniques to enable the collection of personal information (acar2014web; rizzo2021unveiling; papadogiannakis2021user).

Figure 1. Example of Privacy Banner on dailymail.co.uk. Only upon consent, trackers are contacted and ads displayed.

2.1. The Role of Legislators

In this tangled picture, legislators started to regulate the ecosystem to avoid massive indiscriminate tracking that may threaten users’ privacy. The first attempt has been the European Cookie Law (directive2009) entered into force in 2013, which mandates websites to ask for informed consent before using any profiling technology. In May 2018, the General Data Protection Regulation (GDPR) (gpdr) entered into force in all European member states. It is an extensive regulation on privacy, aiming at protecting users’ privacy by imposing strict rules when handling personal information. Unlike previous regulations, it sets severe fines and infringements that could result in a fine of up to €10 million, or 2% of the firm’s worldwide annual revenue, whichever amount is higher. Some websites have already been caught to present legal violations in their Cookie Banner implementation (matte2020cookie) and a large fraction have been shown to use tracking technologies before users consent (trevisan20194; sanchez2019can). In the US, the California Consumer Privacy Act (CCPA) (ccpa) similarly enhances privacy rights and consumer protection for California residents by requiring businesses to give consumers notices about their privacy practices.

As a result, most of the websites now provide explicit Privacy Banners (degeling2018we) and many adopt Consent Management toolsets (hills2020consent), making the website content difficult to access until visitors accept the privacy policy. For example, Figure 1 shows the same news website homepage before and after accepting the privacy policy. Only upon pressing the “Got it” button, the website content is fully visible.

Figure 2. Percentage of websites containing at least one tracker (from HTTPArchive). The black vertical line indicates the entry into force of the GDPR.

2.2. The Effect of Privacy Banners on Web Measurements

Despite cases of misuse, the new regulations had a large impact on the internauts, and this complicates the measurement of the tracking ecosystem. A simple Web crawler visiting the websites without accepting the privacy policies would offer a biased picture, with no trackers and no ads being loaded. Hu et al. (hu2019characterising) already found that the number of third-parties dropped by more than 10% after GDPR when visiting websites automatically. Conversely, when using a dataset from 15 real users, they measure no significant reduction in long-term numbers of third-party cookies. Dabrowski et al. (dabrowski2019measuring) draw similar conclusions, finding an apparent decrease in the use of persistent cookies from 2016 to 2018. Sorensen et al. (sorensen2019before) testify a decreasing trend in the number of third parties during 2018. We quantify this phenomenon in Figure 2, using the HTTPArchive open dataset (httparchive). The curators of this dataset maintain a list of top websites worldwide that they automatically visit using the Google Chrome browser from a US-based server to store a copy of each visited webpage. Using the tracker list detailed in Section 3, we report the percentage of websites embedding one or more trackers for 5 European countries (simply using the Top-Level Domain to identify the country). We restrict the analysis on those websites that exist for the whole six years-long periods ( website in total).

Figure 2

apparently shows that the introduction of the GDPR (the black vertical line in May 2018) results in an abrupt decrease in the number of tracker-embedding websites, a trend that continues up to the moment we write. However, as we will show, these measurements are an artifact due to the GDPR itself. Indeed, the Web crawler used by HTTPArchive can only capture the behavior of the websites as a “first-time visitor”, before the user accepts any privacy policy. The crawler thus misses cookies, third-party trackers, and any personalized ads.

Research papers that rely on crawling large portions of the Web for different reasons could be affected by the same bias in their measurements. For instance, this would challenge the automatic measurement of the Web ecosystem on privacy (acar2014web; falahrastegar2014rise; metwalley2015online; pujol2015annoyed; englehardt2016online; iordanou2018tracing; hu2019characterising; rizzo2021unveiling; vandrevu2019what; papadogiannakis2021user; aqeel2020on) and counter-measurements (pujol2015annoyed; traverso2017benchmark; mazel2019comparison). Moreover, this will also impact those works that rely on crawlers and headless browsers (avasarala2014selenium) to quantify the impact in the wild of new technologies like SPDY, HTTP/2 (wang2014speedy; de2015http; bocchi2016measuring; erman2015towards), 4G/5G (alay2017experience; asrese2019measuring), accelerating proxies (sivakumar2014parcel; wang2016speeding; ruamviboonsuk2017vroom), or generic benchmark solutions (netravali2015mahimahi). At last, even spiders and mirroring tools like HTTPArchive may be affected if the website allows the visitor to access its content only after accepting the privacy policy.

2.3. Related Work and Tools

Authors of (vallina2019tales) are the first to consider the impact of the Privacy Banner presence. First, they instruct a custom OpenWPM crawler to identify specific Cookie Banners, and then they manually verify the results. Unfortunately, they solely focus on the pornographic ecosystem, which they acknowledge to be rather different from the Web at large, and thus their work can hardly be generalized.

Recently, authors of (aqeel2020on) demonstrated that it is fundamental to consider the complexity of the Web ecosystem and include internal pages in every measurement study. They find a number of recent works that neglect internal pages and, as such, might provide biased results. Yet, they ignore the implications of Privacy Banners. Here, we aim at providing an extensive and thorough study of their impact on the Web. Our goal is to enable the study of webpage characteristics as visitors would experience, assuming that most of them accept the default privacy setting as offered by the Privacy Banner. Indeed, it has been shown that users tend to ignore privacy-related notices (vila2003we; grossklags2007empirical; coventry2016personality). Considering GDPR Privacy Banners, users tend to accept privacy policies when offered a default button via intrusive banners that nudge users  (CookieBenchmarkStudy; bauer2021you), which is often the case (hausner2021dark) with websites offering large pop-ups or wall-style banners that cover most of the webpage as seen in Figure 1.

There exist solutions that aim at automatically managing Privacy Banners: some browser add-ons try to hide Privacy Banners using a list of CSS selectors of known Privacy Banners. The most popular add-ons of this kind are “I don’t care about cookies” (idontcare) and “Remove Cookie Banners” (remove). Unfortunately, hiding the Privacy Banners has an unpredictable behavior, in some cases falling back to privacy policies acceptance, while, in other cases, triggering an opt-out choice. Other proposals, again in the form of browser add-ons, try to explicitly opt-in or opt-out to cookies. For example, “Ninja Cookie” (ninja) approves only cookies strictly needed to proceed on the website. Conversely, Autoconsent (autoconsent) and Consent-O-Matic (consentomatic) use a set of predefined rules to either opt-in or opt-out to cookies, according to the user configuration. These two are the most similar solutions to Priv-Accept, as they allow to automate the action of providing consent to privacy policies if used in combination with a crawler. However, they are based on a list of actions the browser has to automatically run when finding a set of popular Consent Management Providers (CMPs), limiting their effectiveness. In Section 3.2, we compare Priv-Accept with Consent-O-Matic – the most mature tool – showing that it accepts privacy policies on a much smaller portion of websites than Priv-Accept. Indeed, the diversity of the Web ecosystem, the presence of multiple languages and the fully customizable choice of cookie banner buttons make the engineering of Priv-Accept not trivial.

3. Priv-Accept design and testing

We explicitly engineer Priv-Accept to fully automate the visit to websites and collect statistics. The key element of Priv-Accept is its ability to identify the presence of a Privacy Banner and automatically accept privacy policies. We aim at a practical and effective approach to accept privacy policies through the offered button. As previously said, most users will indeed be nudged in this direction, being the opt-out options often made cumbersome on purpose(bauer2021you; hausner2021dark; CookieBenchmarkStudy).

To illustrate Priv-Accept operation, consider again Figure 1. A large Privacy Banner appears on the first-time-ever visit, and the user shall click on the “Got it” button to access the webpage content. Priv-Accept has to locate this button and click on it automatically. As a result, the website starts loading advertisements and contacting trackers in background. We refer to these two types of visits as Before-Visit and After-Visit in the remainder of the paper.

We implement Priv-Accept using the Selenium browser automation tool (avasarala2014selenium), the de-facto standard for browser automation. We focus on Google Chrome, but we could easily extend it to other browsers.

Given a target URL, Priv-Accept carries out the following tasks:

  1. It navigate to the URL with a fresh browser profile, i.e., with an empty cache and cookie storage. This makes the visit the equivalent of a Before-Visit to the website.

  2. It inspects the Document Object Model (DOM) of the rendered webpage to find a possible Accept-button in a Privacy Banner. For this, we match a list of keywords on the text of each node of the DOM. We identify an Accept-button if we exactly match one of our keywords. For robustness, the match is case insensitive, and leading, trailing or repeated blank characters are removed.

  3. If Priv-Accept finds the Accept-button, it tries to accept the default privacy policies by clicking on the corresponding DOM element (typically a <button>, <href> or <span> element).

  4. Priv-Accept then revisits the URL to collect statistics about the After-Visit experience.

In the beginning, we built Priv-Accept to look for accept buttons through CSS selectors combined with keywords as done in (vallina2019tales) and popular add-ons. However, we soon observed that this methodology was too fragile as the use of selectors is strongly CMP-specific and highly customizable by webmasters. The keyword-based approach eases the generalization of the solution. Considering the complexity, Priv-Accept adds marginal overhead to the time required to visit a webpage. Only for very complex webpages, iterating through all DOM elements may require some time, but this is still much less than the time needed to load and render the webpage by the browser.

During each visit, Priv-Accept stores metadata regarding the whole process in a JSON log file. It includes details on all HTTP transactions and installed cookies. Moreover, it optionally takes screenshots of the webpage during the various phases to allow manual verification.

Priv-Accept is highly customizable and offers the user various features. It lets the user customize the declared User-Agent and browser language (in the Accept-Language headers). Important to our analysis, it runs a:

  • Warm-up visit: to populate the browser cache.

  • Before-Visit: to collect statistics on the webpage before accepting the privacy policy, as a Naive Crawler would do.

  • After-Visit: to collect statistics on the webpage as it appears after accepting the privacy policy (if an Accept-button is found).

  • Additional-Visits: to a number of webpages of the same website, randomly choosing among the internal links. This step runs regardless of the presence of the Accept-button.

Among metadata Priv-Accept collects, we record the Page Load Time, or OnLoad time, on all visits. It allows us to compare the performance with and without privacy acceptance. The OnLoad time is a performance index often used as a proxy for Quality of Experience measurements (da2018narrowing). We leave the measurements of more sophisticated QoE-related metrics such as the SpeedIndex (speedindex) as future work. Moreover, we neglect metrics that are not affected by the presence of a Privacy Banner, such as the Time-to-first-byte (TTFB). To avoid suffering the bias of the After-Visit that can only occur with a warm browser cache, we run a preliminary Warm-up visit, then we perform another Before-Visit and take performance measurement only on the latter. This lets us fairly compare the OnLoad on the two visits with hot cache in both cases. Alternatively, Priv-Accept can erase the HTTP cache and clean the socket pool upon each visit to measure webpage performance with a cold cache.

At last, to limit the impact of random delay due to webpage download and rendering, Priv-Accept uses quite conservative timeouts before eventually abort the visit. In detail, the DOM inspection starts 5 seconds after the OnLoad event. While this clearly slows down the visit of multiple webpages, it maximizes the accept success rate.

To allow large-scale measurement campaigns, we containerize Priv-Accept using the Docker container engine (docker). In the containerized version, we use Google Chrome version 89 in headless mode and force it to use a standard User-Agent instead of the pre-defined ChromeHeadless.

We offer Priv-Accept as open-source to foster its usage and allow the reproducibility of the results presented in this paper. For this, we also commit to releasing all the data we collected for this study.

3.1. Keyword Selection and Validation

Figure 3. Validation results of Priv-Accept over 200 randomly picked websites per country.
Figure 4. Frequency of the Priv-Accept keywords, with some examples reported.

The core of Priv-Accept is the list of keywords to be matched against the webpage content to localize the clickable DOM element for accepting the privacy policy. We thoroughly build this list manually in an iterative way. To handle different languages, we build a list that includes keywords for each country we are interested in. For this work, we focus on 5 European countries, namely France, Germany, Italy, Spain, UK222Since January 2021 UK has enforced the UK GDPR - with practically identical requirements., plus the US – which we use as an example of an extra-EU country. For each country, we pick the most popular websites according to the Similarweb lists (similarweb), a website-ranking service analogous to Alexa.

3.1.1. First Round - keyword extraction from top websites

In the first round, for each of the countries, we consider the top-200 websites that have a Privacy Banner. We randomly choose half of these websites and manually visit them (from Europe) to extract the accept keyword. In total, we identify unique keywords. We next instruct Priv-Accept to visit the other half of websites and accept privacy policies. For those where it fails, we manually visit them and extract keywords. With this, we include new keywords, in total.

3.1.2. Second Round - testing and keyword increase

To evaluate the accuracy of Priv-Accept in the wild, we next consider 200 new random websites for each country from the Similarweb lists. We let Priv-Accept visit them and manually check the subset of websites for which Priv-Accept fails to accept the privacy policy. We depict the results in Figure 3. Priv-Accept can accept the privacy policy in more than half of websites. In of cases, we find new keywords – that we promptly add to our list. Interestingly, we find a non-negligible portion of websites () that do not present any Privacy Banner. At last, Priv-Accept fails to accept privacy in only of cases. Investigating further, this is due to non-standard behaviors of the webpage when accessed in headless mode. For instance, some websites present a CAPTCHA when they detect an automated visit; other websites return a blank webpage. This is a common problem for any crawler-based measurement study (vastel2020fp). Note that cases of False Positives – i.e., Priv-Accept clicking on a wrong DOM element – are possible, although we have not observed any during the development and testing phases.

At the end of the keyword list building phases, we collect a total of keywords covering languages.333In Spain, some websites are in Catalan, other than in Spanish. The most frequent one is the simple “Ok” string. In Figure 4 we show the cumulative distribution of keyword frequency on the whole set of Similarweb websites with some keyword examples. The top-50 keywords already cover of websites, while 100 are enough to cover . Interestingly, we find also complex strings like “I’m fine with this” or “Alle auswählen, weiterlesen und unsere arbeit unterstützen”.444Which translates to “Select all, keep reading and support our work”.

3.2. Priv-Accept vs. Consent-O-Matic

Figure 5. Privacy policy acceptance rate of Priv-Accept and Consent-O-Matic on 100 websites per country.

We compare the effectiveness of Priv-Accept with Consent-O-Matic, the most mature browser plugin designed to offer/deny consent to privacy policies automatically. Unlike our tool, Consent-O-Matic exploits the presence of popular Consent Management Providers (CMP), services that take care of the management of users’ choices on behalf of the website. At the time of writing, Consent-O-Matic allows managing Privacy Banners for 35 CMPs. To gauge its performance, we visit the top-100 most popular websites with a Privacy Banner for the 5 countries using a Chrome browser with the Consent-O-Matic plugin enabled. Consent-O-Matic accepts the privacy policies in less than 35% of websites with Privacy Banner, and as little as 17% and 20% for websites in Italy and UK, respectively. Here Priv-Accept accepts the privacy policies on all websites by construction.

We then run a second experiment considering another set of 100 websites randomly picked from the Similarweb per country lists. We visit each website with Priv-Accept and a Consent-O-Matic-enabled browser. Figure 5 summarizes the comparison. Priv-Accept can accept the privacy policies in more than 50% of websites, more than twice the success rate of Consent-O-Matic. These results are in line with those of Figure 3. The remaining websites may not have a Privacy Banner, fail to load, or use an unknown keyword. This testifies that the customization of Privacy Banners makes it difficult to engineer a generic and simple solution. The keyword-based strategy results more robust than the CMP-based approach (with similar complexity in curating the lists).

3.3. Dataset and Tracker list

In the following, we use Priv-Accept to run several measurement campaigns. Most of our analyses, unless otherwise indicated, targets a large set of websites popular in Europe, using a test server located in a European university campus. We also use the US as a representative of an extra-EU country. For each of the countries, we consider the top 100 websites from 25 different categories - see Figure 10. In total, we include unique websites to visit (as the lists in different countries partially overlap).

We run Priv-Accept on the target websites using a single high-end server running 16 parallel instances to speed up the crawl. We instrument it to run a test sequence, which consists in a Warm-up visit, Before-Visit, and After-Visit to the landing page, followed by Additional-Visits to 5 randomly chosen internal pages – previous studies indeed show that internal and landing pages have different properties (aqeel2020on). For each website, we repeat the test sequence times, randomizing the order of websites to visit in each repetition. Our main experimental campaign took place for two weeks on April 2021.

We run additional measurement campaigns to investigate specific aspects. First, we repeat the above experiments using servers located in the US, Brazil and Japan. We use Amazon Web Services to deploy on-demand servers on the desired availability zone. Our goal is to understand whether Privacy Banners appear or have a different impact depending on the visitor location. Second, we visit the top-100 000 websites according to the Tranco list (pochat2018tranco) to offer a view on a larger number of websites. Unfortunately, the Tranco list does not offer a per-category and per-country rank. We test these websites twice, with and without clearing the browser cache, to compare webpage performance on the Before-Visit and on the After-Visit both with a warm and a cold cache.

To observe how the presence of trackers changes from the Before-Visit to the After-Visit, we rely on publicly-available lists provided by Whotracksme (whotracksme) (a tracking-related open-data provider), EasyPrivacy (easyprivacy) (one of the lists at the core of AdBlock tracker-blocking strategy) and AdGuard (adguard) (another ad-blocking tool). For robustness, we merge the three lists and consider as potential trackers those third-party domains that appear in at least two lists. In total, we obtain domains that we consider tracking services.555In the following, we identify them with their second-level domain name – i.e., a hostname truncated after the second label. We handle the case of two-label country code TLDs such as co.uk. We then record the presence of a tracker during a visit if the webpage embeds an object from a tracking domain, and the latter installs a cookie with a lifetime longer than one month (trevisan20194) – commonly referred to as profiling cookie. As such, we divide the HTTP transactions carried out during a visit in:

  • First-Party: objects from the same domain of the target webpage.

  • Third-Party: objects from a different domain than the target webpage.

  • Trackers: objects from a Third-Party that is a tracking domain and sets a profiling cookie.

4. Impact on Tracking

In this section, we characterize how the Web tracking ecosystem changes if observed with or without accepting the privacy policies. We break down results by Third-Party/Tracker, by country and website category. For this, we focus on the list of 100-top popular websites per country and category. Out of the websites, Priv-Accept accepts the privacy policies on of them. The percentage is not uniform across countries and it is generally higher on European () and lower for US () websites - despite most keywords are in English. Differences are more pronounced across categories; Priv-Accept accepts privacy policies on of News websites, while only on of Adult portals. For the sake of completeness, the per-country and per-category acceptance rate is reported on the top- labels of Figure 8(a) and 9(a), respectively. These figures are in line with the acceptance rates seen in Figures 3 and 5. Some manual random checking confirms that Priv-Accept does not accept the privacy policy for those websites that do not present any Privacy Banner or where the headless browser fails to visit.

4.1. Third-Party and Tracker Pervasiveness

Figure 6. Pervasiveness of the top-15 Third-Parties.
Figure 7. Pervasiveness of the 342 identified Trackers.

We first study the pervasiveness of Third-Parties and Trackers and check how it varies when we measure it in a Before-Visit or After-Visit. We here focus on the websites that are popular in the European countries according to the Similarweb ranks. Indeed, we aim at quantifying the impact of privacy policy acceptance on European websites. As such, we exclude those websites exclusively popular in the US.

We first detail the top-15 most pervasive Third-Parties in Figure 6. The GDPR mandates to obtain informed consent before starting to collect any personal data. As such, Third-Parties may be seen as possibly offending services if activated before accepting the privacy policy.666Here, we do not enter into the debate of what can be considered a Tracker. With little surprise, the most pervasive Third-Party is google-analytics.com. It grows from to in popularity on the After-Visit. The growth is also sizeable for other Google services such as googleadservices.com and googlesyndication.com. Conversely, domains belonging to Content Delivery Networks, such as cloudflare.com and cloudflare.net do not increase their pervasiveness on the After-Visit, likely being not included in the mechanisms of Privacy Banners. Interestingly, only 3 out of the top-15 Third-Parties are Trackers – i.e., present in our tracker list and setting a persistent cookie. doubleclick.net and facebook.com are the most popular ones, with pervasiveness growing from to and from to on the After-Visit, respectively. They are present in more than twice the number of websites than their first competitor (quantserve.com).

Focusing now on Trackers only, Figure 7 shows their pervasiveness in our dataset. We count of them. Notice that the figure has log-log axes to better show the large variability of Tracker popularity. The red curve shows the pervasiveness on the Before-Visit, which is what a naive crawler would report. The blue curve shows how the figure changes on the After-Visit. The increase in pervasiveness is general and includes both popular and infrequent Trackers, reaching in a few cases one order of magnitude. On the After-Visit, the number of Trackers that are present on or more of websites grows from to . Interestingly, if we sort the Before-Visit and After-Visit Trackers by their pervasiveness, the rank remains almost unchanged. The Spearman’s rank correlation is , indicating that the Tracker popularity order is approximately the same before and after the privacy policy acceptance. The difference is that their presence increases.

As it emerges from Figure 7, many Trackers are widespread even on the Before-Visit. This hints at a possibly wrong implementation of the GPDR regulation, which mandates to acquire first the visitor’s explicit consent before activating any tracking mechanisms. To be precise, the presence of Trackers on the Before-Visit does not necessarily entail a violation of the law. A manual analysis displays that some Trackers install test cookies during the Before-Visit using a form similar to test_cookie = CheckForPermission. These cookies are just a check for the possibility of installing profiling cookies upon the user’s acceptance. It is thus possible that the Before-Visit pervasiveness of some Trackers includes cases in which only test cookies are actually used. Here we limit to observe that often Trackers set some (potentially) profiling cookies even on the Before-Visit.

In conclusion, these results show how different the picture is when collecting measurements with or without accepting the privacy policies. Priv-Accept enables the collection and analysis of what most users would be exposed to, thanks to its ability to handle the Privacy Banners and accept website privacy policies.

4.2. Breakdown on Websites

We now detail the impact of accepting privacy policies on the number of Trackers found in each website, breaking down our results by country and website category.

4.2.1. Analysis by country

Figure 8. Trackers per website seen on the landing page. Websites are sorted by Tracker number on the Before-Visit.
(a) Percentage of websites embedding Trackers.
(b) Average number of Trackers per website.
Figure 9. Trackers penetration and number on websites during different phases of a browsing sessions.

We first check if the number of Trackers found during the Before-Visit differs in the After-Visit. Figure 8 shows websites sorted in descending order by the number of contacted Trackers as measured in the Before-Visit (red curve). Focus now on the blue points. They report the number of Trackers in the After-Visit for the same website. The number tends to grow on the After-Visit, underlying again the need for tools such as Priv-Accept to accept the privacy policy and measure the footprint of Web tracking correctly. Some websites present a sizeable increase, with figures that grow by 50-70 Trackers. Curiously, some websites that already include Trackers in the Before-Visit include more Trackers in the After-Visit. This possibly hints at a wrong implementation of the Privacy Banner, which fails to hinder the presence of possibly offending Trackers. The increase is less remarkable for US-popular websites – again, mainly due to the less widespread presence of Cookie Banners.

To better quantify Tracker presence, we show the fraction of websites containing at least one Tracker in Figure 8(a). About of websites popular in European countries already include some Trackers on Before-Visit. This happens more frequently in the UK () and more occasionally in Germany (). Again, notice that a website embedding a Tracker on the Before-Visit does not necessarily represent a violation of the GDPR, even if this can often be the case (trevisan20194)

. Interestingly, in the US this figure is higher than in European countries. Recalling that in the US the probability of encountering a Privacy Banner is lower, this hints at a positive effect of the GDPR on popular European websites. The percentage of websites containing Trackers in the

After-Visit grows for all European countries from a increase in the UK to for Germany. This increase is moderate () in the US, given the lower fraction of those websites having a Privacy Banner. We complete this analysis by reporting how this fraction increases when performing Additional-Visits as recommended in (aqeel2020on). We perform 5 Additional-Visits per website. Our results confirm this, with the chance to observe at least one Tracker that further grows by - in Additional-Visits when compared to the After-Visit.

We next investigate the quantity of Trackers contacted while visiting websites in Figure 8(b), which shows the average number of Trackers contacted on the websites, separately by country. For all countries, the average amount of Trackers more than doubles on After-Visit, and performing Additional-Visits further increases this figure. In Italy, for instance, this figure grows by a factor of when comparing Before-Visit and Additional-Visits. As previously noted, the behavior of US-popular websites differs from the European: before acceptance, the number of Trackers is already higher than in popular European websites, while it is comparable after.777Recall that these measurements are taken from a European country. This hints that popular websites in the United States may not have to deal with the European legislation, thus being less receptive to GDPR indications. On the opposite side, German-popular websites appear to be the most observant of the regulations, installing Trackers only upon accepting the privacy policies. Afterward, they reach levels comparable to the other countries. In summary, European websites use the same quantity of Trackers as US ones, although they are often contacted only after accepting the privacy policy.

We finally observe that the probabilistic nature of web tracking and bidding mechanisms result in a different number of Trackers contacted at each visit. To obtain the most reliable measurements, we test each website times. We notice that measuring the fraction of websites containing at least one Tracker (as in Figure 8(a)) is moderately impacted by the number of tests. Indeed, when considering a single After-Visit per website, overall, we find of them containing one (or more) Trackers, which increases only to considering all visits. Similarly, the average number of Trackers (as in Figure 8(b)), increases from to .

4.2.2. Analysis by category

(a) Percentage of websites embedding Trackers.
(b) Average number of Trackers per website.
Figure 10. Trackers penetration and number on websites during different phases of a browsing session, separately by category.

We now break down the picture by category, showing the results in Figure 10. As we previously pointed, we explicitly target websites from categories, each holding the top- websites for each of the considered countries.888We find a handful of websites belonging to more than one category.

Figure 9(a) reports the percentage of websites of a given category that contain at least one Tracker. We sort categories from the highest to the lowest percentage of websites with Trackers in Before-Visit. For completeness, the top -axis details the fraction of websites in such category where Priv-Accept accepts privacy policies. As before, there is a significant difference in the Before-Visit and After-Visit. An exception is the Adult category, where the increase is marginal. This is likely due to the low number of websites with Privacy Banners () and confirms the peculiarity of the tracking ecosystem on Adult websites (vallina2019tales). As observed in Figure 8(a), considering Additional-Visits increases again the chance of encountering at least one Tracker.

Figure 9(b) shows the average number of Trackers in websites, with categories sorted from the one with average highest to the one with average lowest number of Trackers in the Before-Visit. In the After-Visit and Additional-Visits, there is a large increase in the number of Trackers, confirming that most Trackers appear only after the user accepts the privacy policies and when visiting internal pages. Here, differences across categories are pronounced. Categories that heavily depend on advertisement-related incomes (such as News and Media, Sports, Games, Arts and Entertainment) tend to rely on a large number of Trackers to support more effective behavioral advertisements. This is noticeable already on the Before-Visit. For example, access to a News website leads to contact Trackers on average. Here, Priv-Accept successfully accepts the privacy policies in of cases. Indeed, being News websites very popular, they tend to correctly implement the privacy regulations, showing a well-configured Privacy Banner. Upon acceptance, suddenly, the number of Trackers becomes almost 6 times higher ( for News) and 9 times higher in Additional-Visits (). Similar considerations hold for many other categories, e.g., for Sport, Food and Drink and Arts and Entertainment the average number of Trackers more than triples.

These results well highlight the need for correctly handling the Privacy Banners to observe the extensiveness of Trackers. Without Priv-Accept, one would radically underestimate the footprint of the tracking and ads ecosystems on the Web. In a nutshell, thanks to Priv-Accept, we obtain the fundamentally different figure in the After-Visit and Additional-Visits.

The case of Adult websites is worth a specific comment. Priv-Accept finds the Privacy Banner on only of them, and a manual check confirms that the large majority of them do not offer any Privacy Banner. In general, tracking is also scarce upon acceptance, as previously found by authors of (vallina2019tales). They suggest that the specialized pornographic advertisement ecosystem may cause this behavior: usually, trackers and advertisers related to pornographic websites do not operate outside of them – often evading tracker listing efforts.

4.3. Visiting from Outside Europe

Figure 11. Trackers per websites when crawling from different countries.

To complete the analysis, we run additional measurement campaigns using crawling servers in the Amazon AWS data centers located in the US (Ohio and California), Japan and Brazil. We target the same websites as before. We aim to check if websites behave differently based on the location of the visitors. Figure 11 summarize our findings. First, we notice how Priv-Accept accepted privacy policies on around fewer websites when run from outside Europe, as reported on top -labels. Investigating further, we find websites for which Priv-Accept can accept the Privacy Banner when visited from Europe, but it fails when visited from not-EU countries. Checking the screenshot taken by Priv-Accept during the visit on a random subset of these websites, we confirm that no Privacy Banner is present. Thus, we conclude that some websites are starting to customize their Privacy Banners based on visitors’ properties, such as their location.

This impacts the percentage of websites that embed Trackers on the Before-Visit. It grows from to when visiting from outside Europe. On the After-Visit, these differences smooth out, revealing how Priv-Accept helps obtain user-centric measurements regardless of the presence or absence of Privacy Banners on websites. As a final note, we do not observe any significant difference visiting the websites from Ohio or California, despite the CCPA.999This figure may require further investigation since we are measuring from Amazon AWS servers whose location may not be correctly handled by the CMPs.

5. Impact on Website Complexity and Performance

(a) Acceptance rate.
(b) Average number of Third-Parties per website.
Figure 12. Acceptance rate and average Third-Parties per website over the top-100 k websites in Tranco list, computed every websites in the rank.

In this section, we measure the impact of accepting privacy policies on the webpage characteristics and loading performance. Trackers and Third-Party objects that the browser has to load and display upon consent could impact the amount of data to download and the rendering performance. Since we are not interested in breaking down results per country or per category, here we use the crawl on the top- websites according to the Tranco list. For each website, we visit only the landing page, doing a Warm-up visit to fill the browser cache, followed by a Before-Visit and After-Visit. We compare results on the latter two visits, considering only those websites for which Priv-Accept successfully accepted the privacy policy, which happens on % of websites. This is in line with the previous findings, as the Tranco list is a world-wide rank and includes (i) European websites in a language different from those for which we built the keyword list and (ii) websites based in non-European countries nor popular in Europe. We detail the acceptance rate on the Tranco list in Figure 11(a), computed over blocks of websites sorted by their rank, totaling on the -axis. The solid red line reports the acceptance rate for websites popular in the 5 European countries we target. Websites belong to this set if (i) either they appear in the Similarweb ranks for the 5 countries or (ii) the Top-Level Domain belongs to the 5 countries.101010The Tranco list does not provide a per-country rank. Out of these websites, Priv-Accept accepts the privacy policy on (%), which is close to what we have obtained with the Similarweb ranks (). Conversely, for the remaining websites (blue dashed line), the acceptance rate is % for the top-ranked and then it settles around %, hinting that some globally popular websites use a Privacy Banner even if they are based outside Europe.

The high acceptance rate for the 5 European countries results in a large increase of Third-Parties from the Before-Visit to the After-Visit, as we depict in Figure 11(b), again computing values over blocks of websites. The solid red line shows that these websites use, on average, Third-Parties in the Before-Visit. In the After-Visit, the average grows to . Differently, the increase for the other Tranco websites is limited – see the area between the blue solid and dashed lines. In the Before-Visit, Third-Parties are more numerous than for the 5 European countries if we compare the solid blue and red lines. This is likely due to the larger presence of non-EU websites, not required to use a Privacy Banner. In the After-Visit (dashed blue line), the increase is moderate, not reaching the values of the 5 European countries (red dashed line), potentially because Priv-Accept misses many Accept-button in non-supported languages.

5.1. Impact on Page Objects and Size

We first focus on the webpage size in terms of downloaded bytes and number of objects. To compare the results, we compute the ratio between the measurement on the Before-Visit and After-Visit, i.e., , where is the metric of interest. We show the results in Figure 12(a), separately for total downloaded bytes (solid red line) and objects (blue dashed line). As expected, accepting the privacy policy increases the webpage size by a sizeable factor for most websites. For instance, about % of websites download more than twice the objects, and about % sees an increase of 3 times or more.

(a) Page size (in number of bytes and objects) ratio.
(b) Number of Third Parties. Notice the log scales.
Figure 13. Webpage characteristic before and upon consent to privacy policies (Tranco list).

Interestingly, we also observe some websites that are lighter on the After-Visit than in the Before-Visit. Investigating further, these cases are mostly due to the lack of additional content upon acceptance and the saving of the cost of not loading the CMP objects on the After-Visit. This happens commonly on those websites that either add a Privacy Banner despite not using tracking mechanisms or those websites that load and contact Trackers and Third-Parties even before the user has accepted the privacy policies. While the former might be seen as an excess of caution, the latter cases are likely violating the privacy regulations.

To better characterize the differences, we quantify the number of Third-Parties seen in the Before-Visit and After-Visit

. We show the Complementary Cumulative Distribution Function (CCDF) in Figure 

12(b). On median, websites rely on Third-Parties on the Before-Visit (solid red line). This figure grows to after (blue dashed line) on the After-Visit. The CCDF highlights the tail of the distribution where we observe those websites that rely on a large number of Third-Parties: the percentage of websites with more than grows from to , with including more than 75 Third-Parties upon acceptance.

(a) Warm Browser Cache.
(b) Cold Browser Cache.
Figure 14. OnLoad time of websites versus the increase of Third-Party number upon acceptance (Tranco list). The cardinality of each category is reported on the top axis of the left-most figure.

5.2. Impact on Page Load Time

The Third-Party domains appearing after acceptance are generally devoted to advertisements, analytics and Web tracking – see in Figure 6 the most pervasive. Contacting a large number of them has direct implications on the page load time and, indirectly, on the users’ QoE (da2018narrowing). We thus expect the growth of Third Parties and the increase in the number of objects to download to cause degradation on the page load time since the browser has to resolve via DNS and contact many servers. This possibly limits the advantages of new solutions like stream multiplexing and header compression offered by HTTP/2.

We dissect the webpage performance in Figure 14, comparing separately visits with a warm cache (Figure 13(a)) and a cold cache (Figure 13(b)). We report the onLoad time by grouping the website according to the different number of additional Third-Parties observed in the After-Visit

. We use boxplots, where the boxes span from the first to the third quartile and whiskers from the

to percentile. The central stroke represents the median. The number of websites in each set is detailed above the respective boxplot. The more Third-Parties are loaded upon acceptance, the larger the time needed to load the webpage and the larger its variability. Especially for the websites that add more than 10 Third-Parties, the distributions are remarkably different on the Before-Visit and After-Visit. Considering visits with cold browser cache (Figure 13(a)), for website with additional Third-Parties, the median onLoad time passes from to seconds. The difference increases for the websites adding more than Third-Parties upon acceptance. The median onLoad time increases from to seconds, more than doubling. Notice also the tail of of websites loading in more than  s, which happens in less than of cases during the Before-Visit. Similar considerations hold for visits with an empty browser cache (Figure 13(b)). In this case, Priv-Accept cleans the browser cache and the socket pool after each visit. As expected, with the clean cache, websites load generally more slowly – compare values in Figures 13(a) and 13(b). Those that do not add new Third-Parties tend to load slightly faster on the After-Visit, potentially due to the absence of the Privacy Banner or CMPs. Again, we observe that those adding several Third-Parties after acceptance have larger onLoad time on the After-Visit. Indeed, the median onLoad time for websites adding more than 50 Third-Parties increases from to seconds.

In summary, measuring the webpage load time of websites without considering the implications of accepting the privacy banners would result in a very biased measurement. These results highlight how approaches such as Priv-Accept are fundamental to obtain a realistic picture of the Web performance and testify how actual users’ experience cannot be measured without handling the Privacy Banners.

6. Ethical considerations

During our measurements, we took care to avoid harming the crawled webpages. We contacted each website times in a span of two weeks and accessed a limited number of internal webpages each time. Considering that the target of our analysis were some of the most popular websites of Western countries, our belief is not to have caused an overload on the servers or any undesirable side effect. Moreover, since we did not interact with Third-Parties after accepting the privacy policies – included displayed ads – we consider not to have significantly altered the economic ecosystem of the crawled websites. We only used the standard HTTP and HTTPS ports for our measurements, carefully avoiding any type of port scanning procedures, and we used large timers to avoid creating any kind of congestion.

7. Conclusions

In this paper, we have demonstrated how the recent regulations have changed the Web scenario, challenging its automatic measurements through traditional Web crawlers. Websites now massively deploy Privacy Banners to obtain visitors’ consent for using tracking technologies and collecting personal data. As a result, webpages appear very different once users provide their consent. This has vast implications on the understanding of Web tracking, on webpage characteristics, on performance measurement, and any other measurement based on Web crawling.

In this paper, we engineered Priv-Accept, a tool that automatically crawls websites accepting the privacy policy when a Privacy Banner is found. We run it on a large set of websites popular in Europe and worldwide. Our results highlighted how the picture of the Web varies when measured upon accepting privacy policies: Web Trackers and Third-Parties suddenly become more pervasive, websites more complex and slower to load.

We release Priv-Accept as an open-source project. We based it on a set of keywords and, thus, has margins for improvement. We foster its use by the research community to contribute to it and extend our results. We also hope Priv-Accept will be included as part of the public projects that provide periodic Web measurements. Our goal is to keep developing Priv-Accept

to enrich the keyword list, implement additional functionalities, adding the possibility to deny the privacy policies, a much harder task. For this, we envision the design of more sophisticated approaches to manage Privacy Banners, likely based on recent advances in Natural Language Processing and Machine Learning.

Acknowledgments

The research leading to these results has been funded by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 871370 (PIMCity project) and the SmartData@PoliTO center for Data Science technologies.

References