A majority of today’s online services are a combination of original content and—to a non-negligible extent—third party resources (Sørensen and Kosta, 2019). Most notably, online advertising is embedded using external resources that display ads to finance these services and to provide them to users free of charge. Other third parties are included for various means, e. g., libraries are used to develop services quickly, to decrease loading times, and for analytical purposes. Consequently, this leads to a highly dynamic Web with complicated dependencies among all participants. This trend comes with the drawback that some service providers might not be aware of which third parties are delivered to customers in their name when users interact with their website. Ultimately, third parties can pose risks to users, which is obviously unintended by the service provider. For example, third parties can create security problems (e. g., malvertising (Kumar et al., 2017; Siddiqui et al., 2008; Sood and Enbody, 2011)), might have negative privacy implications (e. g., trackers (Englehardt and Narayanan, 2016; Englehardt et al., 2015; Acar et al., 2013)), or they can include content that might impact users in other negative ways (e. g., crypto miners (Rüth et al., 2018; Konoth et al., 2018)). Services themselves reinforce these dynamics as they make use of different sets of third parties in different sections and webpages. For example, news websites often insert scripts to connect with social media below articles, but not on the actual landing page. This raises the question of whether previous studies that exclusively measured the landing pages (e. g., (Dabrowski et al., 2019; Sørensen and Kosta, 2019; Urban et al., 2020; Englehardt and Narayanan, 2016; Ikram et al., 2019; Iordanou et al., 2018; Merzdovnik et al., 2017)) captured a complete and comprehensive view of the analyzed phenomenon.
We perform a measurement study on 10,000 websites on the Web and analyze relations between third parties. We use the notion of third party trees (TPT) as a metric for loading dependencies of all third parties embedded into a given website. More specifically, a TPT contains information on all third parties (TP) observed when visiting a given website and accounts for the loading sequence of each TP. Consider the following example: adidas.com embeds a script which loads content from Adobe (3rd party). The script again loads a script from Tealium (4th party), which also loads a script from Akamai (5th party). As a result, a TPT captures the hierarchical structures of third parties on a given website and enables us to study the typical characteristics and dynamic nature of the modern Web. Furthermore, we show that embedding a single TP might result in embedding a non-deterministic amount of additional TPs, which might pose privacy or security risks. Previous work in this area has analyzed implications of the presence of multiple third parties on websites. Recently, Ikram et al. (Ikram et al., 2019) raised awareness for the problem of implicit trust created by decency chains in website embeddings. Earlier work focused on the extent of tracking (e. g., (Dabrowski et al., 2019; Sørensen and Kosta, 2019; Englehardt et al., 2015)), or on the used mechanisms (e. g., (Englehardt and Narayanan, 2016; Acar et al., 2013; Kurtz et al., 2016)), and again other works on defense mechanisms (e. g., (Nikiforakis et al., 2015; Pan et al., 2015)), or the effectiveness of such (e. g., (Merzdovnik et al., 2017; Mayer and Mitchell, 2012; Fouad et al., 2020)). In this work, we want to asses in more detail by whom third parties are embedded into websites and study the extent of control service providers have on the embedded third parties. Most importantly, we show that previous studies did not measure the extent of the phenomenon extensively enough and only measured a (not necessarily generalizable) lower bound of included TP content. Our results show a significant increase in used cookies (36 %) and tracking techniques (6 %) on subsites.
In summary, we make the following key contributions:
We introduce the concept of third party trees (TPTs) that reflects all third parties and dependencies when loading a website. Utilizing TPTs, we show that some TPs load several further partners and that those are not always deterministic and possibly in conflict with current legislation.
We show that only measuring the traffic generated by landing pages of a website or only a few subsites leads to the risk of only capturing a (potentially limited) subset of the loaded third parties. This implies that the obtained results might be biased and not generalizable. For example, our study indicates that subsites use substantial more cookies (over 45 %) than the site’s landing pages.
Using our data, we try to replicate previous work to test if they only measured an incomplete view of their studied phenomenon and show that most privacy-invasive technologies occur more often on subsites.
Before introducing our approach, we briefly describe third party usage and outline the privacy implications of those.
2.1. Third Party Usage
2.2. Online Tracking
Tracking users online is a widespread phenomenon on the Web (Englehardt and Narayanan, 2016). It is used to re-identify users navigating the Web and a crucial part of the modern online advertisement ecosystem as it allows them to provide targeted ads. Techniques to track users can be divided into stateless and stateful approaches. Stateless approaches use specific attributes of the users’ device to identify it (Englehardt and Narayanan, 2016; Nikiforakis et al., 2013; Acar et al., 2013; Xu et al., 2016; Formby et al., 2016) (often called “device fingerprinting”). In contrast, stateful approaches use the machine’s state to identify users. Typically an ID is assigned to each user and is stored in a cookie on the users’ device. The upside of stateless approaches is that they cannot be prevented by deleting third-party cookies. However, they are more error-prone as device-specific attributes tend to change over time (Vastel et al., 2018; Gómez-Boix et al., 2018).
3. Related Work
Previous work analyzed tracking mechanisms and the effects of privacy legislation through measurement studies.
Privacy & Tracking Measurements
Englehardt et al. introduce OpenWPM and use it to crawl the top 1 million websites and analyze their tracking capabilities (Englehardt and Narayanan, 2016). They find that many websites use highly sophisticated fingerprinting methods (e. g., based on image rendering) and that most companies participate in cookie syncing. Degeling et al. analyze different cookie banner notifications and effects of the GDPR on privacy policies (Degeling et al., 2019). They find that more than half of websites provide a cookie consent notice, but only very few offer users a real choice regarding cookie usage. The effects of the GDPR have been studied extensively in the past. For example, Utz et al. (Utz et al., 2019) analyzed implementations of cookie consent banners, Urban et al. (Urban et al., 2019a, b) analyzed usability of the GDPR right to access and the effect of the GDPR on cookie syncing activities (Urban et al., 2020). Dabrowski et al. test if the GDPR has an impact on cookie settings when users access the same websites from different countries (Dabrowski et al., 2019). They find that websites (around 50 %) do not set cookies when a user from the EU visits the website while they set a cookie when the user visits from a non-EU country. Most recently, Sørensen et al. analyzed the effect of the GDPR regarding third parties embedded into websites (Sørensen and Kosta, 2019). The authors measure several prominent websites and test whether the GDPR affects their third party usage. They conclude that the overall usage of cookies declined but that the GDPR was not necessarily the driver for that change.
Third Party Inclusion
Closely related to our approach is the work of Kumar et al. (Kumar et al., 2017) and Ikram et al. (Ikram et al., 2019). Both works use a concept of the implicit trust of the embedded third and further parties. Kumar et al. show that websites heavily rely on third parties, that almost one-third of websites embed a third party that loads further parties, and that these dependencies are a problem if one wants to serve a website fully via HTTPs. Ikramet al. also show that many websites (approx. 40 %) implicitly trust parties loaded by directly embedded third parties and see an increase in embedded malicious or at least suspicious site or script files in these chains.
Our work differs from previous work, as most tried to measure effects on a horizontal scale (i. e., visiting a lot of distinct domains) while we instead analyze websites on a vertical scale (i. e., , we visit several subsites of the same domain). Furthermore, we focus on privacy-invasive technologies and the determinism of third party dependencies. By this vertical approach and dependency identification, we can (1) analyze if subsites show different behavior compared to landing pages, (2) study effects of embedding different third parties to websites, and (3) understand who is responsible for embedding specific third parties.
4. Measurement Approach
Before describing our approach, we define two terms we use throughout this work. By TLD+1 we mean the last part of the hostname following the last dot in it. For example, the URL https://tools.ietf.org has TLD=org, hostname=tools.ietf, and TLD+1=ietf. In most cases, TLD+1 is a “second-level domain”. However, some domain name registries use a second-level hierarchy. For example, New Zealand uses various second level domains for different purposes: .co.nz for organisations or .school.nz for schools. We identified the TLDs using Python’s tldextract (Python, 2019) package, which accurately splits generic or country code top-level domains (ccTLD). Furthermore, we distinguish between landing pages and subsites. A website is a subsite (SB) of a landing page (LP) if both share the same TLD+1 but have distinct URLs. Hence, first-party links on landing pages, the page that is usually visited first, lead to subsites. We chose to use the term SB rather than “webpage” to explicitly highlight the hierarchical relation between SBs and LPs.
4.2. Website Corpus
In our analysis, we use the top 1M Tranco list et al. (Le Pochat et al., 2019), which is an aggregation of four other domain top lists. We used the list generated on 03/26/2019 (ID: W9L9). First, we removed all websites with the same TLD+1 and only kept the one with the higher rank. We did so because we wanted to remove URLs of services that offer users the (almost) same functionality. For example if the list contains google.com (rank 1) and google.co.uk (rank 4) we would drop google.co.uk because both domains share the same TLD+1. In total, we removed 607 websites in this step. From the remaining domains, we used the top 10,000 domains and grouped them by the category of their content and also sort them into four different buckets based on their ranking.
We used the McAfee SmartFilter Internet Database service to retrieve a list of content categories for the websites (McAfee LLC, 2019). We cluster the websites by categories because we want to check if the category of a website has an impact on the usage of cookies and other privacy-invasive technologies. Previous work has shown that, for example, News websites utilize more third parties (e. g., ad services) than other categories (Sørensen and Kosta, 2019). In total, 85 different categories are assigned to the websites of the dataset. An overview of the 15 most prominent categories is given in Figure 1. In the remainder, we limit the analyzed categories to the top eight categories and combine all remaining categories in “Other”. Additionally, we group the websites by the following buckets based on the website’s rank in the used list: (1) rank , (2) rank , (3) rank , and (4) rank . Due to the removal of duplicate domains, bucket (4) holds these 607 domains, 6.1 % of all visited domains. We use the buckets to test whether the popularity of websites has an impact on the usage of specific technologies.
4.3. Measurement Framework
To measure the dynamic of websites, we utilize the OpenWPM platform (Englehardt and Narayanan, 2016). For each visit, we use the same user agent (Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0) and desktop resolution (1366x768), allow all third party cookies, do not set the “Do Not Track” HTTP header or other privacy-preserving techniques (e. g., anti-tracking extensions), and use standard bot mitigation techniques to disguise our crawler (i. e., random scrolling and mouse jiggling). Furthermore, the browser adopts other properties from the operating system (Ubuntu 18.04). Aside from our bot mitigation techniques, we do not interact with the visited websites in any way, limitations of this approach are discussed in Section 6. While a website might detect our crawler, it is not detected by current mechanisms seen in the wild, as presented by Jonker et al. (Jonker et al., 2019).
As the Web is highly dynamic, any attempt to measure it is quite challenging. To get a comprehensive view of cookie and third party usage, we conducted a pre-study to get an approximation of which measuring parameters to use (e. g., amount of subsites to visit) while limiting the crawling time and generated traffic to a reasonable amount. In the following, we limit our pre-study to TP cookies as prior work extensively analyzed those (Gonzalez et al., 2017; Dabrowski et al., 2019; Englehardt et al., 2015; Franken et al., 2018; Sanchez-Rola et al., 2019; Degeling et al., 2019; Kristol, 2001), and we want to test whether they might have missed cookies due to their measurement setup. However, in our primary analysis, we also analyze various tracking mechanisms (see Section 5.2). To find the optimal amount of subsites to visit, we randomly selected 100 websites (TLD+1) from the top 1,000 websites and visited 25, 50, 75, 100, 250, 500, and 1,000 subsites of these websites. The websites were visited in a separate measurement but using the same TLDs+1. We conducted these measurements using a browser with a profile that already has some cookies present in the local cookie store and once with a vanilla browser to see if active cookies influence cookie usage. We filled the local cookie store by randomly visiting 100 websites from the top 1,000 websites and used the resulting cookie store. In a separate measurement, we visited the landing page of the selected websites 1,000 times and recorded the used cookies to test if there is a difference if users visit the landing pages or subsites.
We compared the number of TP cookies set in each measurement of the pre-study and found that subsites of websites typically set significantly more cookies than the respective landing page does. In our measurement, the mean amount of cookies used increased by approx. 20 (41 %), when visiting subsites rather than only the landing page. This shows that if one wants to perform cookie/third party measurements, one should always include subsites to the measurement setup rather than only measuring landing pages. Furthermore, we measured a mean increase of 12 cookies (27 %) per website visit if a browser is used that already has cookies in the local cookie storage. When it comes to the change of cookie usage based on the number of visited subsites, we found that the mean amount of accessed/set cookies stabilizes around 50 (SD: 100; median at 12) after visiting 100 subsites (see Figure 2). In conclusion, to magnify the number of cookies set, we use a browser profile that has cookies set and visit 100 subsites and the landing page of each website.
4.3.2. Measurement Sequence
We used the same method to create the browser profile for our experiment crawls that we utilized in the pre-study. This profile is loaded before each website visit but is not altered. Hence, each website visit uses the same profile and the order of visited websites does not impact the results. In total, we conduct the measurements from three different locations (Europe (DE), North America (US), and Asia (JP)) to account for possible geographical differences (Dabrowski et al., 2019). For all measurements, we used two computers located at a European university. For each of our regional measurement runs, we created a new distinct browser profile. We used a commercial VPN service (NordVPN) to obtain an IP address from the locations outside the EU. Using a VPN service comes with the risk that it might inject content into the communication stream (Khan et al., 2018). However, we did not find any hints of this practice for the used service, neither in the Terms of Service nor publicly on the Internet.
We configured OpenWPM to visit the landing page of each website and to gather all first-party hyperlinks on that site (subsites) one day before the first measurement. Therefore, some of these links might not be present on the front page anymore at the time we perform the measurements from different regions or might not exit anymore after all. We did so to increase comparability between our measurements since we visited the same landing pages and subsites in each measurement. Additionally, we collect all first-party hyperlinks on the subsites but only use them (in random order) if there are not enough subsites linked on the landing page. Afterward, we choose 100 random subsites that we used during the experiment crawls. In each measurement, we visited 549,715 (SD 16,851) distinct URLs on average.
4.4.1. Cookie classification
Cookies can be used for various means. We want to asses the specific purposes why third parties set cookies and which purposes are most dominant to get a better understanding of real-world cookie usage. We use the following cookie type classes defined by the International Chamber of Commerce UK (International Chamber of Commerce UK, 2012): (1) “Strictly Necessary Cookies” are needed to provide basic functionality of a website, (2) “Performance Cookies” aggregate (anonymously) user’s usage of the website, (3) “Functionality Cookies” personalize the website’s usage, and (4) “Targeting/Advertising Cookies” are used to track users or to display them personalized ads. For our analysis, we used Cookiepedia, a platform that provides public classifications of cookie classes (OneTrust LLC, 2019)
. This process might be error-prone as cookie classes are assigned by hand but are—from our point of view—the best approximation of online cookie usage today. In total, we can classify 45.3 % of all observed cookies.
4.5. Third Party Trees
If not stated otherwise, we use the TLD+1 of a third party domain as the node identifier; otherwise, we use the companies associated with the TLD+1. We use the WhoTracks.me database (Cliqz, 2018) to link domains to the respective companies owning them. Thus, a branch in the tree could consist of multiple domains operated by the same company (e. g., foo.com googletagmanager.com googleapis.com youtube.com). However, we collapsed requests stemming from one company into one leaf. In the previous example, we would not add googleapis.com even if youtube.com would load a script form that domain. We did so because otherwise, the resulting trees would result in a much deeper length if several resources were loaded from the same TLD+1. For example, if foo.com was embedded and would than load metric.foo.com, subsequently ad.foo.com and finally foo.com/?ad_loaded=1 the resulting branch would be much deeper. Overall, the maximum depth using this more lax approach would increase by magnitudes from eight to 52. Thus, a branch consists of all TLD+1/companies that could perform a task on the client.
An example of a third party tree is given in Figure 3, including the companies’ names, not TLD+1s. The tree shows the visited website (adidias.com), the directly embedded third parties (MediaMath, TrustArc, and Adobe—), the partner of the third partners (fourth parties at —e. g., Improve Digital), and further embedded services e. g., Akamai () or Instana (). The services that actively set cookies are marked with a [C]. The example illustrates that by embedding a single service, many other direct partners of that third parties might be embedded into a website (e. g., MediaMath embeds four partners). Furthermore, embedding a single third party might implicitly lead to a long branch of direct and indirect partners of the used third party (e. g., Adobe that creates a branch of ). Note that at depth four, a service from Adobe is embedded. This is not a loop, but simply, the previously loaded party utilizes a different service of Adobe.
We conducted our measurements in the second quarter of 2019 and found around 93 % of the landing pages in our dataset to be accessible. The remaining websites provided services that seem not to be intended for rendering in a web browser (e. g., APIs) or did not exist anymore. In total, we visited over 1.5 million websites that embedded over 37,000 third parties producing over 4.5 TB of data. More than 17,000 third parties access/set over 59 million cookies across all website visits in our experiment. An overview of our measurements is given in Table 1.
|gray!50 Region||Websites||Subsites||TPs||C TPs||Cookies|
5.1. General Overview
First, we tested how many cookies are set/accessed when visiting subsites in contrast to the respective landing pages to test the potential bias in previous studies that focused on the landing page only. In our measurements, as shown in Figure 4, subsites set considerably more (36 %) cookies than the respective landing pages. On average, 55 cookies were set when loading a landing page while 78 were set when a subsite was accessed. The difference between the number of cookies used by third parties is statistically significant when comparing (1) different categories (ANOVA test -value ) and (2) when comparing landing pages to subsites (-value ) However, we did not find a statistically significant effect of the originating region of the visit and the rank of the website on the cookie setting behavior. Our results show that landing pages of websites show a different cookie usage behavior than the respective subsites as those make more usage of third parties. To get a better understanding of the implications of increased cookie usage, we analyze the primary purposes of why cookies are set.
5.1.1. Lifetime and Cookie Types
Aside from the number of cookies set, it is interesting to analyze why they are set and how long they stay active in the browser. Overall, we could classify 45.3 % of all observed cookies in terms of distinct used keys. Regarding absolute numbers, we could classify 74 % of all observed cookies. Most of the observed cookies are used to track website visitors or to provide targeted ads (99 %). The “type” of the cookie shows a strong correlation with the amount of cookie set for this type (-value ). This means that specific types of cookies are set more often than others. Furthermore, the purpose of a cookie is not related to its lifetime, a test does not show a correlation between “type” and “lifetime”. Furthermore, third parties use similar types and lifetimes for their cookies, no matter on which website they are embedded in. We did not find a correlation between the “type” or “lifetime” of a cookie and the website’s category. Our results show that cookies are overwhelmingly used to track users or to provide them with targeted ads. Furthermore, cookies in all categories use various lifetimes. Given the primary purpose of cookies (“Targeting/Advertising”) and the measured increased usage of cookies on subsites, we see that subsites show different behavior in that regard (see also Section 5.2). Tracking users on subsites provides a more comprehensive view of their online activities. For example, visiting the landing page of an online shop does not necessarily indicate which products a user is interested in, but this information can be extracted on subsites.
5.1.2. Legal Compliance
With the introduction of the General Data Protection Regulation (GDPR) (The European Parliament and the Council of the European Union, 2016) and the California Consumer Privacy Act (CCPA) (California State Legislature, 2020), service providers have to be more aware of business partners they work with. If a business partner tracks users or uses personal information in other ways and is not located in a GDPR adequate member state (Government Digital Service, 2019) or not a member of the Privacy Shield (The International Trade Administration, 2019), they need to agree on a data processing contract (Article 28 §3 GDPR) that “appropriate safeguards” (Article 46 §1 GDPR) are taken which enforce privacy rights of EU citizens. Based on the IP addresses observed in our measurements (see Section 4.3), we analyzed if connections were established to IP addresses that are associated with countries that are not a member of the EEA or part of the Privacy Shield. In the remainder of the paper, we call these parties “non-adequate” or “possibly problematic” to improve the reading flow of this work. Note that every business can agree by contract that the data of EU citizens are processed according to EU legislation and, therefore, these parties might pose no problem at all (Article 28 §3 GDPR). However, the current legal debate only focuses on TPs as “joint controllers” (Higher Regional Court, Düsseldorf, Germany, 2018; European Court of Justice, 2018) and does not cover fourth or further parties. We want to highlight that a binary classification of what is compliant with legal regulation and what is not is impossible to make without looking at the specific service agreements between websites and third parties.
Figure 5 shows the origins and targets of all requests for which service providers need to make sure that they have taken appropriate safeguards. These numbers only refer to our EU measurement, and the results are not violations of the legislation, but provide insights to potential data flows that might conflict with the legal requirements. The origins/targets are based on the observed IP addresses in our measurements. Overall, 4.7 % of all cookies were set by services outside adequate geolocations and only 7.1 % of the visited domains (TLD+1) exclusively used TPs that are located at adequate geolocations. Domains using only adequate locations are located in the US (59 %), followed by Germany (7 %), and the United Kingdom (3 %). In our dataset, Singapore is the most prevalent target of non-adequate requests (26 %), followed by China (5 %) and Australia (5 %). The US is the most common origin of such requests (63 %), followed by China (6 %) and Germany (5 %). We did not find a statistically significant impact of the region on the question of whether or not a third party from non-adequate geolocation is used. When looking at the services located in possibly non-adequate geolocations, we found that almost half only used sometimes (53 %), and the other half always used possibly non-adequate geolocations (47 %). Overall, roughly 10 % of all observed TPs used IP addresses in possibly problematic geolocations.
In the following, we analyze the services that use sometimes adequate and sometimes non-adequate geolocations. This is an interesting subset as service providers might not be aware of the possibility that these TPs change their geolocations over time. In contrast, third parties that always send data to possibly problematic geolocations are more easy to identify and, therefore, the transfer of data to these non-adequate countries are likely part of the data processing contracts. Requests to TPs that only sometimes used adequate geolocations were most of the time resolved to an EU IP address but sometimes ( %) to addresses outside the EU. For example, sometimes a similar resource of a third party was requested from different locations in the same measurement. Meaning, the URL csm.ad-network.foo was resolved to sgp.csm.ad-network.foo in Singapore and nl.csm.ad-network.foo in the Netherlands. This is challenging as service providers cannot ensure that only EU endpoints of the used third party are used. In our measurement, gstatic.com (a service operated by Google) with 20 % of all inclusions of possibly non-adequate services and upravel.com (a Russian advertising service) with 15 % are the top services that might pose a problem to service providers. The next service only accounts for 1 % of these possibly conflicting services (i. e., there is a long tail distribution). One likely explanation is that these are effects of load balancing or similar techniques and that the servers belonging to these IP addresses are controlled by the same third party. However, service providers need to account for this behavior in the data processing contracts with the TP, and the TP must assure that GDPR adequate data processing rules are in place no matter where their servers are located.
5.2. Replication and Comparison
To provide a more comprehensive overview of our measurements in comparison with previous work, we tried to replicate the main findings of previous work using our data set. We differentiate between studies we could replicate using our data (●—see column “Rep.” in Table 2) and studies we would partly replicate (◑). Furthermore, we indicate (“Res.”) if we could produce similar results (✓). To reproduce the results, we analyzed the landing pages of each website (if the paper did so) or used the same amount of subsites. If we could replicate the results, we measure them on all visited subsites to test if these studies measured a comprehensive generalizable view or as shown in our study, subsites show a different behavior (“Scales”). We differentiate if visiting subsites makes a measurable difference in contrast to only visiting landing pages (✗). The results are given in Table 2. Our replication studies do not aim to replicate all results of previous work, but we only focus on the main takeaways and results closely related to our work. We do not claim that our replications are sound or complete, but we tried to faithfully replicate previous work as good as possible using our data set.
In contrast to Dabrowski et al. (Dabrowski et al., 2019), and as previously stated, we could not find statistical evidence that the originating region of a request influences cookie setting practices in general. On the one hand, this could be a result of different experimental setups as we tried to maximize the “cookie setting behavior” of each website to achieve more generalizable results. Dabrowski et al. used a headless browser that can be easily detected by websites and, therefore, might affect the loaded TPs (e. g., ads might not be loaded to counter ad fraud). On the other hand, we performed our experiment on a larger scale and interacted (e. g., scrolling) with the websites, which could fundamentally affect the results.
|gray!50 1stAuthor||Ref.||Year||Venue||Scale||Main finding||Rep.||Res.||Scales|
|Dabrowski||(Dabrowski et al., 2019)||2019||PAM||LP||Websites set 49 % less cookies if user located in the EU visit them.||●||✗||✓|
|Sørensen||(Sørensen and Kosta, 2019)||2019||WWW||LP + 9 SB||Effects of the GDPR to third-party usage is not definite.||●||✓||✗|
|Sanchez-Rola||(Sanchez-Rola et al., 2019)||2019||AsiaCCS||LP||Tracking is often still present even if opted-out.||◑||✓||✗|
|Urban||(Urban et al., 2020)||2020||AsiaCCS||LP + 3–5 SB||Cookie syncing reduced by around 40 %.||◑||✓||✓|
|Merzdovnik||(Merzdovnik et al., 2017)||2017||EuroS&P||LP + 2 SB||State of the art tracking blocking tools can limit user tracking but still have blind spots.||◑||✓||✓|
|Englehardt||(Englehardt and Narayanan, 2016)||2016||CCS||LP||Websites use various fingerprintig methods.||○||—||✓|
|Kumar||(Kumar et al., 2017)||2017||WWW||LP||Implicitly included TPs pose a challenge when upgrading to HTTPs.||●||✓||✗|
|Ikram||(Ikram et al., 2019)||2019||WWW||LP||Implicitly included parties might pose a security threat.||●||✓||✓|
|Iordanou||(Iordanou et al., 2018)||2018||IMC||LP||In the EU, tracking data is transferred across countries but rarely leaves the EU.||●||✓||✓|
Furthermore, we found that subsites set significantly more cookies than the respective landing pages. As for the results of Sørensen et al. (Sørensen and Kosta, 2019), we could verify that the GDPR has no immediate effect on third party usage. Sanchez-Rola et al. (Sanchez-Rola et al., 2019) show that opting-out of cookies often has no measurable effect on cookie setting practices in the field. We could only partly reproduce this work as we never interacted with any cookie banners, but our results show that cookies are still widely used and that there are no regional differences, while in the EU users should opt-in before cookies are being used. We used data of our prior work collected before the GDPR became effective (Urban et al., 2020). Using this data and comparing the regional data in our experiments, we could verify that cookie syncing seems to be influenced by different legislation. Scaled to our collected data, we found an increase of cookie syncing activities on subsites in contrast to landing pages. This replication cannot be seen as representative as our measurement misses essential features, especially to identify IDs, to assess cookie syncing since we only used one profile in each region.
To test whether our results of increased cookie usage on subsites also applies to user tracking, we use the numbers presented by Merzdovnik et al. (Merzdovnik et al., 2017) on the presence of trackers on websites as a baseline. To test if a tracker is active on a website, we use the EasyPrivacy List (EasyList, 2019), which is a list combining URLs of known trackers. However, we do not test whether anti-tracking tools are useful or not. In our measurement, we found that trackers mostly occur on subsites in comparison to their respective landing pages (an increase of approx. 6 %). 2.5 % of the measured websites do not embed any trackers on the landing page but use trackers on subsites. Overall, we could show that tracking on subsites increases and that future work concerning this area should include subsites into their measurement. In terms of overall tracking occurrence, we produced results comparable to the “plain” profile used by Merzdovnik et al. Finally, we tested the prevalence of device fingerprinting scripts in our data set, as previously studied by Englehardt et al. (Englehardt and Narayanan, 2016)
. As the scripts identified by Englehardt et al. are probably outdated, we only found four of them in our total dataset, we used the popular “Fingerprint2” library (Fingerprint.js, 2019) to test for the presence of such trackers. Hence, our results can be seen as a lower bound as we only test for the presence of one script. We identified a mean increase of device fingerprinting of 25 % on subsites in contrast to the respective landing pages. In all three measurements, we found 13 domains (0.14 %), which did not use the script on the front page but on subsites. Overall, we found the tracking script on 0.15 % of the landing pages while Englehardt et al. identified device fingerprinting on 1.8 %, and the most common script on 0.45 % of the analyzed websites.
In this section, we demonstrated that only measuring landing pages hides the scale of different phenomena observable on the Web. Furthermore, the behavior of TPs differs on different subsites, which raises the question to what extent service providers are in control of TPs embedded into their services. To tackle this challenge, one needs to understand relations between TPs and the determinism of which third parties will be loaded into a service.
5.3. Third Party Trees
5.3.1. Cookies Set in Trees
Not every party in each TPT, more specifically in each branch, will necessarily set a cookie. Therefore, we analyzed the depth of the cookie setting parties and the overall amount of cookies set in each branch. We limit ourselves to cookies but expect, based on our results presented in Section 5.2, that other privacy-invasive techniques would likely produce similar results. Starting with the depth of set cookies, on average, 1.5 parties in each branch do not set a cookie. In 48 % of all branches no party and only in 125 branches (approx. %) all parties set a cookie. The website’s category and its rank both show statistical significance in d the number of cookies set in each branch (both -values ). Furthermore, we found that deeper branches do not necessarily, in relative numbers, lead to more cookies being set. As for the depth on which cookies are being set, we found that most cookies (72 %) are set by the fourth party (). The main reason why most cookies are set on depth one is likely because most trees are of depth one. Hence, deeper trees occur less often and, consequently, in absolute numbers, set fewer cookies.
5.3.2. Determinism of Third Party Trees
The determinism of each branch that is generated by an embedded TP is import if service providers want to understand which TPs are loaded and who is responsible for loading them. If it is known, before loading the third party object, which other third parties might be embedded, service providers can evaluate the potential risks of a TP for their users. Therefore, we tested the fluctuation of embedded companies for each TP in the measured trees. First, we tested the fluctuation within each visited website (TLD+1) and its subsites. Meaning that we test which third parties are embedded into the visited website by each observed third party on a specific subsite in a specific region. Secondly, we tested the fluctuation across all websites and all regions, meaning that we test if a wider spread view of a third party provides more insight of the further loaded parties or if they show different behavior on different websites.
Half of the branches (50.4 %) have at least one fluctuating partner in them. Figure 7 shows the measured fluctuation of a TP within (gray) and across (black) the visited sites. The x-axis shows the relative amount of fluctuating companies in all branches of an embedded third party. Zero means branches of this TP always include the same third parties, and six means that six distinct TPs only accrued in some of the branches. These numbers exclude third parties that never had any children because these would naturally be zero and might lead to a false conclusion about the deterministic of TPs. The results show that almost a third (62 %) of third parties that embed other third parties use fluctuating partners (e. g., due to real-time ad bidding) when loaded on different subsites. Across all regions, we see that there is a long tail distribution of companies that only occur in some of the branches, note the increase in more than six new children. Regarding the impact of the originating region, in which the measurement was performed, we found no statistical significance on its impact on the fluctuation. However, the weighted mean (local) fluctuation was the highest in the US (5.78) and lowest in the EU (5.49).
On a global scale, we find a different picture. We see that the global fluctuation in the EU is more distributed than it is in other regions. We found no statistical evidence that the region affects the local or global fluctuation of children. In conclusion, we see that measuring TPs on a global scale does not necessarily provide a generalizable view as some TPs behave differently on different sites (e. g., due to the advertised products or partners in different regions). Our results show that the list of third parties embedded in a website is not deterministic, which makes it challenging for service providers to account for all TPs that might be present on their websites. Embedding some third parties leads to an often changing set of embedded third parties (e. g., different TPs providing ads). However, service providers only have little control over these processes as they often depend on third parties to provide their service. As the (non-)deterministic of these trees is related to the embedded TP, it is interesting to analyze the depth of trees generated by different TPs (companies).
Figure 8 shows the average, scaled branch depth that is created by embedding a single object of different companies. All values are scaled for each company, not overall, and include all TLDs+1 operated by the company. Thus, Figure 8 presents the resulting depth of each company and does account for the overall occurrence of each company. Furthermore, the figure only lists the top 15 companies, regarding absolute amounts of embeddings of these companies. All remaining companies are combined in the category “Other”. The top companies account for over 98 % of absolute third party embeddings. In general, embedding most TPs results in short trees of depth zero. However, ad-tech companies—the primary source to finance many websites—offer a more widespread resulting TPT depth (e. g., PubMatic or Rubicon Project) which reduces the options to choose partners that do not load many other partners. We found statistical significance that the embedded company impacts the depth of the generated tree (-value ). Regarding the position of companies in the trees, we found that larger companies (e. g., Google or Facebook) occur mostly at depth zero (absolute numbers) while service providers of TPs (e. g., companies that counter ad fraud) occur deeper in the trees.
Our results indicate that it is quite challenging for service providers to keep track of all third parties that might be embedded into their services. Furthermore, before loading the directly embedded TP, it is often not definite which other parties might be loaded—especially ad networks load various fluctuating partners.
In the following we discuss limitations of our work. We use the classification of Cookiepedia, which might be wrong to some extent and is incomplete. We could only classify slightly over 45 % of all observed cookies but show that an overwhelming majority (99 %) tracks users or serves targeted advertisements. We mapped requests from different services to a single company, if possible. If we observed multiple requests to domains owned by one company (e. g., ads.foo.com and fonts.foo.com), we collapsed them to a single request if they occurred in sequence. Our measurement platform, a customized OpenWPM instance, does not interact with any cookie banners that are present on the visited websites. Hence, we do not capture cookies set by third parties that honor opt-in choices of (European) users. However, previous works demonstrated that cookie consent notices often do not offer choices to opt-in (Utz et al., 2019), do not work at all (Sanchez-Rola et al., 2019), and that the used libraries often are not complaint to current legislation (Degeling et al., 2019). Therefore, our results are a lower bound since (1) we shortened the TPTs and (2) some cookies might only be used after affirmative action of the user.
We have shown the challenges service providers face when they rely on third-party code and try to account which third parties are loaded when users use their service. It is the high dynamic and previously nominal regulation of the Web that now presents challenges to service providers. As service providers might carefully select the directly embedded third parties (e. g., ad networks), they cannot control which third parties might get included when these third parties loaded their content (e. g., due to ad real-time bidding). The primary tool website providers have to solve these challenges are data processing contracts that include indirectly embedded third parties. From a research perspective, we have shown that a simple horizontal scaling of websites to visit (i. e., websites from a given toplist) is not sufficient to measure a phenomenon of interest. Meaning that future work should (1) scale their experiments vertically and (2) previous results of different Web measurement areas should be re-visited to measure the given challenges adequately. Finally, our assessment of purposes if cookies underlines the dire need of privacy protection mechanisms to limit cookie-based tracking—which is currently promoted by several browser vendors (e. g., Firefox (Mozilla Corporation, 2019), Chrome (Google Inc., 2020) and Safari (Apple Inc., 2019)).
In this work, we have analyzed the cookie setting practices of the top 10k websites on the Web. We found that 99 % of all cookies we could classify were set with the intention to track users or to serve them targeted ads. Furthermore, we modeled third party trees, which assemble all third parties embedded into a website and loading dependencies among them. By analyzing the third party trees, we found that the median depth of such trees is one (max eight), that there is a sever fluctuation of children in different branches with the same parent node (third party), that especially ad networks result in longer tree branches, and that only 7 % of all visited websites (TLD+1) never embedded a third party that might pose possible legal problems. Moreover, we have shown that studies that only measure landing pages of websites miss a substantial amount of embedded third parties and cookies set.
Acknowledgements.This work was partially supported by the Ministry of Culture and Science of the State of North Rhine-Westphalia (MKW grants 005-1703-0021 “MEwM” and Research Training Group NERD.nrw). We would like to thank Cybot (Cookiebot) for their support.
- FPDetective: dusting the web for fingerprinters. In Proceedings of the2013ACM Conference on Computer and Communications Security, CCS ’13, New York, NY, USA, pp. 1129–1140. Cited by: §1, §1, §2.2.
- Flash & the future of interactive content. Note: Accessed: 2019-10-05 External Links: Cited by: §4.3.
- Intelligent Tracking Prevention 2.3. Note: Accessed: 2019-04-24 External Links: Cited by: §4.4, §7.
- The California Consumer Privacy Act. (en). Cited by: §5.1.2.
- WhoTracks.me data - tracker database. Note: Accessed: 2019-04-24 External Links: Cited by: §4.5.
- Measuring cookies and web privacy in a post-gdpr world. In Proceedings of the2019Conference on Passive and Active Measurement, PAM ’19, Cham. Cited by: §1, §1, §3, §4.3.1, §4.3.2, §5.2, Table 2.
- We Value Your Privacy … Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. In Proceedings of the2019Symposium on Network and Distributed System Security, NDSS ’19, San Diego, California, USA. Cited by: §3, §4.3.1, §6.
- EasyPrivacy. Note: Accessed: 2019-04-24 External Links: Cited by: §5.2.
- Online tracking: a 1-million-site measurement and analysis. In Proceedings of the2016ACM Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 1388–1401. Cited by: §1, §1, §2.2, §3, §4.3, §4, §5.2, Table 2.
- Cookies that give you away: the surveillance implications of web tracking. In Proceedings of the24thWorld Wide Web Conference, WWW ’15, New York, New York, USA, pp. 289–299. Cited by: §1, §1, §4.3.1.
- Unabhängiges Landeszentrum für Datenschutz Schleswig-Holstein vs Wirtschaftsakademie Schleswig-Holstein GmbH, - case c‑210/16. External Links: Cited by: §5.1.2.
Fingerprint.js is the most advanced open-source fraud detection JS library. Note: Accessed: 2019-04-24 External Links: Cited by: §5.2.
- Who’s in control of your control system? device fingerprinting for cyber-physical systems. In Proceedings of the2016Symposium on Network and Distributed System Security, NDSS ’16, San Diego, California, USA. Cited by: §2.2.
- Missed by Filter Lists: Detecting Unknown Third-Party Trackers with Invisible Pixels. In Proceedings of the20thPrivacy Enhancing Technologies Symposium, PETS ’20, Berlin, Heidelberg (en). External Links: Cited by: §1.
- Who left open the cookie jar? a comprehensive evaluation of third-party cookie policies. In Proceedings of the27thUSENIX Security Symposium, SEC ’18, Berkeley, CA, USA, pp. 151–168. Cited by: §4.3.1.
- Hiding in the crowd: an analysis of the effectiveness of browser fingerprinting at large scale. In Proceedings of the2018World Wide Web Conference, WWW ’18, New York, New York, USA. Cited by: §2.2.
- Building a more private web: A path towards making third party cookies obsolete. Note: Accessed: 2020-01-15 External Links: Cited by: §7.
- Countries in the eu and eea. Note: Accessed: 2019-10-05 External Links: Cited by: §4.3, §5.1.2.
- Opinion of advocate general bobek on Fashion ID GmbH & Co. KG vs Verbraucherzentrale NRW eV - case c‑40/17. External Links: Cited by: §5.1.2.
- The chain of implicit trust: an analysis of the web third-party resources loading. In Proceedings of the2019World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 2851–2857. External Links: Cited by: §1, §1, §3, §4.5, Table 2.
- Cookie guide. International Chamber of Commerce UK. Cited by: §4.4.1.
- Tracing cross border web tracking. In Proceedings of the2018Internet Measurement Conference, IMC ’18, New York, NY, USA, pp. 329–342. External Links: Cited by: §1, Table 2.
- Fingerprint surface-based detection of web bot detectors. In Proceedings of the2019European Symposium on Research in Computer Security, ESORICS ’19, Cham, pp. 586–605. Cited by: §4.3.
- An empirical analysis of the commercial vpn ecosystem. In Proceedings of the2018Internet Measurement Conference, IMC ’18, New York, NY, USA. Cited by: §4.3.2.
- MineSweeper: an in-depth look into drive-by cryptocurrency mining and its defense. In Proceedings of the2018ACM Conference on Computer and Communications Security, CCS ’18, New York, NY, USA, pp. 1714–1730. Cited by: §1.
- HTTP cookies: standards, privacy, and politics. ACM Trans. Internet Technol. 1 (2), pp. 151–198. Cited by: §4.3.1.
- Malvertising: a case study based on analysis of possible solutions. In Proceedings of the2017International Conference on Inventive Computing and Informatics, ICICI ’17, San Francisco, United States, pp. 288–291. Cited by: §1, §3, §4.5, Table 2.
- Fingerprinting mobile devices using personalized configurations. Proceedings of thePrivacy Enhancing Technologies Symposium 2016 (1), pp. 4–19. Cited by: §1.
- Tranco: a research-oriented top sites ranking hardened against manipulation. In Proceedings of the 26th Annual Network and Distributed System Security Symposium, NDSS ’19, San Diego, California, USA. Cited by: §4.2.
- GeoIP databases & services. Note: Accessed: 2019-10-05 External Links: Cited by: §4.3.
- Third-party web tracking: policy and technology. In Proceedings of the2012IEEE Symposium on Security and Privacy, S&P ’12, San Francisco, United States, pp. 413–427. Cited by: §1.
- Customer url ticketing system. Note: Accessed: 2019-10-05 External Links: Cited by: §4.2.
- Block me if you can: a large-scale study of tracker-blocking tools. In Proceedings of the2017IEEE European Symposium on Security and Privacy, EuroS&P ’17, San Francisco, United States, pp. 319–333. Cited by: §1, §1, §5.2, Table 2.
- Today’s Firefox Blocks Third-Party Tracking Cookies and Cryptomining by Default . Note: Accessed: 2019-04-24 External Links: Cited by: §7.
- Cookieless monster: exploring the ecosystem of web-based device fingerprinting. In Proceedings of the2013IEEE Symposium on Security and Privacy, S&P ’13, San Francisco, United States, pp. 541–555. Cited by: §2.2.
- PriVaricator: deceiving fingerprinters with little white lies. In Proceedings of the24thWorld Wide Web Conference, WWW ’15, New York, New York, USA, pp. 820–830. Cited by: §1.
- Cookiepedia. Note: Accessed: 2019-10-05 External Links: Cited by: §4.4.1.
- I do not know what you visited last summer: protecting users from third-party web tracking with trackingfree browser. In Proceedings of the2015Symposium on Network and Distributed System Security, NDSS ’15, San Diego, California, USA. Cited by: §1.
- tldextract 2.2.1. Note: Accessed: 2019-04-24 External Links: Cited by: §4.1.
- Digging into browser-based crypto mining. In Proceedings of the2018Internet Measurement Conference, IMC ’18, New York, NY, USA, pp. 70–76. Cited by: §1.
- Can I Opt Out Yet?: GDPR and the Global Illusion of Cookie Control. In Proceedings of the2019ACM Symposium on Information, Computer and Communications Security, AsiaCCS ’19, New York, New York, USA, pp. 340–351. Cited by: §4.3.1, §5.2, Table 2, §6.
Data mining methods for malware detection using instruction sequences.
Proceedings of the26thInternational Conference on Artificial Intelligence and Applications, AIA ’08, Anaheim, CA, USA, pp. 358–363. Cited by: §1.
- Malvertising – exploiting web advertising. Computer Fraud & Security 2011 (4), pp. 11 – 16. Cited by: §1.
- Before and after gdpr: the changes in third party presence at public and private european websites. In Proceedings of the2019World Wide Web Conference, WWW ’19, New York, New York, USA. Cited by: §1, §1, §3, §4.2, §5.2, §5.3, Table 2.
- Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). (en). Note: Official Journal of the European Union, L 119/1 Cited by: §5.1.2.
- The privacy shield. Note: Accessed: 2019-10-05 External Links: Cited by: §4.3, §5.1.2.
- “Your hashed ip address: ubuntu.”: perspectives on transparency tools for online advertising. In Proceedings of the35thAnual Computer Security Applications Conference, ACSAC ’19, New York, NY, USA, pp. 702–717. External Links: Cited by: §3.
- A study on subject data access in online advertising after the gdpr. In Data Privacy Management, Cryptocurrencies and Blockchain Technology, C. Pérez-Solà, G. Navarro-Arribas, A. Biryukov, and J. Garcia-Alfaro (Eds.), DPM’19, Cham, pp. 61–79. Cited by: §3.
- Measuring the impact of the gdpr on data sharing. In Proceedings of the15thACM Symposium on Information, Computer and Communications Security, AsiaCCS ’20, New York, NY, USA, pp. . Cited by: §1, §3, §5.2, Table 2.
- (Un)informed consent: studying gdpr consent notices in the field. In Proceedings of the2019ACM Conference on Computer and Communications Security, CCS ’19, New York, NY, USA. Cited by: §3, §6.
- FP-STALKER: Tracking Browser Fingerprint Evolutions. In Proceedings of the39thIEEE Symposium on Security and Privacy, S&P ’18, San Francisco, United States, pp. 728–741. Cited by: §2.2.
- Device fingerprinting in wireless networks: challenges and opportunities. IEEE Communications Surveys Tutorials 18 (1), pp. 94–104. Cited by: §2.2.