Every year, on average, 300 billions dollars are spend globally in digital advertising (300billions, ) and its dominant segment is the programmatic ad-buying with 84.9% of the ad market in US (programmatic2019, )). The Real Time Bidding (RTB) auctions are the backbone of this programmatic ad-buying ecosystem and there digital advertisers can bid for winning an “impression”, i.e., the right to display their advertisement on part of the real estate of a web-page rendered by the browser of a user. Following a real-time auction, the winning bid is communicated via RTB and is proxied through the browser of the end user to ensure the winning bidder that the ad they were charged for was correctly rendered. This exact notification channel opens up the way to estimating the value of individuals for online advertisers, by tallying up their winning bids.
The work of Olejnik et al. (lukasz2014selling-privacy-auction, ) monitored the RTB traffic to compute detailed analytics on the advertising value of different individuals. Following up on this work, Papadopoulos et al. (imcRTB2017, )
showed how to compute such analytics even when winning bids are encrypted. Specifically, by running purpose-built probe campaigns, they obtained the necessary ground truth to train classifiers that use features of users and auctions to estimate the values of encrypted winning bids. The above mentioned work used data from a mobile proxy and analysed them offline to demonstrate the effectiveness of the method. A corresponding online system was sketched but was not implemented beyond the mock up stage.
In this paper, we present YourAdvalue: a first-of-its kind full fledged system that allows users to know how much money the programmatic advertising ecosystem pays to show them ads. By leveraging the methodology of previous work (imcRTB2017, ), YourAdvalue takes into account both cleartext and encrypted winning bids (or charge prices). An important component of this methodology is the ability of the users to contribute with the ad prices they receive along with auction’s metadata, thus allowing the system to re-train its classifiers and improve its accuracy on estimating the encrypted prices. To preserve the privacy of this operation, YourAdvalue anonymizes data before transmission and applies feature aggregation to achieve minimal surprisal rate.
Contributions. In summary, this paper makes the following main contributions:
We design YourAdvalue: a full fledged service involving an end-user part, in the form of a browser extension and a corresponding back end based on the methodology of previous work (imcRTB2017, ). Our system is the first to our knowledge practical tool for enhancing the transparency of the ad ecosystem over RTB regarding the pricing dynamics and how the user’s personal data affect the cost advertisers pay to deliver them ads.111A similar tool has been developed for advertising within Facebook but it is much more limited in scope than ours. See Related Work for details. YourAdvalue allows its users to contribute with their ad prices and metadata and help the trained classifiers to remain up-to-date and accurate.
We deploy, and provide open source222https://youradvalue.tid.es:2222/ our system, which consists of a remote server and a browser extension333Available for the majority of the most popular web browsers: Google Chrome, Mozilla Firefox, Opera, and Brave.. By providing publicly the browser extension on Google Web Store the last 6 months we were able to collect data from 200 users that opted-in to contribute with their data which we analyze. Results of this analysis show that the ad prices have increased the last 3 years by 75% and that advertisers pay for younger users.
We describe the possible re-identification attack scenarios an attacker could launch against the back-end of our system in an attempt to reconstruct the reporting sessions of individual users. In addition, we present how YourAdvalue protects the privacy of the reported user data by scrambling them locally on the browser extension before transmission. Finally, we evaluate the impact of the granularity of the reported data on the k-anonymity of users and we study how the different data anonymization levels impact the utility of the reported data and thus the final estimated prices.
2. Background on RTB
A Real-Time Bidding (RTB) auction is a programmatic, instantaneous type of auction, where a publisher’s advertising inventory is bought and sold on a per ad slot basis. During an RTB instantaneous auction, advertisers place bids for an ad slot in a publisher’s website and the one with the highest bid gets to render its ad on the user’s display. RTB is more efficient in terms of revenue than the traditional static ad-buying for both the advertisers and the publishers. In RTB auctions there are six key role players: the publisher, the advertiser, the Supply-Side Platform, the Ad-exchange, the Demand-Side Platform and the Data Exchange Platform.
2.1. A typical RTB auction
A typical transaction for an ad-slot begins with a user visiting a website. This triggers a bid request from the publisher (or SSP) to an ADX, usually including various pieces of user’s data (e.g., interests, demographics, location, cookie-related info, minimum acceptable price, etc.). Then, multiple DSPs programmatically submit their impressions and their bids in CPM (i.e., cost per thousand impressions (cpm, )) to the ADX. All bids are sealed so every participant places only one bid for a particular ad-slot, this allows the RTB auction to finish within milliseconds (the entire RTB protocol usually runs in around 100 ms). The ad slot goes to the highest bidder and its impression is served in the user’s display. The charge price for the ad slot is the second higher price following the Vickrey auctions (vickrey1961counterspeculation, ). The main advantage of this type of auctions is that forces all bidders to have their bids truly reflect on what they think the value of the ad slot should be. Note that minimum acceptable price defined by the publisher can act as the second price in an auction, if the second highest bid price falls below it.
2.2. Charge price notification
When an ADX selects the winning bid of an auction, the corresponding bidder must be notified about its win to log the successful entry and the price to be paid to the ADX. This happens with a notification message (or notification URL or simply nURL) conjoined with the price, piggybacked in the ad-response. This nURL passes through the user’s browser and acts as a call-back to the DSP. This ensures the DSP that the winning impression was indeed delivered (the callback is fired soon after the impression is rendered on the user’s device), and also gives the opportunity to drop a cookie on the user’s device. The nURL includes the winning DSP’s domain, the charge price, the impression ID, the auction ID and other relevant logistics (see Table 1 for some examples). In this present work, we monitor such nURLs and we study the prices embedded in them, as well as how they associate with the users’ browsing behavior and other personal information.
|Winning Price Notification URL|
2.3. Estimation of encrypted charge prices
Major ad-companies like Doubleclick tend to encrypt important information like charge prices and thus way they ensure the integrity of the reported information and avoid the prying eyes of competitors. Olejnik et al. in (lukasz2014selling-privacy-auction, ) assumed that the encrypted prices are following the same distribution with the cleartext ones. After observing an increase on the number of ad-companies that were using encryption, Papadopoulos et al. in (imcRTB2017, ) designed a methodology to estimate the value of the encrypted prices. Specifically, authors analyze a large dataset of real users to detect RTB traffic and extract a set of both encrypted and cleartext prices along with their user and auction related information (e.g., geolocation, time of day, user interests etc.). Then, in order to assess possible differences between the distribution of encrypted and cleartext prices, they conduct their own probing ad-campaigns and collect ad prices of both types. Contrary to the previous study (lukasz2014selling-privacy-auction, ), they show that encrypted ad prices are (1.7) more expensive than the cleartext ones.
By using this ad-campaign data as ground-truth and features like user geolocation, time of day, day of week, ad-slot size, user device, user interest, authors train a Random Forest classifier that can estimate the value of an encrypted price with an accuracy of 82%. This trained model, could then be shipped to the user’s end to allow them estimate at real time the monetary cost advertisers paid per delivered ad. In this work, we design, build and provide publiclyYourAdvalue: a system which leverages this methodology for encrypted prices and we describe how we preserve the privacy of the user (and what is the trade-off on data utilization), when they decide to contribute with their data. This data contribution, allows the trained model to get updated and address potential time-shifts of charge prices.
3. System Design
In this section, we elaborate on the guiding principles behind the design of the system, the various components it needs and the functionalities they need to execute to compute correctly this cost, while delivering a proper end-user experience.
3.1. Guiding Design Principles
We group the design principles into three classes: user-related, system-related and ad-related. The user-related principles are summarized as follows:
Simple to understand. The tool should display metrics and visuals that are simple to understand by users who are neither tech- nor privacy-savvy.
Unhampered user experience. The tool should be easy to install and not degrade the user’s existing Web experience. This means it should not change a webpage’s appearance, it should not slow down the loading of a webpage, should not affect the network traffic induced by a webpage, and should use minimum of computing resources on the user’s device.
Preserve user anonymity. The tool should not, in any case, leak sensitive information that could help an entity to re-identify the end-user on the Web.
The system-related principles are summarized as follows:
Transparency. The tool’s operations and functionality should be transparent and an end-user or engineer should be able to audit the tool and its functionalities.
Scalability. The tool’s system architecture should be able to scale to accommodate thousands of users without any problem.
Fault tolerance. The tool should continue to function even after encountering unexpected behaviour, without the need of user interaction and without causing errors/crashes to the user’s browser.
Privacy-by-design. The tool should allow users to opt-in to reporting metadata from ads detected, but in a privacy-preserving fashion. That is, the data transmitted from the tool should not expose the user’s identity.
The ad-related principles are summarized as follows:
Real-time operation. The tool should operate in real time and be able to inform the user of updated measures as soon as new RTB ads are detected.
Generality of RTB detection. The tool should be able to detect RTB-related ads regardless if they are delivered in regular webpages (e.g., news websites) or closed-up publishers (e.g., social networks).
Untampered RTB ad-protocol. The tool’s detection mechanism should be simple and should not tamper with the RTB protocol responsible for the impressions logistics.
RTB price detection. The tool’s detection mechanism should be able to detect and consider both encrypted and cleartext prices included in the RTB nURLs.
3.2. System Components & Functionalities
3.3. User Interface & Input
For the end-user client to adhere with the user-related principles, we opt for building a browser extension which allows easy access to the user’s browsing sessions, can provide an easy to use UI, and can perform network traffic inspection without interfering with the normal operation of the website under visit. The web browser extension executes a web-request inspector, which analyses the incoming and outgoing network traffic of a website and detects RTB-related requests (nURLs). An example of this user interface (UI) can be seen in Figure 2.
This UI is simple and intuitive and was designed with the help of UX experts. In this UI, the user is prompted to indicate (if they want) their gender and age range. Furthermore, it displays to the end-user various simple metrics such as the total cost computed for the said user since the tool’s installation, as well as the cost of the user for the current session. The RTB ad detection and ad-related metadata extraction methods are described in the next paragraphs. The design of the extension is light and its operation was tested with hundreds of different websites to confirm that it does not hinder the normal loading or operation of each website. Finally, it provides the user button to opt-in/our from anonymous metadata reporting at any time.
3.4. Web Server & Database
At the back-end, the web server receives the anonymous metadata reported for further analysis. Upon arrival, the data are shuffled with the already collected data to break any relationship of the reported data with their original users. The collected data are subsequently cleaned and modelled, to enable the creation of an updated decision tree (DT) model for the encrypted prices (as described in(imcRTB2017, )). The updated DT is then sent at frequent intervals to the user clients. For this transmission, the tree must be serialized using a proper and agreed structure (e.g., XML format) and deserialized at the browser extension.
3.5. RTB ad detection & price extraction
nURL detection. The browser extension analyses all network traffic outgoing from the browser to detect all RTB-related requests. For each request leaving the browser, the extension collects metadata to check if it is a nURL about an ad. To detect such requests we employ the webRequest API, which is common and available in both browsers we support. In particular, we create a listener using the event onBeforeRequest, an event which is triggered when a request is about to be made. The goal is to detect RTB nURLs; we don’t want to block or redirect the nURLs and we don’t want to tamper the RTB protocol nor the user experience.
DSP matching. After the tool catches an outgoing request, it checks if it is a nURL or not, by comparing the destination to a set of known DSPs (Figure 3). If it does not match with any of the DSPs, it is allowed to pass. If it is, we consider it a potential notification URL and extract the metadata related to it (i.e., all http variables and values accompanying them).
Price extraction. We then check if there is a possible price keyword in the nURL’s metadata, and if this keyword is correlated to the nURL’s advertiser. The argument here is that each advertiser or DSP is associated with a specific keyword used in the RTB. Therefore, if the price keyword does not match the typical keyword used by the detected DSP, it is considered false positive and it is let through. However, if the price keyword matches what the DSP uses, the tool extracts the associated price value.
Encrypted vs. cleartext prices. By leveraging methodology of previous work (imcRTB2017, ), the extension is able to detect both cleartext and encrypted RTB prices (i.e., if the price is numeric or alphanumeric). If the value is numeric (cleartext), it is normalized from CPM or micros to US dollars and is added to the user’s total ad-cost. Otherwise, the tool applies the provided Decision Tree classifier to estimate the ad’s value, and then it is added to the user’s total ad-cost.
Encrypted price inference with DT model. Classifying an encrypted price from a nURL is not a trivial task. First, before the tool even detects a nURL, the tool needs to load the DT model provided by the web server. The model is constructed in the web server and exported as an XML file which is rolled out with the extension. The extension parses this XML file and creates an internal representation of the model that the server had created. We choose to embed (or preload) the DT model in an XML representation inside the extension for two reasons: (i) the DT does not change very frequently and can be preloaded with an existing version of the DT, and (ii) to reduce the number of requests done from the extension to the server to bare minimum. Whenever the DT is updated from the server, which does not happen frequently, the extension receives it in the form of an update. For the actual prediction of the price, we implemented a lightweight and efficient function that parses the DT and reports back the price prediction, so it does not interfere with the user experience, nor slow down the user’s internet browser. In case the features’ values extracted cannot be matched to the DT model, the extension uses a rolling average of the cleartext prices as an estimation of the given encrypted price that could not be inferred.
When an encrypted price is detected, the tool extracts the features that the DT requires to make the value inference of the price. For the feature extraction process, we collect metadata about the publisher and the detected ad. These features are provided to the prediction function to perform the inference. More details on the feature extraction are provided in the next paragraphs.
3.6. Ad-related feature extraction
Location extraction. For the location extraction we use an online API. By making a GET request over a secure channel (HTTPS), the extension obtains and stores the user location on country level. The location is stored in the client for the whole session (as long as the user keeps the browser open, does not restart the extension, etc.). In this manner, we keep low the requests and the network resource needs.
|Gender||User’s chosen gender.|
|Age||User’s chosen age.|
|User Information||Location||User’s current location.|
|Time Of Day||Time of the day an ad was detected|
|Day Of Week||Day of the week an ad was detected.|
|Cookie Syncing||Cookie Synchronization detected.|
|Browsing Information||DoNotTrack||DoNotTrack flag enabled/disabled.|
|Ad Format||Impression’s size in display.|
|Winner DSP||The DSP that won the auction.|
|Category||Topic category of 1st party domain.|
|Price Keyword||Keyword with the RTB ad-price.|
|Advertisement Information||Price Value||RTB price detected in the nURL.|
Ad-format extraction. After analysing several nURLs, we were able to create a list containing keywords which corresponds to the width and height of ads. Based on that list, we examine the nURLs’ parameters for any possible matches that are then stored in the price features.
Cookie Synchronization extraction. Except from detecting RTB, we are interested in detecting if a user is being tracked by a 3rd-party entity. To achieve that, we again employ the webRequest API and in particular, the onBeforeRequest event. We attempt to detect if user identifiers are sent from the publisher’s webpage to other hostnames. We first load all the cookie identifiers stored in the user’s browser, discarding session cookies and those with values less than 10 characters because most of times they have values unrelated to user identifiers. For each request from a tab, we check if a user identifier is present in the URL and if that URL is sent to a tracker. Also, we make sure that the URL detected is not from the same domain that the user is currently visiting in that tab. Our method is able to detect a user identifier if it is included in the URL’s parameters values or as part of the URL path. When we detect such tracking from a 3rd-party, we store the information in a binary flag, indicating that user tracking with cookie synchronization took place in a specific tab to a specific domain.
IAB category extraction. In order to extract the IAB category for each 1st party, we created a list with a mapping between websites and their IABs. We choose to include only very popular websites, so we could keep the list small so the extension would be lightweight. This list contains the top 500 Alexa sites. When the user visits a website, we extract the 1st-party domain and compare it to the list entries. In case we have a match, we report the IAB found, otherwise we report “not specified IAB”.
3.7. User- & ad-related metadata reporting
As mentioned earlier, YourAdvalue allows users to contribute with their data like ad-prices and auction metadata. In Table 2, we list the metadata that are reported to the server. The list of features is carefully selected to reflect the needed features for the DT modelling to happen at the server side. It also enables further analysis of RTB prices, how they evolve through time and if the advertisers target particular user categories based on their demographics (e.g., gender, age etc.).
Being very cautious to preserve the user anonymity and privacy, the data reported do not include any user identifiers or other PII. Also, they are anonymized in the client side before they are sent to the server. Each feature can be reported at different granularities. However, one crucial question in this work is what is a sufficient granularity for a given feature to allow effective price modelling, but not compromise user anonymity? In Section 5, we elaborate on the aggregation methods possible for the given features, and the guarantees or limits on anonymity they can offer to the reporting users.
Also, in order to overcome the problem of an honest but curious server trying to re-identify the sender, each extension reports its data at random times throughout the day and week. Because of this induced randomness, reports may delay from several seconds to days after they are created, before leaving the user’s browser. Therefore, the server cannot identify a user based on the rate of its reports, since reporting from different users gets mixed-up while being received at the server side. In Section 5, we also investigate various possible threats on anonymity of the users reporting such data, and how the system’s design can protect them from such de-anonymization attacks.
4. RTB Price Evaluation
In this section, we perform a price analysis to observe how the RTB ad prices change over time. To do this analysis, we use a previously collected dataset from (imcRTB2017, ), and the data collected from YourAdvalue. The first dataset is larger and contains RTB ad metadata from the year 2015-2016. It consists of about 80k ads from 810 users, and each entry contains several different features. The second dataset was collected with the YourAdvalue and it is from 2019. It consists of about 3600 ads from 200 users for a period of 6 months. YourAdvalue collects similar metadata to the ones in the 2015-2016 dataset, so we can directly compare the price distributions for the two time periods. Also in contrast with previous datasets, we collect anonymized demographics from our users to study their association with RTB prices, something not done before.
In Figure 4, we compare the CDFs of RTB prices for the two time periods. We observe that the median price in the newest dataset is CPM, whereas in the older dataset it was CPM, pointing to a increase in RTB prices in a 3-year period. In the same way, we observe that the top 5% of RTB prices of the new dataset are CPM, compared to CPM for the older dataset.
Next, we compare the two datasets in basic dimensions such as day of the week, time of day and IAB category of visited publisher. In Figure 10, we compare the prices found for each day of the week, between the two datasets. For the dataset of 2015-2016, we observe a relatively stable distribution for each day with a median value around CPM. In contrast, in the dataset of 2019, we do not capture a stable trend. From Sunday to Friday the median prices tend to be between and CPM, but on Saturday the median price peaks to CPM. Overall, we see an increase in prices advertisers pay in 2019 compared to the ones 3 years before.
Similarly, when studying the prices paid by advertisers and the time that the ad was displayed (Figure 10), we see that advertisers tend to spend more as the day goes by, with a peak in the early afternoon slot of 15:00-18:00. Regarding the prices paid per IAB category of the publisher, in Figure 10 we observe that in 2019 the prices are indeed higher, but in most cases the median prices are relatively close. This means, that advertisers haven’t increased with the years the budget they spend for the different types of IAB categories.
Next we focus on some demographics of our user for 2019444We cannot compare with the older data since these demographics were not captured there.. In Figure 10, we see the prices paid by advertisers based on the age of the users. We observe that advertisers are willing to spend higher prices for younger users. In particular, for users in the age group of 25-34, the median price was CPM compared to the median price of CPM for the user group of age 35-44. Also, for the younger group of users, advertisers are willing to pay as high as CPM, whereas in the older group the maximum reaches CPM.
In Figure 10, we study the prices advertisers spend on users located in different countries. We observe that for Spain and Greece the median values are relatively close but the maximum price advertisers are willing to pay is 4 higher in Spain ( CPM vs. CPM). In Switzerland the median value is CPM and can reach as high as CPM, when in Cyprus, the median value is CPM. Cookie Synchronization is a very important technique for user data sharing, tightly connected with the ad auctions (bashirtracing, ; lukasz2014selling-privacy-auction, ). However, as shown in Figure 10, the existence of prior CS has no impact on the charge prices.
5. Privacy Analysis
In order for YourAdvalue to accurately estimate encrypted charge prices, it requires data from the users in order to have its price modelling engine up-to-date with the current pricing dynamics in the ad-ecosystem. Given the system’s principle for privacy-by-design, YourAdvalue needs to preserve the privacy of the users that opt-in to data contribution against re-identification attacks after a possible data-breach or a malicious server controller.
In this section, we first elaborate on the possible threats that could expose a user’s identity, and allow an attacker to link the user to his reported data (Section 5.1). We also discuss how we solve these threats, thus maintaining user anonymity. Next, we analyse the limits on user privacy by the features reported by the tool (Section 5.2). To do that, we focus on a larger, real-world dataset of 810 users used in previous study (imcRTB2017, ) with the same features (Section 5.3). This dataset allows us to study real users’ distributions of features, and how user anonymity changes with feature aggregation. We also study the inherent trade-off between feature aggregation and price estimation accuracy (Section 5.4).
5.1. User de-anonymization threats
Below, we discuss possible de-anonymization threats that the system could face during data reporting, and how the system design prevents them from materializing.
Threat 1: De-anonymization using users’ cumulative online behavior. Most of the users have a consistent online behavior, meaning that tend to visit the same websites through their week (e.g., a user may consistently visit cnn.com to read daily news, vs. another user who does that via boston.com). This behavior makes users identifiable as already shown in (wit, ). We address this threat by not reporting the 1st-party domains visited by users, but their corresponding IABs instead (iab_lab, ; iab_taxonomy, ). In this way, multiple different sites can appear under the same IAB (e.g., all news sites will be under the same IAB).
Threat 2: De-anonymization using unique 1st-party domains. In addition to the previous threat, which considers the cumulative activity of a user, there is a more subtle attack also based on 1st-party domains. The 1st-party domain that an ad was detected on, could be a private or sensitive domain or a domain that other users are very unlikely to visit. Let us assume that a user (Alice) visits such a private domain (e.g., myorganization.org) and an ad is detected on that domain. Let us also assume that the YourAdvalue includes in its report to the server that an ad was caught on myorganization.org. Now, if a malicious user (Bob) gets access to the stored data, and knows that Alice works for this organization (myorganization.org), he could uniquely identify Alice. As explained above, this attack is addressed by not reporting the actual 1st-party but only its IAB category.
Threat 3: De-anonymization using cookie synchronization activity.
Depending on their browsing behavior, different users could have different trackers tailing them. Therefore, it could be possible to distinguish a user by the vector of trackers that track him. In fact, there could be cases that specific users are tracked by very specific trackers, e.g., due to special websites they visit, making them standing-out in the crowd. We address this threat by not reporting the actual trackers detected during a user’s browsing. Instead, we report only if trackers were present when an ad was detected and attempted a cookie syncing (binary signal).
Threat 4: De-anonymization using user-related identifiers. In many cases, the nURLs contain user-related information such as cookies or other user-specific IDs. An attacker accessing the database could try to detect such ID-looking strings and attempt to reconstruct user sessions by grouping same ID related information. To address this threat, the tool does not report back any user IDs or the complete nURL. Instead, it extracts metadata such as the winner DSP, the price keyword, the price value and the ad size, if available.
Threat 5: De-anonymization using user demographics.
Users can choose to report back some demographic information (e.g., gender and age). This information may or may not be truthful, and cannot point with certainty to a specific user. However, if it is consistently reported, and if it is combined with other public information, it may expose a user’s identity. To reduce the probability of such threat, we aggregate these demographics in coarse-grained classes (e.g., the age is reported in 10-year instead of 1-year bins).
Threat 6. De-anonymization using user’s IP address. Even if the data reporting is done anonymously, an individual’s identity could be compromised by grouping their data reported throughout in the database. Two ways for this to happen would be to: (1) find unique identifiers for users and link database entries together (Threat 4), (2) link incoming data based on IPs reporting them. For the 2nd scenario to occur, we suppose that the server becomes malicious and logs the users’ IP addresses. To address this threat, and increase user trust to the server, we employ previously proposed techniques (noside_eff, ; humanweb, ) and use a web proxy that is not controlled by the team. Instead it is controlled by other members of the organization. Then, the proxy receives the data reports and forwards them to the system’s server using the proxy’s IP address. Also, in order to ensure that the proxy cannot read the data reports, we use SSL between the YourAdvalue clients and the server. This way, only the server can decrypt the reports, and even if the proxy stored the clients’ IP addresses, it has no way of knowing the contents of the reported data.
Threat 7: De-anonymization using grouped records into user sessions. An attacker may attempt to de-anonymize a user, by assuming that consecutive records of incoming data could be reported by the same user. The system proceeds to the following actions to reduce or even eliminate such a threat: (1) As explained earlier in Sec. 3.7, the reporting client does not send data for every detected ad, but instead collects a set of records which it then shuffles and sends to the server at random times. This strategy also breaks possible time linkage between records, since they are not reported as soon as they are created. (2) The receiving server randomly shuffles the incoming data with the stored data frequently, depending on the amount of data stored. This enables the server to break any possible sessions and relationship records may have with each other.
5.2. User uniqueness via surprisal analysis
Overall, the system design can protect users’ anonymity by removing identifiers or attributes that can contain unique examples of users’ activity, reducing the granularity of reported data (with respect to frequency of reporting or detail), and randomizing the data reporting and storing to avoid the possibility of linking records together into sessions. However, it is important to study the bounds of anonymity that can be achieved, when the features reported back to the server are taken into consideration together while attempting to de-anonymize a user.
In this section, we formally investigate this problem. In particular, we are interested to study what granularity of reporting is low enough to protect the anonymity of each user, while still maintaining full or partial utility of the reported data. For this investigation, we employ two commonly known tools to understand this trade-off. First, we study how unique a user is, given a particular reporting setup (i.e., combination of features and granularity classes). and (similar to other works (Eckersley:2010:UYW:1881151.1881152, )) we estimate this by measuring the number of surprisal bits that such a setup achieves. Second, we compute the number of users who are expected to be found or collide in a given reporting setup. The second study is a classic k-anonymity analysis, in which we measure how many users have similar behaviour and are likely to report the same combination of features and granularity classes. In the next paragraphs, we offer these analyses of surprisal and k-anonymity for the features considered in Section 3.7.
5.2.1. Background on Information Theory
In information theory, given an event with probability , surprisal is defined as:
Surprisal is measured in bits, and the higher the surprisal the more unique an event is. Something that is certain has no surprisal (0 bits), flipping a fair coin is associated with 1 bit of surprisal, winning the lottery is associated with 24 bits of surprisal. In our case, higher surprisal of a user’s reporting data means the user can be uniquely identified easier.
If the said event E is dependent on independent attributes or features (), then the overall probability of the event can be broken down and expressed as:
If we assume that each feature
has different discrete classes, and they are equally likely to appear (i.e., they are governed by a uniform distribution), then every featurehas a probability:
In a more realistic scenario, an event , or in our case experimental setup (that dictates specific class for each of the features), is dependent on independent features that are governed by real distributions. Thus, the above probabilities are not equal for all classes of a given feature. Instead, for class of feature we have:
is the real (or observed) probability distribution function of feature, and is the probability returned by this function for the event to occur at class of this feature , (i.e., the particular level or class that the experimental setup takes () for the feature ). For example, feature could be the time of day, and the different classes of this feature could be the 24 hours available (). Then, the probability of an ad to be detected at the 11am slot, is given by the .
To compute the overall for a specific combination of features and classes, we need to take into account all such features and classes :
where: = the experimental setup of interest, comprised of a selected class , for each feature , = probability of to occur, as computed from the PDF of feature for its class , = overall probability of to occur.
5.2.2. Uniqueness using Uniform Distributions
Following the assumption of a uniform distribution of classes of features (Eq. 2), all the combinations of features have the same probability to occur in our dataset. This assumption allows us to study the problem of reporting and define some baselines of features and their classes that we could report. In Table 3 we compute the theoretical surprisal bits of a system that has the features mentioned earlier and are shown in individual rows in the table. The granularity of reporting, i.e., the number of independent and uniformly distributed classes is reported in each column for each row.
We begin by assuming that the YourAdvalue client reports back the features exactly as collected without any generalization. This scenario is hypothetical, but can give us a baseline understanding of what we are dealing with respect to surprisal bits. We assume that the age ranges from 0 to 99 years, the user could be from any of the 240 different countries of the world and that the reporting is in 1-hour bins. We also assume that the client reports back the 1st-party domains they browse through (e.g., 500 distinct values) and the trackers they have encountered and performed cookie synchronization (e.g., 200 distinct values). The price keyword can take only 1 distinct value because each DSP is using one keyword, but the price value can obviously vary. Thus, we assume this price value can take 200 distinct values. In this hypothetical scenario, there is no aggregation so the surprisal, as expected is really high ( bits).
Next, we assume that the tool performs aggregation of the distinct classes of some of the features, to explore a more realistic setup. For example, by just grouping the age in 5 distinct ranges, we reduce the surprisal bits by 5. Grouping the reported time of day in 3-hour intervals seems to have only a small impact to the overall surprisal rate. In contrast, by eliminating the cookie synchronization DSP, and reporting back only if CS took place (2 distinct values) greatly reduces the overall surprisal by almost 7 bits. Also, if instead of reporting the 1st-party domain we report only 30 distinct IAB categories, we reduce the surprisal by over 4 bits.
In the far right case, in which all possible features are aggregated, we can effectively halve the initial surprisal rate. But this can impact greatly the utility of our data. In this extreme case, no actual value is sent. Instead, the client only reports if the price was low, medium or high. This is the same case with the ad format, where the client reports if the ad was small, medium or large. Finally, instead of the actual country, the client reports only a country zone (e.g., Central EU, North America, etc.). The combination of distinct features and their impact on the surprisal rate can be seen in detail in Table 3. From the previous threat analysis and the computation of surprisal bits assuming uniform distribution of classes, we can extract some lessons on what features can be reported and with how many classes each.
5.3. Real-world data privacy analysis
Building on the previous theoretical analysis, we conduct an anonymity analysis of these features on a real user dataset. However, in this case, we employ the formula of Eq. 3, where the probability of a class (or level) in a feature is no longer uniform, but is computed using the probability distribution function of the feature. The dataset used was the one provided by the authors of the study in (imcRTB2017, ) and consists about 80k entries from 810 different users, each entry containing 27 different features. Out of these 27, we selected the ones that are reported back from the YourAdvalue as well, ending up with a set of 8 features. Namely these features are: user location, day of the week, time of the day, ad size, ADX involved, IAB category for the 1st-party, price keyword and price value. Based on the classes in each feature, we calculate the probability of each different class. (due to space constraints we show the probability distribution functions for these features in Figure 20 in Appendix A).
We start our analysis by examining a pessimistic scenario, which assumes an experimental setup where all features are reported at their classes with lowest probability observed. That is, we take the lowest probability class for each feature and compute the surprisal bits of this scenario. Indeed, the surprisal bits for a user who reports such unlikely event are about bits. Interestingly, this number is very close to our previous theoretical worst case scenario, where no kind of anonymization was applied.
|Feature||Number of classes|
|Time Of Day||24||24||8||8||8||8||8||8||8|
|Day Of Week||7||7||7||7||7||7||7||7||7|
Next, we proceed to analyze the scenarios that did happen in the real dataset, and compute their surprisal bits. In Figure 11 (black line), we show the CDF of the surprisal bits for the observed setups. All setups have fewer than bits, and 95% of the cases have fewer than 21 bits. Also, the median setup has 12.5 bits. Indeed, someone could argue that the high surprisal bit rate for a small portion of entries could expose some users. We remind the reader that in the absence of any sort of PII, the link to specific users can be difficult if not impossible inside this database.
|Feature||Number of classes per feature|
|day of week||7||7||2||2||2||2|
|time of day||6||6||6||2||2||2|
|AUC of ROC||0.858||0.854||0.820||0.798||0.798||0.786|
Similarly with the analysis for uniform distribution, we can attempt to improve anonymity of users by performing aggregation of features to fewer classes. We demonstrate this effort in Table 4 and in Figure 11, in which we show the CDF of the overall surprisal bits for the different setups included in the real dataset, and different aggregation efforts. Interestingly, when we perform aggregation for the different features, we see a steady reduction of surprisal bits across all cases, besides the location aggregation which is effective for a portion of the ads. This is due to the fact that is such ads, the locations are popular and summarizing them does not have an impact in the surprisal rate. After all aggregations of classes are applied, the surprisal rate of the median setup drops down to 7.7 bits from 12.5 bits and for the 95% case drops to 13.7 bits from 21 bits.
We continue this investigation by performing a k-anonymity analysis of this real dataset, to understand at what level it is satisfied, based on the different classes and aggregation performed. K-anonymity is used as privacy criterion in real applications such as the “Family Educational Rights and Privacy Act” (FERPA) of USA (ferpa, ), and the “Guidelines for De-identification of Personal Data” of South Korea (koreanprivacy, ). In Figure 12, we show the CDF of the number of users who collide in the same type of setup, when the features are reported in a given set of classes. That is, the figure shows how many users (k) match the same entry reported in the dataset (i.e., ad), with the same features and classes. We notice that the median entry (ad) reported can be mapped to k=4–6 users, and the minimum k-anonymity achieved is k=2 users.
We further consider a realistic scenario where the attacker has limited access per ad detected, without any knowledge of the user’s browsing behavior (i.e., does not have IAB categories reported). In this scenario, the attacker groups the entries found in the dataset based on the features of location, day of the week and time of the day (i.e., does not have access to other features). In Figure 13, we see that in this attack scenario and the different setups tested, a minimum k-anonymity of k=6 users is satisfied, and a median entry (ad) can be mapped to k=35 users. In general, these scores are within the range of k=3–10 reported in (healthdatareporting2017bcm, ) and applied for electronic health records, lending support to the applicability in our scenarios as well.
5.4. User Anonymity vs. Price Modelling
We close this investigation by looking into the inherent trade-off between aggregation of feature classes, and the modelling of prices at the server using machine learning methods. We use a random forest model and 4 equidistant classes for the RTB prices detected, and measure standard ML performance metrics: (i) AUCROC and (ii) F1-score for the different aggregation levels of the available features, as examined in the previous paragraphs. The results are shown in the bottom of Table4. We find that both metrics are not greatly impacted by the aggregation performed: 8.4% decrease in AUC and 16.5% decrease in F1 when comparing the no-aggregation scenario (1st column) vs. the fully aggregated scenario (last column). These results mean that the YourAdvalue can do aggressive aggregation of features at the client side before reporting, without affecting the performance of the ML model at the server side.
6. Related Work
Papadopoulos et al. in (imcRTB2017, ) set out to explore the cost advertisers pay to deliver an ad to the user in the waterfall standard and RTB auctions. In addition, they studied how the personal data can affect the pricing dynamics. The authors proposed a methodology to compute the total cost paid for the user even when advertisers hide the charged prices. Finally, they evaluated their methodology by using data from a large number of volunteering users. According to their findings advertisers, paid a total of around 25 CPM to deliver ads to the average user across a year. In (lukasz2014selling-privacy-auction, ), Olejnik et al. followed a similar approach. They detected RTB notification URLs and extracted the value of the auctioned advertisement. They made an extensive study on the RTB ecosystem and estimated the value of user’s private data based on the cleartext price notification URLs. They found that the average price of an ad is in the range of 0.0001$-0.004$, depending on the user and the ad campaign.
In (followTheMoney, ), authors use a dataset of users’ HTTP traces and provide rough estimates of the relative value of users by leveraging the suggested bid amounts for the visited websites, based on categories provided by the Google AdWords. FDTV (fdvt, ) is a tool to inform users in real-time about the monetary value of the personal information associated to their Facebook activity. Although similar to ours in objective, our approach is more general since it applies to all RTB-based advertising whereas the methodology of (fdvt, ) is obtaining prices from the Facebook AdPlanner and thus is relevant only for Facebook advertising. In a follow up work (FDVT2, ) , the same group investigated if Facebook complies with the recent GDPR. They concluded that Facebook labels 73% EU users with sensitive interests and they estimated that a malicious third-party could unveil the identity of Facebook users that have been assigned a sensitive interest at a cost as low as EUR 0.015 per user.
In (CSync, )
, the authors used a heuristic mechanism to detect information exchanged between advertisers through CS. They concluded that 97% of the users are exposed to CS at least once and that ad-related entities participate in more than 75% of the overall synchronization. In(Papadopoulos:2018:ECM:3193111.3193117, ) the authors demonstrated how this technique may leak user’s cookie IDs and browsing history to a snooping ISP even when user uses TLS and secure VPN services. By probing the top 12k Alexa sites they found 1 out of 13 websites exposing their visitors to these privacy leaks.
Bashir et al. in (DiffusionofUserTrackingDataintheOnlineAdvertisingEcosystem, ), studied the diffusion of user tracking caused by RTB-based programmatic ad-auctions. Results of their study show that under specific assumptions, no less than 52 tracking companies can observe at least 91% of an average user’s browsing history. In (bashirtracing, ), the same group tried to enhance the transparency in ad ecosystem with regards to information sharing, by developing a content agnostic methodology to detect client- and server- side flows of information between ad exchanges and leveraging retargeted ads. By using crawled data, they collected 35.4k ad impressions and identified 4 different kinds of information sharing behavior between ad exchanges.
(acquisti2013privacy, ) discusses the value of privacy after defining two concepts (i) Willingness To Pay: the monetary amount users are willing to pay to protect their privacy, and (ii) Willingness To Accept: the compensation that users are willing to accept for their privacy loss. Two user-studies (bigmac, ; staiano2014moneywalks, ) have measured how much users value their own offline and online personal data, and consequently how much they would sell them to advertisers. In (forSale, ), the authors propose “transactional” privacy to allow users to decide what personal information can be released and receive compensation from selling them.
7. Discussion & Conclusion
In this paper, we present the design, implementation, deployment and operation of YourAdvalue, a first-of-its kind, full fledged system that allows any user, through a browser extension, to know in real time how much money the RTB ad-ecosystem pays to show him ads. We have been operating the system for the last 6 months and it has already been used by 200 real users. During this time period, we collected RTB-related metadata, and using a previously analyzed RTB dataset, in this paper, we make the following main findings regarding RTB prices and their evolution over time:
In a period of 3 years, RTB prices increased by about 75%.
Advertisers kept the same bidding policies regarding the IAB category of a website over time.
Advertisers are bidding more aggressively on Saturdays, compared to the rest of the week.
Prices increase during the day until afternoon (6am-6pm).
Prices for younger users are higher than older users.
Cookie Synchronization does not have an impact on the price of an auctioned ad.
In addition, we performed a privacy evaluation of the system, to identify possible threats against user’s anonymity. We measured the limits of user anonymity with a uniqueness study via surprisal and k-anonymity analysis. Finally, we studied the trade-off of anonymity via feature aggregation vs. performance of price modelling with machine learning methods. In summary, the main takeaways from the privacy evaluation of this tool were:
YourAdvalue’s privacy-preserving design protects users from typical and extreme de-anonymization attacks.
With feature aggregation, the median surprisal bits under various distributions of classes (uniform or real) can be halved to 7.7, in comparison to no-aggregation scenarios.
Location aggregation does not reduce user uniqueness as much as other features (e.g., time of day or day of week).
With feature aggregation, a median 30-45-anonymity can be achieved.
YourAdvalue’s client can do high feature aggregation before reporting with minimal impact on the ML price model.
We envision that in the future, the YourAdvalue tool will be further used by many end-users, privacy researchers and auditors, who can take advantage of its simple functionalities to increase transparency in the RTB ad-ecosystem and its obscure practices of user modelling and ad-costs.
The research leading to these results has received funding from the European Union?s Horizon 2020 Research and Innovation Programme under grant agreements No 786669 (project CONCORDIA) and Marie Sklodowska-Curie grant agreement No 690972 (project PROTASIS). The paper reflects only the authors’ views and the Agency and the Commission are not responsible for any use that may be made of the information it contains.
-  Alessandro Acquisti, Leslie K John, and George Loewenstein. What is privacy worth? The Journal of Legal Studies, 2013.
-  Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson, and Christo Wilson. Tracing information flows between ad exchanges using retargeted ads. In 25th USENIX Security Symposium, pages 481–496, Austin, TX, August 2016.
-  Muhammad Ahmad Bashir and Christo Wilson. Diffusion of user tracking data in the online advertising ecosystem. Proceedings on Privacy Enhancing Technologies, 2018.
-  José González Cabañas, Ángel Cuevas, and Rubén Cuevas. Facebook use of sensitive data for advertising in europe. CoRR, 2018.
-  Juan Pablo Carrascal, Christopher Riederer, Vijay Erramilli, Mauro Cherubini, and Rodrigo de Oliveira. Your browsing behavior for a big mac: Economics of personal information online. In Proceedings of the 22nd international conference on World Wide Web, 2013.
-  Peter Eckersley. How unique is your web browser? In Proceedings of the 10th International Conference on Privacy Enhancing Technologies, PETS’10, pages 1–18, Berlin, Heidelberg, 2010. Springer-Verlag.
-  eMarketer. emarketer releases new global media ad spending estimates. https://www.emarketer.com/content/emarketer-total-media-ad-spending-worldwide-will-rise-7-4-in-2018, 2018.
-  Lauren Fisher. Us programmatic ad spending forecast 2019. https://www.emarketer.com/content/us-programmatic-ad-spending-forecast-2019, 2019.
-  Phillipa Gill, Vijay Erramilli, Augustin Chaintreau, Balachander Krishnamurthy, Konstantina Papagiannaki, and Pablo Rodriguez. Follow the money: Understanding economics of online aggregation and advertising. In Proceedings of the 2013 Conference on Internet Measurement Conference, 2013.
-  José González Cabañas, Angel Cuevas, and Rubén Cuevas. Fdvt: Data valuation tool for facebook users. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 3799–3809. ACM, 2017.
-  InvestingAnswers. Cost Per Thousand (CPM). http://www.investinganswers.com/financial-dictionary/businesses-corporations/cost-thousand-cpm-2917, 2017.
-  Noel Agnew Kevin Flood. Iab tech lab announces final content taxonomy v2 ready for adoption. https://iabtechlab.com/blog/iab-tech-lab-announces-final-content-taxonomy-v2-ready-for-adoption/, 2017.
-  Hyukki Lee, Soohyung Kim, Jong Wook Kim, and Yon Dohn Chung. Utility-preserving anonymization for health data publishing. BMC Medical Informatics & Decision Making, 17, 2017.
-  Ministry of the Interior and Safety. Personal data protection laws in korea. https://www.privacy.go.kr/eng.
-  Konark Modi, Alex Catarineu, Philipp Classen, and Josep M. Pojul. Human web overview. Technical report, 2017.
-  Konark Modi and Josep M. Pujol. Data collection without privacy side-effects. Technical report, 2017.
-  Mopub. Iab categories. https://developers.mopub.com/publishers/ui/marketplace/iab-category-blocking/, 2019.
-  Lukasz Olejnik, Minh-Dung Tran, and Claude Castelluccia. Selling off user privacy at auction. In 21st Annual Network and Distributed System Security Symposium, NDSS, San Diego, California, USA, February 23-26, 2014.
-  Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos Markatos. Cookie synchronization: Everything you always wanted to know but were afraid to ask. In The World Wide Web Conference, pages 1432–1442. ACM, 2019.
-  Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos P. Markatos. Exclusive: How the (synced) cookie monster breached my encrypted vpn session. In Proceedings of the 11th European Workshop on Systems Security, EuroSec’18, pages 6:1–6:6, New York, NY, USA, 2018. ACM.
-  Panagiotis Papadopoulos, Nicolas Kourtellis, Pablo Rodriguez Rodriguez, and Nikolaos Laoutaris. If you are not paying for it, you are the product: How much do advertisers pay to reach you? In Proceedings of the 2017 Internet Measurement Conference, pages 142–156. ACM, 2017.
-  Fotios Papaodyssefs, Costas Iordanou, Jeremy Blackburn, Konstantina Papagiannaki, and Nikolaos Laoutaris. Web identity translator. ACM HotNets, 2015.
-  Christopher Riederer, Vijay Erramilli, Augustin Chaintreau, Balachander Krishnamurthy, and Pablo Rodriguez. For sale : Your data: By : You. In Proceedings of the 10th ACM Workshop on Hot Topics in Networks, 2011.
-  Jacopo Staiano, Nuria Oliver, Bruno Lepri, Rodrigo de Oliveira, Michele Caraviello, and Nicu Sebe. Money walks: A human-centric study on the economics of personal mobile data. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2014.
-  US Department of Education. Family educational rights and privacy act (ferpa). https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html.
-  William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 1961.
Appendix A Appendix
In Figure 20 we show the distributions of features and their different classes (levels) extracted from a real user dataset with RTB prices reported.