1 Introduction
Online advertising strives to serve the most beneficial advertisement (ad) to the most relevant online users in the appropriate context (a specific website, mobile application, etc.). This typically results in attaining higher returnoninvestment (ROI) for the advertisers [10], where the value is generated either from a direct response such as a click or conversion (e.g. the purchase of a product, subscription to a newsletter, etc.), or through delivering a branding message. For this purpose, advertisers receive help from multiple entities in the domain. Supplyside platforms (SSP) provide adspace (inventory) on websites or mobile apps, to serve ad impressions to users. Adexchanges run auctions on available inventory from SSPs. Demandside platforms (DSP) act on behalf of the advertisers and aim to bid on the most valuable inventory.
Advertisers often get performance reports from an independent evaluation agency^{†}^{†}
There are certain independent evaluation agencies in online advertising domain, whose names we cannot list here to comply with the company policy. Advertisers trust these organizations to collect the ground truth.. For privacy reasons, these reports, in most cases, only contain aggregate metrics (e.g. clickthrough rate, percentage of female audiences).
In order to reach the right audience usually defined by the advertiser, which in general would improve direct response and branding metrics, the advertisers need to utilize various data sources to label the users in the most accurate way possible. Data management platforms (DMP) have been emerging as a central hub to seamlessly collect, integrate and manage large volumes of user data [6]. Such user data could be firstparty (i.e. historical user data collected by advertisers in their private customer relationship management systems), or thirdparty (i.e. data provided by thirdparty data partners, typically each specializing in a specific domain, e.g., demographics, credit scores, buying intentions). While firstparty data is proprietary to the advertiser and free to utilize, thirdparty data often carries a prenegotiated cost per impression (ad served to a user in a website or application). In both cases, it is important for the advertiser to know how accurate a data source is. That is, if a data source has tagged a user to be in category (user property, e.g. gender, age, income), how likely it is for the user to actually be in that category.
Predicted by data source  

not  Unknown  
Ground Truth  
not  
Unknown 
In this paper, we are investigating the above problem which we call data quality assessment in online advertising. The main issue in evaluating the accuracy of a data source is the lack of ground truth in the userlevel granularity. For example, the advertisers, in reality, never have access to the confusion matrix (Table 1
) of a data source in either first or thirdparty cases. Therefore, the only way for an advertiser to evaluate the quality of a data source is to run an advertising campaign on a set of users and then evaluate the performance in hindsight. Even in those cases, the postcampaign data is often constrained (mostly due to privacy concerns) and in aggregate, that is, only the total number of users in different categories is provided, and not a granular usertocategory assignment. If it were possible to have the granular data, it would then be trivial to just use the ground truth data source to come up with the accuracy metrics, e.g. filling in the entries of the confusion matrix. Therefore, utilizing the aggregate performance statistics makes the data quality evaluation task quite challenging, and somewhat similar to aggregate learning tasks in machine learning
[5], few of which are also directly applicable to this problem.The main contributions of this work are as follows:

formal definition of data quality assessment problem, and the challenges of solving it in online advertising domain,

multiple approaches for evaluating the quality of a data source, which also take into account the efficiency requirements due to the large number of possible data sources^{†}^{†}
* While we cannot list the exact number due to the company policy, there are currently over 200k active data sources in Turn’s system. to be evaluated, 
several use cases where data quality assessment comes in handy for online advertising, and,

initial evaluation of our methodology utilizing simulated data and realworld advertising campaigns.
Rest of the paper is organized as follows. In Section 2, we give a more formal definition of the data quality assessment problem. Next, we discuss the literature that deals with either data quality assessment, or aggregate learning (which, as aforementioned, is relevant to our problem) in Section 3. We present our two proposed assessment methodologies in Section 4 and Section 5, and later give some use cases on how we can utilize our data quality assessment output in Section 6. Finally, we present some initial results in Section 7 and conclude the paper along with some potential future work in Section 8.
2 Research Problems
As we have explained in the previous section, we seek to evaluate first or thirdparty data sources available for online advertising using multiple accuracy metrics.
Definition 1
A sound data source tags each virtual user (cookie ID that might be specific to a browser and device) with one and only one of the 3 labels – {Positive, Negative, Unknown}.
The tagging process could be explicit or deductive, but cannot be selfcontradictory. For example, a user can have a positive tag – Age25, or a negative tag – NotAge25, but cannot be tagged as both Age25 and NotAge25. The data source can also simply indicate that it has no knowledge on a user by tagging it as Unknown. In realtime bidding, the positive tags are the most important, as advertisers usually utilize them to target the desirable audiences.
The problem of data quality assessment is defined as the following:
Definition 2
Given a sound data source , its data quality assessment is defined as a measurement that has error no more than over user examples drawn from user set
, with probability of at least
,where is a metric to measure the granular targeting performance when this data source tag user with .
As an example, suppose we have a data source which we utilize to tag a user as Male (positive example) or Not Male
(negative example). Consider two evaluation metrics, which are
Accuracy (percentage of correct taggings by our data source), and True Positive Rate (percentage of positive examples, i.e. Males, that our data source also tags as males). For the accuracy metric, we have the following :which does an exact comparison of the ground truth tagging of user () against the tagging by data source (). On the other hand, if we were to calculate true positive rate, then we would have the following :
which counts only those cases where both ground truth and the data source tag the user as Male.
Note that the above problem definition is a very general formulation, which is typically used in evaluating Machine Learning models [7, 8, 12]. As long as both the ground truth category of a user and that of the data source are available, one can come up with a perfect data quality assessment, i.e. from Def. 2. The problem occurs when we don’t have direct access to the ground truth category of every single user. Typically, is unknown, but rather the category distribution of groups of users is provided. The main reason is to protect the privacy of users [1]. In these kind of situations, especially in online advertising, we may utilize a specific data source to make smart advertising decisions to choose the most appropriate set of users, and in the end, we can receive an aggregated report from a thirdparty evaluator, which is considered as the ground truth and provides a nongranular distribution of the audience we have reached over many categories of interest. As an example, the report may provide that over all users we have 20% Male, and 80% Not Male. When this occurs, we can no longer do a onetoone comparison between ground truth and data source in the user granularity, but rather need to come up with alternative methods that can deal with aggregated data, which is our main focus in this paper.
In many cases, we need to select the best data source from a large set of candidates with the same semantic goal and adopt it for targeting. For example, given a set of data sources that tag users as male, female, or unknown, we may care more about their relative performance and less about their absolute measurements. The data quality assessment can then be simplified as a ranking problem:
Definition 3
Given two sound data sources and , and an accuracy metric , a data quality ranking system outputs a rank measurement for and for such that
.
Once we have the rank measurements for each sound data source, we can order them and select the best one.
3 Previous Work
As we have explained in the problem definition, evaluation of a data source can be taken as any other machine learning model evaluation task, provided that we have the ground truth information in the user granularity. A detailed evaluation of 18 performance metrics for classification problems is given in [7]. These 18 metrics can be listed as accuracy, kappa statistic, mean Fmeasure, macro average arithmetic, macro average geometric, AUC of each class against the rest (two variants), AUC of each class couples (two variants), scored AUC, probabilistic AUC, macro average mean probability rate, mean probability rate, mean absolute error, mean squared error, LogLoss, calibration loss, and calibration by bins. The paper provides a detailed correlation analysis and noise sensitivity analysis . Also, the survey by Gunawardana et al. [8] discusses both the evaluation settings and proper evaluation metrics for different classes of recommendation problems, of which online advertising is a subproblem.
When we only have access to aggregated ground truth data, evaluation of a data source is much harder. There has been significant work in aggregate learning tasks which utilize aggregate assignments of classes to groups of samples to train a model. Our aim in this paper is significantly different from such works, since we already have a model (i.e. data source), and we are trying to evaluate its performance utilizing many campaigns and multiple aggregates of ground truth data. Cheplygina et al. [5]
provides an overview of aggregate learning methodologies, which may utilize granular response variables/feature vectors (single instance) or aggregate response variables/feature vectors for groups (multiple instance) to train their models, and later, testing them. Musicant et al.
[11] utilizes aggregate outputs for the response variables to specialize the training process of knearest neighbors, decision trees, and support vector machines. In [3], the authors utilize aggregate views of data, which consist of a choice of different combinations of features, response variables, and combining machine learning models learned from these views. Another interesting work is presented in [16], which gives error bounds on how a model learned from aggregate data can perform. They assert that a machine learning model should minimize empirical proportion risk, and prove that under certain assumptions for the class distributions, learning in the aggregate setting can actually improve individual classification performance.Finally, specific to the online advertising domain, we can list [14]
as being a relevant work to ours. In this paper, similar to aggregate learning techniques, the authors aim to learn a predictive model to decide whether a user is in a specific ground truth category using the aggregate data over many campaigns, by assigning the most likely label to all users in the aggregate, or assigning a probabilistic single label. They utilize logistic regression with L
norm regularization, where the response variables are the artificially generated labels.4 Brute Force Evaluation
In this section we will present our first proposal for data quality assessment, which includes setting up specialized campaigns for a data source and utilizing the targeting results directly for evaluation.
Note that we typically rely on the independent survey agencies to collect the ground truth analysis data on our audience population. Such agencies use offline data (such as credit card information) and online data (such as information filled in social networking websites) to profile an Internet user. Reports from these survey agencies are generally considered as the ground truth by advertisers. Such reports are aggregated statistics and contain no userlevel information due to privacy reasons.
4.1 Performance Campaign for Data Source
An intuitive and straightforward way to evaluate a data source is to set up a campaign that only targets certain users which are tagged by data source to be in category . This way we can calculate the quality of the data source as , where is the number of users in category reached by this campaign as reported by the ground truth and is the total number of users reached by the campaign via at least one impression. Note that we can put a limit on the number of impressions to be served to a user so that we can increase the unique user reach and have more reliable results.
Age Ranges  

0.414  0.245  0.033  0.018  0.012  0.032  0.043  0.064  0.075  0.03  
0.058  0.297  0.246  0.037  0.021  0.043  0.051  0.089  0.091  0.046  
0.031  0.053  0.298  0.227  0.049  0.042  0.042  0.049  0.132  0.054  
0.031  0.041  0.089  0.345  0.182  0.057  0.037  0.039  0.116  0.041  
0.037  0.057  0.056  0.063  0.337  0.212  0.047  0.038  0.082  0.05  
0.056  0.049  0.061  0.041  0.053  0.339  0.23  0.041  0.065  0.046  
0.08  0.067  0.054  0.035  0.03  0.041  0.332  0.204  0.077  0.048  
0.065  0.074  0.082  0.043  0.03  0.04  0.048  0.339  0.203  0.052  
0.05  0.044  0.066  0.065  0.048  0.033  0.048  0.044  0.488  0.101  
0.035  0.036  0.055  0.048  0.039  0.048  0.061  0.06  0.106  0.492 
We applied this methodology to evaluate age and gender categorizations of some wellestablished data providers in online advertising. Table 2 demonstrates some results for one of the better performing such data providers in age categorization. We anonymize the name of the data provider and exact age ranges listed in the table to comply with the company’s regulations. In Table 2, represent the age ranges such that they are mutually exclusive and sorted in ascending order, e.g. is the range that is the immediate higher range after (i.e. minimum age in range is one larger than maximum age in range ), and the immediate lower range before . represents the predicted results by the data source for age range , while stands for ground truth data for the same age range. For example, we can observe from the table that
, that is, when our data source classifies the user to be in age range
, 34.5% of the time it is correct, or in other words, 34.5% of the users reached by campaign that targets category , as predicted by the data source, were actually in category , as provided by ground truth. Note that for this particular data source, the exact match, highlighted in bold, is quite high compared to random.Data providers utilize various models and online/offline information to tag users. Please note that data sources from the same data provider may have quite diverse accuracy values. For example, the accuracy results in Table 2 range from 0.29 to 0.49. Thus we cannot evaluate one data source and assume its sibling data sources have similar predictive power. Each data source needs to be evaluated individually. This causes a significant disadvantage with the above methodology, which is the fact that we need to set up a separate campaign for each category so that we can gather the accuracy statistics. To remedy this problem, we propose a second methodology in Section 5, which solves an optimization problem to come up with the best fitting accuracy probabilities based on the aggregate reports.
4.2 Cost Analysis
In this subsection, we further analyze the cost of the brute force method discussed in Section 4.1. It must be noted that obtaining the ground truth, that is the aggregated labeled data, is costly on its own right. However, in the following lemma, we focus only on the ad serving costs to underline the utility and benefit of our approach.
Lemma 1
Given a data source that can tag the user by one of the possible categories (e.g. the data source gives a positive/negative output on one age group of possible age groups), then to observe a significant difference between the calculated accuracy of one category versus others, we need
at least impressions.
Assuming a uniform distribution, we assume the average ontarget rate is
for each data source (although the intention of a data source is to increase this value). Each impression can be considered as a Bernoulli trial withprobability of success. The sample variance is, thus,
. We would like to detect a significant difference between the prediction accuracy of the correct category versus the rest of possible tags. The industry standard accepts a error. Then, for a significance level of for a two tailed hypothesis test and to attain at least power, we have , where stands for the number of users, andare the values of the quantile function of the standard normal distribution for
and , respectively [2]. Therefore, the number of users that receive the ad impressions must be more than for each data source. Based on Lemma 1, for data sources that provide information on one of the possible categories, we need at least impressions. As we discussed before, it is necessary to evaluate each data source individually, considering the diversity of their predictive power. Given the very large number of data sources, this causes the bruteforce approach to incur a very significant cost.5 Accuracy Inference
Highquality data sources can enable advertisers to reach the right audience at the right moment. Because they have become an important component of online advertising, more and more online/offline data are being ingested into Turn’s data management platform. As we have mentioned previously, there are currently over 200k active data sources in our system. Lemma
1 established that explicitly evaluating each of these data sources by running performance campaigns is overwhelmingly costly: not only a large amount of money is required to run the performance campaigns, but also enormous manual effort to set up and manage those campaigns is essential as well. We need an efficient way to simultaneously infer the accuracy of multiple data sources.As we have presented in Section 2, our focus in this paper is to calculate the accuracy metrics of a data source for single or multiple categories. In essence, we are trying to calculate a set of probabilities, which represent the likelihood of a data source predicting correctly/incorrectly that a user belongs to a category . In Figure 1, we have shown the set of probabilities that we aim to predict. For representational purposes, we have shown the accuracy probabilities of a data source which denote its capabilities to tag a user as Male or not, though the same logic follows for any category. The probabilities in the figure can be listed as follows:

: The probability of the user actually being Male when the data source tags it as Male. This value can also be called precision or positive predictive value.

: The probability of the user actually being Not Male when the data source tags it as Male. This value can also be called false discovery rate.

: The probability of the user being Unknown (i.e. ground truth does not exist) when the data source tags it as Male.

: The probability of the user actually being Male when the data source tags it as Not Male.

: The probability of the user actually being Not Male when the data source tags it as Not Male. This value can also be called negative predictive value.

: The probability of the user being Unknown (i.e. ground truth does not exist) when the data source tags it as Not Male.

: The probability of the user actually being Male when the data source tags it as Unknown.

: The probability of the user actually being Not Male when the data source tags it as Unknown.

: The probability of the user being Unknown (i.e. ground truth does not exist) when the data source tags it as Unknown.
As it can be seen from the figure, and trivial from the definitions, we have . Among these nine variables, , i.e. precision, is often the most important value for the advertisers, since it denotes the goodness of a data source to be used for their advertising purposes. To calculate this value, we presented the methodology, which is based on creating specific campaigns to evaluate a singular data source for a specific category in Section 4. In this section we are proposing an optimization scheme, which utilizes the aggregated category distributions over multiple campaigns, for both the data source we want to evaluate, and the ground truth. In the following subsections, we call the above nine variables predictive values of a data source.
5.1 Setup for Inference
We propose to set up multiple performance campaigns without using any data source for targeting, so the audience will not be explicitly skewed by any data source. We compare the ground truth of each campaign against the hypothesis of each data source, and infer the quality of the data source.
We follow a set of rules to set up a performance campaign. First, the targeting criteria should be minimal and cannot be biased by any thirdparty data. For example, targeting the online users in U.S. is fine, since this is purely based on IP address and not biased; however, using a data source to limit audience to middleaged men
is not acceptable as the quality of this data source is what we want to assess. In general, only geographical location should be used as targeting criteria. Second, the targeted websites must be discriminative, that is, the population of the visiting users should be largely skewed towards one of our possible tags. This way, we will not mistakenly estimate a data source to be accurate, when in fact it is predicting the label in a random manner. For example, a website is beneficial for such an experiment if 70% of visitors are female and 30% of visitors are male, and not necessarily if the distribution is 50% to 50% (since, in such a case, a random prediction of female vs. male will closely fit the overall audience in aggregate). One should note that obtaining such knowledge is not always feasible before running the campaign. Therefore, in our system, we target only the websites that are known, based on our domain experience and verified by independent reports, to be popular among a certain group of audience. After creating such a campaign, we run it for a certain period to collect data. The ground truth is collected through independent agencies as described in Section
4.We log the received report data along with the firstparty campaign data into our inhouse data warehousing system called Cheetah [4], which is built on top of the Hadoop framework [13]. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. Campaign facts are stored within nested relational data tables. Fast MapReduce jobs are designed to extract key features of the performance campaigns, compare them with the ground truth and infer the accuracy of a data source. Utilizing the collected campaign information, we present two approaches in this section to efficiently infer the quality of data sources: one that ranks data sources, and another which directly deduces precision () of a data source.
5.2 Ranking Based Assessment
Ranking Data Sources. In many instances, we only need to choose the best data source from a large set of candidates with similar semantic purposes. If this is the case, then data quality assessment becomes a ranking problem. A data source’s absolute precision is of less importance then, and rather its rank among others is critical.
Since the independent evaluation agency sends us the aggregate statistics on a campaign, we can similarly construct such statistics using a data source. Note that this approach will only represent the view of the data source, and not the ground truth, unlike the independent agency case. We can then evaluate this data source based on how close the constructed statistics are to that of the ground truth, and therefore rank data sources based on such closeness measure. This logic is presented in Algorithm 1.
There are multiple ways to design the closeness function to compare two aggregated statistics. Since the positive taggings are the most valuable for online advertising purposes, we propose to compare the positive population distributions between the ground truth and the data source. We define the percentage of population marked as positive by a data source as
(1) 
where counts the number of users in marked as positive by , and counts negatives. Given that is the ground truth ratio of positive population, a simple way to calculate the closeness can be defined as:
(2) 
However, this does not consider the scale of or , which are usually quite small for a rare positive group. To make the measurements more comparable, instead we propose to calculate the relative error as the closeness:
where is the number of positives reported by the ground truth, is the number of ground truth negatives, and is the scaled number of positives marked by the given data source. We want to scale the number of positives of the data source, because the population recognized by might be quite different from the one recognized by the ground truth. For example, the independent evaluation agency might have data on 10000 users, while might only have data on 1000 users. Therefore, we need to extrapolate the population unrecognized by to scale up the populations:
(3) 
The above value reflects the potential error rate if we scale the data source’s recognizable population to the size of the ground truth population. Per Algorithm 1, we calculate average relative error () across all performance campaigns for each data source. We can then rank data sources based on their average relative errors.
Soundness Analysis. A ranking algorithm needs to be sound to ensure the optimal assessment: Given two data sources and , a ranking algorithm is sound if it outputs measurements for and for , such that if and only if is more likely to perform better than , as we defined in Def. 3. We will show that the based ranking algorithm is sound in many cases. First, let us define the notion of unbiasedness for a data source:
Definition 4
A data source is unbiased if and only if its positive predictive value equals to its negative predictive value: .
Our experience suggests that many data sources we utilize are on demographics and can be considered as unbiased. For example, the accuracy that a data source claims someone as male, in general, is close to the accuracy that it claims someone as female (Not Male as in Figure 1). In realtime bidding, better ontarget metrics, i.e. improving the ratio of audience that really have the data source’s claimed characteristics, is the endeavor of any data provider. We can show that the based ranking algorithm is sound:
Lemma 2
Given a set of sound data sources , the based ranking algorithm is sound for precisions and orders the data sources based on their expected performance. In addition, if the data sources are all unbiased, the algorithm is sound for any ontarget metrics.
Per definition, of the better data source is closer to the reality, thus is smaller. Since is constant, the order of is preserved for precisions. Ontarget metrics can be the precision, or the negative predictive value, or simply the micro or macro average of the two predictive values. Since the data sources are unbiased, the order of metrics for negatives are also preserved. Averaging is monotonic, therefore we can expand the previous statements to micro and macro averaging cases as well.
5.3 Precision Inference Approach
Although the ranking methodology is able to pinpoint the highest performing data sources, the output ranking measurement is only a surrogate of precision. It correlates with the underlying precision, but is inherently different. As we will show later, in online advertising, it is often necessary to forecast the campaign performance as well as evaluate whether a thirdparty data source is worth the extra amount of money that an advertiser has to pay in order to utilize it. In such cases we need an accurate estimation of a data source’s precision.
Direct Inference of Data Source Precision. We propose an efficient way to directly estimate the predictive values of a data source. As shown in Figure 1, a data source’s hypothesis on the audience population can be mapped into the ground truth using its predictive values. Given a performance campaign , let the size of Positive, Negative and Unknown audiences identified by data source be , and correspondingly (these are scaled values to cover the whole population of ), and the size of ground truth Positive, Negative and Unknown audiences be , and correspondingly. When the audience population size is large, it is clear that we have
Combining with the probability simplex constraint and the unbiasedness constraint, we can estimate a data source’s predictive values by solving a quadratic optimization problem: given performance campaigns , we search for predictive values so that
(4)  
(5)  
(6)  
(7) 
Here, (4) is our optimization objective which aims to find the best mapping between the data source’s hypothesis and the ground truth. Note that here we assume the size of audiences of campaigns to be similar, which can be controlled at campaign set up time. Otherwise we need to normalize by the audience size of each campaign. (5) and (6) enforce the probability simplex. (7) attempts to help us find the unbiased solution, and predefined constant controls our confidence on the unbiasedness of the predictive values.
We, therefore, can run a few performance campaigns, extract each data source’s hypothesis on those campaigns, compare with the ground truth and solve the above optimization problem. As we will show, this will efficiently give us the estimated predictive values of data sources in batch (among those, precision is the most valuable for online advertising).
Performance Analysis. The proposed inference approach is efficient, in terms of both computation complexity and money. First, it is straightforward to show that the quadratic programming problem has a semidefinite Hessian with a bowl shape. The optimization problem is convex and can be solved efficiently with polynomial time complexity. Additionally, we only need to run a limited number of performance campaigns to simultaneously estimate the predictive values of multiple data sources. In practice, it is possible that a data source’s predictive values are slightly different in different performance campaigns due to variance. Given a campaign , it is natural to assume a data source’s predictive values for this specific campaign are
where
is normally distributed with zero mean. In such cases, we can get the unbiased estimate of a data source’s predictive values by running a limited number of performance campaigns:
Theorem 3
Given performance campaigns, our direct inference method can get the unique and unbiased estimate of a data source’s predictive values. Furthermore, given any predictive value and its estimation , we have
where is a constant,
is the standard error of the estimation, and
is th quantile of Student Distribution with degrees of freedom.The optimization can be converted into a linear regression problem within a simplex search space. This regression problem contains 6 free regressors and each campaign provides 2 points in the space. When we have
campaigns, the quadratic matrix is positivedefinite and we will have a unique global optimal solution. A BiasVarianceNoise decomposition shows the solution is unbiased.Since the errors are normally distributed, the sum of the regression residuals is then distributed proportional to Student Distribution with degrees of freedom:
We then construct the confidence levels for the estimated regressors. By running more campaigns, we can quickly reduce the estimation errors and get highly reliable predictive value estimations of multiple data sources. Given its computational and economic efficiency, we adopted the direct inference method and utilize it continuously to generate the quality report on data sources.
6 Use Cases
In this section, we will discuss some use cases where the quality assessment of first or thirdparty data sources can be useful. First we will talk about targeting in online advertising, and the amount that an advertiser should be willing to pay for a data source. Then we will give a very general use case in campaign forecasting, i.e. to predict, before an online advertising campaign starts, what category of users will actually be reached by a preset targeting criteria.
6.1 Targeting in Online Advertising
Advertisers aim to reach the best audiences to promote their products, so that they can increase the likelihood of a click or an action happening. The automated way of grouping users into beneficial and nonbeneficial subsets is often called audience segmentation. For an informative work on how this kind of audience segmentation can improve click rates, refer to [15]. In this paper, however, we focus on a different kind of targeting where the advertisers already have a predefined set of users they want to target. As an example, suppose that an advertiser wants to reach only female audiences within the age range 2135. There are multiple data sources this advertiser can utilize to reach this group, but as discussed in this paper, none of these data sources gives a definitive classification. Intuitively, an accurate prediction of the quality of a data source is essential for advertisers to choose it over others. Also, note that here we mostly care about the precision or positive predictive value of a data source (i.e. , when a data source suggests that a user is in category , the likelihood that this user actually belongs to category ), since this is the signal that the advertiser uses to bid on a user.
Here, we would also like to discuss the consideration of data cost. In general, when an advertiser wants to utilize a thirdparty data source for bidding purposes, it should pay the thirdparty provider a certain amount of money. This cost is generally per impression served using this data source, hence can have significant effect on the ROI of an advertiser (i.e. advertiser needs to pick up extra clicks/conversions to make up for the money paid to the thirdparty for the targeting information it provides). An important point for an advertiser to consider is if using a data contract is “worth” its price. We will give a simple calculation here for the case when the advertiser utilizes no data sources to reach a specific audience (i.e. free targeting), and whether adding the data source and paying for it makes sense. Our main argument is that, by paying for the data source to target a specific audience, the reduced cost of the mistargeted impressions (i.e. those impressions that are served to the audience that are out of our desired audience) should make up for the data cost. In other words, for the same amount of money we should get more of the desired impressions, although our total number of impressions is less due to data cost. Please note that below we assume the effective cost per impression (cpi) to be the same for both free targeting and data source assisted targeting, just that the data source has the additional data cost per impression (cpi):
Above, totalSpend is the amount of money that the campaign spends, cpi is effective cost per impression, and cpi is data cost per impression (hence is the number of impressions picked up by data source assisted targeting, and is the number of impressions that can be picked via free targeting, i.e. no data cost). errorRate is indeed the inaccuracy of free targeting (percentage of audience that is not desired), and percentage of the cases when the data source predicts a user to be in desired audience, while, in fact, it is not. In the next inequality, we actually translated (1  errorRate) into from Section 5. After further reorganizing the above inequality, we get the following:
(8) 
This means that for a data source to be beneficial for a campaign, its data cost per impression should be less than . Please note that we have the assumption here that effective cost per impression would be the same for free targeting vs. data source which is not always valid, i.e. we may have to pay more to show ads (impression) to those users that the data source tagged to be desirable. Also, it can be seen that the benefit of the data source often depends on how expensive the impressions are for a campaign, hence is campaign specific. Finally, the above calculations do not take into account the cost of data evaluation utilizing our proposed two methodologies, which was also mentioned in Section 4.2. However, this evaluation can be performed once for each data source, and hence is not of significance for each campaign that utilizes it.
6.2 Forecasting
Forecasting the performance (returnoninvestment), reach (unique users we can show an ad to), and delivery (amount of money we can spend on advertising given the targeting criteria) of a campaign is a significant problem that has to be dealt in online advertising [9]. Here we show that by utilizing the accuracy metrics (i.e. , , and from Section 5) over multiple data sources that may tag a user, we can actually predict the expected number of users that will fall into a specific audience/category, on top of the total spend/reach as in the traditional forecasting problem, once the advertising campaign goes live.
Here is how the forecasting process in online advertising works. Once the advertiser sets some targeting criteria (filtering of users to show ads according to anonymous user properties) and goals (in terms of clicks and conversions) for a campaign, we can utilize our system as explained in [9] to find out which users this campaign is likely to reach. We can already calculate expected number of unique users and delivery for the campaign from this information alone. Furthermore, one problem that we can solve for the advertisers is the prediction of what percentage of these users will fall into a specific user category/class . Note that the approach mentioned in [14] does work on this problem of predicting likelihood of a user belonging to a specific category via utilizing many features, but their focus is on targeting rather than forecasting. Here, we suggest that rather than training a simple model for predicting the membership in a category, we can utilize multiple data sources and their estimated predictive values to forecast the expected number of users that will fall into a category. This information is not used in bidding time, hence contrary to the targeting use case we explained in Section 6.1, there is no data cost.
Figure 2 summarizes the overall idea. In the first step, we communicate the targeting criteria set by the advertiser to our set of users. For realtime forecasting, we often need to use a sampled set of users [9]. Once we filter the users that are appropriate for the targeting criteria, we can go through each of these users and see whether these users are tagged by a first or thirdparty category . Once we see these taggings, we can calculate the probability of this user belonging to a desired groundtruth category by the probability which is the from Section 5, i.e. precision. If we assume that each user is tagged by one and only one , we can forecast the expected number of users that will belong to category as:
(9) 
In (9), is the set of users that belong to a certain set of targeting criteria T. is the category that the advertiser desires to forecast how many of their targeted users will belong to. is a first/thirdparty category from a list of categories for which we have the prediction values. is an indicator to see whether a user has category , and above formula is valid only because we assume that each user has only one of possible s. If each user can have multiple first/thirdparty categories (as is the case in real situations), we need to aggregate multiple s, where we can utilize combination methods such as getting the maximum, minimum, average or median of the precision values.
7 Experimental Results
In this section we will give some preliminary results for our optimizationbased evaluation technique. We have already given some preliminary results for our first methodology (Section 5.1) in Table 2. As aforementioned, we do aim to calculate all nine prediction values of a data source for a category, but for purposes of online advertising, the most important one is the precision (). Only if this value is high we can reliably use this data source to reach a certain category of users.
Simulation Results. To evaluate our methodology we ran several simulated campaigns, where for each campaign we create an audience of 100 users and assign them to predicted categories as follows:

Random, disjoint sets of 20 users each are assigned to category c, not c, and unknown,

The rest 40 users are assigned to either category c, not c, and unknown in a uniform manner.
Then, we generate the ground truth categories for two types of data sources with the following actual probability values:

High Quality: This data set has the following underlying nine probability values: = (0.8, 0.15, 0.05), = (0.2, 0.7, 0.1), = (0.4, 0.5, 0.1), note that ,

Low Quality: This data set has the following underlying nine probability values: = (0.4, 0.5, 0.1), = (0.3, 0.6, 0.1), and = (0.5, 0.4, 0.1).
Please note that this kind of synthetic data generation is quite counterintuitive. We first create the predicted values using some preset distribution, and then generate these users’ actual categories using the predictive values of the two data sources. For example, if the user in our synthetically generated audience has a predicted category of not c by High Quality data source, then we assign it to ground truth category of c by probability , not c by probability , and unknown by probability .
Once we generate the dataset, we actually have aggregated values of category counts for each data source. Using these category counts, we can utilize our data quality assessment method we described in Section 5.3, and look into the difference between our computed predictive values, and the actual predictive values as given above.
The results of the above described simulations are given in Figures 3 and 4. In Figure 3, we estimate the nine predictive values for each of the data sources using our methodology, by utilizing multiple campaigns (the results are averaged over 100 trials at each value in xaxis). As we have proven in Section 5.3, we do need at least three campaigns to get a unique solution for all nine probabilities. We can observe that starting with four campaigns, the difference of the real values versus our predicted values fall to zero. We have plotted the difference between s (precision values, since ), but the results are similar for other , , and .
Next, we performed another experiment where we introduced a uniform noise between (where we changed between 0 and 0.35) to the above nine real predictive values, and then generated the ground truth assignments. We tried to recover the nine predictive values using six campaigns and present the difference (averaged over 100 trials) between real and predicted values in Figure 4. We can see that even under significant noise levels, our methodology can recover the precision values accurately. Because for the high quality data source is higher, we can also observe that the noise effect is slightly less.
RealWorld Results. Following the methods we discussed in Section 5.1, we ran 156 performance campaigns, each of which targeted a specific website. We used half of the campaigns to calculate and predictive values for around 100 data sources. Then, we tried to estimate the positive population in the rest of campaigns using these 100 data sources. We utilized the average of positives predicted by the sources, and calculated the correlation with the ground truth positive sizes. For the direct inference method, it is clear that for each campaign , its estimated positive population is (for a single data source). For the method, by deducing (3), we can roughly estimate , where is the population size, is the percentage of population recognized by the ground truth in the training set, and is the average (i.e. RelativeErr from Eq. 3) of the training set for a single data source.
Each method’s Pearson correlation coefficient is shown in Figure 5. The direct inference method gives a significantly more accurate estimate of the positive population () and it correlates well with the ground truth.
Next, we utilized two popular data sources and as the targeting criterion and ran test campaigns individually. The reported positive rates from the independent evaluation agency can be treated as the ground truth of their precisions. The groundtruth precisions and our estimated values are listed in Figure 6. The direct inference method yields a much closer estimation to the ground truth (), while the ranking method preserves the orders but the values are substantially different from the ground truth.
Our proposed approach was deployed into Turn’s data management platform and generates weekly reports on the quality of our top data sources. We have received positive feedback from our campaign optimization managers in the field, commenting that the reported precisions are close to the real campaign results. Interestingly, by evaluating our data sources periodically, we are forming a positive reinforcement loop over their data quality: feeling the pressure, data providers work consistently to improve their data quality. For example, the estimated precision of one data source over a three month period is plotted in Figure 7. It is clear that the data source’s quality has been improved over this time period.
8 Conclusions and Future Work
In this paper, we have presented a novel framework to evaluate first or thirdparty data sources on user properties for online advertising, which is a particularly challenging task when the ground truth is reported in aggregate form. We call this problem data quality assessment, and presented two solutions, one utilizing the data sources directly in a campaign, and another one, which utilizes outputs from multiple online advertising campaigns to optimize a set of probabilities which represent the “goodness” of a data source. We have also presented some use cases on how these evaluations can be utilized in online advertising domain, mainly in targeting, assessing the amount of money that an advertiser should pay for a data source, and forecasting. Some preliminary simulation and realworld results were also presented that show the effectiveness of our methodology, as well as some results on the performance of a wellestablished actual data provider for age categorization of users on multiple realworld advertising campaigns.
Possible future work mainly lies on the use cases of the evaluation output of our methodologies, as given in Section 6. Our current focus is on accurate targeting of online users, which also needs to take into account the problem of combining multiple data sources and their quality assessments to come up with a better model.
Acknowledgments
We thank many talented scientists and engineers at Turn for their help and feedback in this work.
References
 [1] N. R. Adam and J. Wortmann. Securitycontrol methods for statistical databases: A comparative study. ACM Computing Surveys, 21:515–556, 1989.
 [2] G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
 [3] B. Chen, L. Chen, R. Ramakrishnan, and D. R. Musicant. Learning from aggregate views. In Proc. IEEE ICDE, 2006.
 [4] S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endowment, 3(12):1459–1468, 2010.
 [5] V. Cheplygina, D. M. J. Tax, and M. Loog. On classification with bags, groups and sets. Pattern Recognition Letters, 59:11–17, 2015.
 [6] H. Elmeleegy, Y. Li, Y. Qi, P. Wilmot, M. Wu, S. Kolay, A. Dasdan, and S. Chen. Overview of turn data management platform for digital advertising. Proc. VLDB Endowment, 6(11):1138–1149, 2013.
 [7] C. Ferri, J. HernandezOrallo, and R. Modroiu. An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30:27–38, 2009.
 [8] A. Gunawardana and G. Shani. A survey of accuracy evaluation metrics of recommendation tasks. Journal of Machine Learning Research, 10:2935–2962, 2009.
 [9] A. Jalali, S. Kolay, P. Foldes, and A. Dasdan. Scalable audience reach estimation in realtime online advertising. In Proc. ICDMW, pages 629–637, 2013.
 [10] K.C. Lee, B. Orten, A. Dasdan, and W. Li. Estimating conversion rate in display advertising from past performance data. In Proc. ACM KDD, pages 768–776, 2012.
 [11] D. R. Musicant, J. M. Christensen, and J. F. Olson. Supervised learning by training on aggregate outputs. In Proc. IEEE ICDM, pages 252–261, 2007.
 [12] C. Parker. An analysis of performance measures for binary classifiers. In Proc. IEEE ICDM, pages 517–526, 2011.
 [13] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol, CA, 2012.
 [14] M. H. Williams, C. Perlich, B. Dalessandro, and F. Provost. Pleasing the advertising oracle: Probabilistic prediction from sampled, aggregated ground truth. In Proc. ACM ADKDD, 2014.
 [15] J. Yan, N. Liu, G. Wang, W. Zhang, Y. Jiang, and Z. Chen. How much can behavioral targeting help online advertising? In Proc. ACM WWW, pages 261–270, 2009.
 [16] F. X. Yu, K. Choromanski, S. Kumar, T. Jebara, and S. Chang. On learning from label proportions. arXiv:1402.5902v2, 2015.
Comments
There are no comments yet.