1. Introduction
One of the most important tasks in modern machine learning is that of supervised classification
(ng2017talk), whereby a training set , with associated class labels , is used to minimize the value of an objective function with respect to the data and model parameters , such that the trained model is able to reliably assign labels to new, unseen data.Training a model that will generalize to unseen data is a fundamental challenge in supervised learning, and is subject to the biasvariance dilemma
(geman1992neural)(shalev2014understanding). To lower bias, the model needs to be adapted such that the decision boundary is permitted to contort to accommodate more boundary training samples and improve training accuracy, which results in a more complex classification model (see Figure 1(a)). However, in this process noisy or uncertain points may also be accommodated, which could harm the generalization and make predictions less accurate against a test set. On the other hand, a less complex, higherbias model is relatively simple and may exhibit improved generalization (i.e., have a lower variance). One issue with reducing both model bias and variance lies in the trustworthiness of a sample: ideally, an informative sample should be wholly accommodated and a noisy sample should be discounted or discarded completely.Understanding and study of uncertainty has long been an active topic in AI research. We take a datacentered view and consider that noise and uncertainty comes from four sources:

Value noise: value noise exists in collected values due to imprecise collection procedure and measurement tools, which is stochastic noise or aleatoric (inherent to the problem) uncertainty discussed in literature (der2009aleatory).

Feature uncertainty: failure to include sufficient discriminative features in the collecting process can result in overlapping classes that would be separable in the higherdimensional space that would describe the data if the necessary features were included.

Distribution uncertainty: the number of samples is too small or data is biased, and so doesn’t accurately reflect the true data distribution. This uncertainty is often called deterministic noise or epistemic (due to the modeling process) uncertainty in literature (der2009aleatory).

Label uncertainty: class labels are often conceptual entities, and labeling is performed by human annotators. Even with carefully designed labels and experienced annotators uncertainty can arise due to different interpretation of labels and samples.
This categorization of noise and uncertainty solely serves to inform our approach; a comprehensive discussion of noise and uncertainty goes beyond the scope of this paper. Here, we focus on a particular type of label uncertainty: that which is tied to the representation of the samples rather than errors which are stochastic in nature (labels that are flipped with a certain probability by a random process). As we will illustrate in the following sections, better understanding and handling of label uncertainty can contribute to the reduction of both model bias and variance.
While label uncertainty can be revealed through multiple inconsistent labels if more than one annotator is used, in practice a sample is often annotated only once due to prohibitive labeling cost. Using a single label for both high and lowconfidence samples obscures the label uncertainty. In this paper we will consider this specific type of heteroscedastic (datadependent) label uncertainty and provide a method to mitigate its effects on the decision boundary of the classification model. Our main idea is to use a
nearest neighborbased entropy measure to estimate the degree to which a point’s label is uncertain: the label of a point surrounded by points of another class should be treated as less trustworthy  and the point’s ability to move the decision boundary reduced  as the position of its representation is suspicious. Points in a highly heterogeneous neighborhood are similarly weighted down, as they are in an area of the representation space that is very chaotic; it is possible the points from two different classes are being projected down an unavailable data axis into the same region of the feature space.To be more specific, our approach to improve both model performance and generalization is to identify points in the dataset that are likely to be ”noisy” or ”misleading” to the model (terms which will be defined formally in section 3 of this paper) due to labeling uncertainty and automatically adjust their corresponding sample weights such that data samples with uncertain labels do not contribute to the loss function to the same extent as informative data samples during training. We derive a sample weight for each point, using a function that calculates a pointwise score based on the entropy and distances of samples within each point’s neighborhood, such that a point in a homogeneous neighborhood, with many neighbors of the same class as itself, will be weighted up, and points in heterogeneous neighborhoods will be weighted down as potentially noisy or mislabeled. The main contributions of this paper are the following:

A novel definition of what is meant by an informative training sample with respect to its paired label. Instead of the existing approaches focusing on model finetuning  normally subject to the biasvariance tradeoff  we directly attack the core of the problem: how to measure the trustworthiness of a sample so we can decide whether a decision boundary should be adjusted to accommodate it accordingly. In this way we can improve generalization performance while reducing model variance.

A new nearest neighborbased framework for estimating label uncertainty pointbypoint and mapping it to sample weights for use in model training.

When performing experiments, we configure our training environment such that our runs are deterministic for each random seed. In this way we can be sure that variation in performance is due to the variable we intend to observe, the sample weighting, not from randomness of learning models.
This paper is organized as follows. In the next section, we discuss related work. We then present our nearest neighborbased label uncertainty measure and describe how to map those values to sample weights. We then describe our experimental procedure and results, and give final concluding remarks.
2. Related Work
The tradeoff of bias and variance has been studied from different perspectives. Geman, et al. introduced the dilemma in terms of neural networks (NNs) in (geman1992neural), showing that increasing variance with model capacity makes NN models require an ”unrealistic” number of training examples that they could have not have foreseen becoming realistic with today’s data collection and storage. Goodfellow et al. presented an updated discussion in (Goodfellowetal2016). We examine the biasvariance relationship in settings where labeling errors are datadependent, the probability of a given example having an incorrect label being dependent on the example’s representation,
. This setting could arise in human activity recognition, for example, when trying to classify walking and running: examples along a model’s decision boundary separating the two classes are much more likely to be mislabeled than examples far from it (imagine a model that separated only on a person’s average speed over a window of activity; examples near the speed representing the split point between the classes likely have much more uncertainty to their labels than very high or lowspeed examples).
In (liu2016classification), Liu, et al. took an approach to handling label noise using importancebased sample reweighting. They worked specifically on 2class label noise, where there is certain probability that an example of class will have its label flipped to a , and a probability that an example of class will have its label flipped to . They used a density ratio estimation method to perform the reweighting. This work and that of Scott, et al. (scott2013classification)(blanchard2010semi)(scott2015rate) requires estimation of the noise rates and . The authors of (menon2015learning) offer optimization methods to avoid issues due to binary label noise and give steps to estimate noise from corrupted data.
In (northcutt2017learning), the authors introduce a method called Rank Pruning to treat noise in labels, whereby they train one or more classifiers and then prune the training set of likely false positives and false negatives based upon the confidence rankings of the trained models; they then retrain a new model based upon the cleaned training set. With our method, we are concerned both with points that are actively misleading to the model (and should be removed entirely, as in the case of label corruption as in (northcutt2017learning)), but also with classes with definitions that are more ambiguous, such as those in activity recognition; in these cases, removing points with low model confidence could result in a loss of useful information.
Natarajan, et al. in (natarajan2013learning) also addressed binary classification with classspecific label noise, offering approaches to modify surrogate loss functions robust to it.
A popular avenue of current research studies a form of stochastic label noise that is assumed to be as bad as possible for the model, when creating adversarial examples (fawzi2016robustness) (fawzi2018analysis)(goodfellow2014explaining)(gu2014towards).
In (ren2018learning)
, the authors use sample weighting to improve the performance of deep learning models. They develop a method to calculate sample weights for examples by learning a weighting during optimization, modifying the weights at each step based on how they reduce loss with respect to minibatches drawn from a highquality validation set. This method can help with label noise because instances with incorrect labels should presumably have very poor agreement with the validation set and therefore be weighted down. In our case, where we are trying to improve generalization along decision boundaries where we expect labeling problems, it would be difficult to obtain a truly clean validation set. Additionally, the authors analyze the method only on stochastic noise settings, where there is a uniform probability of label flipping or a certain probability with which labels from any class are flipped to a ’background’ class, representing the case when human annotators miss a positive example. Other recent methods
(goldberger2016training)(jiang2017mentornet) also assume stochastic label noise or corruption that is not datadependent.We use a nearest neighborbased method to calculate our pointwise uncertainty scores, by comparing the local selfinformation of the class of each point with the local selfinformation of the other classes in the dataset, weighted by the distances from each point to the other points.
nearest neighbor methods are commonly used in estimators of differential entropy and mutual information over continuous random variables
(berrett2016efficient) (gao2018demystifying) (gao2017density). A popular method from (kozachenko1987sample) uses the volume of open dimensional balls around each point, with a radius of the distance from the point to its th neighbor, to estimate the pointwise local densities for the available samples, then uses those volumes to compute an estimate of the global entropy.We work in the discrete case, analyzing the local selfinformation over a finite set of classes, but find the nearest neighbor approach to entropy estimation valuable because it allows us to look at the local label entropy by class from a pointwise perspective. This gives us a measure of surprise to find the point’s label at its location in the representation space, which we combine with the sparsity of the neighborhood and the local entropy of all classes to assign a score from which we can derive a sample weight.
In our work, we consider the multiclass case where there is uncertainty in the labeling. We will show how to reduce bias and variance together in the following sections.
3. Overview of model bias and variance and connection with label uncertainty
It is clear that label uncertainty impacts the trustworthiness of a sample, which in turn determines how much the sample should be accommodated when a decision boundary is produced as finding an optimal decision boundary can reduce both bias and variance. In this section we will discuss how we define label uncertainty, then analyze its connection to model bias and variance.
3.1. Uncertain and informative data points
In this paper, we make many references to the ideas of uncertain and informative data points. We use these terms relative to the ground truth of the classification problem and a hypothetical ”ideal model” or true labeling function. Any set of training data can be viewed as a set of draws from some unknown joint distribution, with each class representing a marginal distribution over the feature space. Each vector describing those samples is a single point in the feature space.
We consider the case where the data collection and labeling process is complete and cannot be revisited. Of course, if more features could be added to each point’s representation, the neighborhoods of the points would be changed and consequently the uncertainty estimation would be different as well. If adding a feature to a point maps it into a homogeneous region instead of a heterogeneous one, treating it as a more certain point is a reasonable approach.
A model is then a function that maps input vectors to class labels, integers that denote which class  which marginal distribution  a data vector was most likely generated from. It may be that a particular data point, say the MNIST (lecun1998gradient) 9 with its top left open enough to resemble a 4 (as illustrated in Figure 1(b)), could have been generated by the process of people writing 9s with probability and the process of people writing 4s with probability (for the purposes of this discussion, we assume no one writing any of the other digits could ever produce the sample). When we have a label (in an ideal setting), we have the correct answer for which class marginal distribution a sample was generated from – but we do not know whether or not that class’s generating process was the one most likely to generate a sample at that point in the feature space.
In the cases where the same point in the feature space could be drawn from more than one class marginal distribution  there is some overlap  an ideal model, a model with perfect knowledge of the probability with which each process will generate a sample at each point in the space, cannot have perfect accuracy. The best that a model can do with such data is to predict the most likely generating class marginal distribution at each point in the domain. If the example digit is generated by the process producing 9s with probability .7 and the process producing 4s with probability .3, the best possible model can only predict 9 for that image, and be incorrect 30% of the time.
This leads us to a formal definition of informative and uncertain data points: a data point is ”informative” if it is of the class most likely to generate a data point at its location in the feature space. It is ”uncertain” if it was generated by any other class.
Definition Let be a data set composed of dimensional training samples , and let , where is the number of classes in the dataset, be a function taking any input point to the distribution over classes representing the probability that class generated sample . . Let give the observed labels for each . in this formulation can be seen as the groundtruth labeling function because it contains all possible knowledge of the labeling behavior of the problem.
Then we say that a sample is an point when . In other cases, is an point.
3.2. Model bias and variance
When we refer to model bias and variance, we refer to the bias and variance terms in the decomposition of the expected outofsample (generalization) error of a classifier, as introduced in (geman1992neural) and represented in many canonical texts including (hastie2005elements) and (AbuMostafa:2012:LD:2207825). A good termbyterm explanation is available in (vijayakumar2007tradeoff).
Take a setting where we have a problem domain , which consists of a dataset, and associated labels given by a true labeling function which encapsulates an element of datadependent label noise, as above: for a given point ,
is vectorvalued, giving the probability distribution of observing a particular label
at . .We can think of the objects subscripted with as being populationlevel; let refer to drawing a particular and from . This yields an observed set of individual points and associated labels , with each drawn from the distribution . Model bias and variance is decomposed from expected generalization error taken over such draws. Let be a model function that yields predicted class The typical decomposition of the expected prediction error of a model trained on a single draw into bias and variance is expressed as follows, using the squared error loss function (AbuMostafa:2012:LD:2207825). represents the labels drawn from the distributions :
(1) 
and we write the expected error over potential observed datasets from domain as
(2)  
(AbuMostafa:2012:LD:2207825) observe that is an ”average function” over trained models and denote it , then derive the model bias and variance:
(3) 
Adding in terms summing to 0, ,
(4)  
where is the bias and is the variance. Extension to loss functions beyond squared error and detailed analysis of systemic and variance effects is available in (james2003variance)
3.3. Bias, variance, and label noise
The above shows a biasvariance decomposition for squared error in a setting where is considered the absolute truth, not a particular draw from a set of datadependent label distributions . Here we examine how label noise impacts the bias and variance of models.
Take a model function trained on data with noisy labels, . Let be the groundtruth label vector, the draw from for which each point is assigned its most probable label: , and let be a model trained on of the same hypothesis class as .
We can split the observed data into two subsets, the informative points and noisy points . and are disjoint and their intersection includes all examples in . Assume . In the case where we have a deterministic training process and infinite model capacity in the hypothesis class, will be Bayes optimal but will be suboptimal  it will incorrectly predict each point in , and its bias relative to will be higher.
If we were able to identify which were in via some process with perfect confidence, we could remove that subset from the training set and remove the effect of the label noise. Since we assume that any such attempt to identify noisy would have its own uncertainty, or that the labels themselves might not be perfectly orthogonal, we instead weight down those examples that we suspect are noisy. A model trained with the noisy vector but with the noisy points downweighted will have a decision boundary between that of and that of . In this case, since potential decision boundaries do not reach to the noisy points to the same extent the models do, the models will have lower variance than the models  the models with the noisy points weighted down have better bias and variance.
Of course, the above does not hold in general, even if we stipulate that the points in can be reliably identified. We assume above that we have deterministic training and infinite capacity  it is easy to imagine a case where a linear decision boundary is pivoted to accomodate a noisy point and the boundary on the far side of the pivot changes the prediction associated with a new area of the feature space from incorrect to correct; the incorrect accomodation of the point in this case would improve the model’s performance. Additionally, improved variance does not guarantee better expected generalization error in all cases (james2003variance); a highervariance model can have a lower variance effect on expected prediction error than a lowervariance model. Mislabeled points could also drag the decision boundary over regions that were previously being predicted incorrectly (possibly even because no data had been observed there), improving the expected prediction error over the whole domain.
With that said, many modern machine learning methods  especially neural networks  have enormous model capacity; twolayer neural networks can approximate arbitrary functions in the infinitewidth limit (leshno1993multilayer). We expect that when working with data from realworld distributions, when we weight down points with high label uncertainty we will obtain models with improved decision boundaries in practice.
4. Estimating pointwise label uncertainty
4.1. Requirements for the uncertainty estimation function
Of course, to say for certain which points are uncertain and informative using the above definition (§3.1) would require knowledge of the generating processes or other information that could be difficult or impossible to obtain. Instead, we estimate which points are likely to be uncertain by examining each sample’s neighborhood within the available dataset, defined by a parameter , the neighborhood size (by number of neighbors, including the point itself). We define a scoring function to assign a value to each point based on the entropy of observed classes within its neighborhood and the relative sparsity of the neighborhood, with the intention that the value is indicative of the uncertainty of the point’s label. The score should have the following properties:

A sample should have score 0 when all neighbors are of the same class as the sample.

Examples in highly heterogeneous neighborhoods (i.e., neighborhoods with a high number of classes present) should have higher scores than points in homogeneous neighborhoods consisting of mostly their own class, but lower scores than points in homogeneous neighborhoods consisting of points of mostly another class.

Examples in relatively dense neighborhoods should have higher scores than points in relatively sparse neighborhoods, with label composition held constant.
The intuition for these requirements follows from our goal to use only the information contained in the dataset, i.e., the neighborhood of each sample, to estimate its label uncertainty. If all other points in a given point’s neighborhood are of the same class as the point, we choose to trust its label. Its neighborhood score should be 0, indicating no uncertainty; If a point is in a dense region of the feature space and its neighbors are all of another class, we should be highly suspicious of its label as being potentially incorrect. The second and third requirement follow from how we think of noisy regions. Points with highly diverse labels in their neighborhood  especially in a dense neighborhood  are more likely to not be of the class most likely to generate a sample at that point in the domain, because the presence of many classes in the same neighborhood indicates that several class processes could generate samples in that region and, consequently, the model should put less weight on such samples when drawing the decision boundary. Performance gain from adjusting to accommodate those points is unlikely to generalize because the region is chaotic. Both informative and uncertain samples near class boundaries will have nonzero scores, as they will have neighbors with different labels.
4.2. Incorporating Neighborhood Uncertainty Scores into the Loss Function for Classification
After a neighborhood is analyzed from the view of label uncertainty, we need take a further step to perform classification. Uncertainty scores are converted to sample weights via a logistic mapping function and incorporated into the objective function optimized during model training: Let denote the objective function without sample weighting, where is the set of all data points and represents the model parameters. If is the length vector containing the neighborhood scores for each , and is the logistic mapping function taking neighborhood scores to sample weights, then our objective function becomes:
(5) 
4.3. Calculation of Neighborhood Uncertainty Scores
We calculate the score for a sample with label as follows:
(6) 
where is, as before, the number of classes in the dataset, is the number of neighbors that we consider for each sample, is the number of neighbors with the same label as , and is the number of neighbors with class label . is a distance vector that stores the normalized distances to the neighbors with the same class label as . The normalization is performed by setting the distance to ’s nearest neighbor to be 1, and scaling the distances to the other neighbors based on that value. The terms in the numerator and in the denominator are included to weight the class selfinformation in the formula by the average inverse distance to the neighbors of that class (to reduce the influence of faraway points). The denominator of this formula equals 0 if all neighbors of the sample have label . In these cases, we define the value of to equal .
For a point , this calculation compares the entropy of labels in ’s class, , to the expected entropy of labels in the neighborhood. This meets our established requirements:

When all neighbors have the same label, the entropy in the denominator of equation (1) is 0, and is defined to be .

In a neighborhood that is highly heterogeneous, the total number of points with each label is similar (if one class label had many more points than the others, the neighborhood would not be highly heterogeneous). Therefore, the entropy term for the label of point , , is close to the expected entropy over all labels, represented in the denominator of equation (1). Additionally, inverse average distances are similar over all classes in such regions, so the neighborhood score is close to 1 for each point in the neighborhood, giving us the desired effect.

By weighting the terms corresponding to each class by the inverse average distance from to its neighbors in class , we reduce the effect of sparsely represented classes in the neighborhood and increase that of denser classes.
Example values can be found in Figure (3).
4.4. Mapping neighborhood scores to sample weights for classification
There are potentially many ways to map neighborhood scores to sample weights. The neighborhood scoring function has a minimum value of 0 (for a point in a fully homogeneous neighborhood).
The logistic function (and especially the sigmoid function, a special case of the logistic) is commonly used in machine learning as a ’squashing’ function to ensure output values fall in a certain range
(Gershenfeld:1999:NMM:299882); we use a negative logistic function to transfer neighborhood scores into the desired range for sample weights, because it allows us to generate high weights ( 1.0) on points with a low neighborhood score and low weights ( 1.0) on points with score near . This is exactly the behavior we want  points with low uncertainty are weighted up, and high uncertainty are weighted down. See the experiments section for further examination of this relationship.The hyperparameters that define this function do need to be tuned based on data. We find that logistic functions of form similar to the following fit our requirements:
(7) 
Here, controls the value of neighborhood score that the logistic function is centered on. We find empirically that the median of the nonzero scores is a good initial value for this parameter, and tends not to needlessly downweight useful samples by considering too many of them to be uncertain. controls the steepness of the logistic curve, the ”hardness” of the threshold that separates an informative point that is upweighted from a uncertain point that is downweighted. and take the score values and map them to values in the range , such that the low values of scores (near ) are mapped to nearly , and high values are mapped to .
We find empirically that the sample weights should fall in a range from to ; allowing weights to go to 0 effectively shrinks the available data set, reducing performance. Upweighting samples beyond 2.0 tends to overfit those samples too much when applied to our dataset.
Top 5 Weight Combinations by Average Improvement Over Baseline (%)  
Group Assignments  G0 / G1 / G2  Avg Over Baseline  Group Assignments  G0 / G1 / G2  Avg Over Baseline 
NB Score  1.5 / 0.6 / 0.25  +0.92  Random  2.0 / 0.6 / 1.0  +0.74 
NB Score  2.0 / 0.6 / 0.25  +0.91  Random  1.5 / 0.25 / 0.25  +0.71 
NB Score  0.25 / 1.0 / 0.6  +0.85  Random  0.6 / 0.25 / 0.25  +0.69 
NB Score  0.25 / 0.6 / 0.25  +0.84  Random  1.5 / 0.25 / 0.6  +0.68 
NB Score  0.6 / 1.5 / 0.25  +0.77  Random  1.5 / 0.25 / 2.0  +0.68 
Bottom 5 Weight Combinations by Average Performance Reduction from Baseline (%)  
Group Assignments  G0 / G1 / G2  Avg Under Baseline  Group Assignments  G0 / G1 / G2  Avg Under Baseline 
NB Score  0.6 / 0.6 / 2.0  3.7  Random  0.6 / 0.6 / 0.6  0.82 
NB Score  0.25 / 2.0 / 2.0  3.7  Random  0.25 / 2.0 / 0.6  0.86 
NB Score  0.25 / 0.6 / 2.0  4.5  Random  1.5 / 2.0 / 1.5  0.87 
NB Score  0.25 / 0.25 / 1.5  5.4  Random  0.25 / 1.5 / 1.5  1.1 
NB Score  0.25 / 0.25 / 2.0  8.6  Random  0.6 / 0.25 / 1.5  1.9 
5. Experiments
5.1. Case Study 1: Classifications of A RealWorld Physical Activity Dataset
Objective and Accurate measurement of physical activity is a critical requirement for a better understanding of the relationship between sedentary behaviors, physical activity and health (crouter2015estimating)(mu2014bipart). We evaluate our method on a physical activity recognition dataset collected from hipmounted, triaxial accelerometers from a cohort of 184 child participants. There were 98 male subjects from ages 8 to 15, and 86 female subjects from ages 8 to 14. Each subject was observed for a period of lying rest with median 17 minutes (maximum 30 minutes), and median 4 minutes for each other activity (maximum 10 minutes). Researchers observed each activity and recorded the activity performed and the start and end times of each bout, so the data features groundtruth segmentation. We split each bout into discrete 12second windows of activity described with the output of a single triaxial accelerometer running at 1 Hz, resulting in 36 features per sample. We have 11,543 samples, and calculate the neighborhood scores using
and cosine similarity.
Like many realworld applications, the labeling process is difficult and comes with significant uncertainty. Labeling is performed based on both inperson and video observation, and classes are often difficult to distinguish. There are 5 classes in our analysis: sedentary, light household and games, moderatevigorous household and sports, walking, and running, and they are superclasses of the full label set, which consists of Computer Games, Reading, Light Cleaning, Sweeping, Brisk Track Walking, Slow Track Walking, Track Running, Walking Course, Playing Catch, Wall Ball, and Workout Video. Even among the superclasses, there are typically samples from different classes that appear to be very similar (e.g., walking across the house during a ”light household” sample and walking across a basketball court during a ”sports” sample). Using the even more finegrained labeling approach would introduce more noise and drastically reduce the amount of data available per class, making it impractical.
5.1.1. Grid search validation for neighborhood scoring function
In our first set of experiments we aim to validate our scoring function, and show that the samples with high scores are in fact the uncertain samples and that weighting them down improves performance. To do this, we calculated the neighborhood score for each sample in the training set, and assigned those samples into groups. First, we put all samples for which into group ; a score of 0 indicates that a point’s entire neighborhood is from the same class as itself. This is the zerouncertainty group: our method considers points in fully homogeneous regions to have no label uncertainty. We then take all the remaining points, sort them by score, and divide them in half such that we have two more groups, , those points with lowbutnonzero scores, and , those points with the highest scores. Points that are close to decision boundaries but are not uncertain (by the definition in §3.1) should have low, nonzero scores, and points that are misleading should have high scores, so we aim to separate points around class boundaries into informative and uncertain points with this split. We perform this process to test if there is an advantage to downweighting, leaving the same, and upweighting the three groups split by estimated uncertainty; we need to make sure that our intuition to upweight samples with low uncertainty and downweight samples with high uncertainty holds in practice. We refer to experiments performed on data split this way as neighborhood or NBweighted experiments. Of course, no test examples are ever used in the calculation of the uncertainty scores; inclusion of those examples would leak information about the test set into the training process whenever a test example was in the neighborhood of a training example.
For comparison we create a second split into three groups (such that the sizes of the groups correspond to the sizes of the NBsplit groups) but assign samples to groups randomly. We refer to experiments with these splits as having been run with ”random assignment” groups. We do this to make sure that the results we observe are due to our method and not to chance. By running the whole suite of experiments a second time on randomly split groups, we can observe the distribution of results from the random assignments to see what variations in performance we should expect due to randomness. We can then compare our NBweighted results to make sure they are significant.
We perform a grid search to evaluate each possible combination of sample weight assignment and groups, using five discrete weights chosen to cover a range of weighting options but not leave large gaps: 0.25, 0.6, 1.0, 1.5, and 2.0. These values are chosen as proxies for the following possible ways to adjust the sample weights for a group: strongly downweight; somewhat downweight; no adjustment; somewhat upweight; strongly upweight.
This experiment is intended to show two main points: 1) that the score values capture useful information about the data’s feature space and 2) that our interpretation of the score values is consistent with observed performance differences in the grid search; i.e., that weighting up the zeroscore group (those points we identify as having no label uncertainty) and weighting down the highscore group (those points we identify as having uncertain labels) outperformed other sample weightgroup assignments. The five weights were assigned to the three sample groups, for both assignment schemes, in all possible combinations. Each combination was run 10 times and the results (as measured against a fixed heldout test set) were averaged, for a total of 2,500 model runs ( total combinations of weight assignments, runs per assignment, experimental conditions per combinationassignment=2,500 runs).
Training was performed using the Keras library
(chollet2015keras)and the Theano backend using singlethreaded CPU computation only. This step was taken to remove nondeterminism introduced by multithreaded CPU context switching and CuDNN. With these settings, a run with fixed sample weights and random seed is deterministic, and will finish training with the exact same result each time. 10 random seeds were generated once at the beginning of the experiment and the same 10 seeds were used for each combination of sample weight assignments to reduce the effect of particular combinations having stronger performance due to a lucky set of initializations within the weight space. We use a simple 2layer MultiLayer Perceptron architecture to keep running time reasonable.
The results of this process are shown in Table 1. Reported figures represent change in model performance when using various weighting configurations compared to a baseline model that was trained with no sample weighting (all weights = 1.0). The performance of the baseline model was 83.4%, averaged over 10 runs. While there is no discernible pattern in the random results, as we would expect, there is a clear pattern in the results when nearest neighborbased weighting is used: performance is strong when the most uncertain points (group 2) are weighted down (G2 is weighted down in all 5 of the top combinations), and performance is weak when the uncertain points are weighted up (G2 weighted up for all 5 of the bottom combinations). All 5 of the best neighborhood score combinations are better than the best one when weights are randomly assigned; all 5 of the worst weight combinations perform worse than the worst run under the random setting. This validates our interpretation of the neighborhood scoring function  weighting down the points with highest scores improves model performance over baseline, weighting up highscoring points increases model focus on points with uncertain labels and decreases accuracy.
5.1.2. Evaluating nearest neighbor weighting against baseline for activity recognition
Our second round of experiments takes the best knearestneighbor weighting model with weights (1.5 / 0.6 / 0.25), and uses equation (3) to create a continuous mapping function from scores to weights, then measures the performance of models trained with these weights against baseline models (where all sample weights=1.0) more thoroughly. We choose 1,000 random seeds from integers in the interval [0, 100,000) and run an NBweighted model and a baseline model for each one, under the same CPUbased calculation conditions as the models from the grid search, so that any difference in performance is directly attributable to the difference in the weighting scheme. This yields 1,000 NBweighted models and 1,000 baseline models. We summarize the results in Figure 4. The histogram on the left of Figure 4 shows a distribution of trained models by performance, with betterperforming models on the right. The absolute counts are provided in the table to the right of the histogram. We can see that the models using knearest neighbor weighting have both improved performance on average (are further right) and have lower variance (are more clustered in the histogram). The NBweighted models are on average +0.534% better than the baseline models by accuracy, and have greatly reduced variance
, as compared to the baseline models’ variance and standard deviation . Note that this is the variance of the model results, and is a different mathematical quantity than the model variance defined in §3; that variance is measured over models each trained on a different dataset sampled from a particular data domain. The variance calculated here is a variance over the model training process using a single dataset.6. Conclusion and future work
With an eye on the biasvariance dilemma, we formulated a nearest neighborbased method to estimate pointwise uncertainty in labeling and mitigate its effects by weighting down the samples in areas of the feature space with high density and label entropy. By working on the fundamental issue of the biasvariance dilemma (i.e., whether a decision boundary should accommodate a sample point according to the trustworthiness of that sample), we show improved model bias and variance in a realworld application. Using a neural network architecture to classify accelerometer data for activity recognition, we improve performance in a realworld domain where accurate, consistent labeling is very difficult. In future work, we hope to improve the method we use for estimation of label uncertainty, with the aim of obviating the need to calculate a distance matrix with nearest neighbors.
Comments
There are no comments yet.