1 Introduction
Most of the research is mainly focused on revising the performance of classification architectures, optimization algorithms, and loss functions, but hardly done any work on the improvement over adaptability of the decision threshold that separates the boundaryline between the classes. Although there are myriad stateoftheart models to extract discriminative features, it is still a common trend to choose a classification threshold by the hitandtrial method. As per the author’s best practice, for a handful of feature vectors in the database, it works pretty well, but it becomes ineffective with the increasing database’s size. If the chosen threshold is futile, how much accurate the model is, its stateoftheart performance would go in vain. On the contrary, if the threshold is chosen wisely and is updated iteratively as soon as a new feature vector is registered in the database, even less accurate model will do a better job. In this project, an optimizationbased statistical feature learning algorithm has been developed to boostup the performance of any recognition and reidentification models through decision boundary.
A decision threshold is a numerical value that dichotomizes different classes. Different thresholds yield a different number of true/false positives and true/false negatives, and consequently different precision, recall, and f1score for a given dataset. In this project, a decision threshold is selected to maximize the f1score (that is, the harmonic average of precision and recall). For instance in cancer diagnostic problem, it acts as a specified cutoff for an observation to be classified as either 0 (no cancer) or 1 (has cancer). Choosing an optimal threshold value is a challenging task as it is casespecific, i.e., different for different objectives and datasets. In the context of verification and identification tasks, as the identitydatabase like in face recognition and person reidentification gets updated with a newer identity raises the need for tuning the threshold with database update. This threshold works not only as the deciding factor for verification and identification but also as the gatekeeper to update the identities in database.
2 Related work
There is not much work done yet in selecting the classifier threshold adaptively; however, Chou et. al (2018) [Chou] has developed an online face registration mechanism with a distinctive threshold per face. With this mechanism, they achieved 22% accuracy improvement on the LFW dataset as compared to the common method of choosing the threshold  fixed value throughout the inference time. In brief, they have proposed an adaptive thresholding technique that assigns a specific threshold per registered face in the database – that gets adopted accordingly with the new entries in the database. However, this method is limited only to the evaluation task but not favorable to the realtime inference.
Receiver operating characteristics (ROC) curve plots the truepositive rate against the falsepositive rate at particular thresholds. A diagonal line indicates that the model predicts all the cases as the majority class delineating its poor performance over the testdataset; above the curve indicates a threshold with higher model accuracy and below it with poor accuracy. ROC curve helps to understand the tradeoff in the truepositive rate and falsepositive rate for different threshold values. The area under the ROC curve is known as the area under the curve (AUC) that depicts the model’s overall performance.
Zou et al. (2016) [Zou2016], have analyzed drawbacks of using ROCcurve as the sole measure of selecting the threshold for an imbalanced classification. A novel framework  samplingbased threshold autotuning method, for finding the best classification threshold is proposed yielding
improvement over the conventional method of choosing a default threshold of 0.5. Though ROC curve and AUC values reflect ranking power of positive prediction probability, Zou
et al. claim that classifier performance including precision, recall, and f1score might not be perfect eventhough AUC value exceed 0.9. They have employed f1score and AUC for Liao’s protein remote homology detection such that the classification threshold is tuned for the best f1score. Similarly, Lipton et al. (2014) [Lipton2014] has proposed an optimal classification threshold selection through maximization of f1score value for binary and multilabel classification problem  claiming improvement in predicting 26,853 labels of Medline documents compared to the traditional method.Al Hartmann [hartmann2009] has filed a patent on an adaptive threshold for detecting spam email messages based on ratios between clean and spam emails received at previous timeperiods and misclassification ratio cost. Bauer et al. (2015) [Bauer2015]
has applied a Bayesian model of neurofeedback and reinforcement learning for evaluating the impact of adaptive classification threshold on optimizing restorative braincomputer interfaces (BCI). It is claimed that a threshold adaptation is superior to anyofthe fixed threshold throughout the experimental results.
3 Methodology
For statistical feature learning of a probability density function, different metrics like cosine distance, cosine similarity, and euclidean distance as mentioned in equations [
1,2,3] are used. Nonetheless similarity metric seems to fit best to our objective problem. Cosine similarity is a metric for measuring the distance between two vectors of an inner product space. In other words, it is the cosine of the angle between the two vectors projected in a multidimensional space, also commonly known as the inner dot product of those vectors  both normalized to have a unity length. The vectors’ similarity is between 0 to 1; 0 means entirely dissimilar, and 1 means completely similar. So higher the value, higher will be the similarity between the vectors. Similarly, euclidean distance is a metric to measure distance or length between the two points in a euclidean space. The significance of all these metrics, especially in the machine learning arena, is to compute how similar or dissimilar the feature vectors are. In this project, since there are multiple feature vectors for the same identity (i.e., face or person), vectorial similarities and distances are organized into auto and crosscategory  inspired from the auto/crosscorrelation terminologies.Similarities and distances between the feature vectors of the same (here denoted as auto) and different (here denoted as a cross) identities (here identities denote faces, persons, or any other subject of interest) are computed simultaneously. Figures [1,2] show all the possible pairing of the identities in order to compute auto and cross distance/similarity between the query, and database identities. While finding an adaptive threshold, query embeddings () are cloned from gallery embeddings () to find the similarity among same (autosimilarity) and different identities (crosssimilarity). Hence during the query mode, and will be different images or embeddings, whereas during optimalthreshold search mode, query embeddings () will be an exact copy of database embeddings (
). Different statistical features like minima, maxima, mean, standard deviation, variance, and much more information can be drawn from auto and cross distance/similarity distributions. From such statistical information, an adaptive threshold will be adjusted once a new identity is added or deleted from the database. Since calculating statistical measures along with euclidean distance or cosine similarity between the feature vectors from a large pool of distributions is computationally expensive, an adaptive threshold adjustment can be done periodically once the number of newer identity registration exceeds a certain specified number.
(1) 
(2) 
(3) 
Referring to figure 3
, once auto and cross similarity distribution is obtained, their probability of occurrences is depicted in a histogram. With a mean and a standard deviation of auto and cross probability distributions, a Gaussian function is estimated for each of them, as shown in equation
4. The main reason behind estimating auto and cross Gaussian distribution function is to choose a value between their means such that none of the distribution is dominated or biased while choosing the threshold. We can either choose an average of auto and cross means directly without estimating their normal distribution, but doing so might unnecessarily shift our threshold towards a curve with a higher peak. In order to find the point of intersection of the estimated auto and cross Gaussian distribution functions, we need to equate and solve them, as shown in quadratic equation
5.(4)  
where is auto Gaussian function, and is cross Gaussian function, and and are mean, standarddeviation, and xcoordinate of the intersection point of the auto and cross Gaussian distribution functions respectively.
Equating and solving equations 4 can be arranged in a condensed quadratic form: , where its coefficients are obtained as shown in equation 5.
(5) 
where, , and are standard deviation and variance for auto and cross Gaussian distribution functions. Since, , we can ignore roots lying out of this bound; otherwise, select either of the roots and evaluate the model, and consider the root with higher model accuracy. If the point of intersection lies between and , we can proceed ahead to compute model accuracy, otherwise take an average of them as shown in equation 6.
(6) 
We used different performance metrics, such as precision, recall, f1scores, and accuracy for model evaluation. Their computation requires the count of true/false positives and negatives at given threshold , that are calculated as shown in the equation 7.
(7)  
where
is the total no. of times when the model correctly predicts the positive class with a maximum similarity between feature vectors greater than equal to the threshold
is the total no. of times when the model correctly predicts the negative class with a maximum similarity between feature vectors lesser than equal to the threshold
is the total no. of times when the model incorrectly predicts the positive class with a maximum similarity between feature vectors greater than equal to the threshold
is the total no. of times when the model incorrectly predicts the negative class with a maximum similarity less than equal to the threshold
Using the count of true/false positives and negatives at specified threshold from equation 7, model precision, recall, f1score, and model overall accuracy can be computed as shown in equation 8.
(8)  
Similarly, true positive rate () and false positive rate () for plotting ROC curve is computed with the same count of true/false positives and negatives  resulting from the model evaluation at given threshold as shown in equation 9.
(9)  
where, is a cumulative sum of all positives count, is a cumulative sum of all negatives count, is added to prevent from possible division by zero error.
If model accuracy at specified threshold exceeds the targeted value, we can select it as an optimal threshold and proceed ahead with model inference; otherwise, we need to search for the optimal value with an objective to maximize f1score as shown in equation 10.
(10) 
For optimizing threshold at which the f1score is below the targeted value, bounded minimization method is used to maximize the tradeoff between and as shown in equation 11. In other words, using the optimization technique, we are trying to pick the best threshold value from the ROC curve for which we intend to maximize the number of true positives while suppressing the count of false positives.
(11)  
s.t.  
However, in this study, we have directly taken maximization of f1score as our objective function constrained to and bounded by and . With an assumption that choosing a threshold value between the means of auto and crosssimilarities density functions would maximize the number of true positives and minimize true negatives count, we tried both methods of bounded () and unbounded optimization. As a result, it was found that unbound method i.e., resulted in higher model accuracy. Hence, we tweaked our objective function 12 by making search space flexible to .
(12)  
s.t.  
At optimization termination, if precision or f1score at the converged threshold () is still less than that of the point of intersection of the Gaussian distribution functions, we choose the as our working threshold for model inference as depicted in equation 10. This procedure as shown in figure 3 repeats if there is any change in the database, i.e., addition or deletion of any feature vector; otherwise, the chosen threshold will be used for inference unless any change in the database is observed. On the contrary, since searching optimal threshold is computationally expensive, we can run this algorithm once the model accuracy/f1score falls below a certain value. However, for areas like medicine where a perfect threshold is a must, continuous running of this technique as illustrated in figure 3 would benefit significantly.
Model Inference
Once the optimal threshold is chosen based on an adaptive algorithm as portrayed in figure 3, we proceed ahead with model inference. In this section, as unfolded in figure 4, the feature vector of an image either from webcam or directly from the image itself is generated using pretrained recognition or reidentification model. Here, for face recognition, facenet [Schroff] and dlib [King2009] models are taken into account, and joint discriminative and generative learning model for person reidentification. For a given query image, we first detect face or person alignment using CNN models. Once a face is detected, we feed it into the pretrained model to extract its 128D or 512D facial feature vector. Consequently, we perform dot product of this feature vector with an entire pool of vectors stored in the database to calculate the similarities between each of them. Upon comparison of the similarity metric for a query vector, if it is greater than the chosen decision threshold, it is assumed that the identity has been identified or recognized, and model inference continues. Conversely, if the similarity of the query identity does not match with any of the gallery vectors, it is recorded as a new identity in the database.
4 Experiments
This method has been tested on labeled faces in the wild (LFW)
[Huang], that is a de facto face verification dataset. Since most of the identities in the LFW dataset comprise at most two images per identity, full effectiveness and robustness of an adaptive threshold algorithm is not observed. A separate dataset of the top 100 highlypaid athletes referred to Forbes’s 2018 list [Flores2019] is prepared with more than 20 images per identity with an online automated dataset generator [online_dataset_generator]. Since the same setofidentities is used for galley and query embeddings, only identities with two or more than two images are taken into account. On the other hand, since the whole model accuracy depends on the number of images covering an entire facial orientation of a person, there needs to be a flexibility in adjusting it. If headpose is incorporated, we can choose face oriented only at a particular direction so that 34 images would cover entire possible regions of detection. For instance, the most accurate CNNbased face detector can detect a face rotated up to in either direction w.r.t a perpendicular line protruded out of the face. The other important point to be noted is that since the same dataset, i.e., LFW and Athletes, is used for model evaluation, to simulate a realtime dynamic nature of the database or gallery size, the number of identities is increased onebyone starting with two at the beginning. That is to say that during the start of the algorithm, two identities are accounted to compute a threshold; then substantially all the available identities are added to the galley one at a time; simultaneously computing the appropriate threshold with the momentary database size. Hence, using the proposed algorithm, the threshold learns to adapt to a changing gallery/database identities’ size.From figure 5, cosine similarities between the same identities (auto) is distributed in such a way that maximal distribution lies within the range of .
From figure 6, it is observed that mostly similarity among different identities is distributed between the range . In reference to the distribution pattern of the auto/cross similarities observed in figures [5, 6], and histogram 7, if a single threshold ought to be chosen would be in between their means i.e., . If we chose the lower bound of a choice list, the threshold would perform better at identifying similar identities, whereas bad at differentiating dissimilar ones, and vice versa for the upper bound. Therefore, in this project, we have used an online optimizationbased threshold technique that can adapt to the changes in the database size and their probability density functions.
Figure 7 shows probability density of similarity function for same and different identities for an Athletes dataset. From the observed histogram, a normal distribution function is estimated using a mean and standard deviation of the actual auto and cross similarity distributions. Figure 8 shows that the point of intersection of the actual auto and cross similarity distribution functions separates the curves in a most unbiased way. Higher model accuracy is assumed to be achieved if it is taken as a threshold. There is no such defined function for those actual probability distribution functions. However, their estimated Gaussian distribution function mimics the distribution to a larger extent, though there is a slight shift in the pointofintersection. The intersection point of the estimated normal distribution functions is computed by solving a quadratic equation with coefficients shown in equation 5. This point can be taken as an initial value for an adaptive threshold and perform its conditional inspection. If model accuracy (here f1score) is greater than the targeted value (here ), we take it as an optimal threshold, whereas if it is not, then we proceed ahead with an optimum search via bounded optimization method as shown in the objective function 12. Upon the termination of an optimal search, if f1score at that point () is greater than that of the point of intersection of the estimated Gaussian functions, it is taken as an optimal threshold; otherwise, the intersection point () is taken as an optimal threshold.
The model performance at the chosen optimum threshold is compared with its performance at various fixed threshold values that are usually taken via hitandtrial method. In this study, the prespecified fixed threshold points are chosen from auto and cross similarity distribution functions, i.e., point at maximum density or occurrences. From the auto and cross probability density function referred to figure 7, and for the Athletes dataset is observed to be 0.3 and 0.6, respectively. From figure 10, it is observed that model accuracy at adaptive threshold overshadows the performance at fixed threshold () (which is the asusual chosen threshold in the context of conventional classification problem) and  at which model precision is highest among all. However, looking at f1score, model performance at adaptive threshold is more likely comparable to that of the fixed threshold (). It might be because of its higher recall value; in contrast, looking at the comparative precision curve, it is worst among all the chosen threshold, including adaptive one. Hence, therefore, inspecting precision, recall, and f1score separately, adaptive threshold outperforms all the chosen fixed thresholds believed that they are the best and conventionally taken asusual thresholds. From the figure 10, one of the noticeable things of an adaptive threshold is that once the number of identities in the gallery increases, resulting in a drastic drop in the model accuracy, there is an abrupt adaptation in the threshold to keep the model constraints within check. In this particular test case, model accuracy at the fixed threshold () is quite similar to that of an adaptive threshold, however, for a large number of identities in the database, its adaptive nature will start outperforming rest of all the fixed values. Hence, with a larger dataset or in case of continuous running of the model, it is evident that gallery size will increase indefinitely; consequently, the adaptive threshold will adjust for better performance while the bestfixed threshold deteriorates relentlessly over timespan.
From figure 11, it is observed that the model precision is highest at fixed threshold () followed by fixed threshold (), but from figure 12
, recall at those values are worst. Hence, looking at the harmonic mean of precision and recall at those fixed points can be discarded. Among all of those chosen points, model performance is good at the adaptive threshold and fixed threshold (
). Since this experiment has been done only on a small dataset, its performance at the lowest fixed threshold seems comparable to that of adaptive. Nevertheless, there is a higher probability of getting worse for a larger dataset because of its worsening precision with drastically increasing database size.From ROC curve 9, it is observed that the ratio of true positives to false positives is higher with an adaptive threshold as compared to any of the fixed thresholds . Looking at the curve, if we ought to choose any one of the fixed thresholds, then it would be as its area under the curve is highest among all the fixed ones. Thus, the ROC curve’s comparative plot validates that the adaptive method is a good choice for choosing a threshold for the chosen dataset.






Adaptive  0.72  88.34  20.32  
fixed@0.3  0.62  66.89  19.54  
fixed@0.5  0.68  78.22  5.31  
fixed@0.7  0.64  60.53  0.0 
Table 1 shows that the model performs well with an adaptive threshold compared to any of the fixed thresholds. Though f1score is for most of the time for a fixed threshold at (), its precision is horrible. The higher f1score might be because of its higher recall value. On the contrary, a fixed threshold of () does have higher precision than any of the other bands, as vivid in figure 11, but due to its bad recall, the final f1score is lowest among all. If we look solely at the model accuracy, as shown in equation 8, a model with an adaptive threshold outperforms all the other fixed threshold values. Similarly, if we look at the area under the ROCcurve, the adaptive threshold occupies maximum area that further validates model’s better performance with the adaptive threshold.
5 Conclusion
Most of the research in machine learning focuses on improving the existing model architectures and modifying the optimization algorithms and loss functions. However, there is hardly any research done in making the decision threshold adaptive. Once the model outputs confidence scores, they need to be sieved with a decision threshold to categorize them to their respective classes. If the threshold is not chosen wisely, the entire model performance will go in vain. However, in most cases, this significant deciding value is taken via hitandtrial method; or randomly some fixed value just by analogy (mostly 0.5 by default). Mostly for examples like classifying different objects  where each object’s features are vividly distinct, it might work, but not for all cases.
Nonetheless, for identification tasks like in facerecognition and person reidentification  where features between different identities are more or less identical and inseparable, the conventional method of choosing a fixed threshold might not work. To counteract the idea of selecting an optimum threshold based on ROCcurve, it might fail for temporally increasing identitydatabase size. For a deep learning model that encounter classification/identification of the objects with a higher degree of featuressimilarity, a threshold that could adapt to the varying database size would play a crucial role. Therefore, we have proposed an online optimizationbased statistical feature learning technique to formulate an adaptive threshold in this project. Whenever there is any update on the identitydatabase size, a new decision threshold will be calculated based on the estimated Gaussian distribution functions of the auto/crosssimilarity distributions and boundconstrained optimization search method. The proposed algorithm was tested on Labeled Faces in the Wild (LFW), a defacto face verification dataset, and a selfprepared dataset of top 100 highlypaid athletes published by Forbes magazine (2018). The reason behind preparing own dataset is that the algorithm requires a large number of samples per identity for probing its robustness over widely distributed features. The method achieved
, , and higher accuracy compared to the fixed thresholds at , , and , respectively. Looking at model accuracy, with adaptive threshold f1score was observed to times w.r.t the total number of samples; whereas , , and for fixed thresholds at 0, , and respectively. This project has been tested on a concise dataset, so in the future, researchers can explore more on such techniques that can make decision threshold adapt to varying identitiesdatabase size and indistinguishable features.Acknowledgement
The author declares that the research was conducted in the absence of grants or financial support from any institution or organization. However, I would like to take this opportunity to thank my friends, Mr. Bibek Raj Shrestha and Mr. Baibhav Raj Shrestha, for providing the necessary hardware components like RAM16GB and SSD500GB to encounter higher computational requirement for the project.
Comments
There are no comments yet.