Most of the research is mainly focused on revising the performance of classification architectures, optimization algorithms, and loss functions, but hardly done any work on the improvement over adaptability of the decision threshold that separates the boundary-line between the classes. Although there are myriad state-of-the-art models to extract discriminative features, it is still a common trend to choose a classification threshold by the hit-and-trial method. As per the author’s best practice, for a handful of feature vectors in the database, it works pretty well, but it becomes ineffective with the increasing database’s size. If the chosen threshold is futile, how much accurate the model is, its state-of-the-art performance would go in vain. On the contrary, if the threshold is chosen wisely and is updated iteratively as soon as a new feature vector is registered in the database, even less accurate model will do a better job. In this project, an optimization-based statistical feature learning algorithm has been developed to boost-up the performance of any recognition and re-identification models through decision boundary.
A decision threshold is a numerical value that dichotomizes different classes. Different thresholds yield a different number of true/false positives and true/false negatives, and consequently different precision, recall, and f1-score for a given dataset. In this project, a decision threshold is selected to maximize the f1-score (that is, the harmonic average of precision and recall). For instance in cancer diagnostic problem, it acts as a specified cut-off for an observation to be classified as either 0 (no cancer) or 1 (has cancer). Choosing an optimal threshold value is a challenging task as it is case-specific, i.e., different for different objectives and datasets. In the context of verification and identification tasks, as the identity-database like in face recognition and person re-identification gets updated with a newer identity raises the need for tuning the threshold with database update. This threshold works not only as the deciding factor for verification and identification but also as the gatekeeper to update the identities in database.
2 Related work
There is not much work done yet in selecting the classifier threshold adaptively; however, Chou et. al (2018) [Chou] has developed an online face registration mechanism with a distinctive threshold per face. With this mechanism, they achieved 22% accuracy improvement on the LFW dataset as compared to the common method of choosing the threshold - fixed value throughout the inference time. In brief, they have proposed an adaptive thresholding technique that assigns a specific threshold per registered face in the database – that gets adopted accordingly with the new entries in the database. However, this method is limited only to the evaluation task but not favorable to the real-time inference.
Receiver operating characteristics (ROC) curve plots the true-positive rate against the false-positive rate at particular thresholds. A diagonal line indicates that the model predicts all the cases as the majority class delineating its poor performance over the test-dataset; above the curve indicates a threshold with higher model accuracy and below it with poor accuracy. ROC curve helps to understand the trade-off in the true-positive rate and false-positive rate for different threshold values. The area under the ROC curve is known as the area under the curve (AUC) that depicts the model’s overall performance.
Zou et al. (2016) [Zou2016], have analyzed drawbacks of using ROC-curve as the sole measure of selecting the threshold for an imbalanced classification. A novel framework - sampling-based threshold auto-tuning method, for finding the best classification threshold is proposed yielding
improvement over the conventional method of choosing a default threshold of 0.5. Though ROC curve and AUC values reflect ranking power of positive prediction probability, Zouet al. claim that classifier performance including precision, recall, and f1-score might not be perfect even-though AUC value exceed 0.9. They have employed f1-score and AUC for Liao’s protein remote homology detection such that the classification threshold is tuned for the best f1-score. Similarly, Lipton et al. (2014) [Lipton2014] has proposed an optimal classification threshold selection through maximization of f1-score value for binary and multi-label classification problem - claiming improvement in predicting 26,853 labels of Medline documents compared to the traditional method.
Al Hartmann [hartmann2009] has filed a patent on an adaptive threshold for detecting spam email messages based on ratios between clean and spam emails received at previous time-periods and misclassification ratio cost. Bauer et al. (2015) [Bauer2015]
has applied a Bayesian model of neurofeedback and reinforcement learning for evaluating the impact of adaptive classification threshold on optimizing restorative brain-computer interfaces (BCI). It is claimed that a threshold adaptation is superior to any-of-the fixed threshold throughout the experimental results.
Similarities and distances between the feature vectors of the same (here denoted as auto) and different (here denoted as a cross) identities (here identities denote faces, persons, or any other subject of interest) are computed simultaneously. Figures [1,2] show all the possible pairing of the identities in order to compute auto and cross distance/similarity between the query, and database identities. While finding an adaptive threshold, query embeddings () are cloned from gallery embeddings () to find the similarity among same (auto-similarity) and different identities (cross-similarity). Hence during the query mode, and will be different images or embeddings, whereas during optimal-threshold search mode, query embeddings () will be an exact copy of database embeddings (
). Different statistical features like minima, maxima, mean, standard deviation, variance, and much more information can be drawn from auto and cross distance/similarity distributions. From such statistical information, an adaptive threshold will be adjusted once a new identity is added or deleted from the database. Since calculating statistical measures along with euclidean distance or cosine similarity between the feature vectors from a large pool of distributions is computationally expensive, an adaptive threshold adjustment can be done periodically once the number of newer identity registration exceeds a certain specified number.
Referring to figure 3
, once auto and cross similarity distribution is obtained, their probability of occurrences is depicted in a histogram. With a mean and a standard deviation of auto and cross probability distributions, a Gaussian function is estimated for each of them, as shown in equation4
. The main reason behind estimating auto and cross Gaussian distribution function is to choose a value between their means such that none of the distribution is dominated or biased while choosing the threshold. We can either choose an average of auto and cross means directly without estimating their normal distribution, but doing so might unnecessarily shift our threshold towards a curve with a higher peak. In order to find the point of intersection of the estimated auto and cross Gaussian distribution functions, we need to equate and solve them, as shown in quadratic equation5.
where is auto Gaussian function, and is cross Gaussian function, and and are mean, standard-deviation, and x-coordinate of the intersection point of the auto and cross Gaussian distribution functions respectively.
where, , and are standard deviation and variance for auto and cross Gaussian distribution functions. Since, , we can ignore roots lying out of this bound; otherwise, select either of the roots and evaluate the model, and consider the root with higher model accuracy. If the point of intersection lies between and , we can proceed ahead to compute model accuracy, otherwise take an average of them as shown in equation 6.
We used different performance metrics, such as precision, recall, f1-scores, and accuracy for model evaluation. Their computation requires the count of true/false positives and negatives at given threshold , that are calculated as shown in the equation 7.
is the total no. of times when the model correctly predicts the positive class with a maximum similarity between feature vectors greater than equal to the threshold
is the total no. of times when the model correctly predicts the negative class with a maximum similarity between feature vectors lesser than equal to the threshold
is the total no. of times when the model incorrectly predicts the positive class with a maximum similarity between feature vectors greater than equal to the threshold
is the total no. of times when the model incorrectly predicts the negative class with a maximum similarity less than equal to the threshold
Similarly, true positive rate () and false positive rate () for plotting ROC curve is computed with the same count of true/false positives and negatives - resulting from the model evaluation at given threshold as shown in equation 9.
where, is a cumulative sum of all positives count, is a cumulative sum of all negatives count, is added to prevent from possible division by zero error.
If model accuracy at specified threshold exceeds the targeted value, we can select it as an optimal threshold and proceed ahead with model inference; otherwise, we need to search for the optimal value with an objective to maximize f1-score as shown in equation 10.
For optimizing threshold at which the f1-score is below the targeted value, bounded minimization method is used to maximize the trade-off between and as shown in equation 11. In other words, using the optimization technique, we are trying to pick the best threshold value from the ROC curve for which we intend to maximize the number of true positives while suppressing the count of false positives.
However, in this study, we have directly taken maximization of f1-score as our objective function constrained to and bounded by and . With an assumption that choosing a threshold value between the means of auto and cross-similarities density functions would maximize the number of true positives and minimize true negatives count, we tried both methods of bounded () and unbounded optimization. As a result, it was found that unbound method i.e., resulted in higher model accuracy. Hence, we tweaked our objective function 12 by making search space flexible to .
At optimization termination, if precision or f1-score at the converged threshold () is still less than that of the point of intersection of the Gaussian distribution functions, we choose the as our working threshold for model inference as depicted in equation 10. This procedure as shown in figure 3 repeats if there is any change in the database, i.e., addition or deletion of any feature vector; otherwise, the chosen threshold will be used for inference unless any change in the database is observed. On the contrary, since searching optimal threshold is computationally expensive, we can run this algorithm once the model accuracy/f1-score falls below a certain value. However, for areas like medicine where a perfect threshold is a must, continuous running of this technique as illustrated in figure 3 would benefit significantly.
Once the optimal threshold is chosen based on an adaptive algorithm as portrayed in figure 3, we proceed ahead with model inference. In this section, as unfolded in figure 4, the feature vector of an image either from webcam or directly from the image itself is generated using pre-trained recognition or re-identification model. Here, for face recognition, facenet [Schroff] and dlib [King2009] models are taken into account, and joint discriminative and generative learning model for person re-identification. For a given query image, we first detect face or person alignment using CNN models. Once a face is detected, we feed it into the pre-trained model to extract its 128D or 512D facial feature vector. Consequently, we perform dot product of this feature vector with an entire pool of vectors stored in the database to calculate the similarities between each of them. Upon comparison of the similarity metric for a query vector, if it is greater than the chosen decision threshold, it is assumed that the identity has been identified or recognized, and model inference continues. Conversely, if the similarity of the query identity does not match with any of the gallery vectors, it is recorded as a new identity in the database.
This method has been tested on labeled faces in the wild (LFW)[Huang], that is a de facto face verification dataset. Since most of the identities in the LFW dataset comprise at most two images per identity, full effectiveness and robustness of an adaptive threshold algorithm is not observed. A separate dataset of the top 100 highly-paid athletes referred to Forbes’s 2018 list [Flores2019] is prepared with more than 20 images per identity with an online automated dataset generator [online_dataset_generator]. Since the same set-of-identities is used for galley and query embeddings, only identities with two or more than two images are taken into account. On the other hand, since the whole model accuracy depends on the number of images covering an entire facial orientation of a person, there needs to be a flexibility in adjusting it. If headpose is incorporated, we can choose face oriented only at a particular direction so that 3-4 images would cover entire possible regions of detection. For instance, the most accurate CNN-based face detector can detect a face rotated up to in either direction w.r.t a perpendicular line protruded out of the face. The other important point to be noted is that since the same dataset, i.e., LFW and Athletes, is used for model evaluation, to simulate a real-time dynamic nature of the database or gallery size, the number of identities is increased one-by-one starting with two at the beginning. That is to say that during the start of the algorithm, two identities are accounted to compute a threshold; then substantially all the available identities are added to the galley one at a time; simultaneously computing the appropriate threshold with the momentary database size. Hence, using the proposed algorithm, the threshold learns to adapt to a changing gallery/database identities’ size.
From figure 5, cosine similarities between the same identities (auto) is distributed in such a way that maximal distribution lies within the range of .
From figure 6, it is observed that mostly similarity among different identities is distributed between the range . In reference to the distribution pattern of the auto/cross similarities observed in figures [5, 6], and histogram 7, if a single threshold ought to be chosen would be in between their means i.e., . If we chose the lower bound of a choice list, the threshold would perform better at identifying similar identities, whereas bad at differentiating dissimilar ones, and vice versa for the upper bound. Therefore, in this project, we have used an online optimization-based threshold technique that can adapt to the changes in the database size and their probability density functions.
Figure 7 shows probability density of similarity function for same and different identities for an Athletes dataset. From the observed histogram, a normal distribution function is estimated using a mean and standard deviation of the actual auto and cross similarity distributions. Figure 8 shows that the point of intersection of the actual auto and cross similarity distribution functions separates the curves in a most unbiased way. Higher model accuracy is assumed to be achieved if it is taken as a threshold. There is no such defined function for those actual probability distribution functions. However, their estimated Gaussian distribution function mimics the distribution to a larger extent, though there is a slight shift in the point-of-intersection. The intersection point of the estimated normal distribution functions is computed by solving a quadratic equation with coefficients shown in equation 5. This point can be taken as an initial value for an adaptive threshold and perform its conditional inspection. If model accuracy (here f1-score) is greater than the targeted value (here ), we take it as an optimal threshold, whereas if it is not, then we proceed ahead with an optimum search via bounded optimization method as shown in the objective function 12. Upon the termination of an optimal search, if f1-score at that point () is greater than that of the point of intersection of the estimated Gaussian functions, it is taken as an optimal threshold; otherwise, the intersection point () is taken as an optimal threshold.
The model performance at the chosen optimum threshold is compared with its performance at various fixed threshold values that are usually taken via hit-and-trial method. In this study, the pre-specified fixed threshold points are chosen from auto and cross similarity distribution functions, i.e., point at maximum density or occurrences. From the auto and cross probability density function referred to figure 7, and for the Athletes dataset is observed to be 0.3 and 0.6, respectively. From figure 10, it is observed that model accuracy at adaptive threshold overshadows the performance at fixed threshold () (which is the as-usual chosen threshold in the context of conventional classification problem) and - at which model precision is highest among all. However, looking at f1-score, model performance at adaptive threshold is more likely comparable to that of the fixed threshold (). It might be because of its higher recall value; in contrast, looking at the comparative precision curve, it is worst among all the chosen threshold, including adaptive one. Hence, therefore, inspecting precision, recall, and f1-score separately, adaptive threshold outperforms all the chosen fixed thresholds believed that they are the best and conventionally taken as-usual thresholds. From the figure 10, one of the noticeable things of an adaptive threshold is that once the number of identities in the gallery increases, resulting in a drastic drop in the model accuracy, there is an abrupt adaptation in the threshold to keep the model constraints within check. In this particular test case, model accuracy at the fixed threshold () is quite similar to that of an adaptive threshold, however, for a large number of identities in the database, its adaptive nature will start outperforming rest of all the fixed values. Hence, with a larger dataset or in case of continuous running of the model, it is evident that gallery size will increase indefinitely; consequently, the adaptive threshold will adjust for better performance while the best-fixed threshold deteriorates relentlessly over time-span.
, recall at those values are worst. Hence, looking at the harmonic mean of precision and recall at those fixed points can be discarded. Among all of those chosen points, model performance is good at the adaptive threshold and fixed threshold (). Since this experiment has been done only on a small dataset, its performance at the lowest fixed threshold seems comparable to that of adaptive. Nevertheless, there is a higher probability of getting worse for a larger dataset because of its worsening precision with drastically increasing database size.
From ROC curve 9, it is observed that the ratio of true positives to false positives is higher with an adaptive threshold as compared to any of the fixed thresholds . Looking at the curve, if we ought to choose any one of the fixed thresholds, then it would be as its area under the curve is highest among all the fixed ones. Thus, the ROC curve’s comparative plot validates that the adaptive method is a good choice for choosing a threshold for the chosen dataset.
Table 1 shows that the model performs well with an adaptive threshold compared to any of the fixed thresholds. Though f1-score is for most of the time for a fixed threshold at (), its precision is horrible. The higher f1-score might be because of its higher recall value. On the contrary, a fixed threshold of () does have higher precision than any of the other bands, as vivid in figure 11, but due to its bad recall, the final f1-score is lowest among all. If we look solely at the model accuracy, as shown in equation 8, a model with an adaptive threshold outperforms all the other fixed threshold values. Similarly, if we look at the area under the ROC-curve, the adaptive threshold occupies maximum area that further validates model’s better performance with the adaptive threshold.
Most of the research in machine learning focuses on improving the existing model architectures and modifying the optimization algorithms and loss functions. However, there is hardly any research done in making the decision threshold adaptive. Once the model outputs confidence scores, they need to be sieved with a decision threshold to categorize them to their respective classes. If the threshold is not chosen wisely, the entire model performance will go in vain. However, in most cases, this significant deciding value is taken via hit-and-trial method; or randomly some fixed value just by analogy (mostly 0.5 by default). Mostly for examples like classifying different objects - where each object’s features are vividly distinct, it might work, but not for all cases.
Nonetheless, for identification tasks like in face-recognition and person re-identification - where features between different identities are more or less identical and inseparable, the conventional method of choosing a fixed threshold might not work. To counteract the idea of selecting an optimum threshold based on ROC-curve, it might fail for temporally increasing identity-database size. For a deep learning model that encounter classification/identification of the objects with a higher degree of features-similarity, a threshold that could adapt to the varying database size would play a crucial role. Therefore, we have proposed an online optimization-based statistical feature learning technique to formulate an adaptive threshold in this project. Whenever there is any update on the identity-database size, a new decision threshold will be calculated based on the estimated Gaussian distribution functions of the auto/cross-similarity distributions and bound-constrained optimization search method. The proposed algorithm was tested on Labeled Faces in the Wild (LFW), a de-facto face verification dataset, and a self-prepared dataset of top 100 highly-paid athletes published by Forbes magazine (2018). The reason behind preparing own dataset is that the algorithm requires a large number of samples per identity for probing its robustness over widely distributed features. The method achieved, , and higher accuracy compared to the fixed thresholds at , , and , respectively. Looking at model accuracy, with adaptive threshold f1-score was observed to times w.r.t the total number of samples; whereas , , and for fixed thresholds at 0, , and respectively. This project has been tested on a concise dataset, so in the future, researchers can explore more on such techniques that can make decision threshold adapt to varying identities-database size and indistinguishable features.
The author declares that the research was conducted in the absence of grants or financial support from any institution or organization. However, I would like to take this opportunity to thank my friends, Mr. Bibek Raj Shrestha and Mr. Baibhav Raj Shrestha, for providing the necessary hardware components like RAM-16GB and SSD-500GB to encounter higher computational requirement for the project.