Along with the progress in the field of machine learning, the performances of text detectors and recognizers have remarkably improved over the past few years[tian2016ctpn, zhou2017east, shi2017seglink, baek2019craft, baek2019wrong, long2018scene]. However, existing detection and recognition evaluation metrics fail to provide a fair and reliable comparison among those methods. Especially when evaluating end-to-end models, errors are aggregated due to unconvincing measurements in each part. Fig. 1 illustrates common problems encountered in end-to-end detection and recognition modules. The previously used instance-level binary scoring process assigns a value of 0 on both decent(left figures) and wrong(right figures) results.
To better understand where established text detection and recognition evaluation metrics fail, a closer look into the intrinsic nature of texts is required. At a fundamental level, text consists of words, which can further be decomposed into an array of characters. This character array embodies two intrinsic characteristics: its sequential nature and its content. Text detection’s goal of locating words can then be reinterpreted as finding the area that encapsulates the right sequence and content in a group of characters. The degree to which the right sequence and content are recognized within the detection area denotes the granularity and correctness issues, respectively. A more specific explanation of these attributes follows – with examples of what they measure in Figure 2.
Granularity is the degree to which the text detection model captures the sequence of characters in exactly one word as one unbroken sequence. Split detection results (Figure 2(a)) break the character sequence in the word and merged detection results (Figure 2(b)) fail to capture exactly one word. Both cases incur penalties proportionate to the number of splits or merges per word.
Correctness is the degree to which the text detection and recognition model captures the content of word. Specifically each character in that word must be detected and recognized exactly once with its right order. A penalty is incurred proportionate to the number of missing or overlapping characters in both detection and recognition results (Figure 2(c, d)).
Majority of the public datasets provide a word-level annotation since each word contains a semantic meaning. However, as Fig. 1 shows, evaluating word-level boxes with a predefined threshold provokes various issues. Despite having appropriate box predictions, the binary scoring process discards acceptable prediction results and produces unexplainable scores. To provide more detailed interpretation of the models, recent studies have adopted a character-level evaluation process [lee2019tedeval, lee2019popeval]. Inspired by them, our proposed metric, named as CLEval (Character-Level Evaluation), is designed to perform end-to-end evaluations without explicit character annotations. The method adopts two key components; instance matching process and character scoring process. The instance matching process solves granularity issues by pairing all possible GT and detection boxes that share at least one character, and the character-level scoring process solves correctness issues by calculating the longest common subsequence between GT and predicted transcriptions.
The CLEval metric is primarily designed to evaluate end-to-end tasks, but it can also be applied to individual detection and recognition modules. Interpretation of each module is valuable since it allows us to discover how each component affects overall performance. Assuming that the characters are evenly placed within a word box, the detection evaluation is conducted using pseudo-character center positions. This method was first proposed by [lee2019tedeval], but we further developed the idea to handle end-to-end models based on the same scoring policy. The proposed metric also provides quantitative indicators of recognition modules by measuring correctly recognized words within detection boxes.
The main contributions of this paper can be summarized as follows. 1) We propose a character-level evaluation metric that is favorable qualitative view without character-level annotations. 2) We define granularity and correctness issues, and solve them by performing instance matching and character scoring processes. 3) We propose a unified protocol that could evaluate not only end-to-end tasks, but also individual detection and recognition modules.
2 Related works
2.1 Detection evaluation
Intersection-over-Union (IoU) IoU metric originally comes from object detection task such as Pascal VOC [everingham2015pascal]. IoU accepts detections that match the ground truth (GT) box in an exclusive one-to-one manner only when the overlapping region satisfy the predefined threshold. Although IoU is the most widely used evaluation metric thanks to its simplicity, its behavior is clearly not suitable for evaluating texts as argued by [calarasanu2016good, nguyenstate]. IoU cannot handle granularity and correctness issues, which is critical for OCR tasks.
DetEval DetEval [wolf2013deteval] was designed to solve the granularity issue by allowing multiple relationships of a single bounding box. Their matching processes are conducted by accepting one-to-one, one-to-many, and many-to-one relationships. However, each instance is evaluated based on both area recall and area precision thresholds. Area-based threshold not only causes correctness issues, but also has a limitation when trying to apply end-to-end evaluation.
Tightness-aware IoU (TIoU) Liu et al. recently suggested the TIoU metric that penalizes based on the occupation ratio between detection and ground truth. By doing that, TIoU tried to give the high score to more similar detection compared to the box of ground truth. The major weakness is that TIoU penalizes slight differences between the ground truth and detection, even if the recognition results of those detection boxes are same. This is far from the end user perspective, and it is unfair if the correct box got a different score under the TIoU metric due to the small perturbation of box size.
TedEval A character-level evaluation metric for text detection has been proposed by [lee2019tedeval]. The metric alleviates qualitative disagreements, but can only be used to evaluate text detection modules. We adopt the idea of using pseudo-characters, and apply a character-level evaluation process to evaluate end-to-end results.
2.2 Recognition evaluation
Correctly Recognized Words (CRW) As its term suggests, the CRW is a binary score metric by judging correct answers when the transcription and the recognition result are exactly same. This method has a fundamental limit of a binary score system that fails to give different scores to an absurd recognition result and an almost accurate result.
Edit Distance [levenshtein1966binary] The Edit Distance (ED) method is a common algorithm used to quantify how dissimilar two given strings are. The ED of two strings is the minimum operation required to transform one string into another. In the most standard Levenshtein distance calculation, such an operation involves insertion, deletion, and substitution. Utilizing ED to evaluate scene text recognition model is considered reasonable in that the score reflects how well the model is performing with distance measures. Longest Common Subsequence (LCS) is literally the longest common subsequence in a set of sequences, and is a specialized case of ED which only uses insertion and deletion operations [LCS-ED-Relation].
2.3 End-to-end evaluation
IoU and CRW The IoU and CRW are strictly a cascaded evaluation metric. The detection stage filters out detection results whose IoU with the corresponding GT is below the threshold. Matches with IoU are judged by the CRW. Both metrics in each of the stages are reported to have hindrances for fine-grained assessment due to the binary chain scoring.
PopEval PopEval was proposed to make full use of text information from recognition results. Their character elimination process is simple yet good enough from a practical point of view. However, PopEval does not provide detection evaluation. We adopted the idea of character elimination, and enhanced by using a substring elimination scheme to mitigate the problem of ignoring the order of texts.
Fig. 3 shows comparison of our method with the IoU + CRW metric. Detection boxes in the word “RIVERSIDE” show the granularity issue, and boxes in the word “WALK” show the correctness issue. While previous metrics fail to accept decent prediction results, our metric successfully quantifies various conditions through the matching process and the scoring process. The matching process identifies instance-level pairs between GT and detected boxes, and the scoring process provides a final score by analyzing extracted statistics.
3.1 Matching process
In this section, the matching process is explained in detail. First, to overcome the absence of character annotations, we adopt the idea from [lee2019tedeval] and calculate Pseudo-Character Center(PCC) positions. GT and detection boxes are considered a match if they satisfy two conditions. Their overlapping regions between GT and detection must share at least one PCC in common and should also cover an adequate GT area. Any candidates that do not satisfy the conditions are filtered out from the matching process.
3.1.1 Pseudo-Character Center (PCC)
We first need to know the location of the characters to identify whether a character region is covered by a detection box. However, most of the public datasets only provide word-level bounding box annotations. To handle this issue, we synthetically generate Pseudo-Character Center(PCC) points using GT word box and transcription.
Let be a set of GT boxes and be a set of detected boxes where and denote the size of each set. Each GT box, , contains a word with characters. As shown in Fig. 4, we compute the -th PCC of to obtain the positional information of the characters,
where and indicate midpoints located on the left and right edges of . The equation allows us to construct PCC points using both quadrilateral and polygon boxes. PCC point generation for polygon boxes is described in the Appendix.
The PCC points are generated under the assumption that the characters are evenly divided within a word box. However, constructed points may not be perfectly aligned with actual character positions because characters of different sizes coexist in the word image. Even with this ambiguity, our assumption works fairly well in most cases. Fig. 5 shows constructed PCC points on real datasets.
3.1.2 Matching based on character inclusion
The first criterion for finding a match between GT and detection box is the inclusion of at least one PCC point. Here, we define character inclusion candidate, , between and as
where is a conditional function that gives a value of 1 when A is satisfied and 0 otherwise. If
, it is probable thatand is matched since they share at least one PCC in common.
The second matching criterion is the area of intersection between detection box and GT text region. Since PCC is a single coordinate in the image, inclusion of a point does not guarantee good localization of the ground truth text region. In order to alleviate this ambiguity, detection boxes that cover small GT box regions are filtered out. To this end, the area precision of is defined as follow
where the union condition, , indicates a set of GT boxes that contains at least one of its PCC points matched with the . The matching process filters out candidates whose non-text region is larger than the text region. Therefore, the final box matching flags are defined by considering the character inclusion flag and its area precision as
To solve the granularity issue, we need to consider one-to-many and many-to-one cases. In our matching process, explicitly handles one-to-one and many-to-one cases by calculating the union of intersections between matched s and . In our metric, one-to-many match does not need to be processed since each split detection box is matched with one GT box by checking the inclusion of at least one character.
3.1.3 Summarized matching statistics
Table. 1 shows matching flags and statistics. The upper and lower cases mean box-level and the character-level instances, respectively. The subscript indicates the box index, and the superscript represents the character index. The statistics are written in the script font, and they represent row wise and column wise summations.
Matching statistics are summarized in Table 2. The values are obtained using the character inclusion flag and the box matching flag . By aggregating matching statistics, we could identify total number of box matches and character inclusions. The value of and show the number of matched box candidates on each and . The number denotes matched detection boxes on -th PCC point of , and shows the number of PCC points covered by a detection box . These statistics are finally used to calculate character scores and granularity penalties.
|number of matched with|
|number of matched with|
|number of including -th character of|
|number of characters in s matched with|
3.2 Scoring process
Once the matching candidates are obtained, we now evaluate character-level correctness. Eq. 5 is the ground rule to calculate recall and precision.
TotalNum represents the number of GT or detected characters, and CorrectNum denotes the number of correct characters. GranulPenalty is proportional to the split number of GT or detection boxes. Each attribute will be explained in the following subsections.
3.2.1 TotalNum: Total Number of Characters
TotalNum, the denominator in Eq. 5, indicates the number of target characters. When evaluating the recall of , is set to the GT text length, . However, note that the value of text length is absent when measuring detector accuracy. The value of differs depending on the availability of word transcriptions.
For end-to-end evaluation, , which is the length of the predicted text in , can be used to represent . However, for detection evaluation, the length of predicted word transcription is unknown. In this case, we define using in Table. 2, where it defines the number of included PCC points of all the matched GT boxes.
3.2.2 CorrectNum: Correct Number of Characters
Correct Number for end-to-end evaluation Since word transcriptions are available when performing an end-to-end evaluation, the number of correct characters can be measured by finding a subsequence between the transcriptions of matched GTs and detection boxes. However, multiple matches could occur during the instance matching process, and thus, a character score could be calculated multiple times. To avoid this problem, we introduce Subsequence Elimination Scoring Process (SESP) that calculates each character score once and eliminates the matched subsequences in both GTs and predictions.
SESP is described in Algorithm 1. For each GT box, a set of matched detection boxes are collected and sorted according to the order of included PCC points. Given sorted detection boxes, word transcriptions are assembled together to form a single word(recog_text). We then extract common_seq, which is the Longest Common Sequence(LCS)[LCS-ED-Relation] between GT and recog_text. The length of common_seq is directly used as . For each matched detection box, det_seq is extracted between the common_seq and , and the length of det_seq within all matched GTs is accumulated to . Finally, the det_seq gets eliminated in both and common_seq. This elimination process is required to avoid multiple matches between detection and GT transcriptions. Figure 6 shows an example of SESP on split and merge cases.
Note that unlike the PopEval[lee2019popeval], which ignores the order of characters for scoring, we perform SESP after ordering detection boxes into the right sequence. Therefore, when performing evaluation, the order of the characters are taken into account. Our metric is almost free from having errors due to character permutations.
Correct Number for detection evaluation
Word transcription is not available when evaluating detection results. Therefore, we utilize the number of PCC inclusion since the detection accuracy is related to whether a detected box covers a character or not. The number of correct characters, CorrectNum, is defined using detection statistics as follows;
indicates the number of included PCC points of within detection boxes, and represents the number of accumulated PCC points of within all the matched s. Additionally, of each character is divided by the inclusion counts to penalize overlapping cases. By doing this, only one of the matched characters is marked correct, and this can be seen from the same perspective of the subsequence elimination process when measuring end-to-end results.
3.2.3 GranulPenalty: Granularity Penalty
The granularity indicates the connectivity condition between characters. We define GranulPenalty as a penalty representing how much the detection result loses the connectivity information.
From a GT perspective, the most ideal condition is formed when a single detection box is matched(). Likewise, from a detection perspective, the most ideal condition is formed when a single GT box is matched(). The granularity penalty equation is shown in Eq. 7. As number of and grows, penalty increases proportionally.
This equation means that the weight of the loss of connectivity is same as the failure to detect a single character.
3.2.4 Character Number of False Positive Detection
An appropriate penalty should be given to false positive(FP) detection, but we can’t get the number of characters to penalize FP detection explicitly for detection evaluation. Therefore, the character length of the FP is estimated by assuming that the number of characters is proportional to the aspect ratio to fit into the box. As a result, theTotalNum for FP is given through aspect ratio as shown in Eq. 8a. For end-to-end evaluation, the character length of recognized text in the detection box is given, so this value is applied to the TotalNum of FP as shown in Eq. 8b.
where indicates the minimum length of bounding box of the detection, and indicates the maximum length of it.
3.2.5 Scoring summary
Table 3 shows how scoring is processed on various issues. Basically, instance matching is processed on both detection and end-to-end evaluations. The scoring process, however, differs depending on the level of evaluation. When estimating detection performance, we take the information of inclusive PCC points, and when evaluating end-to-end results, we take the correct subsequence of recognized texts.
In this way, recall and precision of each box instance are obtained. Other evaluation metrics calculate the final score by taking the average of all recall and precision values. This is not the case of our evaluation since the denominator needs to be the sum of character numbers. The final recall and precision values are obtained by separately adding numerator and denominator scores of each instance as
Finally, H-Mean is calculated using Eq. 10 as usual.
Explicit recognition performance is also important for researchers developing recognition models. Using the attributes introduced in section 3.1, we measure the sole performance of the recognizer.
In order to solely evaluate recognition outputs, it is necessary to eliminate factors coming from detection outputs. One element that does not affect recognition performance is box granularity. We therefore remove granularity penalty when evaluating recognition performance. Additionally, unpaired prediction boxes should also be excluded. End-to-end performance assigns a penalty if a predicted word has no GT pair, and this is not fair since the error propagates from the detection performance. After eliminating the factors that disrupt fair recognition performance, we obtain Eq. 11 that expresses the Recognition Score (RS).
The equation measures recognition performance by dividing the number of correctly recognized characters by the total number of predicted characters matched with the GT instance.
In this section, for ease of discussion, we analyze the tendency of our metric using the toy-examples constructed on ICDAR2013 dataset. The evaluation of our metric on real detection and recognition outputs are provided in the appendix.
4.1 Toy-example experiments
To compare the characteristics of the evaluation metric, a toyset is designed to reflect the granularity and correctness issues. To evaluate detection performance, nine cases were synthetically generated using ICDAR2013 dataset[karatzas2013icdar]. The cases are categorized into three parts; crop, split and overlap. To simulate recognition issues, we expect that the synthetic detection results have the same box as the GTs, and modify the text to cover insert, delete, and replace cases. Detailed toy-examples are illustrated in Figure 7.
4.2 Detection evaluation
The detection evaluation result of the toy-example experiment is shown in Table 4. DetEval and IoU metrics show typical problems encountered when using a threshold based binary scoring policy. For example, DetEval uses an area recall threshold value of 0.8. Any detection boxes outside this threshold are not considered a match, and therefore, the H-mean values under crop ratio 80% gets a value close to 0. Similar tendency is also found in the IoU metric. The metric uses a threshold value of 0.5, and thus, the H-mean value under crop ratio 50% gets a value close to 0. This indicates that the binary scoring process does not take into account boxes that do not meet predefined threshold conditions.
As shown in Table. 4, DetEval and IoU metric produces unreasonable values in many cases. DetEval scores in split and overlap cases are almost identical. We expect the scores to be different, but we get the same results because a penalty of 0.8 is assigned in all cases. On the other hand, the IoU recall precision values are strange in split and overlap cases. This is because only one of the detection boxes is paired with the GT box. Recall value of the matched detection box is close to 1 and the precision value of the mismatched detection box is close to 0.5. Also, the DetEval and IoU scores remain the same regardless of the change in overlapping ratio.
While DetEval and IoU metrics fail to cover acceptable detection results, CLEval metric performs fine-grained evaluation on detection results. The calculated recall score in CLEval is proportional to the size of the cropped box region. For the overlapping case, a precision penalty relative to the size of the overlapping region is given. The recall scores in three overlapping cases are almost the same. This is reasonable because every GT character is detected, and the number of detected duplicate characters decrease the precision score.
During the CLEval evaluation process, intermediate attributes directly related to CorrectNum, TotalNum, and GranulPenalty are extracted. These are the number of split and merge, frequency of missing and overlapping characters, and estimated character numbers in false positives. Table 5 shows a summary of extracted intermediate attributes on ICDAR2013 toy dataset. Each quantified attribute conveys a practical view to researchers and end-users. The information can be used by the researchers to further analyze and develop detection models.
|Split by 2||78.7||79.9||79.3||98.4||49.3||65.7||82.4||97.2||89.2|
|Split by 3||78.6||79.8||79.2||0.0||0.0||0.0||66.7||94.2||78.1|
|Split by 4||78.5||79.7||79.1||0.0||0.0||0.0||53.2||89.9||66.8|
|Case||Attibutes from CLEval|
|Split by 2||1014||12||0||60||93|
|Split by 3||1014||16||0||61||276|
|Split by 4||1014||19||0||60||581|
4.3 End-to-end evaluation
In end-to-end evaluation, the strength of using CLEval metric is much more apparent. We insert, delete, and replace characters in GT transcriptions to form end-to-end test samples. When using IoU+CRW metric, all H-mean values become 0 since CRW fails to evaluate partially recognized texts. On the other hand, CLEval metric assigns partial scores according to the conditions. In the case of insertion, recall value becomes 1 because predicted transcription contains all GT characters. In the case of deletion, precision value becomes 1 because recognized texts are marked all correct. In the case of replacement, the score is affected by the penalty added to the character that is incorrectly recognized.
While performing CLEval end-to-end evaluation, we can also obtain Recognition Score (RS). The RS value is obtained regardless of the detection result by mathematically removing the detection-related terms. A detailed description of RS is provided in the appendix with actual examples.
|IoU + CRW||CLEval|
Fair and detailed evaluation of OCR models is needed, and yet, no robust evaluation metric was proposed in the OCR community. The proposed CLEval metric could evaluate text detection, recognition, and end-to-end results. This is done by solving the granularity and correctness issues by performing instance matching and character scoring process. Our metric allows fine assessment, and alleviates qualitative disagreement. We expect researchers and end-users to take advantage of the metric to conduct thorough end-to-end evaluation.
Appendix A Toy-example experiment on ICDAR2015.
To show the stability of our metric, we additionally performed experiments on ICDAR2015 dataset. For detection evaluation, toy-set was produced in the same way as it was made using the ICDAR2013 dataset, and for end-to-end evaluation we constructed another set based on detection toy-examples. Note that we evaluated the end-to-end result using the best model reported in [baek2019wrong]. The result of the experiment and their attributes from CLEval are shown in table 7,8, and the line graphs in Figure 8 show results performed under different conditions.
The results on ICDAR2015 dataset show similar tendency when compared with the evalution results on ICDAR2013 dataset. Due to the trait of the IoU metric using a threshold value of 0.5, the metric assigns zero score to the cropped area less than 50 percent. Also, the zero scores on split cases are caused by the absence of handling granularity issues. One-to-many or many-to-one match cases frequently occur, but the IoU metric only considers one-to-one matching cases. Multiple box predictions could cover a single ground truth box, but zero scores are given if the overlapping region does not meet a predefined threshold.
The same holds for the IoU+CRW metric on end-to-end evaluation. Using a predefined threshold, a one-to-one match is first made to filter out valid box candidates, then CRW is performed to identify matching transcripts. In the transcript matching process, CRW requires ground truth and predicted text to be matched perfectly. Otherwise, a zero score is assigned to the matched box candidates. For this reason, we observe meaningful comparison was difficult with the IoU+CRW.
The proposed metric provides stable scores under various cases by performing evaluations at the character-level. Table 8 shows recall, precision scores of partially corrected detections.
|Case||Attributes from CLEval|
|Split by 2||1990||18||0||38||88|
|Split by 3||1988||19||0||36||178|
|Split by 4||1994||21||0||35||633|
|Case||Detection Metrics||E2E Metrics|
|Split by 2||69.8||74.1||71.9||97.9||49.0||65.3||81.9||98.7||89.5||0.1||0.1||0.1||63.4||76.6||69.4|
|Split by 3||65.9||70.3||68.0||0.0||0.0||0.0||64.0||97.9||77.4||0.0||0.0||0.0||38.4||62.3||47.5|
|Split by 4||65.2||69.3||67.2||0.0||0.0||0.0||49.2||94.1||64.6||0.0||0.0||0.0||15.2||48.1||23.1|
Appendix B Evaluation of text detectors
In this section, we compare CLEval with other commonly used evaluation metrics using the state-of-the-art text detectors. We requested the authors of various scene text detectors to provide their test results on public datasets and organized the results in Table 9, 10 for ICDAR2013 and ICDAR2015, respectively.
The strength of using the CLEval metric is in its use of additional instance-level and character-level information to calculate recall, precision, and hmean values. As shown in Table 9, 10, even without the knowledge of recall, precision, and hmean values, we could examine the quality of the detection models by observing the attributes produced by the CLEval metric.
Appendix C Evaluation on real end-to-end results
In this experiment, we take a close look into the end-to-end performance of various detector and recognizer combinations. We used the well-known detectors such as CRAFT[baek2019craft], EAST[zhou2017east], RRPN[ma2017rrpn], PixelLink[deng2018pixellink], and TextBoxes++[liao2018textboxes++]. We recognized the texts of those detectors with three types of recognizers provided in [baek2019wrong]. CLEval results are listed in the Table 11. High indicates recognizer with TPS+ResNet+BiLSTM+Attn moduels, Mid indicates recognizer with None+VGG+BiLSTM+CTC modules, and Low indicates recognizer with None+VGG+None+CTC modules. We observe that RS scores in each High, Mid, and Low recognition combination are similar. This infers that RS can be used to evaluate recognition performance regardless of the detection module.
|Detector||CLEval Det||Recognizer||CLEval E2E||E2E Rec|
Appendix D PCC generation in polygon annotation
Most of the text bounding boxes in public datasets are represented using four quadrilateral points. However, there exist polygon-type datasets that use multiple vertexes to tightly bound the text regions. For polygon datasets, we could acquire the center information by splitting the polygon into a sub-groups of quadrilaterals. Algorithm 2 describes the detailed procedure for generating PCCs in polygon-type dataset. By extending PCC generation to polygon datasets, CLEval can be used to evaluate on a variety of datasets represented by both rectangles and polygons.