Log In Sign Up

CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks

by   Youngmin Baek, et al.

Despite the recent success of text detection and recognition methods, existing evaluation metrics fail to provide a fair and reliable comparison among those methods. In addition, there exists no end-to-end evaluation metric that takes characteristics of OCR tasks into account. Previous end-to-end metric contains cascaded errors from the binary scoring process applied in both detection and recognition tasks. Ignoring partially correct results raises a gap between quantitative and qualitative analysis, and prevents fine-grained assessment. Based on the fact that character is a key element of text, we hereby propose a Character-Level Evaluation metric (CLEval). In CLEval, the instance matching process handles split and merge detection cases, and the scoring process conducts character-level evaluation. By aggregating character-level scores, the CLEval metric provides a fine-grained evaluation of end-to-end results composed of the detection and recognition as well as individual evaluations for each module from the end-performance perspective. We believe that our metrics can play a key role in developing and analyzing state-of-the-art text detection and recognition methods. The evaluation code is publicly available at


page 1

page 4


TedEval: A Fair Evaluation Metric for Scene Text Detectors

Despite the recent success of scene text detection methods, common evalu...

PopEval: A Character-Level Approach to End-To-End Evaluation Compatible with Word-Level Benchmark Dataset

The most prevalent scope of interest for OCR applications used to be sca...

Tightness-aware Evaluation Protocol for Scene Text Detection

Evaluation protocols play key role in the developmental progress of text...

Negation-Instance Based Evaluation of End-to-End Negation Resolution

In this paper, we revisit the task of negation resolution, which include...

End-to-End Page-Level Assessment of Handwritten Text Recognition

The evaluation of Handwritten Text Recognition (HTR) systems has traditi...

Setting the Yardstick: A Quantitative Metric for Effectively Measuring Tactile Internet

The next frontier in communications is the telemanipulation of remote en...

SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization

In this paper, we present a conceptually simple while empirically powerf...

1 Introduction

Along with the progress in the field of machine learning, the performances of text detectors and recognizers have remarkably improved over the past few years 

[tian2016ctpn, zhou2017east, shi2017seglink, baek2019craft, baek2019wrong, long2018scene]. However, existing detection and recognition evaluation metrics fail to provide a fair and reliable comparison among those methods. Especially when evaluating end-to-end models, errors are aggregated due to unconvincing measurements in each part. Fig. 1 illustrates common problems encountered in end-to-end detection and recognition modules. The previously used instance-level binary scoring process assigns a value of 0 on both decent(left figures) and wrong(right figures) results.

(a) Scores on problematic detection examples
(b) Scores on problematic recognition examples
Figure 1: Comparison of CLEval with previous end-to-end metric[karatzas2015icdar]. Figures in the left column are slightly wrong cases while figures in the right are completely wrong cases. Red: GT. Blue: detection. Texts indicate the recognized results of the detection box, and texts in red are incorrect.
(a) Split case
(b) Merged case
(c) Missing characters
(d) Overlapping characters
Figure 2: Examples of issues for the fair evaluation of text detection and recognition tasks. (a) and (b) correspond to the issue of granularity, (c) and (d) are related to the issue of correctness.

To better understand where established text detection and recognition evaluation metrics fail, a closer look into the intrinsic nature of texts is required. At a fundamental level, text consists of words, which can further be decomposed into an array of characters. This character array embodies two intrinsic characteristics: its sequential nature and its content. Text detection’s goal of locating words can then be reinterpreted as finding the area that encapsulates the right sequence and content in a group of characters. The degree to which the right sequence and content are recognized within the detection area denotes the granularity and correctness issues, respectively. A more specific explanation of these attributes follows – with examples of what they measure in Figure 2.

Granularity is the degree to which the text detection model captures the sequence of characters in exactly one word as one unbroken sequence. Split detection results (Figure 2(a)) break the character sequence in the word and merged detection results (Figure 2(b)) fail to capture exactly one word. Both cases incur penalties proportionate to the number of splits or merges per word.

Correctness is the degree to which the text detection and recognition model captures the content of word. Specifically each character in that word must be detected and recognized exactly once with its right order. A penalty is incurred proportionate to the number of missing or overlapping characters in both detection and recognition results (Figure 2(c, d)).

Majority of the public datasets provide a word-level annotation since each word contains a semantic meaning. However, as Fig. 1 shows, evaluating word-level boxes with a predefined threshold provokes various issues. Despite having appropriate box predictions, the binary scoring process discards acceptable prediction results and produces unexplainable scores. To provide more detailed interpretation of the models, recent studies have adopted a character-level evaluation process [lee2019tedeval, lee2019popeval]. Inspired by them, our proposed metric, named as CLEval (Character-Level Evaluation), is designed to perform end-to-end evaluations without explicit character annotations. The method adopts two key components; instance matching process and character scoring process. The instance matching process solves granularity issues by pairing all possible GT and detection boxes that share at least one character, and the character-level scoring process solves correctness issues by calculating the longest common subsequence between GT and predicted transcriptions.

The CLEval metric is primarily designed to evaluate end-to-end tasks, but it can also be applied to individual detection and recognition modules. Interpretation of each module is valuable since it allows us to discover how each component affects overall performance. Assuming that the characters are evenly placed within a word box, the detection evaluation is conducted using pseudo-character center positions. This method was first proposed by [lee2019tedeval], but we further developed the idea to handle end-to-end models based on the same scoring policy. The proposed metric also provides quantitative indicators of recognition modules by measuring correctly recognized words within detection boxes.

The main contributions of this paper can be summarized as follows. 1) We propose a character-level evaluation metric that is favorable qualitative view without character-level annotations. 2) We define granularity and correctness issues, and solve them by performing instance matching and character scoring processes. 3) We propose a unified protocol that could evaluate not only end-to-end tasks, but also individual detection and recognition modules.

Figure 3: Overall evaluation process of IoU+CRW and CLEval metrics. In IoU+CRW, only two bounding boxes are matched through 0.5 IoU threshold, and partially detected texts are discarded in CRW scoring process. In CLEval, all detection boxes are matched, and correct numbers of characters are reflected to the final evaluation.

2 Related works

2.1 Detection evaluation

Intersection-over-Union (IoU) IoU metric originally comes from object detection task such as Pascal VOC [everingham2015pascal]. IoU accepts detections that match the ground truth (GT) box in an exclusive one-to-one manner only when the overlapping region satisfy the predefined threshold. Although IoU is the most widely used evaluation metric thanks to its simplicity, its behavior is clearly not suitable for evaluating texts as argued by [calarasanu2016good, nguyenstate]. IoU cannot handle granularity and correctness issues, which is critical for OCR tasks.

DetEval DetEval [wolf2013deteval] was designed to solve the granularity issue by allowing multiple relationships of a single bounding box. Their matching processes are conducted by accepting one-to-one, one-to-many, and many-to-one relationships. However, each instance is evaluated based on both area recall and area precision thresholds. Area-based threshold not only causes correctness issues, but also has a limitation when trying to apply end-to-end evaluation.

Tightness-aware IoU (TIoU) Liu et al. recently suggested the TIoU metric that penalizes based on the occupation ratio between detection and ground truth. By doing that, TIoU tried to give the high score to more similar detection compared to the box of ground truth. The major weakness is that TIoU penalizes slight differences between the ground truth and detection, even if the recognition results of those detection boxes are same. This is far from the end user perspective, and it is unfair if the correct box got a different score under the TIoU metric due to the small perturbation of box size.

TedEval A character-level evaluation metric for text detection has been proposed by [lee2019tedeval]. The metric alleviates qualitative disagreements, but can only be used to evaluate text detection modules. We adopt the idea of using pseudo-characters, and apply a character-level evaluation process to evaluate end-to-end results.

2.2 Recognition evaluation

Correctly Recognized Words (CRW) As its term suggests, the CRW is a binary score metric by judging correct answers when the transcription and the recognition result are exactly same. This method has a fundamental limit of a binary score system that fails to give different scores to an absurd recognition result and an almost accurate result.

Edit Distance [levenshtein1966binary] The Edit Distance (ED) method is a common algorithm used to quantify how dissimilar two given strings are. The ED of two strings is the minimum operation required to transform one string into another. In the most standard Levenshtein distance calculation, such an operation involves insertion, deletion, and substitution. Utilizing ED to evaluate scene text recognition model is considered reasonable in that the score reflects how well the model is performing with distance measures. Longest Common Subsequence (LCS) is literally the longest common subsequence in a set of sequences, and is a specialized case of ED which only uses insertion and deletion operations [LCS-ED-Relation].

2.3 End-to-end evaluation

IoU and CRW The IoU and CRW are strictly a cascaded evaluation metric. The detection stage filters out detection results whose IoU with the corresponding GT is below the threshold. Matches with IoU are judged by the CRW. Both metrics in each of the stages are reported to have hindrances for fine-grained assessment due to the binary chain scoring.

PopEval PopEval was proposed to make full use of text information from recognition results. Their character elimination process is simple yet good enough from a practical point of view. However, PopEval does not provide detection evaluation. We adopted the idea of character elimination, and enhanced by using a substring elimination scheme to mitigate the problem of ignoring the order of texts.

3 Methodology

Fig. 3 shows comparison of our method with the IoU + CRW metric. Detection boxes in the word “RIVERSIDE” show the granularity issue, and boxes in the word “WALK” show the correctness issue. While previous metrics fail to accept decent prediction results, our metric successfully quantifies various conditions through the matching process and the scoring process. The matching process identifies instance-level pairs between GT and detected boxes, and the scoring process provides a final score by analyzing extracted statistics.

3.1 Matching process

In this section, the matching process is explained in detail. First, to overcome the absence of character annotations, we adopt the idea from [lee2019tedeval] and calculate Pseudo-Character Center(PCC) positions. GT and detection boxes are considered a match if they satisfy two conditions. Their overlapping regions between GT and detection must share at least one PCC in common and should also cover an adequate GT area. Any candidates that do not satisfy the conditions are filtered out from the matching process.

3.1.1 Pseudo-Character Center (PCC)

We first need to know the location of the characters to identify whether a character region is covered by a detection box. However, most of the public datasets only provide word-level bounding box annotations. To handle this issue, we synthetically generate Pseudo-Character Center(PCC) points using GT word box and transcription.

Figure 4: An example of computing PCC of . Green dot: PCC. Red dash: pseudo character box. Grey: .

Let be a set of GT boxes and be a set of detected boxes where and denote the size of each set. Each GT box, , contains a word with characters. As shown in Fig.  4, we compute the -th PCC of to obtain the positional information of the characters,


where and indicate midpoints located on the left and right edges of . The equation allows us to construct PCC points using both quadrilateral and polygon boxes. PCC point generation for polygon boxes is described in the Appendix.

The PCC points are generated under the assumption that the characters are evenly divided within a word box. However, constructed points may not be perfectly aligned with actual character positions because characters of different sizes coexist in the word image. Even with this ambiguity, our assumption works fairly well in most cases. Fig.  5 shows constructed PCC points on real datasets.

3.1.2 Matching based on character inclusion

The first criterion for finding a match between GT and detection box is the inclusion of at least one PCC point. Here, we define character inclusion candidate, , between and as


where is a conditional function that gives a value of 1 when A is satisfied and 0 otherwise. If

, it is probable that

and is matched since they share at least one PCC in common.

The second matching criterion is the area of intersection between detection box and GT text region. Since PCC is a single coordinate in the image, inclusion of a point does not guarantee good localization of the ground truth text region. In order to alleviate this ambiguity, detection boxes that cover small GT box regions are filtered out. To this end, the area precision of is defined as follow


where the union condition, , indicates a set of GT boxes that contains at least one of its PCC points matched with the . The matching process filters out candidates whose non-text region is larger than the text region. Therefore, the final box matching flags are defined by considering the character inclusion flag and its area precision as


To solve the granularity issue, we need to consider one-to-many and many-to-one cases. In our matching process, explicitly handles one-to-one and many-to-one cases by calculating the union of intersections between matched s and . In our metric, one-to-many match does not need to be processed since each split detection box is matched with one GT box by checking the inclusion of at least one character.

(a) ICDAR2013[karatzas2013icdar]
(b) ICDAR2015[karatzas2015icdar]
(c) TotalText[ch2017total]
Figure 5: Visualization of PCC points on different ground-truth annotations.

3.1.3 Summarized matching statistics

Table. 1 shows matching flags and statistics. The upper and lower cases mean box-level and the character-level instances, respectively. The subscript indicates the box index, and the superscript represents the character index. The statistics are written in the script font, and they represent row wise and column wise summations.

Stat. Recall
Table 1: Table of box-level and character-level matching flags and their statistics.

Matching statistics are summarized in Table 2. The values are obtained using the character inclusion flag and the box matching flag . By aggregating matching statistics, we could identify total number of box matches and character inclusions. The value of and show the number of matched box candidates on each and . The number denotes matched detection boxes on -th PCC point of , and shows the number of PCC points covered by a detection box . These statistics are finally used to calculate character scores and granularity penalties.

Stat. Eq. Description
number of matched with
number of matched with
number of including -th character of
number of characters in s matched with
Table 2: Description of matching statistics.

3.2 Scoring process

Once the matching candidates are obtained, we now evaluate character-level correctness. Eq. 5 is the ground rule to calculate recall and precision.


TotalNum represents the number of GT or detected characters, and CorrectNum denotes the number of correct characters. GranulPenalty is proportional to the split number of GT or detection boxes. Each attribute will be explained in the following subsections.

3.2.1 TotalNum: Total Number of Characters

TotalNum, the denominator in Eq.  5, indicates the number of target characters. When evaluating the recall of , is set to the GT text length, . However, note that the value of text length is absent when measuring detector accuracy. The value of differs depending on the availability of word transcriptions.

For end-to-end evaluation, , which is the length of the predicted text in , can be used to represent . However, for detection evaluation, the length of predicted word transcription is unknown. In this case, we define using in Table. 2, where it defines the number of included PCC points of all the matched GT boxes.

Evaluation Detection End-to-end
Score Recall Precision Recall Precision






Scoring policy


Table 3: Comprehensive examples of scoring for issues in text evaluations. Note that the scores are expressed as fractions rather than reduction because each denominator has the important meaning; that is character length. indicates the separated summation of each nominator and denominator.

3.2.2 CorrectNum: Correct Number of Characters

Correct Number for end-to-end evaluation Since word transcriptions are available when performing an end-to-end evaluation, the number of correct characters can be measured by finding a subsequence between the transcriptions of matched GTs and detection boxes. However, multiple matches could occur during the instance matching process, and thus, a character score could be calculated multiple times. To avoid this problem, we introduce Subsequence Elimination Scoring Process (SESP) that calculates each character score once and eliminates the matched subsequences in both GTs and predictions.

SESP is described in Algorithm 1. For each GT box, a set of matched detection boxes are collected and sorted according to the order of included PCC points. Given sorted detection boxes, word transcriptions are assembled together to form a single word(recog_text). We then extract common_seq, which is the Longest Common Sequence(LCS)[LCS-ED-Relation] between GT and recog_text. The length of common_seq is directly used as . For each matched detection box, det_seq is extracted between the common_seq and , and the length of det_seq within all matched GTs is accumulated to . Finally, the det_seq gets eliminated in both and common_seq. This elimination process is required to avoid multiple matches between detection and GT transcriptions. Figure 6 shows an example of SESP on split and merge cases.

Figure 6: Examples of SESP on different matched cases: split case on left, merge case on right.

Note that unlike the PopEval[lee2019popeval], which ignores the order of characters for scoring, we perform SESP after ordering detection boxes into the right sequence. Therefore, when performing evaluation, the order of the characters are taken into account. Our metric is almost free from having errors due to character permutations.

1 for  in
2       a set of the matched s with
3       recog_text a serialized text from in
4                                         in the order of the matched PCCs
5       common_seq the Longest Common Subsequence(LCS)
6                                               between (, recog_text )
7       len(common_seq)
8       for  in
9             det_seq the characters from in common_seq
10             len(det_seq)
11             det_seq
12             common_seq det_seq
14       end
16 end
Algorithm 1 Subsequence Elimination Scoring Process (SESP)

Correct Number for detection evaluation

Word transcription is not available when evaluating detection results. Therefore, we utilize the number of PCC inclusion since the detection accuracy is related to whether a detected box covers a character or not. The number of correct characters, CorrectNum, is defined using detection statistics as follows;


indicates the number of included PCC points of within detection boxes, and represents the number of accumulated PCC points of within all the matched s. Additionally, of each character is divided by the inclusion counts to penalize overlapping cases. By doing this, only one of the matched characters is marked correct, and this can be seen from the same perspective of the subsequence elimination process when measuring end-to-end results.

3.2.3 GranulPenalty: Granularity Penalty

The granularity indicates the connectivity condition between characters. We define GranulPenalty as a penalty representing how much the detection result loses the connectivity information.

From a GT perspective, the most ideal condition is formed when a single detection box is matched(). Likewise, from a detection perspective, the most ideal condition is formed when a single GT box is matched(). The granularity penalty equation is shown in Eq. 7. As number of and grows, penalty increases proportionally.


This equation means that the weight of the loss of connectivity is same as the failure to detect a single character.

3.2.4 Character Number of False Positive Detection

An appropriate penalty should be given to false positive(FP) detection, but we can’t get the number of characters to penalize FP detection explicitly for detection evaluation. Therefore, the character length of the FP is estimated by assuming that the number of characters is proportional to the aspect ratio to fit into the box. As a result, the

TotalNum for FP is given through aspect ratio as shown in Eq. 8a. For end-to-end evaluation, the character length of recognized text in the detection box is given, so this value is applied to the TotalNum of FP as shown in Eq. 8b.


where indicates the minimum length of bounding box of the detection, and indicates the maximum length of it.

3.2.5 Scoring summary

Table  3 shows how scoring is processed on various issues. Basically, instance matching is processed on both detection and end-to-end evaluations. The scoring process, however, differs depending on the level of evaluation. When estimating detection performance, we take the information of inclusive PCC points, and when evaluating end-to-end results, we take the correct subsequence of recognized texts.

In this way, recall and precision of each box instance are obtained. Other evaluation metrics calculate the final score by taking the average of all recall and precision values. This is not the case of our evaluation since the denominator needs to be the sum of character numbers. The final recall and precision values are obtained by separately adding numerator and denominator scores of each instance as


Finally, H-Mean is calculated using Eq. 10 as usual.


Explicit recognition performance is also important for researchers developing recognition models. Using the attributes introduced in section 3.1, we measure the sole performance of the recognizer.

In order to solely evaluate recognition outputs, it is necessary to eliminate factors coming from detection outputs. One element that does not affect recognition performance is box granularity. We therefore remove granularity penalty when evaluating recognition performance. Additionally, unpaired prediction boxes should also be excluded. End-to-end performance assigns a penalty if a predicted word has no GT pair, and this is not fair since the error propagates from the detection performance. After eliminating the factors that disrupt fair recognition performance, we obtain Eq. 11 that expresses the Recognition Score (RS).


The equation measures recognition performance by dividing the number of correctly recognized characters by the total number of predicted characters matched with the GT instance.

4 Experiments

In this section, for ease of discussion, we analyze the tendency of our metric using the toy-examples constructed on ICDAR2013 dataset. The evaluation of our metric on real detection and recognition outputs are provided in the appendix.

4.1 Toy-example experiments

To compare the characteristics of the evaluation metric, a toyset is designed to reflect the granularity and correctness issues. To evaluate detection performance, nine cases were synthetically generated using ICDAR2013 dataset[karatzas2013icdar]. The cases are categorized into three parts; crop, split and overlap. To simulate recognition issues, we expect that the synthetic detection results have the same box as the GTs, and modify the text to cover insert, delete, and replace cases. Detailed toy-examples are illustrated in Figure 7.

Figure 7: Detailed cases for toy-example experiment. First row shows crop cases, second row shows split cases, and third row shows overlap cases. Recognition cases such as insert, delete, and replace are in the last row. The number of insertion, deletion and replacement is increased from 1 to 3.

4.2 Detection evaluation

The detection evaluation result of the toy-example experiment is shown in Table 4. DetEval and IoU metrics show typical problems encountered when using a threshold based binary scoring policy. For example, DetEval uses an area recall threshold value of 0.8. Any detection boxes outside this threshold are not considered a match, and therefore, the H-mean values under crop ratio 80% gets a value close to 0. Similar tendency is also found in the IoU metric. The metric uses a threshold value of 0.5, and thus, the H-mean value under crop ratio 50% gets a value close to 0. This indicates that the binary scoring process does not take into account boxes that do not meet predefined threshold conditions.

As shown in Table.  4, DetEval and IoU metric produces unreasonable values in many cases. DetEval scores in split and overlap cases are almost identical. We expect the scores to be different, but we get the same results because a penalty of 0.8 is assigned in all cases. On the other hand, the IoU recall precision values are strange in split and overlap cases. This is because only one of the detection boxes is paired with the GT box. Recall value of the matched detection box is close to 1 and the precision value of the mismatched detection box is close to 0.5. Also, the DetEval and IoU scores remain the same regardless of the change in overlapping ratio.

While DetEval and IoU metrics fail to cover acceptable detection results, CLEval metric performs fine-grained evaluation on detection results. The calculated recall score in CLEval is proportional to the size of the cropped box region. For the overlapping case, a precision penalty relative to the size of the overlapping region is given. The recall scores in three overlapping cases are almost the same. This is reasonable because every GT character is detected, and the number of detected duplicate characters decrease the precision score.

During the CLEval evaluation process, intermediate attributes directly related to CorrectNum, TotalNum, and GranulPenalty are extracted. These are the number of split and merge, frequency of missing and overlapping characters, and estimated character numbers in false positives. Table  5 shows a summary of extracted intermediate attributes on ICDAR2013 toy dataset. Each quantified attribute conveys a practical view to researchers and end-users. The information can be used by the researchers to further analyze and develop detection models.

Case Detection Metrics
DetEval IoU CLEval
Original 99.7 99.9 99.8 99.8 100 99.9 99.7 98.7 99.2
Crop 80% 97.1 97.3 97.2 99.8 100 99.9 82.7 98.7 90.0
Crop 60% 0.3 0.5 0.4 99.7 99.9 99.8 60.7 98.9 75.2
Crop 40% 0.1 0.1 0.1 0.0 0.0 0.0 40.1 95.2 56.4
Split by 2 78.7 79.9 79.3 98.4 49.3 65.7 82.4 97.2 89.2
Split by 3 78.6 79.8 79.2 0.0 0.0 0.0 66.7 94.2 78.1
Split by 4 78.5 79.7 79.1 0.0 0.0 0.0 53.2 89.9 66.8
Overlap 10% 78.9 79.8 79.4 99.8 50.0 66.6 81.1 88.7 84.7
Overlap 20% 79.0 79.9 79.4 99.8 50.0 66.6 80.9 82.0 81.5
Overlap 30% 79.0 79.8 79.4 99.8 50.0 66.6 80.9 74.0 77.3
Table 4: Comparison of detection evaluation metrics on toy-example from ICDAR2013 dataset. Some scores are highlighted: Red above 95, Blue below 5.
Case Attibutes from CLEval
Split Merge Miss Overlap FP
Original 15 9 0 61 0
Crop 80% 15 9 996 50 0
Crop 60% 12 8 2292 27 0
Crop 40% 9 7 3499 12 96
Split by 2 1014 12 0 60 93
Split by 3 1014 16 0 61 276
Split by 4 1014 19 0 60 581
Overlap 10% 1085 15 0 710 10
Overlap 20% 1093 15 0 1252 2
Overlap 30% 1093 15 0 2020 2
Table 5: Detection attributes from CLEval on ICDAR2013 dataset.

4.3 End-to-end evaluation

In end-to-end evaluation, the strength of using CLEval metric is much more apparent. We insert, delete, and replace characters in GT transcriptions to form end-to-end test samples. When using IoU+CRW metric, all H-mean values become 0 since CRW fails to evaluate partially recognized texts. On the other hand, CLEval metric assigns partial scores according to the conditions. In the case of insertion, recall value becomes 1 because predicted transcription contains all GT characters. In the case of deletion, precision value becomes 1 because recognized texts are marked all correct. In the case of replacement, the score is affected by the penalty added to the character that is incorrectly recognized.

While performing CLEval end-to-end evaluation, we can also obtain Recognition Score (RS). The RS value is obtained regardless of the detection result by mathematically removing the detection-related terms. A detailed description of RS is provided in the appendix with actual examples.

Case End-to-end Metrics
IoU + CRW CLEval
Original 99.6 99.8 99.7 99.7 99.7 99.7 98.9
Insert 1 0.0 0.0 0.0 99.7 84.0 91.2 83.6
Insert 2 0.0 0.0 0.0 99.7 73.1 84.4 73.1
Insert 3 0.0 0.0 0.0 99.7 65.7 79.2 65.6
Delete 1 0.0 0.0 0.0 81.0 99.7 89.4 80.4
Delete 2 0.0 0.0 0.0 63.7 99.6 77.7 63.3
Delete 3 0.0 0.0 0.0 48.0 99.5 64.8 47.8
Replace 1 0.0 0.0 0.0 81.0 81.0 81.0 80.4
Replace 2 0.0 0.0 0.0 64.1 64.2 64.1 63.7
Replace 3 0.0 0.0 0.0 49.9 49.9 49.9 49.7
Table 6: Comparison of end-to-end metrics on toy-example. Some scores are highlighted: Red above 95, Blue below 5.

5 Conclusion

Fair and detailed evaluation of OCR models is needed, and yet, no robust evaluation metric was proposed in the OCR community. The proposed CLEval metric could evaluate text detection, recognition, and end-to-end results. This is done by solving the granularity and correctness issues by performing instance matching and character scoring process. Our metric allows fine assessment, and alleviates qualitative disagreement. We expect researchers and end-users to take advantage of the metric to conduct thorough end-to-end evaluation.


Appendix A Toy-example experiment on ICDAR2015.

To show the stability of our metric, we additionally performed experiments on ICDAR2015 dataset. For detection evaluation, toy-set was produced in the same way as it was made using the ICDAR2013 dataset, and for end-to-end evaluation we constructed another set based on detection toy-examples. Note that we evaluated the end-to-end result using the best model reported in [baek2019wrong]. The result of the experiment and their attributes from CLEval are shown in table 7,8, and the line graphs in Figure 8 show results performed under different conditions.

The results on ICDAR2015 dataset show similar tendency when compared with the evalution results on ICDAR2013 dataset. Due to the trait of the IoU metric using a threshold value of 0.5, the metric assigns zero score to the cropped area less than 50 percent. Also, the zero scores on split cases are caused by the absence of handling granularity issues. One-to-many or many-to-one match cases frequently occur, but the IoU metric only considers one-to-one matching cases. Multiple box predictions could cover a single ground truth box, but zero scores are given if the overlapping region does not meet a predefined threshold.

The same holds for the IoU+CRW metric on end-to-end evaluation. Using a predefined threshold, a one-to-one match is first made to filter out valid box candidates, then CRW is performed to identify matching transcripts. In the transcript matching process, CRW requires ground truth and predicted text to be matched perfectly. Otherwise, a zero score is assigned to the matched box candidates. For this reason, we observe meaningful comparison was difficult with the IoU+CRW.

The proposed metric provides stable scores under various cases by performing evaluations at the character-level. Table  8 shows recall, precision scores of partially corrected detections.

Case Attributes from CLEval
instance-level character-level
Split Merge Miss Overlap FP
Original 18 15 5 45 0
Crop 80% 11 10 1723 26 0
Crop 60% 9 9 4592 17 0
Crop 40% 8 8 6246 10 0
Split by 2 1990 18 0 38 88
Split by 3 1988 19 0 36 178
Split by 4 1994 21 0 35 633
Overlap 10% 2073 21 0 1574 1
Overlap 20% 2074 23 0 2431 0
Overlap 30% 2074 23 0 4157 0
Table 7: Detection attributes from CLEval on ICDAR2015 dataset.
Case Detection Metrics E2E Metrics
DetEval IoU CLEval IoU+CRW CLEval
Original 98.1 99.9 99.0 100 100 100 99.8 99.4 99.6 72.4 72.4 72.4 88.6 91.1 89.8
Crop 80% 98.0 98.1 98.0 100 100 100 84.4 99.6 91.4 50.7 50.7 50.7 80.6 87.7 84.0
Crop 60% 0.7 1.8 1.0 100 100 100 58.6 99.6 73.8 8.0 8.0 8.0 54.8 77.0 64.0
Crop 40% 0.0 0.1 0.1 0.0 0.0 0.0 43.7 99.6 60.8 0.0 0.0 0.0 35.2 69.4 46.7
Split by 2 69.8 74.1 71.9 97.9 49.0 65.3 81.9 98.7 89.5 0.1 0.1 0.1 63.4 76.6 69.4
Split by 3 65.9 70.3 68.0 0.0 0.0 0.0 64.0 97.9 77.4 0.0 0.0 0.0 38.4 62.3 47.5
Split by 4 65.2 69.3 67.2 0.0 0.0 0.0 49.2 94.1 64.6 0.0 0.0 0.0 15.2 48.1 23.1
Overlap 10% 73.9 78.2 76.0 100 50.0 66.7 81.1 87.4 84.1 0.1 0.1 0.1 66.3 73.6 69.8
Overlap 20% 74.3 78.6 76.4 100 50.0 66.7 81.1 81.9 81.5 0.7 0.3 0.4 68.3 69.4 68.8
Overlap 30% 74.8 78.8 76.8 100 50.0 66.7 81.1 72.6 76.6 1.1 0.5 0.7 69.4.7 65.2 67.2
Table 8: Comparison of evaluation metrics on toy-set from ICDAR2015 dataset. Some scores are highlighted: Red above 95, Blue below 5. denotes our implemented code since official evaluation does not exist.
Figure 8: Line graph for H-mean of each evaluation metric according to different crop, split, overlap condition. Solid line indicates the detection evaluation while dashed line indicates the end-to-end evaluation results.
Detector Metrics Attributes
IoU CLEval instance-level character-level
R P H R P H Split Merge Miss Overlap FP
CTPN [tian2016ctpn] 83.0 93.0 87.7 82.5 84.1 83.3 14 231 1015 56 411
SegLink [shi2017seglink] 60.0 73.9 66.2 74.0 95.4 83.3 93 28 1293 46 116
EAST [zhou2017east] 70.7 81.6 75.8 84.7 94.2 89.2 52 47 827 51 200
RRPN [ma2017rrpn] 87.3 95.2 91.1 90.2 95.3 92.7 43 35 521 75 140
TextBoxes++ [liao2018textboxes++] 85.6 91.9 88.6 92.7 94.1 93.4 26 29 374 41 246
FOTS [liu2018fots] 90.4 95.4 92.8 94.0 96.4 95.2 57 34 289 99 62
MaskTextSpotter [lyu2018mask] 88.6 95.0 91.7 93.9 97.7 95.7 26 23 325 24 69
CRAFT [baek2019craft] 93.1 97.4 95.2 96.3 96.6 96.4 34 37 177 84 60
PMTD [liu2019pmtd] 92.2 95.1 93.6 96.1 97.6 96.8 24 28 199 28 76
Table 9: Comparison of evaluation metrics & additional detection attributes for different detectors on ICDAR2013 dataset. R, P, and H refer to recall, precision, and H-mean. Detectors are sorted from the highest score on DetEval metric.
Detector Metrics Attributes
IoU CLEval instance-level character-level
R P H R P H Split Merge Miss Overlap FP
CTPN [tian2016ctpn] 51.6 74.2 60.9 63.2 93.6 75.4 75 103 3842 31 302
SegLink [shi2017seglink] 72.9 80.2 76.4 79.4 95.1 86.5 130 117 2123 107 224
RRPN [ma2017rrpn] 77.1 83.5 80.2 81.8 94.6 87.7 60 77 1938 49 377
EAST [zhou2017east] 77.2 84.6 80.8 84.8 93.9 89.1 66 124 1607 65 401
TextBoxes++ [liao2018textboxes++] 80.8 89.1 84.8 84.2 95.0 89.3 35 52 1713 33 397
MaskTextSpotter [lyu2018mask] 79.5 89.0 84.0 83.7 96.2 89.5 52 63 1751 62 224
FOTS [liu2018fots] 87.9 91.9 89.8 90.0 97.1 93.4 74 69 1033 67 160
CRAFT [baek2019craft] 84.3 89.8 86.9 90.0 97.4 93.5 33 73 1076 19 171
PMTD [liu2019pmtd] 87.4 91.3 89.3 90.8 97.0 93.8 38 47 982 33 232
Table 10: Comparison of evaluation metrics & additional detection attributes for different detectors on ICDAR2015 dataset. R, P, and H refer to recall, precision, and H-mean. Detectors are sorted from the highest score on IoU metric.

Appendix B Evaluation of text detectors

In this section, we compare CLEval with other commonly used evaluation metrics using the state-of-the-art text detectors. We requested the authors of various scene text detectors to provide their test results on public datasets and organized the results in Table 9, 10 for ICDAR2013 and ICDAR2015, respectively.

The strength of using the CLEval metric is in its use of additional instance-level and character-level information to calculate recall, precision, and hmean values. As shown in Table 9, 10, even without the knowledge of recall, precision, and hmean values, we could examine the quality of the detection models by observing the attributes produced by the CLEval metric.

Appendix C Evaluation on real end-to-end results

In this experiment, we take a close look into the end-to-end performance of various detector and recognizer combinations. We used the well-known detectors such as CRAFT[baek2019craft], EAST[zhou2017east], RRPN[ma2017rrpn], PixelLink[deng2018pixellink], and TextBoxes++[liao2018textboxes++]. We recognized the texts of those detectors with three types of recognizers provided in [baek2019wrong]. CLEval results are listed in the Table 11. High indicates recognizer with TPS+ResNet+BiLSTM+Attn moduels, Mid indicates recognizer with None+VGG+BiLSTM+CTC modules, and Low indicates recognizer with None+VGG+None+CTC modules. We observe that RS scores in each High, Mid, and Low recognition combination are similar. This infers that RS can be used to evaluate recognition performance regardless of the detection module.

Detector CLEval Det Recognizer CLEval E2E E2E Rec
RRPN[ma2017rrpn] 81.8 94.6 87.7 High 76.7 84.3 79.9 89.0
Mid 74.0 82.9 78.2 86.4
Low 70.4 82.4 75.9 83.0
EAST[zhou2017east] 84.8 93.9 89.1 High 78.4 83.7 81.0 88.4
Mid 75.2 83.6 79.2 85.5
Low 72.0 82.5 76.9 82.2
TextBoxes++[liao2018textboxes++] 84.2 95.0 89.3 High 78.1 86.4 82.0 90.0
Mid 72.2 84.0 77.6 84.0
Low 67.8 82.8 74.6 79.0
PixelLink[deng2018pixellink] 89.0 96.7 92.7 High 80.8 88.4 84.5 87.8
Mid 78.1 87.3 82.5 85.3
Low 73.8 86.1 79.4 81.0
CRAFT[baek2019craft] 89.9 97.3 93.5 High 81.6 88.9 85.1 88.2
Mid 78.4 87.3 82.6 85.1
Low 74.4 86.3 79.9 81.1
Table 11: End-to-end evaluation using CLEval for state-of-the art text detectors and recognizers.

Appendix D PCC generation in polygon annotation

Most of the text bounding boxes in public datasets are represented using four quadrilateral points. However, there exist polygon-type datasets that use multiple vertexes to tightly bound the text regions. For polygon datasets, we could acquire the center information by splitting the polygon into a sub-groups of quadrilaterals. Algorithm 2 describes the detailed procedure for generating PCCs in polygon-type dataset. By extending PCC generation to polygon datasets, CLEval can be used to evaluate on a variety of datasets represented by both rectangles and polygons.

1 a set of even-number annotation points from left-top, clockwise order
2 length of transcription
3 def PCC_polygon(Points, Length):
4       initialize PCCs as array
5       a set of points above center ()
6       a set of points below center ()
7       =
8       # to make order from left to right, reverse order of element

= Interpolate(

, , )
11       for k from to
12             =
13             =
14             =
15             =
16             PCCs.append()
18       end
19      return PCCs
21 end
22def Interpolate(, , ):
23       initialize , as array
24       for i from to
25             , = , point of
26             , = , point of
27             .append()
28             .append()
29             for k from to
30                   =
31                   =
32                   .append()
33                   .append()
35             end
37       end
38      .append(last element of )
39       .append(last element of )
40       return ,
42 end
Algorithm 2 PCC generation process from polygon annotation