OLALA: Object-Level Active Learning Based Layout Annotation

10/05/2020 ∙ by Zejiang Shen, et al. ∙ University of Waterloo Harvard University 2

In layout object detection problems, the ground-truth datasets are constructed by annotating object instances individually. Yet active learning for object detection is typically conducted at the image level, not at the object level. Because objects appear with different frequencies across images, image-level active learning may be subject to over-exposure to common objects. This reduces the efficiency of human labeling. This work introduces an Object-Level Active Learning based Layout Annotation framework, OLALA, which includes an object scoring method and a prediction correction algorithm. The object scoring method estimates the object prediction informativeness considering both the object category and the location. It selects only the most ambiguous object prediction regions within an image for annotators to label, optimizing the use of the annotation budget. For the unselected model predictions, we propose a correction algorithm to rectify two types of potential errors with minor supervision from ground-truths. The human annotated and model predicted objects are then merged as new image annotations for training the object detection models. In simulated labeling experiments, we show that OLALA helps to create the dataset more efficiently and report strong accuracy improvements of the trained models compared to image-level active learning baselines. The code is available at https://github.com/lolipopshock/Detectron2_AL.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning-based approaches have been widely applied to document layout analysis and content parsing Zhong et al. (2019b); Schreiber et al. (2017)

. Document layout object detection, like image object detection, requires identifying content regions and categories within images. A key distinction, however, is that it is not uncommon for dozens to hundreds of content regions to appear on a single page in documents such as firm financial reports or newspapers, as compared to around five objects per image in natural image datasets like MS-COCO 

Lin et al. (2014). Hence, the manual labeling process often used on natural images to create high quality, labeled datasets can be prohibitively costly to replicate for documents of central interest to academic researchers and business organizations.

Active learning (AL) provides powerful tools that can optimize object detection tasks by identifying the most important samples to label  Aghdam et al. (2019); Haussmann et al. (2020); Brust et al. (2018); Roy et al. (2018). However, while the end goal is to annotate individual objects within an image, these AL methods typically score and select samples at the image level, rather than at the object level. For object-dense document images, image-level selection may neglect objects that appear less frequently across the dataset, leading to suboptimal results.

To address this challenge, we propose Object-level Active Learning based Layout Annotation, OLALA. It selects critical objects within an image to label, rather than identifying whole images for labeling, addressing the overexposure of common objects. As shown in Figure 1, we assume - based on extensive experience with real world documents - that the visual signals of different objects form a long-tailed distribution Wang et al. (2017). Therefore, if we directly sample images to label, the annotation budget is mainly distributed over common objects, instead of allocated evenly across objects with different frequencies. It is difficult to perform resampling since the distribution is unknown ex ante. One possible solution is to train a detection model during labeling, chosing objects to label based on the informativeness of model predictions. OLALA selects ambiguous object predictions for human annotations and derives the remaining labels from the high-confidence model predictions. This oversamples irregular objects and uses human effort more efficiently during labeling, resulting in a large, high quality annotated dataset that can subsequently be used to train object detection tasks on other related documents.

Figure 1: The intuition behind object-level labeling. 1) The layout regions follow a long tailed distribution, where the frequency of different samples differs massively. 2) If we randomly sample images for labeling, the time distribution is approximately the same as the layout region distribution. It’s better to spend equal time labeling for regular and unique layout regions. However, the layout distribution is unobserved ex ante. 3) As models tend to learn quickly on regular regions, we can employ model predictions to reduce time spent for these samples. And thus more time could be spent on unique samples during labeling.

Developing these methods poses two technical difficulties, the most central of which is quantifying prediction informativeness. We must not only select which predictions to re-label but also assess which will be included in the ground-truth. Since the existing literature focuses on selecting image samples rather than object samples, current methods are not directly applicable to measuring object-level informativeness. Hence, we propose a novel object-centered scoring and selection method. Tailored to object detection tasks, this measurement accounts for both the object bounding box and category. Specifically, for each predicted object, we perturb its bounding boxes by translating it pixels. We use this as a new region proposal and generate new bounding boxes and categories. We measure the disagreement between the original and new predicted boxes categories and prioritize labeling objects with high disagreement. It is analogous to applying consistency regularization methods Jeong et al. (2019).

Evaluating possible errors in the model predictions not selected for annotation poses another challenge. Though our scoring function aims to exhaustively identify suspicious predictions, there is no guarantee that predicted objects will be correct. For this reason, the literature typically treats such predictions as pseudo labels, using them only to boost model performance and not including them in created datasets Wang et al. (2018); Lee (2013). However, to the extent that these labels are accurate, discarding them throws away a large amount of information that could be used to better pretrain models for related document layout tasks. In our context, specific patterns of errors, such as duplicated predictions or mis-identification of an object, are common and can be corrected with minimal extra supervision. We design an object fusion algorithm that combines the model-predicted and human-labeled objects and attempts to correct errors in the predictions. With an additional scheduling function, this fusion algorithm generates accurate combined datasets at different stages of labeling.

The study’s contributions are twofold. First, we introduce the object-level AL setup, where the model selects the most important objects for users to label. It incorporates a novel object scoring and selection method, which estimates the informativeness of predictions based on the disagreement between the raw object prediction and the perturbed versions. Second, to ensure the correctness of the pseudo labels, we build a prediction correction and object fusion algorithm that rectifies possible errors in predicted objects and merges them with user-labeled objects. Evaluated on two document layout object detection datasets, OLALA shows strong improvements over image-level AL methods. In addition, the object-based selection algorithm outperforms other object selection baselines.

2 Related Work

Document Layout Detection poses significant challenges because of the complex organization of objects within an image and the many object types that may be present Lee (2013); Clausner et al. (2019); Gao et al. (2017); Antonacopoulos et al. (2015). In order to train layout detection models, researchers have created various datasets for historical manuscripts Simistira et al. (2016); Grüning et al. (2018), newspapers Clausner et al. (2015), and modern magazines Antonacopoulos et al. (2009). They typically contain only hundreds of images with object labels, since it is laborious for human annotators to draw accurate bounding boxes for the large number of objects within each image. While deep learning based object models (e.g., Faster R-CNN Ren et al. (2015)) have demonstrated superb performance in many applications, these small datasets are not sufficient for the training and evaluation of such data-hungry models. Recently, there have been efforts to generate large-scale annotated datasets programmatically via parsing modern PDF documents Zhong et al. (2019b, a). Unfortunately, this approach is not extensible to millions of scanned document images without parsable PDF metadata, as they contain significant noises from scanning and often from other features - such as historical printing technologies, the aging of the original documents, etc. These documents have the potential to make fundamental contributions to important research questions in business, the social sciences, and the humanities; but more efficient methods to create larger labeled datasets for training document layout detection models are required to make progress.

Active Learning (AL) has long been applied to object detection. Abramson and Freund (2006) use AL for labeling pedestrians based on predictions generated by an AdaBoost model. They demonstrate a large labeling time reduction by proposing a sampling strategy for selecting image instances to label. Yao et al. (2012) study an annotator-centered labeling cost estimation method and prioritize labeling for high-cost images to boost efficiency. AL for object detection has also attracted attention recently in the context of deep learning. Brust et al. (2018) generate marginal scores for candidate boxes and aggregate them to image-level scores for active selection, while Roy et al. (2018) apply the query by committee Seung et al. (1992) method within convolution outputs to generate these scores. Aghdam et al. (2019) propose a pixel level scoring method using convolution backbones and aggregate them to informativeness scores for image ranking. In general, most related works concentrate on image-level scoring and selection, whereas our method is targeted at object-level scoring, selection, and annotation.

Wang et al. (2018)

, in contrast, apply object level AL combined with self-supervised learning to improve labeling efficiency, but unfortunately their approach is not applicable in our context. It crops the selected object, pastes it to another image without objects of the selected category, and evaluates the score based on the composite image. This requires that the algorithm can find images with appropriate blank space to paste the selected object. Document images are object-dense, and there is typically not space to paste an additional object. For common category objects, it may also be impossible to find an image in the dataset without that object. Additionally,

Wang et al. (2018)

discard the generated labels once the training of the specific model is complete. Their accuracy is not evaluated, and hence the method does not yield a dataset that could be used to pretrain models for performing object detection on other similar documents. A central focus of this study is to produce methods that can yield open source labeled datasets. These in turn can be used to scale the methods to many documents that are of interest to researchers, businesses, and the general public.

3 Problem Formulation

In layout object detection problems, a detection model is trained to identify objects within an input image , where the bounding box and category distribution will be estimated for the -th object. are the object annotations for . is initially trained on a small labeled dataset , and it receives a large unlabeled dataset . The goal of active learning is to optimally sample instances from for annotation to maximally improve the model’s performance on given metrics. This process could be iterative: at each round , it selects samples from to query labels, obtains the corresponding labeled set , and updates the existing labeled set . The new model is obtained by training (or fine-tuning) on . For the next round, the unlabeled set becomes .

In existing object detection active learning setups, the AL agent usually selects images for user annotation based on some scoring function . During the training process, are calculated for all (or part of) Aghdam et al. (2019) the unlabeled samples . High scores may imply larger information gains if the sample is selected for training, and an AL agent usually prioritizes labeling for such instances. It is designed to select the most informative samples for models Aghdam et al. (2019), or samples with the most ambiguous model predictions Brust et al. (2018). This image-level selection schema perfectly aligns with the typical AL framework for tasks like image classification, yet the user needs to create all objects labels for the images in . This is not optimal for object detection, especially for tasks like layout object detection where many objects could appear within a single image. Because of the uneven distribution of objects, sometimes only a small portion of object predictions in an image are inaccurate. Labeling whole images wastes budget on these accurately predicted regions which could otherwise be used for labeling less accurate objects.

Consider an alternative setup, where the AL agent prioritizes annotation for a portion of objects in within each image. An object scoring function selects regions where are the most unconfident for users to create labels . It automatically generates labels for other regions based on model predictions. With measures to control the quality of and , the combined annotation is approximately close to , where users only need to spend time for creating the labels ( being the cardinality of the set). Therefore, more images can be annotated given the same labeling budget. This is the object-centered labeling setup in OLALA, and we will address the challenges for controlling the quality of and in the following section.

4 Method

4.1 Perturbation-based Object Scoring

We directly utilize both the bounding box and category predictions to devise the object scores. We calculate the differences of boxes and categories between the original object prediction and a perturbed version. Inspired by the self-diversity idea in Jiang et al. (2020) and Zhou et al. (2017)

, the proposed method hypothesizes that the adjacent image patches share similar features vectors, and the predicted object boxes and categories for them should be consistent. Therefore, any large disagreement between the original and perturbed predictions indicates that the model is insufficiently trained for this type of input, or there is some anomaly in the given sample. Both cases demand user attention, and extra labeling is required. This contrasts to existing methods designed for image-level selection, which usually focus on the categorical - rather than positional - information in object detection model outputs (i.e.

Brust et al., which considers the marginal score of the object category predictions and does not use the bounding boxes, or Aghdam et al., which indirectly uses the positional information based on a pixel map for image-level aggregation).

Specifically, for each object prediction , we take the bounding box prediction and apply some small shifts to perturb the given box, where are the coordinate of the top left corner, and are the width and height of the box. The new boxes are created via horizontal and vertical translation by a ratio of and : , where is the -th perturbed box for box prediction , and a total of perturbations will be generated. Based on the image features within each , the model generates new box and category predictions . We then measure the disagreement between the original prediction and the perturbed versions , and use it as a criterion for selecting objects for labeling.

In practice, we build this method upon a typical object detection architecture composed of two stages Ren et al. (2015), where 1) a region proposal network estimates possible bounding boxes, and 2) a region classification and improvement network (ROIHeads111It’s a module name in Detectron2 Wu et al. (2019).) predicts the category and modifies the box prediction based on the input proposals. We use the perturbed boxes as the new inputs for the ROIHeads, and obtain the new box and class predictions . For object regions of low confidence, the new predictions are unstable under such perturbation, and the predicted boxes and category distribution can change drastically from the original version. To this end, we formulate the position disagreement and the category disagreement for the -th object prediction as

where IOU calculates the intersection over union (IOU) scores for the inputs, and is a measurement for distribution difference, e.g., cross entropy. The overall disagreement is defined as , with being a weighting constant. Objects of larger will be prioritized for labeling, and users will create annotations for them in the -th image.

The proposed method thoroughly evaluates the box and category prediction robustness, and can effectively identify false-positive object predictions. Based on the self-diversity assumption, incorrect category prediction will cause high because of the divergence of the new class prediction for nearby patches. When the predicted box is wrong, the perturbed box is less likely to be the appropriate proposal box. The generated predictions are unreliable, causing higher overall disagreement . It is worth noting that adding the positional prediction evaluation is especially helpful for layout object detection tasks. Different from real-world images, layout regions are boundary-sensitive: a small vertical shift of a text region box could cause the complete disappearance of a row of texts. We aim to search for samples that lead to ambiguous boundary predictions. The helps identify these samples, as it explicitly analyzes the box prediction quality. Additionally, the perturbation-based object scoring method is suitable to object-dense images like document scans.

4.2 Prediction Correction and Object Fusion

Recent work in semi-supervised learning 

Wang et al. (2018)

has demonstrated that using model predictions as labels in training leads to accuracy improvements. However, without guarantees of their accuracy, the predicted (pseudo) labels are disposed after the training loop. These discarded labels may contain information that could, for example, allow researchers to more efficiently train other object detection tasks using transfer learning. To ensure the quality of the large dataset that includes both the ground truth and predicted labels - so that researchers could use this information for future tasks - we propose post-processing methods that aim to reduce the false-positive and false-negative model predictions. The predictions are then merged with user annotations, resulting in a larger dataset given the same labeling budget. The trained model achieves higher object detection accuracy than several baseline methods in our experiments.

Adaptive False-positive Correction False positives occur when the model generates the wrong bounding box or class for an object. We delegate correcting most false-positive predictions to the annotators via the proposed object scoring and selection method. To wisely use human effort, we change the ratio of predictions sent for user annotation at different stages of training. Inspired by Curriculum Learning Bengio et al. (2009), we set high initial values of to rely more on human labeling as an attempt to ease the model training in the beginning. Linear or exponential decay is then applied; gradually increases the trust in the model predictions as the accuracy improves during training.

Duplication Removal In practice, models could generate multiple close predictions for a large object, yet only one or some of the predictions are sent for user inspection. Thus, if naively merging the user’s labels with the remaining predictions, it can lead to overlapping labels for the same object. This will mislead the trained model to produce overlapping boxes with high confidences. We fix this error by filtering out predictions overlapped with any human annotations over a score threshold . Different from IOU scores, we use the the pairwise Overlap Coefficient - - to better address scenarios where a predicted box is contained within a labeled box. The threshold is set to 0.25 empirically.

Missing Annotation Recovery

False-negatives occur when no prediction is generated for a given object. They cannot be directly identified or evaluated in the aforementioned architecture. Thus, extra supervision is required to provide labels for these misidentified regions. To minimize the human effort, we first increase the number of object predictions from the model via adjusting related hyperparameters to reduce false-negatives. Additionally, in real-world labeling experiments, we highlight the regions without predictions during labeling. Users are able to add a small number of labels within a limited region, with minimal labeling cost. In our experiments (Section 

5.4), we find this step has a substantial impact on the accuracy of the merged dataset.

Input : Initial sets , ; labeling budget ; object selection ratio
Initialize , , and detection model weights ;
for  to  do
       Calculate budget and selection ratio for round
       Update the model using
       Let = {}
       for  to  do
             Generate object predictions for
             Let = ,
             if  then break;
             Calculate object scores
             Select objects of top scores and label
             Correct errors in unselected predictions
             Merge with to get image annotations
             Remove from and add to
       end for
end for
Update the model using
Algorithm 1 Object-level Active Learning

4.3 Olala Algorithm

We now present the formal OLALA algorithm. Given an initial labeled set , it aims to use the predictions from a model to optimally label the remaining unlabeled set given some labeling budget. Different from existing work, we define the labeling budget per round as the number of objects rather images that human annotators can label. The algorithm iteratively proposes the most informative objects to label for a total of rounds. At each round , it selects up to objects to label. For each image from the existing unlabeled set , percent of predicted objects are selected for user labeling by using our scoring function introduced in Section 4.1. The rest of the labels are created by correcting errors in the unselected model prediction based on the method in Section 4.2. The labeled image will be removed from and the annotated samples will be added to . After each round, the selection ratio decays as the model accuracy improves.

5 Experiments

In this section, we first describe the setup of our experiments for evaluating OLALA and then report the results. We conduct several controlled experiments to systematically study different aspects of the proposed OLALA framework. First, we contrast OLALA with image-level AL methods to analyze the modeling accuracy improvements from conducting object-level labeling. We find Brust et al. (2018) the most appropriate comparison target for image-level AL. As their method calculates image scores based on aggregating object-level scores, the comparison can reveal the benefits of conducting ranking and selection at the object-level as opposed to the image-level. Second, we show that the proposed object scoring method could further improve object selection and the modeling accuracy compared to random baselines and the marginal object scoring used in Brust et al. (2018). Due to the aforementioned dense-object issues, Wang et al.’s method does not apply to the selected document image datasets, and it is not included in the comparison results. Third, we demonstrate that OLALA leads to substantial efficiency improvements in creating larger datasets by conducting object-level correction. Finally, we evaluate the effectiveness of the prediction correction and object fusion algorithm (Section 4.2) by comparing the accuracy of the created dataset under different settings.

Datasets PubLayNet PRImA HJDataset
Data Source Digital PDF Image Scan Image Scan
Annotation Auto PDF Parsing Human Labeling Combined
Dataset Size 360,000 453 2,048
Train / test split 8,896 / 2,249 363 / 90 1,433 / 307
Avg / max 10.72 / 59 21.63 / 79 73.48 / 98
Labeling budget 21,140 (2,000) 5,623 (240) 51,436 (700)
Total rounds 10 4 8
Initial / last 0.9 / 0.4 0.9 / 0.75 0.9 / 0.5
Table 1: Statistics and parameters for the PubLayNet, PRImA, and HJDatasets. is the number of objects in each image. For labeling budget, the numbers in parentheses indicate the equivalent numbers of images of the given object labeling budget.
Dataset PubLayNet HJDataset PRImA1
Exp Final AP Labeled Images Final AP Labeled Images Final AP Labeled Images
[a] 60.73 2046 69.82 709 31.49 244
[b] 67.91 (+11.84%) 2 2465 73.25 (+4.92%) 709 30.99 (-1.58%) 243.6
[c] 64.21 (+5.73%) 3187 (+55.77%) 72.16 (+3.36%) 1105 (+55.85%) 32.08 (+1.89%) 277.80 (+13.85%)
[d] 69.23 (+14.00%) 3661 (+78.93%) 71.48 (+2.38%) 1075 (+51.62%) 32.85 (+4.33%) 306.60 (+25.66%)
[e] 69.13 (+13.84%) 3686 (+80.16%) 73.40 (+5.13%) 1159 (+63.47%) 33.87 (+7.57%) 286.20 (+17.30%)
  • The figures in PRImA are the average from the 5 folds in cross validation to account for possible noise due to the small dataset size.

  • All the percentages are compared against experiment [a] in this dataset.

Table 2: The final AP and number of labeled images under different settings. OLALA achieves strong performance improvements in model accuracy in all experiments, and creates datasets with more images given the same labeling budget.

5.1 Datasets and Experimental Design

To validate our approach, we run several simulations using three representative layout analysis datasets: PublayNet Zhong et al. (2019b), PRImA Antonacopoulos et al. (2009), and HJDataset Shen et al. (2020). PublayNet is a large dataset of 360k images. The images and annotations are generated from noiseless digital PDF files of medical papers. As the original training set in PubLayNet is too large to conduct experiments efficiently, we use a downsampled version of 8996 and 2249 samples for training and validation, respectively. PRImA is created by human annotators drawing bounding boxes for text regions in both scanned magazines and technical articles, resulting in greater heterogeneity in this dataset than in PublayNet. We convert the original dataset into COCO format, and divide into the training (363 images) and validation (90 images) sets. HJDataset contains layout annotation for 2k historical Japanese documents. It has an intermediate dataset size, and shares similar properties with both the aforementioned datasets. HJDataset is established using noisy image scans, and the creation method is a combination of rule-based layout parsing from images and human inspection and correction. Table 1 shows a thorough comparison; PublayNet and PRImA represent two typical types of existing layout analysis datasets: large and automatically-generated v.s. small and human-labeled, with HJDataset an intermediate case.

The proposed algorithm is implemented based on Detectron2 Wu et al. (2019), an open-source object detection benchmark. For fair comparison, the same object detection model is used for all the experiments, with an identical learning rate and optimization algorithm. It is implemented based on the Faster R-CNN Ren et al. (2015) network with the ResNet-50 He et al. (2016) backbone and Feature Pyramid Network (FPN) Lin et al. (2017). We train each model on a single Tesla V100 GPU with a batch size of 6. To encourage reproducibility as well as inspire and facilitate further research, the code along with configurations files for all hyperparameter settings (including those for active learning) are released on Github.

For object-level AL, we set the total labeling budget and the total round per dataset. The labeling budget is evenly distributed for each round. For the object selection ratio, by default, we use a linear decay function with a given initial and last value. These hyperparameters are initialized as indicated in Table 1. When calculating the object scores, we set to 1 and as the cross entropy function. In addition, unless otherwise mentioned, we use four pairs of ’s: , , , , and for each pair, four boxes are created (moving towards top left, top right, bottom left, and bottom right). A total of perturbed boxes are generated per object prediction for comprehensive analysis of prediction performance under small and large perturbations in different directions.

When running simulations, we build several additional helper algorithms to imitate human labeling behavior. First, we search to obtain the ground-truth object annotations based on the overlap between selected model predictions. For each prediction, we calculate the IOU with all ground-truths and choose the top one to substitute the prediction. Duplicated ground-truths selected in an image can be removed by this process. In addition, when the selected object prediction is highly accurate (both bounding box and category prediction is correct), human annotators do not need to correct it, and the labeling budget is unused. For a selected prediction, if it has an IOU0.925 with some ground-truth objects and the categories are the same, we do not substitute it with the ground-truth and do not expend budget for it. The threshold 0.925 is determined empirically, indicating high overlap in the context of layout analysis. Finally, to find the false-negative regions, we compute the pairwise IOU between the ground truth and the combined labeling . Ground-truth objects whose maximum IOU with predicted objects is less than are chosen and added to the dataset. This increases the expended budget, to imitate how human annotators create labels for false-negative regions. is set to 0.05 in the following experiments to allow minor overlapping caused by noise in the predictions.

5.2 Accuracy of Trained Models

Figure 2: Modeling validation accuracy (lines) and number of labeled images (bars) during labeling under different settings. We compare five settings when labeling and training PubLayNet and HJDataset. The object-level active learning methods show substantial improvements in model accuracy and create more labeled images with the same labeling budget. The proposed object scoring method (black) achieves the optimal AP scores among all settings.

In the first batch of experiments, we aim to evaluate how OLALA improves data labeling and model training. As shown in Figure 2, we compare OLALA with the following:

  1. Random image selection and labeling

  2. Image-level active learning based on Brust et al. (2018), with the average aggregation function

  3. Random object selection and labeling

  4. Object-level active learning with marginal scoring function for object category prediction

  5. Object-level active learning with the proposed scoring function

During labeling, we train the model after gathering new batches of labels, and we evaluate its detection accuracy on the validation set. We use the Average Precision (AP) score as a measurement of detection accuracy, which is commonly used in other object detection tasks Lin et al. (2014). Comparing [c] and [a], or [d] and [b] on PublayNet, the object-level selection and labeling is more accurate than its image-level counterparts. Appropriate object selection methods like [d] and [e] also perform better than random object selection. In addition, comparing [d] and [e], the proposed method - which considers both box and category predictions - tends to improve the model accuracy. This is especially true at the beginning of training.

For the PRImA dataset, we use a 5-fold cross validation evaluation strategy to ensure stable results due to its smaller size and more complicated layouts. To account for the different object sizes in this dataset, one pair of values - - is used for . When comparing the mean AP of the cross-validated models (Table 2), object-level selection methods [e] and [d] attain better scores than image-level selection methods.

5.3 Efficiency of Dataset Creation

We next turn to the efficiency of creating a labeled dataset of images. As shown in Figure 2, the object-level based scoring methods create datasets that are significantly larger than the image-level AL methods in all the experiments. The gaps between object-level and image-level methods extend during training as the ratio of selected labeling objects decreases. This is especially helpful when creating large-scale datasets like PubLayNet: at the end [e] labels 80% more images than the baseline random image labeling method (63% for HJDataset). AL based methods ([e] and [b]) tend to have higher efficiency than random baselines ([c] and [a]). This is because there are some atypical images in the dataset, i.e. in PubLayNet some images have less than 5 objects per image (the average is 10.72). As AL models prioritize labeling for these unusual images, more images are annotated.

5.4 Accuracy of the Merged Datasets

In this section, we go beyond the creation of a scoring function to show that the predicted labels can be used to produce a large dataset of annotations that combines human and predicted labels. This contrasts with the past literature Wang et al. (2018), which has discarded the predicted labels after each round of training and hence cannot be used to produce a dataset of annotations that can pre-train models for other related document layout analysis tasks. Note that other tasks cannot be pre-trained with just the ground truth labels, as typically only some objects on each page are labeled by annotators whereas object detection networks are trained on entire images. The utility of the combined dataset requires that the predicted labels be reasonably accurate, which we ensure through our error correction method. We now provide an intuition for how the error correction works by reporting additional experiments on PubLayNet. Experiments on other datasets show similar results but are not reported due to space constraints.

As Section 4.2 outlines, we develop two correction methods: duplicate removal and recovery of missing labels. We compare four different scenarios where: [e] both corrections are used, [f] duplication removal is disabled, [g] missing annotation recovery is disabled, and [h] both corrections are aborted. For the newly created dataset at each round, we measure the AP using ground-truth objects. When equipped with both corrections, the merged dataset maintains a high accuracy level despite the object selection ratio dropping gradually (Figure 3). Recovering missing labels is vital for accuracy. In the initial training stage, as models are not sufficiently trained, both false positive and false negative ratios are high. [g] and [h] do not allow annotators to correct false negatives (recover missing objects), and hence show significantly worse accuracy for both the created datasets and model predictions. They stop at the end of round 5 as they exhaust all samples in the training set. Similar results are observed on the two other datasets.

Figure 3: The influence of error correction methods on the PubLayNet Dataset. Without using the error correction methods, both the trained model accuracy and merged dataset accuracy deteriorate notably. The blue dashed line in the bottom figure shows the average accuracy of the created dataset with both correction methods.

6 Conclusion

This paper considers active learning for layout object detection tasks, where typically there are many objects in each image. Different from existing work, we propose to select and label at the object-level rather than at the image-level. The proposed object-level active learning based layout annotation framework, or OLALA, consists of two components. The perturbation object scoring function measures the informativeness of a predicted object by calculating the disagreements between the original prediction and the perturbed versions. The prediction correction and object fusion method corrects the false-positive and false-negative object predictions via minimal extra supervision and can generate a merged dataset of high accuracy. Through simulation results on real-world data, we show that our proposed algorithm significantly improves dataset creation efficiency relative to image-level methods, and the trained model outperforms the baselines. Our work can benefit many downstream tasks, such as the processing of historical documents at scale, via creating layout analysis datasets more efficiently.


This project is supported in part by an NSERC Discovery Grant. We thank computecanada.ca for providing the computational resources, and Ruochen Zhang for her valuable feedback on the manuscript.


  • Y. Abramson and Y. Freund (2006) Active learning for visual object detection. Department of Computer Science and Engineering, University of California …. Cited by: §2.
  • H. H. Aghdam, A. Gonzalez-Garcia, J. v. d. Weijer, and A. M. López (2019)

    Active learning for deep detection neural networks


    Proceedings of the IEEE International Conference on Computer Vision

    pp. 3672–3680. Cited by: §1, §2, §3, §4.1.
  • A. Antonacopoulos, D. Bridson, C. Papadopoulos, and S. Pletschacher (2009) A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition, pp. 296–300. Cited by: §2, §5.1.
  • A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher (2015) ICDAR2015 competition on recognition of documents with complex layouts-rdcl2015. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1151–1155. Cited by: §2.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    pp. 41–48. Cited by: §4.2.
  • C. Brust, C. Käding, and J. Denzler (2018) Active learning for deep object detection. arXiv preprint arXiv:1809.09875. Cited by: §1, §2, §3, §4.1, item b, §5.
  • C. Clausner, A. Antonacopoulos, and S. Pletschacher (2019) ICDAR2019 competition on recognition of documents with complex layouts-rdcl2019. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1521–1526. Cited by: §2.
  • C. Clausner, C. Papadopoulos, S. Pletschacher, and A. Antonacopoulos (2015) The enp image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 931–935. Cited by: §2.
  • L. Gao, X. Yi, Z. Jiang, L. Hao, and Z. Tang (2017) ICDAR2017 competition on page object detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1417–1422. Cited by: §2.
  • T. Grüning, R. Labahn, M. Diem, F. Kleber, and S. Fiel (2018) Read-bad: a new dataset and evaluation scheme for baseline detection in archival documents. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 351–356. Cited by: §2.
  • E. Haussmann, M. Fenzi, K. Chitta, J. Ivanecky, H. Xu, D. Roy, A. Mittel, N. Koumchatzky, C. Farabet, and J. M. Alvarez (2020) Scalable active learning for object detection. arXiv preprint arXiv:2004.04699. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §5.1.
  • J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. In Advances in neural information processing systems, pp. 10759–10768. Cited by: §1.
  • Z. Jiang, Z. Gao, Y. Duan, Y. Kang, C. Sun, Q. Zhang, and X. Liu (2020) Camouflaged chinese spam content detection with semi-supervised generative active learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3080–3085. Cited by: §4.1.
  • D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §1, §2.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §5.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §5.2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2, §4.1, §5.1.
  • S. Roy, A. Unmesh, and V. P. Namboodiri (2018) Deep active learning for object detection.. In BMVC, pp. 91. Cited by: §1, §2.
  • S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed (2017) Deepdesrt: deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Vol. 1, pp. 1162–1167. Cited by: §1.
  • H. S. Seung, M. Opper, and H. Sompolinsky (1992) Query by committee. In

    Proceedings of the fifth annual workshop on Computational learning theory

    pp. 287–294. Cited by: §2.
  • Z. Shen, K. Zhang, and M. Dell (2020) A large dataset of historical japanese documents with complex layouts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 548–549. Cited by: §5.1.
  • F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, and R. Ingold (2016) DIVA-hisdb: a precisely annotated large dataset of challenging medieval manuscripts. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 471–476. Cited by: §2.
  • K. Wang, X. Yan, D. Zhang, L. Zhang, and L. Lin (2018) Towards human-machine cooperation: self-supervised sample mining for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1605–1613. Cited by: §1, §2, §4.2, §5.4, §5.
  • X. Wang, A. Shrivastava, and A. Gupta (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2606–2615. Cited by: §1.
  • Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §5.1, footnote 1.
  • A. Yao, J. Gall, C. Leistner, and L. Van Gool (2012) Interactive object detection. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3242–3249. Cited by: §2.
  • X. Zhong, E. ShafieiBavani, and A. J. Yepes (2019a) Image-based table recognition: data, model, and evaluation. arXiv preprint arXiv:1911.10683. Cited by: §2.
  • X. Zhong, J. Tang, and A. J. Yepes (2019b) PubLayNet: largest dataset ever for document layout analysis. arXiv preprint arXiv:1908.07836. Cited by: §1, §2, §5.1.
  • Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang (2017)

    Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7340–7351. Cited by: §4.1.