Log In Sign Up

OPAD: An Optimized Policy-based Active Learning Framework for Document Content Analysis

by   Sumit Shekhar, et al.
IIT Roorkee

Documents are central to many business systems, and include forms, reports, contracts, invoices or purchase orders. The information in documents is typically in natural language, but can be organized in various layouts and formats. There have been recent spurt of interest in understanding document content with novel deep learning architectures. However, document understanding tasks need dense information annotations, which are costly to scale and generalize. Several active learning techniques have been proposed to reduce the overall budget of annotation while maintaining the performance of the underlying deep learning model. However, most of these techniques work only for classification problems. But content detection is a more complex task, and has been scarcely explored in active learning literature. In this paper, we propose OPAD, a novel framework using reinforcement policy for active learning in content detection tasks for documents. The proposed framework learns the acquisition function to decide the samples to be selected while optimizing performance metrics that the tasks typically have. Furthermore, we extend to weak labelling scenarios to further reduce the cost of annotation significantly. We propose novel rewards to account for class imbalance and user feedback in the annotation interface, to improve the active learning method. We show superior performance of the proposed OPAD framework for active learning for various tasks related to document understanding like layout parsing, object detection and named entity recognition. Ablation studies for human feedback and class imbalance rewards are presented, along with a comparison of annotation times for different approaches.


LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition

In recent years, deep learning has achieved great success in many natura...

Active Learning with Weak Labels for Gaussian Processes

Annotating data for supervised learning can be costly. When the annotati...

An Adaptive Supervision Framework for Active Learning in Object Detection

Active learning approaches in computer vision generally involve querying...

Region-level Active Learning for Cluttered Scenes

Active learning for object detection is conventionally achieved by apply...

A Practical Incremental Learning Framework For Sparse Entity Extraction

This work addresses challenges arising from extracting entities from tex...

Automated Seed Quality Testing System using GAN Active Learning

Quality assessment of agricultural produce is a crucial step in minimizi...

1. Introduction

Documents are a key part of several business processes, which can include reports, business contracts, forms, agreements, etc. Extracting data from documents through deep networks have recently started gaining attention. These tasks include document page segmentation, entity extraction or classification. Fueled by the availability of both labeled and unlabeled data, and advances in the computation infrastructure, recently, a number of deep learning models have been proposed for modeling complex tasks (He et al., 2015; Ren et al., 2015a; Devlin et al., 2019). The promising results from this research direction motivated development of several deep learning models which show significant performance improvements on these tasks when trained on a large amount of labelled data (Oliveira et al., 2018; Yang and Mitchell, 2016; Xu et al., 2020a). However, deployment of these models requires considerable effort and cost to annotate unlabelled data especially for document tasks because of requirements for dense annotations, e.g. annotating page structures with components like title, table, figures or references. Thus, there is a need to explore methods to optimize annotation budgets to accelerate the development of document analysis models.

Several approaches have been proposed in the domain of semi-supervised learning 

(Yang et al., 2021)

, unsupervised learning 

(Wilson and Cook, 2020), few-shot learning (Wang et al., 2020), active learning (Settles, 2009) etc to overcome the limitation of availability of labeled data. Each of these approaches have their own objectives incorporated in either modeling or data annotation or both for achieving superior performances in a limited annotated data setup. Among these, our motivation for using active learning is two-folds: (1) active learning bridges the gap in the model by querying samples in the data space, for which the model does not have enough information (Settles, 2009), (2) the active learning approaches seek to learn higher accuracy models within a given annotation cost, through optimizing data acquisition, which align well with our objective of optimizing annotation costs. Recent methods for pool-based active learning scenario, the query for annotations selects a subset batch of data samples for the oracle (i.e. the annotator). Pool or batch-based active learning methods are more scalable than querying single data sample per learning cycle (Guo and Schuurmans, 2007). Most of the active learning work (Settles, 2009; Aggarwal et al., 2014)

formulate acquisition functions as information theoretic uncertainty estimates. While uncertainty-based methods work well for tasks like classification

(Wang and Shang, 2014; Gal et al., 2017), where a single annotation is required per data sample, generalizations to document tasks such as page segmentation and named entity recognition, which require multiple annotations per selected data sample, have been scarcely explored. This is because methods to aggregate uncertainties over various entities present in a data sample are not well developed (Roy et al., 2018; Brust et al., 2019). Recent techniques have been proposed to obtain a better acquisition function for active learning in these tasks (Liu et al., 2018a, 2019)

. However, these methods assume highly task-specific heuristics, and hence can not be generalized across different content detection scenarios.

In addition to active learning, in particular for dense annotation tasks in documents, weak learning can be an effective approach to reduce annotator’s efforts (Papadopoulos et al., 2016; Wang et al., 2017; Papadopoulos et al., 2017). When there are multiple entities to be annotated in a data sample, weak learning reduces the annotation effort, either by providing faster variations of annotation techniques (Papadopoulos et al., 2017) or simply asking the annotator to verify the model predictions (Papadopoulos et al., 2016). However, there are very few works (Desai et al., 2019; Brust et al., 2020) that combine weak learning with active learning. Furthermore, to the best of our knowledge, none of the works takes advantage of the annotator feedback (e.g. from annotator’s corrections of detected instance boundaries) during an active learning cycle.

In this work, we propose a policy-based active learning approach, taking into account the complexities of aggregating model uncertainties in the selection of samples to be labelled. We model the task of active learning as a Markov decision process (MDP) and learn an optimal acquisition function using deep

-learning (Mnih et al., 2013). While several works rely on reinforcement learning for learning an optimal acquisition function (Liu et al., 2018a, 2019; Haussmann et al., 2019; Casanova et al., 2020), they assume task-specific representations of states and actions and hence are not generalizable across tasks. We further show that the proposed method can be combined with weak labelling, reducing the cost of annotation compared to strong labelling. Moreover, we incorporate class imbalance and human feedback signals into the design of MDP using suitable reward functions to further improve the performance of our approach.

To summarize, the major contributions of our work are as follows:

  • We propose a policy-based task-agnostic active learning approach for complex content detection tasks, layout detection and named entity recognition in documents.

  • We report that the proposed approach is generalizable, through demonstrating the performance of our active learning setup on varied detection tasks.

  • We investigate the effectiveness of incorporating class balance and human feedback rewards in improving the active learning policy.

  • We demonstrate the advantage of the proposed approach in reducing the costs of annotation in aforementioned complex detection tasks.

Throughout the remainder of the paper, we explain the proposed concepts, models, configurations, and discussions from the perspective of the layout and object detection, and named entity recognition tasks.

2. Related Work

Document content analysis has been studied extensively along several dimensions such as document classification ( image (Xu et al., 2020a; Xu et al., 2020b) or text (Adhikari et al., 2019; Pappagari et al., 2019) or both (Jain and Wigington, 2019; Audebert et al., 2019)), named entity recognition in documents (Yang and Mitchell, 2016; Luo et al., 2018), content segmentation (Oliveira et al., 2018; Grüning et al., 2019), document retrieval (Sugathadasa et al., 2018; Choudhary et al., 2020; Trabelsi et al., 2021), layout analysis (Binmakhashen and Mahmoud, 2019) among many others. The availability of large scale labeled datasets of documents (Lewis et al., 2006; Harley et al., 2015; Li et al., 2020; Tkaczyk et al., 2014; Zhong et al., 2019) led to the advent of several state-of-the-art deep learning models which have significantly improved these tasks in a large scale data setup. However, to the best of our knowledge, there is very limited amount of literature which uses active learning to optimize data annotation cost in a low resource setting, specifically for document analysis tasks (Godbole et al., 2004; Bouguelia et al., 2013). Therefore, in this section, we discuss about works that deal with general active learning policies, and active learning in a couple of related well studied domains, image classification, object detection and named entity recognition.

Active learning selects data samples with high uncertainty in the model prediction, which can provide more information to the underlying model. Different works have proposed different ways to compute model uncertainty (Settles, 2009). While some methods depend on information theory for designing acquisition functions (Houlsby et al., 2012; Wang and Shang, 2014; Gal et al., 2017), others rely on alternative ways to approximate model uncertainty (Freytag et al., 2014; Ducoffe and Precioso, 2018). Yoo et al (Yoo and Kweon, 2019) add a light-weight loss prediction module to the prediction model to predict the loss for the unlabelled samples, and use that as an uncertainty measure. Mayer et al (Mayer and Timofte, 2020) use uncertainty measure to find the optimal sample and query the data sample closest to the optimal sample.

For complex tasks such as object detection and named entity recognition, recent works (Roy et al., 2018; Brust et al., 2019; Shen et al., 2017) have been proposed to use uncertainty scores for the acquisition of samples. Most of these methods rely on aggregating the uncertainties of various entities within a data sample using max, sum or average functions (Roy et al., 2018; Brust et al., 2019). Aghdam et al (Aghdam et al., 2019) proposed a novel approach combining pixel-level scores to obtain an image-level score for doing active learning for the task of pedestrian detection task.

Several works have been proposed to incorporate reinforcement learning to learn an optimal acquisition function for active learning. The objective of these approaches is to model the active learning process into a Markov decision process through defining and designing suitable representations for states, actions, and rewards (Fang et al., 2017; Liu et al., 2018b; Haussmann et al., 2019). Liu et al (Liu et al., 2018a)

proposed an imitation learning approach for active learning in tasks related to natural language processing, relying on an algorithmic expert to find an optimal acquisition function. We differ from the work of Casanova

et al (Casanova et al., 2020) on using reinforced active learning approach for image segmentation, in terms of the generalize-ability of our approach on various tasks. We also report the effectiveness of using weak learning on top of policy-based active learning in consuming the budget with maximum efficiency.

3. Proposed OPAD Framework

In this section, we describe the proposed Optimized Policy-based Active Learning Framework for Document Content Analysis, OPAD. Figure 1 shows the interface for OPAD, which enables various scenarios of detection tasks for human annotators. The underlying algorithm for OPAD is a Deep Query Network (DQN)-based reinforcement learning policy, optimized for data sample selection based on the performance metrics for the task. OPAD has two stages - policy training stage and deployment stage. In the policy training stage, OPAD is trained using simulated active learning cycles to maximize performance on a validation set. While deploying, the trained policy is used to make online batch selection for annotation. The overall formulation for OPAD is described below.

3.1. Formulation

The underlying objective for policy training in OPAD is to perform an iterative selection of the samples from an unlabelled pool, , which would maximally increase the performance of the model being trained, until the annotation budget, is consumed. In each active learning cycle, the policy DQN (Mnih et al., 2013) selects a batch of samples, which are labelled, and added to the set of labelled samples . The detection model

is then trained for a fixed number of epochs using the expanded set,

. The reward for the policy network for selecting the samples is the performance of the underlying model computed using a metric apropos to the task (e.g. Average Precision for layout detection, and F-score for named entity recognition) on a separate held-out set, . The training of the policy is performed through episodes of active learning.

Notations Description
, , Train, Validation and Test sets of a given dataset
, , Unlabelled, labelled, and initial labelled sets
Candidate unlabelled examples for an active learning cycle
, Metric calculation set, State representation set
, , Action, State and Reward at time
, Policy deep Q network and Prediction model to be trained
, Memory buffer for Q learning, Total budget for active learning
, , Number of samples to be acquired in one active learning cycle, Number of samples in a pool, Number of samples labelled for initial training
Table 1. Notations used to represent various data splits and model components.

We now describe various components of the proposed policy-based active learning approach in details.

3.2. Data Splits

Given a dataset , we split the samples (or use the existing splits of the dataset) into , , and sets. For the two stages of OPAD, the further splits are as follows.

During policy training stage

We separate a set of samples along with their labels from , which is used for validating the performance of underlying model and computing rewards for training the policy DQN . For the RL setup of the policy DQN, we use a held-out set which is used together with later to compute overall state representation. Note that, unlike (Casanova et al., 2020), we do not require labels for , which further reduces the annotation budget. During this stage, we train the detection model on , which is initialized with and populated with samples from as the active learning progresses. Here, is a set with randomly selected samples with the corresponding labels for initial training of the model . Therefore, before the active learning process starts, equals , and equals .

During deployment stage

We utilize the set for training the detection model . We make this differentiation from the policy training stage to ensure that sample selection by the policy happens on an unseen set. During this stage, we use the same terminology , , and from the previous stage. However, the samples in set are selected from the set and therefore, at the start of the active learning process equals , and equals . We use the same set of examples for the state computation set . In this stage we do not require the set.

Though we have ground truth annotations available for all the samples in all the three sets, to simulate the annotation setup, we mask this data from both and models and utilize the labels as and when required.

3.3. Active Learning

Figure 2. Overview of the policy training in OPAD - (1) Candidate samples are chosen randomly from the unlabelled pool . (2) State representation is calculated using and , which is then passed to the policy DQN to select the samples to be annotated (3, 4 and 5). (6) The labelled set is then updated and the model is retrained. (7) Finally, reward is computed using the set .
Figure 3. Architecture of the proposed Deep Query Network, for the policy.
1:Input: , budget
2:Output: Policy DQN, , trained for querying the samples for annotation
3:Randomly sample examples from to form and sets.
4:Initialize policy and target DQN
5:Initialize memory replay buffer
6:while convergence of DQN loss do
7:     Initialize
8:     Randomly sample from to form
9:     Initialize to
10:     Initialize to
11:     Train the model on
12:     Compute the performance metric on
13:     while Consumption of budget  do
14:         Sample number of samples from as candidates for labelling
15:         Compute state representation using predictions of model on and
16:         Select samples from using -greedy policy and add it to - Action
17:         Retrain the model on
18:         Compute the metric on the
19:         Compute the reward as the difference in metric
20:         Re-do steps 14 and 15 - Next State
21:         Add tuple (, , , ) to the memory replay buffer
22:         Optimize policy DQN,
23:     end while
24:end while
Algorithm 1 Training the Policy DQN,

Figure 2 shows an overview of active learning (inner while loop at step 11 in Algorithm 24) in a single episode of policy training. In an active learning cycle, we select number of samples from the set , which represent the candidates selected for the current active learning cycle . The policy DQN computes -value for samples within each pool containing samples, based on candidate set and state representation set . The policy selection network is optimized to maximize the reward, :


The annotator then annotates the selected samples, and the labelled set is updated by adding these new samples. We then retrain the model using the updated labelled set and finally calculate the reward for the current cycle by measuring the performance of the model on .


where is measured in terms of AP metric for layout and object detection tasks, and F-score for named entity recognition task. Algorithm 24 summarizes the training phase of the proposed approach.

3.4. Policy Training Stage

Policy Network

Our policy network is a deep query network, as shown in Figure 3. The underlying prediction model computes the representations and from the sets and respectively (details in Section 4.3). The policy network then receives the two inputs , and , which we denote as the state representation in Figure 3

. We pass the two representations through convolution layers, followed by vector product of state and candidate representations. The final Q-value is obtained by passing the combined representation through fully connected layers.

Policy Optimization

The computed Q-value is used for selecting samples at each step. For this, a memory or experience replay buffer, is created using MDP state representation tuples, (, , , ). Further, as a batch of needs to be selected, the candidate set, , is randomly partitioned into mini-batches, and action set is set to . The loss is then optimized as follows to train the policy network:


The values for are computed using a double DQN formulation (Haussmann et al., 2019) incorporating a target network, for stable training:


where, is the discount factor for future reward, set to in our experiments.

-greedy selection

To encourage exploration of diverse samples by the policy during training, an -greedy strategy is followed while training the policy, which selects a random sample for the action

with probability

, instead of the sample maximizing Q-value. The value starts with for the initial cycle, and decreases by a factor of for subsequent cycles. For policy deployment, is set to 0. The gradient optimization is done using the temporal difference method (Sutton, 1988).

3.5. Deployment Stage

1:Input: , , , budget
2:Randomly sample from to form
3:Initialize to
4:Initialize to
6:Train the model on
7:Compute the performance metric on
8:while Consumption of budget  do
9:     Sample number of samples from as candidates for labelling
10:     Compute state representation using predictions of model on and
11:     Select samples from using -greedy policy and add it to - Action
12:     Retrain the model on
13:     Compute the metric on the and report
14:end while
Algorithm 2 Testing the Policy DQN,

Algorithm 14 summarizes the deployment stage (or policy testing stage). We freeze the parameters of the model in this stage. We use the set to iteratively select the samples and train the model . At the end of each active learning cycle we compute the performance of the model on the held-out set and report the values in Section 4.

3.6. Weak labelling

In a usual annotation scenario (as shown in Figure 4 - top), the annotator has to mark all the entities present in a sample by drawing the bounding boxes and selecting labels for them. To reduce the annotation cost, we propose a weak labelling annotation framework (Figure 4 - bottom). Inspired from (Papadopoulos et al., 2016), the annotator is shown the document as well as the predictions with high confidence from the model for that document. The annotator can then (1) add a missing box, (2) mark a box either correct or incorrect, and (3) mark a label either correct or incorrect for the associated box. The annotation interface for the weak labelling approach is shown in Figure 1.

Figure 4. Weak labelling in the case of layout detection. In the top image, the annotator has to draw and mark all the layout boxes, while in the bottom image, the annotator can verify the predictions of the model in the input image, and add new boxes. Image is best viewed in color.

The advantage of weak labelling is that it significantly reduces the annotation time. Annotation of a new entity by drawing a bounding box or selecting words takes seconds on an average in the case of detection tasks and seconds in case of named entity recognition. Verifying an entity takes seconds for layout detection task and seconds for named entity recognition111All the mentioned values are average annotation times of individuals measured on the developed annotation tool.

3.7. Additional Rewards

We propose the following additional rewards to improve the performance of the active learning approach.

  • Class balance reward: To reduce class imbalance in the newly acquired samples that are to be labelled, , we propose an additional class distribution entropy reward which reinforces a class-balanced selection of samples.


    where is the Shannon entropy function (Shannon, 2001), and

    is the probability distribution over various classes for the newly acquired samples


  • Human feedback reward: In a weak labelling scenario, where the annotator can modify the output from the prediction model, , a human feedback signal could be added at each active learning cycle while training the policy. The objective is to promote the selection of those samples for which the annotator modifies the high confidence predictions of heavily because such samples would be more informative for the model . Accordingly, the additional human feedback reward for detection during training time is given as,


    where is the AP metric on the newly acquired samples, after the annotator has verified the predictions, and is the AP of the samples before feedback.

4. Experiments and Results

In this section, we provide a comprehensive experimental evaluation of the proposed policy-based active learning approach on the document understanding tasks, document layout detection and named entity recognition. Furthermore, we also evaluate our models on Pascal VOC object detection task to demonstrate the generalizability of the proposed solution across different domains.

4.1. Datasets

We use the following datasets for the corresponding tasks:

  • GROTOAP2 (Tkaczyk et al., 2014) dataset is used for the complex document layout detection task. The dataset consists of 22 layout classes for scientific journals. We sampled two sets of images as training and validation sets. Among these, we hold-out 10% for reward computation set and 256 random samples for and use the remaining samples for the active learning setup. We use the validation set for simulating the active learning during the deployment phase and finally report the performance on a held-out subset of images. Further, we merged those classes having very few instances (e.g. glossary, equation, etc.) with the body content class, resulting into a modified dataset with 13 classes.

  • CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003) English Corpus is used for performing active learning experiments for the named entity recognition task. We use the sentences from the train set of CoNLL-2003 training our policy in active learning setup after separating out 10% for and 512 sentences for . We use the sentences of the dev set of this corpus to train the underlying through active learning cycles in the deployment stage. Finally, we use the test set of CoNLL-2003 consisting of sentences for calculating and reporting the F-scores of the model during the deployment stage.

  • Pascal VOC-2007 (Everingham et al., [n.d.]) dataset with 20 object classes is used for the object detection task. We use the train set of VOC-2007 containing images during the policy training phase. Similar to layout detection task, we hold-out 10% for reward computation set and 256 random samples for and use remaining samples for the active learning setup. During the deployment phase, we utilize the val set of VOC-2007 containing images for simulating the active learning setup i.e selecting samples using trained model and training the model . We use the test set of VOC-2007 consisting of — samples for reporting the performance of model after each active learning cycle during the deployment stage.

We also use the following datasets for pre-training the underlying model :

  • PubLayNet (Zhong et al., 2019) We use this dataset for pre-training for document layout detection. This dataset contains over 360K page samples and has typical document layout elements such as text, title, list, figure, and table as the annotations. While the list, figure, table and title classes contains the corresponding information from document, the text category consists of the rest of the content such as author, author affiliation; paper information; copyright information; abstract; paragraph in main text, footnote, and appendix; figure & table caption; table footnote.


     (Lin et al., 2014) This dataset consists of 91 object classes. We use this dataset to pre-train the underlying classification model (i.e. Faster-RCNN model) in the case of object detection on the VOC dataset. We pre-train the model on this dataset and remove the last layers from both the class prediction and bounding box regression branches which are class-specific.

4.2. Models and configurations

We use the Faster-RCNN model (Ren et al., 2015b) with RESNET-101 backbone222 (He et al., 2015) as the underlying prediction model for the layout detection and object detection tasks. The Faster-RCNN model is pre-trained on a subset of images from PubLayNet (Zhong et al., 2019) dataset for the layout detection task, and on MS-COCO (Lin et al., 2014) dataset for the object detection task to bootstrap the active learning experiments. For the NER task, we use the BiLSTM-CRF (Huang et al., 2015) model for recognition task.

For active learning, we use a seed set of labelled samples in case of detection tasks and labelled samples in case of NER for training the prediction model initially. Both the Faster-RCNN model and BiLSTM-CRF models are trained for iterations on the labelled set in an active learning cycle. In each of the active learning cycles we select samples in the case of detection tasks and samples in the case of NER, from unlabelled dataset for labelling giving a total of and labelled samples in a single episode for detection tasks and NER respectively. We run episodes of these active learning cycles to train the policy network. The learning rate for training the policy DQN is set to with a gamma value of . The learning rates of Faster-RCNN and BiLSTM-CRF are set to and respectively. We also apply a momentum of to optimize the training of policy network. We set the size of memory replay buffer to 1000 samples with first-in-first-out mechanism.

(a) GROTOAP2 strong labelling
(b) CoNLL 2003 strong labelling
(c) VOC-2007 strong labelling
(d) GROTOAP2 weak labelling
(e) CoNLL 2003 weak labelling
(f) VOC-2007 weak labelling
Figure 5. Plots showing the performance of the methods, viz. random, entropy, margin and proposed, for GROTOAP2, CoNLL-2003 and VOC-2007 for both strong and weak labelling settings.

4.3. MDP state representation

For the layout detection and object detection tasks, we use a randomly sampled set of images from the train set as the subset for representing the overall distribution of the dataset (). We pass each instance from the candidate () and state () subsets through the Faster-RCNN model, to get the top confident bounding box predictions. We concatenate the class scores for these top predictions to the feature map of RESNET-101 backbone to get a final representation (-dimension for VOC-2007, and -dimension for GROTOAP2) for each sample in the candidate and state subset sets.

For NER task, we use a randomly sampled set of sentences from the train set of CoNLL-2003 as the set. We pass each sentence from the and

sets through the BiLSTM-CRF model and compute the final representation by taking the class scores of all the entities in the sentence. We pad each sentence to get a

dimensional representation, and with classes using IOBES formatting (Ramshaw and Marcus, 1999), we generate a -dimensional representation for each sentence. The representations thus obtained from the samples in are stacked to form , and similarly from the set . Together and form the state representation in Figure 3.

4.4. Human Annotation Simulation

To simulate the role of a human annotator for weak labelling, we use the ground truths of the datasets on which we perform our experiments. In detection tasks (i.e. layout detection and object detection), we consider the predictions which have an IoU greater than 0.5 with the ground truth box as the boxes being marked as correct by the annotator. For those boxes in the ground truth which do not have any prediction with IoU greater than 0.5, we include that box into the labelled set marking as a full annotation (a strong label). In case of named entity recognition, we compare the predictions with the ground truth, and those predictions which are correct are included in the weakly labelled set. The ground truth entities which are missed are added as a strong label.

Avg time (seconds)
Method GROTOAP2 CoNLL2003 VOC2007


Random 10m14s 0m29s 12m21s
Entropy Max 18m14s 1m15s 17m16s
Entropy Sum 18m01s 1m10s 15m53s
Margin 18m00s 1m05s 15m39s
OPAD 11m22s 0m53s 14m00s


Random 10m23s 0m35s 12m24s
Entropy Max 18m20s 1m20s 17m31s
Entropy Sum 18m10s 1m11s 15m26s
Margin 18m03s 1m10s 15m48s
OPAD 11m36s 1m00s 14m12s
Table 2. Time required to complete one active learning cycle i.e selection of samples for various algorithms along with the training time of the model . Note that the training time of the model is constant across the algorithms and hence the relative order is representative of the sample selection time. OPAD performs the best next to the simple random algorithm.
Annotation time required(seconds)
Method GROTOAP2 CoNLL2003 VOC2007


Random 72500 1650 9000
Entropy Max 81600 2100 10500
Entropy Sum 81200 2050 10000
Margin 92700 1950 11000
Ours 66000 1000 7000


Random 38000 1100 4250
Entropy Max 39000 1500 2000
Entropy Sum 41000 1600 2500
Margin 48000 1300 7500
Ours 33000 700 2250
Table 3. Annotation time required to reach an AP of 42.5 on GROTOAP2, an F1 score of 76.0 on CoNLL2003, and an AP of 45.5 on VOC-2007. These values indicate the minimum achievable best performances by all the models on the datasets.

4.5. Results

We compare the performance of our proposed method with three baselines -

  • Random Data samples from the unlabelled pool are randomly chosen for annotation.

  • Entropy (Roy et al., 2018) For the entropy-based selection, first the entropy of class prediction probability by is computed over all the entities of a data sample. We present results for aggregating entropy of a single sample in two ways: 1. maximum entropy, 2. sum of entropy of all detected entities within the sample, and then the samples with the highest aggregate entropy are selected for labelling.

  • Margin (Brust et al., 2019) Similar to entropy, a margin score is computed using the difference of prediction probability of highest and second highest class for all the instances of a sample. Then, the maximum margin score over all the instances is taken to be the aggregate margin measure for the sample. Samples with the highest aggregate margin are selected for labelling. The baseline metrics are as described in the existing prior art.

Though there are active learning approaches in the literature that have been proposed for content detection tasks, they impose implicit or explicit constraint of detecting only one object/element or a fixed number of objects/elements per input instance to generate a viable fixed length representations of instances for their active learning setups. However, in our proposed active learning setup, we impose no such restrictions on the model and utilize all the detections from the underlying model , making those baseline models infeasible for a direct comparison. Figure 5 shows the accuracy of all the methods on the test sets of different datasets, for both strong and weak labelling settings. We can observe that the proposed policy-based AL method significantly outperforms the baseline methods. This is because of the optimized selection policy, learned to reward the better performance of the prediction model. While the curves for VOC-2007 and CoNLL-2003 datasets approach saturation, we stop the GROTOAP2 training before reaching saturation as our objective is to show the performance of the underlying model with a limited budget. Note that the proposed method uses vanilla reward in all the plots in Figure 5. Further, as shown in Table 2 and Table 3, the proposed method takes significantly less time for annotation than the baselines to reach the minimum best performance achievable by all the models, while performing only next to random algorithm for sample selection timings. Please note that the annotation times in Table 3 are based on the number of samples selected for annotation multiplied by the average human annotation times mentioned in Section 3.6.

5. Ablation Study

In this section we discuss the importance of the proposed additional rewards in improving the performance of the proposed AL approach.

5.1. Class balance reward

We conduct ablations by adding the class distribution entropy reward (Equation 5) to the vanilla reward function. The overall reward function is:


where is a hyper-parameter, and is the vanilla reward. We compare the results of using this reward in our policy against the baselines and vanilla policy in a strong labelling setting in Table 4. We can observe a significant increase in performance, specifically on NER task, with the overall reward as compared to the vanilla reward policy. This experiment clearly shows that there is a substantial need to remove the class imbalance in the samples selected through the policy.

AP F-score
Method GROTOAP2 CoNLL-2003 VOC-2007
Random 50.668 77.438 47.490
Entropy Max 46.229 77.401 46.671
Entropy Sum 46.634 77.351 47.431
Margin 47.428 77.598 47.179
OPAD 51.508 80.099 48.061
OPAD (ClsEnt - 0.25) 53.241 86.853 47.727
OPAD (ClsEnt - 0.50) 51.185 86.541 47.701
OPAD (ClsEnt - 0.75) 52.143 86.512 48.566
OPAD (ClsEnt - 1.0) 51.530 86.395 48.060
Table 4. Performance of our method on test data with class distribution entropy reward on various datasets. The reported results are after consuming a total budget of 1152 samples for GROTOAP2 and VOC-2007, and 350 samples for the CoNLL-2003 datasets.

5.2. Human feedback reward

In this experiment we report the effect of adding human feedback to the vanilla reward, i.e.


where is a hyper-parameter. We report the results of using this overall reward in our policy in Table 5, along with the baselines and vanilla policy in a weak labelling setup. We observe that having a small weight on the feedback reward results in a jump in the performance on VOC-2007. Thus, the reward can be useful in some cases; however, further investigation is needed to establish its efficacy. Note that, the proposed human feedback reward is applicable only for the object detection tasks because of the IoU component discussed earlier. Therefore, we refrain from performing this experiment on the NER task.

Method GROTOAP2 VOC2007
Random 44.127 45.541
Entropy Max 44.951 46.433
Entropy Sum 43.842 46.437
Margin 42.690 45.639
OPAD 45.813 46.708
OPAD (Feedback - 0.1) 48.524 47.238
OPAD (Feedback - 0.25) 46.266 46.835
OPAD (Feedback - 0.40) 44.899 46.646
OPAD (Feedback - 0.70) 44.839 46.071
OPAD (Feedback - 1.0) 44.110 46.304
Table 5. Performance of our method with human feedback reward for weak labelling on GROTOAP2 and VOC2007. AP after consuming a total budget of 1152 samples.

6. Conclusion and Future Works

We present a robust policy-based method for active learning task in complex content detection problems. The problem of active learning in detection is formulated using a DQN-based sampling network, optimized for performance metrics of the classifier. We extend the active learning setting to weak labelling, and propose rewards for class balance and human feedback. To the best of our knowledge, this is first-of-its-kind work optimizing active learning for detection tasks in documents. We further show the efficacy of the proposed methods on two document analysis tasks and a related object detection task by evaluating our models on 3 large datasets, with significantly better performance. As a future direction, we would like to improve on the DQN, exploiting the recent advances in the field. Currently, our model uses generic features that are common across the different tasks. We would look into more nuanced details that are specific to the documents which would further improve the performance of our models on layout detection and named entity recognition tasks. We would also like to investigate Generative Adversarial Networks or self-supervision for learning from unlabelled data.


  • (1)
  • Adhikari et al. (2019) Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398 (2019).
  • Aggarwal et al. (2014) Charu C. Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and Philip S. Yu. 2014. Active learning: A survey. CRC Press, 571–605.
  • Aghdam et al. (2019) Hamed H. Aghdam, Abel Gonzalez-Garcia, Antonio Lopez, and Joost Weijer. 2019.

    Active Learning for Deep Detection Neural Networks.

    2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    (Oct 2019).
  • Audebert et al. (2019) Nicolas Audebert, Catherine Herold, Kuider Slimani, and Cédric Vidal. 2019. Multimodal deep networks for text and image-based document classification. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    . Springer, 427–443.
  • Binmakhashen and Mahmoud (2019) Galal M Binmakhashen and Sabri A Mahmoud. 2019. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–36.
  • Bouguelia et al. (2013) Mohamed-Rafik Bouguelia, Yolande Belaïd, and Abdel Belaïd. 2013. A stream-based semi-supervised active learning approach for document classification. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, 611–615.
  • Brust et al. (2019) Clemens-Alexander Brust, Christoph Käding, and Joachim Denzler. 2019. Active Learning for Deep Object Detection. Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2019).
  • Brust et al. (2020) Clemens-Alexander Brust, Christoph Käding, and Joachim Denzler. 2020. Active and Incremental Learning with Weak Supervision. KI - Künstliche Intelligenz 34, 2 (Jan 2020), 165–180.
  • Casanova et al. (2020) Arantxa Casanova, Pedro O. Pinheiro, Negar Rostamzadeh, and Christopher J. Pal. 2020. Reinforced active learning for image segmentation. In International Conference on Learning Representations.
  • Choudhary et al. (2020) Sneha Choudhary, Haritha Guttikonda, Dibyendu Roy Chowdhury, and Gerard P Learmonth. 2020. Document Retrieval Using Deep Learning. In 2020 Systems and Information Engineering Design Symposium (SIEDS). IEEE, 1–6.
  • Desai et al. (2019) Sai Vikas Desai, Akshay Chandra Lagandula, Wei Guo, Seishi Ninomiya, and Vineeth N. Balasubramanian. 2019. An Adaptive Supervision Framework for Active Learning in Object Detection. In BMVC.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
  • Ducoffe and Precioso (2018) Melanie Ducoffe and Frederic Precioso. 2018. Adversarial Active Learning for Deep Networks: a Margin Based Approach. arXiv:1802.09841 [cs.LG]
  • Everingham et al. ([n.d.]) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n.d.]. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
  • Fang et al. (2017) Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning how to Active Learn: A Deep Reinforcement Learning Approach. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017).
  • Freytag et al. (2014) Alexander Freytag, Erik Rodner, and Joachim Denzler. 2014. Selecting Influential Examples: Active Learning with Expected Model Output Changes. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 562–577.
  • Gal et al. (2017) Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep Bayesian Active Learning with Image Data. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, International Convention Centre, Sydney, Australia, 1183–1192.
  • Godbole et al. (2004) Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. 2004. Document classification through interactive supervision of document and term labels. In European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 185–196.
  • Grüning et al. (2019) Tobias Grüning, Gundram Leifert, Tobias Strauß, Johannes Michael, and Roger Labahn. 2019. A two-stage method for text line detection in historical documents. International Journal on Document Analysis and Recognition (IJDAR) 22, 3 (2019), 285–302.
  • Guo and Schuurmans (2007) Yuhong Guo and Dale Schuurmans. 2007. Discriminative Batch Mode Active Learning.. In NIPS

    . Citeseer, 593–600.

  • Harley et al. (2015) Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. 2015. Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 991–995.
  • Haussmann et al. (2019) Manuel Haussmann, Fred Hamprecht, and Melih Kandemir. 2019. Deep Active Learning with Adaptive Acquisition.

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

    (Aug 2019).
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]
  • Houlsby et al. (2012) Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Jose M. Hernández-lobato. 2012. Collaborative Gaussian Processes for Preference Learning. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2096–2104.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
  • Jain and Wigington (2019) Rajiv Jain and Curtis Wigington. 2019. Multimodal Document Image Classification. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 71–77.
  • Lewis et al. (2006) David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, David Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 665–666.
  • Li et al. (2020) Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. Docbank: A benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020).
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
  • Liu et al. (2018a) Ming Liu, Wray Buntine, and Gholamreza Haffari. 2018a. Learning How to Actively Learn: A Deep Imitation Learning Approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1874–1883.
  • Liu et al. (2018b) Ming Liu, Wray Buntine, and Gholamreza Haffari. 2018b.

    Learning to Actively Learn Neural Machine Translation. In

    Proceedings of the 22nd Conference on Computational Natural Language Learning. Association for Computational Linguistics, Brussels, Belgium, 334–344.
  • Liu et al. (2019) Z. Liu, J. Wang, S. Gong, D. Tao, and H. Lu. 2019. Deep Reinforcement Active Learning for Human-in-the-Loop Person Re-Identification. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6121–6130.
  • Luo et al. (2018) Ling Luo, Zhihao Yang, Pei Yang, Yin Zhang, Lei Wang, Hongfei Lin, and Jian Wang. 2018. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34, 8 (2018), 1381–1388.
  • Mayer and Timofte (2020) Christoph Mayer and Radu Timofte. 2020. Adversarial Sampling for Active Learning. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020), 3060–3068.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. ArXiv abs/1312.5602 (2013).
  • Oliveira et al. (2018) Sofia Ares Oliveira, Benoit Seguin, and Frederic Kaplan. 2018. dhSegment: A generic deep-learning approach for document segmentation. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 7–12.
  • Papadopoulos et al. (2016) D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Ferrari. 2016. We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . 854–863.
  • Papadopoulos et al. (2017) Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, and Vittorio Ferrari. 2017. Training Object Class Detectors with Click Supervision. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017).
  • Pappagari et al. (2019) Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In

    2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

    . IEEE, 838–844.
  • Ramshaw and Marcus (1999) Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora. Springer, 157–176.
  • Ren et al. (2015a) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015a. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 91–99.
  • Ren et al. (2015b) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015b. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
  • Roy et al. (2018) Soumya Roy, Asim Unmesh, and Vinay P Namboodiri. 2018. Deep active learning for object detection.. In BMVC. 91.
  • Settles (2009) Burr Settles. 2009. Active learning literature survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.
  • Shannon (2001) Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review 5, 1 (2001), 3–55.
  • Shen et al. (2017) Yanyao Shen, Hyokun Yun, Zachary Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017. Deep Active Learning for Named Entity Recognition. Proceedings of the 2nd Workshop on Representation Learning for NLP (2017).
  • Sugathadasa et al. (2018) Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana, Dimuthu Lakmal, and Madhavi Perera. 2018. Legal document retrieval using document vector embeddings and deep learning. In Science and information conference. Springer, 160–175.
  • Sutton (1988) Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning 3, 1 (1988), 9–44.
  • Tjong Kim Sang and De Meulder (2003) Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. 142–147.
  • Tkaczyk et al. (2014) Dominika Tkaczyk, Pawel Szostek, and Lukasz Bolikowski. 2014. GROTOAP2 - The Methodology of Creating a Large Ground Truth Dataset of Scientific Articles. D-Lib Mag. 20 (2014).
  • Trabelsi et al. (2021) Mohamed Trabelsi, Zhiyu Chen, Brian D Davison, and Jeff Heflin. 2021. Neural Ranking Models for Document Retrieval. arXiv preprint arXiv:2102.11903 (2021).
  • Wang and Shang (2014) D. Wang and Y. Shang. 2014. A new active labeling method for deep learning. In 2014 International Joint Conference on Neural Networks (IJCNN). 112–119.
  • Wang et al. (2017) Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. 2017. Cost-Effective Active Learning for Deep Image Classification. IEEE Transactions on Circuits and Systems for Video Technology 27, 12 (Dec 2017), 2591–2600.
  • Wang et al. (2020) Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–34.
  • Wilson and Cook (2020) Garrett Wilson and Diane J Cook. 2020. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 5 (2020), 1–46.
  • Xu et al. (2020a) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020a. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192–1200.
  • Xu et al. (2020b) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020b. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv preprint arXiv:2012.14740 (2020).
  • Yang and Mitchell (2016) Bishan Yang and Tom Mitchell. 2016. Joint extraction of events and entities within a document context. arXiv preprint arXiv:1609.03632 (2016).
  • Yang et al. (2021) Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. 2021. A Survey on Deep Semi-supervised Learning. arXiv preprint arXiv:2103.00550 (2021).
  • Yoo and Kweon (2019) Donggeun Yoo and In So Kweon. 2019. Learning Loss for Active Learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2019).
  • Zhong et al. (2019) Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1015–1022.