Different approaches utilize different training datasets, optimization strategies (e.g.
, optimizers, learning rate schedules, epoch numbers, pre-trained weights, and data augmentation pipelines), and network designs (e.g. network architectures and losses). To encompass the diversity of components used in various models, we have proposed the MMOCR toolbox which covers recent popular text detection, recognition and understanding approaches in a unified framework. As of now, the toolbox implements seven text detection methods, five text recognition methods, one key information method and one named entity recognition method. Integrating various algorithms confers code reusability and therefore dramatically simplifies the implementation of algorithms. Moreover, the unified framework allows different approaches to be compared against each other fairly and that their key effective components can be easily investigated. To the best of our knowledge, MMOCR reimplements the largest number of deep learning-based text detection and recognition approaches amongst various open-source toolboxes, and we believe it will facilitate future research on text detection, recognition and understanding.
Extracting structured information such as “shop name”, “shop address” and “total payment” in receipt images, and “name” and “organization name” in document images plays an important role in many practical scenarios. For example, in the case of office automation, such structured information is useful for efficient archiving or compliance checking. To provide a comprehensive pipeline for practical applications, MMOCR reimplements not only text detection and text recognition approaches, but also their downstream tasks such as key information extraction and named entity recognition as illustrated in Figure 1. In this way, MMOCR can meet the document image processing requirements in a one-stop-shopping manner.
MMOCR is publicly released at https://github.com/open-mmlab/mmocr under the Apache-2.0 License. The repository contains all the source code and detailed documentation including installation instructions, dataset preparation scripts, API documentation, model zoo, tutorials and user manual. MMOCR re-implements more than ten state-of-the-art text detection, recognition, and understanding algorithms, and provides extensive benchmarks and models trained on popular academic datasets. To support multilingual OCR tasks, MMOCR also releases Chinese text recognition models trained on industrial datasets 111https://github.com/chineseocr/chineseocr. In addition to (distributed) training and testing scripts, MMOCR offers a rich set of utility tools covering visualization, demonstration and deployment. The models provided by MMOCR are easily converted to onnx 222https://github.com/onnx/onnx which is widely supported by deployment frameworks and hardware devices. Therefore, it is useful for both academic researchers and industrial developers.
2. Related Work
Text detection. Text detection aims to localize the bounding boxes of text instances (He et al., 2017; Liu et al., 2019; Yue et al., 2018; Duan et al., 2019; Zhu et al., 2021; Wang et al., 2019a). Recent research focus has shifted to challenging arbitrary-shaped text detection (Duan et al., 2019; Zhu et al., 2021). While Mask R-CNN (He et al., 2017; Liu et al., 2019) can be used to detect texts, it might fail to detect curved and dense texts due to the rectangle-based ROI proposals. On the other hand, TextSnake (Long et al., 2018) describes text instances with a series of ordered, overlapping disks. PSENet (Wang et al., 2019b) proposes a progressive scale expansion network which enables the differentiation of curved text instances that are located close together. DB (Liao et al., 2020)
simplifies the post-processing of binarization for scene-text segmentation by proposing a differentiable binarization function to a segmentation network, where the threshold value at every point of the probability map of an image can be adaptively predicted.
Text recognition has gained increasing attention due to its ability to extract rich semantic information from text images. Convolutional Recurrent Neural Network (CRNN)(Shi et al., 2016a)
uses an end-to-end trainable neural network which consists of a Deep Convolutional Neural Networks (DCNN) for the feature extraction, a Recurrent Neural Networks (RNN) for the sequential prediction and a transcription layer to produce a label sequence. RobustScanner(Yue et al., 2020)
is capable of recognizing contextless texts by using a novel position enhancement branch and a dynamic fusion module which mitigate the misrecognition issue of random text images. Efforts have been made to rectify irregular texts input into regular ones which are compatible with typical text recognizers. For instance, Thin-Plate-Spline (TPS) transformation is employed in a deep neural network that combines a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN) to rectify curved and perspective texts in STN before they are fed into SRN(Shi et al., 2018).
Key information extraction. Key Information Extraction (KIE) for unstructured document images, such as receipts or credit notes, is most notably used for office automation tasks including efficient archiving and compliance checking. Conventional approaches, such as template matching, fail to generalize well on documents of unseen templates. Several models are proposed to resolve the generalization problem. For example, CloudScan (Palm et al., 2017) employs NER to analyze the concatenated one-dimensional text sequence for the entire invoice. Chargrid (Faddoul et al., 2018) encodes each document page as a two-dimensional grid of characters to conduct semantic segmentation, but it cannot make full use of the non-local, distant spatial relation between text regions since it covers two-dimensional spatial layout information with small neighborhood only. Recently, an end-to-end Spatial Dual Modality Graph Reasoning (SDMG-R) model (Sun et al., 2021) has been developed which is particularly robust against text recognition errors. It models unstructured document images as spatial dual-modality graphs with graph nodes as detected text boxes and graph edges as spatial relations between nodes.
aims to locate and classify named entities into pre-defined categories such as the name of a person or organization. They are based on either bidirectional LSTMs or conditional random fields.
|Inference engine||—||OpenCV DNN||NCNN||PyTorch||Paddle inference||PyTorch|
|TNN||Paddle lite||onnx runtime|
|Detection||convention||YOLOV3 (Redmon and Farhadi, 2018)||DB (Liao et al., 2020)||CRAFT (Baek et al., 2019)||EAST (Zhou et al., 2017), DB (Liao et al., 2020), SAST (Wang et al., 2019a)||MaskRCNN (He et al., 2017), PAN (Wang et al., 2019c), PSENet (Wang et al., 2019b)|
|DB (Liao et al., 2020), TextSnake (Long et al., 2018), DRRG (Zhang et al., 2020), FCENet (Zhu et al., 2021)|
|Recognition||convention||CRNN (Shi et al., 2016a)||DB (Liao et al., 2020)||CRNN (Shi et al., 2016a)||CRNN (Shi et al., 2016a), Rosetta (Borisyuk et al., 2019), SRN (Yu et al., 2020a)||CRNN (Shi et al., 2016a), RobustScanner (Yue et al., 2020), SAR (Li et al., 2019)|
|LSTM||Star-Net (Liu et al., 2016b), RARE (Shi et al., 2016b)||SegOCR (Yue et al., 2021), Transformer (Li et al., 2019)|
|Downstream tasks||KIE, NER|
|ddrnet23-slim (Hong et al., 2021)||16.7G||75.2||80.1||77.6||72.3||83.4||77.5||76.7||78.5||77.6|
The effects of backbones. All models are pre-trained on ImageNet, and trained on ICDAR2015 training set and evaluated on its test set.
|FPNF (Wang et al., 2019b)||208.6G||78.4||83.1||80.7||72.4||86.4||78.8||77.5||82.3||79.8|
|PFNC (Liao et al., 2020)||22.4G||75.6||80.0||77.7||70.9||83.3||76.6||73.1||87.1||79.5|
|FPEM_FFM (Wang et al., 2019c)||7.79G||71.7||82.0||76.5||73.4||85.6||79.1||71.8||86.7||78.6|
The effects of necks. All models are pre-trained on ImageNet, and trained on ICDAR2015 training set and evaluated on its test set.
Open source OCR toolbox. Several open-source OCR toolboxes have been developed over the years to meet the increasing demand from both academia and industry. Tesseract333https://github.com/tesseract-ocr/tesseract is the pioneer of open-source OCR toolbox. It was publicly released in 2005, and provides CLI tools to extract printed font texts from images. It initially followed a traditional, step-by-step pipeline comprising the connected component analysis, text line finding, baseline fitting, fixed pitch detection and chopping, proportional word finding, and word recognition (Smith, 2007). It now supports an LSTM-based OCR engine and supports more than 100 languages. Deep learning-based open-source OCR toolbox EasyOCR 444https://github.com/JaidedAI/EasyOCR has been released recently. It provides simple APIs for industrial users and supports more than 80 languages. It implemented the CRAFT (Baek et al., 2019) detector and CRNN (Shi et al., 2016a) recognizer. However, it is for inference only and does not support model training. ChineseOCR 555https://github.com/chineseocr/chineseocr is another popular open-source OCR toolbox. It uses YOLO-v3 (Redmon and Farhadi, 2018) and CRNN (Shi et al., 2016a) for text detection and recognition respectively, and uses OpenCV DNN for deep models inference. By contrast, ChineseOCR_lite 666https://github.com/DayBreak-u/chineseocr_lite releases a lightweight Chinese detection and recognition toolbox that uses DB (Liao et al., 2020) to detect texts and CRNN (Shi et al., 2016a) to recognize texts. It provides forward inference based on NCNN 777https://github.com/Tencent/ncnn and TNN 888https://github.com/Tencent/TNN, and can be deployed easily on multiple platforms such as Windows, Linux and Android. PaddleOCR 999https://github.com/PaddlePaddle/PaddleOCR is a practical open-source OCR toolbox based on PaddlePaddle and can be deployed on multiple platforms such as Linux, Windows and MacOS. It currently supports more than 80 languages and implements three text detection methods (EAST (Zhou et al., 2017), DB (Liao et al., 2020), and SAST (Wang et al., 2019a)), five recognition methods (CRNN (Shi et al., 2016a), Rosetta (Borisyuk et al., 2019), STAR-Net (Liu et al., 2016b), RARE (Shi et al., 2016b) and SRN (Yu et al., 2020a)), and one end-to-end text spotting method (PGNet) (Wang et al., 2021). Comprehensive comparisons among these open-source toolboxes are given in Table 1.
3. Text Detection Studies
Many important factors can affect the performance of deep learning-based models. In this section, we investigate the backbones and necks of network architectures. We exchange the above components between different segmentation-based text detection approaches to measure the performance and computational complexity effects.
Backbone. ResNet18 (He et al., 2016) and ResNet50 (He et al., 2016) are commonly used in text detection approaches. For practical applications, we also introduce a GPU-friendly lightweight backbone ddrnet23-slim (Hong et al., 2021). Table 2 compares ResNet18, ResNet50 and ddrnet23-slim in terms of FLOPs and H-mean by plugging them in PSENet, PAN and DB. It has been shown that ddrnet23-slim performs slightly worse than ResNet18 and ResNet50, as it only has 45% and 21% FLOPs of ResNet18 and ResNet50 respectively.
Neck. PSENet, PAN and DB propose different FPN-like necks to fuse multi-scale features. Our experimental results in Table 3 show that the FPNF proposed in PSENet (Wang et al., 2019b) can achieve the best H-mean in PSENet and DB (Liao et al., 2020). However, its FLOPs are substantially higher than those of PFNC proposed in DB (Liao et al., 2020) and FPEM_FFM proposed in PAN (Wang et al., 2019c). By contrast, FPEM_FFM has the lowest FLOPs and achieves the best H-mean in PAN (Wang et al., 2019c).
We have publicly released MMOCR, which is a comprehensive toolbox for text detection, recognition and understanding. MMOCR has implemented 14 state-of-the-art algorithms, which is more than all the existing open-source OCR projects. Moreover, it has offered a wide range of trained models, benchmarks, detailed documents, and utility tools. In this report, we have extensively compared MMOCR with other open-source OCR projects. Besides, we have introduced a GPU-friendly lightweight backbone-ddrnet23-slim, and carefully studied the effects of backbones and necks in terms of detection performance and computational complexity which can guide industrial applications.
This work was supported by the Shanghai Committee of Science and Technology, China (Grant No. 20DZ1100800).
- Character region awareness for text detection. In CVPR, pp. 9365–9374. Cited by: Table 1, §2.
- Rosetta: Large scale system for text detection and recognition in images. ACM SIGKDD, pp. 71–79. Cited by: Table 1, §2.
- Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, pp. 357–370. Cited by: §1, §2.
- Geometry normalization networks for accurate scene text detection. In ICCV, pp. 9136–9145. Cited by: §1, §2.
- Chargrid: Towards Understanding 2D Documents. In EMNLP, pp. 4459–4469. Cited by: §1, §2.
- Fast R-CNN. In ICCV, pp. 1440–1448. Cited by: §1.
- Mask R-CNN. In ICCV, pp. 2961–2969. Cited by: Figure 1, §1, Table 1, §2.
- Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §3.
- Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. CoRR abs/2101.06085. Cited by: Table 2, §3.
- Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: §2.
- Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. AAAI, pp. 8610–8617. Cited by: Figure 1, §1, Table 1.
- Real-Time Scene Text Detection with Differentiable Binarization. In AAAI, pp. 11474–11481. Cited by: Figure 1, Table 1, Table 3, §2, §2, §3.
- Scene text recognition from two-dimensional perspective. AAAI, pp. 8714–8721. Cited by: §1.
- Pyramid Mask Text Detector. CoRR. Cited by: §2.
- SSD: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1.
- STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition.. In BMVC, Cited by: Table 1, §2.
- Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1.
- TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV, pp. 19–35. Cited by: Figure 1, Table 1, §2.
- CloudScan - A configuration-free invoice analysis system using recurrent neural networks. In ICDAR, pp. 406–413. Cited by: §2.
- YOLO9000: Better, Faster, Stronger. In CVPR, pp. 6517–6525. Cited by: §1.
- YOLOv3: An Incremental Improvement. CoRR abs/1804.02767. Cited by: Table 1, §2.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §1.
- NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In ICDAR, pp. 781–786. Cited by: Figure 1.
- An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. PAMI 39 (11), pp. 2298–2304. Cited by: Figure 1, Table 1, §2, §2.
- Robust Scene Text Recognition with Automatic Rectification. In CVPR, pp. 4168–4176. Cited by: Table 1, §2.
- ASTER : An Attentional Scene Text Recognizer with Flexible Rectification. PAMI 41 (9), pp. 2035–2048. Cited by: §2.
- Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, Cited by: §1.
- An Overview of the Tesseract OCR Engine. In ICDAR, pp. 629–633. Cited by: §2.
- Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv preprint. Cited by: Figure 1, §1, §2.
- Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §1.
- A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning. In ACM MM, pp. 1277–1285. Cited by: Table 1, §2, §2.
- PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network. In AAAI, pp. 2782–2790. Cited by: §2.
- Shape robust text detection with progressive scale expansion network. In CVPR, pp. 9336–9345. Cited by: Figure 1, Table 1, Table 3, §2, §3.
- Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. In ICCV, pp. 8439–8448. Cited by: Figure 1, §1, Table 1, Table 3, §3.
- Aggregation cross-entropy for sequence recognition. In CVPR, pp. 6538–6547. Cited by: §1.
- CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese. arXiv preprint. Cited by: Figure 1, §1, §2.
- Symmetry-constrained rectification network for scene text recognition. In ICCV, pp. 9146–9155. Cited by: §1.
- Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In CVPR, pp. 12110–12119. Cited by: Table 1, §2.
- PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In ICPR, pp. 4363–4370. Cited by: §1.
- RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition. In ECCV, pp. 135–151. Cited by: Figure 1, §1, Table 1, §2.
- SegOCR: Simple Baseline. In Unpublished Manuscript, Cited by: Figure 1, Table 1.
- Boosting up Scene Text Detectors with Guided CNN. In BMVC, Cited by: §2.
- Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. In CVPR, pp. 9696–9705. Cited by: Figure 1, Table 1.
- Joint extraction of entities and relations based on a novel tagging scheme. In ACL, pp. 1227–1236. Cited by: §2.
- EAST: An Efficient and Accurate Scene Text Detector. CVPR, pp. 2642–2651. Cited by: Table 1, §2.
- Fourier Contour Embedding for Arbitrary-Shaped Text Detection. In CVPR, Cited by: Figure 1, §1, Table 1, §2.