bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

08/21/2023
by   Imam Mohammad Zulkarnain, et al.
0

Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce Bengali.AI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: https://bengaliai.github.io/bbocr.

READ FULL TEXT
research
06/29/2021

SDL: New data generation tools for full-level annotated document layout

We present a novel data generation tool for document processing. The too...
research
01/20/2023

Transforming Unstructured Text into Data with Context Rule Assisted Machine Learning (CRAML)

We describe a method and new no-code software tools enabling domain expe...
research
01/06/2021

On-Device Document Classification using multimodal features

From small screenshots to large videos, documents take up a bulk of spac...
research
11/23/2017

Open Evaluation Tool for Layout Analysis of Document Images

This paper presents an open tool for standardizing the evaluation proces...
research
01/10/2022

DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population

We present a new open-source and extensible knowledge extraction toolkit...
research
05/15/2023

Document Understanding Dataset and Evaluation (DUDE)

We call on the Document AI (DocAI) community to reevaluate current metho...

Please sign up or login with your details

Forgot password? Click here to reset