A Comprehensive Gold Standard and Benchmark for Comics Text Detection and Recognition

12/27/2022
by   Gürkan Soykan, et al.
0

This study focuses on improving the optical character recognition (OCR) data for panels in the COMICS dataset, the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for western comics, called "COMICS Text+: Detection" and "COMICS Text+: Recognition". We evaluated the performance of state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called "COMICS Text+", which contains the extracted text from the textboxes in the COMICS dataset. Using the improved text data of COMICS Text+ in the comics processing model from resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ dataset can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All the data and inference instructions can be accessed in https://github.com/gsoykan/comics_text_plus.

READ FULL TEXT

page 9

page 10

page 12

research
08/14/2021

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

We present MMOCR-an open-source toolbox which provides a comprehensive p...
research
01/26/2016

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

This paper describes the COCO-Text dataset. In recent years large-scale ...
research
02/28/2018

Chinese Text in the Wild

We introduce Chinese Text in the Wild, a very large dataset of Chinese t...
research
11/11/2019

t-SS3: a text classifier with dynamic n-grams for early risk detection over text streams

A recently introduced classifier, called SS3, has shown to be well suite...
research
05/11/2023

Combining OCR Models for Reading Early Modern Printed Books

In this paper, we investigate the usage of fine-grained font recognition...
research
03/20/2018

Text Detection and Recognition in images: A survey

Text Detection and recognition is a one of the important aspect of image...
research
07/02/2019

TedEval: A Fair Evaluation Metric for Scene Text Detectors

Despite the recent success of scene text detection methods, common evalu...

Please sign up or login with your details

Forgot password? Click here to reset