DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents

05/03/2023
by   Furkan Simsek, et al.
0

Language identification describes the task of recognizing the language of written text in documents. This information is crucial because it can be used to support the analysis of a document's vocabulary and context. Supervised learning methods in recent years have advanced the task of language identification. However, these methods usually require large labeled datasets, which often need to be included for various domains of images, such as documents or scene images. In this work, we propose DocLangID, a transfer learning approach to identify the language of unlabeled historical documents. We achieve this by first leveraging labeled data from a different but related domain of historical documents. Secondly, we implement a distance-based few-shot learning approach to adapt a convolutional neural network to new languages of the unlabeled dataset. By introducing small amounts of manually labeled examples from the set of unlabeled images, our feature extractor develops a better adaptability towards new and different data distributions of historical documents. We show that such a model can be effectively fine-tuned for the unlabeled set of images by only reusing the same few-shot examples. We showcase our work across 10 languages that mostly use the Latin script. Our experiments on historical documents demonstrate that our combined approach improves the language identification performance, achieving 74 accuracy on the four unseen languages of the unlabeled dataset.

READ FULL TEXT

page 1

page 3

page 5

research
03/15/2021

Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs

We present a framework to generate synthetic historical documents with p...
research
04/01/2016

A Semisupervised Approach for Language Identification based on Ladder Networks

In this study we address the problem of training a neuralnetwork for lan...
research
03/21/2022

Transformer-based HTR for Historical Documents

We apply the TrOCR framework to real-world, historical manuscripts and s...
research
10/01/2020

Using ROC and Unlabeled Data for Increasing Low-Shot Transfer Learning Classification Accuracy

One of the most important characteristics of human visual intelligence i...
research
04/19/2021

Modeling "Newsworthiness" for Lead-Generation Across Corpora

Journalists obtain "leads", or story ideas, by reading large corpora of ...
research
10/08/2019

An Interactive Machine Translation Framework for Modernizing Historical Documents

Due to the nature of human language, historical documents are hard to co...
research
01/07/2022

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a...

Please sign up or login with your details

Forgot password? Click here to reset