Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

05/02/2022
by   Idris Abdulmumin, et al.
2

Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.

READ FULL TEXT

page 2

page 5

page 7

research
07/21/2019

Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation

Visual Genome is a dataset connecting structured image information with ...
research
12/05/2020

Codeswitched Sentence Creation using Dependency Parsing

Codeswitching has become one of the most common occurrences across multi...
research
03/29/2021

English-Twi Parallel Corpus for Machine Translation

We present a parallel machine translation training corpus for English an...
research
11/28/2018

Unsupervised Multi-modal Neural Machine Translation

Unsupervised neural machine translation (UNMT) has recently achieved rem...
research
12/27/2019

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Multi-modal machine translation aims at translating the source sentence ...
research
02/13/2021

The first large scale collection of diverse Hausa language datasets

Hausa language belongs to the Afroasiatic phylum, and with more first-la...
research
04/23/2018

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Automatic generation of natural language from images has attracted exten...

Please sign up or login with your details

Forgot password? Click here to reset