E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

05/09/2023
by   Cong Ma, et al.
0

Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language. Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues. The cascade models can benefit from the large-scale optical character recognition (OCR) and MT datasets but the two-stage architecture is redundant. The end-to-end models are efficient but suffer from training data deficiency. To this end, in our paper, we propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets to pursue both an effective and efficient framework. More specifically, we build a novel modal adapter effectively bridging the OCR encoder and MT decoder. End-to-end TIMT loss and cross-modal contrastive loss are utilized jointly to align the feature distribution of the OCR and MT tasks. Extensive experiments show that the proposed method outperforms the existing two-stage cascade models and one-stage end-to-end models with a lighter and faster architecture. Furthermore, the ablation studies verify the generalization of our method, where the proposed modal adapter is effective to bridge various OCR and MT models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/24/2020

Tight Integrated End-to-End Training for Cascaded Speech Translation

A cascaded speech translation model relies on discrete and non-different...
research
05/09/2023

Multi-Teacher Knowledge Distillation For Text Image Machine Translation

Text image machine translation (TIMT) has been widely used in various re...
research
05/15/2023

Understanding and Bridging the Modality Gap for Speech Translation

How to achieve better end-to-end speech translation (ST) by leveraging (...
research
03/22/2023

Selective Data Augmentation for Robust Speech Translation

Speech translation (ST) systems translate speech in one language to text...
research
12/20/2022

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

One of the major challenges of machine translation (MT) is ambiguity, wh...
research
06/15/2016

A Correlational Encoder Decoder Architecture for Pivot Based Sequence Generation

Interlingua based Machine Translation (MT) aims to encode multiple langu...
research
08/06/2023

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

Contrasting Language-image pertaining (CLIP) has recently shown promisin...

Please sign up or login with your details

Forgot password? Click here to reset