-
An empirical study on the effectiveness of images in Multimodal Neural Machine Translation
In state-of-the-art Neural Machine Translation (NMT), an attention mecha...
read it
-
Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation
In state-of-the-art Neural Machine Translation, an attention mechanism i...
read it
-
Modulating and attending the source image during encoding improves Multimodal Translation
We propose a new and fully end-to-end approach for multimodal translatio...
read it
-
Simultaneous Machine Translation with Visual Context
Simultaneous machine translation (SiMT) aims to translate a continuous i...
read it
-
Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation
In Multimodal Neural Machine Translation (MNMT), a neural model generate...
read it
-
Multimodal Machine Translation with Reinforcement Learning
Multimodal machine translation is one of the applications that integrate...
read it
-
Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
The major challenge in audio-visual event localization task lies in how ...
read it
Dynamic Context-guided Capsule Network for Multimodal Machine Translation
Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on https://github.com/DeepLearnXMU/MM-DCCN.
READ FULL TEXT
Comments
There are no comments yet.