How2: A Large-scale Dataset for Multimodal Language Understanding

by   Ramon Sanabria, et al.
Carnegie Mellon University

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.


BIG-C: a Multimodal Multi-Purpose Dataset for Bemba

We present BIG-C (Bemba Image Grounded Conversations), a large multimoda...

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Humor is a unique and creative communicative behavior displayed during s...

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

Lecture slide presentations, a sequence of pages that contain text and f...

IRFL: Image Recognition of Figurative Language

Figures of speech such as metaphors, similes, and idioms allow language ...

Accurate Word Representations with Universal Visual Guidance

Word representation is a fundamental component in neural language unders...

Unleashing the Power of ChatGPT for Translation: An Empirical Study

The recently released ChatGPT has demonstrated surprising abilities in n...

ShapeWorld - A new test methodology for multimodal language understanding

We introduce a novel framework for evaluating multimodal deep learning m...

Please sign up or login with your details

Forgot password? Click here to reset