Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

07/02/2021
by   Motonari Kambara, et al.
0

There have been many studies in robotics to improve the communication skills of domestic service robots. Most studies, however, have not fully benefited from recent advances in deep neural networks because the training datasets are not large enough. In this paper, our aim is to augment the datasets based on a crossmodal language generation model. We propose the Case Relation Transformer (CRT), which generates a fetching instruction sentence from an image, such as "Move the blue flip-flop to the lower left box." Unlike existing methods, the CRT uses the Transformer to integrate the visual features and geometry features of objects in the image. The CRT can handle the objects because of the Case Relation Block. We conducted comparison experiments and a human evaluation. The experimental results show the CRT outperforms baseline methods.

READ FULL TEXT

page 1

page 2

page 7

research
03/01/2021

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Navigation guided by natural language instructions is particularly suita...
research
07/02/2021

Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Currently, domestic service robots have an insufficient ability to inter...
research
04/13/2020

Relation Transformer Network

The identification of objects in an image, together with their mutual re...
research
07/25/2023

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

The goal of the audio-visual segmentation (AVS) task is to segment the s...
research
04/14/2023

Masked Pre-Training of Transformers for Histology Image Analysis

In digital pathology, whole slide images (WSIs) are widely used for appl...
research
11/11/2021

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Previous studies such as VizWiz find that Visual Question Answering (VQA...
research
01/04/2023

Multi-Aspect Explainable Inductive Relation Prediction by Sentence Transformer

Recent studies on knowledge graphs (KGs) show that path-based methods em...

Please sign up or login with your details

Forgot password? Click here to reset