CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

07/11/2023
by   Long Bai, et al.
0

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available.

READ FULL TEXT
research
05/19/2023

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Despite the availability of computer-aided simulators and recorded video...
research
06/22/2022

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Visual question answering (VQA) in surgery is largely unexplored. Expert...
research
10/01/2022

A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering

Research in medical visual question answering (MVQA) can contribute to t...
research
04/19/2023

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Advances in GPT-based large language models (LLMs) are revolutionizing n...
research
07/22/2023

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

The visual-question localized-answering (VQLA) system can serve as a kno...
research
09/21/2023

Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches

The increase in the availability of online videos has transformed the wa...
research
05/19/2023

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

There has been a growing interest in using deep learning models for proc...

Please sign up or login with your details

Forgot password? Click here to reset