A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

08/29/2023
by   Shuxiao Ma, et al.
0

Biological research has revealed that the verbal semantic information in the brain cortex, as an additional source, participates in nonverbal semantic tasks, such as visual encoding. However, previous visual encoding models did not incorporate verbal semantic information, contradicting this biological finding. This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information in response to this issue. Our visual information encoding network model takes stimulus images as input and leverages textual information generated by a text-image generation model as verbal semantic information. This approach injects new information into the visual encoding model. Subsequently, a Transformer network aligns image and text feature information, creating a multimodal feature space. A convolutional network then maps from this multimodal feature space to voxel space, constructing the multimodal visual information encoding network model. Experimental results demonstrate that the proposed multimodal visual information encoding network model outperforms previous models under the exact training cost. In voxel prediction of the left hemisphere of subject 1's brain, the performance improves by approximately 15.87 hemisphere, the performance improves by about 4.6 encoding network model exhibits superior encoding performance. Additionally, ablation experiments indicate that our proposed model better simulates the brain's visual information processing.

READ FULL TEXT

page 9

page 11

page 12

research
10/07/2020

VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Text-to-image multimodal tasks, generating/retrieving an image from a gi...
research
10/18/2022

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation

Recently, diffusion models have been proven to perform remarkably well i...
research
12/15/2022

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis

With the advent of deep learning, a huge number of text-to-speech (TTS) ...
research
12/21/2019

Multimodal Representation Model based on Graph-Based Rank Fusion

This paper proposes an unsupervised representation model, based on rank-...
research
02/17/2021

I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion

This paper addresses the problem of media retrieval using a multimodal q...
research
12/09/2017

Modulating and attending the source image during encoding improves Multimodal Translation

We propose a new and fully end-to-end approach for multimodal translatio...
research
07/18/2023

Improving Text Semantic Similarity Modeling through a 3D Siamese Network

Siamese networks have gained popularity as a method for modeling text se...

Please sign up or login with your details

Forgot password? Click here to reset