Speaker Naming in Movies

09/24/2018
by   Mahmoud Azab, et al.
2

We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms several competitive baselines on the average weighted F-score metric. To demonstrate the effectiveness of our framework, we design an end-to-end memory network model that leverages our speaker naming model and achieves state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

READ FULL TEXT
research
11/04/2022

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

We present a multi-speaker Japanese audiobook text-to-speech (TTS) syste...
research
06/03/2020

M2P2: Multimodal Persuasion Prediction using Adaptive Fusion

Identifying persuasive speakers in an adversarial environment is a criti...
research
09/13/2023

Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

In this paper, we present a methodology for achieving robust multimodal ...
research
11/02/2022

Towards End-to-end Speaker Diarization in the Wild

Speaker diarization algorithms address the "who spoke when" problem in a...
research
01/23/2019

Automated Essay Scoring based on Two-Stage Learning

Current state-of-art feature-engineered and end-to-end Automated Essay S...
research
12/30/2016

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Referring expressions are natural language constructions used to identif...

Please sign up or login with your details

Forgot password? Click here to reset