Deep Multimodal Speaker Naming

07/17/2015
by   Yongtao Hu, et al.
0

Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online.

READ FULL TEXT

page 2

page 3

page 4

research
12/18/2018

Audiovisual speaker diarization of TV series

Speaker diarization may be difficult to achieve when applied to narrativ...
research
07/14/2020

DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection

For recognizing speakers in video streams, significant research studies ...
research
06/03/2020

M2P2: Multimodal Persuasion Prediction using Adaptive Fusion

Identifying persuasive speakers in an adversarial environment is a criti...
research
11/28/2016

Who's that Actor? Automatic Labelling of Actors in TV series starting from IMDB Images

In this work, we aim at automatically labeling actors in a TV series. Ra...
research
12/18/2018

Constrained speaker diarization of TV series based on visual patterns

Speaker diarization, usually denoted as the 'who spoke when' task, turns...
research
01/31/2018

From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script

The goal of this paper is the automatic identification of characters in ...
research
08/04/2023

Speaker Diarization of Scripted Audiovisual Content

The media localization industry usually requires a verbatim script of th...

Please sign up or login with your details

Forgot password? Click here to reset