Arsha Nagrani

research

∙ 09/07/2023

LanSER: Language-Model Supported Speech Emotion Recognition

Speech emotion recognition (SER) models typically rely on costly human-l...

0 Taesik Gong, et al. ∙

research

∙ 08/21/2023

UnLoc: A Unified Framework for Video Localization Tasks

While large-scale image-text pretrained models such as CLIP have been us...

0 Shen Yan, et al. ∙

research

∙ 06/08/2023

Modular Visual Question Answering via Code Generation

We present a framework that formulates visual question answering as modu...

6 Sanjay Subramanian, et al. ∙

research

∙ 04/13/2023

Verbs in Action: Improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects inter...

4 Liliane Momeni, et al. ∙

research

∙ 04/05/2023

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models have shown strong performance in the image-domain...

5 Kumara Kahatapitiya, et al. ∙

research

∙ 03/29/2023

AutoAD: Movie Description in Context

The objective of this paper is an automatic Audio Description (AD) model...

14 Tengda Han, et al. ∙

research

∙ 03/29/2023

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Audiovisual automatic speech recognition (AV-ASR) aims to improve the ro...

5 Paul Hongsuck Seo, et al. ∙

research

∙ 02/27/2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...

5 Antoine Yang, et al. ∙

research

∙ 11/18/2022

AVATAR submission to the Ego4D AV Transcription Challenge

In this report, we describe our submission to the Ego4D AudioVisual (AV)...

6 Paul Hongsuck Seo, et al. ∙

research

∙ 08/14/2022

TL;DW? Summarizing Instructional Videos with Task Relevance Cross-Modal Saliency

YouTube users looking for instructions for a specific task may spend a l...

2 Medhini Narasimhan, et al. ∙

research

∙ 06/20/2022

M M Mix: A Multimodal Multiview Transformer Ensemble

This report describes the approach behind our winning solution to the 20...

2 Xuehan Xiong, et al. ∙

research

∙ 06/15/2022

AVATAR: Unconstrained Audiovisual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) is an extension of AS...

8 Valentin Gabeur, et al. ∙

research

∙ 05/17/2022

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Our goal in this paper is the adaptation of image-text models for long v...

13 Max Bain, et al. ∙

research

∙ 04/01/2022

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of ...

3 Arsha Nagrani, et al. ∙

research

∙ 01/20/2022

End-to-end Generative Pretraining for Multimodal Video Captioning

Recent video and language pretraining frameworks lack the ability to gen...

7 Paul Hongsuck Seo, et al. ∙

research

∙ 12/08/2021

Audio-Visual Synchronisation in the wild

In this paper, we consider the problem of audio-visual synchronisation a...

2 Honglie Chen, et al. ∙

research

∙ 11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...

2 Valentin Gabeur, et al. ∙

research

∙ 11/01/2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

In egocentric videos, actions occur in quick succession. We capitalise o...

2 Evangelos Kazakos, et al. ∙

research

∙ 06/30/2021

Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dim...

24 Arsha Nagrani, et al. ∙

research

∙ 04/06/2021

Localizing Visual Sounds the Hard Way

The objective of this work is to localize sound sources that are visible...

7 Honglie Chen, et al. ∙

research

∙ 04/01/2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Our objective in this work is video-text retrieval - in particular a joi...

7 Max Bain, et al. ∙

research

∙ 04/01/2021

Composable Augmentation Encoding for Video Representation Learning

We focus on contrastive methods for self-supervised video representation...

12 Chen Sun, et al. ∙

research

∙ 03/05/2021

Slow-Fast Auditory Streams For Audio Recognition

We propose a two-stream convolutional network for audio recognition, tha...

4 Evangelos Kazakos, et al. ∙

research

∙ 01/11/2021

WiCV 2020: The Seventh Women In Computer Vision Workshop

In this paper we present the details of Women in Computer Vision Worksho...

2 Hazel Doughty, et al. ∙

research

∙ 12/10/2020

Look Before you Speak: Visually Contextualized Utterances

While most conversational AI systems focus on textual dialogue only, con...

2 Paul Hongsuck Seo, et al. ∙

research

∙ 09/17/2020

Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds

Testing capacity for COVID-19 remains a challenge globally due to the la...

0 Piyush Bagad, et al. ∙

research

∙ 07/21/2020

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Despite the recent advances in video classification, progress in spatio-...

5 Anurag Arnab, et al. ∙

research

∙ 07/02/2020

Spot the conversation: speaker diarisation in the wild

The goal of this paper is speaker diarisation of videos collected 'in th...

2 Joon Son Chung, et al. ∙

research

∙ 05/08/2020

Condensed Movies: Story Based Retrieval with Contextual Embeddings

Our objective in this work is the long range understanding of the narrat...

10 Max Bain, et al. ∙

research

∙ 03/30/2020

Speech2Action: Cross-modal Supervision for Action Recognition

Is it possible to guess human action from dialogue alone? In this work w...

6 Arsha Nagrani, et al. ∙

research

∙ 02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...

17 Arsha Nagrani, et al. ∙

research

∙ 12/05/2019

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well...

5 Joon Son Chung, et al. ∙

research

∙ 09/23/2019

WiCV 2019: The Sixth Women In Computer Vision Workshop

In this paper we present the Women in Computer Vision Workshop - WiCV 20...

8 Irene Amerini, et al. ∙

research

∙ 09/19/2019

Count, Crop and Recognise: Fine-Grained Recognition in the Wild

The goal of this paper is to label all the animal individuals present in...

34 Max Bain, et al. ∙

research

∙ 08/22/2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

We focus on multi-modal fusion for egocentric action recognition, and pr...

12 Evangelos Kazakos, et al. ∙

research

∙ 07/31/2019

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

The rapid growth of video on the internet has made searching for video c...

3 Yang Liu, et al. ∙

research

∙ 02/26/2019

Utterance-level Aggregation For Speaker Recognition In The Wild

The objective of this paper is speaker recognition "in the wild"-where u...

2 Weidi Xie, et al. ∙

research

∙ 08/16/2018

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Obtaining large, human labelled speech datasets to train models for emot...

8 Samuel Albanie, et al. ∙

research

∙ 06/14/2018

VoxCeleb2: Deep Speaker Recognition

The objective of this paper is speaker recognition under noisy and uncon...

0 Joon Son Chung, et al. ∙

research

∙ 05/02/2018

Learnable PINs: Cross-Modal Embeddings for Person Identity

We propose and investigate an identity sensitive joint embedding of face...

0 Arsha Nagrani, et al. ∙

research

∙ 04/01/2018

Seeing Voices and Hearing Faces: Cross-modal biometric matching

We introduce a seemingly impossible task: given only an audio clip of so...

0 Arsha Nagrani, et al. ∙

research

∙ 01/31/2018

From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script

The goal of this paper is the automatic identification of characters in ...

0 Arsha Nagrani, et al. ∙

research

∙ 06/26/2017

VoxCeleb: a large-scale speaker identification dataset

Most existing datasets for speaker identification contain samples obtain...

0 Arsha Nagrani, et al. ∙

Arsha Nagrani

Featured Co-authors

Sign in with Google

Consider DeepAI Pro