Can we find every moment when our favorite actor appears or talks in a movie? Humans can do such search relying on a high-level understanding of the actor’s facial appearance while also analyzing their voice[bruce1986understanding]
. The computer vision community has embraced this problem primarily from a visual perspective by advancing face identification[facenet, parkhi2015deep, tapaswi2019video]. However, the ability to search for people using audiovisual patterns remains limited. In this work, we address the lack of large-scale audiovisual datasets to benchmark the video person retrieval task. Beyond finding actors, several video domain applications could benefit from our dataset, from accelerating the creation of highlight moments to summarizing arbitrary video data via speaker diarization.
Contrary to image collections, video data casts additional challenges for face and person retrieval tasks [grother2017face]. Such challenges include drastic changes in appearance, facial expressions, pose, or illumination as a video progresses. These challenges have fostered research in video person search. Some works have focused on person re-identification in surveillance videos [gheissari2006person, farenzena2010person, market]. In this setup, the goal is to track a person among a set of videos recorded from various cameras, where the global appearance (clothing) of the target person remains constant. Another setup is the cast search problem, where models take a portrait image as a query to retrieve all person tracks that match the query’s identity [csm]. The community has achieved relevant progress, but the lack of large scale audiovisual information still prevents the development of richer multi-modal search models.
Motivated by PIN cognitive models [bruce1986understanding], Nagrani [learnablepins] have developed self-supervised models that learn joint face-voice embeddings. Their key idea is to use supervision across modalities to learn representations where individual faces match their corresponding voices via contrastive learning [contrastive]. However, many videos in the wild might contain multiple visible individuals that remain, mostly, silent. This situation introduces a significant amount of noise to the supervision signal [ava-active-speakers]. Liu [iqiyi] have also explored audiovisual information for person retrieval in videos. To this end, their work introduces the iQIYI-VID dataset, which contains videos from social media platforms depicting, in large proportion, Asian celebrities. Despite its large-scale, the dataset contains only short clips, with most of them being five seconds long or shorter. We argue that having long videos is crucial to high-level reasoning of context to model real-life expressions of people’s faces and voices. Additionally, we require densely annotated ground-truth labels to enable direct links between speech and visual identities.
In this paper, we introduce APES (Audiovisual PErson Search), a novel dataset, and benchmark for audiovisual person retrieval in long untrimmed videos. Our work aims to mitigate existing limitations from two angles. First, we densely annotate untrimmed videos with person identities and match those identities to faces and voices. Second, we establish audiovisual baselines and benchmarks to facilitate future research with the new dataset. Our dataset includes a broad set of 15-minutes videos from movies labeled among a long-tailed distribution of identities. The dataset samples account for many challenging re-identification scenarios such as small faces, poor illumination, or short speech segments. In terms of baselines, we develop a two-stream model that predicts people’s identities using audiovisual cues. We include benchmarks for two alternative tasks. Seen, which aims at retrieving all segments when a query face appears on-screen; Seen & Heard, which focuses on finding instances where the target person is on-screen and talking. Figure 1 showcases APES annotations and tasks.
This paper’s primary goal is to push the envelope of audiovisual person retrieval by introducing a new video dataset annotated with face and voice identities. To this end, our work brings two contributions.
We annotate a dataset of untrimmed videos from movies. We associate more than face tracks with voice segments and label about identities. Section 3 details our data collection procedure and showcases the characteristics of our novel dataset.
We establish an audiovisual baseline for person search in Section 4. Our study showcases the benefits of modeling audiovisual information jointly for video person retrieval in the wild.
2 Related Work
There is a large corpus of related work on face [tapaswi2019video, facenet, parkhi2015deep] and voice [zhang2019fully, voxceleb2] retrieval. This section focuses on related work on datasets for video person retrieval and audiovisual learning for identity retrieval.
|MARS [mars]||Person Tracks||Re-identification||1M||20K||1261|
|VoxCeleb [voxceleb]||Short Clips||Speaker Recognition||-||22.5K||1251|
|VoxCeleb2 [voxceleb2]||Short Clips||Speaker Recognition||-||150K||6112|
|iQIYI-VID [iqiyi]||Short Clips||Visual Search||70M||600K||5000|
|CSM [csm]||Person Tracks||Visual Search||11M||127K||1218|
|Big Bang Theory [bigban]||Untrimmed Videos||Audiovisual Search||-||3.7K||8|
|Buffy [everingham2006hello]||Untrimmed Videos||Audiovisual Search||49k||1k||12|
|Sherlock [sherlock]||Untrimmed Videos||Audiovisual Search||-||6.5K||33|
|APES||Untrimmed Videos||Audiovisual Search||3.1M||30.8K||1913|
APES is the largest dataset for audiovisual person search. In comparison to available audiovisual search datasets, it contains two orders of magnitude more identities at 1.9K. Additionally, the 30K manually curated face tracks contain a much larger diversity in audiovisual patterns than similar datasets, as its original movie set is composed of far more diverse videos across multiple genres and including diverse demographics. Finally, the 3.1 Million individual instances allow for modern machine learning methods techniques to be used on the APES dataset.
Video Person Retrieval Datasets.
After many milestones achieved on image-based face retrieval, the computer vision community shifted attention into video use cases. There are many datasets and tasks related to person and face retrieval in videos. Three popular tasks have been established, including person re-identification, speaker recognition, and recently person search. Table 1 summarizes datasets for these tasks and compares them with the APES dataset.
The first group includes datasets designed for person re-identification [market, mars, psd]. These datasets usually contain many identities; however, most are composed only of cropped tracks without any visual context or audio information. Moreover, person re-identification datasets focus on surveillance scenarios, where the task is to find a target subject across an array of cameras placed in a limited area.
The second group includes speaker recognition datasets [voxceleb2, voxceleb]. Datasets such as VoxCeleb2 [voxceleb2] have pushed the scale up to 150K face clips. A drawback of this group of datasets is that clips are only a few seconds long and tend to contain a single face. The third group includes datasets for person retrieval. CSM [csm], for instance, introduces the task of cast search, which consists of retrieving all the tracklets that match the identity of a portrait query. iQIYI-Video [iqiyi] scales the total number of tracklets, clips, and identities. Both datasets provide a step towards visual-based person retrieval but exhibit limitations for multimodal (faces and voices) modeling. On the one hand, CSM does not provide audio streams or video context; on the other hand, iQIYI-Video contains short clips and does not associate voices with person identities. Our APES dataset mitigates these limitations by annotating long videos with people’s faces, voices, and their corresponding identities.
The third group is the closest to our setup, it comprises datasets for audiovisual person search. While the Big Bang Theory datatset [bigban] allows to study the same tasks as APES, it is limited to only 8 identities (which are observed mostly indoors). Additionally speech events are approximately localized using the show’s transcripts. APES contains dense manual annotations for speech events and identity pairs. Sherlock [sherlock] also allows for audiovisual identity search but contains only 33 identities, and its cast is composed of mostly white European adults. The Sherlock dataset also discards short segments of speech (shorter than 2 secs), this is a key limitation as our analysis shows that these short segments constitute a big portion of utterances in natural conversations. Finally, Buffy [everingham2006hello] is also very small in terms of number of identities and its data lacks diversity as it was collected from only two episodes of the series.
Audiovisual Learning for Identity Retrieval.
Audiovisual learning has been widely explored in the realm of multiple video tasks [owens2018audio, chung2016out, ava-active-speakers, afouras2020self], but only a few have focused their efforts on learning embeddings for person identity retrieval [nagrani2018seeing, learnablepins]. Nagrani [nagrani2018seeing] have proposed a cross-modal strategy that can ’see voices’ and ’hear faces’. It does so by learning to match a voice to one out of multiple candidate faces using a multi-way classification loss. More recently, the work in [learnablepins] introduces a cross-modal approach that leverages synchronization between faces and speech. This approach assumes there is one-to-one correspondence in the audiovisual signals to form queries, positive, and negative sets and train a model via contrastive learning [contrastive]. Although the method proposed in [learnablepins] does not require manually annotated data, it assumes all face crops contain a person talking, an assumption that often breaks for videos in the wild. Our baseline model seizes inspiration from the success of these previous approaches in cross-modal and audiovisual learning. It leverages the newly annotated APES dataset, and its design includes a two-stream audiovisual network that jointly learns audiovisual characteristics of individuals.
3 APES: Audiovisual PErson Search Dataset
This section introduces the Audiovisual PErson Search (APES) dataset, which aims at fostering the study of multimodal person retrieval in videos. This new dataset consists of more than hours of untrimmed videos annotated with unique identities. APES’ videos pose many challenges, including small faces, unconventional viewpoints, as well as short segments of speech intermixed with environmental sound and soundtracks. Figure 2 shows a few APES samples. Here, we describe our data collection procedure and statistics of APES.
3.1 Data Collection
We aim for a collection of videos showing faces and voices in unconstrained scenarios. While there has been a surge of video datasets in the last few years [kinetics, momentsintime, charades, epickitchens], most of them focus on action recognition on trimmed videos. As a result, it is hard to find a large video corpus with multiple instances where individuals are seen on-screen and speaking. This trend limits the availability of relevant data for audiovisual person search.
Instead of gathering user-generated videos, the AVA dataset  is made from movies. AVA list of movies includes productions filmed around the world in different languages and across multiple genres. It contains a diverse and extensive number of dialogue and action scenes. Another appealing property is that it provides a download mechanism 111Researchers can download the full set of AVA videos at: https://github.com/cvdfoundation/ava-dataset to gather the videos in the dataset; this is crucial for reproducibility and promote future research. Finally, the AVA dataset has been augmented with Active Speaker annotations [ava-active-speakers]. This new set contains face tracks annotated at 20 frames per second; it also includes annotations that link speech activity with those face tracks. Consequently, we choose videos from AVA to construct the APES dataset. Our task is then to label the available face tracks and speech segments with actors’ identities.
Labeling face identities.
We first downloaded a total of videos from the AVA dataset, gathered all face tracks available, and annotated identities for a total of face tracks. We did so in two stages. In the first stage, we addressed the identity assignment tasks per video and asked human annotators to cluster the face tracks into matching identities. To complete this first task, we employed three human annotators for hours each. We noticed two common errors during this stage: (i) false positives emerging from small faces or noisy face tracks; and (ii) false negatives that assign the same person into more than one identity cluster. We alleviated these errors by implementing a second stage to review all instances in the clustered identities and merged wrongfully split clusters. This process was a relatively shorter verification task, which annotators completed in eight hours. At the end of this annotation stage, about of the face tracks were labeled as ambiguous; therefore, we obtained a total of face tracks annotated among identities.
Labeling voice identities.
After labeling all face tracks with their corresponding identities, we now need to find their voice pairs. Our goal then is to cluster all voices in the videos and match their corresponding faces’ original identity. In other words, we want the same person’s faces and voices to share the same identity. Towards this goal, we leveraged the original annotations from the AVA-ActiveSpeakers dataset [ava-active-speakers], which contain temporal windows of speech activity associated with a face track. We mapped speech segments to their corresponding face track and assigned a common identity. We annotated voice segments accounting for a total of hours of speech among the identities.
3.2 Dataset Statistics
We annotated over hours from videos, including face tracks, face bounding boxes, and voice segments. Our labeling framework annotated identities and discarded ambiguous face tracks. We discuss in detail the statistics of the dataset below.
In Figure 3 (Left), we observe the distribution of the number of tracklets and identities per video. The number of tracklets per video follows a long-tail distribution, and there is no correlation between the number of tracklets with the number of identities. This fact indicates that certain identities have longer coverage than others. Figure 3 (Center) shows the average length an identity is Seen or Seen & Heard. Interestingly, the Seen & Heard distribution exhibits a long-tailed distribution, with many identities being heard only very few times. Also, some identities are seen many times without speaking at all. Finally, we investigate the demographics of the dataset in Figure 3 (Right). To do this, we manually annotate the identities with gender, age, and race attributes. Although work needs to be done to balance samples across demographics, the survey shows that our video source has representative samples to cover the various demographic groups. For instance, for the most under-represented group, kids, APES contains more than 1.5K tracks. Moreover, APES provides significant progress from previous datasets that contain a single demographic group, , iQIYI-VID [iqiyi] contains only Asian celebrities, and the Big Bang Theory dataset [bigban] comprises a cast limited to a single TV series.
We analyze here characteristics of the annotated identities. First, in Figure 4 (Left), we show the distribution of the number of tracklets per identity. We observe a long-tailed distribution where some identities, likely the main characters, have ample screen time, while others, supporting cast, appear just a few times. Figure 4(Center) shows the average face coverage per identity, where we observe also a long-tail distribution. On the one hand, identities with large average face coverage include actors favored with close-ups; on the other hand, identities with low average face coverage include actors framed within a wide shot. Finally, we plot the average length of continuous speech per identity (Figure 4 (Right
)). Naturally, different characters have different speech rhythms, and therefore the dataset exhibits a non-uniform distribution. Interestingly a big mass centers around one second of speech. This characteristic might be due to the natural dynamics of engaging dialogues. Moreover, we observe than about 25of the identities do not speak at all.
4 Experimental Analysis
We now outline the standard evaluation setup and metrics for the APES dataset along with a baseline method that relies on a two-stream architecture for multi-modal metric learning. The first stream receives cropped faces while the second works over audio clips. Initially, each stream is optimized via triplet loss to minimize the distance between matching identities in the feature space. As highlighted in other works [facenet, nagrani2018seeing, parkhi2015deep], it is essential to acquire a clean and extensive set of triplets to achieve good performance. Below, we detail each modality of our baseline model and different subsets for training.
To optimize the face matching network, we remove the classification layer from a standard Resnet-18 encoder [resnet]
pre-trained on ImageNet[imagenet], and fine-tune it using a triplet loss [facenet]. We choose the ADAM optimizer [adam] with an initial learning rate of and learning rate annealing of
every 30 epochs for a total of 70 epochs. We resize face crops toand perform random flipping and corner cropping during training.
Similar to the visual stream, we use a ResNet-18 model initialized with ImageNet weights and fine-tuned via triple loss learning. We follow a setup similar to [ava-active-speakers] and use a Mel-spectrogram calculated from audio snippets of seconds length in the audio stream. We use the same hyper-parameters configuration described for the visual matching network.
For the audiovisual experiments, we combine the individual configurations for voice and face matching. However, we add a third loss term which optimizes the feature representation obtained from a joint embedding of audiovisual features, which we obtain via concatenation of each stream’s last layer feature map. This third loss is also optimized using the triplet loss. Figure 5 illustrates our cross-modal baseline.
4.1 Experimental Setup
Dataset splits and Tasks.
We follow the official train-val splits defined in the original AVA-ActiveSpeaker dataset [ava-active-speakers]. As not every person is actively speaking at every moment, we define two tasks: Seen & Heard (a person is on-screen and talking), and only Seen (the person is on-screen but not speaking). Each of these tasks yield a corresponding training and validation subset. The Seen task subsets have and tracklets for training and validation respectively. Conversely, the Seen & Heard subset, is comprised of tracklers for training and for validation.
|Seen & Heard||Seen|
|Avg Positive||Avg Negative||Avg Positive||Avg Negative|
The APES dataset allows us to sample positive and negative samples during training in three different ways. The most direct sampling would gather every tracklet from a single movie. In such a scenario, we will create a positive bag with the tracklets that belong to a given identity, while negative samples would be obtained from every other tracklet in the same movie. We name this setup Within, a simplified configuration where we have a 1:15 ratio of positive tracklets (same identity) to negative tracklets (different identity).
While the Within modality allows us to explore the problem, it might be an overly simplified scenario. Hence, we also devise the Across setup, where we sample negatives identities over the full video collection, , across different movies. This sampling strategy significantly changes the ratio of positive to negative tracklets to 1:150 and better resembles the natural imbalance of positive/negative identities in real-world data.
Finally, we create an extreme setup that resembles few-shot learning scenarios for identity retrieval learning. In this case, a single (or very few) positive samples are available to train our embeddings. These positive samples are sampled from the same tracklet as the query, and instances from every other tracklet in the datasets form the negative bag. We name this setup as Weak, which results in a strongly imbalanced subset with about 1:1500 ratio of positive to negative tracklets. A summary of these three sampling sets is presented in Table 2.
Three evaluation metrics assess methods’ effectiveness in APES:
Precision at K (P@K): we estimate the precision from the top K retrieved identities for every tracklet in a video. As there are no shared identities over videos, we simply estimate the precision at K for every video, and then average for the full validation set.
Recall at K (R@K): we estimate the recall from the top K retrieved identities for every tracklet in a video. Again, we estimate the recall at K for every video and compute the average over the full validation set.
Mean average precision (mAP): as final and main evaluation metric, we use the mean average precision. Like in the recall and precision cases, we compute the mAP for every tracklet in a video, and the average the results for videos in the validation set.
4.2 Benchmark Results
Seen & Heard benchmark results.
summarizes our benchmark results for the three main configurations: Facial Matching, where we operate exclusively on visual data; Voice Matching, where we exclusively model audio data; and Cross-modal Matching, where we train and validate over the audio and visual modalities. Overall results obtained with facial and cross-modal matching are far from perfect. Our best baseline model obtains a max mAP of 64.8%. This result indicates that the standard triplet loss is just an initial baseline for the APES dataset and that there is ample room for improvement. The relatively high precisions at K=1 and K=5 in almost every setting, suggests the existence of a few easy positives matches for every query in the dataset. However, the relative low precision and recall at K=10 suggest that the method quickly exhibits wrong estimations as we progress through the ranking. This drawback is worst in the APES-Weak training setting, as its selection bias induces a much lower variability on the positive samples. The analysis of the recall scores highlights the importance of the additional audio information. After the network is enhanced with audio data, the recall metrics improve significantly, reaching 98.5% at K=100. While this improvement comes at the cost of some precision, overall, the mAP shows an improvement of 0.7%. Finally, there is only a slight difference between the Within and Across sampling settings, the former having a marginally better recall. This suggests that the massive imbalance induced in the across setting does not significantly improve the diversity of the data observed at training time and that more sophisticated sampling strategies such as hard negative mining might be required to take advantage of that extra information.
|Seen Full Set|
Seen benchmark results.
We empirically found that fusing modalities with noisy audio data (original splits) provides no improvement. As outlined before, our experiments suggest that audio models are highly sensitive to noisy speech annotations and do not converge if there is large uncertainty in the ground truth. Under this circumstances the optimization just learns to ignore the audio cues, and yields the same performance as the visual-only setting.
Despite this we report our results of a facial matching for the Seen task, as it will serve as baseline for future works that can handle noise speech data. Table 4 contains the benchmark for the Seen task. We find that this task is actually harder than the Seen & Heard task, despite having more available data. We explain this result as movies typically depict speakers over large portions of the screen and offer a clear view angle to him/her, which results on average larger faces with less noise in the Seen & Heard task, and smaller more challenging faces in the Seen. In other words, we hypothesize that speaking faces are easier to re-identify as they are usually framed within close-ups. This bias makes it harder to find every matching tracklet for the identity (thus reducing recall).
4.3 Qualitative Results
We showcase easy and hard instances for our baseline in Figure 6. Every row shows a query in the left, and the top five nearest neighbors on the right. The first two rows, shows instances where our baseline model correctly retrieves instances of the same person. We have empirically noticed that instances where query faces are large, in close-ups and medium shots from dialogue scenes, our baseline model tends to provide a very good ranking to the retrieved instances. The two rows from the bottom illustrate hard cases where the baseline model fails. We have seen that small faces, poor illumination, occlusion, and subjects in motion present a challenging scenario to our baseline.
We introduce APES, a new dataset for audiovisual person search. We compare APES with existing datasets for person identity analysis and show that it complements previous datasets in that those have mainly focused on visual analysis only. We include benchmarks for two tasks Seen and Seen & Heard to showcase the value of curating a new audiovisual person search dataset. We believe in the crucial role of datasets at measuring progress in computer vision; therefore, we are committed to releasing APES to enable further development, research, and benchmarking in the field of audiovisual person identity understanding.