Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Both appearance and motion features are used in previous works, while they do not utilize them in a proper way but apply simple concatenation or score-level fusion. In this work, we argue that the features extracted from the pretrained extractor, e.g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Therefore, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information of the auxiliary modality. Moreover, we treat the attention weights derived from each CCMas the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two common used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method and achieve the state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

10/22/2020

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Weakly-supervised Temporal Action Localization (W-TAL) aims to classify ...
06/20/2021

Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling

Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to rec...
07/21/2020

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

In this paper, we introduce a new problem, named audio-visual video pars...
06/19/2021

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer ...
04/07/2021

Farewell to Mutual Information: Variational Distillation for Cross-Modal Person Re-Identification

The Information Bottleneck (IB) provides an information theoretic princi...
08/03/2020

Active Object Search

In this work, we investigate an Active Object Search (AOS) task that is ...
11/01/2019

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Temporal action localization is a task to localize the start and end timestamps of action instances and recognize their categories. In recent years, many works (zhai2020two; nawhal2021activity; zhao2017temporal; zeng2019graph) put effort into the fully supervised manner and gain great achievements. However, these fully supervised methods require extensive manual frame/snippet level annotations. To address this problem, many weakly supervised temporal action localization (WS-TAL) methods (shou2018autoloc; luo2020weakly; zeng2020hybrid; islam2020weakly; jain2020actionbytes) are proposed to explore an efficient way to detect the action instances in the given videos with only video-level supervision which is more easily obtained by the annotator.

As other weakly supervised video understanding tasks likes video anomaly detection

(sultani2018real; feng2021mist) and video highlight detection (hong2020mini), most existing WS-TAL methods develop their framework based on the multiple-instance learning (MIL) manner (lee2021Weakly; islam2020weakly; zeng2020hybrid; lee2020background; liu2019completeness)

. These methods firstly predict the categorical probabilities for each snippet and then aggregate them as the video-level prediction. Finally, they perform the optimization procedure using the given video-level labels.

Among them, some works (nguyen2018weakly; lee2020background; zeng2020hybrid; liu2019completeness) introduce an attention module to improve the ability to recognize the foreground by suppressing the background parts. For action completeness modeling, Islam et al. (islam2021hybrid) utilize an attention module to drop the most discriminative parts of the video but focus on the less discriminative ones. With regards to feature learning, most of WS-TAL methods (islam2020weakly; paul2018w) mainly apply a contrastive learning loss on their intermediate features. Lee et al. (lee2021Weakly) proposed to distinguish foreground from background via the inconsistency of their feature magnitudes.

The aforementioned methods use the original extracted features that contain the task-irrelevant information redundancy (wang2015learning; wang2019pruning; feng2021mist; lei2021less) to produce predictions directly for each snippet. However, as the features extracted from trained for another task, i.e., trimmed video action classification, which introduces redundancy inevitably, their performances are restricted to the quality of extracted features and only acquire sub-optimization (feng2021mist; lei2021less). Intuitively, performing feature re-calibration for task-specific features is a way to tackle this problem. Instead of finetuning the feature extractor (feng2021mist; alwassel2020tsp; xu2020boundary) with high time and computation cost, we explore to re-calibrate the features in a more efficient manner. In this work, our intuition is simple: the RGB and FLOW features contain modal-specific information (i.e., appearance and motion information) from different perspectives of the given data. Therefore, we can filter out the redundancy contained in a certain modality with the help of global context information from itself and the local context information from different perspectives of different modalities (Figure 1).

As discussed above, the inconsistency between pre-trained task with the target one leads to inevitable task-irrelevant information in the extracted features denoted as redundancy, which restricts the optimization, especially under weak supervision. Previous works pay less attention to this problem but use the features directly. Here, we aim to re-calibrate the features in the very beginning by leveraging information from two different modalities (i.e., RGB and FLOW features). In this work, we develop a CrOss-modal cOnsensus NETwork (CO-Net) to re-calibrate the representations of each modality for each snippet in the video. CO-Net contains two identical cross-modal consensus modules (CCM). Specifically, two types of modal features are fed into both CCMs, one of them acts as the main modality and the other one serves as the auxiliary modality. In CCM, we obtain the modality-specific global context information from the main modality and the cross-modal local-focused descriptor from the auxiliary modality. Then we aggregate them to produce a channel-wise descriptor that can be used to filter out the task-irrelevant information redundancy. Intuitively, with the global information of the main modality, CCM can use the information from different perspectives of the auxiliary modality to determine whether a certain part of the main modality is task-irrelevant information redundancy. Thus we obtain the RGB-enhanced features and FLOW-enhanced features from two CCMs after filtering the redundancy in original RGB features and FLOW features

, respectively. Then we utilize these two enhanced features to estimate the modality-specific attention weights, respectively, and apply mutual learning loss on these two estimated attention weights for mutual promotion. In addition, we also apply the top-k multiple-instance learning loss

(paul2018w; islam2020weakly; lee2021Weakly) that is widely used to learn the temporal class activation map (T-CAM) for each video.

Finally, we conduct extensive experiments on two public temporal action localization benchmarks, i.e., THUMOS14 dataset (THUMOS14) and Activity1.2 dataset (caba2015activitynet). In our experiments, we investigate and discuss the effect of our proposed cross-modal consensus module with other feature fusion manners (e.g., additive and concatenate function). The experimental results show that our CO-Net achieves the state-of-the-art performance on two public datasets, which verify its efficacy for temporal action localization. To summarize, our contribution is three-fold:

  • As far as we know, it is the first work to investigate multimodal feature re-calibration and modal-wise consistency via mutual learning for temporal action localization.

  • We propose a framework, i.e., CO-Net, for temporal action localization to explore a novel cross-modal attention mechanism to re-calibrate the representation for each modality.

  • We conduct extensive experiments on two public benchmarks, where our proposed method achieves the state-of-the-art results.

Figure 2. An overview of the proposed cross-modal consensus network (CO-Net) with two identical CCMs. CCM would filter out the task-irrelevant redundancy of the main modality and generate the enhanced features for the main modality by the consensus of both global-context information of itself and the local information from the auxiliary modality. The enhanced features would be fed into the attention unit to estimate the modality-specific attention weights. On the one hand, we aggregate two attention weights to generate final attention weights , while these two modality-specific attention weights are optimized by mutual learning loss for mutual promotion. On the other hand, we first fuse the two enhanced features as fused features and then feed them

into a classifier to predict a temporal class activation map (T-CAM). Finally, we apply

the top-k multiple-instance learning loss (i.e.,  and ) and co-activity similarity loss (i.e., ) to optimize the whole framework.

2. Related Works

Weakly Supervised Temporal Action Localization. Weakly supervised temporal action localization provides an efficient way to detect the action instances without overload annotations. Many works mainly tackle this problem using the multiple-instance learning (MIL) framework (islam2021hybrid; islam2020weakly; lee2020background; lee2021Weakly; liu2021acsnet; luo2020weakly; nguyen2018weakly). Several works (paul2018w; islam2020weakly) mainly aggregate snippet-level class scores to produce video-level predictions and learn from video-level action labels. In this formulation, background frames are forced to be mis-classified as action classes to predict video-level labels accurately. To address such a problem, many works (lee2020background; islam2021hybrid) apply an attention module in their framework to suppress the activation of background frames to improve localization performance. Lee et al. (lee2020background) introduces an auxiliary class for background and proposes a two-branch weight-sharing architecture with an asymmetrical training strategy. Besides, MIL-based methods only focus on optimizing the most discriminative snippets in the video (choe2019attention; feng2021mist). For action completeness modeling, some works (islam2021hybrid; min2020adversarial) adopt the complementary learning scheme that drops the most discriminative parts of the video but focuses on the complementary parts. Also, several works (pardo2021refineloc; zhai2020two) attempt to optimize their framework under a self-training regime. Zhai et al. (zhai2020two)

treats the outputs in the last epoch as pseudo labels and refines the network using these pseudo labels.

Different from aforementioned methods, this work is the first one that considers filtering out the task-irrelevant information redundancy from each modality with the help of the consensus of different modalities. Our method aims to re-calibrate the representation, so that each modality has less information redundancy, which can produce more accurate predictions.

Modalities Fusion.

Recently, deep neural networks have been exploited in multi-modal clustering issue due to powerful feature transformation ability. Many computer vision models

(hong2020mini; xu2015learning; deng2018triplet; xu2018PAD-Net; munro2020multi; rao2020a; jing2020cross; xu2017learning) adopt multiple modalities in their framework to obtain performance gains. Different modalities can help to complement each other in a proper way. In the early stage, Ngiam et al. (ngiam2011multimodal) take deep auto-encoder network architecture to learn the common representations of multi-modal data and achieves significant performance in speech and vision tasks. Several works (hong2020mini; afouras2020self) combine the visual modality and audio modality to tackle a specific task. In general, the video and audio contain different modal information but can enhance each other because visual and audio events tend to occur together. Hong et al. (hong2020mini) utilize audio modality in a multiple-head structure to assist vision modality in localizing the video highlights.

In this work, instead of feature extractor finetuning, we attempt to filter out the task-irrelevant information redundancy from the specific modality via a novel re-calibration way, which we make a consensus between the global context from itself and the local context information from another modality, while the aforementioned works treat the multiple modalities information equally.

3. Method

Video is a typical type of multimedia that can be translated into multiple modalities that represent the information from different perspectives. In this work, we propose a cross-modal consensus network (CO-Net) to re-calibrate the representations of each modality using the information from different perspectives of different modalities.

3.1. Problem Formulation

We first formulate the WS-TAL problem as follows: suppose denotes a batch of data with —— videos and corresponding video-level categorical labels are , where and for -th video, where means the number of category. The goal of WS-TAL is to learn a function that simultaneously detects and classifies all action instances temporally with precise timestamps as () for each video, where ,,, denote the start time, the end time, the predicted category and the confidence score for corresponding action proposal, respectively.

3.2. Pipeline

Feature Extraction. Following recent WS-TAL methods (paul2018w; islam2021hybrid), we construct CO-Net upon snippet-level feature sequences extracted from non-overlapping video volumes, where each volume contains 16 frames. The features for appearance modality (RGB) and motion modality (optical flows) are both extracted from pretrained extractors, i.e., I3D (carreira2017quo). The features for appearance and motion modality are 1024-dimension for each snippet. For -th video with

snippets, we use matrix tensors

and to represent the RGB and FLOW features of the whole video, respectively, where D means the dimension of the

feature vector

.

Structure Overview. Figure 2 shows the whole pipeline of our proposed CO-Net. Both RGB and FLOW features are fed in two identical cross modal consensus modules. In each CCM, we select one of the two modalities as the main modality that will be enhanced by removing the task-irrelevant information redundancy with the help of the global context of itself and cross-modal local-focused information from another (auxiliary) modality. Thus we can obtain the more task-specific representation for each modality. Then, the enhanced representation is utilized to produce attention weights that indicate the probabilities of each snippet being foreground through an attention unit that consists of two convolution layers. We aggregate two attention weights generated by enhanced features from two CCMs respectively to produce final attention weights that can be used in the testing stage. And we also fuse the two enhanced features and feed them into a classifier to predict the categorical probabilities for each snippet.

3.3. Cross-modal Consensus Module

Figure 3. An overview of the proposed cross-modal consensus module. The module contains a Global-Context-Aware unit and Cross-Modal-Aware unit to distinguish the information redundancy and re-calibrate the features. In this module, the main modality cooperates with the auxiliary modality to generate channel-wise descriptors to govern the excitation of each channel to filter out the information redundancy. Thus the main modality features are then enhanced by channel-wise attention mechanism as . The workflow is same when the roles of these modalities are exchanged.

In this work, we employ a cross-modal consensus module to filter out the task-irrelevant information redundancy for each modality before the process of downstream learning task. The proposed cross-modal consensus module is constructed by a global-context-aware unit and a cross-modal-aware unit to distinguish the information redundancy and filter out them via a channel-wise suppression on the features. As shown in Figure 3, we treat the appearance modality (RGB features) as the main modality and the motion modality (FLOW features) as the auxiliary to feed in our proposed cross-modal consensus module, while the same workflow is performed when the roles of the two modality are exchanged. For the convenience of expression, we take RGB features as the main modality features as an example in the rest of the article.

As the features are extracted from a encoder that pretrained on some large datasets not related to WS-TAL task, thus the features may contain some task-irrelevant misleading redundancy that restricts the localization performance. Given the main modality and the auxiliary modality, instead of directly concatenating them, we aim to design a mechanism to filter out the task-irrelevant information redundancy in the main modality. Motivated by the self-attention mechanism (vaswani2017attention) and squeeze-and-excitation block (hu2018squeeze), we develop a similar manner, named cross-modal attention mechanism, to distinguish the information redundancy and filter out them.

In the global-context-aware unit, we first squeeze modality-specific global context information into a video-level feature , which is aggregated from the main modality , using an average pooling operator on temporal dimension. Then, we adopt a convolution layer to fully capture channel-wise dependencies and produce modality-specific global-aware descriptor . The process is formulated below:

(1)

As multiple modalities provide information from different perspectives, we can leverage the information from the auxiliary modality to detect the task-irrelevant information redundancy in the main modality. Thus, in the cross-modal-aware unit, we aim to capture the cross-modal local-specific information from the auxiliary modality features . Here, we introduce a convolution layer that embed the features of the auxiliary modality to produce a cross-modal local-focused descriptor as follows:

(2)

Here, we obtain channel-wise descriptor for feature re-calibration by multiplying modality-specific global-aware descriptor with cross-modal local-focused descriptor . Finally, the task-irrelevant information redundancy is filtered out via a cross-modal attention mechanism as follows:

(3)

where

is a Sigmoid function

, while the “” means element-wise multiplication operator. Remarkably, and can be treated as “Query” and “Key” in the self-attention module (vaswani2017attention). Instead of using a softmax operator, we apply a Sigmoid function to produce channel-wise re-calibration weights to enhance the original main modality features .

3.4. Dual Modal-specific Attention Units

After obtaining the enhanced features, we attempt to produce modality-specific temporal attention weights that indicate the snippet-level probabilities of being foreground. Here, following previous works (lee2020background; islam2021hybrid), we feed the enhanced features into the attention unit for modality-specific attention weights:

(4)

where is the attention unit for RGB with three convolution layers, which is same with the attention unit for FLOW .

As we have two CCMs in the proposed CO-Net, we obtain the RBG-enhanced features and modality-specific attention weights from one CCM that treats the appearance modality as the main modality and motion modality as the auxiliary modality, while we also gain the FLOW-enhanced features and modality-specific attention weights from another CCM, in which the roles of two modalities are opposite to the former CCM.

After obtaining the enhanced features (i.e.,  and ) and modality-specific attention weights (i.e.,  and ). We first fuse two attention weights:

(5)

We think that the two modality-specific attention weights produced by two enhanced features respectively have different emphasis on the video, while the fused attention weights can better represent the probability of snippet being foreground because it made a trade-off between the two modality-specific attention weights. Finally, We concatenate two types of enhanced features, i.e.,  and , to form and feed it into a classifier that contains three convolution layers to produce the temporal class activation map (T-CAM) for the given video, where the -th class is the background class.

Figure 4. Illustration of the workflow of the mutual learning process. The two temporal attention weights generated from dual model-specific attention units are learning from each other by treating the other as pseudo labels and stopping the gradients backwards.

3.5. Optimizing Process

Constraints on Attention Weights. Here, we have obtained two modality-specific attention weights (i.e.,  and ) and a fused attention weights . Then we first apply mutual learning scheme on two modality-specific attention weights:

(6)

where represents a function that truncates the gradient of input, while means a similarity metric function and

is a hyperparameter

. In Eq. 6, we treat and as pseudo-labels of each other (as shown in Figure 4), so that they can learn from each other and align the attention weights. Here, we adopt mean square error (MSE) as function in Eq. 6. Besides MSE, we also discuss others similarity metric functions (i.e., Jensen-Shannon (JS) divergence, Kullback-Leibler (KL) divergence and mean absolute error (MAE) ) that is applied in Eq. 6 in Section 4.4.

In addition, we can find that the distribution of attention weights should be opposite to the probability distribution of the background class in

:

(7)

where is a absolute value function, and is the last column in the T-CAM that represents the probabilities of each snippet being background. And we also utilize a normalization loss to make the attention weights more polarized:

(8)

where is a L1-norm function.

Constraints on T-CAMs and Features. In order to better recognize the background activity, we apply the attention weights to suppress the background snippets in T-CAM and obtain suppressed T-CAM :

(9)

In this work, we apply the widely used top-k multiple-instance learning loss (paul2018w) on T-CAM and , denoted as . Also, we apply the co-activity similarity loss 111Both the top-k multiple-instance learning loss and co-activity similarity are widely used in current WS-TAL methods. They are not the main contributions in this work, so that we do not detail them in our paper. More details of them can refer to (paul2018w). (paul2018w) on fused features and suppressed T-CAM to learn better representations and T-CAM. Because we utilize the suppressed T-CAM in the testing stage in Section 3.6, we only apply on suppressed T-CAM.

Final Objective Function. Finally, we aggregate all aforementioned objective functions to form the final objective function for whole framework optimization:

(10)

here, the and are hyperparameters. Our framework can learn more robust representation to produce more accurate T-CAM by optimizing that final objective function.

3.6. Temporal Action Localization

At the testing stage, we follow the process of (islam2021hybrid). Firstly, we calculate the video-level categorical probabilities that indicate the possibility of each action class happened in the given video. Then we set a threshold to determine the action classes that would be localized in the video. For the selected action class, we threshold the attention weights to drop the background snippets and obtain the class-agnostic action proposals by selecting the continuous components of the remaining snippets. As we said in Section 3.1, a candidate action proposal is a four-tuple: (). After obtaining the action proposals, we utilize the suppressed T-CAM to calculate the class-specific score for each proposal using Outer-Inter Score (shou2018autoloc). Moreover, we use multiple thresholds to threshold the attention weights to enrich the proposal set with proposals in different levels of scale. Further, we remove the overlapping proposals using soft non-maximum suppression.

4. Experiments

In this section, we conduct extensive experiments on two public temporal action localization benchmarks, i.e., THUMOS14 (THUMOS14) and ActivityNet1.2 dataset (caba2015activitynet), to investigate the effectiveness of our proposed framework. In addition, we conduct ablation studies to discuss each component in CO-Net and visualize some results.

Supervision Method mAP@IoU (%) AVG mAP (%)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1:0.5 0.1:0.7 0.1:0.9
Fully S-CNN (shou2016temporal) (2016) 47.7 43.5 36.3 28.7 19.0 10.3 5.3 - - 35.0 24.3 -
SSN(zhao2017temporal) (2017) 60.3 56.2 50.6 40.8 29.1 - - - - 47.4 - -
BSN (lin2018bsn) (2018) - - 53.5 45.0 36.9 28.4 20.0 - - - - -
TAL-Net (chao2018rethinking) (2018) 59.8 57.1 53.2 48.5 42.8 33.8 20.8 - - 52.3 45.1 -
P-GCN(zeng2019graph) (2019) 69.5 67.5 63.6 57.8 49.1 - - - - 61.5 - -
Weakly† CMCS(liu2019completeness) (2019) 57.4 50.8 41.2 32.1 23.1 15.0 7.0 - - 40.9 32.4 -
STAR (xu2019segregated) (2019) 68.8 60.0 48.7 34.7 23.0 - - - - 47.4 - -
3C-Net (narayan20193c) (2019) 59.1 53.5 44.2 34.1 26.6 - 8.1 - - 43.5 - -
PreTrimNet (zhang2020multi) (2020) 57.5 50.7 41.4 32.1 23.1 14.2 7.7 - - 41.0 23.7 -
SF-Net (ma2020sf) (2020) 71.0 63.4 53.2 40.7 29.3 18.4 9.6 - - 51.5 40.8 -
Weakly BaS-Net (lee2020background) (2020) 58.2 52.3 44.6 36.0 27.0 18.6 10.4 3.3 0.4 43.6 35.3 27.9
Gong et al. (gong2020learning) (2020) - - 46.9 38.9 30.1 19.8 10.4 - - - - -
DML (islam2020weakly) (2020) 62.3 - 46.8 - 29.6 - 9.7 - - - - -
A2CL-PT (min2020adversarial) (2020) 61.2 56.1 48.1 39.0 30.1 19.2 10.6 4.8 1.0 46.9 37.8 30.0
TSCN (zhai2020two) (2020) 63.4 57.6 47.8 37.7 28.7 19.4 10.2 3.9 0.7 47.0 37.8 29.9
ACSNet (liu2021acsnet) (2021) - - 51.4 42.7 32.4 22.0 11.7 - - - - -
HAM-Net (islam2021hybrid) (2021) 65.9 59.6 52.2 43.1 32.6 21.9 12.5 4.4* 0.7* 50.7 39.8 32.5
UM (lee2021Weakly) (2021) 67.5 61.2 52.3 43.4 33.7 22.9 12.1 3.9* 0.4* 51.6 41.9 33.0
CO-Net 70.1 63.6 54.5 45.7 38.3 26.4 13.4 6.9 2.0 54.4 44.6 35.7
Table 1. Comparisons of CO-Net with other methods on the THUMOS14 dataset. AVG is the average mAP under multiple thresholds, namely, 0.1:0.5:0.1, 0.1:0.7:0.1 and 0.1:0.9:0.1; while means additional information is adopted in this method, such as action frequency or human pose. * indicates that the results are obtained by contacting the corresponding authors via email.

4.1. Datasets and Metrics

We evaluate our proposed approach on two public benchmark datasets, i.e., THUMOS14 dataset (THUMOS14) and ActivityNet1.2 dataset (caba2015activitynet), for temporal action localization.

THUMOS14. There are 200 validation videos and 213 test videos of 20 action classes in THUMOS14 dataset. These videos have diverse length and those actions frequently occur in the videos. Following the previous works (islam2021hybrid; paul2018w), we use 200 validation videos to train our framework and 213 test videos for testing.

ActivityNet1.2. ActivityNet1.2 dataset is a large temporal action localization dataset with coarser annotations. It is composed of 4,819 training videos, 2,383 validation videos and 2,489 test videos of 100 action classes. We cannot obtain the ground-truth annotations for the test video, because they are withheld for the challenge. Therefore, we utilize validation videos for testing (islam2021hybrid; islam2020weakly).

Evaluation Metrics.

In this work, we evaluate our method with mean average precision (mAP) under several different intersections of union (IoU) thresholds, which are the standard evaluation metrics for temporal action localization

(paul2018w). Moreover, we utilize the official released evaluation code222http://github.com/activitynet/ActivityNet to measure our results.

4.2. Implementation Details

In this work, we implement our method in PyTorch

(paszke2019pytorch). In the very beginning, we apply I3D networks (carreira2017quo)

pretrained on Kinetics-400

(kay2017kinetics) to extract both RGB and FLOW features for each video, following previous work (islam2020weakly; paul2018w). We sample continuous non-overlapping 16 frames from video as a snippet, where the features for each modal of each snippet are 1024-dimension. In the training stage, we randomly sample 500 snippets for THUMOS14 dataset and 60 snippets for ActivityNet1.2 dataset, while all snippets are taken during testing. For fair comparisons, we do not finetune the feature extractor, i.e., I3D. The attention unit is constructed with 3 convolution layers, whose output dimensions are 512, 512 and 1 while the kernel sizes are 3, 3 and 1. The classification module contains 3 temporal convolution layers. Between each convolution layer, we use Dropout regularization with possibility as 0.7.

For each hyperparameters, we set for the last two terms of regularization in the final objective function, and to obtain the best performance for both two datasets. In the training process, we sample 10 videos in a batch, in which there are 3 pairs of videos and each pair contains the same categorical tags for co-activity similarity loss . We deploy Adam optimizer (kingma2014adam) for optimizing, in which the learning rate is 5e-5 and weight decay rate is 0.001 for THUMOS14, while 3e-5 and 5e-4 for ActivityNet1.2 dataset. All experiments are run on a single NVIDIA GTX TITAN (Pascal) GPU.

4.3. Comparison With State-of-the-art Methods

We first compare our proposed CO-Net with current weakly supervised state-of-the-art methods and several fully supervised methods. We report the results in Table 1 and Table 2. From Table 1, we can find that our method outperforms all weakly supervised methods in all IoU metrics on the THUMOS14 dataset, while even comparable with fully supervised methods at low IoU region. Compared with those native early fusion methods (e.g., HAM-Net (islam2021hybrid) and UM (lee2021Weakly)) and late fusion methods (e.g., TSCN (zhai2020two)), our method gains a significant improvement. For example, The results on “AVG mAP (0.1:07)” of CO-Net vs. that of UM is 44.6% vs. 41.9%. These results show that using the information from different modalities to reduce the task-irrelevant information redundancy can benefit the temporal action localization. In addition to this, we also compare our method with several fully supervised, we can find that the results produced by our CO-Net are even comparable with those fully supervised methods in terms of metrics with low IoU, i.e., mAP@IoU0.1 and mAP@IoU0.2. Moreover, our method even outperforms some fully supervised methods, e.g., S-CNN (shou2016temporal) and BSN (lin2018bsn). These results validate the effectiveness of our proposed method.

With regard to the results of ActivityNet1.2 dataset reported in Table 2, we can find that our method is still better than the current SOTA methods on the whole. However, We can not obtain the same impressive improvement on ActivityNet1.2 dataset as it we do in the THUMOS14 dataset, because ActivityNet1.2 dataset has only 1.5 action instances per video, compared with THUMOS14 dataset which has around 15 action instances per video. Additionally, we find that the annotations of ActivityNet1.2 dataset are coarser than those in THUMOS14 dataset. Taking all these into account, we recognize that the THUMOS14 dataset is more suitable for temporal action localization task than ActivityNet1.2 dataset (as discussed in (islam2020weakly)). Therefore, we mainly use the former to verify our method in the following.

Supervision Method mAP@IoU (%)
0.5 0.75 0.95 AVG
Fully SSN(zhao2017temporal) (2017) 41.3 27.0 6.1 26.6
Weakly† 3C-Net (narayan20193c) (2019) 35.4 22.9 8.5 21.1
CMCS (liu2019completeness) (2019) 36.8 22.0 5.6 22.4
Weakly BaSNet (lee2020background) (2020) 38.5 24.2 5.6 24.3
ActionBytes (jain2020actionbytes) (2020) 39.4 - - -
DGAM (shi2020weakly) (2020) 41.0 23.5 5.3 24.4
Gong et al. (gong2020learning) (2020) 40.0 25.0 4.6 24.6
TSCN (zhai2020two) (2020) 37.6 23.7 5.7 23.6
RefineLoc (pardo2021refineloc) (2021) 38.7 22.6 5.5 23.2
HAM-Net (islam2021hybrid) (2021) 41.0 24.8 5.3 25.1
UM (lee2021Weakly) (2021) 41.2 25.6 6.0 25.9
ACSNet (liu2021acsnet) (2021) 40.1 26.1 6.8 26.0
CO-Net 43.3 26.3 5.2 26.4
Table 2. Comparison of our algorithm with other methods on the ActivityNet1.2 dataset. AVG means average mAP from IoU 0.5 to 0.95 with 0.05 increment.

4.4. Ablation study

In this work, we propose a cross-modal consensus module to re-calibrate the representations and produce the enhanced features, and a mutual learning loss to enable two CCMs can learn from each other. Also, our final objective function consists of several components. Here, we first conduct the ablation studies to investigate the effect of each object functions. Then we discuss different kinds of combination of main and auxiliary modalities in the cross-modal consensus module. Finally, we also illustrate the results of different multi-modal fusion methods as well as SE-attention (hu2018squeeze) that replace the CCM in CO-Net to verify the effectiveness of CCM.

Exp Avg mAP (%)
1 38.1
2 40.0
3 41.4
4 42.8
5 42.6
6 44.6
Table 3. Ablation studies of our algorithm in term of average mAP under multiple IoU thresholds as {0.1:0.7:0.1}.
mAP@IoU AVG
0.1 0.2 0.3 0.4 0.5 0.6 0.7
MAE 69.2 63.0 53.8 45.2 38.6 26.2 13.9 44.3
KL 67.7 62.2 53.9 44.8 37.3 25.7 14.7 43.8
JS 69.2 63.3 54.5 46.0 38.3 26.5 14.1 44.5
MSE 70.1 63.6 54.5 45.7 38.3 26.4 13.4 44.6
Table 4. Ablation studies of different types of mutual learning loss in term of average mAP under multiple IoU thresholds from 0.1 to 0.7 with interval as 0.1.

Effect of each component of final objective function. Each component in the final objective function (Eq. 10) performs important role in our framework to help to learn the feature representations and final predictions. To verify the effectiveness of each objective function, we conduct related ablation studies and report results in Table 3. We can find that each objective function makes contributions to the final performance. We treat the “Exp 1” as our baseline that only uses multiple-instance learning loss . It is notable that our baseline is similar to the BaS-Net (lee2020background) but our baseline outperforms the latter by 2.8%. Because our baseline contains the cross-modal consensus module to filter out the task-irrelevant information redundancy from two modalities and uses the concatenation of two enhanced features as the representation of each snippet. The results in Table 3 show that each component in the final objective function can help to train our proposed CO-Net. We also evaluate the effect of the different types of mutual learning loss and report the results in Table 4. The results of all types of mutual learning loss can outperform the current state-of-the-art results shown in Table 1. These results indicate that it is necessary for the two CCMs to learn from each other and MSE is more suitable.

Main Auxiliary mAP@IoU AVG
0.1 0.3 0.5 0.7
Local Local 68.1 52.6 36.3 13.2 43.0
Local Global 70.0 54.1 37.5 12.3 44.0
Global Local 70.1 54.5 38.3 13.4 44.6
Table 5. Comparisons of different kinds of combination for the main modality and the auxiliary modality in our cross-modal consensus module. ”Global” means that a convolution layer after global pooling is adopted to capture modal-specific global context, while ”Local” means a convolution layer without global pooling but local-focused.

Effect on different kinds of combination for two modalities. In our proposed CCM, we treat one modality as the main modality and another as the auxiliary modality. In Section 3.3, we utilize the main modality to generate the modality-specific global-aware descriptor , while the auxiliary modality derives the cross-modal local-focused descriptor via a convolution layer. Here, we evaluate the different kinds of combination of the main and auxiliary modality in our cross-modal consensus module. The results are reported in Table 5. We can find that obtaining the global context information from the main modality or auxiliary can obtain stable improvement compared with the results of the first row in Table 5. e.g., the results of third row outperforms that of the first row by 1.6% in AVG result. It verifies that obtaining global context information benefit to guide the recognition of information redundancy. In addition, when we obtain the global context information from the main modality, we can get the best results. Because we aims to remove the task-irrelevant information redundancy from the main modality instead of the auxiliary modality, and obtaining the global context information from the main modality can handle the overall information of the main modality.

Figure 5. The illustration of the action localization results predicted by our full method and several variant methods on several video samples. Action proposals are represented by green boxes. The horizontal and vertical axes are time and intensity of attention, respectively. The method “Ours + CCM” means our full method CO-Net.
method Add Concat SSMA (valada2019self) SE (hu2018squeeze) CCM
Avg mAP 39.9 39.5 38.0 43.0 44.6
Table 6. Comparisons with other multi-modal early fusion methods (i.e., addition and concatenation), SSMA (valada2019self) and SE-attention (hu2018squeeze) in CO-Net in term of average mAP under multiple IoU thresholds {0.1:0.7:0.1}.

Compare with other fusion methods. To verify that our proposed cross-modal consensus module is more suitable than other fusion methods for WS-TAL, we compare with other fusion methods and report results in Table 6, in which SSMA is the fusion method in (valada2019self). We can find that our proposed CCM gains the best results compared with other fusion methods, e.g., our CO-Net with CCM outperforms that with “Concate” by 5.1%. We can also find that the SSMA even underperforms the “Add” and “Concat”, because it contains a specific structure that does not suitable for temporal action localization. The SE-attention mechanism (hu2018squeeze) can also gain an improvement compared with those early fusion methods, but results of our method still outperform that of SE by 1.6%. The results in Table 6 verify that our proposed cross-modal consensus module can better fuse two modalities to boost the performance than those fusion methods. Moreover, Though CO-Net also concatenate two types of features after filtering information redundancy with CCM, the results of “Concat” and “CCM” shown in Table 6 indicate that our method with CCM performs much better than the method with ”Concat” on the original features, showing the significance of feature re-calibration for more representative features.

4.5. Visual Results

To better illustrate our method, we also illustrate the detected results of several samples in Figure 5 using the methods in Table 6. It is obvious that our method with CCM can predict more accurate localization against than the other fusion methods, showing the significance of removing the task-irrelevant information redundancy and the efficacy of our CCM.

5. Conclusion

In this work, we explore feature re-calibration for action localization to reduce the redundancy. A cross-modal consensus network is proposed to tackle this problem. We utilize a cross-modal consensus module to filter out the information redundancy in the main modality with the help of information from different perspectives of the auxiliary modality. Also, we apply a mutual learning loss to enable two cross-modal consensus modules to learn from each other for mutual promotion. Finally, we conduct extensive experiments to verify the effectiveness of our CO-Net and the results on ablation studies show that our proposed cross-modal consensus module can help to produce more representative features that would boost the performance of WS-TAL.

6. Acknowledgement

This work was supported partially by the NSFC (U1911401, U1811461), Guangdong NSF Project (No.2020B1515120085, 2018B030312002), the Key-Area Research and Development Program of Guangzhou (202007030004), the Early Career Scheme of the Research Grants Council (RGC) of the Hong Kong SAR under grant No. 26202321 and HKUST Startup Fund No. R9253. Work was partilly done during Fa-Ting’s and Jia-Chang’s internship in ARC, PCG Tencent.

References