CT-SAT: Contextual Transformer for Sequential Audio Tagging

by   Yuanbo Hou, et al.
Ghent University

Sequential audio event tagging can provide not only the type information of audio events, but also the order information between events and the number of events that occur in an audio clip. Most previous works on audio event sequence analysis rely on connectionist temporal classification (CTC). However, CTC's conditional independence assumption prevents it from effectively learning correlations between diverse audio events. This paper first attempts to introduce Transformer into sequential audio tagging, since Transformers perform well in sequence-related tasks. To better utilize contextual information of audio event sequences, we draw on the idea of bidirectional recurrent neural networks, and propose a contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences. Experiments on the real-life polyphonic audio dataset show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse-grained audio event cues to exploit contextual information to successfully recognize and predict audio event sequences.



page 4


Audio Tagging With Connectionist Temporal Classification Model Using Sequential Labelled Data

Audio tagging aims to predict one or several labels in an audio clip. Ma...

Relation-guided acoustic scene classification aided with event embeddings

In real life, acoustic scenes and audio events are naturally correlated....

Event-related data conditioning for acoustic event classification

Models based on diverse attention mechanisms have recently shined in tas...

Learning Audio Sequence Representations for Acoustic Event Classification

Acoustic Event Classification (AEC) has become a significant task for ma...

AVECL-UMONS database for audio-visual event classification and localization

We introduce the AVECL-UMons dataset for audio-visual event classificati...

Audio-text Retrieval in Context

Audio-text retrieval based on natural language descriptions is a challen...

Sequential Recommendation with Bidirectional Chronological Augmentation of Transformer

Sequential recommendation can capture user chronological preferences fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Audio Tagging (AT) is a multi-label classification task that identifies which target audio events occur in an audio clip. AT only predicts the type of events occurring in an audio clip, not the order between these events nor how many times they occur. Audio events naturally occur sequentially in a sequence, and there is often a relationship between the preceding and following events. This paper studies sequential audio tagging (SAT), which aims to learn such relationships between events and predict sequences of audio events in audio clips. SAT can be applied for tasks such as audio classification [audio_cl], audio captioning [audio_cap], acoustic scene analysis [acoustic_scene], and event anticipation [event_pred].

Previous works related to SAT mostly rely on connectionist temporal classification (CTC) [ctc] to identify event sequences. Paper [dcase2018_ctc]

explores the possibility of polyphonic SAT using sequential labels and utilizes CTC to train convolutional recurrent neural networks (CRNN)

[crnn] with learnable gated linear units (GLU) [GLU] to tag event sequences. As audio events often overlap with each other, the order of start and end boundaries of events is used in [dcase2018_ctc] as sequential labels. For example, the double-boundary sequential label of an audio clip might be dishes_start, dishes_end, speech_start, blender_start, speech_end, speech_start, blender_end, speech_end

. Sequential labels do not contain the onset and offset time information of audio events, which avoids the problem of inaccurate annotations of frame-level labels, and reduces the annotation workload. In addition to exploring the feasibility of recognizing audio event sequences in SAT, CTC-based methods have also been attempted for sound event detection (SED), which detects the type, starting time, and ending time of audio events. A bidirectional long short-term memory (LSTM) RNN

[lstm] equipped with CTC (BLSTM-CTC) [wangyun] is used to detect events using double-boundary sequential labels. The results [wangyun] on a very noisy corpus show that BLSTM-CTC is able to locate boundaries of audio events with rough hints about their positions. Apart from methods using double-boundary labels, another CTC-based SED system [hou2019sound]

uses single-boundary sequential labels (the order of start boundary of events) with unsupervised clustering to detect the type and occurrence time of audio events. CTC redefines the loss function of RNN

[ctc] and allows it to be trained for sequence-related tasks to keep the order information of events. However, CTC implicitly assumes that outputs of the network at different time steps are conditionally independent [ctc], which makes CTC-based approaches unable to effectively learn the contextual information inherent in audio event sequences. This paper attempts to introduce Transformers [Transformer]

, which have revolutionized the field of natural language processing 

[nlp], into SAT. Transformer [Transformer] does not have the conditional independence assumption in CTC. Compared with RNN-based models, Transformer can access information at any time step from any other time step, thereby capturing long-term dependencies [long_term] between audio events. In addition, the training of Transformer can also be efficiently parallelized.

Figure 1: The proposed contextual Transformer. In the forward and backward mask, the red, gray, and white blocks indicate the masked position of the information to be predicted, the position of the masked information, and the position of the available information.

When learning sequence information, the decoder in Transformer [Transformer] exploits past information to infer the upcoming event. For example, when recognizing audio event sequences “fire, alarm, run” and “fire, crying, sobbing”, the model may be confused between alarm and crying when forward inferring the next event from fire. But if the target event is backward inferred from run and sobbing

respectively, the probability of

alarm and crying is different in different sequences. Contextual information can help the model learn the differences between sequences in detail. To make more comprehensive utilize the contextual information in audio event sequences, this paper draws on the idea of bidirectional RNN [brnn] and proposes a contextual Transformer (cTransformer) to explore the bidirectional information of audio event sequences. The cTransformer consists of the encoder and decoder, the latter of which is the main contribution of this paper. The decoder attempts to fuse frame-level representations from the encoder with the clip-level event cues to infer the target by combing the forward and backward information learned from normal and reverse sequences, respectively. Then, the loss between the prediction from normal sequence branch and the prediction from reverse sequence branch is calculated and fed back to update parameters to learn a more accurate prediction about the same target. During training, partial weights of the normal and reverse sequence branches are shared, these shared weights can learn both the forward and backward information. That is, with the help of shared weights, the decoder is able to learn contextual information simultaneously to more comprehensively and accurately identify audio event sequences.

The contributions of this paper are: 1) this paper introduces Transformer into SAT; 2) The cTransformer that can utilize bidirectional information is proposed to better identify audio event sequences in audio clips; 3) To explore the feasibility of SAT based on cTransformer, we manually label sequential labels for a polyphonic audio dataset from real life, and compare the performance of cTransformer and other CTC-based methods on it. This paper is organized as follows, Section 2 shows cTransformer. Section 3 describes the dataset, experimental setup, and analyzes the results. Section 4 gives conclusions.

2 Contextual Transformer

Motivated by the performance of Transformer in sequence modeling [Transformer, devlin2018bert], and the significance of contextual information in audio tasks [contextual_audio, contextual_audio2, contextual_audio3], this paper proposes cTransformer for audio event sequence analysis. The cTransformer aims to transform the acoustic feature to the corresponding event sequential label using both global information and rich contextual details.

2.1 Data preparation

In audio event tasks, the most frequently used acoustic feature is the log mel spectrogram [mel]. The audio clip is converted to the time-frequency representation of log mel spectrogram and input to the model. Referring to [hou2019sound], the start boundary order of events is used as sequential labels. For the normal sequence branch in Figure 1, given the sequential label is <S>, , , …, , <E>, where means the -th event, <S> and <E> are the default tokens [Transformer] indicating the start and end of prediction, respectively. For the reverse sequence branch, the sequential label is , , , …, , <E>, where is the token indicating the start of reverse sequence prediction. The sequential label of an audio clip might be “<S>, dishes, speech, speech, blender, speech, <E>”. The corresponding is , speech, blender, speech, speech, dishes, <E>”.

2.2 Encoder in contextual Transformer

The encoder aims to convert input acoustic features into high-level representations. To consider the audio information globally, this paper does not divide input features into small patches [ast], so there is no positional encoding [Transformer] in the encoder. The encoder mainly consists of identical blocks with multi-head attention layers (MHA) and feed forward layers, which are analogous to the encoder in Transformer [Transformer]. The attention function in MHA is scaled dot-product attention, whose input consists of queries and keys of dimension , and values of dimension [Transformer]. The attention is calculated on a set of queries, keys, and values packed into matrix , , and , respectively.


Then, MHA is used to allow the model to jointly focus on representations from different subspaces at different positions.


Where represents the output of the -th attention head for a total number of heads. , , and are learnable weights. For MHA in the encoder, , , and come from the same place, at this point, the attention in MHA is called self-attention [Transformer]

. Next, the feed forward layer that consists of two linear transformations with ReLU activation function

[ReLU] in between is applied. For the parameters involved above, all refer to the default settings of Transformer [Transformer].

2.3 Decoder in contextual Transformer

The cTransformer is expected to efficiently capture contextual information in audio event sequences without reducing the Transformer’s global summarization ability. The global attention in the encoder can attend to the information of each position. However, the self-attention in Masked MHA of decoder relies only on the forward information to sequentially predict the next event to preserve autoregressive property [Transformer], as the normal sequence branch in Figure 1. Thus, a bidirectional sequence decoder that can exploit the forward and backward information is proposed, as shown in the decoder of Figure 1. To enhance the ability of the model to capture the contextual information of the target event, the normal and reverse sequence branches jointly predict the same target each time. Since some weights of the two branches are shared, these weights both learn forward information about the target and capture its related backward information to help the model learn and model the contextual information about each target more accurately.

The decoder consists of two branches with the same structure, and each branch contains identical blocks, which are analogous to the decoder in Transformer [Transformer]. In Masked MHA, the forward and backward masks are to block future information and past information to preserve the autoregressive property, respectively. Positions corresponding to invisible information will be masked with [Transformer]. The attention in Masked MHA is self-attention, which means that , , and all come from the input sequence of event labels. In the next MHA used to fuse frame-level acoustic representations from the encoder and clip-level event cues, for the encoder-decoder attention, is from the previous decoder layer, while and are from the output of the encoder. For the -th target , given the input embedding for the normal sequence branch is , and the input embedding for the reverse sequence branch is . Let and be the prediction for from the normal and reverse sequence branches. For the normal sequence branch exploring forward information, is jointly derived from the output of encoder and embedding after forward Masked MHA . For the reverse sequence branch exploring backward information, is jointly derived from and embedding after backward Masked MHA .


where and are biases in normal and reverse sequence branches, and are learnable weights in MHA, denote the set of mapping functions in each branch of the decoder. In the inference phase, the model uses the normal sequence branch for prediction. The remaining layers and parameters in the decoder are the same as those of Transformer [Transformer].

2.4 Loss function in contextual Transformer

Denote and as and , and the corresponding ground-truth labels are and , respectively. Following the loss function in Transformer [Transformer], cross entropy (CE) loss is used as the loss function for the normal and reverse sequence branch to compute the normal and reverser sequential tagging loss.


Since and are the prediction for the same target, the mean squared error (MSE) loss that performs well in regression tasks [mse][mse-regression][mse-classification] is used as the context loss to measure the distance between and in the latent space.


To consider the forward and backward information at the same time in training phase, losses of different branches are calculated together. The final loss of the cTransformer is


where adjusts the weights of different loss components during training. defaults to 1. During the training process, the forward prediction and backward prediction will be aligned to capture the rich contextual information around the target event and learn the entire sequence embeddings more accurately.

# {, } AUC BLEU # {, } AUC BLEU
1 {1, 1} 0.771 0.468 7 {3, 3} 0.784 0.482
2 {1, 2} 0.800 0.491 8 {3, 6} 0.770 0.472
3 {2, 2} 0.775 0.481 9 {4, 2} 0.779 0.467
4 {2, 4} 0.775 0.483 10 {4, 4} 0.787 0.464
5 {2, 5} 0.783 0.473 11 {5, 5} 0.774 0.461
6 {3, 1} 0.782 0.474 12 {6, 6} 0.778 0.456
Table 1: Results of the model with different ratios of and .

3 Experiments and results

3.1 Dataset, Baseline, Experiments Setup, and Metrics

Since there is no publicly available polyphonic audio dataset with sequential labels, we manually label the DCASE domestic environment audio dataset [dcase2018] with the start boundary order of events as sequential labels by referring to [hou2019sound], and release the sequential label set to motivate more relevant research. The domestic audio dataset excerpted from Audioset [aduioset] contains 10 classes of real-life polyphonic audio events, where the training and test sets consist of 1578 and 288 audio clips, respectively. During training, the validation set is randomly composed of 20% of the samples in the training set. After manual annotation and cross-checking, the number of occurrence events contained in the train set and test set is 3619 and 923, where the length of the longest audio event sequences is 20 and 14, respectively.

Most previous audio event sequence analysis works rely on CTC, so BLSTM-CTC [wangyun]

is used as Baseline. This paper also compares the cTransformer with CTC-based convolutional bidirectional gated recurrent units (CBGRU-CTC)

[csps_ctc], and CBGRU-CTC equipped with GLU in convolutional layers (CGLU-BGRU-CTC) [dcase2018_ctc], and in both convolutional and recurrent layers (CBGRU-GLU-CTC) [hou2019sound].

In training, log mel-band energy with 64 banks [mel] is extracted using STFT with Hamming window length of 46 ms and the overlap is between windows following the settings of [dcase_kong]

. Stochastic gradient descent with momentum (SGDM)

[sgdm] with an initial learning rate of 1e-3, batch size of 64, and momentum value of 0.9 is used to minimize the loss. Dropout [dropout] and layer normalization [layernorm]

are used to prevent over-fitting. Systems are trained on a single card Tesla V100-SXM2-32GB for maximum 1000 epochs. For more details, source code, and the manually labeled dataset with sequential labels, please visit the project homepage (


SAT consists of AT plus order information between events. This paper uses precision (P), recall (R

), F-score (

F), accuracy (Acc) [metrics], and area under curve (AUC) [AUC] to measure the results of AT in various aspects to show the performance of models on the basic event recognition. Then, the bilingual evaluation understudy (BLEU) [bleu] commonly used in sequence tasks is adopted to comprehensively evaluate the SAT results. Higher P, R, F, Acc, AUC, and BLEU indicate a better performance.

3.2 Results and Analysis

The encoder and decoder of the cTransformer consist of and identical blocks, respectively. This paper first explores the optimal ratio of blocks of encoder and decoder to determine the final model structure, as shown in Table 1. SAT is equivalent to AT with additional sequence information of events. So, AUC, which can avoid the influence of different threshold interference, is used to measure the results of AT more comprehensively, and BLEU is used to evaluate the results of SAT.

In Table 1, the performance of the model does not increase monotonically with the number of blocks. When {, } is {1, 2}, the model achieves the best results on the test dataset. In Transformer [Transformer], {, } defaults to {6, 6}. The size of the best model in this paper is smaller than that of Transformer, and the reason may be that the polyphonic audio dataset with manually labeled sequential labels is not large scale, resulting in smaller models with fewer blocks performing well. And in the experiment, we found that the model will show more serious overfitting when the values of and are large.

# F (%) Acc (%) AUC BLEU
1 66.42 90.41 0.780 0.474
2 64.58 89.79 0.765 0.472
3 67.39 90.66 0.785 0.489
4 70.42 91.63 0.800 0.491
Table 2: Ablation experiments of the cTransformer on test set.

The next step is to optimize the scaling factor of different losses. Different losses target different information. The with MSE aims to align predictions of the normal and reverse sequence branches to make their predictions of the current event more consistent, while and focus on learning task-goal-oriented representations to improve the accuracy of individual event sequence recognition. Table 2 conducts ablation studies to imply the importance of the information represented by different losses to the cTransformer.

1 1 0.5 0.1 0.789 0.481 8 0.5 1 1 0.774 0.467
2 1 0.5 0.25 0.803 0.511 9 1 1 0.1 0.791 0.485
3 1 0.5 0.5 0.782 0.488 10 1 1 0.25 0.788 0.501
4 1 0.5 1 0.805 0.505 11 1 1 0.5 0.783 0.487
5 0.5 1 0.1 0.784 0.479 12 0.1 0.1 1 0.763 0.465
6 0.5 1 0.25 0.788 0.482 13 0.25 0.25 1 0.774 0.466
7 0.5 1 0.5 0.778 0.465 14 0.5 0.5 1 0.785 0.472
Table 3: The effect of different values on the cTransformer.

In Table 2, # 1 has only the normal sequence branch of the cTransformer. That is, the structure of # 1 in Table 2 is equivalent to Transformer [Transformer]. Conversely, # 2 has only the reverse sequence branch. Except for the result of # 2 from the reverse sequence branch, the rest of the results are predicted by the normal sequence branch. The model in # 4 outperforms # 3 without context loss, indicating that the context loss is beneficial to the model to effectively integrate the forward information related to the target in the normal sequence and the backward information related to the target in the reverse sequence. With the support of contextual information, the model can more accurately identify and effectively confirm the target event.

Table 3 further controls the scale of different losses in a fine-grained manner to filter out the optimal combination of coefficients. Table 3 attempts to control variables to compare the performance of models with different combinations of coefficients. Finally, giving the same weight to and , and lightening the weight of achieves the best AUC in # 4. This reveals that in the experiments, the cTransformer should focus on capturing the forward and contextual information, while putting the backward information in a secondary position for better event sequence recognition.

After the structure of the proposed context Transformer and hyperparameters of losses are determined, Table

4 compares the cTransformer with Baseline and other methods related to audio event sequence analysis. To analyze the recognition ability of different models to polyphonic audio events from multiple perspectives, several metrics are adopted to evaluate the AT results of models in Table 4, while the classical BLEU is still used for SAT. In Table 4, BLSTM-CTC [wangyun], which uses only LSTM to extract acoustic representations to identify polyphonic audio event sequences, has the worst overall performance. The CBGRU-CTC [csps_ctc]

with a composite convolutional recurrent neural network outperforms the BLSTM-CTC overall, which implies the superior ability of the convolutional layer in feature extraction. CGLU-BGRU-CTC

[dcase2018_ctc] and CBGRU-GLU-CTC [hou2019sound] with GLU assembled in convolutional layers and both convolutional and recurrent layers, respectively, do not perform very well overall, although they outperform CBGRU-CTC in some metrics. This paper also shows the results of the default Transformer [Transformer] with 6-layer encoder and decoder. Possibly due to the size of the polyphonic audio dataset containing diverse and complex event sequences is not large, the performance of Transformer is close to that of the CTC-based methods. Overall, the cTransformer achieves better results in both AT and SAT. Since data augmentation is not used in the previous CTC-based methods, none of the above models have trained with data augmentation for a fair comparison.

Method AT SAT
P (%) R (%) F (%) Acc (%) AUC BLEU
BLSTM-CTC [wangyun] 69.73 50.12 58.32 89.47 0.713 0.323
CBGRU-CTC [csps_ctc] 67.79 63.39 63.23 90.93 0.793 0.475
CGLU-BGRU-CTC [dcase2018_ctc] 79.87 60.99 69.17 90.48 0.786 0.468
CBGRU-GLU-CTC [hou2019sound] 75.97 64.30 69.65 91.77 0.787 0.463
Transformer [Transformer] 67.24 64.53 65.86 90.17 0.785 0.432
cTransformer 75.66 67.61 71.41 92.05 0.805 0.505
Table 4: Comparison of SAT and AT results with other methods related to the analysis of audio event sequences.

Figure 2: Attention score from the masked MHA in decoder. Subgraph (a) and (b) are from the normal and reverse sequence branches, respectively. The x-axis is each event predicted by the autoregressive way, y-axis is the corresponding reference event.

To gain a more intuitive insight of the performance of the model on polyphonic audio event sequences, for the same audio clip, Figure 2 shows the distribution of attention scores from masked MHA of the normal and reserve sequence branches. In Figure 2 (a), after inputting <S> the attention value for <S> is 1, then combining acoustic representations from the encoder, the model predicts the next event should be frying (the event corresponding to the 2nd column of x-axis), and the reference event label is frying (the event corresponding to the 1st row of y-axis). Then, the input is “<S>, frying” attention values for the two events are 0.34 and 0.66, respectively, and the next event is predicted to be dishes (the event corresponding to the 3rd column of x-axis), and reference event label is dishes (the event corresponding to the 2nd row of y-axis). Finally, when the input is “<S>, frying, dished, dishes”, based on acoustic representations, the model judges that the event sequence is complete, and subsequently outputs <E> (the event corresponding to the 4th row of y-axis) to indicate the inference stops. After the autoregressive process, the predicted event sequence “frying, dished, dishes” is obtained, the reference label sequence is “frying, dished, dishes”. The exact match between and indicates that the cTransformer successfully fuses frame-level acoustic representations from the encoder with clip-level event cues from the decoder to jointly infer the event sequence. In Figure 2 (b), the attention scores from reverse sequence branch for the same audio clip are different from attention scores for forward inference in Figure 2 (a). Guided by , the reverse sequence branch combing audio representations successfully predicts the reverse event sequence “dished, dishes, frying”, the corresponding label is “dished, dishes, frying”. The match of and indicates that with the assistance of different prediction cues and mask matrices, the cTransformer effectively infers the event sequence from normal and reverse directions, which implies that the model is effective for modeling contextual information.

4 Conclusions

This paper first introduces Transformer into SAT. To utilize the context information of audio event sequences, cTransformer is proposed to recognize diverse event sequences in polyphonic audio clips. The cTransformer can automatically assign different attention scores to the existing information to effectively model contextual information and accurately infer the event, then frame-level acoustic representations and clip-level event cues are efficiently fused to successfully identify and predict event sequences implicit in audio clips. Future work will explore the performance of cTransformer using fully bidirectional information to infer audio event sequences on more datasets.