1 Introduction
With the rapid increase in the number of usergenerated videos shared on the Internet, it is becoming increasingly advantageous to explore new ways of retrieving them—for example, by automatically detecting events occurring in them. One less explored approach is to analyze the soundtracks. But while the analysis of visual content is widely studied in disciplines such as image processing and computer vision, analysis of video soundtracks has largely been restricted to specific, inherently audiofocused tasks such as speech processing and music retrieval.
However, when visual information cannot reliably identify content (e.g., due to poor lighting conditions), audio may still furnish vivid information. In other words, audio content is frequently complementary to visual content—and in addition, it offers more tractable processing.
Multimodal approaches that use both visual and audio cues have recently gained traction. However, there has not been much indepth exploration of how best to leverage the audio information. On the audio side, due to a historical focus on carefullycurated speech and music processing corpora, fewer audio researchers consider the problems posed by unfiltered generic audio with varied background noises—but these problems must be addressed to build audio classifiers that can handle usergenerated video. In addition, past work on video analysis has often fed audio “blindly” into a machine learner, without much consideration of how audio information is structured.
In the last few years, the focus has been changing in both event detection and audio analysis. For example, sound event detection was included in the 2013 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge at IEEE AASP [25]. More recently, the YLIMED annotated video corpus became available [4]; YLIMED is targeted toward multimedia event detection, so provides a good platform to study audiobased event detection (see Section 5.1).
Major areas of audiobased event detection research include audio data representation and learning methodologies. In this work, we focus on the first aspect, audio data representation, which aims to extract specific features that can refine an enormous amount of raw audio data into higherlevel information about the audio signals. Section 2
gives an overview of current representation approaches and discusses their limitations in detail. To offer a brief summary, current approaches do not effectively capture signal variance within audio tracks nor local structure (for example, between Gaussian components), they risk losing information about geometric manifold structure and hidden structure within the data, they often require a lot of storage space, and they rarely leverage available information from labels.
In this paper, we address these issues by introducing a Discriminative and Compact Audio Representation (DCAR) to model audio information. This method is implemented in two phases. First, each audio track is modeled using a Gaussian mixture model (GMM) with several mixture components to describe its statistical distribution. This is beneficial for capturing the variability within each audio track and for reducing the storage space required, relative to the full original number of frames.
Second, by integrating the labels for the audio tracks and the local structure among the Gaussian components, we identify an embedding to reduce the dimensionality of the mixture components and render them more discriminative. In this phase, the dimensionality reduction task is formulated as an optimization problem on a Grassmannian manifold and solved via the conjugate gradient method. Then a new audio track can be represented with the aid of the learned embedding, which further compacts the audio information. For classification, we adopt the kernel ridge regression (KRR) method, which is compatible with the manifold structure of the data.
As we argue in detail in Section 3, DCAR represents a considerable advancement of the stateoftheart in audiobased event detection. In a nutshell, the novelty of DCAR lies in its being a compact representation of an audio signal that captures variability and has better discriminative ability than other representations.
Our claim is supported by a series of experiments, described in Section 6, conducted on the YLIMED dataset. We first built binary classifiers for each pair of events in the dataset, and found that the proposed DCAR performed better than an ivector strategy on pairwise discrimination. We then delved deeper, comparing multievent detection results for DCAR with three existing methods (including simple GMMs and mean/variance vectors as well as ivectors) for events that are difficult to distinguish vs. events that are easy to distinguish. We showed that DCAR can handle both easy and hard cases; Section 6.2 discusses how these results may follow from how each type of model leverages (or doesn’t leverage) the intrinsic structure of the data. Finally, we conducted multievent detection experiments on all ten events, again showing that DCAR is the most discriminative representation. In particular, DCAR shows notable accuracy gains on events where humans find it more difficult to classify the videos, i.e., events with lower average annotator confidence scores.
The remainder of this paper is organized as follows: Section 2 surveys related audio work; Section 3 presents the proposed DCAR model in detail; and Section 4 describes the audiobased event detection process with KRR. Section 5 describes our methods and the realworld video dataset YLIMED; and Section 6 discusses a series of binary and multievent detection experiments. The results demonstrate that DCAR significantly improves eventdetection performance. Conclusions and future work are discussed in Section 7.
2 Related Work
Audio representations include lowlevel features (e.g., energy, cepstral, and harmonic features) and intermediatelevel features obtained via further processing steps such as filtering, linear combination, unsupervised learning, and matrix factorization (see overview in Barchiesi et al. 2015
[3]).A typical audio representation method for event detection is to model each audio file as a vector so that traditional classification methods can be easily applied. The most popular lowlevel feature used is Melfrequency cepstral coefficients (MFCCs) [8]
, which describe the local spectral envelope of audio signals. However, MFCC is a shortterm framelevel representation, so it does not capture the whole structure hidden in each audio signal. As one means to address this, some researchers have used endtoend classification methods (e.g., neural networks), for example to simultaneously learn intermediatelevel audio concepts and train an event classifier
[22]. Several approaches have used firstorder statistics derived from the frames’ MFCC features, which empirically improves performance on audiobased event detection. For example, Jin et al. adopted a codebook model to define audio concepts [12]. This method uses firstorder statistics to represent audio: it quantizes lowlevel features into discrete codewords, generated via clustering, and provides a histogram of codeword counts for each audio file (i.e., it uses the mean of the data in each cluster).However, such methods do not capture the complexity of reallife audio recordings. For event detection, researchers have therefore modeled audio using the secondorder statistical covariance matrix of the lowlevel MFCC features [19, 10, 7, 23]. There are two ways to compute the secondorder statistics. The first assumes that each audio file can be characterized by the mean and variance of the MFCC features representing each audio frame, then modeled via a vector by concatenating the mean and variance [23]; this representation can be referred to as a mean/variance vector or mvvector. The other method is to model all training audio via a Gaussian mixture model and then compute the BaumWelch statistics of each audio file according to the mixture components, as in GMMsupervector representations [19]. Again, each audio file is represented by stacking the means and covariance matrices. However, such a vectorization process will inevitably distort the geometric structure^{2}^{2}2By geometric structure, we mean intrinsic structure within data such as affine structure, projective structure, etc. of the data [24].
An exciting area of recent work is the ivector approach, which uses latent factor analysis to compensate for foreground and background variability [6]
. The ivector approach can be seen as an extension of the GMMsupervector. It assumes that these highdimensional supervectors can be confined to a lowdimensional subspace; this can be implemented by applying probabilistic principal component analysis (PCA) to the supervectors. The advantage of an ivector is that the system learns the total variance from the training data and then uses it on new data, so that the representation of the new data has similar discriminativity to the representation of the training data. Ivectors have shown promising performance in audiobased event detection
[7, 10].In fact, many of these representation methods have shown promising performance, but they have some limitations with regard to audiobased event detection. For example, the signal variance within a given audio track may be large; training a Gaussian mixture model on all of the audio tracks (as in the GMMsupervector or ivector approaches) does not capture that variability, and thus may not characterize a given event well. The second limitation is that each mixture component consists of both a mean vector and a covariance matrix, which can introduce many more variables, and so result in high computational complexity and require a lot of storage space. The third limitation is that the covariance matrices of the mixture components in these methods are usually flattened into one supervector, which may distort the geometric manifold structure within the data and lose information about hidden structure. Fourth, most audio representations are derived in an unsupervised manner, i.e., they do not make use of any existing label information. But in fact, label information has been very useful for representing data in classification tasks such as image classification [14] and text classification [16]. Last but not least, these methods do not explicitly consider the local structure between Gaussian components, which may be useful for distinguishing events.
These drawbacks motivate us to propose a new audio representation method to capture the variability within each audio file and to characterize the distinct structures of events with the aid of valuable existing labels and local structure within the data; these characteristics of our method have significant benefits for event detection.
3 Discriminative and Compact Audio Representation
In this section, we describe our proposed twophase audio representation method. The first phase, described in Subsection 3.1, aims to capture the variability within each audio file. The second phase, described in Subsection 3.2, identifies a discriminative embedding.
3.1 Phase 1: Characterizing PerTrack Variability
Given a set of audio tracks, we first extract their lowlevel features, in this case MFCC features. Let denote a set of audio files. Each file is segmented into
frames. Each frame is computed using a 100ms Hamming window with a stride size of 10 ms per frame shift, and its corresponding representation
( and ) is built with the first 20 MFCC features and their firstorder and secondorder derivatives. Each frame is then modeled via a vector with dimensional MFCC features (), i.e., .Previous work has demonstrated that secondorder statistics are much more appropriate for describing complicated multimedia data [6, 27]. Therefore, we train a GMM with
components using the ExpectationMaximization algorithm for each audio track:
(1) 
The estimated GMM components are denoted as:
(2) 
where . When each audio file is modeled via components, . Each component has its corresponding weight , mean , and covariance matrix .
Generally, covariance matrices are positive semidefinite, and can be made strictly positive definite by adding a small constant to the diagonal elements of the matrix. For convenience, we use the notation
to indicate a symmetric positive definite (SPD) matrix. After GMM modeling, each audio file—typically containing hundreds to thousands of frames—is reduced to a smaller number of mixture components with prior probabilities. The covariance matrices provide a compact and informative feature descriptor, which lies on a specific manifold, and obviously captures the (secondorder) variability of the audio.
3.2 Phase 2: Identifying a Discriminative Embedding
In the second phase of the DCAR method, a discriminative embedding is identified by integrating the global and local structure of the training data, so that both training data and unseen new data can be rerepresented in a discriminative and compact manner.
3.2.1 Overview of Phase 2
Although an audio file can be represented via the above mixture components, the model presented thus far ignores the global structure of the data (e.g., the valuable label information) and the local structure among the components (e.g., nearest neighbors). Meanwhile, the original feature representation is usually large (since there are 60 MFCC features, each mean vector has elements, and each covariance matrix contains elements), which may be timeconsuming in later data processing. Therefore, in this subsection, we propose a new method for generating a discriminative and compact representation from the highdimensional mixture components. The DCAR method is summarized in Figure 1.
Our main goal is to learn an embedding (, where is the number of MFCC features and is the embedding space size) based on the GMM components of the labeled audio track () (where is the label for component , based on the label of the corresponding audio file from which was generated). Therefore, the resulting lowdimensional GMM components should preserve the important structure of the original GMM components as much as possible. To accomplish this, we introduce an embedding and define the new GMM components with mean
(3) 
and covariance matrix
(4) 
As mentioned above, the covariance matrix is SPD, i.e, . To maintain this property, i.e., , the embedding is constrained to be full rank. A simple way of enforcing this requirement is to impose orthonormality constraints on (i.e., ), so that the embedding can be identified by solving an optimization problem on the Grassmannian manifold.
For event detection, each training file has label information, which we also assign to its GMM components. This valuable information can be interpreted as global structure for those components. There is also intrinsic internal structure among the components, such as the affinity between each pair of components. When reducing the dimensionality of GMM components, it is necessary to maintain these two types of structure. Motivated by the idea of linear discriminative analysis [18] and Maximum Margin Criterion [17]
, DCAR aims to minimize the intraclass distance while simultaneously maximizing the interclass distance. In the next subsection, we introduce an undirected graph defined by a real symmetric affinity matrix
that we use to encode these structures.3.2.2 Affinity Matrix Construction
The affinity matrix is defined by building an intra (within)class similarity graph and an inter (between)class similarity graph, as follows.
(5) 
and are two binary matrices describing the intraclass and interclass similarity graphs respectively, formulated as:
(6) 
and
(7) 
where contains the nearest neighbors of component that share the same label as , and is the set of nearest neighbors of that have different labels. Here, the nearest neighbors of each component can be identified via their similarity. We use heat kernel weight with a selftuning technique (for parameters and ) to measure the similarity between components:
(8) 
where is a tradeoff parameter to control the contribution from the components’ means and covariance matrices and indicates the distance measure for the means of the mixture components. Here we use the simple euclidean distance
(9) 
indicates the distance measure for the covariance matrices of the components.
A number of metrics have been used in previous research, including the AffineInvariant Riemannian Metric (AIRM) [21], Stein Divergence [5], and the LogEuclidean Metric (LEM) [2]. AIRM imposes a high computational burden in practice, and we have observed experimentally that nearest neighbors selected according to LEM more often fall into the same event than nearest neighbors selected according to either AIRM or Stein (see Appendix A for details). For these two reasons, we exploit LEM to compute :
(10) 
The constructed affinity matrix thus effectively combines local structure, i.e., nearest neighbors, and global structure, i.e., label information—which is used to find the withinclass nearest neighbor () and the betweenclass nearest neighbor ().
3.2.3 Embedding Optimization
Once we have , the next step is to learn an embedding such that the structure among the original GMM components is reflected by the lowdimensional mixture components . This process can be modeled using the following optimization problem:
(11) 
With the aid of the mapping functions in (3) and (4) and the distance metrics (9) and (10), the optimization problem can be rewritten as:
(12) 
As in (8), is used to balance the effects of two terms, in tuning by crossvalidation on the training data. Optimizing results in a situation where the lowdimensional components are close if their corresponding original highdimensional components are eventaware neighbors; otherwise, they will be as far apart as possible.
In image processing, there are several lines of research where a mapping has been learned from a highdimensional manifold to a lowdimensional manifold [9, 11]. However, Harandi et al. [9] exploit AIRM and Stein Divergence to measure the distance, and as we noted in Subsection 3.2.2, these metrics are not appropriate for handling audio data. Huang et al. [11] identified an embedding on the logarithms of the SPD matrix, but our goal is to identify an embedding on GMM components, including both means and covariance matrices.
The problem in (12) is a typical optimization problem with orthogonality constraints; it can therefore be formulated as a unconstrained optimization problem on Grassmannian manifolds [1]. Given that the objective function has the property that for any rotation matrix (i.e., ), (see Appendix B for a detailed proof), this optimization problem is most compatible with a Grassmannian manifold. In other words, we can model the embedding as a point on a Grassmannian manifold , which consists of the set of all linear dimensional subspaces of .
Here we employ the conjugate gradient (CG) technique to solve (12), because CG is easy to implement, has low storage requirements, and provides superlunar convergence in the limit [1]. On a Grassmannian manifold, the CG method performs minimization along geodesics with specific search directions. Here, the geodesic is the shortest path between two points on the manifold. For every point on the manifold , its tangent space is a vector space that contains the tangent vectors of all possible curves passing through that point. Unlike flat spaces, on a manifold we cannot directly transport a tangent vector from one point to another point by simple translation; therefore, the tangent vectors must parallel transport along the geodesics.
More specifically, on the Grassmannian manifold, let and be the tangent vector and the gradient of at point , respectively. The gradient on the manifold at the th iteration can be obtained by subtracting the normal component at from the transported vector:
(13) 
Then the search direction in the th iteration can be computed by parallel transporting the previous search direction and combining it with the gradient direction at the current solution:
(14) 
Here is the parallel translation of the vector . According to Absil et al. [1], the geodesic going from point in the direction can be represented by the geodesic equation
(15) 
Thus, the parallel translation can be obtained by
(16) 
where
is the compact singular value decomposition of
.We use the exact conjugacy condition to adaptively determine the step size , as follows:
(17) 
where . Similar to , is the parallel translation of the vector on the Grassmannian manifold, which can be calculated thusly:
(18) 
Going back to the objective function in (12), by setting and , (12) can be rewritten as
(19) 
Then its tangent vector on the manifold can be computed in three steps (see Appendix C for details):
(20) 
(21) 
(22) 
Then, given a new audio file, we can extract its MFCC features, train GMM components, and rerepresent these components with the embedding to get its discriminative, lowdimensional mixture components, i.e., the proposed DCAR representation.
4 Event Detection with DCARs
As we describe above, each audio file is represented via several mixture components, including mean vectors and covariance matrices. It would be possible to flatten the matrices into vectors and then use traditional vectorbased classification methods for event detection. However, the covariance matrices lie on the manifold of positive definite matrices, and such a vectorization process would ignore this manifold structure [24]. Therefore, we use the Kernel Ridge Regression (KRR) method to build the event classifiers.
Let and be mixture components for the training audio tracks belonging to events. indicates the label information for , where if the component belongs to the event; otherwise . The KRR method aims to train a classifier by solving the following optimization problem:
(23) 
where is a feature mapping from the original feature space to a ’ensional space, and the kernel function can be written as .
Since each component has a mean and a covariance matrix , we can define a combined kernel function to integrate these two parts, as follows:
(24) 
The tradeoff parameter (see (8) can be tuned by crossvalidation on the training data. As described in Section 3.2.2, we use a Gaussian kernel to calculate and via
(25) 
and
(26) 
The problem in (23), as a quadratic convex problem, can be optimized by setting its derivative with respect to to zero, and then computing in closed form:
(27) 
Here as given in (24).
Given a new test audio track, mixture components can be obtained via the methods described in Section 3.1. Then the corresponding discriminative, lowdimensional mixture components can be generated as in (3) for , and as in (4) for , where the embedding is learned from the training data. Then the class membership matrix (where is the event membership of the th component) can be calculated:
(28) 
Here , indicating the similarity between and all of the training mixture components in . We can then make a final prediction about the event of the new audio track with components using an average voting scheme
(29) 
where is the weight of the component.
5 Data and Experimental Methods
We evaluated the event detection performance of our proposed representation against several baseline representations, using the recently released public data set YLIMED.
5.1 Dataset
YLIMED [4] is an open video corpus for multimedia event detection research (modeled on the TRECVID MED corpus [20], but publicly available); the videos in it are drawn from the YFCC100M dataset [26]. YLIMED includes about 2000 videos that contain examples of ten events, with standard training and test videos for each event. Since this work focuses on analyzing the acoustic environment of videos, we conducted a series of experiments using the audio tracks. Table 1 describes the data we used, including the number of training and test audio files and the range in length for the training and testing sets for each event. The wide variation in length among the tracks makes the event detection task more challenging.
Event  Training Data  Testing Data  

ID  Event Name  # of Videos  length (ms)  # of Videos  length (ms) 
Ev101  Birthday Party  99  6850248950  131  8380328960 
Ev102  Flash Mob  91  8290325630  49  11710152560 
Ev103  Getting a Vehicle Unstuck  89  5590591670  39  11170157690 
Ev104  Parade  95  7840303850  127  5770216460 
Ev105  Person Attempting a Board Trick  99  5950391150  88  5500254980 
Ev106  Person Grooming an Animal  97  5950574300  38  7210292870 
Ev107  Person HandFeeding an Animal  95  6850174880  113  7840244450 
Ev108  Person Landing a Fish  99  7930363610  41  7480250120 
Ev109  Wedding Ceremony  90  9640631630  108  9820646300 
Ev110  Working on a Woodworking Project  98  5590373690  44  6760281080 
5.2 Methodology
To evaluate our proposed DCAR method, we compared it with three stateoftheart audio representations used for event detection: mvvector [23], ivector [7], and GMM. By GMM, we mean here the base GMMs obtained by extracting the GMM components from each audio file (as described in Section 3.1), but without the discriminative dimensional reduction step used in DCAR (described in Section 3.2).
As we mentioned in Section 2, an ivector models all of the training audio frames to obtain a GMM supervector for each frame, then factorizes them to obtain the ivector representation. In contrast, an mvvector models each audio file via the mean and variance of the MFCC features, then concatenates the mean and variance to obtain a vector representation.^{3}^{3}3As we want to evaluate the most comparable aspects of the representations, we do not consider the temporal information from the RQA features for the mvvector method.
There are several parameters for each of the representations, which we tuned using crossvalidation on the training data to obtain the best result. For GMM and DCAR, the number of components for each audio track is tuned from 1 to 10, with a step of 1. For ivector, the number of components in all of the training data is tuned to one of the values in and the vector dimensionality is tuned to one of the values in . For DCAR, the number of nearest neighbors ( and ) is set to be 5 for affinity matrix construction, the embedding space size is tuned in with a step of 5 ( is the number of events), and the tradeoff parameter is tuned to one of .^{4}^{4}4In addition, we tried normalizing each track, with each MFCC feature having a mean of 0 and a variance of 1. All GMM components then have 0 means, so the KRR classifier depends solely on the covariance matrix. However, for event detection, information about means appears to be important. When we tuned the tradeoff parameter ( in Eq. (24)), we found that we usually obtained the best results with both means and covariance matrices.
Because our focus is on comparing different audio representations, we describe here experiments that all used the same classification method, KRR.^{5}^{5}5
To check the validity of this approach, we also tested several other classification techniques with mvvector and ivector representations, including SVM (support vector machines), KNN (k–nearest neighbor), and PLDA (parallel latent Dirichlet allocation). The performance rankings between representations were parallel across all classification techniques.
For the th event in testing tracks, we compared the prediction result to the ground truth to determine the number of true positives (), false positives (), true negatives (), and false negatives (). We then evaluated event detection performance using four metrics, , , False Alarm Rate (), and , defined (respectively) as:(30) 
(31) 
(32) 
(33) 
is calculated on all testing tracks (i.e., we use combined or overall accuracy), and the other three metrics are calculated for each event and then averaged over events to evaluate performance. Larger and values indicate better performance, and smaller and values indicate better performance.
6 Experimental Results
We evaluated the four representations under study in a combination of binary detection and multievent detection tasks, described below.
6.1 BinaryEvent Detection
In the first experiment, we built 45 binary classifiers (one for each pair of events). We had two goals in conducting this experiment. The first was to compare two representation strategies: modeling GMMs on all training tracks, in this case as the first phase of the ivector approach, vs. modeling GMMs on each training track, using DCAR. The second was to investigate how the events are distinguished.
As the graphs in Figure 2 show, DCAR outperforms ivector on most tasks. On average, DCAR achieves an accuracy improvement of over ivector (0.8293 vs. 0.7489) across the binary detection tasks. The wintieloss value for pairwise tests at the 0.05 significance level for DCAR against ivector is 3573; the wintieloss value at the 0.01 significance level is 4023.
From these results, we can see that there are some event pairs that are particularly difficult to distinguish, such as Ev106–Ev107 and Ev107–Ev108. Considering the nature of the events involved, it could be argued that distinguishing between events with a Person Grooming an Animal (Ev106), a Person HandFeeding an Animal (Ev107), and a Person Landing a Fish (Ev108) could be nontrivial even for humans. Nonetheless, compared with ivector, our proposed DCAR increases binary classification accuracy even on these difficult pairs. This result demonstrates that modeling each audio file via a Gaussian mixture model is more suitable to characterizing audio content, and that integrating label information and local structure is useful in generating discriminative representations.
6.2 Easy vs. Hard Cases
To further explore how our proposed DCAR method performs on audiobased event detection under different difficulty levels, we extracted two subsets from YLIMED. Subset EC5 (“EasyCase”) contains five events (Ev101, Ev104, Ev105, Ev108, and Ev109) that are generally easy to distinguish from (most of) the others. Subset HC4 (“HardCase”) contains four events (Ev103, Ev106, Ev107, and Ev110) that are more difficult to distinguish.^{6}^{6}6The division was made based on results from the experiments described in Subsections 6.1 and 6.3, as well as a priori understanding of the events’ similarity from a human point of view. Because multiple criteria were used, Ev102 did not fall clearly into either category.
6.2.1 Dimensionality Tuning
Before comparing DCAR with other representations, we conducted a set of multievent detection experiments to study how the dimensionality parameter () affects DCAR under these two difficulty levels. Here, we used fivefold crossvalidation on the training data to tune . The parameter was tuned from 5 to 60 with a step of 5. Combined and average on EC5 and HC4 for each step are given in Figure 3.
The results show that DCAR performs better as increases, reaches the best value at for both cases, and then again decreases in performance as grows larger. We believe this is because a smaller cannot optimally characterize the hidden structure of the data, while larger a may separate the structure into more dimensions until it is essentially equivalent to the original data, thus decreasing the efficacy of the representation.
6.2.2 Easy and Hard Results
Moving on to comparing DCAR with other stateoftheart representations at these two difficulty levels, Table 2 shows the multievent detection performance of DCAR and three baseline representations—base GMM, mvvector, and ivector—in terms of , , and on EC5 and HC4.
Subset EC5  Subset HC4  
Evaluation  (Ev101,Ev104, Ev105, Ev108, Ev109)  (Ev103,Ev106, Ev107, Ev110)  
Metric  mvvector  ivector  GMM  DCAR  mvvector  ivector  GMM  DCAR 
FScore()  0.4773  0.6415  0.6670  0.7067  0.4278  0.2795  0.4821  0.5282 
Accuracy()  0.5455  0.6828  0.7131  0.7434  0.4573  0.2863  0.5000  0.5684 
MissRate()  0.5168  0.3367  0.3252  0.2779  0.5393  0.6975  0.4840  0.4577 
FAR()  0.1136  0.0785  0.0730  0.0647  0.1788  0.2409  0.1684  0.1496 
For both subsets, DCAR consistently achieves the best results (marked in bold) on each evaluation metric, in comparison with the three baselines. (For
, p = 0.01 or better for pairwise comparisons of DCAR vs. mvvector and ivector, and p = 0.05 or better for DCAR vs. the base GMM, for both subsets. Significance is assessed using McNemar’s twotailed test for correlated proportions.) However, interestingly, ivector performs better than mvvector on the EC5 data, but worse than mvvector on the HC4 data (p = 0.0001 or better for comparisons for both subsets).6.2.3 DCAR, Variance, and Structure
We can make a number of observations about these results. First, it seems that modeling GMM components for each audio track (as in the GMM, DCAR, and mvvector representations^{7}^{7}7For these purposes, we can treat the mean and variance within the mvvector as a GMM with one component.) is more effective than modeling a GMM on all the training audio tracks together (as in ivector) when the events are semantically related to each other (as in HC4). We believe this is due to the fact that, in realworld applications (e.g., with usergenerated content), each audio track may have a large variance. The set of strategies that model each track via GMM capture the hidden structure within each audio track, while the ivector strategy may smooth away that structure (even between events), leading to a less useful representation.
Second, GMM and DCAR perform better than mvvector on both subsets. We believe this indicates that one mixture component (as in mvvector) may not sufficiently capture the full structure of the audio; in addition, vectorizing the mean and variance inevitably distorts the intrinsic geometrical structure among the data. Third, DCAR outperforms the base GMM. As we described in Section 3, DCAR begins by extracting such a GMM model, but it also takes into account the label information and the intrinsic nearest neighbor structure among the audio files when modeling the training data, and outputs a mapping function to effectively represent the test data.
In sum, these experimental results further confirm that discriminative dimensionality reduction is beneficial for characterizing the distinguishing information for each audio file, leading to a better representation.
6.3 TenEvent Detection
Because event detection is a kind of supervised learning, learning becomes more difficult as the number of events increases. In the next experiment, we again compared the proposed DCAR model with the three baseline representations, this time on a tenevent detection task. As before, the parameters for each method were tuned by crossvalidation on the training data. Table
3 gives the detection performance on each event and their average over the ten events in terms of and .FScore ()  MissRate ()  

mvvector  ivector  GMM  DCAR  mvvector  ivector  GMM  DCAR  
Ev101  0.7259  0.7842  0.7303  0.7835  0.2824  0.1679  0.1527  0.1298 
Ev102  0.2837  0.3396  0.3651  0.4603  0.5918  0.6327  0.5306  0.4082 
Ev103  0.2178  0.2569  0.2410  0.3820  0.7179  0.6410  0.7436  0.5641 
Ev104  0.4274  0.6206  0.6000  0.6207  0.6063  0.4331  0.3622  0.3621 
Ev105  0.3354  0.3899  0.5714  0.5178  0.6932  0.6477  0.3864  0.4205 
Ev106  0.1964  0.1835  0.2963  0.3750  0.7105  0.7368  0.6842  0.6053 
Ev107  0.3850  0.3298  0.3250  0.4024  0.6814  0.7257  0.7699  0.7080 
Ev108  0.3191  0.3853  0.3878  0.4231  0.6341  0.4878  0.5366  0.4634 
Ev109  0.4211  0.5028  0.4286  0.5176  0.6667  0.5833  0.6667  0.5926 
Ev110  0.0833  0.2299  0.2857  0.2162  0.9091  0.7727  0.7500  0.8182 
Average  0.3395  0.4023  0.4231  0.4699  0.6494  0.5829  0.5583  0.5072 
For all of the individual events and on average, DCAR achieves superior or competitive performance. DCAR also performs better in terms of and . The overall scores for mvvector, ivector, GMM, and DCAR are 0.3907, 0.4640, 0.4923, and 0.5321, respectively (p = 0.01 for DCAR vs. each baseline [McNemar’s twotailed]), and the average scores are 0.0674, 0.0593, 0.0570, and 0.0523.
Although other representations may perform as well or better on some particular events, DCAR consistently outperforms the other representations for all evaluation metrics on the average or overall scores (an average of more than 8% gain on all metrics relative to the second best representation). These results further demonstrate that modeling each audio file via GMM and then integrating both label information and local structure are beneficial to constructing a discriminative audio representation for event detection.
In addition to comparing results across the four methods for tenevent detection, we also experimented with applying feature reduction methods at the frame level before training the GMM, using PCA
[15] and linear discriminant analysis (LDA) [18]. (As an alternative to DCAR’s approach to dimensionality reduction.) The number of principal components () in PCA was tuned from 5 to 60 with a step of 5. For LDA, , where is the number of events. Average or overall results are given in Table 4. The results with PCA are a little better than GMM without PCA, but the accuracy difference is not statistically significant (p = 0.7117, McNemar’s twotailed). Results with LDA are much worse than GMM without LDA. We hypothesize that the main reason for the poor performance of LDA+GMM is that LDA only considers components, which is usually too few to capture sufficient information for later GMM training.Evaluation  GMM  PCA+GMM  LDA+GMM 

Metric  
FScore ()  0.4231  0.4293  0.3278 
Accuracy ()  0.4923  0.4987  0.3419 
MissRate ()  0.5583  0.5508  0.6574 
FAR ()  0.0570  0.0562  0.0935 
6.4 IntraEvent Variation
Delving deeper into how the effectiveness of a representation may depend on variable characteristics of audio tracks, we looked at the degree to which some test tracks in YLIMED could be classified more accurately than others.
6.4.1 Variable Performance Within Events
We took the predicted result for each test audio track in the experiments with four representations described in Subsection 6.3 and calculated how many of the representations made correct predictions for that track.
Figure 4 shows the distribution of the number of representations making accurate predictions for each audio track, broken down by event. Generally, there is wide variation in accuracy among audio files belonging to the same event, with the exception of Ev101 (Birthday Party); this suggests that Ev101 may have distinctive audio characteristics that lead to more consistent classification. It is worth noting that there are some audio files that are never correctly classified by any of the representations (i.e., where acc = 0). For example, more than 50% of the audio tracks for Ev110 (Working on a Woodworking Project) could not be correctly classified by any of the four representations. This situation highlights a challenging property of the event detection task: some of the events are quite confusable due to their inherent characteristics. For example, Ev103 (Getting a Vehicle Unstuck) may have similar audio properties to Ev110 (Working on a Woodworking Project) in that both are likely to involve sounds generated by motors. This is also the reason we included Ev103 and Ev110 in the “Hard Case” HC4 dataset for the experiments described in Section 6.2.
6.4.2 Relationship to Annotator Confidence
When the YLIMED videos were collected, three annotators were asked to give a confidence score for each video, chosen from {1,2,3}, with 1 being “Not sure” and 3 being “Absolutely sure” that the video contains an example of the event in question. The average of the three scores can be used as an indicator of how easily classifiable a given video is with respect to the event category, from the perspective of human beings.
Figure 5 shows the combined scores of different representations for audio tracks from videos in three confidence ranges (79 of the test audio tracks in the range [1,2), 169 in the range [2,3), and 530 with a average confidence of [3]). DCAR achieves the best performance in every range. However, it is interesting that ivector shows a notable improvement with each increment of increasing annotator confidence (+24.1% between [1,2) and [2,3) and +22.9% between [2,3) and [3]), while DCAR shows a dramatic improvement between low and intermediateconfidence videos but performs similarly on intermediate and highconfidence videos (+37.3% for the first step, but 2.3% for the second); GMM follows the same pattern (+45.6% then 1.8%). The mvvector approach shows yet a different pattern, performing similarly poorly on all but the highconfidence videos (only +2.8% improvement for the first step, but +29.8% for the second).
These differing patterns may indicate that the ivector approach is more sensitive to particular audio cues associated with the characteristics of an event that humans find most important in categorizing it, while a lower threshold of cue distinctiveness overall is required for GMM and DCAR. On the other hand, the results in Sections 6.2 and 6.3 suggest that modeling audio files with only one mixture component, as in mvvector, generally cannot sufficiently capture the full structure of the audio signal, but it may be that it can capture that structure at least somewhat more often when the signal is more distinctive (i.e., it has a particularly high distinctiveness threshold).
In other words, in cases where humans do not consider the events occurring in videos to be clear or prototypical examples of the target category, and thus those videos are less likely to have plentiful audio cues distinctive to that event category, a more discriminative representation may be required to improve event detection performance.^{8}^{8}8The experiments here did not include any negative test tracks; the task was simply to identify which of the target events each video’s contents are most like. If negative examples were included, it might be less clear what constitutes the “best” performance in categorizing videos that are on the borders of their categories. Fortunately, it appears that the proposed DCAR method can partially address this problem, as shown by its relatively high performance on lowerconfidence videos.
7 Conclusions and Future Work
In this article, we have presented a new audio representation, DCAR, and demonstrated its use in event detection. One distinguishing characteristic of the DCAR method is that it can capture the variability within each audio file. Another is that it achieves better discriminative ability by integrating label information and the graph of the components’ nearest neighbors among the audio files, i.e., it can successfully characterize both global and local structure among audio files. Representing audio using the proposed DCAR notably improves performance on event detection, as compared with stateoftheart representations (an average of more than 8% relative gain for tenevent detection and more than 10% gain for binary classification across all metrics).
The proposed representation benefits from leveraging global and local structure within audio data; however, videos are of course multi
modal. Other data sources such as visual content, captions, and other metadata can provide valuable information for event detection; we therefore plan to extend the current model by incorporating such information. Within audio, we also hope to evaluate the use of DCAR for other related tasks, such as audio scene classification (for example, testing it with the DCASE acoustic scenes dataset
[25]).Related work in audio (e.g., Barchiesi et al. 2015 [3]) has demonstrated that the temporal evolution of different events plays an important role in audio analysis; another possible direction for expanding DCAR is to take into consideration complex temporal information in modeling video events. Last but not least, we might explore extracting the informationrich segments from each audio track rather than modeling the whole track.
Acknowledgments
This work was partially supported by the NSFC (61370129, 61375062), the PCSIRT (Grant IRT201206), and a collaborative Laboratory Directed Research & Development grant led by Lawrence Livermore National Laboratory (U.S. Dept. of Energy contract DEAC5207NA27344). Any findings and conclusions are those of the authors, and do not necessarily reflect the views of the funders.
References
 [1] P. Absil, R. Mahony, and R. Sepulcher, editors. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.
 [2] V. Arsigny, P. Fillard, X. Pennec, and N. Apache. Geometric means in a novel vector space structure on symmetric positive definite matrices. SIAM on Matrix Analysis, 29(1):328–347, 2007.
 [3] D. Barchiesi, D. Giannoulis, D. Stowell, and M. Plumbley. Acoustic scene classification: classifying environments from the sounds they produce. Signal Processing Magazine, 32(3):16–34, 2015.
 [4] J. Bernd, D. Borth, B. Elizalde, G. Friedland, H. Gallagher, L. Gottlieb, A. Janin, S. Karabashlieva, J. Takahashi, and J. Won. The YLIMED corpus: Characteristics, procedures, and plans (TR15001). Technical report, ICSI, 2015. arXiv:1503.04250.
 [5] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Efficient similarity search for covariance matrices via the JensenBregman LogDet divergence. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
 [6] N. Dehak, P. Kenny, R. Dehak, P. Dumouchefl, and P. Ouellet. Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 19(4):788–798, 2011.
 [7] B. Elizalde, H. Lei, and G. Friedland. An ivector representation of acoustic environment for audiobased video event detection on user generated content. In Proceedings of IEEE International Symposium on Multimedia, 2013.
 [8] A. Eronen, J. Tuomi, A. Klapuri, and S. Fagerlund. Audiobased context awareness—acoustic modeling and perceptual evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 529–532, 2003.
 [9] M. Harandi, M. Salzmann, and R. Hartley. From manifold to manifold: geometryaware dimensionality reduction for SPD matrices. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
 [10] Z. Huang, Y. Cheng, K. Li, V. Hautamaki, and C. Lee. A blind segmentation approach to acoustic event detection asked on ivector. In Proceedings of INTERSPEECH, 2013.
 [11] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen. Logeuclidean metric learning on symmetric positive definite manifold with application to image set classification. In Proceedings of ICML, 2015.
 [12] Q. Jin, P. Schulman, S. Rabat, S. Burger, and D. Ding. Eventbased video retrieval using audio. In Proceedings of INTERSPEECH, pages 2085–2088, 2012.
 [13] L. Jing, B. Liu, J. Choi, A. Janin, J. Bernd, M. W. Mahoney, and G. Friedland. A discriminative and compact audio representation for event detection. In Proceedings of the ACM Multimedia Conference, 2016. To appear.
 [14] L. Jing, C. Zhang, and M. Ng. SNMFCA: supervised NMFbased image classification and annotation. IEEE Transactions on Image Processing, 21(11):4508–4521, 2012.
 [15] I. Jolliffe. Principal Component Analysis. John Wiley & Sons, Ltd, 2005.
 [16] M. Lan, C. Tan, J. Su, and Y. Lu. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):721–735, 2009.

[17]
H. Li, T. Jiang, and K. Zhang.
Efficient and robust feature extraction by maximum margin criterion.
In Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2004. 
[18]
G. McLachlan, editor.
Discriminant analysis and statistical pattern recognition
. Wiley Interscience, 2004.  [19] R. Mertens, H. Lei, L. Gottlieb, G. Friedland, and A. Divakaran. Acoustic super models for large scale video event detection. In Proceedings of ACM Multimedia, 2011.
 [20] P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel, A. Smeaton, W. Kraaij, and G. Quénot. TRECVID 2011  an overview of the goals, tasks, data, evaluation mechanisms, and metrics. Technical report, National Institute of Standards and Technology, Gaithersburg, MD, May 2012.

[21]
X. Pennec, P. Fillard, and N. Apache.
A Riemannian framework for tensor computing.
IJCV, 66(1):41–66, 2006.  [22] M. Ravanelli, B. Elizalde, J. Bernd, and G. Friedland. Insights into audiobased multimedia event classification with neural networks. In Proceedings of the Multimedia COMMONS Workshop at ACM Multimedia (MMCommons), 2015.
 [23] G. Roma, W. Nogueira, and P. Herrera. Recurrence quantification analysis features for auditory scene classification. In Proceedings of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (DCASE), 2013.
 [24] B. Scholkopf and A. Smola, editors. Learning with Kernels. MIT Press, 2002.
 [25] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. Plumbley. Detection and classification of audio scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, 2015.
 [26] B. Thomee, D. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.

[27]
W. Wang, R. Wang, Z. Huang, S. Shan, and X. Chen.
Discriminant analysis on Riemannian manifold of Gaussian distributions for face recognition with image sets.
In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 2048–2057, 2015.
Appendices
Appendix A Comparison of LEM, AIRM, and Stein Divergence
To empirically determine which metric (LEM, AIRM, or Stein Divergence) is most appropriate for measuring the distance between GMM covariance matrices, we calculated what percentage of each component’s nearest neighbors belong to the same event as the that component as in 34.
(34) 
is the number of training mixture components. gives the nearest neighbor set for component obtained by the given metric. indicates that and belong to the same event. A high value for indicates that the metric is suitable for characterizing the structure of the components, in the sense that the result is similar to the external structure given by the label information.
Figure 6 plots the values for obtained by the three metrics when varying the target number of neighbors (). The plot shows that, overall, LEM achieves higher values for than AIRM or Stein Divergence,i.e., that the nearest neighbors selected according to LEM more often fall into the same event than nearest neighbors selected according to either AIRM or Stein Divergence.
Appendix B Proof: The Rotational Invariance of
Given the objective function in (12), repeated here as (36), and any rotation matrix (i.e., ), we can show that .
(36) 
Proof 1
According to the definition of in (36), we can write as
We set
(37) 
Since is symmetric positivedefinite, and the logeuclidean Metric has the properties of Lie group biinvariance and similarity invariance), i.e.,
then
(38) 
Thus
(39) 
as claimed.
Comments
There are no comments yet.