Nowadays multimedia contents have been produced and shared ubiquitous in our daily life, which has encouraged people to develop algorithms for multimedia search and analysis in various applications. As one of the most popular directions, multimedia event detection has been receiving increasing attention in recent years. Different from the atomic object or action recognition, which focus retrieving simple primitives , event detection aims to identify more complex scenario, for example, semantically meaningful human activities, taking place within a specific environment, and containing a number of necessary objects , which makes it more suitable for the purpose of multimedia search.
Due to such complexity mentioned above, only concern of recognizing an event is insufficient for user to understand the searching result thoroughly. A question that “why is this video classified as this event" is required to be answered, that is, our system should provide the exact temporal locations of several key-frames or key-shots from the whole video which contain observed evidences that lead to our decisions. This task is referred to asevent recounting, where several efforts have been contributed to this field. For example,  adopted a semantic concept based event representation for learning a discriminative event model, and generated recounting by weighting the contribution of each individual concept to the final event classification decision score.  proposed to identify event oriented discriminative video segments and their descriptions with a linear SVM classifier and noise-filtered concept detectors, then user friendly concepts including objects, scenes, and speech were extracted as recounting results to generate descriptions. As event detection and recounting are two highly related task that could benefit with each other, some recent work aimed to address these two problems simultaneously.  introduced an evidence localization model where evidence locations were modeled as latent variables, and optimized the model via max-margin framework under the constraints on global video appearance, local evidence appearance, and the temporal structure of the evidence.  proposed a joint framework to optimize both event detection model and recounting model using improved Alternating Direction Method of Multiplier (ADMM) algorithm.  proposed a flexible deep CNN architecture named DevNet that detected pre-defined events and provided key spatio-temporal evidences at the same time. Figure 1 shows an illustration of event recounting results in an ideal multimedia search system.
Although these attempts have obtained promising results by indicating plausible observations, the event recounting task still remains a less addressed problem due to the challenge of evidence annotation, which leads to two limitations of the existing techniques. First, with only event labels of training videos, the evidential or non-evidential part are confused with each other and distinguished all based on category information, which omits some key evidences shared among different events, or even background samples. Second, without the ground truth of evidence locations, there could not be a substantial and quantitative comparison among different methods. The performance of a system can only be evaluated by making subjective and qualitative judgement that whether the recounted evidences or semantic concepts are reasonable or not. In this paper, we focus on the mentioned issues and make efforts in the following two aspects: (1) We propose a weakly supervised evidence discovery method based on self-paced learning framework , which follows a learning process from easy “evidences” to gradually more complex ones, and simultaneously exploit more and more positive evidence samples from numerous weakly annotated video segments. (2) To evaluate our method quantitatively, we also propose two metrics, Percentage of Overlap (PctOverlap) and F1-score, for measuring the performance of evidence localization according to a small group of ground truth annotated by humans (The collection and generation of the ground truth are detailed in Section 4 below). The experiments are conducted on a subset of TRECVID MED dataset and demonstrate the promising results obtained by our method.
The rest of this paper is organized as follows. Section 2 presents the semantic concept based video representation, which can provide high-level semantic information that should benefit the evidence interpretation. Section 3 introduces the essential technique in this paper, i.e. self-paced learning framework, with its detailed formulation and optimization process. In Section 4, we provide comprehensive evaluations of the whole framework and comparisons with several highly related methods. Finally, we conclude the work and discuss possible future directions in Section 5.
2 Semantic Concept Feature
Low-level feature based video representation, for example, SIFT , STIP , and dense trajectory based features [10, 11], has been widely used in action recognition and event detection. However, those low-level features hardly have semantic meanings, thus are not suitable for interpretation purposes , such as the recounting task, which requires some higher-level information for event or evidence description. Recently, semantic representations based on these kind of attributes or concepts have been increasingly popular in the field of event detection and recounting [1, 4, 5, 12, 13, 14]. With the same spirit, we also learn to generate a video representation based on various semantic concepts, such as objects, scenes, and activities.
Specifically, we pre-defined our concept collection (The sources of these concepts is detailed in Section 4. Table 1 provides some examples of the concepts grouped by their types). For each concept, we collect training samples, i.e. video segments, from auxiliary datasets, and employ improved dense trajectory features  for representation. Based on the low-level features, binary linear SVM are used for training concept detector, and finally we can generate concept detectors totally. The next step is to process testing event videos. For the purpose of identifying the key evidences temporally, we first segment each video sample into a number of shots using well-established shot boundary detection techniques . For each shot, we extract the same dense trajectory features and apply all the concept detectors on this shot to obtain confidence scores as a representation (Note that the scores should be normalized from 0 to 1). Formally, we denote the concept representation of the -th shot from the -th video as . Suppose there are shots in the -th video, the collection of all the shots can be represented as , where , and is the total number of videos.
3 Self-paced Learning
Self-paced learning  is a lately proposed theory inspired by the learning process of humans or animals. The idea is to learn the model gradually from easy samples to complex ones in a iteratively self-paced fashion. This theory has been widely applied to various problems, including image classification , visual tracking , segmentation [18, 19], and multimedia event detection [20, 21].
In the context of evidence recounting problem, the easy samples are video shots with high confidence scores obtained by a binary event-oriented detector. Based on these initialized training samples, our algorithm learns a gradually “mature” model by mining and appending more and more complex evidence samples iteratively according to their losses, and also adaptively determines their weights in the next iteration. Now we start to introduce the detailed problem formulation and optimization in this section.
3.1 Problem Formulation
Given video candidates with only annotation of event labels, continue to use notations in Section 2, the -th samples can be represented as , where denotes the representation of the -th shot from the -th video, and denotes its label whether it can be regarded as an “evidence” or not. This formulation partially agree with the definition of Multiple Instance Learning [22, 23], that we only know the label for each “bag” but not the instances assigned to a certain “bag”. The same point with MIL is, if , which indicates that this video is categorized as a certain event, then at least one instance is a positive sample (i.e. ), which means that there exists at least one evidence leading to the decision. The different point with MIL is, if , in most cases there are no evidence in this video, but this cannot be guaranteed since there exists some complex and confused evidences shared among different events or even background videos. while in traditional MIL framework, leads to for all .
Although we cannot employ the solution for MIL problem directly, we can exploit the same idea of heuristic optimization proposed in
, i.e. supposing all the instances have their initialized pseudo labels and seeking for the optimal hyperplane and labels alternatively. Here in our task, we introduce all shots extracted from the background videos as negative samples, and all shots from the videos labeled as a certain event as positive samples. A linear SVM is employed to train the initialized classifier, then the current samples and model parameters are served as an initialization for Self-paced Learning in the next step.
For all the video shots , where is kind of pseudo label which need to be optimized during self-paced learning process. Let
denote the loss function which calculates the cost between the (pseudo) labeland the predicted label , where and represents the model parameters in decision function . In SPL, the goal is to jointly learn the model parameters , the pseudo label and the latent weight variable according to the objective function :
where denotes the total number of instances from videos, denotes the pseudo labels for all instances, and denotes their weighting parameters which reflects the sample importance in training the model, is the standard hinge loss of under classifier (In this work, we simply employ the linear SVM version), calculated from:
More importantly, is the regularization term called self-paced function which specifies how the sample weights are generated. Here is a parameter for determining the learning rate. can be defined in various forms in terms of the learning rate . A conventional one proposed in  is based on the -norm of as:
This regularizer is very general and has been applied to various learning tasks with different loss functions [18, 13, 21]. Up to now, we can observe that the objective function is subjected to two parts of constraints: one is the max-margin constraints inherited from traditional SVM; another one is self-paced term taking control of the pseudo labels and sample weights respectively. This objective is difficult to optimize directly due to its non-convexity. In the next subsection, we introduce the effective Cyclic Coordinate Method (CCM)  to solve this problem as in [7, 25, 21, 19].
Cyclic Coordinate Method (CCM) is a kind of iterative method for non-convex optimization, in which the model variables are divided into independent blocks (two blocks in our case): (1) classifier parameters ; (2) pseudo labels and sample weights . We switch between the two blocks iteratively, that one block of variables can be optimized while fixing the other block. Taking the input MIL-inspired initialization, in each iteration, the alternative optimization process can be presented as follows:
Optimizing while fixing and . In this step, we fix the pseudo labels and weight variables as constant, then the objective (1) is updated to only represent the sum of weighted loss across all instances as :
Generally, is the discounted hinge loss of the shot instance . To simplify the solution, in conventional SPL, all the are forced to be binary value, i.e. or . Thus the objective (4) degenerates to a conventional linear SVM which only considers the selected samples whose weight equals . However, on the other hand, this binary setting of has limited ability for balancing the positive and negative costs, since in our task there exists only few positive evidence (event) examples while a large number of negative (background) samples. To address this problem, we employ the similar idea in Exemplar-SVM  which introduces two regularization parameters (i.e. and ) to balance the effects of these two types of costs. Differently, in our formulation, there is a small set of positive samples rather than a single “exemplar”. Accordingly, we can rewrite (4) as an ESVM-like form as follows:
By solving (5), we can obtain as the classification hyperplane, which is going to be fixed for the next step optimization.
Optimizing and while fixing . With the fixed classifier parameters, we can omit the term and the objective (1) becomes :
Based on (6), learning is independent of . and also, all the pseudo labels are independent with each other in the loss function. As each label can only take two integer values and , the global optimal solution can be achieved by enumerating times.
After obtaining the optimal , the final task for us is to optimize . Following the solution in , the weight for sample can be calculated by:
The criterion in (7) indicates that if the loss of an instance is less than the current threshold , which means “easy”, it will be selected for training in the next iteration, or otherwise unselected. Here controls the learning pace that how many training samples should be selected at this time. As increases, the tolerance of sample loss becomes larger, and more “hard” samples will be appended to the training set to learn a stronger model. Formally, we summarize the whole optimization procedure in Algorithm 1.
4.1 Dataset and Protocol
We conduct our experiments on TRECVID Multimedia Event Detection dataset [27, 28]. Specifically, there are two sets MED13 and MED14 collected by National Institute of Standards and Technology (NIST) for the TRECVID competition. Each dataset includes 20 complex events with 10 events in common from E221 to E230. In this paper, we only take the common part for evaluation. A detailed list of event names with their evidential description is provided in Table 2.
According to the evaluation procedure outlined by TREC-VID MED task, the dataset can be divided into 3 partitions: (1) Background, which contains background videos not belonging to any of the target events; (2) 10Ex (or 100Ex), which contains (or ) positive video examples for each event as the training samples; (3) MEDTest, which contains about videos for testing. Here in our recounting task, we select a small number samples from each event (about for each in average), and annotate the temporal evidence locations manually. To alleviate the bias of different annotators, we average the results from persons to obtain the final ground truth. Note that, this annotation is only performed on test data for evaluation purpose. For training data, only event labels are available without indications of evidences.
In order to conduct quantitative evaluation, we also propose two metrics, Percentage of Overlap (PctOverlap) and F1-score, for measuring the performance of evidence localization. For better explanation, we first define the following notations: (1) : all temporal regions with predicted scores higher than a certain threshold (0.5 in this paper); (2) : all temporal regions with annotated scores 1 (or higher than a certain threshold); (3) : intersection regions of “prediction” and “ground truth”. Based on these notations, . As , , we can have F1-score.
4.2 Parameter Settings
For semantic concept representation, we pre-train the concept detector on three auxiliary datasets: TRECVID SIN dataset (346 concepts), Google Sports dataset  (478 concepts), and Yahoo Flickr Creative Commons (YFCC) dataset 
(609 concepts), and the prediction scores of these detectors are served as a feature vector of each video shot. The SPL framework is based on these three kinds of features corresponding to different concept sets. In the learning process according to Algorithm1, we set , and in each iteration, the learning pace controller . The regularization parameters in (5) are set as and , which follow the default settings in , and proved to be insensitive in our experiments. Table 3 demonstrates the performance in different SPL iterations based on three concept sets.
According to Table 3, we can observe an approximately rising trend of the performance for both PctOverlap and F1-score, as the number of iteration increases. Specifically, for TRECVID SIN, the SPL converges really fast and achieves the peak at . This phenomenon of fast convergence also appears in the 10Ex setting for YFCC (at ). Another observation is about the improvements compared to BasicMIL (the details for are presented in the next subsection). For all the three concept sets, the relative improvements for setting are much more significant than that for , which indicates that our method possesses strong superiority in weakly supervised learning especially for extremely few samples.
4.3 Performance Comparison
. Random forest and AdaBoost are both classical approaches which introducing sample weights implicitly in training process by random or rule-based sampling, which share the similar spirit as SPL. While in BasicMIL manner, the model is trained using all samples simultaneously with equal weights. Here we perform BasicMIL using SVM for fair comparison with SPL. All of the results are shown in Table4 and Figure 2.
According to Table 4, BasicMIL always shows the worst results due to its straight-forward manner of data usage, i.e. no sampling and equal weights. RandomForest performs better since it considers sample weights implicitly by random sampling. AdaBoost is shown to be the best baseline method, because it performs much more similar mechanism that gradually selects “hard” samples out according to the “error”, where in SPL the criterion is “loss”.
We also conduct a late fusion among different concept features, and obtain the comparison results in Table 5, in which we also demonstrate the performance for each individual event respectively. Figure 5 provides the corresponding results for a more intuitive visualization. Moreover, to justify our results qualitatively, we also illustrate some video examples with predicted evidence locations as well as their recounting concepts. Figure 3 shows the event “E227 Rock Climbing”, in which the concepts such as Climbing, Mountaineer, Person, are seemed to be appear in high frequency. Figure 4 shows the event “E229 Winning a race without a vehicle”, and we can observe that concepts Sport, Running, Athlete are most likely to appear with high confidences.
In this paper, we propose a weakly supervised evidence discovery method based on self-paced learning framework, which follows a learning process from easy “evidences” to gradually more complex ones, and simultaneously exploit more and more positive evidence samples from numerous weakly annotated video segments. Our method is evaluated on TRECVID MED dataset and shows promising results both quantitatively and qualitatively. For future work, we will attempt to investigate various forms of self-paced learning function which can be effectively adapted to our specific task for further improvement.
-  Yu, Q., Liu, J., Cheng, H., Divakaran, A., Sawhney, H.: Multimedia event recounting with concept based representation. In: ACM Multimedia. (2012)
-  Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: International Conference on Computer Vision. (2007)
-  Sun, C., Burns, B., Nevatia, R., Snoek, C., Bolles, B., Myers, G., Wang, W., Yeh, E.: Isomer: Informative segment observations for multimedia event recounting. In: ACM International Conference on Multimedia Retrieval. (2014)
Sun, C., Nevatia, R.:
Discover: Discovering important segments for classification of video
events and recounting.
In: IEEE Conference on Computer Vision and Pattern Recognition. (2014)
-  Chang, X., Yu, Y.L., Yang, Y., Hauptmann, A.G.: Searching persuasively: Joint event detection and evidence recounting with limited supervision. In: ACM Multimedia. (2015)
-  Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition. (2015)
-  Kumar, M.P., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: The Conference on Neural Information Processing Systems. (2010)
-  Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2) (2004) 91–110
-  Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition. (2008)
-  Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition. (2011)
-  Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision. (2013)
-  Merler, M., Huang, B., Xie, L., Hua, G., Natsev, A.: Semantic model vectors for complex video event recognition. IEEE Transaction on Multimedia 14(1) (2012) 88–101
-  Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., Sawhney, H.: Video event recognition using concept attributes. In: IEEE Winter Conference on Applications of Computer Vision. (2013)
-  Mazloom, M., Habibian, A., Snoek, C.G.: Querying for video events by semantic signatures from few examples. In: ACM Multimedia. (2013)
-  Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. Journal of Electronic Imaging 5(2) (1996) 122–128
-  Tang, Y., Yang, Y.B., Gao, Y.: Self-paced dictionary learning for image classification. In: ACM Multimedia. (2012)
-  Supancic, J., Ramanan, D.: Self-paced learning for long-term tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. (2013)
-  Kumar, M.P., Turki, H., Preston, D., Koller, D.: Learning specific-class segmentation from diverse data. In: International Conference on Computer Vision. (2011)
-  Zhang, D., Meng, D., Li, C., Jiang, L., Zhao, Q., Han, J.: A self-paced multiple-instance learning framework for co-saliency detection. In: International Conference on Computer Vision. (2015)
-  Jiang, L., Meng, D., Yu, S.I., Lan, Z., Shan, S., Hauptmann, A.: Self-paced learning with diversity. In: The Conference on Neural Information Processing Systems. (2014)
-  Jiang, L., Meng, D., Mitamura, T., Hauptmann, A.G.: Easy samples first: Self-paced reranking for zero-example multimedia search. In: ACM Multimedia. (2014)
-  Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89(1) (1997) 31–71
-  Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: The Conference on Neural Information Processing Systems. (2002)
-  Gorski, J., Pfeuffer, F., Klamroth, K.: Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research 66(3) (2007) 373–407
-  Tang, K., Ramanathan, V., Fei-Fei, L., Koller, D.: Shifting weights: Adapting object detectors from image to video. In: The Conference on Neural Information Processing Systems. (2012)
-  Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-svms for object detection and beyond. In: International Conference on Computer Vision. (2011)
-  NIST. http://nist.gov/itl/iad/mig/med13.cfm (2013)
-  NIST. http://nist.gov/itl/iad/mig/med14.cfm (2014)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.:
Large-scale video classification with convolutional neural networks.In: IEEE Conference on Computer Vision and Pattern Recognition. (2014)
-  Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research. arXiv (2015)
-  Breiman, L.: Random forests. Machine Learning 45(1) (2001) 5–32
Stochastic gradient boosting.Computational Statistics & Data Analysis 38(4) (2002) 367–378