P-ODN: Prototype based Open Deep Network for Open Set Recognition

05/06/2019 ∙ by Yu Shu, et al. ∙ Peking University Beijing Institute of Technology 6

Most of the existing recognition algorithms are proposed for closed set scenarios, where all categories are known beforehand. However, in practice, recognition is essentially an open set problem. There are categories we know called "knowns", and there are more we do not know called "unknowns". Enumerating all categories beforehand is never possible, consequently it is infeasible to prepare sufficient training samples for those unknowns. Applying closed set recognition methods will naturally lead to unseen-category errors. To address this problem, we propose the prototype based Open Deep Network (P-ODN) for open set recognition tasks. Specifically, we introduce prototype learning into open set recognition. Prototypes and prototype radiuses are trained jointly to guide a CNN network to derive more discriminative features. Then P-ODN detects the unknowns by applying a multi-class triplet thresholding method based on the distance metric between features and prototypes. Manual labeling the unknowns which are detected in the previous process as new categories. Predictors for new categories are added to the classification layer to "open" the deep neural networks to incorporate new categories dynamically. The weights of new predictors are initialized exquisitely by applying a distances based algorithm to transfer the learned knowledge. Consequently, this initialization method speed up the fine-tuning process and reduce the samples needed to train new predictors. Extensive experiments show that P-ODN can effectively detect unknowns and needs only few samples with human intervention to recognize a new category. In the real world scenarios, our method achieves state-of-the-art performance on the UCF11, UCF50, UCF101 and HMDB51 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks have demonstrated significant performance on many visual recognition tasks [1, 2, 3]. Almost all of them are proposed for closed set scenarios, where all categories are known beforehand. However, in practice, some categories can be known beforehand, but more categories can not be known until we have seen them. We call the categories we know as priori the “knowns” and those we do not know beforehand the “unknowns”. Enumerating all categories is never possible for the incomplete knowledge of categories. And preparing sufficient training samples for all categories beforehand is time and resource consuming, which is also infeasible for unknowns. Consequently, applying closed set recognition methods in real scenarios naturally leads to unseen-category errors. Therefore, recognition in the real world is essentially an open set problem, and an open set method is more desirable for recognition tasks.

Fig. 1:

Open Set Recognition. The training set contains sufficient labeled samples of different categories (colored in green and blue), while the testing set contains knowns but also unknowns which haven’t been seen at all. The solution should be able to accept the knowns and reject the unknowns. Simultaneously, the solution should classify knowns into correct known categories and further be able to classify unknowns as well.

As shown in Fig. 1, in open set recognition problem, categories of the training set have sufficient labeled samples, which are knowns, while the testing set contains both knowns and unknowns. The solution to the problem should be able to accept and classify knowns into correct known categories and also reject unknowns. Simultaneously, it is natural to further extend to perform recognition of unknowns as shown in the right part of Fig. 1 with different color boxes classifying the unknowns.

Technically speaking, many methods on incremental learning can be used to handle new instances of known categories [4, 5, 6, 7, 8]. However, most of these approaches do not consider about unknowns or dynamically adding new categories to the system. In [9], a discriminative metric is learned for Nearest Class Mean (NCM) classification on the knowns, and new categories are added according to the mean features. This approach, however, assumes that the number of known categories is relatively large. An alternative multi-class incremental approach based on least-squares SVM has been proposed by Kuzborskij et al. [10]

where for each category a decision hyperplane is learned. However, in this way, every time a category is added, the whole set of the hyperplanes will be updated again, which is too expensive as the number of categories growing.

In particular, most of the researches on open set recognition focus on detecting unknowns only, recent works [11, 12, 13] have established formulations of classifying knowns and rejecting unknowns. While it is natural to extend to classify unknown samples after detecting unknowns. And in the real world, solutions which further classify the unknowns are more challenging and have a wider range of applications. Abhijit et al. [14] proposed a SVM-based recognition system that could continuously recognize new categories in an open world model by extending the NCM-like algorithms [15]

to a Nearest Non-Outlier (NNO) algorithm. But it is not applicable in deep neural networks, and the performance is much worse than deep neural network based algorithms. Recently, in the work

[16], Yang et al. have tried to handle the open set recognition problem by training prototypes to represent the unknowns. But the solution works on the assumption that all unknowns have sufficient labeled samples to train discriminative prototypes, which is not realistic. And the system needs to be retrained when new categories comes, which is time and computational resource consuming.

In our previous work [17]

, we proposed an Open Deep Network (ODN) algorithm for open set recognition. First, we train a CNN network to classify the knowns which have sufficient samples. Then the triplet threshold of each category is calculated based on the correctly classified features of training set. Unknowns can be detected by applying the triplet thresholds on the features derived by the CNN. Manual labeling the unknowns which are detected in the previous process, and predictors of the classification layer are added dynamically to incorporate new categories. Weights of the new predictors are initialized by applying the emphasis initialization method which transfers the learned knowledge of the CNN to speed up the fine-tuning. However, the triplet thresholds are calculated on the sampled features of training set, consequently, unknowns detection process might be affected by the outliers of training set. Besides, relations of categories are defined on the feature scores in emphasis initialization method, which is a simple way to estimate the similarity of categories.

Note that we will give a brief illustrations of the methods proposed in our previous work [17] later. Specifically, the triplet thresholding unknown detecting method will be detailed in Sec. IV-C, and the emphasis initialization method will be detailed in Sec. IV-D.

Most recently, the prototype learning was introduced to improve the robustness of CNNs. Yang et al. proposed the CPL to improve the robustness by using prototypes and proposed the PL (prototype loss) to improve the intra-class compactness and inter-class distance of the feature representation in the work [16]. Yang et al. also introduced a method to handle open set recognition problem by using prototypes in their paper. However, as mentioned before, this method assumes that samples of unknowns are sufficient to train the prototypes. And when new unknowns come, the system needs to be retrained again. Inspired by the prototype learning concept, we propose the prototype based Open Deep Network (P-ODN) to handle the open set recognition problem.

In this paper, we propose P-ODN to improve the robustness in detecting unknowns and updating deep neural networks, consequently facilitating open set recognition. Basically, prototypes and prototype radiuses are trained jointly to derive more precise features to better represent categories. In prototype module, prototypes are taught to learn the centers of knowns. And in prototype radius module, values of prototypes are further restricted to a curtain range by learning a radius for each category as a regularization item of prototypes. Both of the modules help to improve the intra-class compactness and inter-class distance of the feature representation. Then the correctly classified features of training set are projected into a different feature space by calculating the distance distribution between the features and prototypes. The triplet thresholds are learned based on the correctly classified distance distribution. Instead of detecting unknowns directly on the features, based on the statistic information of training samples, detecting unknowns based on the distance of prototypes keeps knowledge of the model, which has less potential to be effected by the outliers of training set. After manual labeling the unknowns which are detected in the previous process, new predictors are initialized based on the distance distribution of new samples and prototypes. Each weight column of the knowns is integrated to initialize the new weight according to the distance distribution. And the distance distribution contains more robust relation knowledge of the new category and knowns. Finally, fine-tuning the model with the manual labeled samples to incorporate new categories.

In order to give a convincing results of our P-ODN, in this paper, we choose to focus on the action recognition problem which is a more challenging recognition task. The effectiveness of the proposed framework is evaluated on four public datasets: UCF11, UCF50, UCF101 and HMDB51. The experimental results show that our method can effectively detect unknowns and needs only few samples with human intervention to recognize a new category. And our method achieves the state-of-art performance on all the four datasets in real world scenarios.

The remainder of this paper is organized as follows. Sec. II simply reviews the related works on incremental learning, open set learning and action recognition. Sec. III gives an overview of prototype based Open Deep Network (P-ODN). And the specific algorithms of P-ODN are presented in Sec. IV. The experimental results are discussed in Sec. V. Finally, Sec. VI concludes this paper.

A preliminary version of this work has been published in [17]. The main extensions include four aspects. First, we introduce the prototype learning into open set recognition tasks. In order to learn more discriminative features, we train the prototype and the prototype radius of each category jointly by applying the prototype module and the prototype radius module. Second, the triplet thresholding method is extended to detect unknowns based on the distance metric between features and prototypes, which is more robust. Third, we extend the Emphasis initialization method to a distances based weights initialization method to consider relations between the new category and all known categories. Finally, extensive experiments are performed on more datasets so as to evaluate the effectiveness of the proposed method.

Fig. 2: Framework of open set recognition. The left part of the blue dotted line illustrates the two training phases: a) Initial training phase (detailed in Sec. IV-A & Sec. IV-B) takes the initial training set (contains the knowns only) as input, then learns and outputs an initial model, prototypes and prototype radiuses for each category. b) Incremental training phase (detailed in Sec. IV-D

) takes the incremental training set (contains both knowns and unknowns) and the outputs of Initial training phase as inputs, then detects the unknowns. Manual labeling the unknowns which are detected in the previous process. Next, the new category is dynamically incorporated in the model. Finally, fine-tuning the model with only few samples to make the unknowns ”known”. The right part of the dotted line illustrates two evaluation phases responding to the two training phases: a) Evaluation phase 1, the detection f-score of unknowns is measured here on the initial model trained in Initial training phase. b) Evaluation phase 2, the classification accuracy of both knowns and unknowns is measured here on the final model trained in Incremental training phase.

Ii Related Works

Ii-a Incremental learning

Many incremental methods based on SVMs were proposed in recent years. Cauwenberghs et al. [5] proposed an incremental binary SVM by means of saving and updating KKT conditions. Yeh et al. [7] extended the approach to object recognition and demonstrated multi-class incremental learning. Pronobis et al. [18] proposed a memory-controlled online incremental SVM which combined an approximate technique [19] for visual place recognition. However, the updating process is extremely expensive, because it needs to retrain the whole system when adding new categories. In some other multi-class incremental learning [20, 21, 22, 6], it is more like increasing in terms of additional training samples rather than increasing additional training categories.

Ii-B Open set learning

Open set learning assumes there are both knowns and unknowns in the test phase, considering that knowledge of categories is incomplete in real world scenarios. The system needs to handle the unknowns during test phase. Recent works on open set learning [11, 12, 13] formalized processes of rejecting unknowns and classifying knowns in test phase. Among them, Abhijit et al. [13] adapted the concept of Meta-Recognition [23, 24]

to deep neural networks, and proposed an OpenMax method based on NCM by rejecting unknown categories in the activation levels. However, these solutions focus on detecting unknowns only and have not tried to further classify the unknowns. In

[14], Abhijit et al. proposed a recognition system that can continuously learn new categories in an open world model by extending Nearest Class Mean type algorithms [9, 15] to a Nearest Non-Outlier(NNO) algorithm, but it is not applicable in deep neural networks. Therefore, the open set recognition with deep neural networks is still a challenging problem and needs more efforts to be devoted.

Ii-C Action recognition

In recent years, Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on various tasks (e.g.

[25, 26, 27, 28]) and it has been proven that features learned from CNNs are much better than the hand-crafted features [29, 30]. Many works [31, 32] have been done to transfer CNNs to video tasks and have made significant progress. The two-stream network [33]

is the most widely used baseline video classification model, which has both spatial and temporal networks, and is pre-trained on the ImageNet

[34] dataset and ultimately achieves the optimal performance at that time. In [35], dense trajectories are employed to simultaneously identify the spatial and temporal extents of the actions of interest. Hasan et al. [36]

proposed a continuous activity learning framework for streaming videos, which intricately ties together deep hybrid feature models and active learning. To address the cross-modal video retrieval task, Pang et al.

[37]

presented a multi-pathway Deep Boltzmann Machine (DBM) dealing with low-level features of various types. Later, some people successfully trained ultra-deep video classification networks. Wang

et al.[2] adopted very deep ConvNet architectures [38, 39] to unleash the full potential of temporal segment network framework. The latest works [2, 40] have achieved fantastic performance. However, these works are designed for a static closed world, how to open the deep neural networks and how to dynamically handle unknowns still remain unsolved.

Iii Overview

The framework of our open set recognition approach is shown in Fig. 2. Two training phases and two evaluation phases constitute the whole framework. The initial training set which contains only knowns is provided to the Initial training phase (detailed in Sec. IV-A & Sec. IV-B) as input. Then an initial model is trained, as well prototypes and prototype radiuses of categories. The Incremental training phase (detailed in Sec. IV-D) takes the incremental training set (contains both knowns and unknowns) as input and extracts the features by using the initial model. Then distances of the features and the prototypes are measured under the constraint of the prototype radiuses. Next, a triplet thresholding method proposed in our previous work [17] is modified to apply in our framework to detect the unknowns. Manual labeling the unknowns as new categories which are dynamically incorporated in the model. Then fine-tuning the model with only few samples to make the unknowns ”known”. Finally, the final output model can classify both knowns and unknowns.

Two evaluation phases are set to evaluate the model performance. Evaluation phase 1 is carried out after the Initial training phase. Testing set which contains both knowns and unknowns is provided to the initial model, and we measure the detection f-score of unknowns at this phase. Evaluation phase 2 is carried out after the Incremental training phase. Classification TOP 1 accuracy of both knowns and unknowns is measured here as the most important performance indicator of open set recognition tasks.

Iv Prototype based Open Deep Network

Fig. 3: Structure of prototype based open deep network (P-ODN) in the Initial training phase. The CNN takes knowns as input, and a classification loss is applied to train the initial classification model. Two major modules: a) Prototype module (detailed in Sec. IV-A

) takes the features extracted by CNN as input and learns prototypes for categories. b) Prototype Radius module (detailed in Sec.

IV-B) takes the prototype based distances as input, where the distances are calculated in the Prototype module, and learns the scope of each category prototype. Finally, the Initial training phase outputs an initial model, trained prototypes and prototype radiuses for the Incremental training phase later.

The structure of prototype based open deep network (P-ODN) in the Initial training phase is shown in Fig. 3. Basically, this phase includes two major modules: first, a Prototype module is applied to learn prototypes of categories based on the prototype learning. Second, in order to guarantee each prototype of the category in a certain range, a Prototype Radius module is proposed. Each category will learn a prototype radius to further restrict the scope of features derived by the model. Three kinds of losses are applied to train prototypes and prototype radiuses. First we apply the cross entropy loss to train the classification capacity of the neural networks, which we denotes as :

(1)

where is the batch size, is the feature of the th sample in the batch, and is the ground truth. Second, the prototype loss, which is firstly proposed by [16], is modified to apply in our framework to train the prototypes of knowns. Third, we propose the prototype radius loss, which guides the model to learn the radius scope of each known category. Note that the prototype loss and the prototype radius loss will be introduced in details in Sec. IV-A and Sec. IV-B later.

The Initial training phase outputs the initial model, trained prototypes and prototype radiuses, and they will be used later in the Incremental training phase.

Major modules (Detecting Unknowns and Updating Network) of P-ODN in the Incremental training phase are shown in Fig. 2. The initial model trained in the Initial training phase extracts the features of the incremental training set here. In Sec. IV-C, we will simply review a triplet thresholding method of unknowns detection that proposed in our previous work [17]. Then the method is modified to be applicable in the P-ODN to detect the unknowns based on the distance metric of the features and the trained prototypes. In Sec. IV-D, a new distances based weights initialization method is introduced to initialize the weights of new category predictors in the Updating Network module. After the initialization of new weights, few manual labeled samples are used to fine-tune the model. New categories are incorporated in the current model continuously. At the end of this phase, a final model which can handle both knowns and unknowns is trained.

Iv-a Prototype module

Fig. 4 illustrates the algorithm of training the prototypes to represent the centers of knowns. Since prototype learning has shown its effectiveness in increasing the inter-class variation [16], we introduce the prototype learning into open set recognition tasks and further use prototypes to detect unknowns.

Fig. 4: The illustration of Prototype Module. Different colors represent different categories in the figure. To give an explicit explanation of the process, we assume the batch size is 3 as shown in the figure. The corresponding prototypes of categories are chosen according to the labels of features. Then a L2 loss is applied to indicate the prototypes to learn. Simultaneously, a distance distribution matrix is calculated to train the capacity of the prototypes with the distance based classification loss.

Specifically, a prototype matrix is initialized with zeros, where N is the category number of knowns. Each row of the prototype matrix, shown in different color in Fig. 4, represents the prototype (or center) of each known category. The prototype loss () is applied to acquire the trained prototypes which vary greatly in different categories. The prototype loss consist of a L2 loss () and a distance based classification loss (). Then the two losses are combined with a weight argument :

(2)

The L2 loss: prototypes are chosen according to the label of features, where is the batch size of the CNN networks. To give an explicit explanation of the process, we assume the batch size is 3 as shown in Fig. 4. The L2 loss is applied on the chosen prototypes and the features to guide prototypes to learn the characters of the features:

(3)

where is the feature of the th data sample in the batch and is the corresponding prototype.

The distance based classification loss: As the prototypes and features are trained jointly, simply applying the L2 loss to make prototypes be similar to the features would be instable. The prototypes would be easily misled by some outliers of the training data samples. We add the distance based classification loss to improve the classification capacity of the prototypes and increasing the penalty of misclassification samples, which helps to learn more stable and characteristic prototypes of categories.

As shown in Fig. 4, the Euclidean distance of each feature and each category prototype is calculated to get a distance distribution matrix :

(4)

where and

. We take the reciprocal of distances between features and prototypes here to make features near to the prototypes get larger probability value. And

is applied to avoid dividing by zero. So classification can be implemented by assigning label according to the largest value in each row of . Then the cross entropy loss is applied on , the :

(5)

where the is the batch size (the same as the row number of ), is the ground truth, and denotes the th row of .

In this module, P-ODN learns the category prototypes by applying the , then the trained prototypes are saved which will be used in the Incremental training phase.

Iv-B Prototype Radius module

The prototype radius module is a key part of the initial training phase which aims to restrict the values of prototypes to a certain range and learn the prototype radius of categories. The prototypes and the prototype radiuses are trained jointly by adding a L2 loss () which can be regard as a regularization item of the prototype learning.

Fig. 5: The illustration of Prototype Radius Module. To restrict the features to a curtain range, the prototype radius module is applied. First, the correctly classified rows of the distance distribution matrix are chosen. Then the distance values of the corresponding categories are used to train the prototype radiuses with a L2 loss. This module can be regarded a regularization item of the prototypes.

As shown in Fig. 5

, a vector of each category prototype radius is initialized with zeros. The distance distribution matrix (

) calculated in the prototype module is inputed to the prototype radius module. Then the correctly classified probability scores of are chosen to guide the prototype radiuses to learn, the :

(6)

where is the number of correctly classified distance probability values, is the th correctly classified distance probability value (the largest score of distance distribution row), and is the category prototype radius which is chosen according to the labels of samples.

The prototype radiuses act like a memory unit of the distance probability values of the correctly classified samples. It is updated according to the correctly classified samples continuously. Simultaneously it stores the distance probability value information of the previous correctly classified samples. So the prototype radius module integrates the long-term information to restrict the distance distribution value to a certain range and indirectly restricts the scope of the features extracted by the model. The use of the long-term information improves a lot especially in the temporal stream of action recognition.

In this module, P-ODN learns the category prototype radiuses by applying the and the trained prototype radiuses are also saved. Note that the total loss of the initial training phase goes as:

(7)

where and are weight arguments, we set as and in our experiments.

Obviously, the prototype radiuses are trained jointly with the prototypes, which play an important role as a regularization item to train the prototypes.

Iv-C Detecting Unknowns

In our previous work [17], we proposed a multi-class triplet thresholding method to detect the unknowns. Basically, a triplet threshold () per category is calculated, i.e. accept threshold , reject threshold and distance-reject threshold .The triplet threshold of category is calculated as

(8)
(9)
(10)

where the and are the maximal and the second maximal confidence values of the th correctly classified sample of category . is the number of the correctly classified sample set of category . and are empirical parameters.

Fig. 6: The illustration of Detecting Unknowns in the P-ODN. The first column can be viewed as our previous version in [17]. In the P-ODN, the distance distribution matrix is calculated by using the prototypes. Thresholds calculated on the mean distance distribution are then applied on the distance distribution of the test samples to detect unknowns.

A data sample is classified as category label l only if the index of its top confidence value is l and the value is greater than . And a sample is regarded as unknowns when all of its confidence value is below . The threshold is applied to help detect unknowns in hard samples, which lie between and . The statistical properties of include correlation information between the two categories, which is a simple way of using the inter-class relation information in the activation level. If the distance is large enough, then we accept the data sample as category label l. The process of unknowns detection is shown in the first column of Fig. 6.

Unlike the previous version of unknowns detection in [17], we modify the method to be applicable in the P-ODN framework to apply more robust unknowns detection algorithm based on the distance metric.

Fig. 6 shows the comparison of detecting unknowns between our previous work ODN [17] and the P-ODN. Instead of calculating the triplet thresholds based on the mean feature vectors, the distance distribution matrix of features and prototypes is calculated first as the way detailed in Sec. 4. Then the triplet thresholds are calculated for each category in the same routine while based on the mean distance distribution. While the triplet thresholds are acquired, the later unknowns detection processes are similar.

The insight of this improvement is that thresholds are calculated based on statistic results of the train samples in the previous version. The statistic results, mean feature vectors, are more easily effected by the outliers. While in the P-ODN, the features are transfered into a different feature space by calculating the distance distribution matrix. Then thresholds are calculated based on the mean distance distribution. We assume that in this way the thresholds are acquired by making use of the model information, instead of only getting from the statistic information of data sets, which is more robust. Because the distance distribution can be regarded as a projection of features under the guidance of prototypes, which are trained with the model.

Iv-D Updating Network

After detecting the unknowns, manual labeling the unknown samples. Then these samples could be used to fine-tune the model. It has been discussed that retraining the entire system with the known data and new samples is time consuming, computational resource wasting. And it is also easy to be over-fitting, because new categories are far short of training samples.

Fig. 7: The illustration of distances based weights initialization. Distance distribution is calculated between prototypes and the new sample features. Then a weight distribution can be acquired by applying the mean normalization. Finally, weights of new predictors are initialized depending on weights of the initial networks according to the weight distribution.

In our previous work [17], an updating method by transferring knowledge from the trained model was proposed which helps to speed up the training stage and needs very few manually annotations. A brief retrospective of the method is given bellow.

In each iteration of the incremental training phase, a new category is incorporated to the current model, which is carried out by increasing the corresponding weight column in the classification layer of the networks. By initialization the weight column as Formula 11, the knowledge of the previous model is kind of transfered to the new model.

(11)

In Formula 11, the current category number is and is the weight column of the th category in the classification layer of the networks. And is the weight column of most similar categories measuring by the scores of features. and are empirical parameters.

In the P-ODN, a new weight column is also increased in the classification layer to incorporate the new category. Unlike the previous version, the distance distribution of the new category sample is calculated first. Then by applying the mean normalization, we can get the distribution , where , as shown in Fig. 7. The new weight is initialized as:

(12)

where is the weight column of the th category, and is the current category number.

The insight of this improvement is that more robust relations of the new category and the knowns are taken into account by applying the distances based weights initialization method. By initializing the new weights like Formula 12, the global knowledge as well as the relation knowledge can both be incorporated into the new model. First, each weight column is integrated to initialize the new weights, which guarantees new weights in the same distribution with knowns. And second, the distances based relation metric is much more robust than that in the previous work [17] which is measured by comparing the scores in the features.

After the networks are updated, few samples which detected in the Detecting Unknowns module are used to fine-tune the model. As the new weights incorporate the knowledge of the previous model, the fine-tuning phase is mush less complex and very soon. We also adopt the Allometry Training method and the Balance Training method, which are proposed in [17], while fine-tuning the model. Specifically, different learning rates are embedded into the classification layer to force the new weights to learn at a faster rate. And we use same few samples of each known and new category to avoid the greatly influence on the accuracy of the knowns. At the end of the Incremental training phase, the final model has the capacity to classify both knowns and unknowns.

V Experiments

This section will first introduce the details of datasets and the evaluation schemes. Then, we describe the experiments setting. Finally, we report the experimental results and give the analysis of results.

V-a Datesets

To verify the effectiveness of P-ODN, we conducted experiments on four public datasets, including UCF11 [41], UCF50 [42], UCF101 [43] and HMDB51 [44]. Note that we divide the datasets into knowns and unknowns to simulate the open world scenarios as detailed in Sec. V-B.

The UCF11 dataset contains action categories.For each category, the videos are grouped into groups with more than action clips in it. The video clips in the same group share some common features, such as the same actor, similar background, similar viewpoint, and so on.

The UCF50 dataset is an action recognition dataset with action categories, consisting of realistic videos taken from Youtube. For all the categories, the videos are grouped into groups, where each group consists of more than action clips.

The UCF101 dataset is one of the most popular action recognition benchmarks. It contains 13,320 video clips from action categories and there are at least video clips for each category.

The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. It contains clips divided into action categories, each containing a minimum of clips.

Compared with the very large dataset used for image classification, the dataset for action recognition is relatively small. Therefor we pre-trained our model on the ImageNet dataset [34].

V-B Experiments setting

To simulate the open world scenarios, we choose nearly half categories of each data set as knowns and the other half as unknowns, i.e. categories of UCF11 as knowns while the other as unknowns, categories of UCF50 as knowns while the other as unknowns, categories of UCF101 as knowns while the other as unknowns, and categories of HMDB51 as knowns while the other as unknowns. Then the training set of each dataset is divided into two subsets according to knowns and unknowns. The subset which contains knowns is the initial training set. A small subset is chosen from both the knowns and unknowns, which we guarantee that each category has at least samples, to form the incremental training set. Note that we use much less training samples and removing half labels of the training set to simulate the open world scenarios.

sample number UCF11 UCF50 UCF101 HMDB51
ODN 7 5.8 5.39 5.57
P-ODN 7 5.4 5.2 5
TABLE I: Sample number of manually annotated unknowns needed to increase a category on average.

After initializing from the pre-trained ImageNet model for spatial and temporal streams, we conduct the initial training phase to train prototypes, prototype radiuses of categories, and the initial model jointly. Note that both the spatial stream and the temporal stream train their own prototypes and prototype radiuses respectively. After prototypes and prototype radiuses for the two streams are trained, then triplet thresholds of both spatial and temporal streams are calculated on the initial training set based on the prototypes.

During the incremental training phase, we keep the same experiment setting as our previous work [17] to give a convincing comparison. Basically, we update the networks when the number of any labeled new category goes to , then this category is incorporated into the current model.

The experiments show that we use iterations (on average) to increase new categories while using dataset UCF101 ( iterations for UCF50 to increase , iterations for UCF11 to increase and for HMDB51 to increase new categories). So, on average, UCF101 needs to label () samples ( samples for UCF50, samples for UCF11 and samples for HMDB51) for each unknown categories. A more explict comparison is shown in the Table I between our previous ODN [17] and P-ODN, it is obvious that we use the same number of labeled unknown samples in the P-ODN or even less.

However, for closed set recognition, using UCF101 as an example, the training list of UCF101 split1 has data samples of categories. On average, each category, half of the categories corresponding to the known categories and half to the unknown categories, needs () annotated samples. It is obvious that we use much less samples of unknowns in the open set setting.

We also conduct the closed set recognition experiments as our baseline using the same sample size as the open set setting. The results of experiments in closed set setting are much less than those of our P-ODN while both using insufficient unknown samples(detailed in Sec. V-D). So, P-ODN needs much fewer human annotations then the closed set recognition, and can achieve better performance. Worth to mention that, P-ODN suits the real world scenarios, while the closed set recognition can not handle these tasks.

Fig. 8: Heat map of mean features and prototypes of knowns. Prototypes have much stronger response on the correctly classified values, for the diagonal line is much brighter while the upper and lower triangular matrices are much darker.
Fig. 9: T-SNE visualization of mean features and prototypes. Each colored number represents a mean feature or a prototype of a certain category. The left visualization of the mean features has more confusion categories which the inter-class distances are really short, while the prototypes can better handle the confusion categories as shown in the right sub-figure.

We use the Tensorflow toolbox

[45]

. And we give the results on Inception-resnet-v2. The network weights are trained using the mini-batch stochastic gradient descent with momentum (set to

). We resize all input images to , and then use the fixed- crop strategy [46] to crop a region from images or their horizontal flip.

V-C Exploration experiments

Benefits from prototypes. To illustrate the improvement of prototypes, we firstly conduct an exploration experiments on UCF101 with GoogLeNet [27].

We visualize the heat map of the mean features and prototypes of the knowns, as shown in Fig. 8. We can see that prototypes have much stronger response on the correctly classified values, for the diagonal line is much brighter while the upper and lower triangular matrices are much darker.

As shown in Fig. 9, we reduce dimensions of the mean features and the prototypes, and visualize them by applying t-SNE. In the figure, each colored number represents one mean feature of a certain category or one prototype of a certain category. The left visualization of the mean features has more confusion categories which the inter-class distances are really short, while the prototypes can better handle the confusion categories as shown in the right sub-figure.

The comparison shows that the prototypes are more suitable for representing category centers. Because the prototypes have stronger response on the collected classified values and longer inter-class distances. The two advantages help to complement more robust unknowns detection and guide much better features training.

V-D Evaluation of detecting unknowns

(a) UCF11
(b) UCF50
(c) UCF101
(d) HMDB51
Fig. 10: Comparison of baseline method, ODN [17], P-ODN and P-ODN with radius on category accuracy of UCF11, UCF50, UCF101 and HMDB51 in real world scenarios. Each sub-figure is corresponding to a data set, taking the sub-figure (a)a as an example. The light gray bar denotes the accuracy of knowns of baseline, while the dull gray denotes the accuracy of unknowns of baseline. The light blue line denotes the accuracy of knowns of ODN, while the dark blue denotes the accuracy of unknowns of ODN. And so on, the light green denotes the knowns of P-ODN and dark green denotes the unknowns of P-ODN. The light red and the dark red denote the knowns and unknowns of P-ODN with radius respectively.
F-score UCF11 UCF50 UCF101 HMDB51
OSDN[13] 82.59% 75.34% 72.1% 50.31%
ODN[17] 87.39% 74.91% 73.35% 63.70%
P-ODN 89.12% 80.14% 75.45% 66.79%
P-ODN + radius 89.50% 82.15% 76.2% 67.36%
TABLE II: Unknowns detection results of P-ODN.

In this subsection, we aim to evaluate the unknowns detection performance of our P-ODN on UCF11, UCF50, UCF101 and HMDB51. As mentioned before, we conduct this evaluation at the end of initial training phase as Evaluation phase 1. The experimental results are summarized in Table II. The first row of the results is the performance of OSDN proposed in [13], we conduct the method on the action recognition task. The second row is the performance of our previous work [17]. And the third row is the performance of P-ODN with prototype module only, the last row is P-ODN with both prototype module and prototype radius module. We can see P-ODN with both prototype module and the prototype radius module improves the ODN by on UCF11, on UCF50, on UCF101 and on HMDB51.

TOP1 Acc. UCF11 UCF50 UCF101 HMDB51
baseline 85.1% 84.95% 72.01% 44.58%
ODN [17] 94.91% 93.73% 76.07% 46.01%
P-ODN 94.9% 95.16% 77.21% 47.84%
P-ODN + radius 95.31% 96.15% 78.64% 49.09%
TABLE III: Recognition results of P-ODN.

As mentioned in Sec. IV-C and shown in Sec. V-C, unknowns detection based on the prototypes are more robust. First the prototypes are more discriminable than mean features. Second the prototypes can guide the features to be trained better, which helps to improve the intra-class compactness and inter-class distance of the feature representation. We also learn the triplet thresholds based on the prototypes which would contain the knowledge of the model itself. Then the much discriminable features and the model based triplet thresholds both lead to a great improvement on the performance of unknowns detection.

V-E Evaluation of classification on both knowns and unknowns

In this subsection, we aim to evaluate the classification performance of our P-ODN on UCF11, UCF50, UCF101 and HMDB51. We conduct this evaluation at the end of incremental training phase as Evaluation phase 2, the final classification accuracy of both knowns and unknowns is viewed as the most important performance indicator of open set recognition tasks. The experimental results are summarized in Table III. First, we carry out the closed set recognition experiments while using the same quantity of samples as those of our open set setting. Under the closed set setting, all training samples should have labels, so we provide labels of both knowns and unknowns. The result is shown in the first row as our baseline. The rest of Table III are results under the open set setting as detailed in Sec. V-B. The second row is the result in our previous work [17], and we add the experiment on UCF11 here, since we did not use UCF11 in the previous work. The last row is P-ODN with both prototype module and prototype radius module, which achieves the best performance. We can see P-ODN finally improves the ODN by on UCF11, on UCF50, on UCF101 and on HMDB51. And furthermore, P-ODN finally improves the baseline by on UCF11, on UCF50, on UCF101 and on HMDB51.

A more explicit illustration can be see in Fig. 10. Each sub-figure is corresponding to a data set, UCF11, UCF50, UCF101 and HMDB51. Take the sub-figure (a)a as an example, we compare the accuracy of both knowns and unknowns on UCF11 with four methods, which are baseline, ODN [17], P-ODN and P-ODN with radius (P-ODN with both prototype module and the prototype radius module). The light gray bar denotes the accuracy of knowns of baseline, while the dull gray denotes the accuracy of unknowns of baseline. The light blue line denotes the accuracy of knowns of ODN, while the dark blue denotes the accuracy of unknowns of ODN. And so on, the light green denotes the knowns of P-ODN and dark green denotes the unknowns of P-ODN. The light red and the dark red denote the knowns and unknowns of P-ODN with radius respectively.

We can see that while using the baseline method, the knowns which are trained with abundant data samples can achieve a much better performance than the unknowns which are trained with insufficient data samples with labels. Our methods can improve greatly on unknowns while using insufficient samples. Though, the performance on the knowns may decrease slightly, since the fine-tuning phase incorporates new data continuously. In Fig. 10, we can see P-ODN with radius is generally above the other methods, which achieves the best performance. Note that, different from the baseline method which is provided with all labels beforehand, the other three methods should detect the unknowns first, then manual labeling the unknowns. Therefore, open set recognition is more realistic then the closed set setting.

Vi Conclusion

This paper proposed a prototype based Open Deep Network (P-ODN) for open set recognition. We introduce prototype learning into open set recognition tasks by training prototypes of categories and prototype radiuses with a prototype module and a prototype radius module. Then a distance metric method is applied to detect unknowns, which is based on the prototypes and more robust. In the incremental training phase, a distances based weights initialization method is employed to fast acquire the knowledge of model and speed up the fine-tuning process. Experimental results show that, our P-ODN can effectively detect and recognize new categories with little human intervention and achieve state-of-the-art performance on UCF11, UCF50, UCF101 and HMDB51 datasets.

In this paper, we have proved the importance of more discriminable centers (or prototypes) on the open set recognition tasks. More characteristic features which have larger margin among categories will further improve the performance of unknowns detection. In addition, method [47] utilizes GAN to generate unknown samples and uses them to train the neural networks also has potential to improve the recognition performance of unknowns. In the future work, we will conduct more experiments as mentioned above to further improve the performance of open set recognition.

References

  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [2] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in

    European Conference on Computer Vision

    .   Springer, 2016, pp. 20–36.
  • [3] Y. Shi, Y. Tian, Y. Wang, W. Zeng, and T. Huang, “Learning long-term dependencies for action recognition with a biologically-inspired deep network,” in Proceedings of the International Conference on Computer Vision, 2017, pp. 716–725.
  • [4] A. Tveit and M. L. Hetland, “Multicategory incremental proximal support vector classifiers,” in International Conference on Knowledge-Based and Intelligent Information and Engineering Systems.   Springer, 2003, pp. 386–392.
  • [5]

    G. Cauwenberghs and T. Poggio, “Incremental and decremental support vector machine learning,” in

    Advances in neural information processing systems, 2001, pp. 409–415.
  • [6] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” Journal of Machine Learning Research, vol. 7, no. Mar, pp. 551–585, 2006.
  • [7] T. Yeh and T. Darrell, “Dynamic visual category learning,” 2008.
  • [8] M. Herbster, “Learning additive models online with fast evaluating kernels,” in

    International Conference on Computational Learning Theory

    .   Springer, 2001, pp. 444–460.
  • [9] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Metric learning for large scale image classification: Generalizing to new classes at near-zero cost,” in Computer Vision–ECCV 2012.   Springer, 2012, pp. 488–501.
  • [10] I. Kuzborskij, F. Orabona, and B. Caputo, “From n to n+ 1: Multiclass transfer incremental learning,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2013, pp. 3358–3365.
  • [11] A. S. T. E. B. W. J. Scheirer, A. Rocha, “Towards open set recognition,” in IEEE, 2013.
  • [12] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models for open set recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 11, pp. 2317–2324, 2014.
  • [13] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1563–1572.
  • [14] A. Bendale and T. Boult, “Towards open world recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1893–1902.
  • [15] M. Ristin, M. Guillaumin, J. Gall, and L. Van Gool, “Incremental learning of ncm forests for large-scale image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3654–3661.
  • [16] H.-M. Yang, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Robust classification with convolutional prototype learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3474–3482.
  • [17] Y. Shu, Y. Shi, Y. Wang, Y. Zou, Q. Yuan, and Y. Tian, “Odn: Opening the deep network for open-set action recognition,” in 2018 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2018, pp. 1–6.
  • [18] A. Pronobis, L. Jie, and B. Caputo, “The more you learn, the less you store: Memory-controlled incremental svm for visual place recognition,” Image and Vision Computing, vol. 28, no. 7, pp. 1080–1097, 2010.
  • [19] N. A. Syed, S. Huan, L. Kah, and K. Sung, “Incremental learning with support vector machines,” 1999.
  • [20] G. Fung and O. L. Mangasarian, “Incremental support vector machine classification,” in Proceedings of the 2002 SIAM International Conference on Data Mining.   SIAM, 2002, pp. 247–260.
  • [21] Z. Wang, K. Crammer, and S. Vucetic, “Multi-class pegasos on a budget,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10).   Citeseer, 2010, pp. 1143–1150.
  • [22] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: Primal estimated sub-gradient solver for svm,” Mathematical programming, vol. 127, no. 1, pp. 3–30, 2011.
  • [23] W. J. Scheirer, A. Rocha, R. J. Micheals, and T. E. Boult, “Meta-recognition: The theory and practice of recognition score analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 8, pp. 1689–1695, 2011.
  • [24] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh, “Predicting failures of vision systems,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3566–3573.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [26] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [28] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [29] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 886–893.
  • [30] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European conference on computer vision.   Springer, 2006, pp. 428–441.
  • [31] Y. Shi, W. Zeng, T. Huang, Y. Wang et al., “Learning deep trajectory descriptor for action recognition in videos using deep neural networks,” in 2015 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2015, pp. 1–6.
  • [32] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov, “Exploiting image-trained cnn architectures for unconstrained video classification,” arXiv preprint arXiv:1503.04144, 2015.
  • [33] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
  • [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   Ieee, 2009, pp. 248–255.
  • [35] Z. Zhou, F. Shi, and W. Wu, “Learning spatial and temporal extents of human actions for action detection,” IEEE Transactions on multimedia, vol. 17, no. 4, pp. 512–525, 2015.
  • [36] M. Hasan and A. K. Roy-Chowdhury, “A continuous learning framework for activity recognition using deep hybrid feature models,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1909–1922, 2015.
  • [37] L. Pang, S. Zhu, and C.-W. Ngo, “Deep multimodal learning for affective analysis and retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2008–2020, 2015.
  • [38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [40] Y. Shi, Y. Tian, Y. Wang, and T. Huang, “Sequential deep trajectory descriptor for action recognition with three-stream cnn,” IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1510–1520, 2017.
  • [41] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos “in the wild”,” in Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on.   IEEE, 2009, pp. 1996–2003.
  • [42] K. K. Reddy and M. Shah, “Recognizing 50 human action categories of web videos,” Machine Vision and Applications, vol. 24, no. 5, pp. 971–981, 2013.
  • [43] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [44] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in Computer Vision (ICCV), 2011 IEEE International Conference on.   IEEE, 2011, pp. 2556–2563.
  • [45] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
  • [46] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices for very deep two-stream convnets,” arXiv preprint arXiv:1507.02159, 2015.
  • [47] Z. Ge, S. Demyanov, Z. Chen, and R. Garnavi, “Generative openmax for multi-class open set classification,” arXiv preprint arXiv:1707.07418, 2017.