Segregation Network for Multi-Class Novelty Detection

05/11/2019 ∙ by Supritam Bhattacharjee, et al. ∙ 0

The problem of multiple class novelty detection is gaining increasing importance due to the large availability of multimedia data and the increasing requirement of the classification models to work in an open set scenario. To this end, novelty detection tries to answer this important question: given a test example should we even try to classify it? In this work, we design a novel deep learning framework, termed Segregation Network, which is trained using the mixup technique. We construct interpolated points using convex combinations of pairs of training data and use our novel loss function for prediction of its constituent classes. During testing, for each input query, mixed samples with the known class prototypes are generated and passed through the proposed network. The output of the network reveals the constituent classes which can be used to determine whether the incoming data is from the known class set or not. Our algorithm is trained using just the data from the known classes and does not require any auxiliary dataset or attributes. Extensive evaluation on two benchmark datasets namely Caltech-256 and Stanford Dogs and comparison with the state-of-the-art justifies the effectiveness of the proposed framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning methods have achieved impressive performance in object recognition and classification [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] [Simonyan and Zisserman(2014)]

by using large networks trained with millions of data examples. However, these networks usually work under a closed set assumption and thus tries to classify each query sample even if it does not belong to one of the training classes. For example, a neural network classifier trained to classify fruits, might classify an input from a completely category, say “bird” into one of the fruit classes with high confidence, which is unlikely to happen if a human does the same task. To make the systems more intelligent and better suited to real-world (open-set) applications, they should be able to understand whether the input belongs to one of the trained classes, and then only try to classify it 

[Pimentel et al.(2014)Pimentel, Clifton, Clifton, and Tarassenko], [Markou and Singh(2003)].

This problem is addressed in recent literature as out-of-distribution detection, anomaly detection, novelty detection, open-set recognition and one-class classification, each having subtle differences between them. One class classification rejects all the classes as outliers except the concerned class. Open-set recognition aims to recognize unknown class as well as classifying the known class correctly. In out-of-distribution the algorithm determines samples coming from other data-sets or distribution. Often in such algorithms there are knowledge of similar out-of-distribution data.

Figure 1: Few examples of images from the Stanford Dogs (top row) and Caltech-256 Object Categories (bottom row) dataset.

In this work, we address the multi-class novelty detection task, where given a query, the goal is to understand whether it belongs to one of the training classes. This is very challenging, since the novel data can come from the same data distribution as that of the training data. Here, we propose a novel framework, termed Segregation Network, which utilizes the mixup technique for this task. The network takes as input a pair of data points, and a third interpolated data point which is generated by mixing them together using a variable ratio. The goal is to segregate the constituent classes and their respective proportions in the interpolated data using a novel loss function. Once the network is trained, given a unknown query sample, we mix it with the known class prototypes in a predefined proportion and pass it through the network. Based on the network output, we can infer whether the query belongs to the known set of classes or to a novel class unknown to the system. The main contributions of our work are as follows:
(1) We propose a novelty detection framework, termed as Segregation Network, using the mixup technique and a novel loss function for training it.
(2) Our algorithm works well with only the available training data and does not require access to any auxiliary or out-of-distribution dataset as in [Perera and Patel(2018)]. This is advantageous as the collection of auxiliary data is often difficult, expensive and might be data dependent with respect to the training set of classes.
(3) We perform experiments on two standard benchmark datasets for novelty detection and the results obtained compare favourably with the state-of-the-art method which leverage auxiliary training data.

The rest of the paper is organized as follows. A brief description of the related work in literature is provided in Section 2. The proposed approach is discussed in Section 3 and the experimental evaluation is described in Section 4. The paper ends with a brief discussion and conclusion.

Figure 2:

Illustration of the proposed network. The network accepts feature vectors

belonging to (of birds catagory) and belonging to (of Chimpanzee catagory) to create hybrid data in the feature space, . All these three vectors <> are first passed through the first fully connected layer of the network to be transformed into a lower dimensional vector before being concatenated together to pass it through the rest of the network. The final activation layer is kept sigmoid so that the network can learn the mixture ratio.

2 Related Work

The foundation of this work is based on two threads of machine learning research, namely novelty detection algorithm and mix-up based learning techniques.



Novelty Detection: This problem is an active area of research for detecting abnormalities in data. There have been both statistical [Stevens(1984)],[Yamanishi et al.(2004)Yamanishi, Takeuchi, Williams, and Milne],[Kim and Scott(2012)], [Eskin(2000)] distance based [Knorr et al.(2000)Knorr, Ng, and Tucakov],[Hautamaki et al.(2004)Hautamaki, Karkkainen, and Franti], [Eskin et al.(2002)Eskin, Arnold, Prerau, Portnoy, and Stolfo]

and deep learning based approaches. Statistical methods generally focuses on trying to fit the distribution of the known data using probability models. Early works on open set recognition were mostly based on statistical methods. One of the early works of in this direction involves using a 1-vs-Set Machine to determine the representative space between novel and seen classes. Subsequently, to enhance the performance,

[Jain et al.(2014)Jain, Scheirer, and Boult] ,[Scheirer et al.(2014)Scheirer, Jain, and Boult] have been proposed. Distance based algorithms generally perform some transform and then identify novel classes by thresholding the distance with known examples. The assumption is that the known class examples will be much closer to the known class representatives than the unknown in the transformed space. A relatively recent work in this direction is Kernel-Null Foley-Sammon Transform (KNFST) [Bodesheim et al.(2013)Bodesheim, Freytag, Rodner, Kemmler, and Denzler] for multi-class novelty detection. Here the same class points are projected into a single point in the null space, and during testing, the distance with respect to the class representative is thresholded to get a novelty score. This algorithm was improved to handle incremental incoming class and subsequently update its novelty detector in [Liu et al.(2017)Liu, Lian, Wang, and Xiao]. In addition, [Liu et al.(2017)Liu, Lian, Wang, and Xiao] made the approach more scalable and reduced the computational burden of the method proposed in [Bodesheim et al.(2013)Bodesheim, Freytag, Rodner, Kemmler, and Denzler]. Deep learning based approaches such as Open-max tries to fit a Weibull-distribution to determine the novelty[Bendale and Boult(2016)]. The generative version of this approach was proposed in [Ge et al.(2017)Ge, Demyanov, Chen, and Garnavi], where unknown samples were generated. Several one-class deep learning based novelty detection has been proposed in recent literature [Sabokrou et al.(2018)Sabokrou, Khalooei, Fathy, and Adeli],[Perera et al.(2019)Perera, Nallapati, and Xiang],[Sadooghi and Khadem(2018)]. The work in  [Perera and Patel(2019)] designs a novel training paradigm where a reference set is used to learn a set of negative filters that will not be activated for the known category data. To this end they design a novel loss function called membership loss. Masana et al. [Masana et al.(2018)Masana, Ruiz, Serrat, van de Weijer, and Lopez] propose a method to improve the feature space by forming discrminitive features with contrastive loss for this task.

Mixing: Learning algorithms involving interpolation or mix-up between classes has been recently introduced in the community. The very first works in vision involves improving classification tasks by interpolating between classes [Zhang et al.(2017)Zhang, Cisse, Dauphin, and Lopez-Paz],[Tokozume et al.(2018)Tokozume, Ushiku, and Harada]. While the mentioned works interpolate in the input space, [Berthelot et al.(2018)Berthelot, Raffel, Roy, and Goodfellow], [Dumoulin et al.(2016)Dumoulin, Belghazi, Poole, Mastropietro, Lamb, Arjovsky, and Courville],[Mathieu et al.(2016)Mathieu, Zhao, Zhao, Ramesh, Sprechmann, and LeCun], [Ha and Eck(2017)],[Bowman et al.(2015)Bowman, Vilnis, Vinyals, Dai, Jozefowicz, and Bengio], [Mescheder et al.(2017)Mescheder, Nowozin, and Geiger]

tries interpolation in the latent space of Autoencoders. In our work, unlike these mentioned papers, we interpolate in the feature space to train our model.

3 Proposed Method

In this section, we describe the network architecture of the proposed Segregation Network, the novel loss function used to train the model and the training and testing protocol. First, we describe the notations used.
Notations: Let the input data be represented as , being the number of training samples and being its feature dimension. Let the labels be denoted as , where is the number of training or known classes. We define the known class set to be , and thus . In the open set scenario, the testing data can come from the seen classes or from unseen/novel classes, for which no information is available to the system. During testing, given a query, the goal is to determine whether it comes from set or not, i.e. whether it belongs to a seen class or a novel class. Classifying the known examples into its correct class is not the focus of this work and can be done using the base classifier trained using the training data. Now, we describe the details of the Segregation Network.
Features: Any pre-trained standard deep learning model can be used to extract features. Here, we use pre-trained Alexnet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and VGG -16 [Simonyan and Zisserman(2014)] architecture. These networks are fine-tuned and the extracted features are normalized and given as input to our network.

3.1 Segregation Network

The proposed network consists of three fully connected (fc) layers with ReLU activations and dropout between each layer except the final fc layer. The final layer is of dimension

. Sigmoid is used at the final layer activation as the output of Sigmoid is between between which can be interpreted as the proportion of the mixtures in our case. In our design, the network has architecture, with the numbers denoting the length of each fc layer. For training this network Adam optimizer with learning rate of 0.001 is used.

The network takes as input a triplet set of data samples , where , are data from the training set and is the mixture obtained by mixing and in some proportion. Let us denote the output of the first fc layer, which is shared by all three inputs, as . Then is concatenated together to form which is then passed forward through the rest of the network. In all implementation, since we have used the pretrained features from Alexnet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] or VGG-16 [Simonyan and Zisserman(2014)]

deep networks, which are of very high dimension, the first fc layer serves as a dimensionality reduction layer, the output of which is then concatenated and passed through the rest of the network structure. The final output of the network after passing through the sigmoid activation function is denoted as

.
Training the model : The network is trained such that given a interpolated input, it will decouple/segregate the input data into the constituents of the known class. This property is exploited in the following way. Given a pair of feature vectors , we perform convex combination on this pair to produce , where , . We feed these three feature vectors to our network. The output of the network is a dimensional vector from the final sigmoid layer . Since the output is passed through the sigmoid activation function, each element of the -dimensional vector is bounded between . In addition, each element denotes the proportion by which the mixed sample has been constructed from that training classes. For example, an output of indicates that the mixed sample has been constructed as where and . Given , the following cases may arise,

  • If, and , where , and both , belongs to seen classes, we should get the output of the model such that, and , while for . We consider such a pair to be a non-matched pair as the interpolated point lies somewhere in the middle between two classes based on the value of .

  • If, both , , the network should output or be as high as possible whereas for . This is because a mixed element constructed from two data items of the same class must ideally belong to the same class also. We consider such a pair to be a matched pair.

  • During testing in open set scenario, we pair the query sample with different training examples, and so a third case may arise if the query belongs to a novel class. Here, since one of the two inputs to the network is seen, only the output node corresponding to that class should be non-zero and equal to the proportion of this class in the generated mixture. We do not explicitly train the network for this scenario, since we do not assume any auxiliary datasets. So, we consider only the first two cases for training.

Note that the final activation function is the sigmoid layer and not softmax, the total sum of may not be equal to . This is important, since if the input belongs to a novel class, our network will only consider the mixing ratio of known class. So the proportion of unknown class in the mixture will be ignored and thus the sum will not be equal to 1.

1:Input : is the input data with their provided labels .
2:Output : Trained Segregation Network model to detect novel samples.
3:Initialize : Initialize the network parameters of Segregation Network. Extract the fine-tuned features for using Alexnet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] or Vggnet [Simonyan and Zisserman(2014)].
4:Train the network by following these steps
5:   Randomly generate the mixing coefficient value .
6:   Randomly take pairs of training data to construct using the mixing coefficient
7:   Feed-forward the triplet data pair through the network.
8:   Compute the constituency loss and back-propagate it back to train the network.
Algorithm 1 Algorithm for training the Segregation Network

What is important is that the value of peaks at the right places thus signifying the classes from which the mixed data has been generated. Since, we don’t have the softmax output we cannot use cross-entropy loss function to train this model. In addition, cross-entropy loss function tries to maximize the probability of the correct class while in our case we may need to find the two constituent classes. Here we design our own novel loss function termed as Constituency loss which we describe next.
Constituency loss: This loss ensures that the output of the Segregation Network, , gives positive values for only those classes which has been mixed to create . Thus, the network is expected to output not only the correct proportion of mixing of the mixing class but also zero output for the non-mixing classes . Based on this requirement, the loss function can be written for - “m” and “nm” classes as follows

(1)

where, denotes the mixing coefficient vector which has zeros for the non-mixing classes and have values of and in their relevant places for the mixing classes. The denotes,the sparse output for the Non-mixing classes, while the is for the classes used in forming . It is to be noted that the weight plays a significant role in training the model as shown in the ablation studies. This factor is important since the number of zero elements is much more than the number of non-zero (mixing) coefficients. Hence during training, we penalize the errors in wrongly predicting the value of as zero much more severely compared to the incorrect prediction of the zero elements. In the implementation, we found best to be between in all our experiments.

3.2 Testing Scenario

We assume that a base network has been trained on the training classes with a softmax-output. Here, it can be taken as the AlexNet or VGGNet from where the features are extracted. In the open set testing scenario, the test query can come from one of the seen classes or from a novel class. Given a test query, we consider the Top-N classes which get the highest output scores, i.e. the possibility of the query belonging to one of these classes is high. The goal of the Segregation Network is to consider each of these top classes, and verify whether the query actually belongs to that class. Taking the top few classes is intuitive since (1) if the query belongs to a seen class, its score is usually high and (2) it reduces the computation required for novelty detection using the proposed Segregation Network. If the query is from a novel class, all retrieved classes are obviously wrong.

Here we use training class centres, where , where () as the prototype exemplars. For each query, , a set of interpolated points is generated as , where , which is then is passed through the proposed network. The mixing coefficient for the prototype exemplars are kept low while feeding to our model. In other words the mixing coefficient is kept high for the incoming test data. This is because of the following reasons

  • If the query data is from the domain of known classes, the high from the known class, would produce a high output for the corresponding class.

  • If the query data is coming from an unknown class, the low weight added to the prototype exemplars forces the network output to be low for all the classes.

Thus for each query data, the average of the highest network output is taken to be the probability of being known.

4 Experiments

In this section, we evaluate our method, Mixing Novelty Detection (MND) against several state-of-the-art approaches. We also describe in this section, the dataset we tested on and the testing protocol that was followed. We then give the analysis of our algorithm.

4.1 Datasets Used and Baselines

Here, we report results on two benchmark datasets, namely Caltech256 and Stanford Dogs. Caltech256 Datasets: [Griffin et al.(2007)Griffin, Holub, and Perona] This dataset is a standard dataset for visual recognition consisting of 256 object of diverse categories. This consists of 30607 images from a minimum of 81 per class image to a maximum of 827 images per class. As per our protocol we took the first 128 classes as known and rest are considered as unknown class.
Stanford Dogs Dataset: [Khosla et al.(2011)Khosla, Jayadevaprakash, Yao, and Li] This is a fine grained dataset consisting of 120 classes of different breeds of dogs. It consists of total of 20,580 images. We consider the first 60 class,sorted alphabetically, to be considered as known. The final testing was performed on the remaining 60 classes.

4.2 State-of-the-art Baseline Method

We evaluate our method against the following baseline algorithms - (1) Finetune[Simonyan and Zisserman(2014)]: The fine-tuned network output is taken and threshold-ed to determine whether a query is from known or novel class.; (2) One-class SVM [Schölkopf et al.(2001)Schölkopf, Platt, Shawe-Taylor, Smola, and Williamson]: All known classes are considered during training the SVM. During testing the maximum SVM score is considered.; (3) KNFST[Bodesheim et al.(2013)Bodesheim, Freytag, Rodner, Kemmler, and Denzler]:

The deep features are extracted and normalized and KNFST algorithm is implemented with those features to detect novel class.; (4) Local KNFST

[Bodesheim et al.(2015)Bodesheim, Freytag, Rodner, and Denzler]: Deep features were extracted and the algorithm is evaluated with 600 local regions.; (5) Openmax [Bendale and Boult(2016)] : The feature embedding of the penultimate layer of a trained network is taken and mean activation vectors are determined to fit in the Weibull distribution.; (6) K-extremes [Bodesheim et al.(2015)Bodesheim, Freytag, Rodner, and Denzler] VGG16 features are extracted and the top 0.1 activation index is used to get the eextreme value signatures.; (7) Finetune (): [Perera and Patel(2019)] The network is trained on additional class coming from reference dataset.; and (8) the state-of-the-art algorithm proposed in [Perera and Patel(2019)] where an external dataset as reference data is used to learn negative filters which will not get activated for any of the data from the known categories using a novel membership function. This not only requires an extra auxiliary dataset but also is computationally expensive requiring a separate network to be trained. An added concern is present regarding what kind of reference dataset to chose to learn the negative filters. Our approach compares favorably with the state-of-the art algorithm without knowledge of any reference data or training the network on any extra data. This reduces not only computational cost but does not require the collection of the extra reference dataset.

4.3 Testing Protocol

In testing our protocol, half the classes were taken to be known . The rests, are considered as unknown. The training and test splits of the known class are equally divided, while the unknown class are considered only during testing. We consider area under the receiver operating characteristics [ROC] curve (AUC) of the receiver operating characteristics as the evaluation criteria.
We select AUC because it is threshold independent and hence less susceptible to parameter variation. Also often, there is imbalance between the number of known class instances and number of unknown class instances and AUC is equipped in handling data imbalance. Imbalance in data during testing arises from the fact that the number of unknown class data could be potentially infinite compared to the fixed set of classes used to train the network.

4.4 Results

The evaluation of our algorithm is based on features extracted from Alexnet and VGG16, which are also used in

[Perera and Patel(2019)]. The compared baseline methods too are evaluated on these features as reported in [Perera and Patel(2019)]. As seen in table 1, our method has exceeded the baseline state-of-the-art for most of the cases. Our method on VGG16 features has convincingly outperformed the method in [Perera and Patel(2019)]. The margin is of staggering 7.9 in Stanford-dogs dataset ,while in Caltech-256 it is of 1.3 . For Alexnet fetaures, Caltech-256 shows a dip in performance but in Stanford-dogs again, our method is outperforming all other baselines. We would again like to highlight that our algorithm produces these results without the knowledge of any external or auxiliary data-sets. This makes our algorithm much more computationally efficient than Deep transfer Novelty detection.

Method Stanford dogs Caltech-256
VGG16 AlexNet VGG16 AlexNet
FineTune[Simonyan and Zisserman(2014)] 0.766 0.702 0.827 0.785
One-Class SVM [Schölkopf et al.(2001)Schölkopf, Platt, Shawe-Taylor, Smola, and Williamson] 0.542 0.520 0.576 0.561
KNFST pre [Bodesheim et al.(2013)Bodesheim, Freytag, Rodner, Kemmler, and Denzler] 0.649 0.619 0.727 0.672
KNFST [Liu et al.(2017)Liu, Lian, Wang, and Xiao],[Bodesheim et al.(2013)Bodesheim, Freytag, Rodner, Kemmler, and Denzler] 0.633 0.602 0.743 0.688
Local KNFST pre [Bodesheim et al.(2015)Bodesheim, Freytag, Rodner, and Denzler] 0.652 0.589 0.657 0.600
Local KNFST [Bodesheim et al.(2015)Bodesheim, Freytag, Rodner, and Denzler] 0.626 0.600 0.712 0.628
K-extremes [Bodesheim et al.(2015)Bodesheim, Freytag, Rodner, and Denzler] 0.610 0.592 0.546 0.521
OpenMax [Bendale and Boult(2016)] 0.776 0.711 0.831 0.787
Finetune(c+C) [Perera and Patel(2019)] 0.780 0.692 0.848 0.788
Deep Transfer Novelty [Perera and Patel(2019)] 0.825 0.748 0.869 0.807
MND (ours) 0.904 0.762 0.882 0.751
Table 1: Comparison of our method (MND) to the base-line methods. Our algorithm [MND] gives convincing results compared to the state-of-the-art method- Deep Transfer Novelty without use of any extra dataset.

4.5 Analysis and Observation

The following points are noted and observed in our experiments:

Effect of number of prototype of class The results for our algorithm when we consider the comparison of the query element with the full set of class prototypes from the training data are provided in Table 1. We here investigate the effect of only taking the top-N class representative comparisons (as given by the softmax values from the base network namely Alexnet and VGG16) and provide the results in 2. We observe that as the value of is increased from to , the performance monotonically increases and starts saturating. This basically tells us that we need not compare the query element with all the class prototypes in the training set. This makes our algorithm quite a bit faster.

Figure 3: Class-wise novelty detection scores for the (a) Stanford Dogs and (b) Caltech-256 dataset.

Analysis of class-wise novelty detection score: Here, we plot the Novelty Detection score (the lesser the more novel) for the images of the seen and unseen categories for the (a) Stanford Dogs and (b) Caltech-256 dataset. The first and categories are the training classes in Stanford Dogs and Caltech-256 respectively. We can draw two conclusions from the following -(1) the separation between the novelty detection score for the seen and novel categories is more for the Stanford Dogs dataset as compared to the Caltech 256. This is also reflected by overall performance of our algorithm in Table 1. (2) The curve for the VGG16 model has higher peaks and lower troughs indicating that it gives a better margin for error while detecting novelty. This is reflected by the better performance of the VGG16 model over the Alexnet model in Table 1.

Classification task: Our Model can be used for classification task as well. The classification is performed on for the known class test data split . The accuracy,shown in Table-3, is more or less similar to the softmax accuracy when tested on the finetuned base network.

No of Class Protype
5 10 20 40 60
Dogs-Alexnet 0.775 0.779 0.779 0.780 0.781
Dogs- VGG16 0.866 0.884 0.894 0.901 0.904
Caltech- Alexnet 0.699 0.716 0.726 0.732 0.734
Caltech- VGG16 0.837 0.853 0.862 0.867 0.870
Table 2: Result showing variation of AUC when the number of Prototype of Training Class to be compared with the query is changed
Dogs-Alexnet Dogs- VGG16 Caltech- Alexnet Caltech- VGG16
Accuracy(in %) 65 86.5 70 87
Table 3: Accuracy of our method on the studied datasets with different baseline models.

5 Conclusions

In this work we propose a new method for multi-class novelty detection using the mixup technique. For the purpose of training our network, here we define our novel constituency loss which solves the desired objective. Our novelty detection algorithm compares favorably with the state-of-the-art without the need for any auxiliary dataset. Further analysis has shown that our method can be made much more efficient by leveraging the extra softmax confidence outputs of the pre-trained network and gives comparable results.

References