DeepStreamCE: A Streaming Approach to Concept Evolution Detection in Deep Neural Networks

04/08/2020
by   Lorraine Chambers, et al.
Birmingham City University
0

Deep neural networks have experimentally demonstrated superior performance over other machine learning approaches in decision-making predictions. However, one major concern is the closed set nature of the classification decision on the trained classes, which can have serious consequences in safety critical systems. When the deep neural network is in a streaming environment, fast interpretation of this classification is required to determine if the classification result is trusted. Un-trusted classifications can occur when the input data to the deep neural network changes over time. One type of change that can occur is concept evolution, where a new class is introduced that the deep neural network was not trained on. In the majority of deep neural network architectures, the only option is to assign this instance to one of the classes it was trained on, which would be incorrect. The aim of this research is to detect the arrival of a new class in the stream. Existing work on interpreting deep neural networks often focuses on neuron activations to provide visual interpretation and feature extraction. Our novel approach, coined DeepStreamCE, uses streaming approaches for real-time concept evolution detection in deep neural networks. DeepStreamCE applies neuron activation reduction using an autoencoder and MCOD stream-based clustering in the offline phase. Both outputs are used in the online phase to analyse the neuron activations in the evolving stream in order to detect concept evolution occurrence in real time. We evaluate DeepStreamCE by training VGG16 convolutional neural networks on combinations of data from the CIFAR-10 dataset, holding out some classes to be used as concept evolution. For comparison, we apply the data and VGG16 networks to an open-set deep network solution - OpenMax. DeepStreamCE outperforms OpenMax when identifying concept evolution for our datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/13/2019

TopoAct: Exploring the Shape of Activations in Deep Learning

Deep neural networks such as GoogLeNet and ResNet have achieved superhum...
03/30/2022

ConceptEvo: Interpreting Concept Evolution in Deep Learning Training

Deep neural networks (DNNs) have been widely used for decision making, p...
06/21/2021

Incremental Deep Neural Network Learning using Classification Confidence Thresholding

Most modern neural networks for classification fail to take into account...
11/19/2020

Deep Learning with a Single Neuron: Folding a Deep Neural Network in Time using Feedback-Modulated Delay Loops

Deep neural networks are among the most widely applied machine learning ...
11/10/2018

PolyNeuron: Automatic Neuron Discovery via Learned Polyharmonic Spline Activations

Automated deep neural network architecture design has received a signifi...
09/03/2020

Asynchronous dual-pipeline deep learning framework for online data stream classification

Data streaming classification has become an essential task in many field...
11/07/2018

YASENN: Explaining Neural Networks via Partitioning Activation Sequences

We introduce a novel approach to feed-forward neural network interpretat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability of deep neural networks to classify data based on a sufficiently representative training set is established experimentally. However, when the unseen instances are presented and they deviate from the training set distribution, they could be incorrectly classified. This is problematic in safety critical systems such as autonomous vehicles, flight control, medical image classification or medical sensor analysis. In such systems, data would be arriving real-time and any potentially incorrect classifications need to be captured as quickly as possible. There are many reasons that the unseen instances can vary from the training data, ranging from valid data changes over time to deliberate adversarial attacks. The data discrepancy that this research focuses on is concept evolution, where new valid classes appear over time in the data stream.

Detecting concept evolution in data streams in not a new topic. Various approaches have been discussed in the literature to detect evolution in the stream  Masud et al. (2011); Haque et al. (2016, 2016). These approaches focus mostly on inspecting changes in the input data distribution using methods such as statistics Song et al. (2007) or PCA Kuncheva and Faithfull (2012), or applying the data to one or more classifiers and look for a change in confidence of those classifiers. Since deep neural networks frequently have high dimensional input data, detecting distributional change in such a multi-dimensional space is challenging Harel et al. (2014). Hence, the latter option of using classifier outputs to detect change is more viable. However, classifier outputs do not leverage the feature rich activation data available within the deep neural networks. Therefore, in our research, we are detecting concept evolution specifically within data provided to deep neural networks which means we have the opportunity to change the input data space by utilising activations from the hidden layers within the deep neural network instead of the input data. This research utilises a deep neural network that classifies images, known as a Convolutional Neural Network (CNN). The network is given an image and it will calculate transformations on the image until it produces a classification label of the image. Inside the network, there are hidden layers containing neurons which are functions that have weights and biases whose values are calculated during training of the network. When the trained network processes the image that is given to it, values are produced in the hidden layers of the network. These values are called activations. The activations can be accessed and provide different information about the image. As each hidden layer of the network has learned to detect different features of the image, this means that when we are looking at the activations, we are looking at the feature space of the image instead of the pixel space. This is beneficial as it makes the analysis independent of the type of input data and increases the space between closed set and open set instances Bendale and Boult (2016).

Analysis of the internal neuron activation of deep neural network is a popular area of study in the field of deep neural network inspection, visualisation and explainable AI Buhrmester et al. (2019); Adadi and Berrada (2018); Carter et al. (2019); Kahng et al. (2018) and shows that activations of a deep neural network can be used to help determine how a network arrives at its classification decision. Many deep neural networks operate within a streaming environment however, to our knowledge, there have been no studies analysing the activation data with streaming analysis techniques.

Detecting and analysing neural network activations is challenging in a streaming environment due to the amount of activations, even for a low-resolution image and a small deep neural network. Identification of the most important neurons based on their activation data is therefore required. We discuss an overview of existing techniques for detecting new classes in deep neural networks, existing techniques for extracting important neuron activation data and concept evolution detection in a streaming environment. We propose utilising multi-layer activations from the deep neural network, reducing the activations via an autoencoder and Micro-cluster-based Continuous Outlier Detection (MCOD)

Kontaki et al. (2016) based on streaming clustering for activation analysis to determine if concept evolution has occurred.

Our contributions can be stated as follows.

  • We detect the activation difference between unseen instances of concept evolution with the training data using streaming techniques;

  • We use fast interpretation of deep neural network activations to detect concept evolution using streaming techniques;

  • We investigate how changing the streaming analysis parameters affects the concept evolution detection; and

  • we compare our technique to a leading deep neural network open-set classification solution – OpenMax Bendale and Boult (2016).

This paper is organised as follows: In Section 2 we first discuss the related work, then present a system description including formalisation and implementation details of of the DeepStreamCE components and methodology in Section 3. In the experimental study in Section 4, we evaluate and analyse DeepStreamCE on sub-datasets from the CIFAR-10 dataset and experiment with varying the input parameters of the streaming clustering algorithm. The same data and deep neural networks are applied to an open-set deep network solution and the results are compared. Section 6 summarises our findings with a conclusion and future work suggestions.

2 Background and Related Work

Within the field of deep neural networks, there are two areas of research that focus on how new classes are identified that the deep neural network has not been trained on: (1) open-set classification and (2) out-of-distribution detection. Open-set classification means that the deep neural networks have the ability to reject the unseen instances as unknown, rather than having to choose a classification from the known classes they were trained on Bendale and Boult (2016)

. Open-set classification deals with Rubbish/Fooling images – images that are plainly rubbish to the human eye (i.e. computer generated patterns to adversarial images that are deliberately slightly modified images). These represent the two opposite ends of the scale. The rubbish/fooling images are far away from the feature space of the images the network was trained on and are easier to capture and adversarial images that only manipulate a few pixels are close to the feature space of the original training instances. Common methods for detecting images that differ from the training data are thresholding softmax scores, uncertainty estimation and extra training using negative samples as summarised by

Dhamija et al. (2018)

. Other methods include using Extreme Value Theory (EVT) which is a branch of statistics dealing with extreme deviations from the median of probability distributions

Geng et al. (2019)

and re-training the neural network using a different error loss function

Hassen and Chan (2018)

. Bendale proposes a system called OpenMax where activation patterns in the penultimate activation layer are utilised, and another probability calculation layer is added that then compares outputs with the original softmax layer of the network. The author also suggests that there could be a layer in the deep neural network where the activations will be far away from the training samples, where unknown images become outliers in an open set recognition problem

Bendale and Boult (2016). The solutions used in out-of distribution involve perturbing the images and using thresholds and temperature scaling on the Softmax layer, or training the data. Temperature scaling and adding small peterbations to images are used in Liang et al. (2018) and in DeVries and Taylor (2018), the penultimate layer output is used to calculate a confidence estimate for each data input into the deep neural network. In the open-set classification and out-of-distribution fields, a common theme is using the activations of the layer preceding the Softmax layer. As the deep neural network activations represent the features of the image, it could provide more information than just using the penultimate layer activations or the pixel data. Another field that uses activations is Deep Neural Network (DNN) Inspection, which is reviewed in Section 2.1.

2.1 Neuron Activations

Activations have been widely used in the Visual Interpretation of deep neural networks to determine what neurons are related to what image features to explain how the neural network is arriving at its classification. How the network arrives at its classification is out of the scope of this research as we are only interested in identifying the important neurons in an image’s classification. Activations have also been used in the field of adversarial attacks on deep neural networks. We draw inspiration from these fields with respect to the detection of important neurons.

In the DNN inspection field, there have been many approaches to the identification of the most important activations and these have been recently surveyed, showing that this is an important area and forms part of explainable AI Adadi and Berrada (2018). In Buhrmester et al. (2019), Table 1 shows that neuron activations are used for explainers for deep neural networks in different fields such as images and text based DNNs. Approaches to using neuron activations can be summarised as: top

percent of activations in each layer, the activation magnitude, average activations and clustering, nearest neighbour and backpropagation. The first approach of top

percent activations is used in Hohman et al. (2019)

, which uses the activation of channels in a CNN, to determine edges, shapes and texture, and applies global max pooling to reduce the data. It uses the activations of the channels and is appropriate to CNNs only. This method only does a forward pass through the network to obtain the activations, so is low on computation, which is required in a streaming environment. Activation Magnitude and Matrix Factorisation are utilised by Olah

Olah et al. (2018) and uses the magnitude of the neuron activations and represents them as a cube and breaks them up using matrix factorisation to get more meaningful groups of neurons, however matrix factorisation is computationally expensive as it has to be done differently for each image, so is not suitable for a streaming environment. Average activation and clustering is used in Liu et al. (2017), where the average activation of each neuron in the activation layer is used (average is taken for all instances with the same class), then clustered and a number of neurons from each cluster is selected. In ActiVis Kahng et al. (2018), the average activation for each neuron for all instances in a class are used, but presented to the user for visualisation. The nearest neighbour approach was used in Papernot and McDaniel (2018) where nearest neighbour was used on the activation outputs of each hidden layer. Locality Sensitive Hashing (LSH) function is used to reduce the data dimensionality, so it is suitable for use in the nearest neighbour representation. However, this is computationally expensive and unsuitable for a streaming environment. Backpropagation is used in Samek et al. (2017), Qiu et al. (2019). The latter is applicable to both CNNs and fully connected networks. This describes an effective critical path of weights and neurons that lead to the final predicted path and uses an activation-based back propagation algorithm to extract the effective path. This requires a backward pass through the network which is computationally expensive and not appropriate in a streaming environment.

Activation data is also used in the adversarial detection field and although this research is concerned with detecting concept evolution, it is worth noting the work using activations in the adversarial detection field. The work in Chen et al. (2018)

clusters the activations of the last hidden neural network layer, flattened into a 1D vector and clustered. Dimensionality reduction is performed using Independent Component Analysis (ICA) to avoid issues with clustering on very high dimensional data – as dimensionality increases, distance metrics are less effective. They used k-means with

. They train a a new model on the original data minus the data corresponding to the clusters in question, they use this model to classify the removed clusters. If a cluster contained legitimate data then the cluster will be classified as its correct label, this is computationally expensive. In Chen et al. (2018), the proposed method also uses neuron activation via back propagation, however, backpropagation is also computationally expensive. Hendrycks uses abnormality detection, using activations and suggests using auxiliary decoders such as autoencoders as further work Hendrycks and Gimpel (2018).

The usage of neuron activations faces the challenge of deciding which neurons are used (i.e. only use a particular layer(s) or channel(s), or general reduction of the activation such as in LSH to be suitable for post processing). Using the last activation layer only is commonly used as a method of data reduction as this is the most representative of the image and provides the most information. However, we are not restricted to only using the last layer, and given recommendations on using activations with auxiliary decoders, a summary of data reduction techniques follows.

2.2 Data Reduction

Popular methods of data reduction are Independent Component Analysis (ICA), Principle Component Analysis (PCA), autoencoders and Restricted Boltzman Machines (RBMs). The proposed method in

Chen et al. (2018) uses ICA, however, Hinton describes using autoencoders as better than PCA Hinton (2006)

. PCA and ICA are for linear transformations but when we are looking at DNNs, they are not linear transformations. For this research, autoencoders will be used as they provide scope for more complex data reduction, including expansion into RBMs. Once the data is reduced we have the opportunity to use more detection methods on it. As the aim is to detect concept evolution, we first review Concept Evolution detection techniques.

2.3 Concept Evolution in Data Streams

Concept evolution is the appearance of new classes while streams evolve. New concepts need to be detected as soon as they arrive, without being trained with labelled data. There has been much investigation into concept evolution GamaJoão et al. (2014); Faria et al. (2016); Khamassi et al. (2018); Masud et al. (2011); Haque et al. (2016, 2016); Abdallah et al. (2016), some of which cover concept drift, of which concept evolution is defined as one of its manifestations. They also cover the whole process of concept evolution in data streams including the forgetting of the concept evolution and the update of the algorithm. The aspects that are of interest to this research is the learning process that is used.

There are two types of methods for detecting drift – sequential methods and Windowing. Sequential methods are concerned with only one instance at a time. It utilises statistical analysis and checks if there is a change between distributions of instances. One method is to measure the dissimilarity between an incoming instance and a set of data. To measure this, distance function measures can be used Tran et al. (2016); Tsymbal (2004); Gonçalves et al. (2014)

, or summarised statistics from the two distributions like mean and variance

Ross et al. (2012). In sequential approaches, each instance is processed only a single time, then discarded. This is suitable for detecting drifts where data streams are infinite and it is not practical to store all the instances, which is usually the case in real-world applications. These sequential approaches do detect abrupt drift, which is what a new class would be. Windowing methods consider that the most recent observations are the most informative. They progressively estimate the change through a time or data window. Generally the windowing approach considers that the drift is uniform and affects the entire instance space, so they can handle global concept drifts – this is sufficient for concept evolution. For this research, an unsupervisedwindowing method has been selected to identify concept evolution via outlier detection. MCOD Kontaki et al. (2011) is an established outlier detection technique which performs clustering on continuous data streams Haidar and Gaber (2019) and it outperforms other streaming clustering methods Tran et al. (2016). Section 3.4 provides a description of MCOD.

In summary, deep neural networks produce many activations that require reducing before they are suitable to be used in outlier detection techniques to detect concept evolution. Section 3 details our proposed methods for activation reduction and streaming outlier detection. We use a multi-layer technique to extract activation information from the deep neural network. As this provides a large amount of data, an autoencoder approach is to be used to reduce this data. This will then be fed into a streaming clustering algorithm to detect outliers.

2.4 Comparison with OpenMax

This research will compare results with Bendale’s OpenMax solution for detecting unknown classes Bendale and Boult (2016). OpenMax was selected for comparison as, similarly to DeepStreamCE, it utilises activation data from within the deep neural network and identifies unknown instances. It uses the penultimate and final layer activations from the network. The final layer of the network are the Softmax probabilities and the penultimate layer is a representation of the instance’s class. OpenMax estimates the probability of an instance being from an unknown class. To achieve this, it extracts the activations from the penultimate layer of the network, calculates the mean of these activations for each instance in the training data and constructs a vector of these for each class. For each training instance, it also calculates the distance between the instance and its class activation vector. For an unseen instance, the mean activation of the penultimate layer is calculated, the distance between the instance’s mean activations and each class mean activations is measured, then a Weibull fit is applied to the distances between the instance and the classes and extreme value theory (EVT) is used to estimate the probability of the instance being an outlier with respect to each class. Either a known class or an ‘unknown’ classification is returned.

3 DeepStreamCE System Description

DeepStreamCE is comprised of two stages, the offline phase and the online phase. Figure 1 shows the offline phase components and Figure 2 shows the online phase components. Prerequisites for the system are: (1) a trained deep neural network that is being analysed for concept evolution and (2) the data instances that the network was trained on.

Figure 1: DeepStreamCE: Offline Phase

Figure 2: DeepStreamCE: Online Phase

3.1 Offline Phase

During the offline phase, all of the training instances are presented to the deep neural network and the resulting activations of each training instance are extracted. After this, the activation data goes through 3 stages: activation layer filtering, activation reduction setup and activation analysis setup. The algorithm for the offline phase is listed in Algorithm 1 and Table 2 shows the description for the DeepStreamCE symbols.

3.2 Activation Layer Filtering Setup

One data instance produces many activations. The amount of activations per data instance depends on the neural network architecture (e.g. how many layers and what kind of layers). For instance, if the network has activation layers or fully connected layers as opposed to convolutional layers, there will be more activation data. Therefore, filtering of the activations needs to occur in order to proceed with a manageable amount of data, with the manageable size depending on the amount of system memory available. There is also a decision to be made regarding which layers are selected from the deep neural network. It is generally considered that the latter layers of a deep neural network produce more interesting data as it is closer to the final class outcome. For the network in this experimental setup, more information regarding the layer selection is given in Section 4.2. The amount of memory required in the offline phase is important, and is where the maximum amount of memory is required to handle the activations extracted from the training data. The activations no longer represent the pixel space, but a representation of the pixel space, so we no longer need to keep the dimensions of the data, thus we flatten the activation data into a 1D vector. (flatten – line 2 of Algorithm 1), ready for use in the activation reduction setup phase, as described in section  3.3.

3.3 Activation Reduction Setup

The activation data is high dimensional. To be able to make use of this in a clustering algorithm requires that the dimensionality is reduced. To do this, DeepStreamCE uses an autoencoder to reduce the data to 100 dimensions. The autoencoder is trained using the data as selected from the activation layers of the deep neural network (line 4 of Algorithm 1), then each training instance is processed through the autoencoder to reduce its dimensionality to 100 (reduce - line 6 of Algorithm 1). The training and creation of the autoencoder has the largest memory and computational requirement of the system. For these initial experiments, the autoencoder is an undercomplete autoencoder Rumelhart et al. (1986)

with a relu activation function and a mean squared error loss function, which makes it equivalent to PCA, This has potential for further work 

Hinton (2006); Wang et al. (2016). With this experimental setup, the autoencoder reduces the activations from 47104 to 100 dimensions. The activation instances are now ready for use in the activation analysis setup.

3.4 Activation Analysis Setup

Stream clustering underpinning MCOD is used for the activation analysis Kontaki et al. (2011). Figure 3 shows the concept of MCOD and Table 1 describes the symbols used. MCOD is based on a micro-clustering technique that takes parameters of: (1) radius () of the micro-cluster (), (2) the minimum number of instances to form a micro-cluster () and (3) window size – the number of instances considered in the clustering algorithm ().

Symbol Description
Minimum number of neighbours to form a micro cluster
Radius - distance parameter for outlier detection
Micro cluster
MCOD clusterer
data instance
Window size
Table 1: MCOD Symbols

Figure 3: Example MCOD clusters for k=4

If there are instances of within of an , becomes a member of that . If is within of any cluster, it becomes an outlier of those . MCOD uses the centre of the micro clusters to perform its calculations, which makes it computationally efficient.

For the DeepStreamCE implementation, one MCOD clusterer () is created for each possible classification output of the trained deep neural network (createClusterer – line 9 of Algorithm 1) and the reduced activation instances are added to the MCOD clusterer that represents their classification label (addToClusterer – line 12 of Algorithm 1). When the reduced instances are being added to the MCOD clusterer, micro clusters () may be formed within the MCOD clusterer (), however, during the offline phase we are not interested in these micro clusters as we do not require inlier/outlier decisions. is set to the number of training instances for that class plus one and the effect of varying and is investigated in this research. These parameters can only be set once, at the time the MCOD clusterer is created. The use of the micro clusters for outlier detection in DeepStreamCE is described in Section 3.5.

The output components from the offline phase is an autoencoder () and an MCOD clusterer for each class of the deep neural network (). These are both used during the online phase, as described in Section 3.5.

Symbol Description
Number of instances
Instance iterator
Number of activation layers
Number of activation values
Activation value
Activation layer
Correctly classified training instance
Activation values for all layers in an instance
Flattened activations for an instance
Reduced activations for an instance
Non-discrepancy class
Trained autoencoder
Unseen instance
Table 2: DeepStreamCE Symbols
1:Pre-trained VGG16 DNN on 2 classes
2:Activation Levels in layers 9,12,13,15,16,17,20,21 layer
3:MCOD Parameters
4:For each correctly classified training example, let
5:for  = 1… do
6:      = flatten()
7:end for
8:Train autoencoder,
9:for  = 1… do
10:     Reduce activations to 100: = reduce(,)
11:end for
12:for  = 1… do
13:     Create Stream and MCOD clusterer: = createClusterer()
14:end for
15:for  = 1… do
16:     Add class instances to stream: = addtoClusterer(,)
17:end for
18:return ,
Algorithm 1 Offline Algorithm

3.5 Online Phase

The algorithm for the online phase is listed in Algorithm 2. During the online phase, previously unseen instances arrive at the deep neural network. For each instance, the activations are extracted (line 2 of Algorithm 2) and the deep neural network’s prediction is stored (line 3 of Algorithm 2). The activation layers are filtered and flattened to a 1D vector in the same way as during the offline phase (flatten – line 6 of Algorithm 2). The flattened activations are then processed through the autoencoder (reduce – line 9 of Algorithm 2) that was produced during the training phase. The reduced activation instance is added to the MCOD clusterer corresponding to the deep neural network’s predicted class for that instance (addToClusterer – line 12 of Algorithm 2) and the inlier/outlier information for that instance is obtained from the clusterer and a decision is made as to whether the instance is Non-Discrepancy (ND) or Concept Evolution (CE) (analyse – line 13 of Algorithm 2). During the addToClusterer phase, each unseen instance is applied to the MCOD clusterer associated with its class prediction. This clusterer only contains training data, so that no previously unseen instances affect the inlier/outlier decision. During the analyse phase, the inlier/outlier decision is obtained from the clusterer and transformed into a non-discrepancy or concept evolution decision. MCOD defines an inlier as of a cluster centre and an outlier as as shown in Figure 4. For DeepStreamCE, data points within (“INLIERS”) are reported as non-discrepancy (ND). Data points within the ranges of “OUTLIER” and “NO OUTLIERS REPORTED” are reported as concept evolution (CE): and .

Figure 4: CE and ND Definition for DeepStreamCE
1:Pre-trained VGG16 DNN on 2 classes
2:Trained Autoencoder:
3:Initialised MCOD clusterers:
4:Specified activation layers:
5:for  = 1 … do
6:     let
7:     Get DNN prediction:
8:end for
9:for  = 1… do
10:      = flatten()
11:end for
12:for  = 1… do
13:     Reduce activations to 100: = reduce(,)
14:end for
15:for  = 1… do
16:     Add instance to stream: addToClusterer(,)
17:     Analyse for outlier: = analyse(,)
18:end for
19:return
Algorithm 2 Online Algorithm

4 Experimental Study

4.1 Data Setup

The aim of DeepStreamCE is to detect concept evolution. Therefore, we need to introduce a new class into the system, other than the classes it has been trained on. To achieve this, the deep neural network is only trained on two of the classes in the CIFAR-10 dataset, then a third class that the deep neural network has not been trained on from the CIFAR-10 dataset is introduced. The CIFAR-10 data set Krizhevsky (2009) consists of 10 different classes of colour images. In total there are 50000 training images and 10000 test images. Table 3 provides a list of the classes, along with a higher granularity classification/type.

Class Name Class Type
airplane Vehicle
automobile Vehicle
bird Animal
cat Animal
deer Animal
dog Animal
frog Animal
horse Animal
ship Vehicle
truck Vehicle
Table 3: CIFAR-10 Dataset Classes

As listed in Table 3, the CIFAR-10 data consists of four different types of vehicles and six different types of animals. The classes are mutually exclusive, however, some classes have more separation from each other. For instance, airplane and automobile are more similar to each other than the frog class. This separation of types will be used to introduce concept evolution. The data setup specifications are defined as (NDclass,NDclass-CEclass)

. Where the ND class are the non-discrepancy classes the neural network is trained on, and the CE class is the class that is introduced to simulate concept evolution. The data setup specifications are split into two groups. The first group will utilise data consisting of two vehicle classes, then concept evolution will be introduced by applying unseen instances of a type of animal from the dataset. The second group consists of class combinations that are perceived to have less separation between the classes. Airplane, ship and bird are selected for their similar backgrounds, giving overall image similarity. Ship, truck, automobile are selected as they are all transport. Cat, frog, deer are selected as they are all animals and cat, deer, horse are selected as they are four legged animals. The combinations of classes that the deep neural network will be trained on and the concept evolution classes are shown in Table 

4.

20cmData
Setup Name 20cmTrained Classes 20cmConcept Evolution
Class
(airplane,automobile-frog) airplane, automobile frog
(ship,truck-cat) ship, truck cat
(airplane,truck-deer) airplane, truck deer
(ship,truck-bird) ship, truck bird
(airplane,ship-bird) airplane, ship bird
(ship,truck-automobile) ship, truck automobile
(cat,frog-deer) cat, frog deer
(cat,deer-horse) cat, deer horse
Table 4: Concept Evolution Class Combinations

4.2 Deep Neural Network

The deep neural network that the system operates on is a widely used base model, VGG16 Simonyan and Zisserman (2015)

. This model was originally designed and trained on imagenet; we have trained it on 2 classes at a time from the CIFAR-10 dataset. The system is designed such that it is not limited to only being able to use VGG16 as the deep neural network, however, for our experiments we have selected VGG16 as it is a smaller efficient network and due to the amount of activations that are produced, this is a manageable size for our initial experiments. Figure 

5 shows the layers in the VGG16 network, which are numbered from the top starting with 0. The following 8 layers are being used: 9,12,13,15,16,17,20,and 21. Using a small network allows a good representative number of layers to be selected to facilitate future work in analysing the usefulness of the layers. These particular layers were selected as the closer the hidden layer is to the end of the network, the more feature information it contains. Therefore, the final convolutional layer, prior to each pooling layer was selected to provide maximum information rather than the pooling layers. The open classification technique we are comparing DeepStreamCE to is OpenMax as described in Section 2.4, OpenMax uses the final two layers of the network – the fc2 (Dense) layer and the predictions (Dense) layer.

Figure 5:

VGG16 Network Representation with layer names as reported from Keras

4.3 Experimental Setup

For the experimental setup, four class combinations are being used as defined in Table 3. For each of these data setup specifications, a parameter investigation is conducted to investigate the effect that the MCOD parameters (the minimum number of neighbours required to form an MCOD micro cluster) and (the radius of the micro cluster) has on the Recall and Precision. From this, and parameters are deemed to be the most influential on DeepStreamCE. The DeepStreamCE experimental setup is trained on 5000 instances per class and utilises 500 unseen instances per run, 250 of which are non-discrepancy instances and 250 of which are concept evolution instances. The unseen instances are selected randomly from the test data of 1000 instances. Any test instances that are wrongly classified by the network are removed for experimental purposes. These runs are repeated 4 times and averages are taken. For OpenMax, all the non-discrepancy and concept evolution instances for that data setup are applied to the system in one run. (2000 non-discrepancy instances and 1000 concept-evolution instances). Similarly, any test instances that are wrongly classified by the network (softmax layer) are removed.

Each data setup specification is applied to the OpenMax open-classification method to detect unknown classesBendale and Boult (2016). The OpenMax experimental setup has been modified to work with the VGG16 deep neural network and our data setup specifications as defined in Table 5. This was modified with the assistance of source code from the original paper Bendale (2020) and a wrapper implementation from Neupane (2019).

Data Setup Layer Filtering 20cmActivation
Reduction 20cmActivation
Analysis
(airplane,automobile-frog) 8 layers autoencoder MCOD
(ship,truck-cat) 8 layers autoencoder MCOD
(airplane,truck-deer) 8 layers autoencoder MCOD
(ship,truck-bird) 8 layers autoencoder MCOD
(airplane,ship-bird) 8 layers autoencoder MCOD
(ship,truck-automobile) 8 layers autoencoder MCOD
(cat,frog-deer) 8 layers autoencoder MCOD
(cat,deer-horse) 8 layers autoencoder MCOD
(airplane,automobile-frog) penultimate layer N/A OpenMax
(ship,truck-cat) penultimate layer N/A OpenMax
(airplane,truck-deer) penultimate layer N/A OpenMax
(ship,truck-bird) penultimate layer N/A OpenMax
(airplane,ship-bird) penultimate layer N/A OpenMax
(ship,truck-automobile) penultimate layer N/A OpenMax
(cat,frog-deer) penultimate layer N/A OpenMax
(cat,deer-horse) penultimate layer N/A OpenMax
Table 5: Experimental Setup

4.4 Evaluation Metrics

The following metrics are computed: true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). True positives are defined as images that belong to the new class and were correctly identified as concept evolution. False positives are defined as images that belong to an existing class and were incorrectly identified as concept evolution. True negatives are defined as images that belong to an existing class and were correctly identified as such. False negatives are defined as images that belong to a new class and were incorrectly identified as belonging to an existing class. From this data, we calculate Precision, Recall and F-Measure as defined in Table 6. The OpenMax paper Bendale and Boult (2016)

uses F-measure to evaluate open-set performance as it is better than using accuracy, because it is not inflated by true negatives. It combines precision and recall – it is the harmonic mean.

Scheirer et al. (2013) For a given threshold on OpenMax/SoftMax probability values, they compute true positives, false positives and false negatives over the entire dataset. To compare DeepStreamCE with OpenMax, we are using a modified version of OpenMax that we have adapted to work with our data and deep neural network. The parameters of alpha and tail required modification for our data and optimum values for these were found empirically to be 2 and 9, respectively. See the code availability section for a link to the source code for the implementation of this research.

Name Description Formula
Precision 20cmRatio of CE instances that are
declared as outliers amongst all outliers
Recall 20cmRatio of CE instances that are
declared as CE amongst all CE instances
F-Measure 20cmThe harmonic mean
(combination of Precision and Recall)
Table 6: Performance Measures

5 Experimental Results

5.1 Parameter Investigation Results

The parameters (the minimum number of neighbours required to form an MCOD micro cluster) and (the radius of the micro cluster) were varied between 10 and 200 and 0.01 and 0.1, respectively. Figures 6 and 7 show the effect these changing parameters has on Precision and Recall. Figure 6 demonstrates that the smaller the radius, the better the recall. However, if the radius is set too small the precision will drop, suggesting that 0.04 would produce a balance of Precision and Recall. Figure 7 demonstrates that the higher the value of , the better the recall without a large drop in Precision, suggesting that could be set to 80 with only a small drop in precision.

(a) Variation of Precision with

(b) Variation of Recall with
Figure 6: Variation of Precision and Recall with ( = 50)

(a) Variation of Precision with

(b) Variation of Recall with
Figure 7: Variation of Precision and Recall with ( = 0.04)

5.2 DeepStreamCE and OpenMax Results

For comparison of DeepStreamCE to OpenMax, the MCOD parameters of = 80 and = 0.04 have been used as suggested by the experimentation as reported in Section 5.1. DeepStreamCE is compared with OpenMax via Precision, Recall and F-Measure. Table 7 shows the results for these measures.

20cmData
Setup 20cmDS
Precision 20cmDS
Recall 20cmDS
F-Meas 20cmOM
Precision 20cmOM
Recall 20cmOM
F-Meas
(airplane,automobile-frog) 0.788 0.565 0.638 0.700 0.178 0.284
(ship,truck-cat) 0.615 0.414 0.436 0.750 0.096 0.170
(airplane,truck-deer) 0.661 0.505 0.518 0.085 0.079 0.081
(ship,truck-bird) 0.666 0.502 0.531 0.735 0.089 0.158
(airplane,ship-bird) 0.527 0.192 0.281 0.191 0.225 0.206
(ship,truck-automobile) 0.361 0.050 0.088 0.640 0.057 0.104
(cat,frog-deer) 0.594 0.848 0.698 0.000 0.000 0.000
(cat,deer-horse) 0.527 0.271 0.357 0.317 0.128 0.182
Table 7: Experimental Results for DeepStreamCE (DS) and OpenMax (OM)

For the first group of data, where the class combinations consist of vehicles as the trained classes and an animal as the concept evolution class, the results for DeepStreamCE indicate that the data setup of (airplane,automobile-frog) obtained the best precision and recall, with (ship,truck-cat) obtaining the lowest results. This suggests that the frog is more distinguishable from transport than cat. For OpenMax, the results show that the F-Measure is lower on all data setup specifications than DeepStreamCE, because of low Recall rates. This means that a high percentage of unknown classes are assigned as known classes. The OpenMax F-Measure scores follow the trend of the DeepStreamCE F-Measure scores in that the most separated data setup of (airplane,automobile-frog) also displays the highest F-Measure score. Data setup (airplane,truck-deer) shows very low Precision and Recall, this was also one of the lower scoring data setup specifications on DeepStreamCE. In the second group of data setup specifications, for DeepStreamCE, the F-Measure results were generally lower than the first group as expected due to the more similar nature of the data. However, (cat,frog-deer) is the highest of all data setup specifications with an F-Measure of 0.698 and a high recall of 0.848 reported. OpenMax reports higher F-Measure results for (ship,truck-automobile) all transport categories, although this has a very low recall – so it identified automobile as the unknown class, but also identified automobile as known classes a high number of times. Both DeepStreamCE and OpenMax struggled to identify concept evolution when all classes were vehicles. OpenMax scored zero on (cat,frog-deer) where it did not identify any instances as unknown – it could not identify a difference between any of these classes. From these results, it can be seen that DeepStreamCE outperforms OpenMax in the scenario of detecting concept evolution for the data setup specifications provided. Using the Wilcoxon Signed-Rank test, the difference between the F-Measure of DeepStreamCE and that of OpenMax over the eight tested cases is statistically significant. The -value = 0.01563 which is less than 0.05 significance level, suggesting the acceptance of the alternative hypothesis that true location shift is not equal to 0.

5.3 Detection Time Analysis

In a streaming scenario, the amount of time that is taken to process one instance is of interest. Table 8 below shows the average, minimum and maximum time taken to process instances for DeepStreamCE and OpenMax. The average speed of detection for DeepStreamCE is 324ms per instance. This is the time per instance, measured in batches of 100 instances. The system is running on 64vCPUs, 416GB RAM, 4 x NVIDIA Tesla T4 GPUs. This is the duration it takes to select the layers, flatten the activations, reduce them in the autoencoder and process them through the MCOD clustering algorithm to produce an outlier result. The variation between the time taken on all runs is within 55ms and the run times stay consistent when varying and . The average time taken for OpenMax to calculate its outcome for each instance is 257ms. This is the time it takes to compute the mean activation vector for an instance, apply this to the Weibull distribution of the mean activation vectors for each class and re-calibrate the output decision to allow an ‘unknown’ classification. The results show that OpenMax is faster than DeepStreamCE. However, OpenMax considers much less activation data than DeepStreamCE and performs less computations as it based on probability rather than activation reduction via an autoencoder and cluster processing. OpenMax performs considerably lower in the concept evolution detection performance than DeepStreamCE and only provides a small decrease in execution time.

System Average Min Max
DeepStreamCE 324 306 361
OpenMax 257 256 260
Table 8: Time taken to process an instance in ms

6 Conclusion and Future Work

The experiments have shown that detection of concept evolution utilising deep neural network activations via streaming detection methods is a viable approach. This was proven using two separated types of classes (transport and animals) from the CIFAR-10 dataset. The effectiveness of this was compared to OpenMax where DeepStreamCE outperformed OpenMax. The value of the radius, and the number of neighbours in a cluster () of the MCOD clusterer are significant factors with regards to the concept evolution decision, with the potential to increase the Recall with only a small decrease in Precision. This research has demonstrated an introduction into utilising deep neural network activations in a streaming environment to detect concept evolution. Further directions of study are: (1) expanding the data into using more classes and less separated classes, (2) extending the analysis into concept drift and adversarial detection, using larger neural network models, (3) investigating which are the optimum network layers to use and experimenting on data other than images and different types of deep neural networks, as the system becomes data agnostic once the activations are utilised instead of the input data.

Acknowledgements.
The authors thank the Google Cloud Platform Research Credits scheme for enabling this research via the use of cloud resources and software utilised from MOA Bifet et al. (2010) and Py4J Dagenais .

References

  • Z. S. Abdallah, M. M. Gaber, B. Srinivasan, and S. Krishnaswamy (2016) AnyNovel: detection of novel concepts in evolving data streams: An application for activity recognition. Evolving Systems 7 (2), pp. 73–93 (en). External Links: ISSN 1868-6478, 1868-6486, Link, Document Cited by: §2.3.
  • A. Adadi and M. Berrada (2018)

    Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)

    .
    IEEE Access 6, pp. 52138–52160 (en). External Links: ISSN 2169-3536, Link, Document Cited by: §1, §2.1.
  • A. Bendale and T. E. Boult (2016) Towards Open Set Deep Networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1563–1572. External Links: Link Cited by: 4th item, §1, §2.4, §2, §4.3, §4.4.
  • A. Bendale (2020) Abhijitbendale/OSDN. Note: original-date: 2016-07-12T18:31:17Z External Links: Link Cited by: §4.3.
  • A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer (2010) MOA: Massive Online Analysis. Journal of Machine Learning Research 11 (May), pp. 1601–1604. External Links: ISSN ISSN 1533-7928, Link Cited by: §6.
  • V. Buhrmester, D. Munch, and M. Arens (2019) Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey. External Links: Link Cited by: §1, §2.1.
  • S. Carter, Z. Armstrong, L. Schubert, I. Johnson, and C. Olah (2019) Activation Atlas. Distill 4 (3), pp. e15 (en). Note: 00000 External Links: ISSN 2476-0757, Link, Document Cited by: §1.
  • B. Chen, W. Carvalho, N. Baracaldo, B. Edwards, T. Lee, H. Ludwig, I. Molloy, and B. Srivastava (2018) Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. arXiv preprint arXiv:1811.03728, pp. 8 (en). Note: 00004 Cited by: §2.1, §2.2.
  • [9] B. Dagenais Py4J - A Bridge between Python and Java. External Links: Link Cited by: §6.
  • T. DeVries and G. W. Taylor (2018) Learning Confidence for Out-of-Distribution Detection in Neural Networks. arXiv:1802.04865 [cs, stat]. Note: arXiv: 1802.04865 version: 1 External Links: Link Cited by: §2.
  • A. R. Dhamija, M. Günther, and T. Boult (2018) Reducing Network Agnostophobia. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9157–9168. External Links: Link Cited by: §2.
  • E. R. Faria, I. J. C. R. Gonçalves, A. C. P. L. F. de Carvalho, and J. Gama (2016) Novelty detection in data streams. Artificial Intelligence Review 45 (2), pp. 235–269 (en). External Links: ISSN 1573-7462, Link, Document Cited by: §2.3.
  • GamaJoão, ŽliobaitėIndrė, BifetAlbert, PechenizkiyMykola, and BouchachiaAbdelhamid (2014) A survey on concept drift adaptation. ACM Computing Surveys (CSUR) (EN). External Links: Link Cited by: §2.3.
  • C. Geng, S. Huang, and S. Chen (2019) Recent Advances in Open Set Recognition: A Survey. arXiv:1811.08581 [cs, stat]. Note: arXiv: 1811.08581 External Links: Link Cited by: §2.
  • P. M. Gonçalves, S. G. T. de Carvalho Santos, R. S. M. Barros, and D. C. L. Vieira (2014) A comparative study on concept drift detectors. Expert Systems with Applications 41 (18), pp. 8144–8156 (en). Note: 00073 External Links: ISSN 0957-4174, Link, Document Cited by: §2.3.
  • D. Haidar and M. M. Gaber (2019)

    Data Stream Clustering for Real-Time Anomaly Detection: An Application to Insider Threats

    .
    In Clustering Methods for Big Data Analytics: Techniques, Toolboxes and Applications, O. Nasraoui and C. Ben N’Cir (Eds.),

    Unsupervised and Semi-Supervised Learning

    , pp. 115–144 (en).
    Note: 00001 External Links: ISBN 978-3-319-97864-2, Link, Document Cited by: §2.3.
  • A. Haque, L. Khan, M. Baron, B. Thuraisingham, and C. Aggarwal (2016) Efficient handling of concept drift and concept evolution over Stream Data. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 481–492. Note: 00046 ISSN: null External Links: Document Cited by: §1, §2.3.
  • A. Haque, L. Khan, and M. Baron (2016) SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. In Thirtieth AAAI Conference on Artificial Intelligence, (en). External Links: Link Cited by: §1, §2.3.
  • M. Harel, K. Crammer, R. El-Yaniv, and S. Mannor (2014) Concept Drift Detection Through Resampling. International Conference on Machine Learning, pp. 9 (en). Cited by: §1.
  • M. Hassen and P. K. Chan (2018) Learning a Neural-network-based Representation for Open Set Recognition. arXiv:1802.04365 [cs, stat]. Note: arXiv: 1802.04365 External Links: Link Cited by: §2.
  • D. Hendrycks and K. Gimpel (2018) A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv:1610.02136 [cs]. Note: 00233 arXiv: 1610.02136 External Links: Link Cited by: §2.1.
  • G. E. Hinton (2006) Reducing the Dimensionality of Data with Neural Networks. Science 313 (5786), pp. 504–507 (en). External Links: ISSN 0036-8075, 1095-9203, Link, Document Cited by: §2.2, §3.3.
  • F. Hohman, H. Park, C. Robinson, D. Horng, and Chau (2019) Summit: Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations. arXiv:1904.02323 [cs]. Note: 00000 arXiv: 1904.02323 External Links: Link Cited by: §2.1.
  • M. Kahng, P. Y. Andrews, A. Kalro, and D. H. (. Chau (2018) ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE Transactions on Visualization and Computer Graphics 24 (1), pp. 88–97. Note: 00000 External Links: ISSN 1077-2626, Document Cited by: §1, §2.1.
  • I. Khamassi, M. Sayed-Mouchaweh, M. Hammami, and K. Ghédira (2018) Discussion and review on evolving data streams and concept drift adapting. Evolving Systems 9 (1), pp. 1–23 (en). External Links: ISSN 1868-6486, Link, Document Cited by: §2.3.
  • M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos (2011) Continuous monitoring of distance-based outliers over data streams. In 2011 IEEE 27th International Conference on Data Engineering, pp. 135–146. Note: 00098 External Links: Document Cited by: §2.3, §3.4.
  • M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos (2016) Efficient and flexible algorithms for monitoring distance-based outliers over data streams. Information Systems 55, pp. 37–53. Note: 00000 External Links: ISSN 0306-4379, Link, Document Cited by: §1.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §4.1.
  • L. I. Kuncheva and W. J. Faithfull (2012) PCA feature extraction for change detection in multidimensional unlabelled streaming data. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 1140–1143. Note: ISSN: 1051-4651 Cited by: §1.
  • S. Liang, Y. Li, and R. Srikant (2018) Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. arXiv:1706.02690 [cs, stat]. Note: arXiv: 1706.02690 External Links: Link Cited by: §2.
  • M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu (2017) Towards Better Analysis of Deep Convolutional Neural Networks. IEEE Transactions on Visualization and Computer Graphics 23 (1), pp. 91–100. Note: 00126 External Links: ISSN 1077-2626, Document Cited by: §2.1.
  • M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham (2011) Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints. IEEE Transactions on Knowledge and Data Engineering 23 (6), pp. 859–874. External Links: ISSN 2326-3865, Document Cited by: §1, §2.3.
  • A. Neupane (2019) Aadeshnpn/OSDN. Note: original-date: 2017-12-17T22:50:32Z External Links: Link Cited by: §4.3.
  • C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev (2018) The Building Blocks of Interpretability. Distill 3 (3), pp. e10 (en). Note: 00081 External Links: ISSN 2476-0757, Link, Document Cited by: §2.1.
  • N. Papernot and P. McDaniel (2018) Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. arXiv:1803.04765 [cs, stat]. Note: arXiv: 1803.04765 External Links: Link Cited by: §2.1.
  • Y. Qiu, J. Leng, C. Guo, Q. Chen, C. Li, M. Guo, and Y. Zhu (2019) Adversarial Defense Through Network Profiling Based Path Extraction. arXiv:1904.08089 [cs, stat]. Note: 00000 arXiv: 1904.08089 External Links: Link Cited by: §2.1.
  • G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recognition Letters 33 (2), pp. 191–198 (en). External Links: ISSN 0167-8655, Link, Document Cited by: §2.3.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning Internal Representations by Error Propagation. Parallel Distributes Processing 1, pp. 23 (en). Cited by: §3.3.
  • W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2017) Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems 28 (11), pp. 2660–2673. Note: 00168 External Links: ISSN 2162-237X, Document Cited by: §2.1.
  • W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult (2013) Toward Open Set Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (7), pp. 1757–1772. Note: 00331 External Links: ISSN 2160-9292, Document Cited by: §4.4.
  • K. Simonyan and A. Zisserman (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs]. Note: arXiv: 1409.1556 External Links: Link Cited by: §4.2.
  • X. Song, M. Wu, C. Jermaine, and S. Ranka (2007) Statistical Change Detection for Multi-Dimensional Data. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. Cited by: §1.
  • L. Tran, L. Fan, and C. Shahabi (2016) Distance-based outlier detection in data streams. Proceedings of the VLDB Endowment 9 (12), pp. 1089–1100 (en). Note: 00032 External Links: ISSN 21508097, Link, Document Cited by: §2.3.
  • A. Tsymbal (2004) The Problem of Concept Drift: Definitions and Related Work. Technical report Trinity College Dublin. Note: 00895 Cited by: §2.3.
  • Y. Wang, H. Yao, and S. Zhao (2016) Auto-encoder based dimensionality reduction. Neurocomputing 184, pp. 232–242 (en). External Links: ISSN 0925-2312, Link, Document Cited by: §3.3.