Towards Neural Network Patching: Evaluating Engagement-Layers and Patch-Architectures

In this report we investigate fundamental requirements for the application of classifier patching [8] on neural networks. Neural network patching is an approach for adapting neural network models to handle concept drift in nonstationary environments. Instead of creating or updating the existing network to accommodate concept drift, neural network patching leverages the inner layers of the network as well as its output to learn a patch that enhances the classification and corrects errors caused by the drift. It learns (i) a predictor that estimates whether the original network will misclassify an instance, and (ii) a patching network that fixes the misclassification. Neural network patching is based on the idea that the original network can still classify a majority of instances well, and that the inner feature representations encoded in the deep network aid the classifier to cope with unseen or changed inputs. In order to apply this kind of patching, we evaluate different engagement layers and patch architectures in this report, and find a set of generally applicable heuristics, which aid in parametrizing the patching procedure.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/28/2021

A Proof of Concept Neural Network Watchdog using a Hybrid Generative Classifier For Optimized Outlier Detection

With the continuous development of tools such as TensorFlow and PyTorch,...
06/11/2018

Understanding Patch-Based Learning by Explaining Predictions

Deep networks are able to learn highly predictive models of video data. ...
05/04/2021

Automatic Learning to Detect Concept Drift

Many methods have been proposed to detect concept drift, i.e., the chang...
06/22/2022

Concentration inequalities and optimal number of layers for stochastic deep neural networks

We state concentration and martingale inequalities for the output of the...
08/26/2018

An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams

Existing fuzzy neural networks (FNNs) are mostly developed under a shall...
02/22/2022

Increasing Depth of Neural Networks for Life-long Learning

Increasing neural network depth is a well-known method for improving neu...
06/24/2019

A Review on Neural Network Models of Schizophrenia and Autism Spectrum Disorder

This survey presents the most relevant neural network models of autism s...

1 Introduction

Nowadays, machine learning research is dominated by neural networks. Although they have been around since the 1940s, it took a long time to leverage their potential, mostly because of the computational complexity involved. This changed in the mid 2000s, when new methods and hardware emerged that allowed bigger networks to be trained faster, opening up new possibilities of application. The main advantage of deep networks is their layered architecture, which turns out to be easier to train compared to networks with a single hidden layer, given enough training data is present.

The possibility of training bigger and deeper networks has enabled neural networks to deal with more complex problems. The current understanding of this is, that each layer of a network represents a different stage of abstraction from the input data, similar to how we believe the human brain processes information. Besides of the abstraction, specific functional units such as convolutional layers or long-short-term-memory units provide functionality that is beneficial to certain problems, for example when dealing with image data or sequential prediction tasks. A typical network for image classification consists of multiple layers of convolutional units

[6], each representing feature detectors with different grades of abstraction. Early layers detect simple structures like edges or corners. Later layers comprise more complex features related to the given task, for example eyes or ears, when recognizing faces.

Due to the large amounts of data available today, building highly capable deep neural networks for certain tasks has become feasible. However, most domains are subject to changing conditions in the long run. That means, either the data, the data distribution, or the target classification function changes. This is usually caused by concept drift or other kinds of non-stationarity. The result is that once perfectly capable systems degrade in their performance or even become unusable over time.

An example could be an image classification task, where previously unknown classes need to be detected. This usually requires a retraining of at least a part of the network, in order to accommodate the new classes. Another example could be a piece of complex machinery, as used in productive environments such as factories. This machine might be fitted with hard- and software to finely detect its current state, and a predictive model for failures on top of it. When the next hardware revision of that machine is sold by the manufacturer, new data from the machine has to be collected and the failure predicting model has to be retrained, which can be very expensive.

A final example to motivate the necessity of adaptation is the personalization setting. A product is sold with a general prediction model that covers a wide variety of users. However, personalization would help to make it even more suitable for a specific user. This is a type of adaptation that is difficult to manage with neural networks as underlying models.

To solve these problems, we build upon patching, a framework that has recently been proposed to cope with such problems [8]. Contrary to many conventional techniques, this framework does not assume that it is feasible to re-train the model from scratch with newly recorded data. Instead, it tries to recognize regions where the model errs, and tries to learn local models—so-called patches—that repair the original model in these error regions.

In this paper, we present neural network patching (NN-Patching), a variant of patching that is specifically tailored to neural network classifiers. We recognize the fact, that building a well-working neural network for a certain task can be cumbersome and require many iterations w.r.t. the choice of architecture and the hyper-parameters. Once such a network is established and properly trained, a prolonged use of it is usually appreciated. However, it is not guaranteed that the underlying problem domain remains stationary, and it is desirable that the network can adapt to such changes.

NN-Patching allows existing neural networks to be adapted to new scenarios by adding a network layer on top of the existing network. This layer is not only fed by the output of the base network, but also leverages inner layers of the network that enhance its capabilities. Furthermore, the patching network is only activated, when the underlying base network gives erroneous results.

This report is structured as follows. In section 2 we elaborate on the concept of NN-Patching and define its requirements. We derive a set of experiments in section 3 and test various assumptions and methods based on this setup in sections 4-8. We conclude our findings in section 9.

2 Deep Neural Network Adaptation

Since neural networks are usually trained by backpropagation, adapting a neural network towards a changed scenario can be achieved via training on the latest examples, hence refining the weights in the network towards the current concept. However, this may lead to catastrophic forgetting

[5] and—depending on the size of the networks—may be costly. To mitigate this issue, a common approach is to train only part of the network and not adapt the more general layers [9], but only the specific layers relevant to the target function. For example, [4] leverages this behavior to achieve transfer to problems with higher complexity than the original problem the network was intended for.

In summary, we make three observations: (i) neural networks are useful towards adaptation tasks, caused by their hierarchical structure, (ii) neural networks can be trained such that they adapt to changed environments via new examples, and (iii) this adaptation may lead to catastrophic forgetting. In our proposed method, we want to leverage the advantages of (i) and (ii), but avoid the disadvantages of (iii). In the next section we will explain the patching procedure for neural networks.

2.1 Patching for Neural Networks

We tailored the Patching-procedure [8] to the specific case of neural network classifiers. The idea is depicted in Figure 1. NN-Patching therefore consists of three steps:

  1. Learn a classifier that determines where errs. In this step, when receiving a new batch of labeled data, the data is used to learn a classifier that estimates where will misclassify instances.

  2. Learn a patch network . The patch network engages to one inner layer of and the last layer (usually a softmax for classification), and takes the activations of these layers as its own input (Fig. 1).

  3. Divert classification from to , if is confident. When an instance is to be classified, the error detector is executed. If the result is positive, classification is diverted to , otherwise to .

(a) Base classifier. Dashed black line symbolizes the decision boundary of the base classifier.
(b) Concept drift. Some areas in the instance space are now error-prone.
(c) Patching error regions. The green area symbolizes the detected error region.
Figure 1: Patching Algorithm. (a) shows the instance space of the classes 1 and 2. The dashed black line marks the decision boundary of the base classifier. The instances are classified satisfactory. In (b), concept drift occurs. We have a error-prone region in the instance space. In (c) the error region is detected and a patch classifier is learned. The classification of an instance from the error region is diverted to the patch.

In contrast to the original procedure (cf. [8]), neural networks enable us to iteratively update both and over time. We will hence not create separate versions for each new batch, but rely on the existing one and update it via backpropagation with the instances from the latest batch.

To learn the patch network, we must engage in one of the inner layers of . The selection of this layer is non-trivial. Furthermore, we need to determine an appropriate architecture for the patch itself. The experiment described in the next section will aid in determining some generalized rules-of-thumb to approach this parametrization problem.

3 Experimental Setup

Figure 2: Different phases of the experimental scenario and evaluation measures.

In this section, we will elaborate on the datasets we used to determine optimal engagement layers and patch architecture. Our datasets are derived from well known datasets and are engineered to give a stream of instances, where each stream contains one or multiple drifts of the underlying concept. We evaluate these streams as sequence of instances, where the true labels are retrieved in regular intervals. These are so called batches of instances. On the end of each batch it allows us to retrospectively evaluate the performance of the classifier, and make adaptations for the next batch. Bifet et al. [1] describe this process more thoroughly w.r.t. their Massive Online Analysis (MOA) framework. We applied the same principles, although we implemented our solution in Python.

3.1 Evaluation Datasets

Dataset Init CPs Total Chunks
MNIST Dataset
40k #70k 140k 100
20k #35k 70k 100
15k #20.4k 57.2k 100
20k #35.7k 70k 100
20k #35.7k 70k 100
NIST Dataset
30k #40k 100k 100
30k #40k 100k 100
20k #28.6k 88.6k 100
20k #28k 55.8k 100
20k #30k 80k 100
Table 1: Summary of the datasets used in the experiments

We will evaluate our findings in 10 scenarios which are based on two datasets. Each scenario represents a different type of concept drift with varying severity up until a complete transfer of knowledge to an unknown problem. The scenarios are summarized in Table 1.

The MNIST Dataset.

The first dataset is the MNIST111http://yann.lecun.com/exdb/mnist/ dataset of handwritten digits. It contains the pixel data of 70,000 digits (28x28 pixel resolution), which we treat as a stream of data and introduce changes to the labeling. We created the following scenarios.

  • : The second half of the dataset consists of vertically and horizontally flipped digits.

  • : The digits in the dataset are rotated at a random angle from instance #35k onwards with increasing degree of rotation up to 180 degrees (at #65k).

  • : The digits change during the stream, such that classes 5–9 do not exist in the beginning, but only start to appear at the change point (in addition to 0–4).

  • : In the first half, only the digits 0–4 exist. The input images of 0–4 are then replaced by the images of 5–9 for the second half (labels remain 0–4). Here we only have 5 classes.

  • : The first half of the stream only consists of digits 0-4, while the second half only consists of the before unseen digits 5-9.

The NIST Dataset.

The second dataset is the NIST222https://www.nist.gov/srd/nist-special-database-19 dataset of handprinted forms and characters. It contains 810,000 digits and characters, to which we will apply similar transformations as to the MNIST data. Contrary to MNIST, NIST items are not pre-aligned, and the image size is 128x128 pixels. We will use all digits 0-9 and uppercase characters A-Z for a total of 36 classes as data stream and draw a random sample for each scenario.

  • : The second half of the dataset consists of vertically and horizontally flipped images.

  • : The images in the dataset start to rotate randomly at instance #40k with increasing rotation up to 180 degrees for the last 10k instances.

  • : The distribution of the images changes during the stream so that instances of classes 0–9 do not exist in the beginning, but only start to appear at the change point (mixed in between the characters A–Z).

  • : In the first half, only the digits 0–9 exist. The input images are then replaced by the letters A–J for the second half, but the labels remain 0–9. Here we only have 10 classes.

  • : The first 30k instances of the stream only consists of digits 0–9, while the following 80k are solely characters A–Z.

3.2 The Base Classifiers

In the original Patching-procedure, it is assumed that a base classifier exists, which we can learn errors and build patches upon. Since these are not given in our case, we use part of the dataset to create them based on popular neural network architectures.

We exploited three architectures that are generally suited to solve the scenarios we described: (i) a fully-connected deep neural network (FC-NN), (ii) a convolutional neural network (CNN), and (iii) a residual network (ResNet) architecture. Each classifier architecture is tuned to achieve high accuracy on the unaltered datasets (Table 

2

). We assume ReLU activation for all fully-connected and convolutional layers, except in the ResNet and the residual blocks. In this case the application of ReLU activation is stated explicitly whenever used. The CNN and the FC-NN are trained for 10 epochs on the initialization fraction (Fig. 

2) of the dataset. The ResNet architectures are trained for 20 epochs instead, since the deep structure requires more epochs to lead to convergence. For the training in the initialization phase we use a batch size of 64.

Dataset FC-NN CNN ResNet
MNIST 98.87% 99.28% 99.35%
NIST 94.07% 97.77% 98.03%
Table 2: Base classifier accuracy on unaltered datasets. NISTonly consists of uppercase letters and numbers. The test set consisted of 10,000 instances.

In the following sections we show the architectural details w.r.t. layer configuration and activations of the chosen networks.

Fully-Connected Architectures

The fully-connected architectures for NIST and MNIST are stated in table 3. The networks both tend to overfit, hence two dropout layers are utilized to counteract this problem. We use fully-connected layers with decreasing number of nodes to build the architectures.

MNIST NIST
InputLayer(28x28) - Flatten() - Dropout(0.2) - InputLayer(128x128) - Flatten() - Dropout(0.2) -
FC(2048) - FC(1024) - FC(1024) - FC(512) - FC(1024) - FC(1024) - FC(768) - FC(512) -
FC(128) - Dropout(0.5) - Softmax(#classes) FC(512) - FC(256) - FC(256) -
Dropout(0.5) - Softmax(#classes)
InputLayer(i): Input layer, i = shape of the input
Flatten(): Flatten input to one dimension
FC(n): Fully Connected, n = number of units
Dropout(d): Dropout, d = dropout rate
Softmax(n): FC layer with softmax activation, n = number of units
Table 3: Fully-Connected Architectures.

Convolutional Architectures

In the CNN architectures we additionally use convolutional and pooling layers. In the architecture for MNIST only two convolutional layers and one pooling layer are required to achieve an accuracy greater than 99.25 %. The NIST

dataset has a total of 128x128 = 16384 attributes. We use one convolutional layer with stride=2 and two pooling layers to reduce the dimensionality of the data, while propagating through the network. In both cases we counteract overfitting with the help of two dropout layers.

MNIST NIST
InputLayer(28x28) - Conv2D(32,(3,3),1) - InputLayer(128x128) - Conv2D(32,(7,7),2) -
Conv2D(64,(3,3),1) - MaxPooling((2,2),2) - MaxPooling((2,2),2) - Conv2D(64,(5,5),1) -
Dropout(0.25) - Flatten() - FC(128) - Conv2D(64,(5,5),1) - Conv2D(64,(3,3),1) -
Dropout(0.5) - Softmax(#classes) Conv2D(64,(3,3),1) - MaxPooling((2,2),2) -
Dropout(0.25) - Flatten() - FC(256) -
Dropout(0.5) - Softmax(#classes)
InputLayer(i): Input layer, i = shape of the input
Flatten(): Flatten input to one dimension
FC(n): Fully Connected, n = number of units
Conv2D(f,k,s): 2D Convolution, f = number of filters, k = kernel size, s =stride

MaxPooling(k,s): Max Pooling, k = kernel size, s =stride

Dropout(d): Dropout, d = dropout rate
Softmax(n): FC layer with softmax activation, n = number of units
Table 4: Convolutional Architectures.

Residual Architectures

Our ResNet architecture is based on the contest winning model by He et al. [6]. It consists of two different residual block types (Figure 3

). An important tool in both block types is the 1x1 convolutional layer. 1x1 convolutions can be used to change the dimensionality in the filter space. Both residual block types follow the same pattern. At first a 1x1 convolution is used to reduce the dimensionality, then a 3x3 convolution is applied on the data with reduced dimensionality. Finally, another 1x1 convolution is utilized to restore the original filter space. The reduction of dimensionality results in a reduced computational cost for applying the 3x3 convolutions. The optional layer parameter ’same’ refers to zero padding in Keras

[2]. Zeros are added around the image in such a way that for stride=1 the width and height for the input and output of the layer would be the same.

(a) Identity block
(b) Convolutional block
Figure 3: Residual block types.

The identity block preserves the input size, whereas the convolutional block can be used to change the width and height of each feature map. Hence, the convolutional block has an additional convolutional layer in the residual connection. With a stride greater than one, the width and height of the block output can be manipulated. The ResNet architectures for

NIST and MNIST are stated in table 5.

MNIST NIST
InputLayer(28x28) - Dropout(0.2) - InputLayer(128x128) - Dropout(0.2) -
Conv2D(64,(5,5),2,’same’) - BatchNorm() - Conv2D(64,(7,7),2) - BatchNorm() -
ReLU() - ConvBlock((64,256),1) - ReLU() - MaxPooling((3,3),3) -
IdBlock(64,256) - IdBlock(64,256) - ConvBlock((64,256),1) - IdBlock(64,256) -
ConvBlock((128,512),2) - IdBlock(128,512) - IdBlock(64,256) - ConvBlock((128,512),2) -
IdBlock(128,512) - IdBlock(128,512) - IdBlock(128,512) - IdBlock(128,512) -
ConvBlock((256,1024),2) - IdBlock(256,1024) - IdBlock(128,512) - ConvBlock((256,1024),2) -
IdBlock(256,1024) - IdBlock(256,1024) - IdBlock(256,1024) - IdBlock(256,1024) -
IdBlock(256,1024) - AveragePooling2D((2,2),2) - IdBlock(256,1024) -
Flatten() - Dropout(0.5) - Softmax(#classes) AveragePooling2D((2,2),2,’same’) - Flatten() -
Dropout(0.5) - Softmax(#classes)
InputLayer(i): Input layer, i = shape of the input
Flatten(): Flatten input to one dimension
Conv2D(f,k,s): 2D Convolution, f = number of filters, k = kernel size, s = stride
MaxPooling(k,s): Max Pooling, k = kernel size, s =stride
AvgPooling(k,s): Average Pooling, k = kernel size, s =stride
IdBlock(f1,f2): Identity Block, f1 = #reduced filters, f2 = #output filters
ConvBlock((f1,f2),s): Convolutional Block, f1 = #reduced filters, f2 = #output filters, s = stride

BatchNorm(): Batch Normalization

ReLU(): ReLU Activation
Dropout(d): Dropout, d = dropout rate
Softmax(n): FC layer with softmax activation, n = number of units
Table 5: Residual Architectures.

Although batch normalization is frequently used in ResNet, the CNN seems to be more robust in terms of initialization than the ResNet structure. Ioffe and Szegedy [7] reported that batch normalization increases the robustness to initialization of a network. On rare occasions the ResNet gets stuck in a local minimum during training. We never observed this for the CNN or FC-NN. In this case the ResNet converges in the first epoch of the training process. During experiment runtime we detect this stagnation in the training process and discard the current model. We reinitialize the network with a new random seed and restart the training process.

3.3 Evaluation Measures

For the comparison of the different architectures we use the following metrics:

  • Final Accuracy (F.Acc): Classification accuracy, measured in the Finish phase, which consists of the last five batches of the stream.

  • Average Accuracy (Avg.Acc): Average accuracy in the Adaptation and Finish phases (after first change point).

  • Recovery Speed (R.Spd): Number of instances that a classifier requires during the Adaptation phase to achieve 90% of its final accuracy.

4 Network Architecture and Engagement Layer

In our scenario we want to select a suitable engagement layer. Therefore we have to consider the architecture of the base classifier. For some architectures, well performing engagement layers are in higher layers of the network, close to the output layer, whereas for other network architectures it is preferable to choose engagement layers close to the network input. If we categorize the networks into three different archetypes, we can recognize essential similarities and differences. The distinguished archetypes are: Fully-Connected Network (FC-NN), Convolutional Network (CNN) and Residual Network (ResNet).

In figure 4 we show an ordinary engagement layer accuracy progression, which is representative for each network archetype. Flatten and Dropout layers are excluded from the evaluation for all archetypes. The Flatten layer has the same patching performance as the previous network layer, since output values are not altered and the engagement layer output is flattened in any case before directed to the patch. Dropout layers can also be excluded from the evaluation, since they have no functionality during the predicting and testing process. For residual networks we only consider the output of each residual block in the performance evaluation. Hence, for residual networks we additionally exclude the parallel layers inside the residual blocks. The exact base classifier architectures are described in section 3.2.

(a) Fully-Connected Network
(b) Convolutional Network
(c) Residual Network
Figure 4: Patching performance by engagement layer for three different base classifier archetypes. The used dataset was NIST

and the patch architecture was 128xSoftmax. The plots show the evaluation measures average accuracy and final accuracy. The evaluation measures are obtained by using the output of the engagement layer as input for the patch network. The x-axis states the layer type of the engagement layer. ’fc’ is fully-connected layer, ’conv’ or ’c’ is convolutional, ’pool’ or ’p’ are pooling layer and ’r’ are the outputs of a residual block. The left side of the x-axis starts with the input layer, which is the raw data. The last network layer is the output layer. The output layer is a softmax layer with the number of nodes equal to the amount of different classes in the dataset.

Engagement in Fully Connected Networks

In Figure 4(a) the patching performance by engagement layer is shown for a Fully-Connected Network architecture. The optimal engagement layer in the presented configuration is the second fully-connected layer of the network. We observe that the average and final accuracy increases up to the second fully-connected layer. After this point, the accuracy decreases gradually. This indicates that fully-connected layers tend to perform classification tasks, instead of extracting transferable features.

Yosinski et al. (2014) describe this behaviour as general versus specific. The features in early layers tend to be general, whereas later layers consist of more specific features with respect to the classification task.

Furthermore, average and final accuracy show a strong correlation. The final accuracy is higher then the average accuracy, since the patch network already adapted to the new concept when the final phase begins (Fig. 2). But besides the accuracy offset, average and final accuracy do not show qualitative differences. The optimal engagement layer for maximizing the average accuracy is usually also ideal w.r.t. the final accuracy. This property holds for all examined network archetypes.

Engagement in Convolutional Neural Networks

Figure 4(b) shows the patching performance based on the engagement layer for a convolutional network. The graph shows a gradual increase in accuracy for layers further away from the input layer. The accuracy maximum is achieved by using the fifth convolutional layer as an engagement layer for the patch. The following pooling layer shows a marginal loss in average and final accuracy. Moreover, we know that stacking convolutional layers generates a strong feature hierarchy [10]. The graph indicates that in contrast to fully-connected layers, convolutional layers tend to extract transferable features. The last two network layers in the CNN are fully-connected layers. The graph shows a significant performance decrease for using these layers as engagement layers. Although it is intuitive that layers close to the network output perform classification, the graph of the FC-NN indicates that fully-connected layers always perform classification. Therefore, fully-connected layers in CNNs seem not suitable as engagement layers for patching. inlineinlinetodo: inlinewie ist denn jetzt die hieraus resultierende heuristik?

Engagement in Residual Networks

Our residual network architecture consists of convolution, batch normalization, add, dropout and pooling layers. The only fully-connected layer in the network is the output layer. In contrast to the gradual feature extraction by the CNN, the patching performance for the last residual layers (Figure 

4(c)) suddenly increases in the ResNet. The residual blocks ’r1’ to ’r7’ are not suited as an engagement layer for patching. The patching accuracy of these layers is comparable to using noise, without any relation to the classification task, to train the patch. Although the output from the early residual blocks seems to not contain any useful information for classifying instances underlying the new concept, the rear residual blocks of the network recover useful and transferable features. Since the transferable features are recovered from the poor performing residual block output, these blocks also have to contain useful information. But the information must be encoded in the activations in such a way, that the patch network is not capable of extracting useful information for the classification task. We observe this behaviour for our residual networks on all datasets.

Conclusion on the engagement layer

inlineinlinetodo: inlinehier nochmal zusammenfassen, was die heuristiken sind

4.1 On the importance of the activation function

We notice, that the first convolutional layer of the ResNet shows poor performance. This contradicts the assumption, that convolutional layers are good feature extractors. We have to mention that we used the output of the engagement layer after applying the activation function. To be consistent, we also used the output of the convolutional layer in the ResNet after the activation layer. Since the ResNet architecture from He et al.

[6] applies batch normalization before ReLU activation, the patching accuracy is obtained after applying batch normalization and ReLU activation. In table 6 the patching accuracy for these layers are stated in detail.

# Layer Type Avg. Acc.
1 Input 0.53
2 Convolutional 0.49
3 Batch Normalization 0.22
4 ReLU Activation 0.09
5 Max Pooling 0.15
Table 6: Comparison of the average patching accuracies for the first layers of the ResNet. The table shows the patching accuracy decrease after applying batch normalization and ReLU activation. The used dataset is NIST and the patch architecture was 128xSoftmax (the setting is similar to the experiment leading to Figure 4).

It shows, that when we use the convolutional layer (before applying the activation or batch normalization) as engagement layer, we achieve an average accuracy of 49 %. After applying batch normalization, the accuracy for patching decreases to 22 %. If, additionally, the ReLU activation is applied, the accuracy drops to 9 %.

In our experiments, the application of ReLU activation sometimes decreases the patching performance in comparison to the pure layer output (without applied activation). ReLU activation returns the identity for each value greater than zero and zero for every negative input value. If we consider these characteristics, it becomes clear that ReLU activation obliterates information contained in negative values. The patch can still use this information to achieve a better performance. This should be intuitive, since information, which is unusable for the original task, may be useful after the concept has changed. In figure 5 we present a comparison between using a layer before and after applying ReLU activation as the engagement layer.

(a) Fully-Connected Network
(b) Convolutional Network
(c) Residual Network
Figure 5: Comparison of the average patching accuracies for each network archetype before and after applying ReLU activation. The used dataset is NIST and the patch architecture was 128xSoftmax (the setting is similar to the experiment leading to Figure 4).

In case of the FC-NN (Figure 5(a)), the patching accuracy obtained with the engagement layer after applying the activation is higher for the first network layers. After the fourth fully-connected layer, the accuracy from the raw layer output without activation surpasses it.

In the CNN architecture (Figure 5(b)), it is always beneficial to apply the activation before using the engagement layer output for patching. Only for the fully-connected layer, which is the second to last layer in the CNN, both methods show comparable accuracy.

With the ResNet (Figure 5(c)) it is beneficial to retrieve the engagement layer output before applying the activation for most layers, except the last two residual block outputs.

This comparison shows two effects of ReLU activation on the information in a network layer. Sometimes the patching performance decreases after applying ReLU activation to an engagement layer output. In contrast, we also observe performance increase through applying the ReLU activation. Since the ReLU function maps every negative value to zero and returns the identity for non-negative values, the ReLU activation discards the information implied by negative values. For positive values exact activations remain, but for negative values only the information about the negative sign is preserved.

In order to explain the behaviour shown in figure 5, we recognize that applying the ReLU activation is only beneficial for layers with general features. For engagement layers which are already showing decreased patching performance due to specificity of features, applying no activation function is beneficial. We propose, that solving discrimination tasks with neural networks consist of two phases, to which the network layers can be allocated: (i)the phase where feature extraction is conducted and (ii) the phase where classification tasks are performed. This definition is related to the characterization into general and specific layers by Yosinski et al.[9]. Features may be more general in the early layers, but our experiments indicate that they are not necessarily features that are beneficial for the target task.

The optimal engagement layer is the layer with features general enough to solve the drift task and specific enough to contain suitable high-level features related to the drift task. All layers before the optimal engagement layer are too general and all layers after are too specific with respect to the given drift task.

Layers with general features.

For engagement layers with general features it is beneficial to apply the ReLU activation in order to increase patching accuracy. The ReLU activation discards information about unsuitable features, which are not beneficial for the original classification task. Since the dropped features are general, these features tend to be unsuited for the drift task as well.

Layers with specific features.

Contrarily, in engagement layers with specific features the discarded features may be relevant for the target task due to their specificity. High-level features, which are dispensable for the original concept, may still be useful for the drift task. Hence, applying ReLU activation yields a performance decrease for specific layers.

This explanation holds for the FC-NN and the CNN. To explain the performance decrease through ReLU in the ResNet, we propose that the shallow layers serve an information preserving purpose and neither work as feature extraction nor classification layers.

In order to avoid the possibility of the ResNet being not sufficiently trained, we also ran it with 500 epochs of training, but received the same results. If we assume that the only functionality of these layers is to preserve information, and the layers receive no task related feedback due to a vanishing gradient, the ReLU activation will discard random features, which explains the performance loss through the ReLU activation in the ResNet layers. Nevertheless, the ResNet still achieves highest accuracy on NIST and MNIST 2.

Conclusion on activation function

Concluding the discussion on using the engagement layer output before or after the ReLU activation, we recognize that, in the experiments we conducted, the highest performing engagement layer always has the ReLU activation applied. Hence, we use the engagement layer output after activation as input for the patch network in further experiments.

4.2 On the effects of Batch Normalization

We further explored the accuracy loss through batch normalization (Table 6). We suspected, that the data is shifted by the batch normalization parameters and into a range which is not feasible for the patch network. This seems not to be the case. In fact, every entry of the vector is close to one and

close to zero. Therefore, the overall mean of the data is close to zero and the overall standard deviation is close to one after applying batch normalization. Since the only difference between the convolutional output and the batch normalization output is the normalized mean and variance, we conclude that the observed accuracy decrease is caused by value compression. Significant differences between values, may seem negligible after normalization.

4.3 Conv Layers in CNN and Resnet

Next we compare the patching accuracy of the first convolutional layer in the CNN and the ResNet architecture (Figure 4). Both convolutional layers have similar filter size and stride. They only differ in the number of filters (64 filters in the ResNet vs. 32 filters in the CNN). Since their position in the network is also the same (first layer), we would expect a comparable patching performance. However, the convolutional layer from the CNN performs much better with a average accuracy of 0.68 (after ReLU activation) in comparison to the accuracy of 0.49 (before activation and batch normalization) from the ResNet.

Regarding the engagement, we observed that the application of the ReLU activation is beneficial for layers with general features. Since we investigate the first layer of each network, it is an obvious assumption that the features in the layer are fairly general. We propose, that the compression effect of batch normalization negates the positive effect of the ReLU activation in this case. We know that the mean activation is zero after batch normalization, because of the observed parameter. Furthermore, we assume that, since the mean activation is zero, half of the values are mapped to zero by applying ReLU activation. Thus, it is possible, that the convolutional layer from the CNN and the ResNet extract similar features, but the performance difference is caused by the batch normalization.

Another way to interpret the application of the ReLU function to the engagement layer output is the effect of distractive information on neural networks. The engagement layers in the CNN (Figure 5(b)

) show a huge performance increase by applying the ReLU activation. We interpret it in such a way, that the ReLU function maps distractive information to zero. Since an activation of zero is similar to the absence of the neuron, we conclude that removal of irrelevant information can achieve huge performance increases for neural networks.

5 Dataset Dependence of Engagement Layers

In the previous section we concluded that the optimal engagement layer is highly dependent on the base classifier archetype. In this section we investigate the engagement layer dependence on the dataset. Hence, we conducted test series for every dataset and base classifier archetype. The results are presented in table 7.

Base Archetype: Fully-Connected Convolutional Residual
Best Layer by: Avg. Acc. F. Acc. Avg. Acc. F. Acc. Avg. Acc. F. Acc.
MNIST FC1(1.) FC1(1.) Pool1(3.last) Pool1(3.last) Pool1(2.last) R11(4.last)
MNIST FC1(1.) FC1(1.) Pool1(3.last) Pool1(3.last) R12(3.last) R12(3.last)
MNIST FC1(1.) FC1(1.) Pool1(3.last) Conv2(4.last) R12(3.last) R12(3.last)
MNIST FC1(1.) FC1(1.) Conv2(4.last) Pool1(3.last) R12(3.last) Pool1(2.last)
MNIST FC1(1.) FC1(1.) Conv2(4.last) Conv2(4.last) Pool1(2.last) R12(3.last)
NIST FC2(2.) FC2(2.) Conv5(4.last) Conv5(4.last) Pool2(2.last) R11(3.last)
NIST FC2(2.) FC2(2.) Conv5(4.last) Conv5(4.last) Pool2(2.last) R11(3.last)
NIST FC1(1.) FC1(1.) Conv5(4.last) Conv5(4.last) R11(3.last) R11(3.last)
NIST FC2(2.) FC2(2.) Conv5(4.last) Pool2(3.last) R11(3.last) R11(3.last)
NIST FC1(1.) FC1(1.) Conv5(4.last) Conv5(4.last) R11(3.last) R11(3.last)
Table 7: Best engagement layer by dataset and classifier. The optimal engagement layer is given for the evaluation measures average accuracy and final accuracy. We state the name of the optimal engagement layer and position in the base classifier network. The position in the base classifier network is shown in parentheses (e.g (3.last) refers to the third to last layer of the base network). The dataset we used is NIST and the patch architecture is 128xSoftmax.

The best engagement layer for the FC-NN is always the first or second fully-connected layer of the base classifier. For the CNN, the best engagement layer is either the last pooling layer or the last convolutional layer. The best engagement layer for the ResNet follows a similar pattern: Either the output of the last residual block or the average pooling layer show highest patching accuracy. The ResNet architecture on MNIST shows the best final accuracy for the second last residual block. We consider this a coincidence, since the patching performance of the last and second to last residual block hardly differs in this specific case.

We consider NIST and NIST the datasets with the most difficult concept drifts, since the base classifier is trained on numbers and has to adapt to letters.

The best engagement layer for the fully-connected base classifier is the first fully-connected layer for both datasets (Table 7). For the three remaining datasets the second fully-connected layer is ideal.

This indicates that more complex concept drifts tend to be solved with the information of earlier engagement layers, whereas for moderate drifts the ideal engagement layer tends to be later in the network. The second fully connected layer appears to be too specific for the target tasks in NIST and NIST. This is in line with our previous observations regarding the generality vs. specificity dilemma of fully-connected layers.

However, this behavior was not observed with the other two base classifier architectures. We suggest that this phenomenon still occurs. But since convolutional layers generate fairly general features it does not show in this case.

The observed difference in generality and specificity is huge when we compare the last convolutional layer (the residual block also consists of convolutions) and the first fully-connected layer of both the CNN and the ResNet. Fully-connected layers in the CNN and ResNet are apparently too specific to deal with all the different types of drift.

Between the last convolution and the fully-connected layer is a pooling layer in both the ResNet and the CNN architecture. The pooling layer is apparently more specific than the convolutional layer. For some datasets the pooling layer is the best engagement layer, but never for NIST and NIST.

The observed property of fully-connected layers to perform specific classification tasks and the generality of convolutional layers lead to a strong division between general and specific sections in a neural networks. We can use this property to make a robust selection regarding suitable engagement layers for patching. For every different base classifier architecture, only two layers11todo: 1which are those? qualify for the highest performing engagement layer across all datasets.

To give an intuition on how relevant it is to choose the best engagement layer over the second best engagement layer, we obtained the average accuracy difference for choosing the best over the second best engagement layer for each base classifier architecture (Table 8).

Base Archetype: Fully-Connected Convolutional Residual
Difference in: Avg.Acc. F.Acc. Avg.Acc. F.Acc. Avg.Acc. F.Acc.
MNIST +4.27 +3.18 +0.63 +0.41 +1.39 +0.59
NIST +2.92 +2.87 +1.58 +0.68 +1.66 +1.26
Table 8: Average patching accuracy difference in percent between the best and second best engagement layer. The values are respectively averaged over the five NIST or MNIST datasets and stated for each base classifier architecture. The accuracy difference is stated for average and final accuracy. All values depict the difference of total accuracy increase (in percent). The patch network architecture was 128xSoftmax.

First, we notice that the difference in average accuracy is considerably larger than the difference in final accuracy for all base classifier architectures. The performance difference between the best and second best engagement layer is biggest for the fully-connected archetype.

The accuracy difference between selecting the best or second best engagement layer is less pronounced for CNN and ResNet-architectures. The convolutional layers all generate rather general features, therefore the last convolutional layers show lower differences in generality and specificity.22todo: 2das wird aus der tabelle nicht klar Moreover, the last convolutional layer is followed by a pooling layer, which only slightly increases specificity of a layer due to discarding or averaging of features. Hence, a comparable patching performance is expected.

To summarize this discussion, we conclude that the performance increase by choosing the best engagement layer over the second best engagement layer is not negligible. 33todo: 3nicht sicher, ob ich durch diesen abschnitt jetzt so viel gewonnen habe.

6 Patch Architecture Dependence

In this section we investigate the inter-dependence between the optimal engagement layer selection and the patch network architecture. To examine the optimal engagement layer with respect to the patch architecture, we conducted a test series with three different patch architectures. The results are shown in Figure 6.

(a) Fully-Connected Network
(b) Convolutional Network
(c) Residual Network
Figure 6: Comparison of layer-wise patching average accuracies for different patch and base classifier architectures. Although the graphs show differences in the average accuracy for the three patch architectures, the layer with the maximum average accuracy does not change for the FC-NN and the CNN. For the ResNet the best engagement layer with 256x128xSoftmax is the output of the last residual block. The average pooling layer ’p2’ achieves highest patching accuracy for the two remaining patch architectures. The dataset is NIST.

We observe, that the patching accuracy varies for different patch sizes, but the optimal engagement layer is not influenced by the patch architecture for CNN and FC-NN.

To substantiate this claim, we tested the three patch architectures for each classifier archetype and dataset. We obtained the average rank of the best engagement layer as shown in Table 9. The average is computed over the three patch architectures. An average rank of 1.0 indicates, that the best engagement layer is the same for all three patch architectures.

Base Archetype: Fully-Connected Convolutional Residual
Best Layer Rank by: Avg. Acc. Final Acc. Avg. Acc. Final Acc. Avg. Acc. Final Acc.
MNIST 1.0 1.0 1.0 1.0 1.7 1.7
MNIST 1.0 1.0 1.0 1.0 2.0 2.7
MNIST 1.0 1.0 1.0 1.0 2.3 2.7
MNIST 1.0 1.3 1.3 1.0 2.7 2.7
MNIST 1.0 1.0 1.0 1.0 1.3 3.0
NIST 1.3 1.0 1.0 1.3 1.7 3.0
NIST 1.0 1.0 1.0 1.0 2.7 2.3
NIST 1.0 1.0 1.0 1.0 2.7 2.0
NIST 1.3 1.0 1.0 1.0 3.0 2.0
NIST 1.0 1.0 1.0 1.0 3.3 2.3
Table 9: Average rank of the best engagement layer for three different patch architectures. The average is computed over the three patch architectures (128xSoftmax, 256xSoftmax, 256x128xSoftmax). An average rank of 1.0 indicates, that the best engagement layer is the same for all three patch architectures.

From Table 9 we can infer that the optimal engagement layer for fully-connected and convolutional base network archetypes is independent of the patch architecture. In contrast, for the ResNet this independence can not be observed.44todo: 4was ist die schlussfolgerung daraus?

6.1 Heuristic for Engagement Layer Selection

After we thoroughly investigated the dependencies of engagement layer selection, we want to formulate a heuristic rule for each network archetype. In order to do that, we consider the main findings of the previous sections.

Our experiments indicate that there are strong differences in patching performance for each network archetype. For FC-NNs earlier layers are best suited as engagement layers, whereas for CNNs or ResNet layers close to the output layer achieve highest patching accuracy (excluding FC-layers). Furthermore, we observed that it is not helpful to use the engagement layer output for patching, before applying the ReLU activation function.

In Section 5 we investigated the engagement layer’s dependence for different datasets. Across all datasets only two layers from each base network architecture qualify for the optimal engagement layer.

The dependence on the patch architecture was reviewed in Section 6. The experiments showed a low engagement layer dependence on patch architecture for FC-NNs and CNNs, but a higher dependence for ResNets.

After inclusion of our findings, we formulate the following heuristic rules for engagement layer selection:
Fully-Connected Neural Network: The best engagement layer is either the first or second fully-connected layer in the network. Selecting the best engagement layer is important, since we observed a huge performance difference between best and second best engagement layer. Best practice would be to try both candidate layers.
Convolutional Neural Network: The best engagement layer is either the last convolutional layer or the last pooling layer of the network.
Residual Neural Network: The best engagement layer is either the output of the last residual block or the last pooling layer of the network.
We investigated performance difference between best and second best engagement layer. The performance increase for the best layer is not negligible. These rules of thumb narrow down the search space for the optimal engagement layer to two layers.

7 Patch Architecture Selection

In this section we investigate the influence of different patch architectures on the patching performance. We evaluate 25 different patch architectures on all datasets for FC-NN, CNN and ResNet. Each experiment was conducted five times with varying random seeds. All presented values are averaged over these five runs. We exclude all engagement layers except the two most promising layers from the previous sections. The two candidate engagement layers are selected by applying our selection heuristic for engagement layers (Section 6.1). The 25 patch architectures have between one and three hidden layers. Only fully-connected layers are used building the patch. If the engagement layer output is multidimensional, the first layer of the patch network is a flatten layer. The 25 patch architectures are listed in Table LABEL:asdf.55todo: 5table missing

It is not beneficial to choose a network with a single softmax output layer as patch architecture. In this case the patch gets often stuck in a poor local minimum during the training process. This is concordance with the findings of Choromanska et al. [3]

. They suggest that the quality of local minima in the error plane of the loss function increases with the size of the network. Although recovering the global minimum becomes harder as the network size increases, local minima in larger networks already tend to minimize the loss function satisfactorily. Hence, smaller networks tend to get stuck in poor local minima.

Archetype: FC-NN CNN ResNet
Dataset Layer Patch Arch. Layer Patch Arch. Layer Patch Arch.
MNIST fc1 2048 conv2 256 p1 128
MNIST fc1 2048 pool1 256 r12 128
MNIST fc1 1536 pool1 256 r12 128
MNIST fc1 2048 pool1 1536 r12 256
MNIST fc1 2048 conv2 256 p1 128
NIST fc2 2048 conv5 2048 p2 2048
NIST fc2 2048 conv5 1536 p2 1536
NIST fc1 1024 conv5 1024 r11 128
NIST fc2 2048 conv5 1536 p2 1024
NIST fc1 1536 conv5 512 p2 1024
Table 10: Best engagement layer and patch architecture by average accuracy for each dataset and base archetype. The table shows the best engagement layer/patch architecture combination for each dataset and classifier with respect to average accuracy. Performances of the layer/architecture combinations are averaged over five runs with different random seeds.

Furthermore, a neural network consisting of merely one softmax layer is a linear classifier, since the softmax output is equivalent to logistic regression when trained to minimize cross-entropy. A linear model has not enough representation power to solve concept drifts in a satisfactory way.

The patch architectures are presented without explicitly indicating the softmax classification layer as the last layer of every patch, since the presence of an output layer is mandatory. Therefore, ’256x128’ refers to a patch architecture: Input() - FC(256) - FC(128) - Softmax(num_classes). The architecture ’128’ refers to: Input() - FC(128) - Softmax(num_classes).

The model used in the patch architecture experiments is NN-patching. Due to the absence of an error estimator network, all arriving instances after the concept drift are referred to the patch for classification. The FC-NN and CNN base classifiers are trained for 10 epochs on the data from the initialization phase of the dataset. The ResNet is trained for 20 epochs, because of slower convergence.

In Table 10 we show the ideal engagement layer/patch architecture combination with respect to maximizing average accuracy. All patch architectures maximizing the average accuracy consist of one fully-connected layer and a softmax classification layer. Deeper patch structures perform significantly worse for average accuracy (Table 15). The nature of the dataset (i.e. the inherent concept drift) has an influence on the engagement layer. As predicted by our heuristic rules for engagement layer selection, it is not possible to select a single perfect engagement layer across all datasets without considering the nature of the concept drift.

Archetype: FC-NN CNN ResNet
Dataset Layer Patch Arch. Layer Patch Arch. Layer Patch Arch.
MNIST fc1 1024x512 pool1 256 r12 256
MNIST fc1 2048x512x256 pool1 1024 r12 128
MNIST fc1 1536x256 conv2 128 r12 128
MNIST fc1 128 pool1 2048 r12 512x256
MNIST fc1 1024 conv2 256x128 r12 256x128
NIST fc2 2048 conv5 512 p2 128
NIST fc1 2048 conv5 512 r11 1024
NIST fc1 1536 conv5 256 r11 256x128
NIST fc2 2048 conv5 1536 r11 1536
NIST fc1 1536x512 conv5 1024 r11 256
Table 11: Best engagement layer and patch architecture by final accuracy for each dataset and base archetype. The table shows the best engagement layer/patch architecture combination for each dataset and classifier with respect to final accuracy. Performances of the layer/architecture combinations are averaged over five runs with different random seeds.

If we consider the best patch architecture with respect to maximizing the final accuracy instead of average accuracy (Table 11), occasionally deeper patch architectures with two hidden layers achieve highest performance. Deeper architectures require more training to converge opposed to shallow architectures and are harder to train in general. The average accuracy is obtained as the average accuracy of all batches after the concept drift. In comparison, the final accuracy is obtained by averaging the accuracy of the patch network on the last five batches of the datastream. The amount of available training data available is larger for final accuracy. It is expected that deeper patch architectures perform better for final accuracy than for average accuracy, since the patch network receives more training.

Moreover, we observe that, in comparison to average accuracy, final accuracy is more often higher with the earlier, more general layer of the two candidate layers (Table 13).

Archetype: FC-NN CNN ResNet
Dataset Layer Patch Arch. Layer Patch Arch. Layer Patch Arch.
MNIST fc1 2048 pool1 2048 p1 128
MNIST fc1 1024 conv2 256 p1 1024
MNIST fc1 2048 pool1 1024 p1 1024
MNIST fc2 1024 pool1 2048 r12 128
MNIST fc1 1536 pool1 1536 p1 512
NIST fc2 1024 conv5 1024 p2 2048
NIST fc2 1024 conv5 2048 p2 1024
NIST fc1 1024 conv5 1536 p2 256
NIST fc2 2048 pool2 1536 p2 2048
NIST fc1 1024 conv5 1024 r11 1024
Table 12: Best engagement layer and patch architecture by recovery speed for each dataset and base archetype. The table shows the best engagement layer/patch architecture combination for each dataset and classifier with respect to recovery speed. Performances of the layer/architecture combinations are averaged over five runs with different random seeds.

The third evaluation measure we examine is recovery speed. Recovery speed states the amount of batches, and therefore the amount of training required to recover to 90 % of the base classifier accuracy before the concept drift. Recovery speed is optimized, if the model performs well on the first batches on the datastream. Thus, fast adaptation to the new concept is demanded. Final accuracy is optimized if the model performs well on the last batches of the datastream. Hence, the ability of the model to represent the new concept arbitrarily well is required to optimize final accuracy. Sometimes this is achieved by more complex (deeper) models, but we only occasionally observed this in our experiments. Average accuracy can be interpreted as a tradeoff between recovery speed and final accuracy, since the model performance on all batches after the concept drift are considered obtaining this evaluation measure.

The best engagement layer and patch architecture with respect to recovery speed for each dataset and base archetype is shown in Table 12. Similar to average accuracy, all patch architectures optimizing the recovery speed are architectures with one hidden layer. Shallow network architectures converge faster than deeper networks, therefore they are well suited for quick adaptation to new concepts. In comparison to final accuracy, more often the latter of the two engagement layer candidates optimizes the recovery speed (Table 13).

Engagement Layer Rec. Speed Avg. Acc Final Acc.
Earlier Layer 13 18 24
Later Layer 17 12 6
Table 13: Number of early and late layers acting as the best engagement layer for different evaluation measures. The table shows the number of times the earlier and later engagement layer, of the two evaluated candidate layers, are the best engagement layer. We have three base classifier archetypes and ten different datasets, hence each column adds up to . This table can be reproduced with the contents of Table 1012.

The difference between the two candidate layers in the CNN and ResNet is the pooling operation (e.g. max-pooling for CNNs, average-pooling for ResNets). The aggregated information in the pooling layers tends to be better suited for fast adaptation, whereas the more comprehensive features from the previous layer results in a better performance in final accuracy.

Dataset: NIST          Classifier: CNN           Engagement Layer: Conv5
Average Accuracy Final Accuracy Recovery Speed
Patch Architecture Accuracy Patch Architecture Accuracy Patch Architecture Batches
1536 89.35 512 93.88 2048 8.4
1024 89.34 1024 93.88 1536 9.4
2048 89.33 2048 93.82 256 9.4
512 89.29 1536 93.76 512 9.4
256 89.03 256 93.65 1024 9.8
128 88.46 128 93.55 128 10.2
1024x512 87.03 2048x256 93.49 1536x256 11.2
1536x512 87.02 1536x512 93.4 2048x256 11.6
2048x512 86.99 1024x512 93.35 512x128 12.0
2048x256 86.93 512x128 93.35 1536x512 12.2
1024x256 86.91 256x128 93.32 2048x512 12.2
1536x256 86.91 2048x512 93.27 1024x256 12.4
512x256 86.62 1536x256 93.24 256x128 12.8
512x128 86.41 1024x256 93.13 1024x512 13.4
256x128 85.88 512x256 93.11 512x256 13.8
2048x512x256 83.57 1024x512x256 92.97 1536x256x128 16.0
1024x512x256 83.41 512x256x128 92.78 1536x512x128 16.0
1536x512x256 83.38 1536x256x128 92.53 1024x256x128 16.2
1024x256x128 82.88 2048x512x256 92.51 2048x256x128 16.6
1024x512x128 82.79 1536x512x128 92.45 1024x512x256 17.0
1536x512x128 82.66 2048x512x128 92.43 1536x512x256 17.0
2048x256x128 82.64 1024x256x128 92.38 2048x512x128 17.4
512x256x128 82.6 1024x512x128 92.24 2048x512x256 17.4
1536x256x128 82.53 1536x512x256 92.0 512x256x128 17.8
2048x512x128 82.51 2048x256x128 91.95 1024x512x128 18.2
Table 14: Various patch architectures sorted by average accuracy, final accuracy and recovery speed. For each evaluation measure we state the patch architecture and according accuracy (or batches to recover). The patch architectures are sorted by the evaluation measure. Architecture with the best performance comes first. The accuracy is shown in percent. The values are averaged over five runs with different random seed. The used dataset is NIST,the base classifier was a CNN and the engagement layer was Conv5.

In Table 14 we show the 25 patch architectures sorted by average accuracy, final accuracy and recovery speed. We notice that the shallow architectures with one hidden layer perform best on average for all evaluation measures. In average accuracy and recovery speed we notice a strong performance separation by patch depth. The maximum difference in average accuracy between highest and lowest performing of the six architectures with one hidden layer is 0.89 %, whereas the performance decrease between the lowest performing architecture with a single hidden layer and the best performing layer with two hidden layers is 1.43 %. The difference between the lowest performing architecture with two hidden layers and the best performing architecture with three hidden layers is 2.31 %. Similar dependencies between architectures of different depth can be found for recovery speed.

In contrast, these large performance steps are not observed for final accuracy. For final accuracy, the performance discrepancy between architectures of different depth are less significant. The architectures are still listed by depth, but the transition is fluent.

Moreover, the performance difference between the one-hidden-layer architectures ’512’, ’1024’, ’1536’, and ’2048’ in average and final accuracy is negligible, since the standard deviation in average accuracy is approximately 0.01 % for these architectures.

Average Accuracy Final Accuracy Recovery Speed
Patch Architecture Avg. Rank Patch Architecture Avg. Rank Patch Architecture Avg. Rank
1024 4.07 2048 5.57 1024 4.43
512 4.1 1024 5.92 1536 6.37
2048 4.25 512 6.28 2048 6.6
1536 4.28 1536 6.42 512 7.83
256 5.4 256 7.78 256 7.92
128 7.83 1024x512 9.87 128 8.6
512x256 9.6 128 10.08 1024x512 9.68
1024x512 10.33 1024x256 10.17 1024x256 9.75
1024x256 10.4 2048x256 10.33 1536x256 10.58
512x128 10.6 1536x256 10.55 1536x512 10.92
1536x256 10.95 1536x512 10.65 512x256 12.17
1536x512 11.0 512x256 10.65 2048x256 12.18
256x128 11.28 512x128 10.67 2048x512 12.47
2048x512 11.75 2048x512 11.2 512x128 12.9
2048x256 11.8 256x128 11.92 256x128 13.7
1024x512x256 18.2 1024x256x128 17.65 1024x256x128 15.77
512x256x128 18.48 1536x512x128 18.12 1024x512x256 16.03
1024x256x128 18.77 1024x512x256 18.42 1536x512x128 16.77
1536x512x256 18.9 1536x512x256 18.43 1024x512x128 17.18
1536x256x128 19.82 2048x512x128 18.47 1536x256x128 17.68
1536x512x128 19.82 512x256x128 18.5 1536x512x256 17.83
1024x512x128 19.92 1024x512x128 18.78 2048x512x256 18.65
2048x512x256 20.5 1536x256x128 19.2 2048x512x128 19.5
2048x256x128 21.25 2048x256x128 19.47 2048x256x128 19.55
2048x512x128 21.7 2048x512x256 19.92 512x256x128 19.93
Table 15: Patch architectures ranked by evaluation measures. The table shows the average rank of each patch architecture for average accuracy, final accuracy and recovery speed. The rank is calculated as the average rank over all datasets, classifiers and both candidate layers. Each configuration is tested five times with different random seeds. The ranks are averaged over these five runs.

To get a more comprehensive idea of the performance of different patch architectures, we ranked the 25 patch architectures by our three evaluation criterions (Table 15). The rank is calculated as the average rank over all datasets, classifiers and both candidate layers. For each configuration the rank is obtained by sorting the patch architectures by the respective evaluation measure and assigning ranks. The architectures ’512’, ’1024’, ’1536’, and ’2048’ are the top 4 architectures for all evaluation measures.

The results indicate, that it is not beneficial to increase the number of nodes in the hidden layer to an arbitrarily high amount. We do not notice an advantage of the ’2048’ architecture over the ’1024’ architecture.

In conclusion, patch architectures with a single hidden layer and a sufficient number of nodes show the best patching performance on average in all scenarios. In the following, we discuss the theoretical representation power of neural networks with merely one hidden layer.

Base Archetype: FC-NN CNN ResNet
MNIST fc1 pool1 p1
NIST fc2 conv5 p2
Table 16: Engagement layer choice for further experiments.

After considering our findings on patch architecture selection, we decide that we use a single hidden layer with 512 nodes as our patch architecture in further experiments, since this configuration showed good performance in all evaluated scenarios. We also fixed a distinctive engagement layer for each base classifier (Table 16).

For selecting patch architectures to deal with non-stationary environments, we recommend shallow one-hidden-layer architectures with a number of nodes between 256 and 2048, due to their fast adaptation capability and adequate representation power.

8 Inclusive vs Exclusive Training on the Error Region

In previous sections we trained the patch network on all instances from each batch. The patch is trained to approximate the new concept after the drift. We call training on all instances of the datastream inclusive training, since the patch is not only trained with instances from the error region of the base classifier.

The intuition behind the patching algorithm is, that a secondary model improves the base classifier in error-prone regions of the instance space. After reporting the performance on a batch, we assume that the labels become available, therefore we can train the patch merely on instances, which a misclassified by the base network. We call this training scheme exclusive training, since the patch is exclusively trained on instances from the error region of the base classifier.

In this section we compare inclusive and exclusive training by obtaining theoretical performance boundaries. The base network and the patch network form a classifier ensemble. To obtain theoretical performance boundaries, we assume perfect ensemble usage. This means, an instance counts as correctly classified, if either the patch or the base network correctly predicts the true label of the instance.

(a) Exclusive training
(b) Inclusive training
(c) Inclusive versus exclusive training
Figure 7: Comparison of accuracy curves for inclusive and exclusive training of the patch. Figures (a) and (b) show the accuracies of the unaltered base network, the patch network and the combined classifiers assuming perfect ensemble usage. Figure (c) shows the comparison between the ensemble with inclusive patch and exclusive patch. The results are obtained using a FC-NN as base classifier, the patch architecture is 512xSoftmax and the dataset is MNIST.

Figure 7 shows the accuracy comparison between exclusive and inclusive training. In Figure (a) and (b) the accuracies of the unaltered base network, the patch network and the combined classifiers, assuming perfect ensemble usage, are presented for the exclusive and inclusive patch. The accuracies are obtained by evaluating the respective model on the data from the next batch. The base classifier shows an accuracy of approximately 50 %. The accuracy of the exclusive patch is around 42 %. But the combined ensemble shows a huge accuracy increase. Since the patch is exclusively trained on instances from the error region of the base network, the classification capabilities of both models complement one another.

The inclusive patch (Figure 7(b)) and the respective ensemble have comparable accuracy. However, the inclusive patch is able to classify data from the whole instance space, hence the classification capabilities of the base network and the patch are overlapping, which results in higher accuracy for the patch. However, the total accuracy of this type of ensemble is lower, as shown in Figure 7(c). Here we compare the theoretical accuracy boundary for the classifier ensemble consisting of the base network and the inclusive/exclusive patch. The exclusive patch combined with the base classifier achieves a higher accuracy than the inclusive patch combined with the base network.

Evaluation Measure: Average Accuracy Final Accuracy Recovery Speed
Dataset excl. incl. excl. incl. excl. incl.
MNIST 94.92 92.52 99.18 98.52 6.6 7.8
MNIST 94.55 94.01 97.99 97.59 5.2 5.2
MNIST 94.05 93.36 97.69 97.78 6.6 5.6
MNIST 77.07 75.22 76.71 76.68 —- —-
NIST 95.68 92.42 97.17 94.92 1.8 3.8
NIST 88.29 87.62 94.34 94.06 10.4 10.6
NIST 85.55 84.46 93.21 92.9 9.2 12.8
NIST 62.75 62.09 68.06 66.83 —- —-
Table 17: Evaluation of inclusive and exclusive patch network combined with the base network. We assume a perfect ensemble usage for the classifier ensemble consisting of inclusive/exclusive patch and the base network. The values are averaged over the three different classifiers and five runs for each individual base classifier with different random seed. The average accuracy and final accuracy is given in percent. The measure for recovery speed is: batches needed to recover to 90 % of the base classifier accuracy. The classifiers do not recover 90 % of the base classifier accuracy, when training on MNIST and NIST.

To substantiate this observation, we conducted experiments to obtain evaluation measures for all datasets and classifiers. The results are shown in Table 17. In this table we compare the inclusive and the exclusive ensemble for all datasets. The exclusive patch ensemble outperforms the inclusive ensemble in average accuracy, final accuracy and recovery speed.

The performance difference between the inclusive and the exclusive ensemble depends on the classification capabilities of the classifier after the occurrence of the concept drift. The average accuracy of base classifiers after occurrence of the concept drift for all datasets are listed in Table 18. The performance difference is highest for NIST and MNIST. These are the datasets with the highest base classifier accuracy after the concept drift. For datasets with lower performing base classifiers, we observe a smaller performance difference between the inclusive and exclusive ensemble.

Datasets FC-NN CNN ResNet
MNIST 50.58 50.78 50.89
MNIST 29.55 35.9 39.67
MNIST 41.99 33.6 35.34
MNIST 46.58 51.64 52.87
MNIST 0.0 0.0 0.0
NIST 64.89 69.71 68.3
NIST 14.36 17.35 15.68
NIST 13.33 13.05 13.64
NIST 32.16 38.71 39.45
NIST 0.0 0.0 0.0
Table 18: Average accuracy of base classifiers in percent after occurrence of the concept drift. We evaluate the base network on every batch after the concept drift and report the accuracy. The base classifier is not further trained after the initialization phase. Values are averaged over five runs with different random seeds. The average standard deviation is 1.35 % (MNIST and NIST were excluded for obtaining this value).

The datasets NIST and MNIST have a base classifier accuracy of zero. In these datasets the data before the drift consist of instances from different classes than after the concept drift. To understand why the base classifier will only predict the classes from the training data and never an unseen class, we have to look at the cross-entropy loss function. The derivative of the CE-loss for a output layer node is . The singularity at can easily be conditioned by adding a small positive number to the numerator. We note that, in comparison to MSE-loss, CE-loss generates sparse gradient updates. If the target is zero, the gradient update is zero, regardless of the actual node output. Connections from a neuron in the output layer to the previous layer are only updated, if the respective target is one. Since for NIST and MNIST

labels from classes after the drift never occurred, all connections from these output nodes to the previous layer nodes are not updated, since the weight initialization. The weights are uniformly distributed around zero, hence the features from the previous layer are randomly combined. Random positive numbers weighted by uniformly distributed values (with mean zero) add up to zero. The nodes of the original classes have trained weights and get higher activations, therefore the new classes are never predicted.

The performance of the exclusive and inclusive ensemble are equal for NIST and MNIST, since the error region of the base classifier comprises all instances from the data stream.

The performance increase from the exclusive ensemble is caused by the fact that a reduced sub-problem is easier to solve than a more comprehensive problem. The feature space is divided by the error region of the base classifier. The exclusive patch merely has to classify instances from the error region, which is a sub-problem. Therefore, the exclusive patch classifies instances inside the error region with higher accuracy then the inclusive patch.

The difference between inclusive and exclusive training is dependent on the capabilities of the base network to still correctly classify instances after the concept drift. The size of the error region is larger for poor performing base classifiers, hence the respective sub-problem for the exclusive patch is also larger. Better base classifier performance after the drift results in a smaller sub-problem for the patch. The smaller the problem, the higher the accuracy of the patch network.

We conclude that an exclusive training on instances from the error region of the base network leads to an increase in patching performance, since the sub-problem in the instance space is easier to solve for the patch network.

The advantage of inclusive training is the robustness towards a poor error region estimator. To obtain the theoretical accuracy boundaries, we assumed perfect ensemble usage. In a real world scenario, the error region estimator creates imperfect predictions. Since the inclusive patch is capable of classifying instances from the whole instance space satisfactorily, a error region estimator, which refers most instances to the patch for classification results in a good performance. The exclusive patch is relying more on a well-tuned error region estimator.

9 Conclusion

In this report we have investigated the behavior of neural network patches w.r.t. their architecture, the layer they are attached to and other specifics in combination with three different types of deep neural network architectures: fully connected, convolutional, and residual (ResNet). In conclusion, we have derived four heuristics:

  • Engagement Layer.

  • Patch Architecture.

  • Use of activation function.

  • Inclusive or exclusive training.

inlineinlinetodo: inlinekönnen wir eine tabelle machen, die die ergebnisse aller heuristiken auf einen blick summiert?

With these heuristics, we can establish appropriate patches for the process of neural network patching. Future work will feature detailed experimental analysis of how patching can be applied to various datasets and on various network architectures, in combination with the comparison of related work to the patching architecture.

References