Model Weight Theft With Just Noise Inputs: The Curious Case of the Petulant Attacker

This paper explores the scenarios under which an attacker can claim that 'Noise and access to the softmax layer of the model is all you need' to steal the weights of a convolutional neural network whose architecture is already known. We were able to achieve 96 and 82 Bernoulli noise inputs. We posit that this theft-susceptibility of the weights is indicative of the complexity of the dataset and propose a new metric that captures the same. The goal of this dissemination is to not just showcase how far knowing the architecture can take you in terms of model stealing, but to also draw attention to this rather idiosyncratic weight learnability aspects of CNNs spurred by i.i.d. noise input. We also disseminate some initial results obtained with using the Ising probability distribution in lieu of the i.i.d. Bernoulli distribution.


page 8

page 9


Weight Map Layer for Noise and Adversarial Attack Robustness

Convolutional neural networks (CNNs) are known for their good performanc...

Perturbing Inputs to Prevent Model Stealing

We show how perturbing inputs to machine learning services (ML-service) ...

Senti17 at SemEval-2017 Task 4: Ten Convolutional Neural Network Voters for Tweet Polarity Classification

This paper presents Senti17 system which uses ten convolutional neural n...

MimosaNet: An Unrobust Neural Network Preventing Model Stealing

Deep Neural Networks are robust to minor perturbations of the learned ne...

CSI Neural Network: Using Side-channels to Recover Your Artificial Neural Network Information

Machine learning has become mainstream across industries. Numerous examp...

Dynamic Weight Alignment for Convolutional Neural Networks

In this paper, we propose a method of improving Convolutional Neural Net...

Weighted Patterns as a Tool for Improving the Hopfield Model

We generalize the standard Hopfield model to the case when a weight is a...

1 Introduction

In this paper, we consider the fate of an attacker who is adamant about only using noise as input to a convolutional neural network (CNN) whose architecture is known and whose weights are the target of theft. We assume that the attacker has earned access to the softmax layer and is not restricted in terms of the number of inputs to be used to carry out the attack.
At the outset, we’d like to emphasize that our goal in disseminating these results is not to convince the reader on the real-world validity of the attacker-scenario described above or to showcase a novel attack. This paper contains our initial explorations after a chance discovery that we could functionally replicate the weights of an MNIST-trained CNN model by just using noise as input into the framework described below.

Through a set of empirical experiments, which we are duly open sourcing to aid reproducibility, we seek to draw the attention of the community on the following two issues:

  1. This risk of model weight theft clearly entails an interplay between the dataset as well as the architecture. Given a fixed architecture, can we use the level of susceptibility as a novel metric of complexity of the dataset?

  2. Given the wide variations in success attained by varying the noise distribution, how do we formally characterize the relationship between the input noise distribution being used by the attacker and the true distribution of the data, while considering a specific CNN architecture? What aspects of the true data distribution are actually important for model extraction?

The rest of the paper is structured as follows:
In Section 2, we provide a brief literature survey of the related work. In Section 3, we describe the methodology used to carry out the attack. In Section 4, we cover the main results obtained and conclude the paper in Section 5.

2 Related work

The art form of stealingmachine learning models has received a lot of attention in the recent years. In stealing_tramer , the authors specifically targeted real-world ML-as-a-service mlaas platforms such as BigML and Amazon Machine Learning and demonstrated effective attacks that resulted in extraction of machine learning models with near-perfect fidelity for several popular model classes. In copycat_cnn , the authors trained what they termed as a copycat network using Non-Problem Domain images and stolen labels to achieve impressive results in the three problems of facial expression, object, and crosswalk classification. This was followed by work on Knockoff Nets knockoff_net , where the authors demonstrated that by merely querying with random images sourced from an entirely different distribution than that of the black box target training data, one could not just train a well-performing knockoff but it was possible to achieve high accuracy even when the knockoff was constructed using a completely different architecture.
This work differs from the above works in that the attacker is adamant on only using noise images as querying inputs. Intriguingly enough, the state-of-the-art CNNs are not robust enough to provide a flat (uniform) softmax output (with weight ) when we input non-input-domain noise at the input layer. This has been studied under two contexts. The first context was within the framework of fooling images. In nguyen2015deep

, the authors showcased how to generate synthetic images that were noise-like and completely unrecognizable to the human-eye but ones that state-of-the-art CNNs classified as one of the training classes with

confidence. The second text was with regards to what the authors in goodfellow2014explaining stated to be rubbish-class examples . Here, they showcased that the high levels of confident mis-predictions exuded by state-of-the-art trained on MNIST and CIFAR-10 datasets in response to isotropic Gaussian noise inputs.
In this work, we focus on using Bernoulli noise-samples as inputs and using the softmax responses of the target model to siphon away the weights.

3 Methodology

3.1 Threat model

We propose a framework for model extraction without possession of samples from the true dataset which the model has been trained on or the purpose of the model other than the dimensionality of the input tensors as well as the ability to access the resulting class distribution from what is assumed to be a softmax activation given an input. We make the additional assumption that the architecture of the model to be extracted is known by the adversary. In our experiments, we assume that the input tensor is of dimension

by and each pixel has values on the interval .

3.2 Victim model

The black box model which we attempt to extract, , whose architecture is described in Table 1, is trained to convergence on a standard dataset for epochs using the Adadelta optimizer with an initial learning rate of 1.0 and a minibatch size of Mnistcnn1:online . From this point onward, this model is assumed to be a black box in which we have no access to the parameters of each layer.

Layer type Dimensions Additional
Convolutional , ReLU
Convolutional , ReLU
Max Pooling -
Dense ReLU
Dropout -
Dense Softmax
Table 1:

Victim architecture as found in the MNIST example in the documentation for the Keras deep learning library.

3.3 Random stimulus response for model extraction

We procedurally generate a dataset of ‘stimuli’ comprised of by binary tensors where each pixel is sampled from a Bernoulli distribution with a success probability parameter . In other words, let each image where for . We sample these tensors with probability parameters , where each is used to generate of the data. We obtain predictions from the black box model for each randomly sampled example, , which we refer to as ‘responses.’

3.4 Extraction

We train a new model, , on the stimulus response pairs, pairs with no regularization and evaluate on the dataset originally used to train . The architecture for this model is the same as , except we remove the dropout layers to encourage overfitting. We train for 50 epochs using the Adadelta optimizer with an initial learning rate of 1.0 and a minibatch size of

. Additionally, we acknowledge a significant class imbalance in the highest probability classes in the softmax vectors

, so we remedy this by computing class weights according to the of each softmax vector, and applying this re-weighting during the training of . We show the full extraction algorithm in Algorithm 1 and summarize it in Figure 1.

We evaluate our proposed framework on four datasets from the MNIST family of datasets with identical dimensions: MNIST, KMNIST, Fashion MNIST, and notMNIST lecun-mnisthandwrittendigit-2010 ; clanuwat2018deep ; xiao2017/online ; notMNIST66:online .

3.5 Experiments with noise distributions

We evaluated the effect of sampling random data from different distributions on the performance of on the MNIST validation set. We used the same training procedure as found in the previously described experiments with two exceptions: we sample only procedurally generated examples and we train

for only 10 epochs. We evaluated the use of the uniform distribution on the bounded interval

, the standard normal distribution, the standard Gumbel distribution, the Bernoulli distribution with success parameter

, and samples from an Ising model simulation with inverse temperature parameter and resulting values scaled to .

3.6 The Ising prior as a model of spatial correlation

The Ising prior is defined by the density taroni2015statistical :

Examples of images sampled from the Ising model can be found in Figure 4.

For this experiment, we evaluated the role of the inverse temperature parameter, , of the Ising sampler in training . We first partition the stimulus response pairs, into subsets with examples each corresponding to the different parameters used to generate the samples, where . We train for epochs for each and validate on the original dataset. We performed this experiment for MNIST, KMNIST, Fashion MNIST, and notMNIST and report the variation in performance over different values of .

Figure 1: Overview of the model extraction algorithm.
Input: data , , ,
Initialize .
Initialize .
Initialize .
Fit , .
Evaluate , .
for  in  do
     for  in  do
         for  in  do
              for  in  do
              end for
         end for
     end for
end for
Initialize .
for  do
end for
Compute class weights given
Fit , with .
Evaluate , .
Algorithm 1 Stimulus response model extraction.

4 Results

4.1 Mnist

We evaluate the efficacy of our framework by training on MNIST and going on to evaluate the performance of on MNIST after extraction. We found that achieved a validation accuracy of and achieved a validation accuracy of . The distribution of the of can be found in Figure 2. The most underrepresented class according to the of was class 6 represented by out of random examples.

4.2 Kmnist

Our experiments with KMNIST resulted in achieving a validation accuracy of and achieving a validation accuracy of . Class 8 was found to be the class with the fewest representatives according to the of , which had representative examples out of .

4.3 Fashion MNIST

On the Fashion MNIST dataset, we found that achieved a validation accuracy of , while achieved a validation accuracy of . For Fashion MNIST, the most underrepresented class according to the of was class 7 (sneaker) with only 12 out of random examples. Notably, the most common mispredictions according to Figure 3 were incorrectly predicting class 5 (sandal) when the ground truth is class 7 (sneaker) and predicting class 5 (sandal) when the ground truth is class 9 (ankle boot). seems to predict the majority of examples from shoe-like classes to be of class 5 (sandal).

4.4 notMNIST

We found that the notMNIST dataset had a more uniform class distribution according to the of than the other datasets that we evaluated. The class with the fewest representatives in this sense was class 9 (the letter j) with out of examples. Despite this potential advantage, the extracted model failed to generalize to the notMNIST validation set, achieving an accuracy of , and as can be seen in Figure 3, predicts class 5 (the letter e) in the vast majority of cases. In contrast, achieved a validation accuracy of .

4.5 The performance of different noise distributions

In evaluating the effect of sampling from different distributions to construct , we found that among the uniform, standard normal, standard Gumbel, Bernoulli distributions, and the Ising model, samples from the Ising model attained the highest accuracy at when evaluating on the MNIST validation set. The results for each of the other distributions can be found in Table 2. We postulate that this is due to the modelling of spatial correlations, which is a property which is lacking when sampling from the uniform, standard normal, standard Gumbel, and Bernoulli distributions, as the pixels are assumed to be i.i.d.

validation accuracy
Uniform ()
Standard Normal
Standard Gumbel
Bernoulli ()
Ising ()
Table 2: Performance using different noise distributions.

4.6 Extraction hardness resulting from data

We propose a measure of model extraction hardness resulting from the dataset which the original model is trained on as the ratio of the post-extraction validation accuracy (using ) and the pre-extraction validation accuracy (using ) under our framework. We show that the resulting ratios are align with the mainstream intuition regarding the general relative learnability of MNIST, KMNIST, Fashion MNIST, and notMNIST. For MNIST, we found this ratio to be , the ratio for KMNIST was , for Fashion MNIST we found it to be , and notMNIST achieved a ratio of .

4.7 The role of modelling spatial correlation

We found that the loss and accuracy demonstrate an Occam’s hill effect when the value of is varied, which, as Figure 6 demonstrates, is particularly clear in the cases of MNIST and KMNIST. In Figure 5, we see that across datasets, the losses tend to be minimized around , however the behavior of larger values of varies from dataset to dataset. We postulate that this is indicative of the different distributions of the amount of spatial correlation across each dataset. We also found that accuracy is maximized at for MNIST, KMNIST, and Fashion MNIST. We found that the optimal setting for for notMNIST was , where the behavior here varies as increases from the optimal value.

Dataset Pre-extraction Post-extraction
accuracy accuracy
Table 3: Performance on original dataset before and after extraction (measured on the validation set).

5 Conclusion and future work

In this paper, we demonstrated a framework for extracting model parameters by training a new model on random impulse response pairs gleaned from the softmax output of the victim neural network. We went on to demonstrate the variation in model extractability based on the dataset which the original model was trained on. Finally, we proposed our framework as a method for which relative dataset complexity can be measured.

5.1 Future work

This is a work in progress and we are currently working along the following three directions: In our experiments, pixels are notably i.i.d., whereas in real world settings, image data is comprised of pixels which are spatially correlated. In this vein, we intend to establish the relationship between the temperature of an Ising prior and the accuracy obtained by the stolen model. We will experiment with different architectures, specifically exploring the architecture unknown scenario where the attacker has a fixed plug-and-play swiss-army-knife architecture whose weights are learned by the noise and true-model softmax outputs. Additionally, we will explore methods for constructing which gives more uniform distributions over ) and evaluate the associated effect on the performance of .


Appendix A Additional figures

Figure 2: Distribution of classes given . From top to bottom: MNIST, KMNIST, Fashion MNIST, notMNIST.
Figure 3: Confusion matrices of on . From top to bottom: MNIST, KMNIST, Fashion MNIST, notMNIST.
Figure 4: Examples of images from an Ising model simulation at various parameters.
Figure 5: Occam’s hill effect for loss when is varied. From top to bottom: MNIST, KMNIST, Fashion MNIST, notMNIST.
Figure 6: Occam’s hill effect for accuracy when is varied. From top to bottom: MNIST, KMNIST, Fashion MNIST, notMNIST.