Deep neural networks (DNNs) have been used lately very effectively in many applications, e.g., object detection (as described by the ImageNet challenge), with state-of-the-art performance 
exceeding human-level capabilities, natural language processing, where text translation using DNNs with attention mechanism has achieved remarkable results, playing highly-complex games (such as chess  and Go ) at a grandmaster level, generation of realistic-looking images , etc.
The recent impressive advancement of deep learning (DL) can be attributed to a number of factors, including: (1) Enhancement of computational capabilities (e.g., using strong graphical processing units (GPUs)), (2) improvement of network architectures, and (3) acquisition of vast amounts of training data. With the growing availability of powerful computational capabilities, much of the research has focused on innovative network architectures for the pursuit of state-of-the-art performance in various problem domains. Some examples include:transferable architectures , which suggest a method of learning the model architectures directly on the dataset of interest, fractional max-pooling 20], and exponential linear units (ELUs) 
, which provide a new activation function for improving learning characteristics.
In this paper, we focus mainly on the usage of large amounts of available unlabeled data to form a training method that utilizes these available resources. Specifically, we focus here on a new DeepMimic training methodology, demonstrating its effectiveness with respect to object classification based on the use of CNNs.
Occupied mainly by the performance of DNNs in numerous applications, researchers may tend to overlook various aspects of the learning process, e.g., the specific manner in which supervised learning (i.e., the training of a network using labeled data) is performed. In the case of multi-label classification, each data item is associated with a class label, and is represented by a “one-hot” encoding vector. (The dimension of a one-hot encoding vector is the number of possible classes in the dataset, such that, the correct class index contains ’1’ and all the other indexes contain ’0’.) It is reasonable to assume that a label distribution that is different from the one-hot vector representation might gain extra insight or knowledge about the model, thereby changing significantly the training process.
To explore this idea, we need a meaningful label distribution, which we gain by using the proposed DeepMimic paradigm. In our method, we use a relatively small subset of our data to perform supervised training on our mentor model, while treating the rest of the dataset as unlabeled data, i.e., ignoring the labels completely. Once the mentor is trained, we use it to create a label distribution by outputting the softmax components for each data item in the unlabeled dataset. During the data splitting process, the one-hot labels are used merely to ensure a balanced dataset split. We later show that this might not be actually required, based on our empirical results for the unbalanced dataset, which yield the same accuracy gained for the balanced dataset.
Using the unlabeled data and the label distribution produced by the mentor model we train a student model. We are able to achieve comparable performance to the mentor’s, with a student model that is simpler, shallower, and substantially faster. In other words, our method can extract a model’s knowledge and successfully transfer it to another model using essentially no labeled data. These remarkable results suggest that the method presented can be used in many applications. One can take advantage of large amounts of unlabeled data and mimic a black-box trained model, without even knowing its architecture or the labeled data used for its training. For example, an individual can purchase a neural network-based product and create a copy of it, which will match the original product’s performance with no access to the data used to train the product. Finally, the student model could result in a substantially simpler architecture. Therefore, we can achieve a much faster inference time, which is very important in production for various real life systems and services.
Many real-life problems have led to interesting, innovative DL techniques, such as those pertaining to the mentor-student learning process . These methods suggest a less strict training of the mentor with the overall gain of lowering the risk of overfitting by the student. In  a class-distance loss is presented that assists mentor networks in forming densely-clustered vector spaces to make it easy for a student network to learn from. In  the authors focus on enhancing the robustness of the student network without sacrificing the performance. Model compression, originally researched in , presents a way of compressing the function learned by a complex model into a much smaller, faster model.
The problems addressed in this paper are considered nowadays rather simple, and thus the method should be reestablished on more challenging problems. Furthermore, years ago, when the Internet was much less developed and considerably smaller amounts of data were available, the focus was directed at the ability to generate synthetic data for training and development purposes. With tens of zettabytes ( bytes) of data available online, acquiring unlabeled data is no longer an issue. Currently, the main interest is to develop ways of exploiting these data efficiently.
is extremely valuable. During this knowledge transfer, the method of training on soft labels, i.e., using a vector of classes (whose probabilities sum up to 1) as labels, seems to provide much more information for the training process compared to the training with one-hot vectors only. This supports the notion that training based on one-hot labels may not be ideal.
Another interesting aspect of soft-label training is its use of regularization . Regularization techniques for preventing overfitting and achieving better generalization consist mainly of dropout , 
, i.e., randomly “shutting down” some of the neurons,DropConnect 
, for random cancellation of synapses between neurons in a very similar way to dropout, random noise addition, and weight decay . These techniques are also referred to as and regularization . Another work is the mixup paper 
, which shows that averaging the training examples and their labels, e.g., creating a new image and its label as a weighted average of the original two images and two one-hot vectors used as labels, to improve the regularization. It is also possible to transfer knowledge from different types of networks, e.g., a recurrent neural network (RNN) to a DNN, as shown in.
Mimicking a model’s predictions in order to obtain knowledge has been researched in various aspects. In 
it is used to transfer knowledge from one domain to another, in order to generalize it and teach a reinforcement learning agent how to behave in multiple tasks simultaneously. In our case, we mimic a mentor model and try to acquire its knowledge as well; yet, we always remain in the same domain and try to maximize the student’s performance there. In the authors show that their method can extract the policy of a reinforcement learning agent, and train a new network, which is dramatically smaller and more efficient, while performing at a comparable level of the agent’s. Thinner and deeper student models are presented in ; the method discussed allows using not only the outputs but the intermediate representations learned by the mentor as hints to improve the training process and the final performance of the student. In  it is argued that even though a student model does not have to be as deep as its mentor, it requires the same number of convolutional layers in order to learn functions of comparable accuracy. According to their results, the large gap between CNNs and fully-connected DNNs cannot be significantly reduced, as long as the student model does not consist of multiple convolutional layers.
The difference from our work is that both mentor and student models are trained over the entire dataset, i.e., there are no unique data seen only by the student, as in our case. For now, the state-of-the-art results on any visual tasks are achieved by CNNs, as the classical fully-connected DNNs simply cannot compete with it. Even though the DNN limits can be pushed further , they are no match for the CNN architecture which relies on local correlations in a given image. Our method may enable DNNs to overcome this boundary, since the regular training procedures which failed to do so are not used by our method. Note that we can alter the mentor model as we deem fit, and rely on the soft labels it predicts, in order to train a student model, regardless of its architecture.
3 DeepMimic Training
3.1 Data Split
When it comes to available data, our goal is to simulate real-life scenarios. In such a case, we would usually have huge amounts of unlabeled data; these data are considered useless, most of the time, unless used for training autoencoders, for example.
In order to simulate such a scenario, we choose a ratio between the mentor’s training data and that of the student’s, such that there is a sufficient amount of unlabeled data to train the student and a sufficient amount of training data for the mentor model to reach good performance on the test set. We performed this experiment on the following datasets: MNIST , CIFAR-10 , and Tiny ImageNet. All the training data chosen for the student model are treated as unlabeled data, i.e., we ignore the labels as described in the next section.
The ratio chosen for the data split is 1:4, which produced the best results after testing various split ratios, and considering the need for sufficient training data for the student model. All the images are randomly assigned to create balanced datasets in most experiments. In other words, by splitting the data randomly, we ensure that for each image of a certain label in the mentor dataset, there are four images of the same label in the student training set. This way both the mentor and the student datasets contain an equal number of images of each label, i.e., the image amount per each class is balanced. In order to simulate scenarios where the available data distribution is unknown and unbalanced, we modified in some of the experiments the student dataset by forcing, e.g., a different number of samples in each class, as described in Section 4.3. We did that by either removing a random number of samples from each class in the student dataset or adding a random amount of out-of-domain images to the student training set. Regardless, it seems the student is only bound to its mentor accuracy rate, i.e., even if there were huge amounts of data for the student, it could not be significantly better than its mentor. As for testing, we used the original test set of each dataset, respectively, to test both models. Since the datasets are fairly limited in size, we decided not to split the data to training, validation and testing; instead, we use all of the available data for training and testing.
3.2 Training Method
In the training process, we first start by training the mentor model using its assigned dataset. Regularization methods, such as dropout, were vastly used in order to reach high accuracy on the test set. Considering mainly classification problems, the last layer of each model is a softmax layer, which normalizes the output and provides a distribution for each possible class (with all distributions summing up to 1). The training uses a stochastic gradient descent (SGD) algorithm and the cross-entropy loss. Once the mentor is well trained, we can predict a soft label for each image in the student dataset. By doing that, we generate an estimate for each image while still ignoring all the real labels. We now train the student model, using its assigned data and the soft labels generated by the mentor model. For the student training, we also use SGD and cross entropy loss. In the student model training, regularization is less needed since training on the soft labels creates a very strong generalization in the training process. The student reaches the mentor’s accuracy on the test set, in all experiments. Based of the performances of shallow students on test sets, it is clear that the student architecture does not have to be similar to the mentor’s, while the performance remains almost identical on the test set. In all the classification tasks we worked on, the reduced student network consistently maintained the mentor’s performance.
MNIST is a relatively simple dataset containing handwritten digit images; it is ideal to perform a “sanity check” on the method. It contains 70,000 () grayscale images, 60,000 of which for training the model and the remaining 10,000 for testing it.
As mentioned in the previous section, we use 20% of the training set for the mentor training; after it is trained, we use the remaining 80% and the trained mentor model to create the soft label distributions. In the experiments reported below, we tested a student model identical to the mentor model, as well as shallower and more simplified student models. The mentor’s accuracy is relative to the amount of data used for training; it is not expected to reach state-of-the-art results with only one fifth of the original training data. This is true for all models trained on a small subset of the standard dataset. As can be seen from Table 1 and Figure 2, the Mentor and Student-A (i.e., the model with the identical architecture) reach almost identical results (i.e., identical loss and accuracy) on the test set, while all the unlabeled data used for training Student-A are never used to train the Mentor. Student-B reaches very close results, as well, i.e., it is possible to create a rather simplified student model to mimic successfully a mentor without knowing its architecture.
CIFAR-10 is an established dataset used for object recognition. It consists of 60,000 () RGB images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images in the official data.
We used deeper networks for this task; as before, the student networks manage to achieve very good results compared to the mentor’s, using various network architectures.
Models’ test loss and accuracy over 150 epochs for CIFAR-10 dataset; Students are averaged over multiple runs to show consistent results. In contrast to Mentor’s spiky and increasing loss function, Student models remain steady and consistent, owing to the very strong regularization of soft label training.
As can be seen from Table 2 and Figure 4, Student-A matches the Mentor’s performance, and Student-B reaches a very high accuracy compared to that of the Mentor (only 0.76% lower), which serves as its only training source. Finally, Student-C still reaches good results (only 3.51% lower than the Mentor’s accuracy), despite its substantially shallower architecture.
It might be of interest to observe also the training of a student model on 80% of the data using one-hot labels instead of the mentor’s predictions as a simpler mentor model training. There is a limit, of course, to simplifying the model and still obtain better accuracy than the original mentor, while training merely on 20% of the data. In our case, Student-B and Student-C reach accuracy rates of 77.22% and 72.64%, respectively, while the original Mentor reaches an accuracy rate of 73.14%. Note that the models described have four times more data to train on with simpler architectures.
4.3 Experiments with Unbalanced CIFAR-10 Data
4.3.1 Reduced Student Dataset Samples
In the following experiment we tested our method on an unbalanced student dataset, as follows. After splitting the dataset by a 20%-80% ratio and creating a balanced student dataset, we decreased the number of samples in each class by some randomly chosen fraction to obtain an unbalanced dataset for the student. This was done on the training data alone, keeping the test set intact. The results obtained are presented in Table 3. We executed the experiment multiple times for different reduction bounds per each class (i.e., for different bounds on the fraction of samples removed from each class). Even for very large ratio bounds, i.e., where the amount of data available for the student model is decreased drastically, the student performance remains rather stable and the method still shows good accuracy.
4.3.2 Added Out-of-domain Student Dataset Samples
Having shown that an unbalanced dataset for the student model (generated by removing at random large amounts of samples from the balanced dataset) has little effect on the performance, we now demonstrate the effect of adding “out-of-domain” random data to the student dataset, by testing our models on this newly created dataset. Specifically, the student dataset is modified by adding samples whose labels are very different from the categories contained in the CIFAR-10 dataset, so as to ensure non-related data to the student dataset. The labels of the added samples are, for example, Flowers, Food Containers, Fruits and Vegetables, Household Electrical Devices and Furniture, Trees, Insects, and others, taken from the CIFAR-100 dataset. As before, we use for each experiment a specified fraction limit per each class on the number of samples added at random from the other categories. The results are presented in Table 4; as can be seen, the models perform very well, reaching good accuracy with no disruption caused by the addition of out-of-domain data.
4.4 Tiny ImageNet
The Tiny ImageNet dataset is the most challenging dataset we have applied our method on. The training data consists of 100,000 () RGB images in 200 classes, with 500 images per class. There are 10,000 images in the validation set and in the test set. As can be seen from Table 5, the architecture used for the networks is much deeper. This makes it possible to demonstrate the effect of removing a substantial amount of layers without having almost a negative impact on the model’s performance. Note that Student-B and Student-C have much simpler architectures, yet, their obtained results are very close to the Mentor’s.
Images successfully classified by both Mentor and Student-A.
As can be seen in , obtaining over 55% accuracy on the test set is an impressive result; in contrast, a random guess yields only 0.5% accuracy. Therefore, and considering that only a fifth of the original training data is used for training, obtaining over 20% accuracy on the test set for the Mentor is satisfactory, as well. The result demonstrates our method’s effectiveness for this dataset. Table 5 and Figure 5 show that both Student-A and Student-B definitely match the Mentor’s performance. Student-C is the shallower model we use. Still, it achieves only 0.85% less accuracy than the Mentor’s, attesting to the method’s effectiveness and impressive results, even when applied to highly-complex and involved datasets. Figures 6 and 7 contain images classified correctly by both the Mentor and Student-A and images classified differently by the Mentor and Student-A, respectively.
4.5 Inference Time Measurements
|Dataset||Model||GeForce Gtx 1050 Ti||GeForce Gtx 1070|
We have tested also comparative inference times (in seconds) for each student model versus its associated mentor, running on the test sets that correspond to the datasets experimented with (see Table 6). Each model was tested on two different GPU architectures, with the results averaged over 100 executions. When using a more complex and deeper network, which is usually the case in real-life scenarios, the time reduction is more significant, and may allow for much faster data processing. Sometimes the student seems to slightly surpasses the mentor; this behavior was observed mostly for student models which are replicas of the mentor, or a student with relatively little reduction in architecture. Determining whether a smaller, albeit less accurate model, should be used versus a larger, more accurate model, is an interesting question. For DNN-based cloud services, the answer would probably be never, as such services usually rely on very strong and expensive hardware, so we would not be limited by any restrictions and just use the most accurate model. However, embedded devices which usually do not rely on strong hardware or stable internet connection, e.g., a cell phone or an IOT (Internet of things) device, are mostly more limited as far as size, memory, and power. The manufacturers would usually develop an extremely small and less powerful hardware, in order to keep the product small, elegant, and rather inexpensive. The mentioned limitations are quite problematic when one is interested in deploying a massive model on a product. In such scenarios, creating a significantly smaller and faster model would enable to deploy it on a smaller hardware, so it is highly likely that manufacturers would rather employ a less accurate model than a more accurate one which cannot be embedded in their products.
In this paper, we have presented a novel approach for training deep neural networks. Our DeepMimic method relies on utilizing two models, which are not necessarily identical. We have shown that reducing the student model’s complexity has a minor effect on its success rate compared to the mentor’s. According to this empirical evidence, it is possible to mimic a black-box mentor model with an unknown architecture and reach the same accuracy. In a series of experiments, we have shown that for both balanced and unbalanced training data available for the student, the method manages to mimic the mentor model successfully. One only needs to exploit large amounts of unlabeled data, which is the expected scenario in real-life situations. Our method raises serious security implications, as one can “duplicate” a proprietary neural network, by creating a copy of it without having access to the original training data. The method presented yields impressive results and exploits large amounts of unlabeled data for training, without having to manually tag them. We have worked solely on CNNs for both the mentor and student models. Our method can be further extended and used to explore the relations between different types of networks, e.g., a fully-connected network and a CNN.
This could prove as a key factor to obtain, extract, and transfer knowledge between different types of networks, thereby pushing further the performance level.
6 Future Work
As can be seen from Table 5, the larger the network, the easier it is to reduce its size more significantly with low reduction in accuracy. In such cases, the effect on the inference time is more noticeable and such compressed networks have an advantage, as shown in Table 6. Therefore, we would prefer to test our method on deeper networks such as VGG  and ResNet 
, expecting to create models with even more improved inference times. So far we have experimented mainly with CNNs for classification problems, but it is of interest to explore the effect of DeepMimic in other problem domains, e.g., networks designed for detection and segmentation. Such networks usually perform feature extraction on the input and rely on massive architectures to do so, we expect our method to be very beneficial in these domains.
An additional idea that might lead to a much smaller, yet a more accurate student, is to distill multiple mentor models into a single student model. By doing so, the student training data can be increased by using multiple mentors to generate the data or we could average different mentor predictions to make the student hopefully more accurate.
An interesting work regarding CNN classifiers using low-shot learning is given in . The idea is to enable a model to successfully classify a newly seen category after being presented with merely few training examples. This notion resembles the way human vision works using imprinted weights. The authors use a CNN as an embedding extractor, and after a classifier is trained, the embedding vectors of new low-shot examples are used to imprint weights for new classes in the extended classifier. As a result, the new model is able to classify well examples belonging to a novel category after seeing only a few examples. Combining this work and DeepMimic might be very interesting, in the following sense. While using a mentor model trained on specific categories, upon the arrival of a novel category it might be easier to implant the new category in a student model combining the two processes described in DeepMimic and . It is possible that a student model would adjust more naturally to new categories during the training process itself rather than an already trained model.
-  (2017) SoftTarget regularization: an effective technique to reduce over-fitting in neural networks. In Prgoceedings of the IEEE International Conference on Cybernetics, Exeter, UK, pp. 1–5. External Links: Cited by: §2, §3.2.
-  (2014) Do deep nets really need to be deep?. In Advances in Neural Information Processing Systems, Vol. 27, Montreal, Quebec, Canada, pp. 2654–2662. Cited by: §2.
Autoencoders, unsupervised learning, and deep architectures. In
Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Vol. 27, Edinburgh, Scotland, pp. 37–50. Cited by: §3.1.
-  (2006) Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Philadelphia, Pennsylvania, pp. 535–541. External Links: Cited by: §2.
-  (2015) Transferring knowledge from a RNN to a DNN. arXiv preprint arXiv:1504.01483. Cited by: §2.
-  (2016) Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations, San Juan, Puerto Rico. Cited by: §1.
-  (2018) Copycat cnn: stealing knowledge by persuading confession with random non-labeled data. In International Joint Conference on Neural Networks, Rio, Brazil, pp. 1–8. External Links: Cited by: §2.
-  (2016) DeepChess: end-to-end deep neural network for automatic learning in chess. In Proceedings of the International Conference on Artificial Neural Networks, Barcelona, Spain, pp. 88–96. External Links: Cited by: §1.
-  (2009) ImageNet: a large-scale hierarchical image database. In , Miami Beach, Florida, pp. 248–255. External Links: Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Montreal, Quebec, Canada, pp. 2672–2680. Cited by: §1.
-  (2014) Fractional max-pooling. arXiv preprint arXiv:1412.6071. Cited by: §1.
-  (2018) Robust student network learning. arXiv preprint arXiv:1807.11158. Cited by: §2.
-  (2016) Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, pp. 770–778. External Links: Cited by: §6.
-  (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §2.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
-  (2017) Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507. External Links: Cited by: §1.
-  (2017) Transferring knowledge to smaller network with class-distance loss. In International Conference on Learning Representations Workshop, Cited by: §2.
-  (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §3.1.
-  (1992) A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems, pp. 950–957. Cited by: §2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Cited by: §1.
-  (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §3.1.
-  (2016) Whiteout: gaussian adaptive noise regularization in feedforward neural networks. arXiv preprint arXiv:1612.01490. Cited by: §2.
-  (2015) How far can we go without convolution: improving fully-connected networks. arXiv preprint arXiv:1511.02580. Cited by: §2.
Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Cited by: §1.
-  (2019) Stealing knowledge from protected deep neural networks using composite unlabeled data. In Proccedings of the International Joint Conference on Neural Networks, Budapest, Hungary. Cited by: §2.
Feature selection, vs. regularization, and rotational invariance.
Proceedings of the International Conference on Machine learning, Banff, Alberta, Canada, pp. 78. External Links: Cited by: §2.
-  (2016) Actor-mimic: deep multitask and transfer reinforcement learning. In International Conference on Learning Representations, San Juan, Puerto Rico. Cited by: §2.
-  (2018) Low-shot learning with imprinted weights. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, pp. 5822–5830. External Links: Cited by: §6.
-  (2015) Fitnets: hints for thin deep nets. In International Conference on Learning Representations, San Diego, California. Cited by: §2.
-  (2016) Policy distillation. In International Conference on Learning Representations, San Juan, Puerto Rico. Cited by: §2.
-  (2017) Mastering the game of Go without human knowledge. Nature, pp. 354–359. External Links: Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, pp. 1929–1958. Cited by: §2.
-  (2017) Do deep convolutional nets really need to be deep and convolutional?. In International Conference on Learning Representations, Toulon, France. Cited by: §2.
-  (2013) Regularization of neural networks using DropConnect. In Proceedings of the International Conference on Machine Learning, Atlanta, Georgia, pp. 1058–1066. Cited by: §2.
-  (2017) Tiny Imagenet Challenge. External Links: Cited by: §3.1.
-  (2018) Knowledge distillation in generations: more tolerant teachers educate better students. arXiv preprint arXiv:1805.05551. Cited by: §2.
-  (2015) Tiny ImageNet classification with convolutional neural networks. External Links: Cited by: §4.4.
-  (2018) “Mixup: beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, British Columbia, Canada. Cited by: §2.
-  (2017) Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012. External Links: Cited by: §1.