This repository contains the architectures, Models, logs, etc pertaining to the SimpleNet Paper (Lets keep it simple: Using simple architectures to outperform deeper architectures )
Major winning Convolutional Neural Networks (CNNs), such as AlexNet, VGGNet, ResNet, GoogleNet, include tens to hundreds of millions of parameters, which impose considerable computation and memory overhead. This limits their practical use for training, optimization and memory efficiency. On the contrary, light-weight architectures, being proposed to address this issue, mainly suffer from low accuracy. These inefficiencies mostly stem from following an ad hoc procedure. We propose a simple architecture, called SimpleNet, based on a set of designing principles and we empirically show that SimpleNet provides a good tradeoff between the computation/memory efficiency and the accuracy. Our simple 13-layer architecture outperforms most of the deeper and complex architectures to date such as VGGNet, ResNet, and GoogleNet on several well-known benchmarks while having 2 to 25 times fewer number of parameters and operations. This makes it very handy for embedded system or system with computational and memory limitations. We achieved state-of-the-art result on standard data sets such as CIFAR10 outperforming several heavier architectures including but not limited to AlexNet on ImageNet and very good results on data sets such as CIFAR100, MNIST and SVHN. In our experiments we show that SimpleNet is more efficient in terms of computation and memory overhead compared to state of the art. Models are made available at: https://github.com/Coderx7/SimpleNetREAD FULL TEXT VIEW PDF
This repository contains the architectures, Models, logs, etc pertaining to the SimpleNet Paper (Lets keep it simple: Using simple architectures to outperform deeper architectures )
Since the resurgence of neural networks, deep learning methods have been gaining huge success in diverse fields of applications, amongst which, semantic segmentation, classification, object detection, image annotation and natural language processing are few to mentionGuo et al. (2015). What has made this enormous success possible is the ability of deep architectures to do feature learning automatically, eliminating the need for a feature engineering stage. In this stage which is the most important one amongst others, the preprocessing pipelines and data transformation are designed using human ingenuity and prior knowledge Bengio et al. (2013) and has a profound effect on the end result. It is highly dependent on the level of engineers experience and expertise and if done poorly the result would be disappointing. It however, cannot scale or be generalized for other tasks well. Furthermore, in deep learning methods, instead of manual and troublesome feature engineering, feature learning is carried out automatically in an efficient way. Deep methods also scale very well to different tasks of different essence. This proved extremely successful which one can say by looking at the diverse fields it has been being used.
The rest of the paper is organized as follows: Section 2 presents the most relevant works. In Section 3 we present our architecture and the set of designing principles used in the design of the architecture. In Section 4 the experimental results are presented conducted on 4 major datasets (CIFAR10, CIFAR100, SVHN and MNIST) and more details about the architecture and different changes pertaining to each dataset are explained. Finally, conclusions and future work are summarized in Section 5 and acknowledgment is covered in section 6.
In this section, we review the latest trends in related works in the literature. We categorize them into 4 sections and explain them briefly.
Designing more effective networks were desirable and attempted from the advent of neural networks Fukushima (1979, 1980); Ivakhnenko (1971). With the advent of deep learning methods, this desire manifested itself in the form of creating deeper and more complex architectures Ciresan et al. (2010); Cireşan et al. (2011); CireşAn et al. (2012); He et al. (2015b); Alex et al. (2012); Simonyan & Zisserman (2014); Srivastava et al. (2015); Szegedy et al. (2015); Zagoruyko & Komodakis (2016). This was first attempted and popularized by Ciresan et al. (2010) training a 9 layer MLP on GPU which was then practiced by other researchers Cireşan et al. (2011); CireşAn et al. (2012); Ciregan et al. (2012); He et al. (2015b); Alex et al. (2012); Simonyan & Zisserman (2014); Srivastava et al. (2015); Szegedy et al. (2015); Zagoruyko & Komodakis (2016).
In 2012 Alex et al. (2012) created a deeper version of LeNet5 Lecun et al. (1998)
with 8 layers called AlexNet, unlike LeNet5, It had local contrast normalization, ReLUNair & Hinton (2010) nonlinearity instead of Tanh, and a new regularization layer called Dropout Hinton et al. (2012), this architecture achieved state of the art on ILSVRC 2012. The same year, Le (2013) trained a gigantic network with 1 billion parameters, their work was later proceeded by Coates et al. (2013) which an 11 billion parameter network was trained. Both of them were ousted by much smaller network AlexNet Alex et al. (2012).
kernels which they call, an Inception module. Using this architecture they could decrease the number of parameters drastically compared to former architectures. They ranked first in ImageNet challenge that year. They later revised their architecture and used two consecutiveconv layers with 128 kernels instead of the previous
layers, they also used a technique called Batch-NormalizationIoffe & Szegedy (2015) for reducing internal covariate shift. This technique provided improvements in several sections which is explained thoroughly in Ioffe & Szegedy (2015). They achieved state of the art results in ImageNet challenge.
released their Long Short Term Memory (LSTM) recurrent network inspired highway networks in which they used the initialization method proposed byHe et al. (2015a)
and created a special architecture that uses adaptive gating units to regulate the flow of information through the network. They created a 100 layer and also experimented with a 1K layer network and reported the easy training of such networks compared to the plain ones. Their contribution was to show that deeper architectures can be trained with Simple stochastic gradient descent.
investigated the effectiveness of combining residual connections with their inceptionv3 architecture. They gave empirical evidence that training with residual connections accelerates the training of Inception networks significantly, and reported that residual Inception networks outperform similarly expensive Inception networks by a thin margin. With these variations the single-frame recognition performance on the ILSVRC 2012 classification taskRussakovsky et al. (2015) improves significantly. With an ensemble of three residual and one Inception-v4, they achieved 3.08 percent top-5 error on the test set of the ImageNet classification challenge. The same year, Zagoruyko & Komodakis (2016) ran a detailed experiment on residual nets He et al. (2015b) and came up with a novel architecture called Wide Residual Net (WRN) where instead of a thin deep network, they increased the width of the network in favor of its depth(decreased the depth). They showed that the new architecture does not suffer from the diminishing feature reuse problem Srivastava et al. (2015) and slow training time. They report that a 16 layer wide residual network, outperforms any previous residual network architectures. They experimented with varying depth of their architecture from 10 to 40 layers and achieved state of the art result on CIFAR10/100 and SVHN.
The computational and memory usage overhead caused by such practices, limits the expansion and applications of deep learning methods. There have been several attempts in the literature to get around such problems. One of them is model compression in which it is tried to reduce the computational overhead at inference time. It was first researched by Buciluǎ et al. (2006), where they tried to create a network that performs like a complex and large ensemble. In their method they used the ensemble to label unlabeled data with which they train the new neural network, thus learning the mappings learned by the ensemble and achieving similar accuracy. This idea is further worked on by Ba & Caruana (2014). They proposed a similar concept but this time they tried to compress deep and wide networks into shallower but even wider ones. Hinton et al. (2015) introduced their model compression model, called Knowledge Distillation (KD), which introduces a teacher/student paradigm for transferring the knowledge from a deep complex teacher model or an ensemble of such, to less complex yet still similarly deep but fine-grained student models, where each student model can provide similar performance overall and perform better on fine-grained classes where the teacher model confuses and thus eases the training of deep networks. Inspired by Hinton et al. (2015), Romero et al. (2014) proposed a novel architecture to address what they referred to as not taking advantage of depth in the previous works related to Convolutional Neural Networks model compression. Previously, all works tried to compress a teacher network or an ensemble of networks into either networks of similar width and depth or into shallower and wider ones. However, they proposed a novel approach to train thin and deep networks, called FitNets, to compress wide and shallower (but still deep) networks. Their method is based on Knowledge Distillation (KD)Hinton et al. (2015) and extends the idea to allow for thinner and deeper student models. They introduce intermediate-level hints from the teacher hidden layers to guide the training process of the student, they showed that their model achieves the same or better accuracy than the teacher models.
In late 2015 Han et al. (2015) released their work on model compression. They introduced “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35 to 49 times without affecting their accuracy. In their method, the network is first pruned by learning only the important connections. Next, the weights are quantized to enforce weight sharing, finally, the Huffman coding is applied. After the first two steps they retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9 to 13 times; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, their method reduced the storage required by AlexNet by 35 times, from 240MB to 6.9MB, without loss of accuracy.
In 2014 Springenberg et al. (2014) released their paper where the effectiveness of simple architectures was investigated. The authors intended to come up with a simplified architecture, not necessarily shallower, that would perform better than at the time, more complex networks. Later in 2015, they proposed different versions of their architecture and studied their characteristics, and using a 17 layer version of their architecture they achieved a result very close to state of the art on CIFAR10 with intense data-augmentation.
While preparing our paper we found out, our record was beaten by Wide Residual Net, which we then addressed in related works. We still have the state of the art record without data-augmentation as of zero padding and normalization. We also have the state of the art in terms of accuracy/parameters ratio.
In 2016 Iandola et al. (2016) released their paper in which they proposed a novel architecture called, SqueezeNet, a small CNN architecture that achieves AlexNet-level accuracy on ImageNet With 50 times fewer parameters. To our knowledge this is the first architecture that tried to be small and yet be able to achieve a good accuracy.
In this paper, we tried to come up with a simple architecture which exhibits the best characteristics of these works and propose a 13 layer convolutional network that achieves state of the art result on CIFAR10111
While preparing our paper we found out, our record was beaten by Wide Residual Net, which we then addressed in related works. We still have the state of the art record without data-augmentation as of zero padding and normalization. We also have the state of the art in terms of accuracy/parameters ratio.. Our network has fewer parameters (2 to 25 times less) compared to all previous deep architectures, and performs either superior to them or on par despite the huge difference in number of parameters and depth. For those architectures such as SqueezeNet/FitNet where the number of parameters is less than ours but also are deeper, our network accuracy is far superior to what can be achieved with such networks. Our architecture is also the smallest (depth wise) architecture that both has a small number of parameters compared to all leading deep architectures, and also unlike previous architectures such as SqueezeNet or FitNet, gives higher or very competitive performance against all deep architectures. Our model then can be compressed using deep compression techniques and be further enhanced, resulting in a very good candidate for many scenarios.
We propose a simple convolutional network with 13 layers. The network employs a homogeneous design utilizing kernels for convolutional layer and kernels for pooling operations. Figure 1 illustrates the proposed architecture.
The only layers which do not use kernels are 11th and 12th layers, these layers, utilize convolutional kernels. Feature-map down-sampling is carried out using nonoverlaping max-pooling. In order to cope with the problem of vanishing gradient and also over-fitting, we used batch-normalization with moving average fraction of 0.95 before any ReLU non-linearity. We also used weight decay as regularizer. A second version of the architecture uses dropout to cope with over-fitting. Table 1 shows different architectures and their statistics, among which our architecture has the lowest number of parameters and operations. The extended list is provided in the appendix.
We used several principles in our work that helped us manage different issues much better and achieve desirable results. Here we present these principles with a brief explanation concerning the intuitions behind them:
In order to better manage the computational overhead, parameter utilization efficiency, and also network generalization power, start with a small and thin network, and then gradually expand it. Neither the depth nor the number of parameters are good indicators of how a network should perform. They are neutral factors that are only beneficial when utilized mindfully, otherwise, the design would result in an inefficient network imposing unwanted overhead. Furthermore, fewer learnable parameters also decrease the chance of over fitting and together with an enough depth it increases the networks generalization power. In order to utilize both depth and parameters more efficiently, design the architecture in a symmetric and gradual fashion, i.e. instead of creating a network with a random yet great depth, and large number of neurons per layer, start with a small and thin network then gradually add more symmetric layers. Expand the network to reach a cone shaped form. A Large degree of invariance to geometric transformations of the input can be achieved with this progressive reduction of spatial resolution compensated by a progressive increase of the richness of the representation (the number of feature maps), hence getting a conned shape, that’s one of the reasons why deeper is better)Lecun et al. (1998). Therefore a deeper network with thinner layers, tends to perform better than the same network being much shallower with wider layers. It should however be noted that, very deep and very thin architectures, like their shallow and very wide counter parts are not recommended. The network needs to have proper processing and representational capacity and what this principle suggests is a method of finding the right value for depth and width of a network for this very reason.
Instead of thinking in layers, think and design in group of homogeneous layers. The idea is to have several homogeneous groups of layers, each with gradually more width. The symmetric and homogeneous design, allows to easily manage the number of parameters a network will withhold and also provide better information pools for each semantic level.
Preserve locality information throughout the network as much as possible by avoiding kernels in early layers. The corner stone of CNN success lies in local correlation preservation. Avoid using kernels or fully connected layers where locality of information matters. This includes exclusively the early layers in the network. kernels have several desirable characteristics such as increasing networks non-linearity and feature fusion Lin et al. (2013) which increases abstraction level, but they also ignore any local correlation in the input. Since they do not consider any neighborhood in the input and only take channels into account, they distort valuable local information. Preferably use kernels at the end of the network or if one intends on using tricks such as bottleneck employed by GoogleNet Szegedy et al. (2015) and ResNet He et al. (2015b), use more layers with skip connections to compensate the loss in information. It is suggested to replace kernels with if one plans on using them other than the end of the network. Using kernels both help to reduce the number of parameters and also to retain neighborhood information.
Utilize as much information as it is made available to a network by avoiding rapid down sampling especially in early layers. To increase a network’s discriminative power, more information needs to be made available. This can be achieved either by a larger dataset or larger feature-maps. If larger dataset is not feasible, the existing training samples must be efficiently harnessed. Larger feature-maps especially in early layers, provide more valuable information to the network than the smaller ones. With the same depth and number of parameters, a network which utilizes bigger feature-maps achieves a higher accuracy. Therefore instead of increasing the complexity of a network by increasing its depth and number of parameters, one can leverage more performance/accuracy by simply using larger input dimensions or avoiding rapid early down-sampling. This is a good technique to keep the complexity of the network in check and improve the network performance.
Use , and follow established industrial trends. For an architecture to be easily usable and widely practical, it needs to perform fast and decently. By taking into account the current improvements in underlying libraries, designing better performing and more efficient architectures are possible. Using kernels, apart from already known benefits Simonyan & Zisserman (2014), allows to achieve a substantial boost in performance when using NVIDIA’s cuDNNv library. A speed up of about compared to the former v4 version 222https://developer.nvidia.com/cudnn-whatsnew. This is illustrated in figure 2. This ability to harness every amount of performance is a decisive criterion when it comes to production and industry. A fast and robust performance translates into, less time, decreased cost and ultimately a higher profit for business owners. Apart from the performance point of view, on one hand larger kernels do not provide the same efficiency per parameter as a kernel does. It may be theorized that since larger kernels capture a larger area of neighborhood in the input, using them may help in ignoring noises and thus capturing better features, or more interesting correlations in the input because of larger receptive field and ultimately improving performance. But in fact the overhead they impose in addition to the loss in information they cause make them not an ideal choice. This makes the efficiency per parameter to decrease and causes unnecessary computational burden. More over larger kernels can be replaced with a cascade of smaller ones (e.g. ) which will still result in the same effective receptive field and also more nonlinearity, making them a better choice over larger kernels.
Test the architecture with different learning policies before altering it. Most of the time, it’s not the architecture that needs to be change, rather it’s the optimization policy. A badly chosen optimization policy leads to bad convergence, wasting network resources. Simple things such as learning rates and regularization methods, usually have an adverse effect if not tuned correctly. Therefore it is first suggested to use an automated optimization policy to run quick tests and when the architecture is finalized, the optimization policy is carefully tuned to maximize network performance.
Conduct experiments under equal conditions. When testing a new feature, make sure only the new feature is being evaluated. For instance, when evaluating a kernel against a kernel, the overall network entropy must remain equal. It is usually neglected in different experiments and changes are not evaluated in isolation or better said, under an equal condition. This can lead to a wrong deduction and thus result in an inefficient design. In order to effectively assess a specific feature and its effectiveness in the architecture design, it is important to keep track of the changes, either caused by previous experiments or by the addition of the new feature itself, and take necessary action to eliminate the sources of discrepancies.
Like previous principle, here we explain about the generalization power and why lower entropy matters. It is true that the more parameter a network withholds, the faster it can converge, and the more accuracy it can achieve, but it will over-fit more as well. A model with fewer number of parameters which provides better results or performs comparable to heavier models indicates the fact that, the network has learned much better features based on which it is making its decision. In other words, by imposing more constrains on the amount of entropy a network has, we force the network to find and use much better and more robust features. This specifically manifests itself in the generalization power, since the network decisions are based on more important and more discriminative features. It can thus perform much better compared to a network with higher number of parameters which would easily over fit as well.
While we try to formulate the best ways to achieve better accuracy in the form of rules or guidelines, they are not necessarily meant to be aggressively followed in all cases. These guidelines are meant to help achieve a good compromise between performance and the imposed overhead. Therefore start by designing according to the guidelines and then try to alter the architecture in order to get the best compromise according to your needs. In order to better tune your architecture, try not to alter or deviate a lot from multiple guidelines at once. Following a systematic procedure helps to avoid repetitive actions, and also obtain better understanding of what/which series of actions lead to specific outcomes that would normally be a hard task. Work on one aspect at a time until the desired outcome is achieved. Ultimately, it’s all about the well balanced compromise between performance/imposed overhead according to one’s specific needs.
As we have already briefly discussed in previous sections, the current trend in the community, has been to start with a deep and big architecture and then use different regularization methods to cope with over-fitting. The intuition behind such trend is that, it is naturally difficult to come up with an architecture with the right number of parameters/depth that suites exactly ones data requirements. While such intuition is plausible and correct, it is not without flaws.
One of the issues is the fact that, there are many use cases and applications for which there is not a huge dataset (such as ImageNet e.g.) available. Apart from the fact that less computation and memory overhead is always desirable for any circumstances and results in decreased costs, the majority of applications have access to medium/small sized datasets and yet they are already exploiting the benefits of deep learning and achieving either state of the art or very outstanding results. Individuals coming from this background, have two paths before them when they want to initiate a deep learning related project: 1) they either are going to design their own architecture which is difficult and time-consuming and has its own share of issues and 2) Use one of the existing heavy but very powerful architectures that have won competitions such as ImageNet or performed well on a related field of interest.
Using these kinds of architectures impose a lot of overhead and users should also bear the cost of coping with the resulting over-fitting. It adversely affects training time, making it more time and resource consuming. When such architectures are used for fine-tuning, the issues caused by such deep and heavy architectures such as computational, memory and time overhead, are also imposed.
Therefore it makes more sense to have a less computationally expensive architecture which provides higher or comparable accuracy compared to the heavier counter parts. The lowered computational overhead results in a decreased time and power consumption which is a decisive factor for mobile applications. Apart from such benefits, reliance on better and more robust features is another important reason to opt for such networks.
datasets in order to evaluate and compare our architecture against the top ranking methods and deeper models that also experimented on such datasets. We only used simple data augmentation of zero padding, and mirroring on CIFAR10/100. Other experiments on MNIST , SVHN datasets are conducted without data-augmentation. In our experiments we used one configuration for all datasets and, we did not fine-tune anything except CIFAR10. We did this to see how this configuration can perform with no or slightest change in different scenarios. We used Caffe frameworkJia et al. (2014) for training our architecture and ran our experiments on a system with Intel Pentium G3220 CPU ,14 Gigabyte of RAM and NVIDIA GTX980.
The CIFAR10/100 Krizhevsky & Hinton (2009) datasets includes 60,000 color images of which 50,000 belong to training set and 10,000 are reserved for testing (validation). These images are divided into 10 and 100 classes respectively and classification performance is evaluated using top-1 error. Table 2 shows the results achieved by different architectures.
We tried two different configurations for CIFAR10 experiment, one with no data-augmentation i.e. zero-padding and normalization and another one using data-augmentation. We name them Arch1 and Arch2 respectively. The Arc1 achieves a new state of the art in CIFAR10 when no data-augmentation is used and the Arc2 achieves 95.32%. In addition to the normal architecture, we used a modified version on CIFAR100 and achieved 74.86% with data-augmentation. Since it had more parameters we did not include it in the following table. More results are provided in the appendix.
|VGGNet(16L) Zagoruyko (2015)/Enhanced||138m||91.4 / 92.45||-|
|ResNet-110L / 1202L He et al. (2015b) *||1.7/10.2m||93.57 / 92.07||74.84/72.18|
|SD-110L / 1202L Huang et al. (2016)||1.7/10.2m||94.77 / 95.09||75.42 / -|
|WRN-(16/8)/(28/10) Zagoruyko & Komodakis (2016)||11/36m||95.19 / 95.83||77.11/79.5|
|Highway Network Srivastava et al. (2015)||N/A||92.40||67.76|
|FitNet Romero et al. (2014)||1M||91.61||64.96|
|FMP* (1 tests) Graham (2014a)||12M||95.50||73.61|
|Max-out(k=2) Goodfellow et al. (2013)||6M||90.62||65.46|
|Network in Network Lin et al. (2013)||1M||91.19||64.32|
|DSN Lee et al. (2015)||1M||92.03||65.43|
|Max-out NIN Jia-Ren Chang (2015)||-||93.25||71.14|
|LSUV Dmytro Mishkin (2016)||N/A||94.16||N/A|
*Note that the Fractional Max Pooling Graham (2014a) uses a deeper architecture and also uses extreme data augmentation. means No zero-padding or normalization with dropout and means Standard data-augmentation- with dropout. To our knowledge, our architecture has the state of the art result, without aforementioned data-augmentations.
The MNIST dataset Lecun et al. (1998) consists of 70,000 28x28 grayscale images of handwritten digits 0 to 9, of which 60,000 are used for training and 10,000 are used for testing. We didn’t use any data augmentation on this dataset, and yet scored second to the state-of-the-art without data-augmentation and fine-tuning. We also slimmed our architecture to have only 300K parameters and achieved 99.72% accuracy beating all previous larger and heavier architectures .Table 3 shows the current state of the art results for MNIST.
|DropConnectWan et al. (2013)**||0.21%|
|Multi-column DNN for Image ClassiﬁcationCiregan et al. (2012)**||0.23%|
|APACSato et al. (2015)**||0.23%|
|Generalizing Pooling Functions in CNNLee et al. (2016)**||0.29%|
|Fractional Max-PoolingGraham (2014a)**||0.32%|
|Batch-normalized Max-out NIN Jia-Ren Chang (2015)||0.24%|
|Max-out network (k=2) Goodfellow et al. (2013)||0.45%|
|Network In Network Lin et al. (2013)||0.45%|
|Deeply Supervised Network Lee et al. (2015)||0.39%|
|RCNN-96 Liang & Hu (2015)||0.31%|
*Note that we didn’t intend on achieving the state of the art performance here as we are using a single optimization policy without fine-tuning hyper parameters or data-augmentation for a specific task, and still we nearly achieved state-of-the-art on MNIST. **Results achieved using an ensemble or extreme data-augmentation
The SVHN dataset Netzer et al. (2011) is a real-world image dataset, obtained from house numbers in Google Street View images. It consists of 630,420 32x32 color images of which 73,257 images are used for training, 26,032 images are used for testing and the other 531,131 images are used for extra training. Like Huang et al. (2016); Goodfellow et al. (2013); Lin et al. (2013) we only used the training and testing sets for our experiments and didn’t use any data-augmentation. We also used the slimmed version with 300K parameters and obtained a very good test error of 2.37%. Table 4 shows the current state of the art results for SVHN.
|Network in NetworkLin et al. (2013)||2.35|
|Deeply Supervised NetLee et al. (2015)||1.92|
|ResNetHe et al. (2015b) (reported by Huang et al. (2016) (2016))||2.01|
|ResNet with Stochastic DepthHuang et al. (2016)||1.75|
|Wide ResNetZagoruyko & Komodakis (2016)||1.64|
Some architectures can’t scale well when their processing capacity decreases. This shows the design is not robust enough to efficiently use its processing capacity. We tried a slimmed version of our architecture which has only 300K parameters to see how it performs and whether it’s still efficient. The network also does not use any dropout. Table 5 shows the results for our architecture with only 300K parameters in comparison to other deeper and heavier architectures with 2 to 20 times more parameters.
|SimpleNet||310K - 460K||91.98 - 92.33||64.68 - 66.82|
|Maxout Goodfellow et al. (2013)||6M||90.62||65.46|
|DSN Lee et al. (2015)||1M||92.03||65.43|
|ALLCNN Springenberg et al. (2014)||1.3M||92.75||66.29|
|dasNet Stollenga et al. (2014)||6M||90.78||66.22|
|ResNet He et al. (2015b) (Depth32, tested by us)||475K||91.6||67.37|
|WRN Zagoruyko & Komodakis (2016)||600K||93.15||69.11|
|NIN Lin et al. (2013)||1M||91.19||—|
In this paper, we proposed a simple convolution architecture that takes advantage of the simplicity in its design and outperforms deeper and more complex architectures in spite of having considerably fewer number of parameters and operations. We showed that a good design should be able to efficiently use its processing capacity and showed that our slimmed version of the architecture with much fewer number of parameters (300K) also outperforms deeper and or heavier architectures. Intentionally limiting ourselves to a few layers and basic elements for designing an architecture allowed us to overlook the unnecessary details and concentrate on the critical aspects of the architecture, keeping the computation in check and achieve high efficiency. We tried to show the importance of simplicity and optimization using our experiments and also encourage more researchers to study the vast design space of convolutional neural network in an effort to find more and better guidelines to make or propose better performing architectures with much less overhead. This will hopefully greatly help to expand deep learning related methods and applications, making them more viable in more situations. Due to lack of good hardware, we had to contend ourselves to a few configurations. We are still continuing our tests and would like to extend our work by experimenting on new applications and design choices especially using the latest achievements about deep architectures in the literature.
We would like to express our deep gratitude to Dr. Ali Diba the CTO of Sensifai for his great help and cooperation in this work. We also would like to express our great appreciation to Dr. Hamed Pirsiavash for his insightful comments and constructive suggestions. We would also like to thank Dr. Reza Saadati, and Dr. Javad Vahidi for their valuable help in early stage of the work.
Neural network model for a mechanism of pattern recognition unaffected by shift in position- neocognitron.ELECTRON. & COMMUN. JAPAN, 62(10):11–18, 1979.
Building high-level features using large scale unsupervised learning.In ICASSP, pp. 8595–8598. IEEE, 2013. ISBN 1520-6149.
92.45% on cifar-10 in torch.2015.
In this section the extended results pertaining to CIFAR10 and CIFAR100 are provided along with early results on ImageNetRussakovsky et al. (2015) dataset.
ImageNet includes images of 1000 classes, and is split into three sets: 1.2M training images, 50K validation images, and 100K testing images. The classification performance is evaluated using two measures: the top-1 and top-5 error.
We used the same architecture without any dropout and didn’t tune any parameters. We just used plain SGD to see how it performs with a simple learning policy. Table 6 shows the latest result until 300K iteration from the ongoing test. Unlike others that use techniques such as scale jittering and multi-crop and dense evaluation in training and testing phases, no data-augmentation is used in achieving the following results.
|Method||T1/T5 Accuracy Rate|
|AlexNet(60M)Alex et al. (2012)||57.2/80.3|
|VGGNet16(138M)Simonyan & Zisserman (2014)||70.5|
|GoogleNet(8M) Szegedy et al. (2015)||68.7|
|Wide ResNet(11.7M)Zagoruyko & Komodakis (2016)||69.6/89.07|
|ResNet-110He et al. (2015b)*||93.57||1.7m|
|ResNet-1202He et al. (2015b)||92.07||10.2m|
|Stochastic depth-110LHuang et al. (2016)||94.77||1.7m|
|Stochastic depth-1202LHuang et al. (2016)||95.09||10.2m|
|Wide Residual NetZagoruyko & Komodakis (2016)||95.19||11m|
|Wide Residual NetZagoruyko & Komodakis (2016)||95.83||36m|
|Highway NetworkSrivastava et al. (2015)||92.40||-|
|FitNetRomero et al. (2014)||91.61||1M|
|SqueezNetIandola et al. (2016)-(tested by us)||79.58||1.3M|
|ALLCNNSpringenberg et al. (2014)||92.75||-|
|Fractional Max-pooling* (1 tests)Graham (2014a)||95.50||12M|
|Max-out(k=2)Goodfellow et al. (2013)||90.62||6M|
|Network in NetworkLin et al. (2013)||91.19||1M|
|Deeply Supervised NetworkLee et al. (2015)||92.03||1M|
|Batch normalized Max-out NINJia-Ren Chang (2015)||93.25||-|
|All you need is a good init (LSUV)Dmytro Mishkin (2016)||94.16||-|
|Generalizing Pooling Functions in CNNLee et al. (2016)||93.95||-|
|Spatially-Sparse CNNsGraham (2014b)||93.72||-|
|Scalable Bayesian Optimization Using DNNSnoek et al. (2015)||93.63||-|
|Recurrent CNN for Object RecognitionLiang & Hu (2015)||92.91||-|
|RCNN-160Liang & Hu (2015)||92.91||-|
|SimpleNet-Arch1 using data augmentation||95.32||5.4m|
|GoogleNet with ELUClevert et al. (2015)*||75.72|
|Spatially-sparse CNNsGraham (2014b)||75.7|
|Fractional Max-Pooling(12M) Graham (2014a)||73.61|
|Scalable Bayesian Optimization Using DNNsSnoek et al. (2015)||72.60|
|All you need is a good initDmytro Mishkin (2016)||72.34|
|Batch-normalized Max-out NIN(k=5)Jia-Ren Chang (2015)||71.14|
|Network in NetworkLin et al. (2013)||64.32|
|Deeply Supervised NetworkLee et al. (2015)||65.43|
|ResNet-110LHe et al. (2015b)||74.84|
|ResNet-1202LHe et al. (2015b)||72.18|
|WRNZagoruyko & Komodakis (2016)||77.11/79.5|
|HighwaySrivastava et al. (2015)||67.76|
|FitNetRomero et al. (2014)||64.96|
*Achieved using several data-augmentation tricks
In order to see how well the model generalizes, and whether it was able to develop robust features, we tried some images that the network has never faced and used them with a model trained on CIFAR10 dataset. As the results show, the network classifies them correctly despite the fact that they are very different from the images used for training. These visualizations are done using Deep Visualization Toolbox byYosinski et al. (2015) and early un-augmented version of SimpleNet.
An interesting point in the figure 4 lies in the black dog/cat like drawing and the interesting predictions the network does on the strange drawing we drew! We intentionally drew a figure that does look like several categories inside CIFAR10 dataset, and thus wanted to test how it looks like to the network and whether the network uses sensible features to distinguish between each class. Interestingly the network tries its best and classifies the image according to the prominent features it finds in the picture. The similarity to some animals present in the dataset is manifested in the first four predictions and then a truck at the end denotes the circular shape of the animal’s legs might have been used as an indication of the existence of the truck! Suggesting the network is trying to use prominent features to identify each class rather than some random features. Investigating the internals of the network also shows, such predictions are because of a well developed feature combinations, by which the network performs its deduction. Figure 5 shows the network has developed a proper feature to distinguish the head/shoulder in the input, and a possible deciding factor by which to distinguish between animals and non animals. As it can be seen from the samples, while the results are very encouraging and in high confidence, they are still far from prefect. This observation may suggest 3 possible reasoning: 1) The network does not have the capability needed to perfectly deduce as we expect. 2) More data is needed for the network to develop better features, a small dataset such as CIFAR10 with no data augmentation is not simply enough to provide such capability we expect. 3) The current optimization process that we employ to train our deep architectures is insufficient and or incapable of providing such capability easily or at all. Apart from the current imperfections, results show that even a simple architecture, when properly devised, can perform decently.