1 Introduction
The use of competitive activation units in deep convolutional neural networks (ConvNets) is generally understood as a way of building one network by the combination of multiple subnetworks, where each one is capable of solving a simpler task when compared to the complexity of the original problem involving the whole dataset [22]
. Similar ideas have been explored in the past using multilayer perceptron models
[6], but there is a resurgence in the use of competitive activation units in deep ConvNets [23, 22]. For instance, rectified linear unit (ReLU)
[1] promotes a competition between the input sum (usually computed from the output of convolutional layers) and a fixed value of 0, while maxout [4] and local winnertakeall (LWTA) [23] explore an explicit competition among the input units. As shown by Srivastava et al. [22], these competitive activation units allow the formation of subnetworks that respond similarly to similar input patterns, which facilitates training [1, 4, 23] and generally produces superior classification results [22].(a) Competitive multiscale convolution module 
(b) Competitive Inception module 
(c) Original inception module [24] 
The proposed deep ConvNet modules are depicted in (a) and (b), where (a) only contains multiscale convolutional filters within each module, while (b) contains the maxpooling path, which resembles the original inception module depicted in (c) for comparison.
In this paper, we introduce a new module for deep ConvNets composed of several multiscale convolutional filters that are joined by a maxout activation unit, which promotes competition among these filters. Our idea has been inspired by the recently proposed inception module [24], which currently produces stateoftheart results on the ILSVRC 2014 classification and detection challenges [17]. The gist of our proposal is depicted in Fig. 1, where we have the data in the input layer filtered in parallel by a set of multiscale convolutional filters [2, 24, 27]. Then the output of each scale of the convolutional layer passes through a batch normalisation unit (BNU) [5] that weights the importance of each scale and also preconditions the model (note that the preconditioning ability of BNUs in ConvNets containing piecewise linear activation units has recently been empirically shown in [11]). Finally, the multiscale filter outputs, weighted by BNU, are joined with a maxout unit [4] that reduces the dimensionality of the joint filter outputs and promotes competition among the multiscale filters, which prevents filter coadaptation and allows the formation of multiple subnetworks. We show that the introduction of our proposal module in a typical deep ConvNet produces the best results in the field for the benchmark datasets CIFAR10 [7], CIFAR100 [7], and street view house number (SVHN) [16], while producing competitive results for MNIST [8].
2 Literature Review
One of the main reasons behind the outstanding performance of deep ConvNets is attributed to the use of competitive activation units in the form of piecewise linear functions [14, 22], such as ReLU [1], maxout [4] and LWTA [23] (see Fig. 2
). In general, these activation functions enable the formation of subnetworks that respond consistently to similar input patterns
[22], dividing the input data points (and more generally the training space) into regions [14], where classifiers and regressors can be learned more effectively given that the subproblems in each of these regions is simpler than the one involving the whole training set. In addition, the joint training of the subnetworks present in such deep ConvNets represents a useful regularization method
[1, 4, 23]. In practice, ReLU allows the division of the input space into two regions, but maxout and LWTA can divide the space in as many regions as the number of inputs, so for this reason, the latter two functions can estimate exponentially complex functions more effectively because of the larger number of subnetworks that are jointly trained. An important aspect about deep ConvNets with competitive activation units is the fact that the use of batch normalization units (BNU) helps not only with respect to the convergence rate
[5], but also with the preconditioning of the model by promoting an even distribution of the input data points, which results in the maximization of the number of the regions (and respective subnetworks) produced by the piecewise linear activation functions [11]. Furthermore, training ConvNets with competitive activation units [11, 22] usually involves the use of dropout [20] that consists of a regularization method that prevents filter coadaptation [20], which is a particularly important issue in such models, because filter coadaptation can lead to a severe reduction in the number of the subnetworks that can be formed during training.Competitive activation units, where the gray nodes are the active ones, from which errors flow during backpropagation. ReLU
[1] (a) is active when the input is bigger than 0, LWTA [23] (b) activates only the node that has the maximum value (setting to zero the other ones), and maxout [4] (c) has only one output containing the maximum value from the input. This figure was adapted from Fig.1 of [22].Another aspect of the current research on deep ConvNets is the idea of making the network deeper, which has been shown to improve classification results [3]. However, one of the main ideas being studied in the field is how to increase the depth of a ConvNet without necessarily increasing the complexity of the model parameter space [19, 24]. For the Szegedy et al.’s model [24], this is achieved with the use of convolutional filters [12] that are placed before each local filter present in the inception module in order to reduce the input dimensionality of the filter. In Simonyan et al.’s approach [19], the idea is to use a large number of layers with convolutional filters of very small size (e.g., ). In this work, we restrict the complexity of the deep ConvNet with the use of maxout activation units, which selects only one of the input nodes, as shown in Fig, 2.
Finally, multiscale filters in deep ConvNets is another important implementation that is increasingly being explored by several researchers [2, 24, 27]. Essentially, multiscale filtering follows a neuroscience model [18] that suggests that the input image data should be processed at several scales and then pooled together, so that the deeper processing stages can become robust to scale changes [24]. We explore this idea in our proposal, as depicted in Fig. 1, but we also argue (and show some evidence) that the multiscale nature of the filters can prevent their coadaptation during training.
3 Methodology
Assume that an image is represented by , where denotes the image lattice, and that an image patch of size (for ) centred at position is represented by . The models being proposed in this paper follow the structure of the the NIN model [12], and is in general defined as follows:
(1) 
where denotes the composition operator, represents all the ConvNet parameters (i.e., weights and biases), denotes an averaging pooling unit followed by a softmax activation function [12], and the network has blocks represented by , with each block containing a composition of modules with . Each module at a particular position of the input data for block is defined by
(2) 
where represents the maxout activation function [4], the convolutional filters of the module are represented by the weight matrices for (i.e., filters of size , with denoting the number of D filters present in ), which means that each module in block has different filter sizes and different filters, and represent the batch normalization scaling and shifting parameters [5], and represents a max pooling operator on the subset of the input data for layer centred at , i.e. .
Using the ConvNet module defined in (2), our proposed models differ mainly in the presence or absence of the node with the maxpooling operator within the module (i.e., the node represented by ). When the module does not contain such node, it is called Competitive Multiscale Convolution (see Fig. 3(a)), but when the module has the maxpooling node, then we call it Competitive Inception (see Fig. 3(b)) because of its similarity to the original inception module [24]. The original inception module is also implemented for comparison purposes (see Fig. 3(c)), and we call this model the Inception Style, which is similar to (1) and (2) but with the following differences: 1) the function in (2) denotes the concatenation of the input parameters; 2) a convolution is applied to the input before a second round of convolutions with filter sizes larger than or equal to ; and 3) a ReLU activation function [1] is present after each convolutional layer.
An overview of all models with the structural parameters is displayed in Fig. 3. Note that all models are inspired by NIN [12], GoogLeNet [24], and MIM [11]. In particular, we replace the original convolutional layers of MIM by multiscale filters of sizes , , , and . For the inception style model, we ensure that the number of output units in each module is the same as for the competitive inception and competitive multiscale convolution, and we also use a maxpooling path in each module, as used in the original inception module [24]. Another important point is that in general, when designing the inception style network, we follow the suggestion by Szegedy et al. [24] and include a relatively larger number of and filters in each module, compared to filters of other sizes (e.g., and ). An important distinction between the original GoogLeNet [24] and the inception style network in Fig. 3(c) is the fact that we replace the fully connected layer in the last layer by a single convolution node in the last module, followed by an average pooling and a softmax unit, similarly to the NIN model [12]. We propose this modification to limit the number of training parameters (with the removal of the fully connected layer) and to avoid the concatenation of the nodes from different paths (i.e., maxpooling, convolution filter, and etc.) into a number of channels that is equal to the number of classes (i.e., each channel is averaged into a single node, which is used by a single softmax unit), where the concatenation would imply that some of the paths would be directly linked to a subset of the classes.
3.1 Competitive Multiscale Convolution Prevent Filter Coadaptation
The main reason being explored in the field to justify the use of competitive activation units [1, 4, 23] is the fact that they build a network formed by multiple underlying subnetworks [22]. More clearly, given that these activation units consist of piecewise linear functions, it has been shown that the composition of several layers containing such units, divide the input space in a number of regions that is exponentially proportional to the number of network layers [14], where subnetworks will be trained with the samples that fall into one of these regions, and as a result become specialised to the problem in that particular region [22], where overfitting can be avoided because these subnetworks must share their parameters with one another [22]
. It is worth noting that these regions can only be formed if the underlying convolutional filters do not coadapt, otherwise all input training samples will fall into only one region of the competitive unit, which degenerates into a simple linear transform, preventing the formation of the subnetworks.
A straightforward solution to avoid such coadaptation can be achieved by limiting the number of training samples in a minibatch during stochastic gradient descent. These small batches allow the generation of “noisy” gradient directions during training that can activate different maxout gates, so that the different linear pieces of the activation unit can be fitted, allowing the formation of an exponentially large number of regions. However, the drawback of this approach lies in the determination of the “right” number of samples per minibatch. A minibatch size that is too small leads to poor convergence, and if it is too large, then it may not allow the formation of many subnetworks. Recently, Liao and Carneiro
[11] propose a solution to this problem based on the use of BNU [5] that distributes the training samples evenly over the regions formed by the competitive unit, allowing the training to use different sets of training points for each region of the competitive unit, resulting in the formation of an exponential number of subnetworks. However, there is still a potential problem with that approach [11], which is that the underlying convolutional filters are trained using feature spaces of the same size (i.e., the underlying filters are of fixed size), which can induce the filters to coadapt and converge to similar regions of the feature space, also preventing the formation of the subnetworks.The competitive multiscale convolution module proposed in this paper represents a way to fix the issue introduced above [11]. Specifically, the different sizes of the convolutional filters within a competitive unit force the feature spaces of the filters to be different from each other, reducing the chances that these filters will converge to similar regions of the feature space. For instance, say you have two filters of sizes and being joined by a competitive unit, so this means that the former filter will have a dimensional space, while the latter filter will have additional dimensions for a total of dimensions, where these new dimensions will allow the training process for the filter to have a significantly larger feature space (i.e., for these two filters to converge to similar values, the additional dimensions will have to be pushed towards zero and the remaining dimensions to converge to the same values as the filter). In other words, the different filter sizes within a competitive unit imposes a soft constraint that the filters must converge to different values, avoiding the coadaptation issue. In some sense, this idea is similar to DropConnect [26], which, during training, drops to zero the weights of randomly picked network connections with the goal of training regularization. Nevertheless, the underlying filters will have the same size, which promotes coadaptation even with random connections being dropped to zero. Compared with DropConnect that stochastically drops filter connections during training, our approach deterministically drops the border connections of a filter (e.g., a filter is a filter with the border connections dropped to zero, and a filter is a filter with the 40 border connections forced to zero  see Fig. 5). We show in the experiments that our approach is more effective than DropConnect at the task of preventing filter coadaptation within competitive units.
4 Experiments
We quantitatively measure the performance of our proposed models Competitive Multiscale Convolution and Competitive Inception
on four computer vision/machine learning benchmark datasets: CIFAR10
[7], CIFAR100[7], MNIST [8] and SVHN [16]. We first describe the experimental setup, then using CIFAR10 and MNIST, we show a quantitative analysis (in terms of classification error, number of model parameters and train/test time) of the two proposed models, the Inception Style model presented in Sec. 3, and two additional versions of the proposed models that justify the use of multiscale filters, explained in Sec. 3.1. Finally, we compare the performance of the proposed Competitive Multiscale Convolution and Competitive Inception with respect to the current state of the art in the four benchmark datasets mentioned above.The CIFAR10 [7] dataset contains 60000 images of 10 commonly seen object categories (e.g., animals, vehicles, etc.), where 50000 images are used for training and the rest 10000 for testing, and all 10 categories have equal volume of training and test images. The images of CIFAR10 consist of pixel RGB images, where the objects are wellcentered in the middle of the image. The CIFAR100 [7] dataset extends CIFAR10 by increasing the number of categories to 100, whereas the total number of images remains the same, so the CIFAR100 dataset is considered as a harder classification problem than CIFAR10 since it contains 10 times less images per class and 10 times more categories. The wellknown MNIST [8] dataset contains grayscale images comprising 10 handwritten digits (from to
), where the dataset is divided into 60000 images for training and 10000 for testing, but note that the number of images per digit is not uniformly distributed. Finally, the Street View House Number (SVHN)
[16] is also a digit classification benchmark dataset that contains 600000 RGB images of printed digits (from to ) cropped from pictures of house number plates. The cropped images is centered in the digit of interest, but nearby digits and other distractors are kept in the image. SVHN has three sets: training, testing sets and a extra set with 530000 images that are less difficult and can be used for helping with the training process. We do not use data augmentation in any of the experiments, and we only compare our results with other methods that do not use data augmentation.In all these benchmark datasets we minimize the softmax loss function present in the last layer of each model for the respective classification in each dataset, and we report the results as the proportion of misclassified test images, which is the standard way of comparing algorithms in these benchmark datasets. The reported results are generated with the models trained using an initial learning rate of 0.1 and following a multistep decay to a final learning rate of 0.001 (in 80 epochs for CIFAR10 and CIFAR100, 50 epochs for MNIST, and 40 epochs for SVHN). The stopping criterion is determined by the convergence observed in the error on the validation set. The minibatch size for CIFAR10, CIFAR100, and MNIST datasets is 100, and 128 for SVHN dataset. The momentum and weight decay are set to standard values 0.9 and 0.0005, respectively. For each result reported, we compute the mean and standard deviation of the test error from five separately trained models, where for each model, we use the same training set and parameters (e.g., the learning rate sequence, momentum, etc.), and we change only the random initialization of the filter weights and randomly shuffle the training samples.
We use the GPUaccelerated ConvNet library MatConvNet [25] to perform the experiments specified in this paper. Our experimental environment is a desktop PC equipped with i74770 CPU, 24G memory and a 12G GTX TITAN X graphic card. Using this machine, we report the mean training and testing times of our models.
4.1 Model Design Choices
In this section, we show the results from several experiments that show the design choices for our models, where we provide comparisons in terms of their test errors, the number of parameters involved in the training process and the training and testing times. Tables 1 and 2 show the results on CIFAR10 and MNIST for the models Competitive Multiscale Convolution, Competitive Inception, and Inception Style models, in addition to other models explained below. Note that all models in Tables 1 and 2 are constrained to have the same numbers of input channels and output channels in each module, and all networks contain three blocks [12], each with three modules (so there is a total of nine modules in each network), as shown in Fig. 3.
We argue that the multiscale nature of the filters within the competitive module is important to avoid the coadaptation issue explained in Sec. 3.1. We assess this importance by comparing both the number of parameters and the test error results between the proposed models and the model Competitive Singlescale Convolution, which has basically the same architecture as the Competitive Multiscale Convolution model represented in Fig. 3(a), but with the following changes: the first two blocks contain four sets of filters in the first module, and in the second and third modules, two sets of filters; and the third block has three filters of size in the first module, followed by two modules with two filters. Notice that this configuration implies that we replace the multiscale filters by the filter of the largest size of the module in each node, which is a configuration similar to the recently proposed MIM model [11]. The configuration for the Competitive Singlescale Convolution has around two times more parameters than the Competitive Multiscale Convolution model and takes longer to train, as displayed in Tables 1 and 2. The idea behind the use of the largest size filters within each module is based on the results obtained from the training of the batch normalisation units of the Competitive Multiscale Convolution modules, which indicates that the highest weights (represented by in (2)) are placed in the largest size filters within each module, as shown in Fig. 4. The classification results of the Competitive Singlescale Convolution, shown in Tables 1 and 2, demonstrate that it is consistently inferior to the Competitive Multiscale Convolution model.
Another important point that we test in this section is the relevance of dropping connections in a deterministic or stochastic manner when training the competitive convolution modules. Recall that the one of the questions posed in Sec. 3.1 is if the deterministic masking provided by our proposed Competitive Multiscale Convolution module is more effective at avoiding filter coadaptation than the stochastic masking provided by DropConnect [26]. We run a quantitative analysis of the Competitive DropConnect Singlescale Convolution, where we take the Competitive Singlescale Convolution proposed before and randomly drop connections using a rate, which is computed such that it has on average the same number of parameters to learn in each round of training as the Competitive Multiscale Convolution, but notice that the Competitive DropConnect Singlescale Convolution has in fact the same number of parameters as the Competitive Singlescale Convolution. Using Fig. 5, we see that the DropConnect rate is 0.57 for the module 1 of blocks 1 and 2 specified in Fig. 3. The results in Tables 1 and 2 show that it has around two times more parameters, takes longer to train and performs significantly worse than the Competitive Multiscale Convolution model.
Finally, the reported training and testing times in Tables 1 and 2 show a clear relation between the number of model parameters and those times.
Method  No. of Params  Test Error  Train Time  Test Time 

(mean std dev)  (h)  (ms)  
Competitive Multiscale  6.4 h  2.7 ms  
Convolution  
Competitive Inception  7.6 h  3.1 ms  
Inception Style  3.9 h  1.5 ms  
Competitive Singlescale  8.0 h  3.2 ms  
Convolution  
Competitive DropConnect  7.7 h  3.1 ms  
Singlescale Convolution 
Method  No. of Params  Test Error  Train Time  Test Time 

(mean std dev)  (h)  (ms)  
Competitive Multiscale  1.5 h  0.8 ms  
Convolution  
Competitive Inception  1.9 h  1.0 ms  
Inception Style  1.4 h  0.7 ms  
Competitive Singlescale  1.7 h  0.9 ms  
Convolution  
Competitive DropConnect  1.6 h  0.9 ms  
Singlescale Convolution 
4.2 Comparison with the State of the Art
We now show the performances of the proposed Competitive Multiscale and Competitive Inception Convolution models on CIFAR10, CIFAR100, MNIST and SVHN, and compare them with the current state of the art in the field, which can be listed as follows. Stochastic Pooling [28] proposes a regularization based on a replacement of the deterministic pooling (e.g., max or average pooling) by a stochastic procedure, which randomly selects the activation within each pooling region according to a multinomial distribution, estimated from the activation of the pooling unit. Maxout Networks [4] introduces a piecewise linear activation unit that is used together with dropout training [20] and is introduced in Fig. 2(c). The Network in Network (NIN) [12]
model consists of the introduction of multilayer perceptrons as activation functions to be placed between convolution layers, and the replacement of a final fully connected layer by average pooling, where the number of output channels represent the final number of classes in the classification problem.
Deeplysupervised nets [9]introduce explicit training objectives to all hidden layers, in addition to the backpropagated errors from the last softmax layer. The use of a recurrent structure that replaces the purely feedforward structure in ConvNets is explored by the model
RCNN [10]. An extension of the NIN model based on the use of maxout activation function instead of the multilayer perceptron is introduced in the MIM model [11], which also shows that the use of batch normalization units are crucial for allowing an effective training of several singlescale filters that are joined by maxout units. Finally, the Tree based Priors [21] model proposes a training method for classes with few samples, using a generative prior that is learned from the data and shared between related classes during the model learning.The comparison on CIFAR10 [7] dataset is shown in Tab. 3, where results are sorted based on the performance of each method, and the results of our proposed methods are highlighted. The results on CIFAR100[7] dataset are displayed in Tab.4. Table 5 shows the results on MNIST [8], where it is worth reporting that the best result (over the five trained models) produced by our Competitive Multiscale Convolution model is a test error of , which is better than the single result from Liang and Hu [10]. Finally, the comparison on SVHN[16] dataset is shown in Table 6, where two out of the five models show test error results of .
Method  Test Error (mean standard deviation) 

Competitive Multiscale Convolution  
Competitive Inception  
MIM [11]  
RCNN160 [10]  
Deeplysupervised nets [9]  
Network in Network [12]  
Maxout Networks [4]  
Stochastic Pooling [28] 
Method  Test Error (mean standard deviation) 

Competitive Multiscale Convolution  
Competitive Inception  
MIM [11]  
RCNN160 [10]  
Deeplysupervised nets [9]  
Network in Network [12]  
Tree based Priors [21]  
Maxout Networks [4]  
Stochastic Pooling [28] 
Method  Test Error (mean standard deviation) 

RCNN96 [10]  
Competitive Multiscale Convolution  
MIM [11]  
Deeplysupervised nets [9]  
Competitive Inception  
Network in Network [12]  
Conv. Maxout+Dropout [4]  
Stochastic Pooling [28] 
Method  Test Error (mean standard deviation) 

Competitive Multiscale Convolution  
RCNN192 [10]  
Competitive Inception Convolution  
Deeplysupervised nets [9]  
Dropconnect [26]  
MIM [11]  
Network in Network [12]  
Conv. Maxout+Dropout [4]  
Stochastic Pooling [28] 
5 Discussion and Conclusions
In terms of the model design choices in Sec. 4.1, we can see that the proposed Competitive Multiscale Convolution produces more accurate classification results than the proposed Competitive Inception. Given that the main difference between these two models is the presence of the maxpooling path within each module, we can conclude that this path does not help with the classification accuracy of the model. The better performance of both models with respect to the Inception Style model can be attributed to the maxout unit that induces competition among the underlying filters, which helps more the classification results when compared with the collaborative nature of the Inception module. Considering model complexity, it is important to notice that the relation between the number of parameters and training and testing times is not linear, where even though the Inception Style model has 10 fewer parameters, it trains and tests 2 to 1.5 faster than the proposed Competitive Multiscale Convolution and Competitive Inception models.
When answering the questions posed in Sec. 3.1, we assume that classification accuracy is a proxy for measuring the coadaptation between filters within a single module, where the intuition is that if the filters joined by a maxout activation unit coadapt and become similar to each other, a relatively small number of large regions in the input space will be formed, which results in few subnetworks to train, with each subnetwork becoming less specialized to its region [14, 22]. We argue that the main consequence of that is a potential lower classification accuracy, depending on the complexity of the original classification problem. Using this assumption, we note from Tables 1 and 2 that the use of multiscale filters within a competitive module is in fact important to avoid the coadaptation of the filters, as shown by the more accurate classification results of the Multiscale, compared to the Singlescale model. Furthermore, the use of deterministic, as opposed to stochastic, mapping also appears to be more effective in avoiding filter coadaptation given the more accurate classification results of the former mapping. Nevertheless, the reason behind the worse performance of the stochastic mapping may be due to the fact that DropConnect has been designed for the fully connected layers only [26], while our test bed for the comparison is set in the convolutional filters. To be more specific, we think that a fully connected layer usually encapsulates hundreds to thousands of weights for inputs of similar scale of dimensions, thus a random dropping on a subset of weight elements can hardly change the distribution of the outputs pattern. However, the convolution filters are of small dimensions, and each of our maxout unit controls 4 to 5 filters at most, so such masking scheme over small weights matrix could result in “catastrophic forgetting” [13] which explains why the Competitive DropConnect Singlescale Convolution performs even worse than Competitive Singlescale Convolution on CIFAR10.
We also run an experiment that assesses whether filters of larger size within a competitive module can improve the classification accuracy at the expense of having a larger number of parameters to train. We test the inclusion of two more filters of sizes and in module 1 of blocks 1 and 2, and two more filter sizes and in module 1 of block 3 (see Fig. 3). The classification result obtained is
on CIFAR10, and number of model parameters is 13.11 M. This experiment shows that increasing the number of filters of larger sizes do not necessarily help improve the classification results. An important modification that can be suggested for our proposed Competitive Multiscale Convolution model is the replacement of the maxout by ReLU activation, where only the largest size filter of each module is kept and all other filters are removed. One can argue that such model is perhaps less complex (in terms of the number of parameters) and probably as accurate as the proposed model. However, the results we obtained with such model on CIFAR10 show that this model has 3.28 M parameters (
i.e., just slightly less complex than the proposed models, as shown in Tab. 1) and has a classification test error of , which is significantly larger than for our proposed models. On MNIST, this model has 0.81 M parameters and produces a classification error of , which also shows no advantage over the proposed models.The comparisons with the state of the art in Tables 3 6 of Sec. 4.2 show that the proposed Competitive Multiscale Convolution model produces the best results in the field for three out of the four considered datasets. However, note that this comparison is not strictly fair to us because we run a fivemodel validation experiment (using different model initializations and different sets of mini batches for the stochastic gradient descent), which provides a more robust performance assessment of our method. In contrast, most of the methods in the field only show one single result of their performance. If we consider only the best result out of the five results in the experiment, then our Competitive Multiscale Convolution model has the best results in all four datasets (with, for example, on MNIST and on SVHN). An analysis of these results also allows us to conclude that the main competitors of our approach are the MIM [11] and RCNN [10] models, where the MIM method is quite related to our approach, but the RCNN method follows a quite different strategy.
In this paper, we show the effectiveness of using competitive units on modules that contain multiscale filters. We argue that the main reason of the superior classification results of our proposal, compared with the current state of the art in several benchmark datasets, lies in the following points: 1) the deterministic masking implicitly used by the multiscale filters avoids the issue of filter coadaptation; 2) the competitive unit that joins the underlying filters and the batch normalization units promote the formation of a large number of subnetworks that are specialized in the classification problem restricted to a small area of the input space and that are regularized by the fact that they are trained together within the same model; and 3) the maxout unit allows the reduction of the number of parameters in the model. It is important to note that such modules can be applied in several types of deep learning networks, and we plan to apply it to other types of models, such as the recurrent neural network
[10].References

[1]
X. Glorot, A. Bordes, and Y. Bengio.
Deep sparse rectifier neural networks.
In
International Conference on Artificial Intelligence and Statistics
, pages 315–323, 2011.  [2] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multiscale orderless pooling of deep convolutional activation features. In Computer Vision–ECCV 2014, pages 392–407. Springer, 2014.
 [3] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multidigit number recognition from street view imagery using deep convolutional neural networks. International Conference on Learning Representations (ICLR), 2014.
 [4] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. The 30th International Conference on Machine Learning (ICML), 2013.
 [5] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning (ICML), 2015.
 [6] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 [7] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 1(4):7, 2009.
 [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [9] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In Proceedings of AISTATS, 2015.

[10]
M. Liang and X. Hu.
Recurrent convolutional neural network for object recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3367–3375, 2015.  [11] Z. Liao and G. Carneiro. On the importance of normalisation layers in deep learning with piecewise linear activation units. CoRR, abs/1508.00330, 2015.
 [12] M. Lin, Q. Chen, and S. Yan. Network in network. International Conference on Learning Representations (ICLR), 2013.
 [13] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. The psychology of learning and motivation, 24(109165):92, 1989.
 [14] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 2924–2932, 2014.

[15]
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 807–814, 2010.  [16] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5. Granada, Spain, 2011.
 [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, pages 1–42, 2014.
 [18] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortexlike mechanisms. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(3):411–426, 2007.
 [19] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. International Conference on Learning Representations (ICLR), 2015.
 [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[21]
N. Srivastava and R. R. Salakhutdinov.
Discriminative transfer learning with treebased priors.
In Advances in Neural Information Processing Systems (NIPS), pages 2094–2102, 2013.  [22] R. K. Srivastava, J. Masci, F. Gomez, and J. Schmidhuber. Understanding locally competitive networks. International Conference on Learning Representations (ICLR), 2015.
 [23] R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. Compete to compute. In Advances in Neural Information Processing Systems (NIPS), pages 2310–2318, 2013.
 [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [25] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab.
 [26] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1058–1066, 2013.
 [27] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [28] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. International Conference on Learning Representations (ICLR), 2013.