The ever-increasing complexity of Convolutional Neural Networks (CNN) and their associated set of layers demand deeper insight into the internal mechanics of CNNs. The functionality of CNNs is often understood as a series of projections and a variety of non-linearities to increase the capacity of the model[Hinton2009, Nair and Hinton2010, Ramachandran, Zoph, and Le2017, Zheng et al.2015]
. Despite the fact that the prediction layer of CNNs (e.g., the Softmax layer) and the loss functions (e.g., Cross Entropy) are borrowed from the Bayesian framework, a clear connection of the functionality of the intermediate layers with probability theory remains elusive. The current understanding of CNNs leaves much to subjective designs with extensive experimental justifications.
We informally argue that subjectivity is inherent to problems defined over real numbers. Accordingly, the confusion existing in the functionality of CNNs reflects the aforementioned theoretical subjectivity. Since real vector spaces are unbounded and uncountable, they require strong assumptions in the form of prior information about the underlying data distribution in a Bayesian inference framework. For example, fitting a Gaussian distribution to a set of samples requires that the prior distribution on the location parameter be non-vanishing near the samples. In this scenario, an uninformative prior needs to be close to the uniform distribution over real numbers; a paradoxical distribution. Since the real line is unbounded and uncountable, the choice of the model and its prior distribution is always highly informative[Jaynes1968]. Although the choice of the prior in univariate distributions is not a practical issue, the adverse effects of subjective priors are more evident in high dimensions. When the sample space is large and the data is comparatively sparse, either careful design of a prior or an uninformative prior is needed. Note that in the context of CNNs, the architecture, initialization, regularization, and other processes can be interpreted as imposing some form of prior on the distribution of real data. [Jaynes1957] shows that the correct extension of entropy to real distributions does not have a finite value. By switching to finite state distributions (FSDs), the entropy value is calculable and finite, providing potential for an information-theoretic treatment.
In contrast to distributions defined over real numbers, working with FSDs makes the problem of objective inference theoretically more tractable. In problems where the data are represented by real numbers, the values can be treated as parameters of a finite-state distribution, therefore each sample represents a distribution over some finite space. Discrete modeling of the sample space reduces the complexity of the input domain, and treating the inputs as distributions reduces the chance of overfitting since every sample represents a set of realizations. In the case of natural images, the aforementioned modeling of input data is justified by the following observation. In conventional image acquisition devices, the intensity of pixels can be interpreted as the probability of presence of photons in a spatial position and some wavelength. Therefore, a single image is considered as the distribution of photons on the spatial plane with finite states when the number of pixels is finite.
In this paper, we present a framework for classification with the key feature that unlike existing models inference is made on finite-state spaces. Classification of FSDs are attractive in the sense that it sets up the requirement for composition of classifiers, since the output of Bayesian classifiers are FSDs. To construct a Bayesian FSD classifier we borrow concepts from the Theory of Large Deviations and Information Geometry, introducing the KullBack-Leibler divergence (KLD) as the log-likelihood function. The composition of Bayesian classifiers are used to serve as a multilayer classification model. The resulting structure deeply resembles CNNs, namely modules similar to the core CNN layers are naturally derived and fit together. Specifically, we show that the popular non-linearities used in deep neural networks, e.g., ReLU and Sigmoid[Nair and Hinton2010], are in fact element-wise approximations of some normalization mapping. Moreover, we show that the linearities amount to calculating the KLD, while max pooling is an approximation to the marginalization of the indices. In our framework, there exists a natural correspondence between types of the nonlinearity and pooling. In particular, Sigmoid and ReLU correspond to Average Pooling and Max Pooling, respectively, while each pair is dictated by the type of KLD used. The models in our framework are statistically analyzable in all the layers; there is a clear statistical interpretation for every parameter, variable and layer. The interpretability of the parameters and variables provides insights into the initialization, encoding of parameters and the optimization process. Since the distributions are over finite states, the entropy is easily calculable for both the model and data, providing a crucial tool for both theoretical and empirical analysis.
The organization of the paper is as follows. In Section 2, we review related work on FSDs and the analysis of CNNs. In Section 3, we describe the construction of the proposed framework and a single layer model for classification and explain the connections to CNNs. In Section 3, we describe the extension of the framework to multiple layers. Also, we introduce the extension to the convolutional model and provide a natural pooling layer by assuming stationarity of the data distribution. Furthermore, we explain the relation between vanilla CNNs and our model. In Section 4, we evaluate few baseline architectures in the proposed framework as a proof of concept, and provide an analysis on entropy measurements available in our framework.
2 Related Work
A line of work on statistical inference in finite-state domains focuses on the problem of Binary Independent Component Analysis (BICA) and the extension over finite fields, influenced by[Barlow, Kaushal, and Mitchison1989, Barlow1989]. The general methodology in the context of BICA is to find an invertible transformation of input random variables which minimizes the sum of marginal entropies [Yeredor2011, Yeredor2007, e Silva et al.2011, Painsky, Rosset, and Feder2014, Painsky, Rosset, and Feder2016]. Although the input space is finite, the search space for the correct transformation is computationally intractable for high-dimensional distributions given the combinatorial nature of the problem. Additionally, the number of equivalent solutions is large and the probability of generalization is low.
In the context of CNNs, a body of research concerns discretization of variables and parameters of neural networks [Courbariaux and Bengio, Soudry, Hubara, and Meir2014, Courbariaux, Bengio, and David2015]. [Rastegari et al.2016]
introduced XNOR-Networks, in which the weights and the input variables take binary values. While discretization of values is motivated by efficiency, the optimization and learning the representation of the data are in the context of real numbers and follow similar dynamics as in CNNs.
To formalize the functionality of CNNs, a wavelet theory perspective of CNNs was considered by [Mallat2016] and a mathematical baseline for the analysis of CNNs was established. [Tishby, Pereira, and Bialek2000] introduced the Information Bottleneck method (IBP) to remove irrelevant information and maintain the mutual information between two variables. [Tishby and Zaslavsky2015]
proposed to use IBP, where the objective is to minimize the mutual information between consequent layers, while maximizing the mutual information of prediction variables and hidden representations.[Su, Carin, and others2017]
introduce a framework for stochastic non-linearities where various non-linearities including ReLU and Sigmoid are produced by truncated Normal distributions. In the context of probabilistic networks, Sum Product Networks (SPNs)[Poon and Domingos2011, Gens and Domingos2012, Gens and Pedro2013]
are of particular interest, where under some conditions, they represent the joint distribution of input random variables quite efficiently. A particularly important property of SPNs is their ability to calculate marginal probabilities and normalizing constants in linear time. The efficiency in the representation, however, is achieved at the cost of restrictions on the distributions that could be estimated using SPNs.[Patel, Nguyen, and Baraniuk2016] constructed Deep Rendering Mixture Models (DRMM) generating images given some nuisance variables. They showed that given that the image is generated by DRMM, the MAP inference of the class variable coincides with the operations in CNNs.
3 Proposed Framework
We set up our framework by modeling the input data as a set of “uncertain realizations” over symbols. To be precise, we define an uncertain realization as a probability mass function (pmf) over states with non-zero entropy, and similarly a certain realization is a degenerate pmf over states. To demonstrate an example of interpreting real-valued data as uncertain realizations, consider a set of -pixel RGB image data. We can view each pixel as being generated from the set and further interpret the value of each channel as the unnormalized log-probability of being in the corresponding state. If we normalize the pmf of each pixel, we can interpret the image as a factored pmf over states and each pixel a pmf over states. Formally, we define a transfer function , where is the -dimensional simplex and is the dimension of the input vector space. In the previous example, each pixel is mapped from (i.e., ) to (i.e., ). Therefore, the entire image is mapped from to . In general, the choice of depends on the nature of the data and it can either be designed or estimated during the training process.
Although probability assignment to a certain realization given a model is trivial, the extension to uncertain realizations requires further considerations. We consider Moment Projection (M-Projection) and Information Projection (I-Projection) and observe that both projections are used to obtain probabilities on distributions in two established scenarios, namely Sanov’s Theorem and the Dirichlet Distribution. Sanov’s theorem[Sanov1958] and the Probability of Type classes (Method of Types) [Cover and Thomas2012, Csiszár1998] use the KLD associated with I-Projection of the input distribution onto the underlying pmf (1) to calculate the probability of observing empirical distributions. On the other hand, the Dirichlet distribution uses the KLD associated with M-Projection (2) to asymptotically assign probabilities to the underlying distribution. We use the following approximations for probability assignments to distribution given the distribution
where is the KLD.
Inspired by the aforementioned probability assignments, we regard both types of KLD as the main tool for probability assignment on distributions in our model. We denote the KLD associated with I-Projection and M-Projection as I-KLD and M-KLD, respectively. Later, we will show that approximations to ReLU-type networks and Sigmoid-type networks are derived when employing M-KLD and I-KLD probability assignments, respectively. We define a single layer model for supervised classification as an example of using M-KLD. Constructing the I-KLD models follows a similar construction and is briefly described in Section 3. Let model
be a mixture of a set of probability distributionsover symbols, each representing the distribution of a class,
To calculate the membership probability of an input in class following the Bayesian framework, we have
where is the probability that is generated by . Substituting the log-likelihood term with M-KLD, we get
Note that the KLD term is linear in . We can break the operation in (5) into composition of a linear mapping and a non-linear mapping , where the -th components of the outputs are defined as and for some input . To formally define Divg and LNorm, let us define the logarithmic simplex of dimension denoted by as
Setting up the domain of Divg and the parameters as
where is the -th row of the matrix , we define the function Divg as
where each row of contains a distribution and calculates the entropy of each row. The weights and biases being the parameters of the model, are randomly initialized and trained according to some loss function. Unlike current CNNs, the familiar terms in (8
) such as the linear transformationand the bias term are not arbitrary. Specifically, is the cross entropy of the sample and the distributions, while is the logarithm of the mixing coefficient in (3). The entropy can be thought as the regularizer matching the Maximum Entropy Principle [Jaynes1957]. The term biases the probability on distributions with the highest degree of uncertainty.
The non-linear function is the Log Normalization function whose -th component is defined as
Note that the function is a multivariate operation. The behavior of LNorm in one dimension of the output and input is similar to that of ReLU. Furthermore, in (8) demonstrates the certainty in the choice of the model. For example, when , equal probability is assigned to all input distributions, whereas when is large, a slight deviation of the input from the distributions results in a significant decrease in the membership probability. We refer to as the concentration parameter, however in all the models presented we fix .
Multilayer Model, Convolutional Model, and Pooling
The model described in the previous section demonstrates a potential for a further recursive generalization, i.e., the input and output of the model are both distributions on finite states. We extend the model simply by stacking single layer models. The input of each layer are in , therefore the log-normalization performed by LNorm is crucial to maintain the recursion. The multilayer model (Finite Neural Network) is defined as the composition of Divg and LNorm layers,
where the superscript denotes the layer index and is the total number of layers. To elaborate, after each couple of layers, the input to the next layer are the log probabilities of membership to classes. Therefore, one can interpret the intermediate variables as distributions on a finite set of symbols (classes). In the case where I-KLD is used as the probability assignment mechanism, the input to the layers must be in the probability domain, therefore the nonlinearity reduces to Softmax, which in one dimension behaves similar to the Sigmoid function. Note that the entropy term in I-KLD is not linear with respect to the input. We focus on the M-KLD (ReLU-activated) version, however, the concepts developed herein are readily extendable to the I-KLD (Sigmoid-activated) networks.
Convolutional Model: One of the key properties of the distribution of image data is strict sense stationarity, meaning that the joint distribution of pixels does not change with translation. Therefore, it is desirable that the model be shift-invariant. Inspired by CNNs, we impose shift invariance by convolutional KLD (KL-Conv) layers. In our convolutional model, a filter of size represents a factorized distribution with factors, each factor representing a pmf over states. The distribution associated with the filter is
where is a single factor over states defined by the values in , is a neighborhood of pixels and is the -th state. In other words, the values across the channels of the filter represent a pmf and sum up to . In the RGB image example provided previously, the factors of the filters compatible with the input layer are over states. The input of the layer is log-normalized across the channels. We model the input with channels as a factorized distribution where each pixel represents a factor. The distribution is shifted along the spatial positions and the KLD of the filter distribution and each neighborhoods of pixels are calculated. As an example, we define the KL-Conv operation associated with M-KLD as
where represents the set of filters in the layer (each filter representing a distribution), is the vector of distribution entropies, is the convolution operator used in conventional CNNs and is the concentration parameter.
The non-linearity is applied to the input across the channels in the same manner as in the multilayer model, i.e.,
The overall operation of KL-Conv and LNorm layers is illustrated in Fig.1.
Pooling: We define the pooling function as a marginalization of indices in a random vector. In the case of tensors extracted in FNNs, the indices correspond to the relative spatial positions. In other words, the distributions in the spatial positions are mixed together through the pooling function. Assume is the input to the pooling layer, where . The input is in the logarithm domain, therefore to calculate the marginalized distribution the input needs to be transferred to the probability domain. After marginalization over the spatial index, the output is transferred back to the logarithm domain. We define the logarithmic pooling function as
where is the probability distribution over the relative spatial positions and denotes the support. In the usual setting of pooling functions and our model, is assumed to be a uniform distribution and the support of the distribution represents the pooling window. Note that the term in (15) is approximately equivalent to the Max function as the variables in the exponent deviate. Therefore, we hypothesize that Max Pooling in conventional CNNs is approximating (15). Evidently, the output of the pooling function is already normalized and is passed to the next layer. In the case that I-KLD is used, the input is in the probability domain and the pooling function will be identical to average pooling.
Input Layer: The model presented so far considers finite state probability distributions as input to the layers. In the case of natural images, we chose to normalize all the pixel values to the interval . Each pixel value was interpreted as the expectation of a binary random variable with range . As a result, each filter with total number of variables is a probability distribution over a space of states. Note that our model is not restricted by the choice of the type of input distribution. Depending on the nature of the input, the user can modify the distribution represented by filters, e.g., distributions on real spaces.
As explained, the parameters of the model represent parameters of distributions which are constrained to some simplex. To eliminate the constraints of the parameters, we use a “Link Function”, , mapping the “Seed” parameters to the acceptable domain of parameters, i.e., logarithmic/probability simplex. The link function impacts the optimization process and partially reflects the prior distribution over the parameters. While the parameters are updated in uniformly, the mapped parameters change according to the link function. The filters in our model are factorized distributions and each component is a categorical distribution. Additionally, the biases are categorical distributions, therefore we use similar parameterization for biases and filter components. In general, the filters of the model are obtained by
where are the seed parameters of the filter in the spatial position and across all the channels, represents the channels of the filter in the position, is the seed parameter of bias and
is the bias vector. Since the filters and biases comprise categorical distributions, we avoid complicating the notation by limiting the discussion to the parameterization of categorical distributions. We suggest two forms of parameterization of a categorical distribution, namely log-simplex and spherical parameterizations.
We define the link function with respect to the natural parametrization of a categorical distribution, where the seed parameters are interpreted as the logarithm of unnormalized probabilities. Therefore, the link function is defined as the Softmax function
where is the seed parameter vector and the subscript denotes the index of the vector components. Writing down the Jacobian of (18)
we can observe that the Jacobian only depends on and does not depend on the denominator in (18) and the link function is invariant to translation of along the vector
. Log-Simplex parameterization completely removes the effect of the additional degree of freedom.
Initialization: We initialize each factor of the filters by sampling from a Dirichlet distribution with parameters equal to . Therefore, the distribution’s components are generated uniformly on some simplex. We speculate that the initialization of the model should follow maximizing the mixing entropy or Shannon-Jensen Divergence (JSD) of the filters in a given layer , defined as
where is the total number of filters, is the -th filter and is the corresponding mixture proportion. There is a parallel between orthogonal initialization of filters in conventional CNNs and maximizing in M-KLD networks. In the extreme case where filters are degenerate distributions on unique states and the filters cover all possible states, is at the global maximum and the M-KLD operation is invertible. Similarly, orthogonal initialization of conventional CNNs is motivated by having invertible transformations to help with the information flow through the layers. Since it is hard to obtain a global maximizer for JSD, we minimize the entropy of individual filters (second term in (20) by scaling the log-probabilities with a factor . We set as a rule of thumb. Finally, the Bias seed components are initialized with zeros, indicating equal proportion of mixture components.
|top 1||top 5||top1||top1||top1||params||DO||BN||DPP|
Here, we present an alternative parameterization method attempting to eliminate the learning rate hyper-parameter.
Assume that we parameterize the categorical distribution by the link function
The expression in (21) maps to the unit sphere , where the square of the components are the probabilities. The mapping defined in (21) ensures that the value of the loss function and predictions are invariant to scaling . The Jacobian of (21) is
It is evident from (22) that the norm of the gradient has an inverse relation with . Scaling is equivalent to changing the step size, since the direction of gradients does not depend on . Additionally, the objective function is not dependent on , therefore the gradient vector obtained from the loss function is orthogonal to the vector . Considering the orthogonality property, updating along the gradients always increases the norm of the parameter vector. As a consequence, the learning rate decreases at each step of the iteration; independent of the network structure.
Initialization: The seed parameters are initialized uniformly on . The standard way of generating samples uniformly on is to sample each component from a Normal distribution followed by normalization.
4 Experimental Evaluations
We experimented with our model on the CIFAR-10 and CIFAR-100 datasets [Krizhevsky and Hinton2009] on the problem of classification. We employed three types of base-CNN architectures, namely Quick-CIFAR, Network in Network (NIN) [Lin, Chen, and Yan2013] and VGG [Simonyan and Zisserman2014, Liu and Deng2015] to experiment with different network sizes. The CNN architectures were transformed to their Finite CNN (FCNN
) versions by replacing Conv/ReLU/Pool with KL-Conv/LNorm/LPool. The inputs for the original architectures were whitened, while the whitening procedure was not applied for testing FCNNs. We first compared the performance of the original networks with their corresponding transformed architectures in finite states. We excluded certain layers from our transformation, e.g., the Dropout and the batch normalization (BatchNorm)[Ioffe and Szegedy2015], since we do not yet have a clear justification for their roles in our model. We did not use weight decay [Krizhevsky, Sutskever, and Hinton2012], regularization, and the learning rate was fixed to in the FCNNs. FCNNs were parameterized with log-simplex and spherical schemes for comparison. Experiments with I-KLD was excluded, since they achieved lower accuracy compared to M-KLD. We justify this observation by considering two facts about I-KLD: 1) Since the input is in the probability domain, the nonlinearity behaves similar to Sigmoid, therefore the gradient vanishing problem exists in I-KLD. 2) As opposed to LNorm, is not convex and interferes with the optimization process.
Table 1 demonstrates the performance achieved by the baselines and their FCNN analogues.
For all the conventional CNN networks, the data was centered at the origin and ZCA whitening was employed. Additionally, the original optimized learning rates were used to train the CNNs. The weights in all the models were regularized with norm, where in the case of NIN and VGG the regularization coefficient is defined per layer. VGG was unable to learn without being equipped with BatchNorm and Dropout layers. In the case of NIN, we could not train NIN without Dropout and BatchNorm, therefore we rely on the results reported in [Lin, Chen, and Yan2013] on vanilla NIN (without Dropout and BatchNorm) trained on CIFAR10 for 200 epochs. Figure 3 in [Lin, Chen, and Yan2013] reports the test error of vanilla NIN on CIFAR10 as roughly 19%, which is similar to the results obtained by the Finite counterpart. The final test error reduces to 14.51% over the number of epochs, which is unknown to us. The vanilla NIN results on CIFAR100 are not available in the original paper. FCNNs achieved lower performance in VGG and NIN architectures which are equipped with Dropout and BatchNorm. Note that FCNN performs without regularization, data preprocessing, hyper-parameter optimization, and change of learning rate. The results show that the finite state models’ performance is at the same scale as CNNs, considering the simplicity of FCNNs. Spherical parameterization performs better than Log Simplex in NIN-Finite and Quick-CIFAR-Finite networks, whereas in VGG-Finite Log Simplex is superior. We do not have a definite explanation for the difference in performance of parameterizations in different architecture settings. However, the results show that none are objectively superior as they stand.
Entropy of Filters and Biases
To analyze the behavior of the networks, we performed a qualitative analysis on the trend of the bias entropies and the filter entropies. In our experiments, M-KLD was used as the linearity. Since the input is represented by log-probability in the cross entropy term of M-KLD, the filter distribution naturally tends to low entropy distributions. However, in Figure 2, we observe that the average entropy of some layers starts to increase after some iterations. This trend is visible in the early layers of the networks. Since high entropy filters are more prone to result in high divergences when the input distribution is low-entropy (property of M-KLD), the network learns to approach the local optimum from low entropy distributions. The entropy of the input tensors of late layers are larger compared to that of the early layers, and start decreasing during the learning process. Therefore, the entropy of the filters decreases as the entropy of their input decreases.
The entropy of bias distributions contain information about the architecture of networks. Note that the bias component is the logarithm of the mixing coefficients. Degeneracy in the bias distribution results in removing the effect of the corresponding filters from the prediction. The increase in the entropy of the biases could also demonstrate the complexity of the input, in the sense that the input distribution cannot be estimated with a mixture of factorized distributions given the current number of mixture components.
Our work was motivated by the theoretical complications of objective inference in infinite state spaces. We argued that in finite states objective inference is theoretically feasible, while finite spaces are complex enough to express the data in high dimensions. The stepping stones for inference in high dimensional finite spaces were provided in the context of bayesian classification.
The recursive application of Bayesian classifiers resulted in FNNs; a structure remarkably similar to Neural Networks in the sense of activations (ReLU/Sigmoid) and the linearity. Consequently, by introducing the shift invariance property (Strict Sense Stationarity assumption) using the convolution tool, FCNNs as finite state analogue of CNNs were produced. The pooling function in FCNNs was derived as marginalizing the spatial position variables. The Max Pool function was explained as an approximation to the marginalization of spatial variables in the log domain. In our work, it is evident that there exist a correspondence between M-KLD, ReLU and Max Pool and similarly between I-KLD, Sigmoid and Average Pool.
In the context of classic CNNs, diverse interpretations for layers and values of the feature maps exist whereas in FNNs the roles of layers and the nature of every variable is clear. Additionally, the variables and parameters represent distributions, making the model ready for a variety of statistical tools, stochastic forward passes and stochastic optimization. The initialization and parameterization of the model points clearly and directly to the objective inference literature [Jeffreys1946, Jaynes1968], which would potentially reveal further directions on how to encode the functionality objectively.
Open Questions: The pillar of our framework is assigning probabilities to uncertain events. We directed the reader to the literature that justifies usage of both KLD forms in asymptotic cases. I-KLD is used to assign probabilities to empirical distributions, while M-KLD is assigning probability to the true distribution, given some empirical distribution. The concentration parameter roughly represents the number of empirical data in both probability assignments. The following questions are subject of future investigations.
The experiments show that using M-KLD as opposed to I-KLD results in higher performance. How could one theoretically justify the performance gap?
Could both schemes of probability assignment be incorporated in the learning process?
The normalizing factors in the nonlinearities represent the probability of the observation given the mixture distribution of filters. Can they be included in the objective to train without supervision?
- [Barlow, Kaushal, and Mitchison1989] Barlow, H. B.; Kaushal, T. P.; and Mitchison, G. J. 1989. Finding minimum entropy codes. Neural Computation 1(3):412–423.
- [Barlow1989] Barlow, H. B. 1989. Unsupervised learning. Neural computation 1(3):295–311.
- [Courbariaux and Bengio] Courbariaux, M., and Bengio, Y. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. corr abs/1602.02830 (2016).
- [Courbariaux, Bengio, and David2015] Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, 3123–3131.
- [Cover and Thomas2012] Cover, T. M., and Thomas, J. A. 2012. Elements of information theory. John Wiley & Sons.
- [Csiszár1998] Csiszár, I. 1998. The method of types [information theory]. IEEE Transactions on Information Theory 44(6):2505–2523.
- [e Silva et al.2011] e Silva, D. G.; Attux, R.; Nadalin, E. Z.; Duarte, L. T.; and Suyama, R. 2011. An immune-inspired information-theoretic approach to the problem of ica over a galois field. In IEEE Information Theory Workshop (ITW), 618–622.
- [Gens and Domingos2012] Gens, R., and Domingos, P. 2012. Discriminative learning of sum-product networks. In Advances in Neural Information Processing Systems, 3239–3247.
[Gens and Pedro2013]
Gens, R., and Pedro, D.
Learning the structure of sum-product networks.
International conference on machine learning, 873–880.
- [Hinton2009] Hinton, G. E. 2009. Deep belief networks. Scholarpedia 4(5):5947.
- [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456.
- [Jaynes1957] Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical review 106(4):620.
- [Jaynes1968] Jaynes, E. T. 1968. Prior probabilities. IEEE Transactions on systems science and cybernetics 4(3):227–241.
- [Jeffreys1946] Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 453–461.
- [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto.
- [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
- [Lin, Chen, and Yan2013] Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.
[Liu and Deng2015]
Liu, S., and Deng, W.
Very deep convolutional neural network based image classification
using small training sample size.
3rd IAPR Asian Conference on Pattern Recognition (ACPR), 730–734. IEEE.
- [Mallat2016] Mallat, S. 2016. Understanding deep convolutional networks. Phil. Trans. R. Soc. A 374(2065):20150203.
- [Nair and Hinton2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), 807–814.
- [Painsky, Rosset, and Feder2014] Painsky, A.; Rosset, S.; and Feder, M. 2014. Generalized binary independent component analysis. In IEEE International Symposium on Information Theory (ISIT), 1326–1330.
- [Painsky, Rosset, and Feder2016] Painsky, A.; Rosset, S.; and Feder, M. 2016. Large alphabet source coding using independent component analysis. arXiv preprint arXiv:1607.07003.
[Patel, Nguyen, and
Patel, A. B.; Nguyen, M. T.; and Baraniuk, R.
A probabilistic framework for deep learning.In Advances in neural information processing systems, 2558–2566.
[Poon and Domingos2011]
Poon, H., and Domingos, P.
Sum-product networks: A new deep architecture.
IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 689–690.
- [Ramachandran, Zoph, and Le2017] Ramachandran, P.; Zoph, B.; and Le, Q. V. 2017. Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941.
- [Rastegari et al.2016] Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, 525–542. Springer.
- [Sanov1958] Sanov, I. N. 1958. On the probability of large deviations of random variables. Technical report, North Carolina State University. Dept. of Statistics.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[Soudry, Hubara, and
Soudry, D.; Hubara, I.; and Meir, R.
Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights.In Advances in Neural Information Processing Systems, 963–971.
- [Su, Carin, and others2017] Su, Q.; Carin, L.; et al. 2017. A probabilistic framework for nonlinearities in stochastic neural networks. In Advances in Neural Information Processing Systems, 4489–4498.
- [Tishby and Zaslavsky2015] Tishby, N., and Zaslavsky, N. 2015. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop (ITW), 1–5.
- [Tishby, Pereira, and Bialek2000] Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057.
- [Yeredor2007] Yeredor, A. 2007. ICA in boolean XOR mixtures. Independent Component Analysis and Signal Separation 827–835.
- [Yeredor2011] Yeredor, A. 2011. Independent component analysis over galois fields of prime order. IEEE Transactions on Information Theory 57(8):5342–5359.
- [Zheng et al.2015] Zheng, H.; Yang, Z.; Liu, W.; Liang, J.; and Li, Y. 2015. Improving deep neural networks using softplus units. In IEEE International Joint Conference on Neural Networks (IJCNN), 1–4.