Sparsity in either neural activity or synaptic connectivity plays an important role in sensory information processing across brain areas Lennie (2003); Huang and Toyoizumi (2016a). Sparsity constraints imposed on neural activity in a sparse coding model Olshausen Bruno A. and Field David J. (1996)
reproduce Gabor-like filters (edge detectors), which resemble receptive fields of simple cells in the mammalian primary visual cortex. The sparse representation was also applied to deep belief networks to model hierarchical representations of natural image statisticsLee et al. (2008), which capture higher order features at deeper levels of the cortical hierarchy. In addition, from a perspective of optimal information storage, there must exist a large fraction of silent or potential synapses Brunel et al. (2004), consistent with the existence of these synapses in cortex and cerebellum Barbour et al. (2007). These (zero) synapses are vital for plasticity during learning Brunel et al. (2004). Therefore, the sparse representation in synaptic connectivity is also appealing in optimal neural computation.
Most artificial neural networks are trained by supervised learning, which requires a large library of images prelabeled with categories. However, unsupervised learning gives humans and non-human animals the ability to make sense of the external world by themselves, without any additional supervision. Thus unsupervised learning aims at extracting regularities in sensory inputs without specific labels. To figure out what learning algorithms may be used for modeling the external world in an unsupervised way is important for designing machine intelligence. However, understanding computational mechanisms of unsupervised learning in sensory representation is extremely challengingErhan et al. (2010).
Although zero synapses were observed in real neural circuits, their computational role in concept formation during unsupervised learning remains unclear, and lacks a simple model to explain. Furthermore, previous theoretical efforts focused on random models of neural networks Agliari et al. (2012); Sollich et al. (2014); Tubiana and Monasson (2017)
, where the distribution of synaptic values is prefixed. Here, we propose a simple model of unsupervised learning with zero synapses, where a two-layered neural network is introduced to learn the synaptic values from sensory inputs, which is thus more practical than random models. The bottom layer is composed of visible neurons receiving sensory inputs, while the top layer contains only one hidden neuron in response to specific features in the inputs. The bottom layer is connected to the top layer by synapses. Note that there do not exist lateral connections in the bottom layer. This kind of neural network is called a one-bit restricted Boltzmann machine (RBM)Smolensky (1986); Huang and Toyoizumi (2016b). Binary synapses were also experimentally observed in real neural circuits Petersen et al. (1998); O’Connor et al. (2005). The one-bit RBM with binary synapses has been studied as a toy model of unsupervised feature learning Huang and Toyoizumi (2016b), and is analytically tractable at the mean-field level Huang (2017).
To study the computational role of zero synapses, we model the connections between bottom and top layers in the one-bit RBM by ternary synapses, which take discrete values of . Given a sensory input, the ternary synaptic connections provide a hidden feature representation of the input. One configuration of ternary synapses forms a feature map and is also called the receptive field of the hidden neuron at the top layer.
Ii Problem setting and mean-field method
In this study, we use the one-bit RBM defined above to learn specific features in sensory inputs, which are raw unlabeled data. The machine is required to internally create concepts about the inputs. This process is thus called unsupervised learning. Here, the sensory inputs are given by handwritten digits taken from the MNIST dataset Lecun et al. (1998). Each image from this dataset has pixels, specified by an Ising-like spin configuration where is the input dimensionality. A collection of images is denoted by . The number of synapses is the same as the input dimensionality. Synaptic values are characterized by , where each component takes one of the ternary values. The one-bit RBM is thus described by the Boltzmann distribution , where denotes the activity of the hidden neuron, denotes an inverse temperature, and the synaptic strength is scaled by the factor to ensure that the corresponding statistical mechanics model has an extensive free energy. After marginalization of the hidden activity, one obtains the distribution of the visible activity as:
As an inference model, for any given input, the one-bit RBM has possible synaptic configurations to describe the sensory input. However, the machine will choose one of these potential candidates as the feature map. This process is naturally modeled by Bayes’ rule:
where is the partition function of the model, and a uniform prior for is assumed. serves as the inverse-temperature like parameter to control learning noise. The synaptic scaling also requires , and under this condition, the last product in Eq. (2) becomes , where with . It is clear that from a Bayesian viewpoint, introducing zero synapses amounts to some sort of Gaussian-like regularization but with discrete support. The Bayesian method is able to capture uncertainty in learned parameters (synaptic values here) and thus avoids over-fitting MacKay (1992). It may be able to reduce the necessary data size for learning as well Huang and Toyoizumi (2016b).
In what follows, we compute the maximizer of the posterior marginals estimatorNishimori (2001)
, where the data dependence of the probability is omitted. Hence, the task is to compute marginal probabilities, e.g.,, which is a computationally hard problem due to the interaction among data constraints (the product over in Eq. (2)). However, by mapping the original model (Eq. (2)) onto a graphical model Huang and Toyoizumi (2016b), where data constraints and synaptic values are treated as factor (data) nodes and variable nodes respectively, one can estimate the marginal probability by running a message passing algorithm as we shall explain below. The key assumption is that synapses on the graphical model are weakly correlated, which is called the Bethe approximation Mézard and Montanari (2009) in physics. We first define a cavity probability of with data node removed. Under the weak correlation assumption, satisfies a self-consistent equation:
where is a normalization constant, denotes neighbors of feature node except data node , denotes neighbors of data node except feature node , and the auxiliary quantity indicates the probability contribution from data node to feature node given the value of Mézard and Montanari (2009). Products in Eq. (3) stem from the weak correlation assumption.
In the thermodynamic limit, the sum inside the hyperbolic cosine function in Eq. (3b), excluding the
and varianceHuang and Toyoizumi (2015), where and . The cavity magnetization is defined as
, while the second moment of the feature componentis defined by . Thus the intractable sum over all () can be replaced by an integral over the normal distribution. Furthermore, because is a ternary variable, can be parametrized by cavity fields and , as . Combining this representation with Eq. (3), we have the following iterative learning equations:
where , and . can be interpreted as the message passing from feature node to data node , while can be interpreted as the message passing from data node to feature node , depending on . Note that the prefactor in Eq. (4a) is the cavity probability of non-zero synapses, i.e., , and this is also according to the definition. In this sense, the sparsity of synapses is described by a single parameter , where . The potential feature (synaptic configuration) is inferred by computing as well as , in which the magnetization is related to via .
If the weak correlation assumption is self-consistent, starting from randomly initialized messages, the learning equations will converge to a fixed point corresponding to a thermodynamically dominant minimum of the Bethe free energy function () Mézard and Montanari (2009)
. In the following part, we study how the learned feature map and the fraction of zero synapses change with data size. In particular, we focus on when the machine develops an internal concept about the input handwritten digits and what the computational role of zero synapses is for feature selectivity and receptive field formation.
We use the above mean field theory to study unsupervised feature learning with zero synapses. In the following simulations, unless otherwise stated. For simplicity, we consider only the and digits, because other combinations of two different digits yield similar results. We first study how the receptive field of the hidden neuron develops during the learning, as the number of training images increases. As shown in Fig. 1 (a), when the data is severely scarce, there is no apparent structure in the feature map. When the number of training images increases up to around , an intrinsically structured feature map starts to develop. Nevertheless, there are still a large fraction of zero synapses. As learning proceeds, the intrinsic structure concentrates more on the center of the feature map, indicating that the machine has already created an internal perception of external stimuli. This kind of perception has been shown to have an excellent discriminative capability on different stimuli by a precision-recall analysis Huang and Toyoizumi (2016b). This is because the distribution of the weighted sum of inputs the hidden neuron receives develops two well-separated peaks for two different digits.
Then, we study the computational role of zero synapses. As shown in Fig. 1 (b), the sparsity level of synapses decreases with the data size . Around some critical , the sparsity decreases abruptly, suggesting that a structured feature map is beginning to develop. Here, learning indeed induces the fraction of zero synapses to decrease Brunel et al. (2004), since some zero synapses are required to adopt non-zero values for capturing characteristics in the input data. When the data size is further increased, the feature map is refined, and the sparsity decreases more slowly than around the critical region. At large values of , a small fraction of zero synapses are still maintained. The zero synapses at this stage seem to form an approximate boundary between active and inactive regions in the feature map (see the last feature map in Fig. 1 (a)). Therefore, the zero synapses behave like contour detectors. This effect is predicted by our model, but its neurobiological counterpart is still unclear and deserves tested in future experiments of feature learning.
In particular, the monotonic behavior of the sparsity level of synapses is intimately related to the overall strength of cavity messages, which is defined as . The model has originally symmetry, since Eq. (2) is invariant under the transformation of . This symmetry can be spontaneously broken, as indicated by (Fig. 1 (b)). Around some critical , cavity messages start to polarize without maintaining trivial (null) values any more, which is accompanied by the rapid decrease of the number of zero synapses. Moreover, the asymptotic behavior of in the small limit of the message strength can be derived as , where denotes the small strength (, ), and denotes the image statistics expressed as in which . This asymptotic behavior captures well the trend of when is small (Fig. 1 (b)).
Next, we study how the machine creates a perception of only one digit such as , as learning proceeds. Receptive field formation is displayed in Fig. 2. Around , a rough structure of receptive field starts to emerge from the learning process, and the structure becomes more apparent as more data is added. Meanwhile, the fraction of zero synapses decreases. Some of them become active, while some become inactive, refining the developed receptive field or feature map. The belief about the stimulus image is continuously updated with more sensory inputs. Around , a clear concept about digit is created by the unsupervised learning via combining likelihood and prior (see Eq. (2)). Interestingly, a very small fraction () of zero synapses remain and serve as contour detectors. These zero synapses specify the boundary between active and inactive regions in the feature map.
Next, we study the effect of the inverse-temperature on the receptive field formation. can be thought of as a scalar tuning the global contrast level of the input image Orban et al. (2016). By increasing , one observes a qualitative change of the feature map (Fig. 3 (a)), from an active-synapses-dominated phase (in the center of the feature map) at small to a zero-synapses-dominated phase at high . Surprisingly, the zero-synapses-dominated phase still maintains the discriminative capability to distinguish different stimuli. Note that, at large , the free energy ceases to be extensive, which can be seen from the last product of the second equality in Eq. (2). The qualitative change results from the competition between data constraints and biases introduced by zero synapses (Eq. (2)). The critical is determined by the value from which ceases to decrease and starts to increase. As observed in Fig. 3 (b), decreases as increases.
Finally, we apply the computational framework to model retinal neural activity. We study a dataset composed of about spike patterns of retinal ganglion cells. The neural activity was measured during a natural-movie-stimuli experiment on the salamander retina (data courtesy of Michael J. Berry Marre et al. (2012)). The retina is an early visual system performing decorrelation computation of redundant visual inputs Pitkow Xaq and Meister Markus (2012). The downstream brain areas may directly model the structure of population activity from the upstream area (such as retina), without any reference to external sensory inputs Loback et al. (2016). Thus it is important to test our theory on this kind of unsupervised learning of retinal neural activity. In Fig. 4, we observe similar behavior of the sparsity of synapses as found in learning a handwritten digits dataset. The learned feature map has a spontaneous symmetry breaking at some critical data size, where the sparsity of synapses changes rapidly as well. After the spontaneous symmetry breaking, the feature map has two possible phases: synapses are either all-active or all-inactive, and the fraction of zero synapses becomes nearly zero. The hidden neuron in our model can be thought of as a unit in a downstream circuit along the ventral visual pathway, and the polarization of its receptive field does not show any intrinsic structures similar to those we already observed in learning a handwritten digits dataset. This may be because the retina circuit is at the bottom level of the visual hierarchy, while the concept of the visual input can only be formed at the higher level of the cortical hierarchy DiCarlo et al. (2012).
In conclusion, we build a physics model of sparse unsupervised feature learning based on the one-bit RBM, and in this model, the sparseness of synaptic activity is automatically learned from the noisy data. A rapid decrease of the number of zero synapses signals concept formation in the neural network, and the remaining zero synapses refine the learned concept by serving as contour detectors. In addition, zero synapses are sensitive to the contrast level of sensory inputs. These predictions may guide future neurobiological experiments. In particular, the fact that the number of zero synapses acts as an indicator of concept formation is intimately related to the spontaneous symmetry breaking in the model. These findings may also have implications on promising deep neuromorphic computation with discrete synapses Ardakani et al. (2016).
It would be very interesting, yet challenging, to generalize the current framework to neural networks with multiple hidden neurons, and furthermore with hierarchical multi-layered architectures.
Previous studies showed that parallel retrieval of memory is possible by a random dilution of connections in random RBMs Agliari et al. (2012); Sollich et al. (2014), which may have connections to our current findings, in the sense that zero synapses offer the possibility to simultaneously recall multiple patterns. Furthermore, our findings on unsupervised learning with zero synapses are consistent with results reported in Brunel et al. (2004)
, where supervised learning in a perceptron model of cerebellar Purkinje cells was studied. An intuitive explanation is that, learning stretches the synaptic-weight distribution, pushing synapses towards their limit values (eitherin the perceptron model Brunel et al. (2004) or here) bru
. A recent work derived the paramagnetic-spin-glass transition line in a generalized RBM with spin and weight priors interpolating between Gaussian and binary distributionsBarra et al. (2017), which may connect to our results of spontaneous symmetry breaking and concept formation in real data analysis.
I thank Taro Toyoizumi for his comments on silent synapses, Jack Raymond and James Humble for careful reading the manuscript, and Adriano Barra for drawing my attention to his previous works. This research was supported by AMED under Grant Number JP15km0908001.
- Lennie (2003) P. Lennie, Current Biology 13, 493 (2003).
- Huang and Toyoizumi (2016a) H. Huang and T. Toyoizumi, Phys. Rev. E 93, 062416 (2016a).
- Olshausen Bruno A. and Field David J. (1996) Olshausen Bruno A. and Field David J., Nature 381, 607 (1996).
- Lee et al. (2008) H. Lee, C. Ekanadham, and A. Y. Ng, in Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Curran Associates, Inc., 2008), pp. 873–880.
- Brunel et al. (2004) N. Brunel, V. Hakim, P. Isope, J.-P. Nadal, and B. Barbour, Neuron 43, 745 (2004).
- Barbour et al. (2007) B. Barbour, N. Brunel, V. Hakim, and J.-P. Nadal, Trends in Neurosciences 30, 622 (2007).
- Erhan et al. (2010) D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, J. Mach. Learn. Res. 11, 625 (2010).
- Agliari et al. (2012) E. Agliari, A. Barra, A. Galluzzi, F. Guerra, and F. Moauro, Phys. Rev. Lett. 109, 268101 (2012).
- Sollich et al. (2014) P. Sollich, D. Tantari, A. Annibale, and A. Barra, Phys. Rev. Lett. 113, 238106 (2014).
- Tubiana and Monasson (2017) J. Tubiana and R. Monasson, Phys. Rev. Lett. 118, 138301 (2017).
- Smolensky (1986) P. Smolensky (MIT Press, Cambridge, MA, USA, 1986), chap. Information Processing in Dynamical Systems: Foundations of Harmony Theory, pp. 194–281.
- Huang and Toyoizumi (2016b) H. Huang and T. Toyoizumi, Phys. Rev. E 94, 062310 (2016b).
- Petersen et al. (1998) C. C. H. Petersen, R. C. Malenka, R. A. Nicoll, and J. J. Hopfield, Proc. Nat. Acad. Sci. 95, 4732 (1998).
- O’Connor et al. (2005) D. H. O’Connor, G. M. Wittenberg, and S. S.-H. Wang, Proc. Nat. Acad. Sci. 102, 9679 (2005).
- Huang (2017) H. Huang, Journal of Statistical Mechanics: Theory and Experiment 2017, 053302 (2017).
- Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Proceedings of the IEEE 86, 2278 (1998).
- MacKay (1992) D. J. C. MacKay, Neural Comput. 4, 448 (1992).
- Nishimori (2001) H. Nishimori, Statistical Physics of Spin Glasses and Information Processing: An Introduction (Oxford University Press, Oxford, 2001).
- Mézard and Montanari (2009) M. Mézard and A. Montanari, Information, Physics, and Computation (Oxford University Press, Oxford, 2009).
- Huang and Toyoizumi (2015) H. Huang and T. Toyoizumi, Phys. Rev. E 91, 050101 (2015).
- Orban et al. (2016) G. Orban, P. Berkes, J. Fiser, and M. Lengyel, Neuron 92, 530 (2016).
- Marre et al. (2012) O. Marre, D. Amodei, N. Deshmukh, K. Sadeghi, F. Soo, T. E. Holy, and M. J. Berry, J. Neurosci. 32, 14859 (2012).
- Pitkow Xaq and Meister Markus (2012) Pitkow Xaq and Meister Markus, Nat Neurosci 15, 628 (2012).
- Loback et al. (2016) A. R. Loback, J. S. Prentice, M. L. Ioffe, and M. J. Berry, II, ArXiv e-prints 1610.06886 (2016).
- DiCarlo et al. (2012) J. J. DiCarlo, D. Zoccolan, and N. C. Rust, Neuron 73, 415 (2012).
- Ardakani et al. (2016) A. Ardakani, C. Condo, and W. J. Gross, ArXiv e-prints 1611.01427 (2016).
- (27) Nicolas Brunel, personal communication.
- Barra et al. (2017) A. Barra, G. Genovese, P. Sollich, and D. Tantari, ArXiv e-prints: 1702.05882 (2017).