A green-striped elephant! No one has probably seen such a thing—no surprise. But what is a surprise is our ability to imagine one with almost no trouble. Humans are not only adept in recognizing what class an input instance belongs to (i.e., classification task), but more remarkably, they can imagine (i.e., generate) plausible instances of a desired class with ease, when prompted. In fact, humans can generate instances of a desired class, say, elephant, that they have never encountered before, like, a green-striped elephant.111In counterfactual terms: Had a human seen a green-striped elephant, s/he would have yet recognized it as an elephant. Geoffrey Hinton once told a similar story about a pink elephant! In this sense, humans’ generative capacity goes beyond merely retrieving from memory. In computational terms, the notion of generating examples from a desired class can be formalized in terms of sampling
Cascade-Correlation Neural Networks (CCNNs) Fahlman Lebiere (1989) are a well-known class of discriminative (as opposed to generative) models that have been successful in simulating a variety of phenomena in the developmental literature, e.g., infant learning of word-stress patterns in artificial languages Shultz Bale (2006), syllable boundaries Shultz Bale (2006), visual concepts Shultz (2006), and have also been successful in capturing important developmental regularities in a variety of tasks, e.g., the balance-scale task Shultz . (1994); Shultz Takane (2007), transitivity Shultz Vogel (2004), conservation Shultz (1998), seriation Mareschal Shultz (1999)
. Moreover, CCNNs exhibit several similarities with known brain functions: distributed representation, self-organization of network topology, layered hierarchical topologies, both cascaded and direct pathways, an S-shaped activation function, activation modulation via integration of neural inputs, long-term potentiation, growth at the newer end of the network via synaptogenesis or neurogenesis, pruning, and weight freezingWestermann . (2006). Nonetheless, in virtue of being deterministic and discriminative, CCNNs have so far lacked the capacity to probabilistically generate examples from a category of interest. This ability can be used, e.g., to diagnose what the network knows at various points during training, particularly when dealing with high-dimensional input spaces.
In this work, we propose a framework which allows transforming CCNNs into probabilistic generative models, thereby enabling CCNNs to generate samples from a category. Our proposed framework is based on a Markov Chain Monte Carlo (MCMC) method, called the Metropolis-Adjusted Langevin (MAL) algorithm, which employs the gradient of the target distribution to guide its explorations towards regions of high probability, thereby significantly reducing the undesirable random walk often observed at the beginning of an MCMC run (a.k.a. the burn-in period). MCMC methods are a family of algorithms for sampling from a desired probability distribution, and have been successful in simulating important aspects of a wide range of cognitive phenomena, e.g., temporal dynamics of multistable perception Gershman . (2012); Moreno-Bote . (2011), developmental changes in cognition Bonawitz, Denison, Griffiths Gopnik (2014), category learning Sanborn . (2010), causal reasoning in children Bonawitz, Denison, Gopnik Griffiths (2014), and giving accounts for many cognitive biases Dasgupta . (2016).
Furthermore, work in theoretical neuroscience has shed light on possible mechanisms according to which MCMC methods could be realized in generic cortical circuits Buesing . (2011); Moreno-Bote . (2011); Pecevski . (2011); Gershman Beck (2016)
. In particular, moreno2011bayesian showed how an attractor neural network implementing MAL could account for multistable perception of drifting gratings, and savin2014spatio showed how a network of leaky integrate-and-fire neurons could implement MAL in a biologically-realistic manner.
2 Cascade-Correlation Neural Networks
CCNNs are a special class of deterministic artificial neural networks, which construct their topology in an autonomous fashion—an appealing property simulating developmental phenomena Westermann . (2006) and other cases where networks need to be constructed. CCNN training starts with a two-layer network (i.e., the input and the output layer) with no hidden units, and proceeds by recruiting hidden units one at a time, as needed. Each new hidden unit is trained to be maximally correlated with residual error in the network built so far, and is recruited into a hidden layer of its own, giving rise to a deep network with as many hidden layers as the number of recruited hidden units. CCNNs use sum-of-squared error as objective function, and typically use symmetric sigmoidal activation functions with range to for hidden and output units.222fahlman1989cascade also suggest linear, Gaussian, and asymmetric sigmoidal (with range to ) activation functions as alternatives. Our proposed framework can be straightforwardly adapted to handle all such activation functions. Some variants have been proposed for CCNNs, e.g., Sibling-Descendant Cascade-Correlation (SDCC) Baluja Fahlman (1994) and Knowledge-Based Cascade-Correlation (KBCC) Shultz Rivest (2001). Although in this work we specifically focus on standard CCNNs, our proposed framework can handle SDCC and KBCC as well.
3 The Metropolis-Adjusted Langevin Algorithm
MAL Roberts Tweedie (1996) is a special type of MCMC method, which employs the gradient of the target distribution to guide its explorations towards regions of high probability, thereby reducing the burn-in period. More specifically, MAL combines the two concepts of Langevin dynamics (a random walk guided by the gradient of the target distribution), and the Metropolis-Hastings algorithm (an accept/reject mechanism for generating a sequence of samples the distribution of which asymptotically converges to the target distribution).
We denote random variables with small bold-faced letters, random vectors by capital bold-faced letters, and their corresponding realizations by non-bold-faced letter. The MAL algorithm is outlined in Algorithm1 wherein denotes the target probability distribution, is a positive real-valued parameter specifying the time-step used in the Euler-Maruyama approximation of the underlying Langevin dynamics, denotes the number of samples generated by the MAL algorithm, denotes the proposal distribution (a.k.a. transition kernel),
denotes the multivariate normal distribution with mean vectorand covariance matrix , and, finally,
denotes the identity matrix. The sequence of samples generated by the MAL algorithm,, is guaranteed to converge in distribution to Robert Casella (2013). It is worth noting that work in theoretical neuroscience has shown that MAL, outlined in Algorithm 1, could be implemented in a neurally-plausible manner Savin Deneve (2014); Moreno-Bote . (2011). In the following section, we propose a target distribution , allowing CCNNs to generate samples from a category of interest.
4 The Proposed Framework
In what follows, we propose a framework which transforms CCNNs into probabilistic generative models, thereby enabling them to generate samples from a category of interest. The proposed framework is based on the MAL algorithm given in Sec. 3. Let denote the input-output mapping learned by a CCNN at the end of the training phase, and denote the set of weights for a CCNN after training.333Formally, where and denote the set of values that input unit and output unit can take on, respectively. Upon termination of training, presented with input , a CCNN outputs . Note that, in case a CCNN possesses multiple output units, the mapping will be a vector rather than a scalar. To convert a CCNN into a probabilistic generative model, we propose to use the MAL algorithm with its target distribution being set as follows:
where denotes the -norm, is a damping factor, is the normalizing constant, and is a vector whose element corresponding to the desired class is (i.e., its element) and the rest of its elements are s. The intuition behind Eq. (1) can be articulated as follows: For an input instance belonging to the desired class ,444In counterfactual terms, this is equivalent to saying: Had input instance been presented to the network, it would have classified
been presented to the network, it would have classifiedin class . the output of the network is expected to be close to in -norm sense; in this light, Eq. (1) is adjusting the likelihood of input instance to be inversely proportional to the exponent of the said distance.
For the reader familiar with probabilistic graphical models, the expression in Eq. (1
) looks similar to the expression for the joint probability distribution of Markov random fields and probabilistic energy-based models, e.g., Restricted Boltzman Machines and Deep Boltzman Machines. However, there is a crucial distinction: The normalizing constant, the computation of which is intractable in general, renders learning in those models computationally intractable.555More specifically, renders the computation of the gradient of the log-likelihood for those models intractable. The appropriate way to interpret Eq. (1) is to see it as a Gibbs distribution for a non-probabilistic energy-based model whose energy is defined as the square of the prediction error LeCun . (2006). Section 1.3 of LeCun . (2006) discusses the topic of Gibbs distribution for non-probabilistic energy-based models in the context of discriminitive learning, computationally modeled by (i.e., to predict a class given an input), and raises the same issue that we highlighted above regarding the intractability of computing the normalizing constant in general. In sharp contrast to LeCun . (2006), our framework is proposed for the purpose of generating examples from a desired class, as evidenced by Eq. (1) being defined in terms of . Also crucially, the intractability of computing raises no issue for our proposed framework due to an intriguing property of the MAL algorithm according to which the normalizing constant need not be computed at all.666The MAL algorithm inherits this property from the Metropolis-Hasting algorithm, which it uses as a subroutine.
Due to Line 4 of Algorithm 1, MAL’s proposal distribution, , requires the computation of , which essentially involves the computation of (note that the gradient is operating on , and is merely treated as a set of fixed parameters). The multi-layer structure of CCNN ensures that
can be efficiently computed using Backpropagation. Alternatively, in settings where CCNNs recruit a small number of input units (hence, the cardinality ofis small), can be obtained by introducing negligible perturbation to a component of input signal , dividing the resulting change in network’s outputs by the introduced perturbation, and repeating this process for all components of input signal . It is worth noting that although the idea of computing gradients through introducing small perturbations would lead to a computationally inefficient approach for learning CCNNs, it leads to a computationally efficient approach for generation, as the number of input units are typically much fewer than the number of weights in CCNNs (and artificial neural networks, in general). It is also crucial to note that the normalizing constant plays no role in the computation of .
In this section we demonstrate the efficacy of our proposed framework through simulations. We particularly focus on learning which can be accomplished by two input and one output units. This permits visualization of the input-output space, which lies in . Note that our proposed framework can handle arbitrary number of input and output units; this restriction is solely for ease of visualization.
5.1 Continuous-XOR Problem
In this subsection, we show how our proposed framework allows a CCNN, trained on the continuous-XOR classification task (see Fig. 1), to generate examples from a category of interest. The output unit has a symmetric sigmoidal activation function with range and . The training set consists of samples in the unit-square
, paired with their corresponding labels. More specifically, the training set is comprised of all the ordered-pairs starting fromand going up to with equal steps of size , paired with their corresponding labels (i.e., for positive samples and for negative samples); see Fig. 1(top-left). After training, a CCNN with hidden layers is obtained whose input-output mapping, , is depicted in Fig. 1(top-right).777Due to the inherent randomness in CCNN construction, training could lead to networks with different structures. However, since in this work we are solely concerned with generating examples using CCNNs rather than how well CCNNs could learn a given discriminitive task, we arbitrarily pick a learned network. Note that our proposed framework can handle CCNNs with arbitrary structures; in that light, the choice of network is without loss of generality.
A CCNN trained on the continuous-XOR classification task. Top-left: Training patterns. All the patterns in the gray quadrants are negative examples with label, and all the patterns in the white quadrants are positive examples with label . Red dotted lines depict the boundaries. Top-right: The input-output mapping, , learned by a CCNN, along with a colorbar. Bottom: The top-down view of the curve depicted in top-right, along with a colorbar.
Fig. 2 shows the efficacy of our proposed framework in enabling CCNNs to generate samples from a category of interest, under various choices for MAL parameter (see Algorithm 1) and damping factor (see Eq. (1)); generated samples are depicted by red dots. For the results shown in Fig. 2, the category of interest is the category of positive examples, i.e., the category of input patterns which, upon being presented to the (learned) network, would be classified as positive by the network. Because controls the amount of jump between consecutive proposals made by MAL, the following behavior is expected: For small (Fig. 2(a)) consecutive proposals are very close to one another, leading to a slow exploration of the input domain. As increases, bigger jumps are made by MAL (Fig. 2(b)).888Yet, too large a is not good either, leading to a sparse and coarse-grained exploration of the input space. Some measures have been proposed in computational statistics for properly choosing ; cf. Roberts Rosenthal (1998). Parameter controls how severely deviations from the desired class label (here, ) should be penalized. The larger the parameter , the more severely such deviations are penalized and the less likely it becomes for MAL to make moves toward such regions of input space. Acceptance Rate (AR), defined as the number of accepted moves divided by the total number of suggested moves, is also presented for the results shown in Fig. 2. Fig. 2(c) shows that for and , our proposed framework demonstrates a desirable performance: virtually all of the generated samples fall within the desired input regions (i.e., the regions associated with hot colors, signaling the closeness of network’s output to in those regions; see Fig. 1(bottom)) and the desired regions are adequately explored (i.e., all hot-colored input regions being visited and almost evenly explored).
Results shown in Fig. 2 depict all the first samples generated by MAL, without excluding the so-called burn-in period. In that light, the result shown in Fig. 2(c) nicely demonstrates how MAL—by directing its suggestions toward the direction of gradient and therefore making moves toward regions with high likelihood—could alleviate the need for discarding a (potentially large) number of samples generated at the beginning of an MCMC which are assumed to be unrepresentative of equilibrium state, a.k.a. the burn-in period. Fig. 3 shows the performance of our framework in enabling the learned CCNN to generate from the category of negative examples, with and .
5.2 Two-Spirals Problem
Next, we show how our proposed framework allows a CCNN, trained on the famously difficult Two-Spirals classification task (Fig. 4), to generate examples from a category of interest. The output unit has a symmetric sigmoidal activation function with range and . The training set consists of samples ( samples per spiral), in the square , paired with their corresponding labels ( and for positive and negative samples, respectively). The training pattern is shown in Fig. 4(top-left); cf. Chalup Wiklendt (2007) for details. After training, a CCNN with hidden layers is obtained whose input-output mapping, , is depicted in Fig. 4(top-right).
Fig. 5(top) and Fig. 5(bottom) show the efficacy of our proposed framework in enabling CCNNs to generate samples from the positive and negative categories, respectively. Although similar patterns of behavior observed in Sec. 5.1 due to increasing/decreasing and are observed here as well, due to the lack of space such results are omitted. Note that the results shown in Fig. 5 depicts all the first samples generated by MAL, without excluding the burn-in period. In that light, the results shown in Fig. 5(top) and Fig. 5(bottom) demonstrate once again the efficacy of MAL in alleviating the need for discarding a (potentially large) number samples generated at the beginning of an MCMC run.
Interestingly, our proposed framework also allows CCNNs to generate samples subject to some forms of constraints. For example, Fig. 6 demonstrates how our proposed framework enables a CCNN, trained on the continuous-XOR classification task (see Sec. 5.1), to generate examples from the positive category, under the following constraint: Generated samples must lie on the curve . To generate samples from the positive category while satisfying the said constraint, MAL adopts our proposed target distribution given in Eq. (1), and treats as an independent and as a dependent variable.
6 General Discussion
Although we discussed our proposed framework in the context of CCNNs, it can be straightforwardly extended to handle other kinds of artificial neural networks, e.g., multi-layer perceptron and convolutional neural networks. Furthermore, our proposed framework, together with recent work in theoretical neuroscience showing possible neurally-plausible implementations of MALSavin Deneve (2014); Moreno-Bote . (2011), suggests an intriguing modular hypothesis according to which generation could result from two separate modules interacting with each other (in our case, a CCNN and a neural network implementing MAL). This hypothesis yields the following prediction: There should be some brain impairments which lead to a marked decline in a subject’s performance in generative tasks (i.e., tasks involving imagery, or imaginative tasks in general) but leave the subject’s learning abilities (nearly) intact. Studies on learning and imaginative abilities of hippocampal amnesic patients already provide some supporting evidence for this idea Hassabis . (2007); Spiers . (2001); Brooks Baddeley (1976).
According to Line 4 of Algorithm 1, to generate the sample, MAL requires to have access to a fine-tuned, Gaussian noise with mean for its proposal distribution . Recently savin2014spatio showed how a network of leaky integrate-and-fire neurons could implement MAL in a neurally-plausible manner. However, as gershman2016complex point out, Savin and Deneve leave unanswered what the source of that fine-tuned Gaussian noise could be. Our proposed framework may provide an explanation, not for the source of Gaussian noise, but for its fine-tuned mean value. According to our modular account, the main component of the mean value, which is , may come from another module (in our case a CCNN) which has learned some input-output mapping , based on which the target distribution is defined (see Eq. (1)).
The idea of sample generation under constraints could be an interesting line of future work. Humans clearly have the capacity to engage in imaginative tasks under a variety of constraints, e.g., when given incomplete sentences or fragments of a picture people can generate possible completions; cf. Sanborn Chater (2016). Also, our proposed framework can be used to let a CCNN generate samples from a category of interest at any stage during CCNN construction. In that light, our proposed framework, along with a neurally-plausible implementation of MAL, gives rise to a self-organized generative model: a generative model possessing the self-constructive property of CCNNs. Such self-organized generative models could provide a wealth of developmental hypotheses as to how the imaginative capacities of children change over development, and models with quantitative predictions to compare against. We see our work as a step towards such models.
This work is funded by an operating grant to TRS from the Natural Sciences and Engineering Research Council of Canada.
- Baluja Fahlman (1994) baluja1994reducingBaluja, S. Fahlman, SE. 1994. Reducing Network Depth in the Cascade-Correlation Learning Architecture Reducing network depth in the cascade-correlation learning architecture. Technical Report # CMU-CS-94-209, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA..
Bonawitz, Denison, Gopnik Griffiths (2014)
bonawitz2014winBonawitz, E., Denison, S., Gopnik, A. Griffiths, TL.
Win-Stay, Lose-Sample: A simple sequential algorithm for approximating Bayesian inference Win-stay, lose-sample: A simple sequential algorithm for approximating bayesian inference.Cognitive psychology7435–65.
Bonawitz, Denison, Griffiths Gopnik (2014)
bonawitz2014probabilisticBonawitz, E., Denison, S., Griffiths, TL. Gopnik, A.
Probabilistic models, learning algorithms, and response variability: sampling in cognitive development Probabilistic models, learning algorithms, and response variability: sampling in cognitive development.Trends in cognitive sciences1810497–500.
- Brooks Baddeley (1976) brooks1976canBrooks, D. Baddeley, A. 1976. What can amnesic patients learn? What can amnesic patients learn? Neuropsychologia141111–122.
- Buesing . (2011) buesing2011neuralBuesing, L., Bill, J., Nessler, B. Maass, W. 2011. Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput Biol711e1002211.
- Chalup Wiklendt (2007) chalup2007variationsChalup, SK. Wiklendt, L. 2007. Variations of the two-spiral task Variations of the two-spiral task. Connection Science192183–199.
- Dasgupta . (2016) dasgupta2016hypothesesDasgupta, I., Schulz, E. Gershman, SJ. 2016. Where do hypotheses come from? Where do hypotheses come from? Center for Brains, Minds and Machines (CBMM) Memo No. 056.
- Fahlman Lebiere (1989) fahlman1989cascadeFahlman, SE. Lebiere, C. 1989. The cascade-correlation learning architecture The cascade-correlation learning architecture.
- Gershman Beck (2016) gershman2016complexGershman, SJ. Beck, JM. 2016. Complex Probabilistic Inference: From Cognition to Neural Computation Complex probabilistic inference: From cognition to neural computation. In Computational Models of Brain and Behavior, ed A. Moustafa (Hoboken, NJ: Wiley-Blackwell).
- Gershman . (2012) gershman2012multistabilityGershman, SJ., Vul, E. Tenenbaum, JB. 2012. Multistability and perceptual inference Multistability and perceptual inference. Neural computation2411–24.
- Hassabis . (2007) hassabis2007patientsHassabis, D., Kumaran, D., Vann, SD. Maguire, EA. 2007. Patients with hippocampal amnesia cannot imagine new experiences Patients with hippocampal amnesia cannot imagine new experiences. Proceedings of the National Academy of Sciences10451726–1731.
- LeCun . (2006) lecun2006tutorialLeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. Huang, F. 2006. A tutorial on energy-based learning A tutorial on energy-based learning. Predicting structured data10.
- Mareschal Shultz (1999) mareschal1999developmentMareschal, D. Shultz, TR. 1999. Development of children’s seriation: A connectionist approach Development of children’s seriation: A connectionist approach. Connection Science112149–186.
- Moreno-Bote . (2011) moreno2011bayesianMoreno-Bote, R., Knill, DC. Pouget, A. 2011. Bayesian sampling in visual perception Bayesian sampling in visual perception. Proceedings of the National Academy of Sciences1083012491–12496.
- Pecevski . (2011) pecevski2011probabilisticPecevski, D., Buesing, L. Maass, W. 2011. Probabilistic inference in general graphical models through sampling in stochastic networks of spiking neurons Probabilistic inference in general graphical models through sampling in stochastic networks of spiking neurons. PLoS Comput Biol712e1002294.
- Robert Casella (2013) robert2013monteRobert, C. Casella, G. 2013. Monte Carlo statistical methods Monte carlo statistical methods. Springer Science & Business Media.
- Roberts Rosenthal (1998) roberts1998optimalRoberts, GO. Rosenthal, JS. 1998. Optimal scaling of discrete approximations to Langevin diffusions Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)601255–268.
- Roberts Tweedie (1996) roberts1996exponentialRoberts, GO. Tweedie, RL. 1996. Exponential convergence of Langevin distributions and their discrete approximations Exponential convergence of langevin distributions and their discrete approximations. Bernoulli341–363.
- Sanborn Chater (2016) sanborn2016bayesianSanborn, AN. Chater, N. 2016. Bayesian brains without probabilities Bayesian brains without probabilities. Trends in Cognitive Sciences2012883–893.
- Sanborn . (2010) sanborn2010rationalSanborn, AN., Griffiths, TL. Navarro, DJ. 2010. Rational approximations to rational models: alternative algorithms for category learning. Rational approximations to rational models: alternative algorithms for category learning. Psychological review11741144.
- Savin Deneve (2014) savin2014spatioSavin, C. Deneve, S. 2014. Spatio-temporal representations of uncertainty in spiking neural networks Spatio-temporal representations of uncertainty in spiking neural networks. 2024–2032.
- Shultz (1998) shultz1998computationalShultz, TR. 1998. A computational analysis of conservation A computational analysis of conservation. Developmental Science11103–126.
- Shultz (2006) shultz2006constructiveShultz, TR. 2006. Constructive learning in the modeling of psychological development Constructive learning in the modeling of psychological development. Processes of change in brain and cognitive development: Attention and performance2161–86.
- Shultz Bale (2006) shultz2006neuralShultz, TR. Bale, AC. 2006. Neural networks discover a near-identity relation to distinguish simple syntactic forms Neural networks discover a near-identity relation to distinguish simple syntactic forms. Minds and Machines162107–139.
- Shultz . (1994) shultz1994modelingShultz, TR., Mareschal, D. Schmidt, WC. 1994. Modeling cognitive development on balance scale phenomena Modeling cognitive development on balance scale phenomena. Machine Learning161-257–86.
- Shultz Rivest (2001) shultz2001knowledgeShultz, TR. Rivest, F. 2001. Knowledge-based cascade-correlation: Using knowledge to speed learning Knowledge-based cascade-correlation: Using knowledge to speed learning. Connection Science13143–72.
- Shultz Takane (2007) shultz2007ruleShultz, TR. Takane, Y. 2007. Rule following and rule use in the balance-scale task Rule following and rule use in the balance-scale task. Cognition1033460–472.
- Shultz Vogel (2004) shultz2004connectionistShultz, TR. Vogel, A. 2004. A connectionist model of the development of transitivity A connectionist model of the development of transitivity. Proceedings of the 26th annual conference of the cognitive science society Proceedings of the 26th annual conference of the cognitive science society ( 1243–1248).
- Spiers . (2001) spiers2001hippocampalSpiers, HJ., Maguire, EA. Burgess, N. 2001. Hippocampal amnesia Hippocampal amnesia. Neurocase75357–382.
- Westermann . (2006) westermann2006modelingWestermann, G., Sirois, S., Shultz, TR. Mareschal, D. 2006. Modeling developmental cognitive neuroscience Modeling developmental cognitive neuroscience. Trends in cognitive sciences105227–232.