1 Introduction
Computational neuroscientists have made significant progress in understanding how correlations amongst neural spikecounts influence information processing averbeck_neural_2006; kohn_correlations_2016. In particular, correlated fluctuations in the responses to a fixed stimulus (termed noise correlations) depend on stimulus identity, and can both hinder and facilitate information processing in model neural circuits abbott_effect_1999; sompolinsky_population_2001; shamir_implications_2006; ecker_effect_2011; morenobote_informationlimiting_2014. However, due to the challenges in assessing the strength and order of noise correlations empirically, it remains unclear how noise correlations affect computation in biological neural circuits.
Measuring correlations of every order in large neural populations is infeasible, and modelling neural correlations requires knowledge of the order and dimensionality of significant correlations. Most models measure exclusively pairwise correlations schneidman_towards_2016; gardella_modeling_2018, which can often explain most of the variability in spikecount data schneidman_weak_2006; granotatedgi_stimulusdependent_2013. Moreover, dimensionalityreduction methods have shown that the complete set of pairwise neural correlations often has a lowdimensional structure for both total cowley_stimulusdriven_2016 and noise ecker_state_2014; cunningham_dimensionality_2014; goris_partitioning_2014; okun_diverse_2015; rosenbaum_spatial_2017 correlations. Nevertheless, higherorder correlations become more significant as both stimulus complexity and neural population size increase schneidman_towards_2016, motivating the use of higherorder models when modelling noise correlations in large populations of neurons.
Dimensionality reduction methods have provided many insights into neural correlations, but correlations must be studied through models of population codes in order to understand their effect on neural coding. Maximum entropy models of binary neural spiking drive much of the contemporary research on neural correlations in model population codes schneidman_towards_2016. The flexibility of the maximum entropy framework allows the techniques for analyzing and fitting pairwise models to be generalized to both sparse, higherorder models ganmor_sparse_2011 and stimulusdependent models granotatedgi_stimulusdependent_2013. Nevertheless, there is still a need for a ratebased model of population coding that can reliably capture the complex yet lowdimensional noise correlations found in neural data.
To address this we propose a conditional maximum entropy model, specifically a conditional finite mixture of independent Poisson distributions (CMP). When conditioned on a context variable (e.g. a stimulus) a CMP is a mixture of component collections of independent Poisson distributions karlis_finite_2007. A CMP with only one component reduces to a standard ratebased population code with Poisson neurons that are conditionally independent given the stimulus ma_bayesian_2006; ostojic_spiking_2011, and adding components to the CMP introduces noise correlations with dimensionality controlled by the number of components.
By combining the properties of maximum entropy models with additional features of Poisson neurons we derive a hybrid expectationmaximization algorithm that can efficiently and reliably fit the CMP to data. We apply the CMP to neural population data recorded in macaque primary visual cortex (V1). We find that noise correlation structure is best captured with 3–5 components across several datasets, and that while overall dimensionality of correlations is largely stimulusindependent, the structure of noise correlations (i.e. the relative weights of the mixture components) depends on the stimulus.
2 Conditional Finite Mixtures of Independent Poisson Distributions
Many of the equations that describe how to analyze and train the CMP model can be solved by expressing the parameters of the model in an appropriate coordinate system. Two of the coordinate systems we consider are the mean and natural parameters that arise in the context of exponential families^{1}^{1}1Maximum entropy models are also known as exponential families, and we prefer that name in this paper.. We thus begin this section by reviewing exponential families, primarily for the purpose of developing notation (for a thorough development see amari_methods_2007; wainwright_graphical_2008). We continue by showing that the standard weightedsum form of a finite mixture model is a third parameterization of a kind of exponential family known as an exponential family harmonium welling_exponential_2005. Finally, we introduce and develop conditional finite mixture models and CMPs, and we derive training algorithms for such models including the hybrid expectationmaximization algorithm for training CMPs.
2.1 Exponential Families
Consider a random variable
with an unknown distribution , and suppose we are given an independent and identically distributed sample from such that every . We may model based on the sample by first defining a statistic , where is the codomain of with dimension. We then look for a probability distribution
that satisfies , where is the expected value of under . This is of course an underconstrained problem, but if we also assume that must have maximum entropy, then we arrive at a family of distributions which uniquely satisfy the constraints for every value of .An dimensional exponential family is defined by the sufficient statistic , as well as a base measure which helps define integrals and expectations within the family. An exponential family is parameterized by a set of natural parameters , such that each element of the family may be identified with some parameters . The density of the distribution is given by , where is the logpartition function. Expectations of any are then given by . Because each is uniquely defined by , the means of the sufficient statistic also parameterize . The space of all mean parameters is , and we denote them by . Finally, a sufficient statistic is minimal when its component functions are nonconstant and linearly independent. If the sufficient statistic of a given family is minimal, then and are isomorphic. Moreover, the transition functions between them and are given by , and , where is the negative entropy of .
Before we conclude this subsection, we highlight the exponential families of categorical distributions and of independent Poisson distributions . The dimensional categorical exponential family contains all the distributions over integer values between 0 and . The base measure of is the counting measure, and the th element of its sufficient statistic at is 1, and 0 for any other elements. The sufficient statistic
is thus a vector of all zeroes when
, and all zeroes except for the th element when . The dimensional family of independent Poisson distributions contains distributions over dimensional vectors of natural numbers, where each element of the vector is independently Poisson distributed. The sufficient statistic of is the identity function, and the base measure is given by .2.2 Finite Mixture Models
In this subsection we show how finite mixture models can be expressed as latentvariable models such that the complete joint model is a particular kind of exponential family. To begin, let us consider an exponential family . The density of a finite mixture distribution of distributions in has the form , where each distribution is a mixture component with natural parameters , each is a mixture weight, and the weights satisfy the constraints and .
Building a model out of mixture distributions is theoretically problematic because swapping component indices has no effect on a mixture distribution, and their parameters are therefore not identifiable. Nevertheless, we may express mixture distributions as latentvariable distributions with the identifiable form , where and . Moreover, is a categorical distribution with mean parameters , and may thus be described by the natural parameters . Given and , we therefore define a finite mixture model as the set of all distributions , and parameterize it by the mixture weights and mixture components parameters .
Now, an exponential family harmonium
is a kind of product exponential family which includes restricted Boltzmann machines as a special case
welling_exponential_2005. We may construct an exponential family harmonium out of and by defining the base measure of as the product measure , and by defining the sufficient statistic of as the vector which contains the concatenation of all the elements in , , and the outer product . Intuitively, contains all the distributions with densities of the form , where , , and are the natural parameters of . As we show, a harmonium defined partially by the categorical exponential family is equivalent to a mixture model.Theorem 1.
Let be an exponential family, let be the categorical exponential family of dimension , let be the mixture model defined by and , and let be the exponential family harmonium defined by and . Then with natural parameters , , and with mixture parameters and if and only if
(1)  
(2) 
where and are the th elements of and , is the th row of , and .
Proof.
We identify and by identifying the parameters of their conditionals and marginals. First note that the conditional distribution at is an element of with parameters welling_exponential_2005, and that the sufficient statistic in this expression essentially selects a row from , or returns . Thus, where is the th row of , if and only if , and for .
To equate and , first note that the marginal density of a harmonium distribution satisfies welling_exponential_2005. Because is partially defined by , we may show with a bit of algebra that , where satisfies Equation 1, and by substituting this into the expression for the harmonium marginal, we may conclude that if and only if Equation 1 is satisfied. ∎
The advantage of formulating finite mixture models as exponential family harmoniums is that we may apply the theory of harmoniums and exponential families to analyzing and training them. For example, given a sample and a harmonium distribution with parameters , , and , an iteration of the expectationmaximization algorithm (EM) may be formulated as:
 Expectation Step:

compute the unobserved means for every ,
 Maximization Step:

evaluate .
On the other hand, the stochastic loglikelihood gradients of the parameters of are
(3)  
(4)  
(5) 
where .
Whether or not the various transition functions and their inverses can be evaluated depends on the defining manifolds and . When is the categorical family, its transition function is computable, and whether or are computable is reducible to whether or
are computable, respectively. EM is typically preferred when maximizing the likelihood of finite mixture model parameters, especially for finite mixtures of normal distributions. However, gradient descent can perform just as well as EM for certain mixture models, as we demonstrate in figure
1.2.3 Conditional Finite Mixture Models
According to Equations 1 and 2 we may induce context dependence in all the mixture parameters of a mixture distribution by allowing to depend on additional variables. We thus define a conditional finite mixture model as the set of all the conditional distributions with densities , where is defined by an additional set of parameters . In the conditional setting we aim to maximize the conditional loglikelihood of the parameters , , , and , given a sample from a target conditional distribution . Stochastic gradient ascent on the conditional loglikelihood remains relatively straightforward; given a sample point
, one must simply backpropagate the errorgradient
through the function , where .The expectation step of the EM algorithm also remains trivial for conditional mixtures. It is easy to show that for any the conditional distribution , and we may therefore continue to evaluate the unobserved means as . On the other hand, the maximization step can be expressed as maximizing the sum over of the function
(6) 
with respect to the parameters , , , and . It is more difficult to evaluate the maximization step for conditional finite mixtures due to the dependence on , although we may still find maxima of the objective by gradient ascent. Nevertheless, when is the family of independent Poisson distributions, we can fact evaluate the maximization step in closedform with respect to the parameters and . We refer to the conditional finite mixture model as the family of conditional finite mixtures of independent Poisson distributions (CMPs).
Theorem 2.
Let be a CMP model defined by , , and . Then
and
where
Proof.
Similar to Theorem 1, consider the change of variables ,, and . We may express our optimization in this alternative parameterization as
such that . Moreover, the logpartition function of is given by and we may therefore conclude that
This is an variant of a Legendre transform and its solution is . ∎
2.4 Training Algorithms
Based on these results, we propose three algorithms for training conditional finite mixture models, the third of which is specific to CMPs, and we abbreviate them as EM, SGD, and Hybrid. The first approximates expectationmaximization (EM) by computing the expectation step in closedform, and approximating the maximization step by gradient ascent of the objective in Definition 6
. The second is stochastic gradient descent (SGD) of the negative loglikelihood based on Equations
3, 4, and 5. The third is the Hybrid algorithm based on Theorem 2. The Hybrid algorithm alternates between one of two phases every training epoch: the first phase is simply SGD with respect to all the CMP parameters, and the second phase is to update the parameters
and based on Theorem 2.3 Applications to Synthetic and Real Data
In this section we model synthetic and real data with a CMP. In both cases we define the context variable as an element of the halfcircle such that , which represents e.g. the orientation of a grating, as is common in V1 experiments. We define the natural parameter function by , where . This choice of allows us to express the conditional firing rates of the model neurons as , where is the gain and is a von Mises density function with preferred stimulus and precision
. The dependence of the model on the categorical variable is expressed solely through gain components
, and where is the tuning curve of the th neuron, this form ensures that each tuning curve retains a von Mises shape regardless of correlation structure.All simulations were run on a Lenovo P51 laptop with an Intel Xeon processor, and individual simulations of the algorithms in Subsection 2.4 took on the order of tens of seconds to execute. All algorithms were implemented in the Haskell programming language.
3.1 Synthetic Data
We first consider synthetic data generated from a CMP with neurons and mixture components. The preferred stimuli
of the tuning curves are drawn from a uniform distribution over
, and the precisions and gainsare lognormally distributed with a mean of 0.8 and 2, respectively. We plot the von Mises part
of each CMP tuning curve in Figure 2A. In Figure 2B we plot for every , as a function of . This provides a lowdimensional picture of the stimulusdependence of the correlations; the number of active components determines the dimensionality of the correlations, and their identity determines the structure of correlations.To synthesize a dataset consistent with typical recordings in visual cortex we synthesize 62 responses to 8 stimuli tiled evenly over the halfcircle, for a total of sample points from . In Figure 2C we plot the descent of the resulting negative loglikelihood of the CMP parameters for each of the three algorithms proposed in Subsection 2.4. Each epoch of each algorithm involves exactly one pass over the dataset, ensuring that their computation time is of a similar order. In each epoch the dataset is broken up into randomized minibatches of 50 samples, resulting in ten gradient steps per epoch for each algorithm, except for the closedform phase of the Hybrid algorithm which processes the data as a single batch. Gradient descents are implemented with the Adam algorithm kingma_adam_2014 with standard momentum parameters and a learning rate of 0.005, and the Adam parameters are reset every epoch. To avoid local minima we run 10 descents in parallel for each algorithm and select the best one.
In addition, we plot the true negative loglikelihood in Figure 2C to establish a training lowerbound. Since a CMP with component reduces to a set of conditionally independent Poisson neurons, and fitting such a CMP is a convex optimization problem, we also plot an upperbound in the form of for the optimal . This upperbound tells us how much the addition of correlations to the model improves the quality of the fit. As seen in Figure 2, the Hybrid algorithm converges more quickly and to a lower value than either SGD or EM. All algorithms overfit the data relative to the lowerbound, which is unsurprising given the small sample size. We later apply crossvalidation in our analysis of real data to help avoid this.
In Figure 2D we plot for the CMP model trained with the Hybrid algorithm. We highlight two curves to emphasize that the weightcurves learned by model are similar to those of the ground truth CMP. In Figure 2E we plot stimulusdependent, paired correlation matrices of the ground truth and hybridtrained models. As can be seen, the learnedmodel and ground truth correlations are nearly identical. Also note that when one mixture weight dominates, the correlations disappear. Finally, in Figure 3 we repeat the previous analysis on a CMP target and model with neurons. As can be seen, the results are essentially identical to those in figure 2. As such, at least for synthetic data, a small number of sample points (e.g. 496) are sufficient for modelling lowdimensional noise correlations in large populations of neurons.
3.2 Response Recordings from Macaque Primary Visual Cortex
In this subsection we repeat our analysis from the previous subsection on multielectrode recordings of macaque primary visual cortex (V1) in response to the oriented grating stimuli (the data were originally presented in coencagli_flexible_2015). We analyze 8 datasets corresponding to 8 multielectrode recording sessions of between 28–76 welltuned neurons. Each dataset is composed of 80 repeated trials of 16 stimuli^{2}^{2}2Each set of 80 trials pools 4 distinct phases of the stimulus, which could inflate measured correlations. Nevertheless, recorded neurons were found to be roughly phase independent coencagli_flexible_2015., where the stimuli are spread evenly over the half circle.
In Figure 4A we depict the 10fold crossvalidation of the loglikelihood of each of the 8 datasets as a function of the number of components in the Hybridtrained CMP model, and we subtract the 1component value from the value computed for the remaining components. This figure thus quantifies the amount of information (in nats) that is gained about the true data distribution by modelling the data with the given number of components. As can be seen, between 3–5 components is optimal for the various datasets before information gain yields to overfitting. In Figure 4B we plot the results of fitting the complete selected dataset with the three proposed algorithms. The Hybrid algorithm continues to perform well and converge quickly, while it appears as though EM and SGD require significantly more computation to find good solutions.
In Figure 4C we plot for the hybridtrained model , but in this case we colour the four curves to more easily distinguish them. The curves indicate that the correlations of the true data distribution are stimulusdependent, but less so than those of our random models. Moreover, at no stimulus value does a single component dominate, and the effective dimensionality is largely stimulusindependent. Figure 4D shows that the CMP finds a subtle yet significant amount of stimulusdependence in model correlations, which are not clearly visible in the empirical correlations.
4 Conclusion
In this paper we introduced a novel model (CMPs) of contextdependent neural correlations, derived an expectationmaximization method for training it, and demonstrated that CMPs can capture significant correlation structure amongst neurons. We also demonstrated that EM and SGD can often perform equally well in the context of finite mixture models, which stands in contrast with common practice mclachlan_finite_2019. In future work we will compare the correlations learned with a CMP to the results of alternative dimensionalityreduction techniques for contextdependent neural correlations granotatedgi_stimulusdependent_2013; ecker_state_2014; cowley_stimulusdriven_2016. At the same time, CMPs are uniquely suited for studying ratebased neural coding. For example, informationlimiting correlations morenobote_informationlimiting_2014 can be directly incorporated into a CMP by defining the gain components as identical but shifted with respect to neuron index.
In our applications in Section 3, the stimulusdependent parameters had a simple generalized linear form. Theoretically, however, these parameters could be modelled by a deep neural network, thereby allowing the output of the deep network to exhibit correlations. When the output of the network has a complex structure, such as when modelling the stimulusdependent neural responses of midlevel brain regions to naturalistic images yamins_using_2016, incorporating correlations could significantly improve model predictions. Moreover, because CMPs model noise correlations in a manner that is compatible with theories of population coding, such a combined model can extend our understanding of neural computation beyond lowlevel sensory areas and lowdimensional variables.