1 Introduction
Recent advances in imaging and recording techniques have enabled monitoring the activity of hundreds to several thousands of neurons simultaneously Jun et al. (2017); Helmchen (2009); Dombeck et al. (2007). These recordings can be made in awake animals engaged in specifically designed tasks or natural behavior Stringer et al. (2019); Pakan et al. (2018a, b), which further augments these already large datasets with a variety of behavioral variables. These complex high dimensional datasets necessitate the development of novel analytical approaches Staude et al. (2010); Brown et al. (2004); Saxena and Cunningham (2019); Stevenson and Kording (2011)
to address two central questions of systems and behavioral neuroscience: how do populations of neurons encode information? And how does this neuronal activity correspond to the observed behavior? In machine learning terms, both of these questions translate into understanding the highdimensional multivariate dependencies between the recorded variables
Ince et al. (2010); Shamir and Sompolinsky (2004); Kohn et al. (2016); Shimazaki et al. (2012).There are two major methods suitable for recording the activity of large populations of neurons from behaving animals: the multielectrode probes that provide milliseconds precision for recordings of electrical activity Jun et al. (2017), and calcium imaging methods Helmchen (2009); Dombeck et al. (2007); Grienberger et al. (2015) that use changes in intracellular calcium concentration as a proxy for neuronal spiking activity at a lower (tens of milliseconds) temporal precision. As a result, the recorded neuronal and behavioral variables may operate at different timescales and exhibit different statistics, which further complicates the statistical analysis.
The natural approach to modeling statistical dependencies between the variables with drastically different statistics is based on copulas, which separate marginal statistics from the dependence structure Joe (2014). For this reason, copula models are particularly effective for mutual information estimation Jenison and Reale (2004); Calsaverini and Vicente (2009a). They can also escape the ‘curse of dimensionality’ by factorising the multidimensional dependence into paircopula constructions called vines Aas et al. (2009); Czado (2010). Copula models have been successfully applied to spiking activity Berkes et al. (2009); Onken et al. (2009); Hu et al. (2015); Shahbaba et al. (2014), 2photon calcium recordings Safaai (2019) and multimodal neuronal datasets Onken and Panzeri (2016). However, these models assumed that the dependence between variables was static, whereas in neuronal recordings it may be dynamic or modulated by behavioral context Doiron et al. (2016); Shimazaki et al. (2012). Therefore, it might be helpful to explicitly model the continuous time or contextdependent changes in the relationships between variables, which reflect changes in an underlying computation.
Here, we extend a copulabased approach by adding explicit conditional dependence to the parameters of the copula model, approximating these latent dependencies with Gaussian Processes (GP). It was previously shown that such a combination of parametric copula models with GP priors outperforms static copula models LopezPaz et al. (2013) and even dynamic copula models on many realworld datasets, including weather forecasts, geological data or stock market data HernándezLobato et al. (2013). Yet, this method has never been applied to neuronal recordings before.
In this work, we improve the scalability of the method by using stochastic variational inference. We also increase the complexity of the copula models in order to adequately describe the complex dependencies commonly observed in neuronal data. In particular, we use mixtures of parametric copula models to account for changes in tail dependencies. We develop model selection algorithms, based on the fullyBayesian Watanabe–Akaike information criterion (WAIC). Finally and most importantly, we demonstrate that our model is suitable for estimating mutual information. It performs especially well when the parametric model can closely approximate the target distribution. When it is not the case, our copula mixture model demonstrates sufficient flexibility and provides close information estimates, comparable to the best stateoftheart nonparametric information estimators.
We first introduce the copula mixture models and propose model selection algorithms (Sec. 2). We then validate our model on synthetic data and compare its performance against other commonly used information estimators (Sec. 3). Next, we demonstrate the utility of the method on real neuronal and behavioral data (Sec. 4). We show that our CopulaGP method can produce interpretable bivariate models that emphasize the qualitative changes in tail dependencies and estimate mutual information that exposes the structure of the task without providing any explicit cues to the model. Finally, we apply the vine CopulaGP model to measure information content in the whole dataset with 5 behavioral variables and more than 100 neurons.
2 Parametric copula mixtures with Gaussian process priors
Our model is based on copulas: multivariate distributions with uniform marginals. Sklar’s theorem Sklar (1959)
states that any multivariate joint distribution can be written in terms of univariate marginal distribution functions
and a unique copula which characterizes the dependence structure: . Here,are the marginal cumulative distribution functions (CDF) and, as a result, each
is uniformly distributed on [0,1].
For high dimensional datasets (high ), maximum likelihood estimation for copula parameters may become computationally challenging. The twostage inference for margins (IFM) training scheme is typically used in this case Joe (2005). First, univariate marginals are estimated and used to map the data onto a multidimensional unit cube. Second, the parameters of the copula model are inferred.
Conditional copulas
Following the approach by HernándezLobato et al. (2013), we are using Gaussian Processes (GP) to model the conditional dependencies of copula parameters:
(1) 
In the most general case, the marginal PDFs and CDFs and the copula itself can all be conditioned on . In our framework, is assumed to be onedimensional. A Gaussian Process is ideally suited for copula parametrization, as it provides an estimate of the uncertainty in model parameters, which we utilize in our model selection process (Sec. 2.1).
Conditional marginals
In order to estimate marginal CDFs , we use the nonparametric fastKDE O’Brien et al. (2016)
algorithm, which allows for direct estimation of the conditional distributions. The conditional distribution is then used to map the data onto a unit hypercube using the probability integral transform:
, such that is uniformly distributed for any .Bivariate copula families
We use 4 copula families as the building blocks for our copula models: Gaussian, Frank, Clayton and Gumbel copulas (Figure 1). All of these families have a single parameter, corresponding to the rank correlation (Table 1). We also use rotated variants (90, 180, 270) of Clayton and Gumbel copula families in order to express upper tail dependencies and negative correlation.
Copula  Domain  

Independence  –  – 
Gaussian  [1,1]  
Frank  (, )  
Clayton  [0,)  
Gumbel  [1,) 
Since we are primarily focused on the analysis of neuronal data, we have first visualized the dependencies in calcium signal recordings after a probability integral transform, yielding empirical conditional copulas. As a distinct feature in neuronal datasets, we observed changes in tail dependencies with regard to the conditioning variable. Since none of the aforementioned families alone could describe such conditional dependency, we combined multiple copulas into a linear mixture model (which is also a copula Nelsen (2007)):
(2) 
where is the number of elements, is the concentration of the th copula in a mixture, is the pdf of the th copula, and is its parameter.
Each of the copula families includes the Independence copula as a special case. To resolve this overcompleteness, we add the Independence copula as a separate model with zero parameters (Table 1). For independent variables , the Independence model will be preferred over the other models in our model selection algorithm (Sec. 2.1), since it has the smallest number of parameters.
Gaussian Process priors
We parametrize the mixture model (2) with the independent latent GPs: . For each copula family, we constructed GPLink functions (Table 1) that map the GP variable onto the copula parameter domain: Next, we also use GP to parametrize concentrations , which are defined on a simplex ():
where
is a CDF of a standard normal distribution and
. This parametrization ensures that when all GP variables , all of the concentrations are equal to . We use the RBF kernel with bandwidth parameter . Therefore, the whole mixture model with copula elements requires hyperparameters: for and for .Approximate Inference
Since our model has latent variables with GP priors and intractable posterior distribution, the direct maximum likelihood TypeII estimation is not possible and an approximate inference is needed. Such inference problem with copula models has previously been solved with the expectation propagation (EP) algorithm HernándezLobato et al. (2013). Considering the recent developments in highperformance parallel computing and stochastic optimization algorithms, we chose to use stochastic variational inference
(SVI) instead. In order to scale the SVI to a large number of inputs, we use Kernel Interpolation for Scalable Structured Gaussian Processes (KISSGP)
Wilson and Nickisch (2015). For efficient implementation of these methods on GPU, we use the PyTorch
Paszke et al. (2017) and GPyTorch libraries Gardner et al. (2018).2.1 Bayesian Model selection
We use the Watanabe–Akaike information criterion (WAIC Watanabe (2013)) for model selection. WAIC is a fully Bayesian approach to estimating the Akaike information criterion (AIC) (see Eq. 31 in the original paper Watanabe (2013)
). The main advantage of the method is that it avoids the empirical estimation of the effective number of parameters, which is often used for approximation of the outofsample bias. It starts with the estimation of the log pointwise posterior predictive density (lppd)
Gelman et al. (2014):where is a draw from a posterior distribution, which must be large enough to represent the posterior. Next, the approximates the bias correction, where
represents sample variance. Therefore, the biascorrected estimate of the log pointwise posterior predictive density is given by:
In the model selection process, we aim to choose the model with the lowest . Since our copula probability densities are continuous, their values can exceed 1 and the resulting is typically negative. Zero corresponds to the Independence model ( on the whole unit square).
Since the total number of combinations of 10 copula elements (Fig. 1, considering rotations) is large, exhaustive search for the optimal model is not feasible. In our framework, we propose two model algorithms for constructing closetooptimal copula mixtures: greedy and heuristic
(see Supplemental Material for details). The greedy algorithm is universal and can be used with any other copula families without adjustment, while the heuristic algorithm is finetuned to the specific copula families used in this paper (Fig.
1). Both model selection algorithms were able to select the correct 1 and 2component model on simulated data and at least find a close approximation (within ) for more complex models (see validation of model selection in Supplemental Material).2.2 Entropy and mutual information
Our framework provides tools for efficient sampling from the conditional distribution and for calculating the probability density . Therefore, for each the entropy can be estimated using Monte Carlo (MC) integration:
(3) 
factorizes into the conditional copula density and marginal densities (1), hence for each the entropy also factorizes Jenison and Reale (2004) as , where . The conditional entropy can be integrated as separating the entropy of the marginals from the copula entropy.
Now, if is 1) a homeomorphism, 2) independent of Kraskov et al. (2004). If marginal statistics are independent of , then the probability integral transform satisfies both requirements, and . Then, in order to calculate the mutual information , we must also rewrite it using only the conditional distribution , which is modelled with our conditional CopulaGP model. This can be done as follows:
(4) 
The last term in (4) involves nested integration, which is computationally difficult and does not scale well with . Therefore, we propose an alternative way of estimating , which avoids double integration and allows us to use the marginals conditioned on (), providing a better estimate of . We can use two separate copula models, one for estimating and calculating , and another one for estimating and calculating :
(5) 
where both entropy terms are estimated with MC (3). Here we only integrate over the unit cube and then , whereas (4) required integration over .
The performance of both (4) and (5) critically depends on the approximation of the dependence structure, i.e. how well the parametric copula approximates the true copula probability density. If the joint distribution has a complex dependence structure, as we will see in synthetic examples, then the mixture of parametric copulas may provide a poor approximation of and overestimate , thereby overestimating . The direct integration (4), on the other hand, typically underestimates the due to imperfect approximation of , but it is only applicable if the marginals can be considered independent of .
2.3 Copula vine constructions
Highdimensional copulas can be constructed from bivariate copulas by organizing them into hierarchical structures called copula vines Aas et al. (2009). In this paper, we focus on the canonical vine or Cvine
, which factorizes the highdimensional copula probability density function as follows:
(6) 
where , and is a conditional CDF. Note, that all of the copulas in (6) can also be conditioned on via CopulaGP model. We choose the first variable to be the one with the highest rank correlation with the rest (sum of absolute values of pairwise Kendall’s ), and condition all variables on the first one. We repeat the procedure until no variable is left. It was shown by Czado et al. (2012) that this ordering facilitates Cvine modeling.
Code availability
Code will be made available on GitHub upon paper acceptance.
3 Validation on artificial data
We compare our method with the other commonly used nonparametric algorithms for mutual information estimation: KraskovStögbauerGrassberger (KSG Kraskov et al. (2004)), BiasImprovedKSG by Gao et al. (BIKSG Gao et al. (2016)) and the Mutual Information Neural Estimator (MINE Belghazi et al. (2018)).
First, we test these estimators on a dataset sampled from a multivariate Gaussian distribution, with
, where is Kronecker’s delta and . Our algorithm selects a Gaussian copula on these data, which perfectly matches the true distribution. Therefore, CopulaGP measures both entropy and mutual information exactly (within integration tolerance, see Fig. 2A). The performance of the nonparametric methods on this dataset is lower. It was shown before that KSG and MINE both severely underestimate the for highdimensional Gaussians with high correlation (e.g. see Fig. 1 in Belghazi et al. (2018)). The CopulaGP model (integrated) provides accurate estimates for highly correlated (up to , at least up to 20D) Gaussian distributions (see Supplemental Material).Next, we test the CopulaGP performance on the Student T distribution, which can only be approximated by the CopulaGP model, but would not exactly match any of the parametric copula families in Table 1. We keep the correlation coefficient fixed at
, and only change the number of degrees of freedom from 2 to 150 exponentially:
. This makes the dataset particularly challenging, as all of the mutual information is encoded in tail dependencies of . The true of the Student T distribution was calculated analytically (see Eq. A.12 in Calsaverini and Vicente (2009b)) and was integrated numerically according to (4) given the true .Figure 2B shows that most of the methods underestimate . CopulaGP (integrated) and MINE (with 100 hidden units) provide the closest estimates. The training curve for MINE with more hidden units (200,500) showed signs of overfitting (abrupt changes in loss at certain permutations) and the resulting estimate was higher than the true at higher dimensions. It was shown before that MINE provides inaccurate and inconsistent results on datasets with low Song and Ermon (2019). We also demonstrate estimation with a combination of two copula models for and : "CopulaGP estimated" (see Eq. 5). In lower dimensions, it captures less information than "CopulaGP integrated", but starts overestimating the true at higher dimensions, when the inaccuracy of the density estimation for builds up. This shows the limitation of the "estimated" method, which can either underestimate or overestimate the correct value due to parametric model mismatch, whereas "integrated" method consistently underestimates the correct value. We conclude that CopulaGP and MINE demonstrate similar performance in this example, while KSGbased methods significantly underestimate in higher dimensions.
Finally, we created another artificial dataset that is not related to any of the copula models used in our framework (Table 1). We achieved that by applying a homeomorphic transformation to a multivariate Gaussian distribution. Since the transformation is independent of the conditioning variable, it does not change the Kraskov et al. (2004). Therefore, we possess the true , which is the same as for the first example in Figure 2A. Note, however, that the entropy of the samples changes: . So, there is no ground truth for the conditional entropy. We transform the Gaussian copula samples from the first example as and again transform the marginals using the empirical probability integral transform ). Both conditional and unconditional densities here do not match any of the parametric copulas from Table 1. As a result, "CopulaGP estimated" overestimated the correct value, while "CopulaGP integrated" underestimated it similarly to the MINE estimator with 100 hidden units.
Figure 2 demonstrates that the performance of the parametric CopulaGP model critically depends on the match between the true probability density and the best mixture of parametric copula elements. When the parametric distribution matches the true distribution (Fig. 2A), our CopulaGP framework predictably outperforms all nonparametric methods. Nonetheless, even when the exact reconstruction of the density is not possible (Figs. 2BC), the mixtures of the copula models (2) are still able to model the changes in tail dependencies, at least qualitatively. As a result, our method performs similarly to the neuralnetwork based method (MINE) and still outperforms KSGlike methods.
4 Validation on real data
We investigate the dependencies observed in neuronal and behavioral data and showcase possible applications of the CopulaGP framework. We used twophoton calcium imaging data of neuronal population activity in the primary visual cortex of mice engaged in a visuospatial navigation task in virtual reality (data from Henschke et al. (2020)). Briefly, the mice learned to run through a virtual corridor with vertical gratings on the walls (Fig. 3A, 0120cm) until they reached a reward zone (Fig. 3A, 120140cm), where they could get a reward by licking a reward spout. We condition our CopulaGP model on the position in the virtual environment and studied the joint distribution of the behavioral () and neuronal () variables (=109). Figure 3B shows a part of the dataset (trials 2535 out of 130). The traces here demonstrate changes in the position of the mouse as well as the activity of 3 selected neurons and the licking rate. These variables have different patterns of activity depending on
and different signaltonoise ratios. Both differences are reflected in marginal statistics, which are shown on the right with the density plots of equal area.
Constructing interpretable bivariate models
We first studied bivariate relationships between neurons. In order to do this, we transformed the raw signals (shown in Fig. 3B) with a probability integral transform . We observed strong nontrivial changes in the dependence structure subject to the position in the virtual reality and related visual information (Fig. 3C). Such stimulusrelated changes in the joint variability of two neuronal signals are commonly described as noise correlations. The CopulaGP model provides a more detailed description of the joint probability that goes beyond linear correlation analysis. In this example, the dependence structure is best characterized by a combination of Gaussian and Clayton copula (rotated by 90). The density plots Fig. 3C demonstrate the match between the true density (outlines) and the copula model density (blue shades) for each part of the task. We measure the accuracy of the density estimation with the proportion of variance explained , which shows how much of the variance of the variable can be predicted given the variable (see Eq.(1) in Supplemental Material). The average for all is provided in the upper right corner of the density plots.
Next, we show that our model can be applied not only to the neuronal data, but also to any of the behavioral variables. Fig. 3D shows the dependence structure between one of the neurons and the licking rate. The best selected mixture model here is Frank + Clayton 0 + Gumbel 270, which again provides an accurate estimate of the conditional dependence between the variables. Therefore, Figs. 3
CD demonstrate that our CopulaGP model provides both an accurate fit for the probability distribution and an interpretable visualization of the dependence structure.
Figs. 3EF show the absolute value of the conditional entropy , which is equivalent to the mutual information between two variables . For both examples, the peaks in the reward zone. The bivariate CopulaGP models were agnostic of the reward mechanism in this task, yet they revealed the position of the reward zone as an anomaly in the mutual information.
Measuring information content in a large neuronal population
Finally, we constructed a Cvine describing the distribution between all neuronal and behavioral variables () and measured the conditional entropy for all variables in the dataset . The conditional entropy in Fig. 3G peaks in the reward zone (similarly to Figs. 3EF) and also at the beginning of the trial. Now the model is informed on the velocity of the animal and the reward events, so the first peak can be attributed to the acceleration of the mouse at the start of a new trial Khan and Hofer (2018); Pakan et al. (2018b).
While constructing the Cvine, we ordered the variables according to their pairwise rank correlations (see Sec. 2.3). We considered subsets of the first variables and measured the with the position for each subset. We compared the performance of our CopulaGP method on these subsets of vs. KSG and MINE. Fig. 3H shows that all 3 methods provide similar results on subsets of up to 10 variables, yet in higher dimensions both MINE and KSG show smaller compared to our CopulaGP method, which agrees with the results obtained on the synthetic data (Fig. 2). The true values of are unknown, yet we expect the integrated CopulaGP (solid line) to underestimate the true value due to parametric model mismatch. The CopulaGP "estimated" (dashed line) almost perfectly matches the "integrated" result, which suggests that the model was able to accurately approximate both and , and, as a result, . These results demonstrate superior performance of our CopulaGP model on highdimensional neuronal data.
5 Discussion
We have developed a CopulaGP framework for modeling conditional multivariate joint distributions. The method is based on linear mixtures of parametric copulas, which provide flexibility for estimating complex dependencies. We approximate conditional dependencies of the model parameters with Gaussian Processes which allow us to implement a Bayesian model selection procedure. The selected models combine the accuracy in density estimation with the interpretability of parametric copula models. Despite the limitations of the parametric models, our framework demonstrated good results in mutual information estimation on the synthetically generated data, performing similarly to the stateoftheart nonparametric information estimators. The framework is also well suited for describing neuronal and behavioral data. The possible applications include, but are not limited to, studying noise correlations, behavioral modulation and the neuronal population statistics. We demonstrated that the model scales well at least up to 109 variables, while theoretically, the parameter inference scales as , where is a number of samples and is the (effective) number of variables (see Suppl. Mat.). In summary, we demonstrated that the CopulaGP approach can make stochastic relationships explicit and generate accurate and interpretable models of dependencies between neuronal responses, sensory stimuli, and behavior. Future work will focus on implementing model selection for the vine structure and improving the scalability of the estimation algorithm.
Broader Impact
We envision a wide range of impacts resulting from the use of the CopulaGP framework in computational neuroscience as well as in machine learning research, information technology and economics.
In computational neuroscience, this approach has the potential to reveal high dimensional contextdependent relationships between neuronal activity and behavioral variables. Therefore, our model can provide novel insights into the principles of neural circuitlevel computation, both under physiological conditions, and during the aberrant network function observed in many neuropsychiatric disorders Uhlhaas and Singer (2012); Bassett et al. (2018); Braun et al. (2018). Furthermore, understanding contextdependent processing in the brain might suggest ways to mimic the same principles in artificial neural networks.
The proposed framework also has some potentially far reaching applications, not only in computational neuroscience but also in information technology and economics. Current problems of information flow in computer networks as well as highspeed trading in micro and macromarkets can be phrased as nonstationary relationship problems that require proper stochastic representations, which, in turn, can benefit economic success. On the other hand, there is a possible risk of a onesided adoption of the method by malicious actors benefiting from market manipulation, which may give them unfair advantage and thus reduce the market transparency and cause economic damage.
We encourage the researchers adapting our method to understand the limitations of the use of parametric copula models. The choice of the bivariate copula models and the vine structures introduces our beliefs about the particular conditional dependency or independency into the model. The misuse of parametric copula models has once had a negative impact on the insurance industry in the past Donnelly and Embrechts (2010). Thus, these limitations must be taken into consideration when designing new applications in the future.
One potential negative societal impact related to the use of our framework may include a relatively large energy footprint from training the models on graphics processing units (GPUs). The models used for creating figures in this paper took about 2 weeks of computational time on 8 GPUs. The mitigation strategy should be focused on careful planning of the simulations, creating checkpoints and backups to prevent data loss, reuse of the trained models and other practices that reduce the energyconsuming computation.
We thank the GENIE Program and the Janelia Research Campus, specifically V. Jayaraman, R. Kerr, D. Kim, L. Looger, and K. Svoboda, for making GCaMP6 available. This work was funded by the Engineering and Physical Sciences Research Council (grant EP/S005692/1 to A.O.), the Wellcome Trust and the Royal Society (Sir Henry Dale fellowship to N.R.), the Marie Curie Actions of the European Union’s FP7 program (MCCIG 631770 to N.R.), the Simons Initiative for the Developing Brain (to N.R.), the Precision Medicine Doctoral Training Programme (MRC, the University of Edinburgh (to T.A.)).
References
 Jun et al. [2017] James J Jun, Nicholas A Steinmetz, Joshua H Siegle, Daniel J Denman, Marius Bauza, Brian Barbarits, Albert K Lee, Costas A Anastassiou, Alexandru Andrei, Çağatay Aydın, et al. Fully integrated silicon probes for highdensity recording of neural activity. Nature, 551(7679):232–236, 2017.
 Helmchen [2009] Fritjof Helmchen. Twophoton functional imaging of neuronal activity. CRC Press, 2009.
 Dombeck et al. [2007] Daniel A Dombeck, Anton N Khabbaz, Forrest Collman, Thomas L Adelman, and David W Tank. Imaging largescale neural activity with cellular resolution in awake, mobile mice. Neuron, 56(1):43–57, 2007.
 Stringer et al. [2019] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Charu Bai Reddy, Matteo Carandini, and Kenneth D Harris. Spontaneous behaviors drive multidimensional, brainwide activity. Science, 364(6437):eaav7893, 2019.
 Pakan et al. [2018a] Janelle MP Pakan, Stephen P Currie, Lukas Fischer, and Nathalie L Rochefort. The impact of visual cues, reward, and motor feedback on the representation of behaviorally relevant spatial locations in primary visual cortex. Cell reports, 24(10):2521–2528, 2018a.
 Pakan et al. [2018b] Janelle MP Pakan, Valerio Francioni, and Nathalie L Rochefort. Action and learning shape the activity of neuronal circuits in the visual cortex. Current opinion in neurobiology, 52:88–97, 2018b.
 Staude et al. [2010] Benjamin Staude, Stefan Rotter, and Sonja Grün. Cubic: cumulant based inference of higherorder correlations in massively parallel spike trains. Journal of computational neuroscience, 29(12):327–350, 2010.
 Brown et al. [2004] Emery N Brown, Robert E Kass, and Partha P Mitra. Multiple neural spike train data analysis: stateoftheart and future challenges. Nature neuroscience, 7(5):456–461, 2004.
 Saxena and Cunningham [2019] Shreya Saxena and John P. Cunningham. Towards the neural population doctrine, 2019. ISSN 18736882.
 Stevenson and Kording [2011] Ian H. Stevenson and Konrad P. Kording. How advances in neural recording affect data analysis. Nature Neuroscience, 14(2):139–142, feb 2011. ISSN 10976256. doi: 10.1038/nn.2731. URL http://www.nature.com/articles/nn.2731.
 Ince et al. [2010] Robin AA Ince, Riccardo Senatore, Ehsan Arabzadeh, Fernando Montani, Mathew E Diamond, and Stefano Panzeri. Informationtheoretic methods for studying population codes. Neural Networks, 23(6):713–727, 2010.
 Shamir and Sompolinsky [2004] Maoz Shamir and Haim Sompolinsky. Nonlinear population codes. Neural computation, 16(6):1105–1136, 2004.
 Kohn et al. [2016] Adam Kohn, Ruben CoenCagli, Ingmar Kanitscheider, and Alexandre Pouget. Correlations and neuronal population information. Annual review of neuroscience, 39:237–256, 2016.
 Shimazaki et al. [2012] Hideaki Shimazaki, Shunichi Amari, Emery N Brown, and Sonja Grün. Statespace analysis of timevarying higherorder spike correlation for multiple neural spike train data. PLoS computational biology, 8(3), 2012.
 Grienberger et al. [2015] Christine Grienberger, Xiaowei Chen, and Arthur Konnerth. Dendritic function in vivo. Trends in neurosciences, 38(1):45–54, 2015.
 Joe [2014] Harry Joe. Dependence modeling with copulas. CRC press, 2014.
 Jenison and Reale [2004] Rick L. Jenison and Richard A. Reale. The shape of neural dependence. Neural Computation, 16(4):665–672, 2004. doi: 10.1162/089976604322860659. URL https://doi.org/10.1162/089976604322860659.
 Calsaverini and Vicente [2009a] Rafael S Calsaverini and Renato Vicente. An informationtheoretic approach to statistical dependence: Copula information. EPL (Europhysics Letters), 88(6):68003, 2009a.
 Aas et al. [2009] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. Paircopula constructions of multiple dependence. Insurance: Mathematics and economics, 44(2):182–198, 2009.
 Czado [2010] Claudia Czado. Paircopula constructions of multivariate copulas. In Copula theory and its applications, pages 93–109. Springer, 2010.
 Berkes et al. [2009] Pietro Berkes, Frank Wood, and Jonathan W Pillow. Characterizing neural dependencies with copula models. In Advances in neural information processing systems, pages 129–136, 2009.
 Onken et al. [2009] Arno Onken, Steffen Grünewälder, Matthias HJ Munk, and Klaus Obermayer. Analyzing shortterm noise dependencies of spikecounts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology, 5(11), 2009.

Hu et al. [2015]
Meng Hu, Kelsey L Clark, Xiajing Gong, Behrad Noudoost, Mingyao Li, Tirin
Moore, and Hualou Liang.
Copula regression analysis of simultaneously recorded frontal eye field and inferotemporal spiking activity during objectbased working memory.
Journal of Neuroscience, 35(23):8745–8757, 2015.  Shahbaba et al. [2014] Babak Shahbaba, Bo Zhou, Shiwei Lan, Hernando Ombao, David Moorman, and Sam Behseta. A semiparametric bayesian model for detecting synchrony among multiple neurons. Neural Computation, 26(9):2025–2051, 2014. doi: 10.1162/NECO_a_00631. URL https://doi.org/10.1162/NECO_a_00631. PMID: 24922500.
 Safaai [2019] Wang A Panzeri S Harvey CD Safaai, H. Characterizing information processing of parietal cortex projections using vine copulas. In Bernstein Conference 2019. American Physical Society, 2019. doi: doi:10.12751/nncn.bc2019.0067. URL https://abstracts.gnode.org/abstracts/f80ac63f88fc42039c2ba279bb9e201a.
 Onken and Panzeri [2016] Arno Onken and Stefano Panzeri. Mixed vine copulas as joint models of spike counts and local field potentials. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 1333–1341, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
 Doiron et al. [2016] Brent Doiron, Ashok LitwinKumar, Robert Rosenbaum, Gabriel K Ocker, and Krešimir Josić. The mechanics of statedependent neural correlations. Nature neuroscience, 19(3):383, 2016.
 LopezPaz et al. [2013] David LopezPaz, Jose Miguel HernándezLobato, and Ghahramani Zoubin. Gaussian process vine copulas for multivariate dependence. In International Conference on Machine Learning, pages 10–18, 2013.
 HernándezLobato et al. [2013] José Miguel HernándezLobato, James R Lloyd, and Daniel HernándezLobato. Gaussian process conditional copulas with applications to financial time series. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1736–1744. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5084gaussianprocessconditionalcopulaswithapplicationstofinancialtimeseries.pdf.
 Sklar [1959] Abe Sklar. Fonctions de reprtition an dimensions et leursmarges. Publ. Inst. Statis. Univ. Paris, 8:229–231, 1959.

Joe [2005]
Harry Joe.
Asymptotic efficiency of the twostage estimation method for
copulabased models.
Journal of Multivariate Analysis
, 94(2):401 – 419, 2005. ISSN 0047259X. doi: https://doi.org/10.1016/j.jmva.2004.06.003. URL http://www.sciencedirect.com/science/article/pii/S0047259X04001289. 
O’Brien et al. [2016]
Travis A. O’Brien, Karthik Kashinath, Nicholas R. Cavanaugh, William D.
Collins, and John P. O’Brien.
A fast and objective multidimensional kernel density estimation method: fastKDE.
Computational Statistics & Data Analysis, 101:148 – 160, 2016. ISSN 01679473. doi: https://doi.org/10.1016/j.csda.2016.02.014. URL http://www.sciencedirect.com/science/article/pii/S0167947316300408.  Nelsen [2007] Roger B Nelsen. An introduction to copulas. Springer Science & Business Media, 2007.
 Wilson and Nickisch [2015] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISSGP). In International Conference on Machine Learning, pages 1775–1784, 2015.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, 2017.
 Gardner et al. [2018] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. GPyTorch: Blackbox matrixmatrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems, pages 7576–7586, 2018.
 Watanabe [2013] Sumio Watanabe. A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14(Mar):867–897, 2013.
 Gelman et al. [2014] Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predictive information criteria for bayesian models. Statistics and computing, 24(6):997–1016, 2014.
 Kraskov et al. [2004] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E, 69:066138, Jun 2004. doi: 10.1103/PhysRevE.69.066138. URL https://link.aps.org/doi/10.1103/PhysRevE.69.066138.
 Czado et al. [2012] Claudia Czado, Ulf Schepsmeier, and Aleksey Min. Maximum likelihood estimation of mixed cvines with application to exchange rates. Statistical Modelling, 12(3):229–255, 2012.
 Gao et al. [2016] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying Fixed kNearest Neighbor Information Estimators, 2016.
 Belghazi et al. [2018] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. MINE: Mutual Information Neural Estimation, 2018.
 Calsaverini and Vicente [2009b] R. S. Calsaverini and R. Vicente. An informationtheoretic approach to statistical dependence: Copula information. EPL (Europhysics Letters), 88(6):68003, Dec 2009b. ISSN 12864854. doi: 10.1209/02955075/88/68003. URL http://dx.doi.org/10.1209/02955075/88/68003.
 Song and Ermon [2019] Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators, 2019.
 Henschke et al. [2020] Julia U Henschke, Evelyn Dylda, Danai Katsanevaki, Nathalie Dupuy, Stephen P Currie, Theoklitos Amvrosiadis, Janelle MP Pakan, and Nathalie L Rochefort. Reward association enhances stimulusspecific representations in primary visual cortex. Current Biology, 2020.
 Khan and Hofer [2018] Adil G Khan and Sonja B Hofer. Contextual signals in visual cortex. Current opinion in neurobiology, 52:131–138, 2018.
 Uhlhaas and Singer [2012] Peter J Uhlhaas and Wolf Singer. Neuronal dynamics and neuropsychiatric disorders: toward a translational paradigm for dysfunctional largescale networks. Neuron, 75(6):963–980, 2012.
 Bassett et al. [2018] Danielle S Bassett, Cedric Huchuan Xia, and Theodore D Satterthwaite. Understanding the emergence of neuropsychiatric disorders with network neuroscience. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 3(9):742–753, 2018.
 Braun et al. [2018] Urs Braun, Axel Schaefer, Richard F Betzel, Heike Tost, Andreas MeyerLindenberg, and Danielle S Bassett. From maps to multidimensional network mechanisms of mental disorders. Neuron, 97(1):14–31, 2018.
 Donnelly and Embrechts [2010] Catherine Donnelly and Paul Embrechts. The devil is in the tails: Actuarial mathematics and the subprime mortgage crisis. ASTIN Bulletin, 40(1):1–33, 2010. doi: 10.2143/AST.40.1.2049222.
S1 Methods
s1.1 Goodnessoffit
We measure the accuracy of the density estimation with the proportion of variance explained . We compare the empirical conditional CDF vs. estimated conditional CDF and calculate:
(S1) 
where quantifies the portion of the total variance of that our copula model can explain given , and . The sum was calculated for .
Next, we select all of the samples from a certain interval of the task () matching one of those shown in Figure 3 in the paper. We split these samples into 20 equally sized bins: . For each bin , we calculate (S1). We evaluate using a copula model from the center of mass of the considered interval of : for samples . We use the average measure:
(S2) 
to characterize the goodness of fit for a bivariate copula model. Since is uniformly distributed on , the probabilities for each bin are equal to , and the resulting measure is just an average from all bins. The results were largely insensitive to the number of bins (e.g. 20 vs. 100).
s1.2 Variational inference
Since our model has latent variables with GP priors and intractable posterior distribution, the direct maximum likelihood TypeII estimation is not possible and an approximate inference is needed. We used stochastic variational inference (SVI) with a single evidence lower bound hensman2015scalable:
(S3) 
implemented as VariationalELBO in GPyTorch gardner2018gpytorchS. Here is the number of data samples, are the inducing points, is the variational distribution and .
Following the wilson2015kernelS approach (KISSGP), we then constrain the inducing points to a regular grid, which applies a deterministic relationship between and . As a result, we only need to infer the variational distribution , but not the positions of . The number of grid points is one of the model hyperparameters: grid_size.
Equation S3 enables joint optimization of the GP hyperparameters (constant mean and two kernel parameters: scale and bandwidth) and parameters of the variational distribution (mean and covariance at the inducing points: ) hensman2015scalable. We have empirically discovered by studying the convergence on synthetic data, that the best results are achieved when the learning rate for the GP hyperparameters (base_lr) is much greater than the learning rate for the variational distribution parameters (var_lr, see Table S1).
Priors
For both the neuronal and the synthetic data, we use a standard normal prior for a variational distribution. Note, that the parametrization for mixture models was chosen such that the aforementioned choice of the variational distribution prior with zero mean corresponds to a priori equal mixing coefficients for . In our experiments with the simulated and real neuronal data, we observed that the GP hyperparameter optimisation problem often had 2 minima (which is a common situation, see Figure 5.5 on page 116 in williams2006gaussian). One of those corresponds to a short kernel lengthscale () and low noise (), which we interpret as overfitting. To prevent overfitting, we used prior on RBF kernel lengthscale parameter that allows the optimizer to approach the minima from the region of higher , ending up in the minimum with a larger lengthscale.
Optimization
We use the Adam optimizer with two learning rates for GP hyperparameters (base_lr) and variational distribution parameters (var_lr). We monitor the loss (averaged over 50 steps) and its changes in the last 50 steps: loss = mean(loss[100:50])  mean(loss[50:]). If the change becomes smaller than check_waic, then we evaluate the model and check if it is lower than . If it is higher, we consider that either the variables are independent, or the model does not match the data. Either way, this indicates that further optimisation is counterproductive. If the , we proceed with the optimisation until the change of loss in 50 steps becomes smaller than loss_tol (see Table S1).
Effective learning rates for different families
The coefficients in the GPLink functions for different copula families are also a part of model hyperparameters. The choice of these coefficients affects the gradients of the log probability function. Since GPLink functions are nonlinear, they affect the gradients in various parameter ranges to a different extent. This results in variable convergence rates depending on the true copula parameters.
To address the problem of setting up these hyperparameters, we have created the tests on synthetic data with different copula parameters. Using these tests, we manually adjusted these hyperparameters such that the GP parameter inference converged in around 10002000 iterations for every copula family and parameter range. We have also multiplied the GP values corresponding to the mixture coefficients by , to effectively slow down the learning of the mixture coefficients compared to the copula coefficients , which also facilitates the convergence.
Hyperparameter selection
The hyperparameters of our model (Table S1) were manually tuned, often considering the trade off between model accuracy and evaluation time. A more systematic hyperparameter search might yield improved results and better determine the limits of model accuracy.
Hyperparameter  Value  Description 

base_lr  Learning rate for GP parameters  
var_lr  Learning rate for variational distribution  
grid_size  128  Number of inducing points for KISSGP 
waic_tol  0.005  Tolerance for estimation 
loss_tol  Loss tolerance that indicates the convergence  
check_waic  0.005  Loss tolerance when we check 
and GPLink parameters listed in Table 1. 
s1.3 Bayesian model selection
In model selection, we are aiming to construct a model with the lowest possible . Since our copula probability densities are continuous, their values can exceed 1 and the resulting is typically negative. Zero corresponds to the Independence model ( on the whole unit square). We also set up a tolerance (), and models with are considered indistinguishable from the independence model.
Since the total number of combinations of 10 copula elements (Fig.1) is large, exhaustive search for the optimal model is not feasible. In our framework, we propose two model algorithms for constructing closetooptimal copula mixtures: greedy and heuristic.
s1.4 Model selection algorithms
The greedy algorithm (Algorithm 1) starts by comparing of all possible singlecopula models (from Table 1, in all rotations) and selecting the model with the lowest . After that, we add one more copula (from another family or in another rotation) to the first selected copula, and prepend the element that yields the lowest of the mixture. We repeat the process until the stops decreasing. After the best model is selected, we remove the inessential elements using the reduce(.) function. This function removes those elements which have an average concentration of everywhere on . This step is added to improve the interpretability of the models and computation time for entropy estimation (at a small accuracy cost) and can, in principle, be omitted.
The greedy algorithm can be improved by adding model reduction after each attempt to add an element. In this case, the number of elements can increase and decrease multiple times during the model selection process, which also must be terminated if the algorithm returns to the previously observed solution. Even though it complicates the algorithm, it reduces the maximal execution time (observed on the real neuronal data) from 90 minutes down to 40 minutes.
The heuristic algorithm focuses on the tail dependencies (Algorithm 2). First, we try a single Gaussian copula. If variables are not independent, we next compare 2 combinations of 6 elements, which are organized as follows: an Independence copula together with a Gaussian copula and either 4 Clayton or 4 Gumbel copulas in all 4 rotations (0, 90, 180, 270). We select the combination with the lowest . After that, we take the remaining Clayton/Gumbel copulas one by one and attempt to switch the copula type (Clayton to Gumbel or vise versa). If this switching decreases the , we keep a better copula type for that rotation and proceed to the next element.
Here we make the assumption, that because Clayton and Gumbel copulas have most of the probability density concentrated in one corner of the unit square (the heavy tail), we can choose the best model for each of the 4 corners independently. When the best combination of Clayton/Gumbel copulas is selected, we can (optionally) reduce the model.
We have not yet used a Frank copula in a heuristic algorithm. We attempt to substitute the Gaussian copula with a Frank copula (if it is still a part of the reduced mixture, see lines 1619 in Alg. 2). Sometimes, a Gaussian copula can be mistakenly modeled as a Clayton & Gumbel or two Gumbel copulas. So, as a final step (lines 2031, Alg. 2), we select all pairwise combinations of the remaining elements, and attempt to substitute each of the pairs with a Gaussian copula, selecting the model with the lowest . Despite a large number of steps in this algorithm, the selection process takes only up to 25 minutes (in case all elements in all rotations are required).
The procedure was designed after observing the model selection process on a variety of synthetic and real neuronal datasets.
s1.5 Vine copulas
Vine models provide a way to factorize the highdimensional copula probability density into a hierarchical set of bivariate copulas
aas2009pairS. There are many possible decompositions based on different assumptions about conditional independence of specific elements in a model, which can be classified using graphical models called
regular vines bedford2001probability,bedford2002vines. A regular vine can be represented using a hierarchical set of trees, where each node corresponds to a conditional distribution function (e.g. ) and each edge corresponds to a bivariate copula (e.g. ). The copula models from the lower trees are used to obtain new conditional distributions (new nodes) with additional conditional dependencies for the higher trees, e.g. a ccdf of a copula and a marginal conditional distribution from the 1st tree provide a new conditional distribution for a 2nd tree. Therefore, bivariate copula parameters are estimated sequentially, starting from the lowest tree and moving up the hierarchy. The total number of edges in all trees (= the number of bivariate copula models) for an dimensional regular vine equals .The regular vines often assume that the conditional copulas themselves are independent of their conditioning variables , but depend on the them indirectly through the conditional distribution functions (nodes) acar2012beyond. This is known as the simplifying assumption for vine copulas haff2010simplified, which, if applicable, allows to escape the curse of dimensionality in highdimensional copula construction.
In this study, we focus on the canonical vine or Cvine, which has a unique node in each tree, connected to all of the edges in that tree. For illustration, see, for example, Figure 2 in aas2009pair. The Cvine was shown to be a good choice for neuronal datasets onken2016nipsS, as they often include some proxy of neuronal population activity as an outstanding variable, strongly correlated with the rest. This variable provides a natural choice for the first conditioning variable in the lowest tree. In the neuronal datasets from henschke2020rewardS, this outstanding variable is the global fluorescence signal in the imaged field of view (global neuropil).
To construct a Cvine for describing the neuronal and behavioural data from henschke2020rewardS, we used a heuristic element ordering based on the sum of absolute values of Kendall’s of a given element with all of the other elements. It was shown by czado2012maximumS that this ordering facilitates Cvine modeling. For all of the animals and most of the recordings (14 out of 16), including the one used in Figure 3, the first variable after such ordering was the global neuropil activity. This again confirms, that a Cvine with the global neuropil activity as a first variable is an appropriate model for the dependencies in neuronal datasets.
s1.6 Algorithmic complexity
In this section, we discuss the algorithmic complexity of the parameter inference for a Cvine copula model.
The parameter inference for each of the bivariate CopulaGP models scales as , where is the number of samples, since we use a scalable kernel interpolation KISSGP wilson2015kernel. As we mentioned in Sec. S1.5, a full dimensional Cvine model requires bivariate copulas, trained sequentially. As a result, the GP parameter inference has to be repeated times, which yields complexity.
In practice, the computational cost (in terms of time) of the parameter inference for each bivariate model varies from tens of seconds to tens of minutes. The heuristic model selection is designed in such a way, that it discards independent variables in just around 20 seconds (line 3 in Alg. 2). As a result, most of the models are quickly skipped and further considered as Independence models, and their contribution to the total computational cost can be neglected. When the model is evaluated, the Independence components are also efficiently ‘skipped’ during sampling, as ppcf function is not called for them. The Independence models also add zero to Cvine log probability, so they are also ‘skipped’ during log probability calculation. They also reduce the total memory storage, as no GP parameters, which predominate the memory requirements, are stored for these models.
In a conditional Cvine trained on a real neuronal dataset with 109 variables, 5253 out of 5886 (89%) bivariate models were Independence, which leaves only 633 nonIndependence models.
In practice, this means that the algorithmic complexity of the model is much better than the naïve theoretical prediction , based on the structure of the graphical model. Suppose that the actual number of the nonIndependence models in a vine model is much smaller than and can be characterized by an effective number of dimensions . In this case, instead of the scaling with the number of variables, the complexity highly depends on the sparsity of the dependencies in the graphical model and scales with as .
Therefore, the our method is especially efficient on the datasets with a low effective dimensionality , such as the neuronal data. The number of variables itself has little effect on the computational cost and memory storage.
S2 More validation on synthetic data
Computing infrastructure
We developed our framework and ran the majority of our experiments (described both in the paper and Supplemental Material) on an Ubuntu 18.04 LTS machine with 2 x Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz and 1x GeForce RTX 2080 + 1 x GeForce RTX 2080 Ti GPUs. For training Cvine models, we used another Scientific Linux 7.6 machine with 1 x Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz and 8 x GeForce RTX 2080 Ti GPUs.
Code availability
Code will be made available on GitHub upon paper acceptance.
s2.1 Model selection for bivariate copulas
Synthetic data
We generate artificial data by sampling from a copula mixture, parametrized in two different ways:

mixing concentrations of all copulas were constant and equal to ( = number of copulas), but copula parameters were parametrized by the phaseshifted sinus functions:
(S4) where is the index of the copula in a mixture, . For Clayton and Gumbel copulas, the absolute value of the sinus was used. The amplitudes were chosen to cover most of the range of parameters, except for extremely low or high s for which all copula families become indistinguishable (from independence or deterministic dependence, respectively).

copula parameters were constant, but mixing concentrations were parametrized by the phaseshifted sinus functions (same as Eq. S4, with and ). Such parametrization ensures that the sum of all mixing concentrations remains equal to one (). Yet, each turns to zero somewhere along this trajectory, allowing us to discriminate the models and infer the correct mixture.
Identifiability tests
We tested the ability of the model selection algorithms to select the correct mixture of copula models, the same as the one from which the data was generated. We generated 5000 samples with equally spaced unique inputs on [0,1].
Both model selection algorithms were able to correctly select all of the 1component and most of the 2component models on simulated data. For simulated data with larger numbers of components (or 2 very similar components), the of the selected model was either lower (which is possible given a limited number of samples) or close to the of the correct parametric model. In other words, the difference between the of the correct model and of the best selected model never exceeded the , which we set up as a criteria for passing the test: . Since all the tests were passed successfully, we conclude that both algorithms are capable of finding optimal or closetooptimal solutions for copula mixtures.
A more detailed report on the model identifiability tests
Tables S2S6 below illustrate the search for the best model. The copula model names in these tables are shortened to the first two letters, e.g. Gumbel becomes ‘Gu’, Frank becomes ‘Fr’. The information in these Tables provides some intuition on the model selection process and the range of s for the correct or incorrect models. The final selected models are shown in bold.
Table S2 demonstrates that both greedy and heuristic algorithms can identify the correct single copula model. Some key intermediate models ( in Alg. 12) with their s are listed in the table, along with the total duration of simulations (T, in minutes) on RTX 2080Ti for both algorithms.
Table S3 shows the identification of the mixtures with 2 components, where the copula parameters were constant (independent of ) and mixing concentrations were parameterized by the phaseshifted sinus functions (Eq. S4). All of these models were correctly identified with both algorithms. The mixtures with 2 components, where the copula parameters varied harmonically (as in Eq. S4) but the mixing concentrations were constant, were harder to identify. Table S4 shows that a few times, each of the algorithms selected a model that was better than the true model (). The greedy algorithm made one mistake, yet the model it selected was very close to optimal. Such misidentification happens due to the limited number of samples in a given synthetic dataset.
Tables S5S6 show the model selection for 3 component models. Again, as in Tables S3S4, either or was constant. Here, the model selection algorithms could rarely identify the correct model (due to overcompleteness of the mixture models), but always selected the one that was very close to optimal: .
Note, that is different from waic_tol. We have set waic_tol for comparison against Independent model to such a small value (10x smaller than ) because we want to avoid making false assumptions about conditional independences in the model. Also note, that the of the true model depends on the particular synthetic dataset generated in each test. Therefore, the final in the left and in the right columns of Tables S2S6 can be slightly different (yet, right within ).
s2.2 Accuracy of entropy estimation
In this section, we consider a fixed copula mixture model with known parameters and test the reliability of the entropy estimation with Monte Carlo (MC) integration. We test the accuracy of the entropy estimation on a multivariate Gaussian distribution, with , where is Kronecker’s delta and . Given a known Gaussian copula, we estimate the entropy with MC integration and compare it to the analytically calculated true value. We set up a tolerance to . As a result, for every correlation (Fig. S4AB) and every number of dimensions (Fig. S4CD), the CopulaGP provides an accurate result, within the error margin. In Figure S4, BIKSG estimates gao2016demystifyingS obtained on the dataset with 10k samples are shown for comparison. This experiment 1) validates the MC integration; 2) validates the numerical stability of the probability density function of the Gaussian copula up to a specified maximal (for the model is indistinguishable from the deterministic dependence ).
S3 Model parameters for the bivariate neuronal and behavioural examples
In this section, we provide visualisations for the parameters of the bivariate copula models from Figure 3CF and discuss the interpretability of these models.
Figure S5 shows the probability density of the joint distribution of two variables and the parameters of a corresponding CopulaGP mixture model. The plots on the left repeat Fig.3CD and represent the true density (outlines) and the copula model density (blue shades) for each part of the task.
In the noise correlation example (Fig. S5A), we observe the tail dependencies between the variables (i.e. concentration of the probability density in a corner of the unit square) around [060] cm and [140160] cm of the virtual corridor. There is only one element with a tail dependency in this mixture: Clayton 90 copula. On the rightmost plot in Fig. S5A, we see the mixing concentration for the elements of the mixture model. The concentration of Clayton 90 copula (orange line) is close to 100% around 20 cm and 150 cm, which agrees with our observations from the density plots.
The confidence intervals (
) for the parameters approximated with Gaussian processes are shown with shaded areas in parameter plots. These intervals provide a measure of uncertainty in model parameters. For instance, when the concentration of the Gaussian copula in the mixture is close to 0% ( around 20 cm and 150 cm), the confidence intervals for the Gaussian copula parameter (, blue shade) in Fig. S5A become very wide (from almost 0 to 1). Since this copula element is not affecting the mixture for those values of , its parameter has no effect on the mixture model log probability. Therefore, this parameter is not constrained to any certain value. In a similar manner, we see that the variables are almost independent between 60 and 120 cm (see density plots on the left in Fig. S5). Both copula elements can describe this independence. As a result, the mixing concentrations for both elements have high uncertainty in that interval of . Yet, Gaussian copula with a slightly positive correlation is still a bit more likely to describe the data in that interval.The copula parameter plot in Fig. S5A also shows Pearson’s , which does not change much in this example and remains close to zero. This illustrates, that the traditional linear noise correlation analysis would ignore (or downplay) this pair of neurons as the ones with no dependence. This happens because the Pearson’s only captures the linear correlation and ignores the tail dependencies, whereas our model provides a more detailed description of the joint bivariate distribution.
In the behavioural modulation example (Fig. S5B), we observe more complicated tail dependencies in the density plots. The best selected model supports this observation and provides a mixture model with 3 components, 2 of which have various tail dependencies. The Clayton 0 copula (orange) describes the lower tail dependence observed in the second part of the virtual corridor with gratings (around [60120] cm, see Fig. 3A for task structure). This dependence can be verbally interpreted as follows: when there is no licking, the Neuron 60 is certainly silent; but when the animal is licking, the activity of Neuron 60 is slightly positively correlated with the licking rate.
These examples illustrate, that by analysing the copula parameters and the mixing concentrations of the CopulaGP mixture model, one can interpret the changes in the bivariate dependence structure. Just like traditional tuning curves characterize the response of a single neuron, our mixture model characterizes the ‘tuning’ of the dependence structure between pairs of variables to a given stimulus or context. Knowing the qualitative properties of the copula elements that constitute a copula mixture, one can focus on the dominant element of the copula mixture for every given conditioning variable and describe the shape of the dependence.
unsrtnat references_Suppl_1
True  Greedy  Heuristic  

Model  Search attempts  WAIC  T  Search attempts  WAIC  T 
Ga  Ga  0.1619  25 m  Ga  0.1513  3 m 
GaFr  0.1610  InGaGuGuGuGu  0.1499  
Ga  0.1619  InGaClClClCl  0.1498  
Ga  0.1513  
Fr  Fr  0.1389  57 m  Ga  0.1400  3 m 
FrCl  0.1395  InGaGuGuGuGu  0.1391  
FrClGu  0.1396  InGaClClClCl  0.1391  
FrClGuGu  0.1396  Fr  0.1509  
Fr  0.1389  
Cl  Cl  0.5225  37 m  Ga  0.3825  5 m 
ClGu  0.5226  InGaGuGuGuGu  0.4943  
ClGuCl  0.5225  InGaClClClCl  0.5303  
Cl  0.5224  Cl  0.5311  
Gu  Gu  0.6267  43 m  Ga  0.5555  7 m 
GuCl  0.6268  InGaGuGuGuGu  0.5988  
GuClGu  0.6267  InGaClClClCl  0.5946  
Gu  0.6230  GaGu  0.6040  
Gu  0.6050  
Cl  Cl  0.5389  22 m  Ga  0.3922  5 m 
ClCl  0.5389  InGaGuGuGuGu  0.5047  
Cl  0.5389  InGaClClClCl  0.5409  
Cl  0.5410  
Gu  Gu  0.6137  55 m  Ga  0.5501  7 m 
GuGu  0.6144  InGaGuGuGuGu  0.5893  
GuGuCl  0.6145  InGaClClClCl  0.5831  
GuGuClCl  0.6144  GaGu  0.5887  
Gu  0.6137  Gu  0.5950  
Cl  Cl  0.5566  36 m  Ga  0.3932  7 m 
ClCl  0.5582  InGaGuGuGuGu  0.4956  
ClClIn  0.5582  InGaClClClCl  0.5493  
Cl  0.5565  Cl  0.5489  
Gu  Gu  0.6131  43 m  Ga  0.5553  6 m 
GuCl  0.6164  InGaGuGuGuGu  0.6091  
GuClFr  0.6163  InGaClClClCl  0.6045  
Gu  0.6131  Gu  0.6154  
Cl  Cl  0.5434  23 m  Ga  0.3909  5 m 
ClGu  0.5433  InGaGuGuGuGu  0.5094  
Cl  0.5434  InGaClClClCl  0.5535  
Cl  0.5548  
Gu  Gu  0.5928  51 m  Ga  0.5763  6 m 
GuCl  0.5934  InGaGuGuGuGu  0.6277  
GuClIn  0.5935  InGaClClClCl  0.6179  
GuClInCl  0.5931  Gu  0.6300  
Gu  0.5928 
True  Greedy  Heuristic  

Model  Search attempts  WAIC  T  Search attempts  WAIC  T 
Gu  Ga  0.1877  101 m  Ga  0.1922  11 m 
Ga  GaGu  0.2855  InGaGu 