1 Introduction
Spatiotemporal receptive fields (STRFs) are frequently used in neuroscience to approximate the computation implemented by a sensory neuron. Such models consist typically of one or more linear filters summing up sensory inputs across time and space, followed by a static nonlinearity and a probabilistic output process [aljadeff_analysis_2016]. For this, different distributions can be used depending on the type of recorded neuronal responses.
The simplest way to estimate a STRF for a given data set is arguably to compute the spiketriggered average (STA), the average over all stimuli preceding a spike [schwartz_spiketriggered_2006]. However, the results of the spiketriggered analysis are often noisy due to limited data and prone to be distorted when the neuron is stimulated with correlated noise or natural stimuli [paninski_convergence_2003]. The maximum likelihood estimate (MLE), also known as whitenedSTA (wSTA) [theunissen_estimating_2001], does not suffer from such shortcomings, but it requires even more data to converge as the whitening procedure tends to amplify noise at frequencies with low power, making it a less preferred option in many experimental settings. A modified wSTA that removes low power frequencies can result in smoother STRF, while the threshold for such a frequency cutoff is normally chosen by crossvalidation [theunissen_estimating_2001].
It has been suggested that these shortcomings can be overcome by computing the maximum a posteriori (MAP) estimates within the framework of Bayesian inference
[wu_complete_2006]. Under this framework, prior knowledge about STRFs such as sparsity, smoothness [sahani_evidence_2002], and locality[park_receptive_2011]can be encoded into a prior covariance matrix, whose hyperparameters are learnt automatically from the data via evidence optimization (also known as empirical Bayes, or Type II maximum likelihood in the literature
[casella_introduction_1985]). These algorithms thus provide STRF estimates with the desired properties even with little or noisy data. Although elegant in theory, the evidence optimization step of this framework is computationally costly: the cubic complexity in the dimensionality of the parameter space renders it difficult to scale to highdimensional settings as often encountered in sensory neuroscience [wu_complete_2006]. Also, current popular implementations of MAP estimates work mostly under the LinearGaussian encoding model, which assumes additive Gaussian noise to the neural response, and are restricted to simple one filter models.An alternative approach to encoding prior knowledge for estimating STRFs is to parameterize the STRF with a set of basis functions with the desired properties. In this way, the model fitting becomes much more efficient as the number of parameters to be optimized is significantly reduced, and some degree of smoothness is automatically enforced. Previous studies used a raisedcosine basis (Figure 1A) for 1D retinal ganglion cell STRFs in macaque monkey [pillow_spatiotemporal_2008] and Gaussian spectral or Morlet wavelet basis for 2D auditory STRFs in ferrets [thorson_essential_2015]. Those basis functions are chosen to capture certain STRF properties in their corresponding studies, thus they are relatively inflexible to generalize to other settings. Besides, they are typically controlled by multiple hyperparameters, which creates a burdensome and timeconsuming model selection process.
Here, we propose to use an alternative basis for receptive field estimation: natural cubic splines (Figure 1
B). These are known as the smoothest possible interpolant and can be easily extended to highdimensional settings with minimal assumptions and few hyperparameters
[wood_generalized_2017]. This spline basis can be incorporated into a wide range of existing receptive field models (Figure 2), such as the LinearGaussian (LG) and the LinearNonlinearPoisson (LNP) model[pillow_prediction_2005, pillow_spatiotemporal_2008] for single STRF estimation as well as LinearNonlinearLinearNonlinear (LNLN) cascade models[mcfarland_inferring_2013, liu_inference_2017] and spiketriggered clustering methods [liu_inference_2017, shah_inference_2020] for subunit estimation. When combined with proper regularization, smooth and sparse STRFs can be retrieved with very little data and in a computationally efficient manner, while their predictive performance reaches stateoftheart level. Furthermore, we introduce diagnostic tools for assessing the quality of an estimated STRF: (1) a significance test based on the analytical solution of the STRF covariance matrix and (2) a permutation test based on model prediction. We provide all developed methods as part of the RFEst Python toolbox.2 Results
We developed the RFEst Python toolbox for splinebased STRF estimation in various models, including the singlefilter LG and LNP models as well as the multifilter LNLN cascade model (Figure 2). We measured the performance of the splinebased methods and compared it with their nonspline versions or other previous methods, using simulated data as well as neural recordings from the retina and the visual cortex. To quantify the goodness of fits for the simulated benchmark data, we measured the mean squared error (MSE) between the ground truth STRF and the estimated STRFs for various methods and studied their dependence on the amount of available data. For the experimental data, we computed the correlation between the predicted and measured responses on a heldout test set as a performance measure.
2.1 Estimating receptive fields from simulated LG, LNP and LNLN models
First, we simulated the responses of a sensory neuron with additive Gaussian noise using a STRF with 30 time frames and 40 pixels in space in response to white and pink noise stimuli for various stimulus lengths (see Methods 4.1 for details). First, we fit LG models (Figure 2A) to these responses using various STRF estimation techniques.
We measured the MSEs between the ground truth STRF and the different STRF estimates with respect to the ratio of the number of samples to the number of parameters , averaged over 10 different random seeds (Figure 3A). As expected, the STA and wSTA approached the ground truth only in high datatoparameters regimes. For limited data (e.g. n/d=4), the estimated STRF showed substantial background noise (Figure 3B). In the case of pink noise, the STA was in fact never able to recover the STRF due to the high correlation between pixels in the stimulus [paninski_convergence_2003] (Figure 3CD).
We also applied more advanced techniques to estimate the STRF to the simulated data, such as automatic smoothness determination (ASD, aka MAP with smoothness prior [sahani_evidence_2002]) and automatic locality determination (ALD, aka MAP with locality prior [park_receptive_2011]). These estimates were much closer to the ground truth already with a relatively small amount of data, with ALD outperforming ASD for both white and pink noise (Figure 3). However, their computation time for estimating the optimized prior was long due to the high computational complexity.
We then applied a splinebased version of the wSTA (SPL wSTA) with and without L1 regularization on the basis function coefficients (see Methods). This simple change of basis functions led to a STRF estimator which performed very well for all ndratios tested, even better than ASD and ALD, especially when using L1 regularization (Figure 3). At the same time, this estimator was also computationally very efficient, as the number of coefficients to be estimated was reduced to the number of basis functions (for this example, the STRF had parameters, whereas the number of basis functions was only )
We next simulated responses from an LNP model (Figure 2B) with Poisson noise similar to before using white and pink noise for stimulation. For simplicity, we used an exponential nonlinearity and treated it as known during the estimation procedure. We estimated the linear filter from different amounts of data using the STA, wSTA, ASD and ALD, as well as spline and nonspline LNP fitted via gradient descent with L1 regularization (referred to as LNP+SPL and LNP, respectively). Note that the original ASD and ALD were not designed for the LNP but the LG model (though some progress has been made for the former [park_bayesian_2013]). Despite this model mismatch, they still recovered good estimates of the STRF (as has been demonstrated in the original papers [sahani_evidence_2002, park_receptive_2011]).
Again, we found that different methods resulted in estimates of different quality depending on the available amount of data (Figure 4). Here, in contrast to the LG model, we changed the xaxis of Figure 4A and C to time in minutes for better presentation, as the model performance was dependent on the number of effective samples (frames that contain at least one spike) instead of the total number of samples, and the stochastic nature of the Poisson spike generator yielded different number of spikes for each trial. For both stimulus conditions, we found that the splinebased LNP yielded extremely dataefficient estimates of the linear filter compared to all other methods. In contrast, gradientbased optimization of an LNP model without spline basis, even with L1 regularization, performed only slightly better than the STA and wSTA computed under the mismatched LG model assumptions. The gap in performance of the LNP+SPL and MAP estimators of ASD and ALD were larger here compared to the LG case. This was partially due to the fact that the number of effective samples was smaller in Poisson spiking data, but the mentioned model mismatch between the cost function of the LG model optimized by the ASD and ALD methods and the data likely also contributed.
We finally investigated how recently proposed subunit models (Figure 2C), in particular an LNLN cascade model [mcfarland_inferring_2013, maheswaranathan_inferring_2018] and spiketriggered clustering or nonnegative matrix factorization methods[shah_inference_2020, liu_inference_2017], can be made more dataefficient using a spline basis. We compared their performance with and without spline augmentation (see Methods 4.6
). The assumption behind the spiketriggered clustering methods is that each spike can be attributed to a specific subunit of a STRF, so that each subunit can be viewed as a STA of a cluster of spikes. If this assumption holds, then a splineapproximated STA should be able to serve as a better centroid for clustering, as the smoothed STA is closer to the ground truth subunit STRF. We used kmeans clustering as an inplace of the previous published softclustering method
[shah_inference_2020] and used a multiplicative update rule[ding_convex_2010] for seminonnegative matrix factorization (semiNMF) [liu_inference_2017]. Moreover, we modified the previous published semiNMF [liu_inference_2017] and imposed the nonnegative constraint to the weight matrix instead of the subunit matrix, which allowed us to retrieve antagonistic STRFs.We simulated responses with Poisson noise from the model by stimulating two 20x20 pixels antagonistic centersurround subunits with different lengths of white noise stimuli. Both filter outputs first went through an exponential nonlinearity, were then summed up, and finally went through a softplus output nonlinearity.
We estimated subunit STRFs using kmeans clustering or semiNMF of the spiketriggered ensemble, and maximumlikelihood estimation of the LNLN model directly, with or without a spline basis (Figure 5). Similar to the results from previous simulations, retrieved subunit STRFs from splinebased models were generally much more similar to the ground truth than models without spline, and required a smaller amount of data to achieve a similar level of performance. For example, when using a spline basis, we were able to retrieve the two subunit filters at high quality using only 4 minutes of data. With this amount of data, all other methods still showed essentially only noise with a hint of signal in the estimate of the linear filter.
It is worth noting that the computational efficiency of LNLN estimation was improved a lot by using a spline basis, as the number of coefficients to be estimated was significantly reduced. In contrast, the computation time for both spiketriggered clustering methods was actually increased by incorporating splines, because more computational steps were added to the original algorithms in those modified versions (see Methods 4.6).
2.2 Comparing computation time between different methods and model selection
It was not straightforward to fairly compare the computation time between MAPderived and splinebased GLMs, as the most timeconsuming part for MAP was to estimate the prior hyperparameters via evidence optimization, while splinebased GLMs need to resort to crossvalidation for selecting the number of basis functions (, also known as the degrees of freedom) for each dimension. Yet, we can still get an impression of the algorithmic efficiency of each method by profiling their most timeconsuming step.
Therefore, we compared the time taken by the computation of the STA (see equation 1 in Methods), wSTA (equation 2), MAP (equation 3) and SPL wSTA (equation 12) for different amounts of data by separately profiling the single most time consuming computation (Figure 6). Not surprisingly, STA was the fastest, as it had a linear time complexity () and its computation time depended mainly on the amount of data (), while wSTA and MAP were the slowest, as they were both limited by the calculation of the inverse matrix, and their computation time grew in a cubic manner with respect to the number of coefficients () in the STRF due to the cubic time complexity () of the matrix inversion. SPL on the other hand, was less timeconsuming than wSTA and MAP, even though the time complexity was still cubic. However, the improvement was due to the reduced number of model parameters by parameterizing the STRF coefficients with the spline basis ( coefficients), which resulted in a much smaller covariance matrix to invert (), thereby speeding up the most timeconsuming step of the algorithm substantially. Thus, splinebased STRF estimators offer the best available compromise between STRF estimation quality and computation time.
The total amount of time spent on finding an optimal STRF depends on many factors, such as how the model is initialized, the choice of learning rate and stopping criteria for gradient descent optimization. Importantly, the way hyperparameters are selected also contributes significantly: most commonly, the model prediction performance is monitored for on a heldout data set over a discrete grid of hyperparameters. Even for evidence optimization methods, in which the main selling point is to learn the optimal hyperparameters automatically, crossvalidation can still be used to assess model goodness of fit and generalization performance [sahani_evidence_2002, park_bayesian_2011]. For splinebased GLM, this requires refitting the same model as often as the number of hyperparameter sets. Given that splines are equally spaced, a good strategy to accelerate the process would be to measure the performance of SPL wSTA over a grid of possible candidates from very few to many, choose the number when the performance is starting to plateau, then fit SPL+L1 with gradient descent to retrieve a sparse STRF.
2.3 Diagnostic tools
In addition, even if a wellregularized model is chosen according to predictive performance, additional diagnostic tools can be useful to decide whether a certain STRF is reasonable and shows a clear structure (Figure 7
). We devised three diagnostic tools that can be employed to this end: (1) the confidence interval (CI) of the fitted STRF and its corresponding pvalues based on Wald statistics (with the null hypothesis that all STRF coefficients are zero, see also Method
4.7); (2) the difference between the metrics of the training and the validation set over all iterations of the optimization; and (3) a permutation test on model prediction for validation and/or test data set (Method 4.8).To demonstrate their usefulness, we investigated a ground truth 3D STRF under a LinearGaussian model. We showcased four different scenarios (Figure 7): First, we looked at a wellfitted STRF (Figure 7
A). Here, the CI was narrow and significantly different from an allzero STRF (p<0.001), the gap between the training and validation prediction metrics was small and the prediction on the validation set was better than chance (p<0.001). The confidence interval and its corresponding pvalue were determined by the training data. Next, we looked at scenarios where the amount of data was small or the signaltonoise ratio (SNR) was low (Figure
7 C and D). Here, the STRF did not pass the significance test due to a lack of power, but it is still possible that the model predictions on the validation set were better than chance. Additionally, if the gap between the performance metric on the training and validation sets gets too large, it would be a good idea to refit the model with a stronger regularization. Finally, in the case of the data with extremely low SNR (Figure 7E), which we generated by randomly permuting the simulated response, the prediction performance of the fitted STRF was just at the chance level regarding all indicators.2.4 Application to experimental data
We next documented the complete process of using gradientbased estimation for the parameters of splinebased LNP and LNLN models in a previously published experimental data set: extracellularly recorded spikes from tiger salamander retinal ganglion cells (RGC), stimulated by 3D nonrepeat checkerboard white noise [liu_inference_2017] (Figure 9A). In addition, we applied the same techniques to 2photon calcium imaging data recorded from a mouse retinal ganglion cell soma stimulated by checkerboard white noise (Figure 9B) (using the positive gradient of the calcium trace as the response). We compared the results of splinebased models with nonspline versions.
For both data sets, only the first 10 minutes of the whole recording were used for fitting the models, another 2 minutes were used for model validation and another 4 sets of 2 minutes for testing. The predictive performance was measured by the average of the four correlation coefficients of the predicted and measured responses. We set the number of subunits throughout. The dimensionality of the 3D STRFs were [20, 25, 25] and [25, 15, 15], respectively. We added diagnostic information as in Figure 7. Moreover, a contour was drawn on the spatial filter to illustrate the spatial extent of the STRF.
We fitted the nonspline LNP and LNLN with different L1 regularization weights from 0 and 10, with steps of 1.
For spline LNP and LNLN, the number of basis functions (aka degree of freedom, df) for temporal and spatial dimensions (assuming x and y dimensions here used the same df) were first selected based on the predicted performance of the maximum likelihood estimates on the validation set (from 1/3 to 2/3 of the pixels in each dimension, with steps of 1). Finally, for data set (1), the grid search took 11.2 minutes on the same hardware setup used for Figure 6 and yielded the optimal df=[9, 9, 9] ; for data set (2), it took 2.4 minutes and yielded the optimal df=[12, 9, 9] (Figure 8). See also how different amounts of training data affects the optimal df in Supplementary Section 1.
Then, we fitted spline LNP and LNLN with L1 regularization weights ranged from 0.5 and 1.5, with steps of 0.1 (as we observed, the spline model required a smaller value for regularization due to the reduction of model parameters). The ordered grid search for the L1 weight, starting with the smallest value, was interrupted if the current predictive performance was lower than the previous one. All models were initialized with maximum likelihood estimates (plus small additive Gaussian noise) and fitted by gradient descent with 1500 iterations, but an early stop was also deployed with two triggers: either the training cost changed less than 1e5 for the last 10 iterations, or the validation cost monotonically increased for the last 10 iterations. Sometimes the optimization didn’t stop after reaching the lowest validation cost due to the adaptive learning rate resulting in the cost changing in a zigzag manner. But regardless of the stopping trigger, the model at the lowest validation cost was returned. The computation time of each fit for all four models was summarized by box plots in Figure 9A and B.
Fitting a 3D STRF with gradient descent directly was considered infeasible unless certain regularization was imposed [maheswaranathan_inferring_2018], and the situation would become even worse for subunit models, as the number of model parameters increases linearly with the number of subunits. Previously methods [liu_inference_2017, shah_inference_2020] (as shown in Figure 5) relied on dimensionality reduction techniques to retrieve subunit STRFs but those methods could not be applied to 2photon calcium imaging data directly without an extra step of spike inference. Spline GLMs not only overcome both of these problems but also allow us to assess the quality of the estimated STRF with CI and their corresponding pvalues (In principle, the same statistics can also be applied to nonspline GLMs, yet a larger number of model coefficients would result in high computational cost for computing the STRF posterior covariance).
We found that for both example neurons from the two data sets, STRFs estimated by nonspline models were noisy and prediction performance was only moderate (Figure 9A, B, A, B). This was especially prominent for LNLN estimates, as here the number of parameters to estimate exceeded the amount of available data to fit the models (Figure 9A and B). In contrast, the splinebased LNP and LNLN estimated in general were smooth and sparse even with a severely limited amount of data, and they consistently outperformed their nonspline versions (Figure 9A and B).
The number of subunits that could be retrieved by the model was limited by the amount of available data (total length of the recording, and for spike data, the total number of spikes) as well as the signaltonoise ratio of the data. With the limited data in our examples, we found three and two nonoverlapping subunits for the two neurons, respectively (Figure 9A and B). Additional subunits available to the model were pushed to near zero by the L1 regularization and the significance test failed to reject the null hypothesis that all coefficients in those subunit STRFs were close to zero. But as noted in the previous section, the pvalue should not be the only judge for the quality of the estimated STRF. As shown in 9B, the STRF from the spline LNP model was also nonsignificant but looked reasonable from the snapshot visualization. In cases like this, other tools such as the permutation test shown in the previous section would be required to determine whether the STRF in question was good or not.
We provided two more examples of visual neurons (RGC and V1) and one example of nonvisual neuron (entorhinal grid cell) in Supplementary Section 2.
3 Discussion
Here, we showed that natural cubic splines form a good basis for STRF estimation by tackling one of the main challenges in sensory neuroscience: to efficiently estimate highdimensional STRFs with limited data. Using this basis function set, we showed that inference for single filter LN models with different noise models as well as LNLN cascade models can be efficiently performed from both spiking as well as twophoton imaging data, even if the signaltonoise ratio is low (and see also how splinebased methods compare to previous methods in Table 1). We provide an implementation of the proposed methods in the RFEstpackage for Python.
Methods  Stimuli Type  Noise Distribution  Data needed  Complexity  Multiple Filters 

STA  White Noise only  Gaussian  a lot  No  
wSTA  Any  Gaussian  a lot  No  
MAP  Any  Gaussian  little  No  
LNLN  Any  Poisson  a lot  Yes  
SPL LG  Any  Gaussian  little  No  
SPL LNP  Any  Poisson  little  No  
SPL LNLN  Any  Poisson  little  Yes 
3.1 Choice of basis functions
We chose the natural cubic spline basis to represent the linear filters as default in our toolbox, because among many other spline bases such as cyclic cubic splines, Bsplines, or thinplate splines, natural cubic splines have been shown to be the smoothest possible interpolant through any set of data and theoretically nearoptimal to approximate any true underlying functions closely [wood_generalized_2017]. Conveniently, assuming all knots are spaced equally in each dimension, natural cubic splines only require setting one hyperparameter: the degree of freedom, or the number of basis functions. This is unlike the raisedcosine basis used previously[pillow_spatiotemporal_2008]
, which is governed by multiple hyperparameters. Also, natural cubic splines can be easily extended to highdimensional settings, also known as tensor product bases, by simply taking the rowwise Kronecker product of the basis matrix along each dimension (see Method
4.4), making them uniquely suitable for estimating STRFs. There is also an interesting connection between smoothing splines and Gaussian processes. As shown by [kimeldorf_correspondence_1970, bay_generalization_2016, bay_generalization_2016], spline regression converges to the MAP solution of a Gaussian process regression under certain constraints.Through extensive validation with simulated and real experimental data sets, we showed that STRFs estimated with this basis can not only be more accurate than previous stateoftheart methods, but the estimation procedure was also computationally much more efficient. The estimated STRFs were smoothed automatically, as they were the linear combination of a set of smooth basis functions. Further, sparsity and localization could often be achieved by adding an L1 regularization term to the loss function, which pushed the coefficient for the less relevant basis functions to zeros. Moreover, by parameterizing STRFs with basis functions, the number of coefficients could be significantly reduced, hence model fitting became much more computationally efficient, and the amount of data needed for the optimization to converge was also significantly less.
3.2 Potential pitfalls of using splines in nonideal situations
One caveat of using splines to fit STRFs is that some STRFs might not need smoothing, thus using splines would lead to suboptimal results. For example, if a STRF occupies only one pixel in space or the amplitude changes sharply in the temporal profile, splinebased model would likely yield suboptimal results by yielding overly smoothed filters. However, one can identify these cases by monitoring the crossvalidation performance on heldout data while increasing the number of basis functions in each dimension. If the performance keeps increasing as the number of basis functions reaches the number of pixels in each dimension, a splinefree model could be preferable.
Another concern of using splines is that it would be difficult to tell a poor fit from a good one in some cases, as a poorlyfitted splinebased STRF would also look like a smooth STRF, unlike pixelbased methods that yield a noisy STRF when the fit is poor. In these cases, the diagnostic tools we introduced in Section 2.3 and Figure 7 would be helpful to assess the quality of the STRFs.
3.3 Other approaches to efficient estimation of receptive fields
Similar to our approach, fast automatic smoothness determination (fASD) [aoi_scalable_2017]
has been suggested to speed up the process of evidence optimization in ASD. Because the smoothness prior covariance can be represented in frequency coordinates and many highfrequency coefficients of the Fouriertransformed filter tend to be small, the size of the prior covariance matrix can be reduced by cutting off those small value frequencies (an idea similar to the modified wSTA
[theunissen_estimating_2001]), leading to a speedup in the optimization process. However, as the efficiency of fASD is controlled completely by the initial frequency truncation, and the level of frequency to be truncated has to be fixed before the evidence optimization starts, one needs to resort back to grid search and crossvalidation for choosing a good starting point, making the automatic hyperparameters selection via evidence optimization less automatic and comparable to the need to find the optimal degrees of freedom for the spline basis. Also, as fASD (and original ASD) rely on only a few hyperparameters, it might not even require to do evidence optimization at all but just to find the optimal hyperparameters through crossvalidation completely. Importantly, fASD is so far only applicable within the framework of linear models with Gaussian additive noise, while our approach is ready to use in nonlinear models with nonGaussian noise, as shown for LNP and LNLN models.Regarding the inference for multiple STRFs in LNLNcascade models, previous work has attempted to provide efficient algorithms by first retrieving subunit STRFs through spiketriggered clustering methods such as softclustering [shah_inference_2020] or semiNMF [liu_inference_2017]. These were subsequently used to estimate other model parameters via gradient descent while holding the subunit STRFs fixed. Utilizing only the collection of stimuli that elicit a spike, these approaches have advantages in terms of the computational cost in time and memory compared to the direct fitting of the complete model using gradient descent [mcfarland_inferring_2013, maheswaranathan_inferring_2018].
The ease of using these algorithms relying on the spiketriggered ensemble under white noise stimulation is also the disadvantage of these approaches – they can only be applied to spike data under white noise stimulation. In many cases, experiments today record more diverse measurements from neurons, such as twophoton calcium imaging [baden_functional_2016, ran_typespecific_2020] or synaptic current recording [trong_origin_2008] under more diverse types of stimulation, such as correlated noise and natural stimuli [qiu_mouse_2020]. In those cases, spiketriggered analysis can not be used directly without modification, as the spiketriggered average is no longer the maximum likelihood estimate [paninski_convergence_2003]. In contrast, our method does not rely on Gaussianity of the stimulus or the discreteness of the response, and in principle can be generalized to other stimuli distribution and other response types.
In addition, current spiketriggered clustering methods rely on temporally collapsing the 3D stimuli to further reduce the model complexity, assuming all subunits share the same temporal response kernel. This limits the usability of these methods to some very specific cell types and will not account for the spatiotemporal inseparable computation implemented by neurons receiving inputs from multiple sources, such as bistratified OnOff directionselective cell in the retina [taylor_diverse_2002]. In this case, a more general approach such as our method should be preferred.
3.4 Further extensions
Here, we only showcased how to estimate STRFs using a natural cubic spline in a set of popular models with standard gradient descent with L1 regularization. This simple spline addon can be directly incorporated into other previous models [mcfarland_inferring_2013, maheswaranathan_inferring_2018] with different constraints, regularization and optimization techniques, as it only changed the linear step of the models without modifying other parts.
Furthermore, to further improve model prediction performance, a responsehistory filter (as shown in Figure 2) and flexible nonlinearity can be added to the model [williamson_equivalence_2015, maheswaranathan_inferring_2018], as both of which can also be parameterized by a 1D spline basis. With such extra flexibility, a model can better adapt to the cellspecific firing threshold and adaptation to the stimulus ensemble. Still, even with the responsehistory filter, a STRF model may still not be able to account for the total variability of the neuronal responses, if the response is influenced by other sources. Future extension of the current GLM for STRFs can consider incorporating more diverse sources of behavioral data [balzani_efficient_2020] of the experimental animal under experimental or naturalistic settings as 1D timedependent filters.
4 Material and Methods
4.1 Data sets
Simulated data
For the simulated data shown in Figure 3 and 4, we used a 2D rank2 STRF with 30 time frames and 40 pixels in space () as shown as "ground truth" in both Figures and generated responses with two types of flicker bar stimuli: Gaussian white noise and pink noise. For LNP models, we used a fixed exponential nonlinearity.
For Figure 5, we used two 20x20 pixels antagonistic centersurround STRF and with Gaussian white noise as stimulus. We used a fixed exponential function for the filter nonlinearity and a softplus function for the output nonlinearity. We controlled the mean firing rate of all simulated spike trains to about 21 Hz by adjusting the intercept of the corresponding models (for Figure 4A and C, the intercepts were 3.5 and 2.5, respectively. For Figure 5, the intercept was 0) and also multiplying a constant to .
For all simulations, we measured the similarity of the ground truth and the model estimators by computing the MSE between the respective normalized STRFs. The STRFs were normalized by elementwise division by the matrixnorm.
For Figure 3, the similarity was measured for various stimulus length (), which were scaled by different factors () with respect to the number of coefficients of the STRF. For Figure 4 and 5, the similarity was measured at various lengths of time ( minutes), which is equal to setting , where is the time bin size and took the value of 0.033 s.
Error bars in Figure 35 were obtained by averaging over 10 trials of simulations with different random seeds for the stimulus.
For the simulated data shown in Figure 7, we used a 3D STRF with 25 time frames and 25x25 pixels in space (), where the spatial profile was a 2D Gaussian field and the temporal profile had a sharp dip close to zero time lag, and generated responses with Gaussian white noise (sample size unless noted otherwise). Additive Gaussian noise was added to the responses ( except (c), for which ).
Experimental data
We used a previously published data set for Figure 9A: extracellularly recorded spikes from tiger salamander retinal ganglion cells [liu_inference_2017], available to download at https://gin.gnode.org/gollischlab/Liu_etal_2017_RGC_spiketrains_for_STNMF.
Another two data sets of visual neurons and one data set of a nonvisual neuron were used for the supplementary Figure 2 and 3:

Extracellularly recorded spikes from a tiger salamander retinal ganglion cell [maheswaranathan_inferring_2018], available to download at https://github.com/baccuslab/inferringhiddenstructureretinalcircuits;

Extracellular recorded spikes from a V1 complex cell of macaque monkey [rust_spatiotemporal_2005], available on request from the authors;

Extracellular recorded spikes from a grid cell in rat entorhinal cortex [hafting_microstructure_2005], available to download at https://www.ntnu.edu/kavli/research/gridcelldata;
The details of the second data set for Figure 9B are described below.
4.2 Experimental procedures for 2photon calcium imaging
Animals and tissue preparation
One wildtype mouse (C57Bl/6J, JAX 000664) housed under a standard 12 hour day/night rhythm was used in this study. Before experiments, the mouse was dark adapted for at least 2 hours. Then, the animal was anaesthetized with isoflurane (Baxter) and killed with cervical dislocation. Eyes were quickly enucleated in carboxygenated (95% O, 5% CO) artificial cerebral spinal fluid (ACSF) solution containing (in mM): 125 NaCl, 2.5 KCl, 2 CaCl, 1 MgCl, 1.25 NaHPO, 26 NaHCO, 20 glucose, and 0.5 Lglutamine (pH 7.4). After carefully removed the cornea, sclera and vitreous body, the retina was flattened on an Anodisc (0.2 M pore size, GE Healthcare) with the RGC side facing up and electroporated (see below). Then, the retina was transferred to the recording chamber of the microscope. To visualize the blood vessels and the damaged cells, we added 5 M sulforhodamine101 (SR101, Invitrogen) into 2l of ACSF solution. All experimental procedures were carried out under very dim red light. All animal procedures were approved by the governmental review board (Regierungspräsidium Tübingen, BadenWürttemberg, KonradAdenauerStr. 20, 72072 Tübingen, Germany) and performed according to the laws governing animal experimentation issued by the German Government.
Electroporation
For twophoton imaging, OregonGreen BAPTA1 (OGB1, hexapotassium salt; Life Technologies) was bulk electroporated [briggman_bulk_2011]. After the retina was flattened on an Anodisc and placed between two 4mm horizontal plate electrodes (CUY700P4E/L, Nepagene/Xceltis), 10 l OGB1 (5 mM) diluted in ACSF solution was suspended on the upper electrode. Then the upper electrode was lowered onto the retina and applied 10 pulses ( 9.2 V, 100 ms pulse width, at 1 Hz) from a pulse generator/wideband amplifier combination (TGP110 and WA301, Thurlby handar/Farnell). For complete recovery of the electroporated retina, we started recordings 30 min after electroporation.
Two photon imaging and light stimulation
A MOMtype twophoton microscope (designed by W. Denk, MPI, Martinsried; purchased from Sutter Instruments/Science Products) as described previously [euler_eyecup_2009, euler_studying_2019] was used for this study. Briefly, the system was equipped with a modelocked Ti:Sapphire laser (MaiTaiHP DeepSee, Newport SpectraPhysics, Darmstadt, Germany), green and red fluorescence detection channels for OGB1 (HQ 510/84, AHF, Tübingen, Germany) and SR101 (HQ 630/60, AHF), and a water immersion objective (16x/0.80W, DIC N2, /0 WD 3.0, Nikon). For imaging, we tuned the laser to 927 nm, and used a custommade software (ScanM, by M. Müller, MPI, Martinsried, and T.E.) running under IGOR Pro 6.3 for Windows (Wavemetrics). Timeelapsed signals were recorded with 6416 pixel image sequences (31.25 Hz).
Light stimuli were projected through the objective lens [euler_studying_2019]. A digital light processing (DLP) LightCrafter 4500 (Texas Instruments, Dallas, TX) coupled with external unit with lightemitting diodes (LEDs) – "green" (576 nm) and UV (387 nm) that match the spectral sensitivities of mouse M and Sopsins (for details, see [baden_functional_2016, franke_arbitraryspectrum_2019]) was used in this study. Both LEDs were intensitycalibrated to range from ("black" background) to ("white" full field) photoisomerizations per cone. The light stimulus was centered before every experiment, ensuring that its center corresponded to the center of the microscope’s scan field. For all experiments, the tissue was kept at a constant mean stimulator intensity level for 15 s after the laser scanning started and before light stimuli were presented.
Binary dense noise (2015 matrix of 3030 M pixels; each pixel displayed an independent, balanced random sequence at 5 Hz for 20 minutes) were generated and presented using the Pythonbased software package QDSpy (https://github.com/eulerlab/QDSpy).
Data preprocessing
imaging data were preprocessed using customwritten scripts in IGOR Pro. Regionsofinterest (ROIs) were manually defined depending on the outline of the soma. For each ROI, traces were extracted using the image analysis toolbox SARFIA for IGOR Pro [dorostkar_computational_2010]. A stimulus time marker embedded in the recorded data served to align the traces relative to the light stimulus at a temporal resolution of 2 ms. All stimulusaligned traces together with the relative ROI position were exported for further analysis.
4.3 Previous methods
Spiketriggered Average (STA)
The STA was computed as the average of all stimuli that proceeded a spike [aljadeff_analysis_2016]:
(1) 
where is the stimulus design matrix, the response, can be the number of spikes or the number of samples (rows) in and for nonspike data.
Whitened spiketriggered average (wSTA)
The maximumlikelihood estimator under a Gaussian noise model, or whitenedSTA (wSTA), was computed by multiplying the STA with the inverse of the stimulus autocorrelation matrix:
(2) 
In our code base, we did not compute the matrix inverse explicitly but computed the least squared solution instead, which is common practice.
Maximumaposteriori estimator (MAP)
Similar, the MAP under a Gaussian noise model was computed by adding the inverse of the prior covariance to the stimulus autocorrelation matrix:
(3) 
where is a prior covariance matrix of choice, e.g. corresponding to a smoothness or sparseness assumption [sahani_evidence_2002, park_receptive_2011]. Its hyperparameters can be optimized by minimizing the negative logevidence[sahani_evidence_2002, park_receptive_2011]:
(4) 
where
is variance of the additive Gaussian noise,
and are posterior mean and covariance, respectively:(5) 
and
(6) 
Smoothness prior covariance matrix for ASD
The 1D ASD prior covariance matrix is formulated as following [sahani_evidence_2002]:
(7) 
where is the squared distance between STRF coefficients and , controls the scale and controls the smoothness.
Locality prior covariance matrix for ALD
The 1D ALD prior covariance matrix is formulated as following [park_receptive_2011]:
(8) 
where is an orthogonal basis matrix for the 1D discrete Fourier transform, and are the diagonal ALD prior covariance in space and frequency, respectively, which can be constructed as:
(9) 
where is scalar values control the shape and extent of the local region, is the STRF coefficient coordinates in spacetime and is center coordinate of the STRF. And
(10) 
where is scalar values control the scale of smoothness in frequency, is of STRF coefficient coordinates in frequency and is center coordinate of the STRF in Fourier domain.
Highdimensional prior covariance matrix can be approximated by taking the Kronnecker product () of the covariance matrix in each dimension.
4.4 Estimating STRFs with Natural Cubic Regression Splines
To generate a natural cubic spline matrix (), we used the implementation from Patsy (0.5.1), a Python equivalent of the original implementation in R package mcgv [wood_generalized_2017]. The only hyperparameter needed to generate a 1D spline basis matrix was the number of basis functions, or the degrees of freedom (df), as we set all knots to be equally spaced.
(11) 
where is the spline basis matrix, is knot locations, , and , , , , , are defined in Table 2.
i=1…k2  
i=1…k3 
An illustration of 1D and 2D natural cubic splines can be seen in Figure 1B. A splinebased STRF under a Gaussian noise model can be fitted in the similar manner as the wSTA:
(12) 
where is the spline coefficients, and the corresponding STRF is:
(13) 
The spline matrix can be extended to high dimensions, which is also called tensor product smooth [wood_generalized_2017], by simply taking the Kronnecker product () of the basis functions in each dimension:
(14) 
4.5 Splinebased GLMs with gradient descent
The spline coefficients can also be fitted by gradient descent with respect to the cost function defined in each model with L1 regularization.
For the LG model, the cost function is the sum square error:
(15) 
where is the weight of L1 regularization.
For LNP and LNLN model, the cost function is the negative loglikelihood:
(16) 
where is the conditional intensity or the instantaneous firing rate, is the size of time bin, and is a nonlinearity applied to the filter output, which can be fixed as a softplus or exponential nonlinearity.
We used JAX [bradbury_jax_2018] for automatic differentiation, and Adam optimizer [kingma_adam_2017]for parameters optimization.
4.6 Splinebased spiketriggered clustering methods
Both kmeans clustering and semiNMF can be formulated as a matrix factorization problem:
(17) 
where is the spiketriggered ensemble, is clustering centroids matrix (or subunit matrix), and is the clustering label (or subunit weight matrix). A spline basis can be easily incorporated into kMeans clustering by assuming that each centroid can be approximated by the same spline basis matrix: , where is the spline coefficient matrix for each subunit, and can be computed as the leastsquares solution to the linear matrix equation , where is pseudoinverse. For semiNMF, splines can also be combined with the update rules [ding_convex_2010]:
(18a)  
(18b) 
where is pointwise multiplication, and .
4.7 Confidence interval and statistical significance of the estimated STRFs
The asymptotic posterior probability of the estimated STRF coefficients can be computed analytically
[wood_generalized_2017]: , where is the fitted STRF coefficients, is the posterior covariance. For LG model:(19) 
where is the residual sum of squares of the data and predicted response from the training set. And for LNP model:
(20) 
where is a diagonal matrix with entries , and
is the nonlinearity of the model. Thus, the standard error of
is the square root of the diagonal of , and the corresponding confidence interval (CI) of the estimated STRF is as following:(21) 
Given the computed confidence interval, Wald ChiSquared Test can be used to determine whether a STRF is statistically significant (against the null hypothesis that all STRF coefficients are zero), with the Wald statistic .
4.8 Permutation Test
To evaluate the predictive performance of the estimated STRF, we permuted the validation test stimulus in time and made prediction over 100 repetitions. A onetailed onesample Student’s Ttest was used to evaluate if the model prediction was better than chance level.
4.9 Visualization of 3D STRF
For visualization of 3D STRF in Figure 7 and 9
, we first performed singular value decomposition (SVD) to separate the spatial and temporal components of the estimated 3D STRF. The coordinates of the extremum pixel in the spatial component were selected, and the temporal filter of the original STRF is shown for those coordinates. The confidence interval of the same extremum point was plotted as a gray shaded region on the temporal filter. For the spatial filter, we show the frame of the original STRF at the extremum point of the temporal filter (indicated by a vertical dotted line). The maximum value of the color map was set to the maximum absolute value of the STRF.
4.10 Comparison of Computation Time
For comparison of computation time in Figure 6, we used simulated 2D RFs of different sizes (15x15, 20x20 and 30x30 pixels, respectively) and white noise stimuli. Simulations were carried out at different lengths of time in the same manner as Figure 4 and 5. The number of basis functions for SPL wSTA were fixed as 5 and 10 along each dimension in all three cases. Measurements of computation time were averaged over 10 repetitions and performed on a 2019 16inch MacBook Pro with a 2.3 GHz 8core Intel Core i9 and 16 GB of RAM.
4.11 Data availability
All data used are available in GitHub (https://github.com/huangziwei/data_RFEst).
4.12 Code availability
All methods (splinebased GLMs, evidence optimization and spikeclustering methods) were implemented in Python3 and available in GitHub (https://github.com/berenslab/RFEst). And Jupyter notebooks for all figures are available also in GitHub (https://github.com/huangziwei/notebooks_RFEst)
Acknowledgements
We thank the authors of [maheswaranathan_inferring_2018, liu_inference_2017, rust_spatiotemporal_2005] and [hafting_microstructure_2005]
for making their data available. Research was funded by the Deutsche Forschungsgemeinschaft through a Heisenberg Professorship (BE5601/41 and BE5601/81, PB), the Collaborative Research Center 1233 “Robust Vision” (reference number 276693517) as well as individual research grants (EU 42/101, BE5601/61), the German Ministry of Education and Research through the Bernstein Award (01GQ1601, PB). PB is a member of the Excellence Cluster 2064 “Machine Learning — New Perspectives for Science” (reference number 390727645) and the Tübingen AI Center (FKZ: 01IS18039A).