Discovering Neuronal Cell Types and Their Gene Expression Profiles Using a Spatial Point Process Mixture Model

Cataloging the neuronal cell types that comprise circuitry of individual brain regions is a major goal of modern neuroscience and the BRAIN initiative. Single-cell RNA sequencing can now be used to measure the gene expression profiles of individual neurons and to categorize neurons based on their gene expression profiles. While the single-cell techniques are extremely powerful and hold great promise, they are currently still labor intensive, have a high cost per cell, and, most importantly, do not provide information on spatial distribution of cell types in specific regions of the brain. We propose a complementary approach that uses computational methods to infer the cell types and their gene expression profiles through analysis of brain-wide single-cell resolution in situ hybridization (ISH) imagery contained in the Allen Brain Atlas (ABA). We measure the spatial distribution of neurons labeled in the ISH image for each gene and model it as a spatial point process mixture, whose mixture weights are given by the cell types which express that gene. By fitting a point process mixture model jointly to the ISH images, we infer both the spatial point process distribution for each cell type and their gene expression profile. We validate our predictions of cell type-specific gene expression profiles using single cell RNA sequencing data, recently published for the mouse somatosensory cortex. Jointly with the gene expression profiles, cell features such as cell size, orientation, intensity and local density level are inferred per cell type.



There are no comments yet.


page 1

page 2

page 3

page 4


A Nonparametric Multi-view Model for Estimating Cell Type-Specific Gene Regulatory Networks

We present a Bayesian hierarchical multi-view mixture model termed Symph...

Cell Identity Codes: Understanding Cell Identity from Gene Expression Profiles using Deep Neural Networks

Understanding cell identity is an important task in many biomedical area...

Factorized linear discriminant analysis for phenotype-guided representation learning of neuronal gene expression data

A central goal in neurobiology is to relate the expression of genes to t...

Joint Learning of Discrete and Continuous Variability with Coupled Autoencoding Agents

Jointly identifying discrete and continuous factors of variability can h...

RZiMM-scRNA: A regularized zero-inflated mixture model framework for single-cell RNA-seq data

Applications of single-cell RNA sequencing in various biomedical researc...

A Mixture Model to Detect Edges in Sparse Co-expression Graphs

In the early days of microarray data, the medical and statistical commun...

Reconstructing probabilistic trees of cellular differentiation from single-cell RNA-seq data

Until recently, transcriptomics was limited to bulk RNA sequencing, obsc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivations and Goals

The human brain comprises about one hundred billion neurons and one trillion supporting glial cells. These cells are specialized into a surprising diversity of cell types. The retina alone boasts well over 50 cell types, and it is an active area of research to perform a census of the various neuronal cell types that comprise the central nervous system. Many criteria have been used to categorize neuronal cell types, from neuronal morphology and connectivity to their functional response properties. Neurons can also be categorized based on the proteins they make. Immunohistochemistry has been used with great success for many decades to differentiate excitatory neurons from inhibitory neurons by labeling for known proteins involved in the synthesis and regulation of glutamate and GABA, the primary excitatory and inhibitory neurotransmitters respectively.

More recently, there has been an effort to systematically measure the complete transcriptome of single neurons. Single-cell RNA sequencing (RNA-Seq) is an extremely powerful technique that can quantitatively determine the expression level of every gene that is expressed in individual neurons. This so-called transcriptome or gene expression / transcription profile can then be used to define cell types by clustering. A recent study produced the most comprehensive census of cell types to date in the mouse somatosensory cortex and hippocampus by performing single-cell RNA-Seq on over 3000 neurons [zeisel2015cell]. While this study is quite exciting, tyring to replicate it for all brain regions might well require the equivalent of a thousand such experiments. Thus, it is likely that the unprecedented insights that RNA-Seq can provide will be slow to arrive. More importantly, single cell sequencing methods are not currently able to capture the precise three-dimensional location of the individual neurons.

Here we propose a complementary approach that uses computational strategies to identify cell types and their spatial distribution by re-analysing data published by the Allen Institute for Brain Research. The Allen Brain Atlas (ABA) contains cellular resolution brain-wide in-situ hybridization (ISH) images for 20,000 genes111 Although the Atlas contains ISH data for approximately 20,000 distinct mouse genes, we focus on the top 1743 reliable genes whose sagittal and coronal experiments are highly correlated.. ISH is a histological technique that labels the mRNA in all cells expressing the corresponding gene in a manner roughly proportion to the gene expression level. An example of an ISH image can be seen in figure LABEL:fig:overview(a).

The ABA contains genome-wide and brain-wide ISH images of the adult mouse brain. These images were generated by slicing the brain into a series of thin sections and performing ISH. Image series of ISH performed for different genes come from different mouse brains, since ISH can only be performed for one gene at a time. The ISH image series for different genes were then computational aligned into a common reference brain coordinate system. Such data have been productively used to infer the average transcriptomes corresponding to different brain regions.

It is commonly thought that the ABA cannot be used to infer the transcriptomes of individual cells in a given brain region since mouse brains cannot be aligned to the precision of a single cell. This is because there is individual variation in the precise number and location of neurons from brain to brain. However, we expect that the average number and spatial distribution of neurons from each cell type to be conserved from brain to brain, for a given brain area. More concretely, we might expect that parvalbumin-expressing (PV) inhibitory interneurons in layer 2/3 of the mouse somatosensory cortex comprise approximately 7% of all neurons and have a conserved spatial and size distribution from brain to brain. We use this fact to derive a method for simultaneously inferring the cell types in a given brain region and their gene expression profiles from the ABA.

We propose to model the spatial distribution of neurons in a brain as being generated by sampling from an unknown but consistent brain-region and cell-type dependent spatial point process distribution. And since each gene might only be expressed in a subset of cell types, an ISH image for a single gene can be thought of as a mixture of spatial point processes where the mixture weights represent the individual cell types expressing that gene. We infer cell types, their gene expression profiles and their spatial distribution by unmixing the spatial point processes corresponding to the ISH images for 1743 genes. This is in notable contrast to the information provided by single-cell RNA sequencing which can only measure the gene expression profile of individual cells to high accuracy but where, due to the destructive measurement process, all information about the spatial position and distribution of cell types is lost.

1.2 Previous Work

Allen Brain Atlas (ABA) [lein2007genome] is a landmark study which mapped the gene expression of about 20,000 genes across the entire mouse brain. The ABA dataset consists of cellular high-resolution 2d imagery of in-situ hybridized series of brain sections, digitally aligned to a common reference atlas. However, since the in-situ images for each gene come from different mouse brains and since there is significant variability in the individual locations of labeled cells, it is not possible to register brain-wide gene expression at a resolution higher than about . Therefore, the cellular resolution detail was down-sampled to construct a coarser 3d representation of the average gene expression level in voxels.

The coarse-resolution averaged gene expression representation has been widely used and analyzed to understand differences in gene expression at the level of brain region. Hawrylycz et al [hawrylycz2011multi] analyzed the correlational structure of gene expression at this scale, across the entire mouse brain. However, due to the poor resolution of the average gene expression representation, it has proven challenging to use the ABA to discover the microstructure of gene expression within a brain region. To address this issue from a complementary perspective, Grange et al [grange2014cell] used the gene expression profiles of 64 known cell-types, combined with linear unmixing to determine the spatial distribution of these known cell-types. However, such an approach can be confounded by the presence of cell-types whose expression profiles have yet to be characterized, and limited by the resolution of the averaged gene expression representation.

In contrast to previous approaches, we aim to solve the difficult problem of automatically discovering the gene expression profiles of cell-types within a brain region by analyzing the original cellular resolution ISH imagery. We propose to use the spatial distributions of labeled cells, and their shapes and sizes, which are a far richer representation than simply the average expression level in voxels. This spatial point process is then un-mixed to determine the gene expression profile of cell types.

Most previous work on unmixing point process mixtures adopted parametric generative models where the point process is limited to some distribution family such as Poisson or Gaussian [ji2009spatial, kottas2007bayesian]. However, since we are not interested in building a generative model of a point process, but rather care more about inferring the mixing proportions (gene expression profile), we take a simpler parameter-free approach. This approach models only the statistics of the point process, but is not a generative model, and so cannot be use to model individual points/cells.