Systemic lupus erythematosus (SLE) is the tenth most common cause of death among females 15-24 years old in the US (Yen2018). SLE is one of many autoimmune diseases, which are diseases in which a patient’s immune system mistakes parts of their own body as foreign, attacking their healthy organs and tissue (LupusFoundationofAmerica2020).
SLE can be driven by defects in the innate immune system and/or the adaptive immune system. SLE patients are often characterized by high levels of interferon-1, which causes inflammation in the innate immune system in response to viruses. In SLE, high interferon levels can be caused by a variety of factors, such as neutrophil extracellular traps (Bengtsson2017). Most SLE patients also have high levels of autoantibodies, which are antibodies directed against self cells and are created by mature B cells (plasma cells) (Dema2016). Autoantibodies cause a much more targeted response than the innate immune system, but SLE patients can have a wide range of autoantibodies - one study found over 180 autoantibodies expressed in SLE patients (Yaniv2015). Some patients with lupus do not even have autoantibodies, and many of the autoantibodies in SLE are also found in other rheumatic diseases (Egner2000).
The heterogeneity of lupus symptoms and immune pathways affected makes it difficult to treat, because different drugs work well on different patients. Merrill2017 found that certain standard drugs (anti-rheumatic drugs and immunosuppressants) affect immune pathways differently in interferon-low and interferon-high patients. While there is still debate on whether SLE is one disease or many (Agmon-Levin2012), it is clear that subdividing SLE patients into categories will help treat patients.
Previous studies have tackled this problem by dividing patients based on antibody levels (Artim-Esen2014), gene expression (Toro-Dominguez2018), and immune molecule levels (Hamilton2018). However, none of these studies have reached a consensus on the best subdivision of SLE. Guthridge2020 used all three of these factors to divide SLE patients into seven clusters with unsupervised machine learning. However, in practice, gathering these different types of patient data to categorize a patient into an SLE subdivision is infeasible.
In this study, we use unsupervised machine learning to categorize SLE patients using only gene expression data, which is more accessible than all three types of data combined. This would help determine if gene expression data alone reveals similar patterns in immune pathway expression as does its combination with antibody levels and immune molecule levels.
2 Data and Methods
We use a gene expression dataset available on GEO (accession number GSE138458) containing data collected by Guthridge2020. The data includes 336 samples in total, with 24 control patients and 198 SLE patients. 108 of the SLE patients have two or more samples taken. The data analysis methods are shown in fig:flowchart in A.1. Data pre-normalized by Guthridge2020
Given the high dimensionality nature of gene expression data, dimension reduction techniques are required. Prior work has used unsupervised learning following dimension reduction of gene expression data for other diseases such as cancer (Shi2010). The dimensionality of our 47,323 gene data is reduced using three separate methods: Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and a simple autoencoder (AE), which are intended to minimize the effects of random variation on the unsupervised clustering model in different ways.
With both PCA and UMAP, 200 reduced features are selected. In PCA, these explain 96.29% of the variance in the original 47,323 genes. UMAP is a nonlinear model (unlike PCA) similar to t-SNE, used for visualization as well as nonlinear dimension reduction (McInnes2018)
. The autoencoder aims to reduce the loss of information between the original inputs (genes) and the decoded output of the same dimension. AEs with both linear and sigmoid activation functions are validated and the linear AE is found to perform much better after 100 epochs (validation loss 0.068) than the sigmoid AE (validation loss 48.28). So, we select the 1000 encoded components from the linear AE for subsequent clustering.
The three datasets with reduced features are then used for k-means clustering. To determine the best number of clusters, we use the YellowBrick Python package, which creates visualizations quantifying the “elbow method” used on the metrics distortion score, silhouette score, and Calinski-Harabasz score. All these metrics are found to converge on 6 clusters for each dataset. k-means clustering is then used to derive 6 clusters from each dataset.
For visualization and interpretation of the clusters, we use 27 pre-existing modules created by Chaussabel2008
. Each module represents a group of genes with a common function. These are used to calculate module scores for the three datasets. Module scores for each cluster represent the percentage of genes in each module that were significantly upregulated (i.e., overexpressed) or downregulated (i.e., underexpressed) in that cluster as compared to the controls, based on a two-tailed t-test ().
3 Results and Discussion
fig:heatmaps shows heatmaps generated from the module scores for the 3 feature-reduced datasets. These heatmaps show the percentage of underexpressed (brown) or overexpressed (purple) genes for SLE patients as compared to the controls.
The clusters originating from the PCA and UMAP dimensionality reductions show very similar patterns in the upregulated and downregulated modules, while the clusters originating from the AE mostly show a consistent level of increased or decreased gene expression across all modules, excepting cluster 5.
The patients in the clusters created from the PCA and UMAP dimensionality reduction techniques and AE cluster 5 can be designated as belonging to one of three groups: (a) interferon-driven SLE, (b) autoantibody-driven SLE, and (c) SLE caused by mitochondrial apoptosis. The first two groups substantiate results from prior literature, while the third group presents a pathway that suggests a novel cause of SLE.
3.1 Interferon-driven SLE
In lupus, type 1 interferon levels are often elevated, which can lead to inflammation and tissue damage caused by the innate immune system (Crow2014). PCA cluster 6 and UMAP cluster 3 in fig:heatmaps both display substantial upregulation of genes related to interferons and inflammation. These two clusters validate the patterns also observed in Guthridge2020’s clusters 1, 4 and 6. All of these upregulated genes are related to the innate immune response. Those patients also have underexpressed B and T cells and normal expression of plasma cells, which would all be overexpressed if production of autoantibodies by plasma cells was the main reason for autoimmunity, rather than interferon levels.
3.2 Autoantibody-driven SLE
Many of the other PCA and UMAP clusters displayed upregulation of antibody-producing plasma cells; particularly PCA clusters 2, 3, 4 and 5 and UMAP clusters 4, 5, and 6. Guthridge2020 observed a similar trend, where their clusters 2, 3, and 5 had higher T cell, B cell, and plasma cell related expression. While autoantibodies are known to be common in SLE, the diversity of autoantibodies (as discussed in Yaniv2015) means that there is still work to be done understanding what is different among the antibodies produced in these four PCA clusters and three UMAP clusters. Some of these differences might come from genes used to create the clusters that were not included in the modules used for the heatmap visualization.
Brant2020, who grouped lupus patients based on their correlation between gene expression and disease activity, found one cluster where neutrophil levels correlated to disease activity and one where lymphocyte levels correlated to disease activity. Since neutrophil extracellular traps are one way that interferon levels become elevated, their neutrophil-correlated group might correspond to our high-interferon group, and their lymphocyte-correlated group might correspond to our antibody-driven group. More analysis should be done on disease activity correlation in our data to confirm this.
3.3 SLE caused by mitochondrial apoptosis
PCA clusters 1 and 4, UMAP cluster 2, and Autoencoder cluster 5 display a different pattern from many of the other clusters. In these clusters, many of the modules labeled as Undetermined by Chaussabel2008 were underexpressed. A closer look at the genes in these Undetermined modules reveals that they include mitochondrial ribosomal proteins, mitochondrial elongation factors, and proteins in the cAMP-signaling pathway. Mitochondrial ribosomal proteins, in addition to their ribosomal functions, are involved in apoptotic (programmed cell death) pathways (Kim2017), and cAMP signaling regulates mitochondrial apoptosis (Valsecchi2013). Apoptosis is known to be a factor in SLE, but mainly because ineffective clearance of apoptotic cells can expose B and T cells to intracellular material, leading to the creation of autoantibodies against this intracellular material (Mevorach2003).
We suggest that for the patients in these clusters, dysregulation of mitochondrial pathways or signaling from outside molecules, possibly lymphocytes, could cause mitochondrial apoptotic pathways to become activated in healthy cells, destroying healthy cells as is characteristic of SLE. These healthy cells would have a range of gene expression of mitochondrial proteins, but the cells with higher expression of the proteins would activate the apoptotic pathway and die. Only cells with lower expression levels would survive, so lower expression levels were found in our study. These lower expression levels would also impair mitochondrial functions, which has been observed to be true in SLE patients (Leishangthem2016).
Cluster 7 from the Guthridge2020 study also had low expression of mitochondrial respiration and mitochondrial stress genes (not discussed in their study). The discovery of this cluster of patients using two completely different machine learning approaches corroborates the idea that the mitochondrial apoptotic pathway is a novel cause for SLE. Future studies should investigate to a further extent the mitochondrial apoptotic pathway in SLE patients as a reason for destruction of self cells in addition to a way that autoantibodies are produced.
Appendix A Supplement
a.1 Methods Flowchart
fig:flowchart illustrates the data analysis steps undertaken in this study. A detailed description is provided in Section 2.
a.2 Dimensionality Reduction Models’ Specifications
The following specifications were used in the three dimensionality reduction techniques.
: keras API was used for the simple autoencoder with: Input dimension: (47323,), Output dimension: (1000,), activation = ‘linear’ in encoded and decoded layers, epochs = 50, optimizer = ‘adam’, loss = ‘mse’, batch_size = 64, shuffle = True, validation_split = 0.2.
UMAP: UMAP parameters used: n_components = 200, n_neighbors = 15, min_dist = 0.1, metric = ‘euclidean’.
: Linear dimensionality reduction using Singular Value Decomposition (SVD) was used with the PCA class in sklearn API with the following parameters: n_components = 200, svd_solver = ‘randomized’.
a.3 Analyzing Patients with Multiple Samples
For patients who had multiple samples taken, k-means following the autoencoder classified them into the same cluster 97.3% of the time, while k-means following PCA and UMAP classified them into the same cluster 32.9% and 45.7% of the time respectively (fig:same/diff). While gene expression data is correlated with SLE disease activity(Kegerreis2019; Toro-Dominguez2018), Petri2019 found that the majority of gene expression signatures were stable in patients over time. This suggests that the autoencoder’s dimensionality reduction may have emphasized the stable gene expression signatures, causing them to be a major factor in the clustering, but that PCA and UMAP, which aimed to preserve more of the variance in the data, did not maintain the data from genes whose expression was stable over time. Many of these more stable genes might not have been related to the immune system, so they were not included in Chaussabel2008’s coexpression modules. Thus, the more variable modules that were in the heatmap would have shown a lot of variation between the patients in each cluster, causing the clusters in the autoencoder heatmap to show a more consistent level of expression across all genes in the modules. Further analysis should be done to determine the level of variation in gene expression in the modules for the autoencoder clusters in comparison to the PCA and UMAP clusters, and to determine whether the PCA and UMAP clusters correlated to disease activity more than the autoencoder clusters, which this idea would imply.