scRNAseq: an encounter with Simpson's Paradox in the wild

Flipping signs

Solving the riddle of model coefficient sign reversals in scRNAseq expression data

By Combiz Khozoie |

06 May 2022

∙

4 minute read

d3 | simple square grid

Flipping signs—Interactive header art reminiscent of a gene expression heatmap

Background

Single-cell RNA sequencing (scRNAseq) analyses typically include differential gene expression (DGE) results as a key output. These DGE models can identify differences in gene expression between experimental groups (e.g. healthy vs disease) while accounting for sources of intra- and inter-sample variance. And increasingly, studies are applying scRNAseq DGE models with a numeric dependent variable, such as tissue staining measures quantified by immunohistochemistry (IHC). Such models allow for identifying differences in gene expression associated with, for example, quantitative measurements of local disease pathology and spatiotemporal disease progression.

Our recent work used this approach to identify gene expression differences associated with the abundance of malformed protein deposits—amyloid-beta and phosphorylated tau—in the brains of patients with Alzheimer's disease (AD). Similarly, our upcoming multiple sclerosis (MS) study identifies gene expression differences associated with the density (death) of neurons and the activation of microglia (brain-resident immune cells).

The extensive iteration of the DGE approach and model specification for these studies requires a separate discussion, a subject for a future post. Here, I share an interesting statistical phenomenon encountered during the optimisation process: an encounter with Simpson's Paradox.

Cross-validating our model

We used human brain cortical grey matter from the same tissue block with two slightly different protocols: with and without a novel cell-sorting strategy to enrich glial (non-neuronal) cells. Unusually for snRNAseq, this provided a quasi-replicated cross-validation dataset (with a proviso for differences in protocol and relative cell-type abundances). We could therefore test whether our DGE model predictions were generalisable across datasets.

We focused on microglia, a cell type represented in both datasets. The overlap of differentially expressed genes between the two datasets was predictably high (hypergeometric probability $\lt10^{-80}$). But when we considered the directionality of the genes, we were surprised to find that the overlap dropped ~50%. Genes up-regulated with pathology in one dataset could appear down-regulated in the other dataset, and vice-versa.

Genes up-regulated with pathology in one dataset could appear down-regulated in the other dataset, and vice-versa.

Diagnosing the problem

The initial effort to understand the contradictory results involved aligning the two datasets further; we retained only samples present in both experiments and applied stratified random sampling to ensure the same relative contribution of cells from each sample in both datasets. The issue remained.

At this stage, attention turned to our model specification. We were computing gene expression differences associated with a quantitative histopathology marker: -

$\sim\text{histopath_marker}$ + $\text{(1|individual)}$ + $\text{cngeneson}$ + $\text{pc_mito}$ + $\text{sex}$ + $\text{brain_region}$

The model is typical for a generalised linear mixed-effects model fit using MAST. A random effect for individuals accounts for the correlation structure among measures from different cells within an individual ('(1|individual)'). Additionally, we included the cell-level confounding variables 'cngeneson' (number of genes switched on, also known as the cellular detection rate) and the relative proportion of gene counts mapping to mitochondrial genes ('pc_mito'). And finally, we included the sample-level confounders 'sex' and 'brain_region'.

We diagnosed—after not inconsiderable reflection—the issue: an easy to miss model misspecification. Although the literature for scRNAseq DGE statistics is replete with statements underscoring the importance of a random effect term for individuals, these statements are imprecise. The random effect term should account for the correlation structure among measures from different cells within a unique sample, not each individual. While these may be equivalent for some studies, this is frequently not the case.

The random effect term should account for the correlation structure among measures from different cells within a unique sample, not each individual.

Our study evaluated two samples per individual, each from a different brain region. We selected these regions based on the known spatiotemporal progression of Alzheimer's disease: a region affected early in the disease and another late. These regions exhibit different associations with the model outcome variable—the local measure of disease pathology. The original model with a random effect for individuals inappropriately aggregated these measures resulting in collinearity and erratic results. A random effect term for sample identifier, rather than individual, fully resolved the paradox and allowed the partial associations to be correctly determined, producing consistent results across datasets (low hypergeometric probability values with 100% of significant genes directionally matched).

Visualising Simpson's Paradox with synthesised data

I produced an animated visualisation with synthesised data to explain the phenomenon below:

The models fitted in this visualisation used linear regression rather than the two-part GLMM used by MAST. Still, the principle is the same: the marginal (aggregated) association is a reversal relative to the partial associations with the introduction of group information (represented by colours).

Summary

Fortunately, this issue was picked up early during our model optimisation and validation. Worryingly, I'm aware of examples of published scRNAseq studies in high-impact journals with models strongly suggestive of multicollinearity and the potential for erratic results analogous to those described here. I've also encountered similarly high-impact works that have omitted the DGE model specification altogether!

I'll explore other aspects of DGE analysis in a future post, including algorithm choices, model selection, power calculations, and more. Stay tuned!

Tags:

Return to home

Categories

Tags

Background

Cross-validating our model

Diagnosing the problem

Visualising Simpson's Paradox with synthesised data

Summary