The scFlow analytical toolkit and pipeline was delivered at The Department of Brain Sciences at Imperial College London. It addressed a need both within the department, and the wider field, for flexible, reproducible, and scalable analyses of single-cell/nucleus RNA-sequencing data.
scFlow is comprised of two components: (1) an independent R package, scFlow, containing a toolkit for analysis of single-cell RNA sequencing data, and (2) a Nextflow pipeline (nf-core/scflow) that orchestrates end-to-end, automated, and scalable single-cell analyses using the scFlow R package.
Users of the pipeline specify analytical parameters (documented here) and the location of input files (gene-cell matrices). All analytical steps are performed automatically with the generation of interactive HTML reports for each step. These reports capture key parameters and analysis results and may be used to guide parameter optimization.
An example interactive HTML report for automated quality control of a single sample may be viewed here.
In this example, the report captures the results from the initial step of quality control (step a, above) performed on a sample. This typically includes ambient RNA profiling using the EmptyDrops algorithm, thresholding to remove low-quality cells and uninformative (non-expressed) genes, and identification and filtering of doublets using the DoubletFinder algorithm.
The pipeline permits the revision of parameters during a run. The intelligent cache-based resume functionality of NextFlow ensures that only impacted tasks downstream of revised parameters are re-run, ensuring time- and cost-efficient analyses.
[...] only impacted tasks downstream of revised parameters are re-run, ensuring time- and cost-efficient analyses.
The pipeline requires complex dependencies, including over 400 R packages in addition to system-level dependencies. These needs are managed following best-practices in data sciences, including containerization using Docker. Each versioned release of the pipeline is assigned a unique digital object identifier (DOI), linking pipeline code with software dependencies to ensure reproducibility.
All code is open-sourced (GPL3) with the development of code managed using continuous integration and version-control using GitHub and GitHub Actions. This ensures that the pipeline always works, and new versions are automatically tested on an example dataset before release.
This ensures that the pipeline always works, and new versions are automatically tested on an example dataset before release.
The nf-core framework and community have been central to this effort of creating standards to ensure pipelines are portable, i.e., compute infrastructure-independent to enable deployment across all institutions and research facilities. With minimal configuration, scFlow can be run on a local workstation, a private or university HPC, or on a Cloud environment (e.g. GCP, AWS).
At the time of writing, scFlow has already been trusted to analyse diverse datasets both internally, including for The Multi-'Omics Atlas Project for Alzheimer's disease, in studies of multiple-sclerosis, and internationally, including studies of schizophrenia in Geneva. The total value of samples analysed even before public-release of the pipeline is approaching $1m.
If you're interested in learning more, take a look at our bioRxiv pre-print here.
Special thanks to Dr Nathan Skene for valuable guidance, and to contributions made by the team including Dr Nurun Fancy (impacted pathway analysis in particular), Dr Mahdi M Marjaneh (dataset integration), Mr Alan E Murphy (containerization), and Professor Paul M Matthews for funding and valuable feedback.