Delineating the effective use of self-supervised learning in single-cell genomics
Data curation
Preprocessing
All datasets used in this study underwent a commonly used preprocessing pipeline in SCG. This involved normalization to 10,000 counts per cell and log1p transformation to mitigate technical variations and facilitate more meaningful biological comparisons. This uniform preprocessing approach ensured that our models were trained and evaluated on data closely reflecting the underlying biological realities while minimizing technical noise.
scTab dataset
The core dataset for our study stems from scTab5 and is derived from the CELLxGENE44 census version 2023-05-15, a long-term supported release hosted by CELLxGENE. This dataset represents a substantial collection of human single-cell RNA-sequencing data, encompassing 22.2 million cells spanning 164 unique cell types, 5,052 unique donors and 56 different tissues. To ensure the reproducibility of dataset creation, scTab applied stringent criteria for inclusion, focusing on primary data from 10x-based sequencing protocols and ensuring a broad representation across cell types and donors. The scTab data are divided into training, validation and test sets based on donors, avoiding label leakage and ensuring each set contains unique donors. This donor-based splitting approach allowed us to maintain a proportional representation of cells across the sets. It ensured that each cell type was represented in the training and testing phases. It further presented a challenging test split with unseen donors. The final split resulted in 15.2 million cells for training, 3.5 million for validation and 3.4 million for testing.
Single-cell atlases
We further considered smaller, focused datasets to test whether access to the auxiliary data gives an advantage. These datasets are subsets of the CELLxGENE44 census of scTab5 (scTab dataset), tailored to specific applications, and consist of the Human Lung Cell Atlas (HLCA)4 (available at cellxgene.cziscience.com/e/9f222629-9e39-47d0-b83f-e08d610c7479.cxg; 775,790 cells after filtering, 51 cell types, 540,732 training, 117,541 validation, 117,517 test samples), peripheral blood mononuclear cells (PBMCs) after SARS-CoV-2 infection48 (available at cellxgene.cziscience.com/e/2a498ace-872a-4935-984b-1afa70fd9886.cxg; 78,354 cells after filtering, 30 cell types, 78,354 training, 33,761 validation, 189,756 test samples), and the Tabula Sapiens Atlas (available at cellxgene.cziscience.com/e/53d208b0-2cfd-4366-9866-c3c6114081bc.cxg; 335,861 cells after filtering, 161 cell types, 223,337 training, 54,908 validation, 57,616 test samples)49. The division into training, validation and test sets is derived from their allocation within the scTab dataset to prevent data leakage. Note that the training, validation and test sets of the PBMC, Tabula Sapiens and HLCA datasets are also part of the corresponding splits of the full scTab dataset.
Unseen datasets
To evaluate our models’ performance in unseen data analysis scenarios, we incorporated five unseen datasets published after the CELLxGENE census version of scTab: (1) all non-neuronal cells from the Human Brain Atlas52 (available at cellxgene.cziscience.com/e/b165f033-9dec-468a-9248-802fc6902a74.cxg) (2) dissection, tail of hippocampus (HiT) – caudal hippocampus – CA4-DGC from the Human Brain Atlas52 (available at cellxgene.cziscience.com/e/9f499d32-400d-4c42-ac9a-fb1481844fee.cxg), (3) the single-cell analysis of prenatal and postnatal human cortical development53 (available at cellxgene.cziscience.com/e/1a38e762-2465-418f-b81c-6a4bce261c34.cxg), (4) circulating immune cells—CV19 infection, vaccination and HC54 (available at cellxgene.cziscience.com/e/242c6e7f-9016-4048-af70-d631f5eea188.cxg), and (v) human, great apes study55 (available at cellxgene.cziscience.com/e/2bdd3a2c-2ff4-4314-adf3-8a06b797a33a.cxg). The unseen datasets were filtered for the genes used in scTab; missing genes were zero-padded. The datasets were then normalized to 10,000 counts per cell and log1p transformed. The full datasets were used as the test split, that is, no samples were used for training.
NeurIPS multiome dataset
Our study included the NeurIPS multiome dataset57, a multimodal bone marrow dataset that integrates gene-expression counts with proteomics data. While distinct in its multi-omic nature, this dataset underwent similar preprocessing steps to our other datasets, ensuring consistency across all analyses. We split the dataset into training, validation and test sets using an 80/10/10 random split. We chose 2,000 highly variable genes using Scanpy64 as a standard preprocessing step for this dataset.
Self-supervision methods
Overview
SSL is the concept that data, along with their inherent pairwise relationships, are sufficient for learning meaningful data representations, even in the absence of explicit labels. While supervised learning relies on paired observations and labels (X, Y), SSL thus depends on only the input X and an inter-sample relationship (X, G), where G is constructed through a data augmentation that sustains the semantic information of X8. Thereby, the method distils signal from noise65, a crucial aspect for managing challenges such as class imbalances in large, real-world datasets66. In single-cell data, this means distiling the signal of the cellular omics and removing noise sources such as batch effects or inconsistent labelling.
In the context of SCG, SSL harnesses these capabilities to navigate the complexities of vast, unlabelled datasets replete with intricate biological interdependencies. The framework is structured into two distinct phases: pre-training and fine-tuning. During the pre-training phase, the model employs contrastive learning or denoising methods to learn a data representation. This representation, characterized by its broad applicability, is then utilized in one of two ways. First, as a zero-shot SSL model, it can be directly applied to a downstream task without further label-dependent training. Alternatively, as an SSL model, it undergoes fine-tuning to enhance performance on specific tasks. This fine-tuning capitalizes on the rich data representation acquired during pre-training, adjusting and optimizing it for the desired application. The fine-tuning phase of SSL, therefore, is not only about refining the pre-training but also about strategically leveraging the pre-established data mappings for task-specific optimizations.
Core principles and strategies
The choice of self-supervised pre-training, that is, learning the inter-sample relationship, is critical to obtaining a meaningful data representation as it gives rise to the signal-to-noise distinction in the dataset. Our SSL framework is designed around two primary pre-training strategies: masked autoencoders and contrastive learning, both adapted to meet the unique demands of SCG.
Masked autoencoders
This approach follows the concept of self-prediction, where a significant portion of input features (genes in SCG) are masked (that is, set to zero), and the model is trained to reconstruct these missing parts9,45,67. It thus sets focus on inter-feature dependencies. We implemented various masking strategies. (1) In random masking, 50% of genes are randomly chosen and masked with different choices in each iteration. (2) In GP masking, sets of genes known for biological functions are masked such that n% of genes are masked and reconstructed. The C8 cell-type signature gene sets from the Human MSigDB Collections68,69,70 were used. Next, we introduce isolated masked autoencoders, in which all genes but a defined set are masked, and only this set is reconstructed. (3) For this, we present a GP to TF isolated masking. This masking predicts the expression value of the transcription factor known to correspond to a gene programme. This connection is given in the TFT transcription factor targets subset of C3 regulatory target gene sets from the Human MSigDB Collections71,72. (4) Last, we present a GP to GP isolated masking. In this strategy, a gene programme is kept unmasked and used to predict only itself. The gene programmes for this strategy also stem from the C8 cell-type signature gene sets from the Human MSigDB Collections. These strategies are tailored to capture specific gene interactions and relationships, making them particularly suited for the intricate nature of single-cell data.
Contrastive learning
Unlike self-prediction, contrastive learning focuses on understanding relationships between different samples, thus focusing on inter-sample dependencies. This method minimizes distances between similar samples and maximizes distances between dissimilar ones in the embedded space. Contrastive methods are typically distinguished by their strategy to avoid representation collapse, the trivial solution to contrastive losses of constant representations9,10. BYOL is an example of architectural regularization through its teacher–student network. Barlow twins is an example of an information maximization method that avoids collapse by maximizing the information content of the embedding. We incorporated BYOL and Barlow twins in our framework to benchmark two schools of thought. We used a combination of negative binomial noise and masking as data augmentation, simulating the expected noise profiles in SCG data.
Zero-shot SSL concept
A key concept in our study is the differentiation between the zero-shot SSL and SSL models. The zero-shot SSL model represents the initial phase of pre-training, where the model learns from the data without any label guidance through self-supervision algorithms. This model, even without fine-tuning, can provide meaningful insights into data, as demonstrated in various downstream tasks. The SSL model, in contrast, undergoes an additional fine-tuning phase tailored to specific downstream applications. This distinction allows us to explore the full spectrum of SSL’s capabilities, from a generalized understanding of data to specialized, task-specific optimizations.
In summary, our self-supervision methods in SCG are defined by a nuanced application of masked autoencoders and contrastive learning adapted to the field’s specific challenges. The zero-shot SSL concept plays a central role in our approach, highlighting the potential of SSL to derive meaningful insights from large-scale, unlabelled datasets. This methodological framework sets the stage for a detailed exploration and benchmarking of SSL’s impact on various SCG tasks, as detailed in the following sections of our study.
Downstream applications in SCG
Cell-type annotation
Cell-type annotation in SCG is a classification task where data samples, represented as vectors of RNA-sequencing counts, are assigned to distinct cellular identities. Although seemingly straightforward, this task is complicated by the noise and heterogeneity inherent in large-scale datasets. We utilize the scTab dataset as the primary basis for our cell-type annotation analysis. We employ various SSL methods and compare their effectiveness against supervised approaches. We train the classifier using a cross-entropy loss. We evaluate cell-type annotation performance by kNN (k = 5) classification using the scTab validation set as neighbours of the test sample. The validation set is sufficiently large and diverse, making it a simple and scalable alternative to the training set for this purpose. This choice is driven to have the same evaluation, including for the zero-shot SSL model that does not have a prediction head. Our evaluation metrics focus on the macro F1 score, reflecting the models’ ability to handle class imbalances, supplemented by the micro F1 score, offering an additional comparative perspective to class imbalances. Exemplary loss curves for this training are shown in Supplementary Fig. 5 and a list of hyperparameters is shown in Supplementary Table 1.
Gene-expression reconstruction
Gene-expression reconstruction, the process of reconstructing counts from the transcriptome, still presents challenges due to the inherent noise and dispersion in RNA-sequencing data. The popular scVI model38 inspires our approach and diverges in its use of input data. While scVI uses raw counts as input and models them as a negative binomial distribution, our method employs normalized data for consistency with other downstream tasks. Nonetheless, similar to scVI, we predict the parameters of the negative binomial distribution. This strategy of modelling distribution parameters rather than direct RNA-sequencing count prediction enhanced reconstruction accuracy in our experiments. We opt for a non-variational, fully connected autoencoder framework consistent with our cell-type prediction approach. Performance evaluation encompasses MSE and uniform and weighted explained variance. We reported the weighted explained variance to best reflect the actual reconstruction efficacy, accounting for class imbalances. We include the MSE and uniform explained variance in our framework as supplementary evaluation, and they were used in our experiments. The hyperparameters used are shown in Supplementary Table 1.
Cross-modality prediction
Cross-modality prediction is the task of predicting one modality from another. Such a task could potentially augment cellular data by a different modality, offering another perspective. For pre-training, we used masking (1) on the auxiliary scTab dataset and (2) on the downstream task dataset. For fine-tuning, we included two studies, both using normalized and log1p transformed RNA-sequencing counts as originating modalities. First, we predicted all 134 normalized and log1p transformed protein counts (proteomics) available in the NeurIPS CITE-seq dataset57. We trained the models in a random training, validation and test split using coupled RNA and proteomics counts. Second, we predicted all 116,490 TF-IDF (term frequency-inverse document frequency)73-normalized ATAC counts available in the NeurIPS multiome dataset57. Again, we trained the models in a random training, validation and test split using the coupled RNA and ATAC counts. Hyperparameters are shown in Supplementary Table 1.
Data integration
Data integration is an effort to study a set of related SCG datasets, possibly curated from various donors with different pipelines and in different settings that create batch effects and technical artefacts. The scIB63 integration benchmarking is a well-established analysis to determine how well the relevant and meaningful biological signals are preserved in any model data representation while removing the unwanted batch effects resulting in a mixed representation of various datasets. Accordingly, the scIB pipeline measures two metrics, including the bio conservation and batch correction metrics, each consisting of several evaluations through different methods. The hyperparameters for data integration are shown in Supplementary Table 1.
Contrastive method choice
This benchmark developed contrastive methods based on BYOL and Barlow twins, two well-performing negative-pair-free methods. This choice is motivated by their reliance solely on data augmentations rather than sampling negative pairs in a large and heterogeneous dataset and their proven performance46,47. Other reasonable choices include simple Siamese networks74, which were excluded due to repeatedly observed training instability in our setting, and SimCLR12, which was not pursued further as BYOL and Barlow twins showed superior performance in previous benchmarks. While VICReg11 is promising by design, we focused on BYOL and Barlow twins due to their robustness. As contrastive learning methods generally performed worse than masking approaches, we prioritized them for thorough investigation.
Batch effect
Batch effects were not explicitly corrected when working with large datasets, such as scTab, covering 249 datasets. Including many datasets seems to reduce the relative impact of such effects on the overall variation. When working with fewer datasets, such as in the data integration experiments covering three datasets, a batch covariate needs to be included to avoid strong batch effects.
Computational resources
The experiments for this work were conducted on a graphics processing unit (GPU) server with the following specifications:
-
GPU: 16x Tesla V100 GPUs with 32 GB random access memory (RAM) per card
-
GPU: 2x Tesla V100 GPUs with 16 GB RAM per card
-
GPU: 8x A100-SXM4 GPUs with 40 GB RAM per card
All pre-training methods were trained on a single GPU for 2 days with early stopping, using up to 160 GB of system memory at a batch size of 8,192. For practitioners with limited GPU memory, smaller batch sizes can reduce memory usage. For example, a batch size of 2 uses under 1 GB of VRAM but greatly increases training time (>200 h per epoch on scTab). All fine-tuning methods were trained on a single GPU for 1 day with early stopping. All models were checked for convergence in the validation metrics.
Terminology
In this paper, we use the following terminology:
Term |
Definition |
Example |
---|---|---|
Architecture |
Neural network structure |
Multi-Layer-Perceptron Transformer |
Method |
Training approach |
SSL Supervised learning Unsupervised learning |
Model |
A trained architecture using a specific method |
scTab5 scGPT36 Nicheformer38 |
We use the above table’s terminology throughout the paper to distinguish between architecture, method and model. This distinction clarifies how different methods impact models that share similar architectures. For example, scGPT trains a transformer architecture using SSL.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link