gPRINT, a computational framework that integrates gene expression levels and chromosomal positional information to generate unique "gene prints" for DSCS annotation. Inspired by speech recognition, this approach leverages spatial gene organization (e.g., co-regulated genes within nucleosomes) to reduce noise and improve resolution in heterogeneous datasets.
Targeted benchmarking against marker-based (SingleR) and clustering-based (Seurat) methods showed gPRINT’s superiority in resolving ambiguities within mixed-cell populations (e.g., tumor-stroma interfaces). In tendinopathy, gPRINT identified novel chondrogenic tendon cells marked by
SOX9/
COL2A1 co-expression, a population undetectable by conventional methods. Cross-species alignment further validated conserved fibroblast subtypes driving fibrotic cascades in human, mouse, and primate models.
Further comparative analyses highlighted the generalizability of the "gene print" approach across diverse tissue types and disease models. These findings establish gPRINT as a powerful tool for single-cell data integration and subtype annotation, providing a unified platform for decoding cellular heterogeneity in human diseases.
Key findings from the study include:
1. Gene Print Framework for Cross-Dataset Annotation: The gPRINT algorithm integrates gene expression and chromosomal positional information to generate unique "gene prints," enabling platform-agnostic identification of disease-specific cell subtypes (DSCSs). Validated across 1.2 million cells, gPRINT achieved 98.37% cross-platform accuracy, outperforming traditional methods (SingleR, Seurat) in resolving ambiguous populations (e.g., tumor-stroma interfaces) and identifying novel subtypes like SOX9/COL2A1-expressing chondrogenic tendon cells in tendinopathy.
2. Mechanistic Link to 3D Genome Architecture: Hi-C data confirmed that gene prints reflect spatial co-localization of signature genes (e.g., COL1A1/ACTA2 clusters on chromosome 7) in DSCSs. Disrupting chromosomal topology (e.g., CTCF anchor deletions) reduced annotation accuracy by 63%, while CRISPR-mediated enhancer deletions abolished subtype-specific pathways (e.g., TGF-β signaling).
3. Therapeutic Discovery and Universal Utility: gPRINT prioritized drug candidates (e.g., ascorbic acid, celastrol) via CMAP database integration and revealed conserved fibrotic networks across species (human/mouse/primate). Its application to TendonBase, a multi-omics database, established a universal framework for decoding cellular heterogeneity in fibrosis, cancer, and degenerative diseases.
This study established gPRINT, a computational framework that unifies cell subtype annotation across single-cell datasets by integrating gene expression and chromosomal spatial organization into unique "gene prints." Validated on 1.2 million cells, gPRINT achieved 98.37% cross-platform accuracy, identifying novel pathological subtypes and linking their gene prints to 3D chromatin architecture via Hi-C. Disrupting chromosomal topology reduced annotation accuracy by 63%, while drug-database integration prioritized candidates like ascorbic acid for fibrosis. The work entitled “Gene print-based cell subtypes annotation of human disease across heterogeneous datasets with gPRINT” was published on
Protein & Cell (published on Mar. 14, 2025).
DOI:
10.1093/procel/pwaf001