DNA methylation (DNAm) is one of the earliest identified types of epigenetic modifications and plays an essential role in regulating normal cellular processes, embryogenesis, and tumor development and progression. In recent years, advances in single-cell DNA methylation (scDNAm) have provided unprecedented opportunities to explore cellular epigenetic differences with high resolution. Most current studies analyzing single-cell DNA methylation data are typically based on cell-by-region matrices. A simple and effective method for constructing scDNAm data cell-by-region matrices is genome window binning, where the genome is divided into fixed-length blocks (e.g., 100 kbp), and the average DNA methylation level for each cell in each region is computed. However, before performing downstream analyses, a critical issue remains: how to handle the not available (NA) values in scDNAm data. In single-cell RNA sequencing (scRNA-seq) or single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) data, missing values are usually represented as zero read counts. However, in scDNAm data, captured methylation sites typically display a binary characteristic: methylated (read count of 1) or unmethylated (read count of 0), while uncaptured sites are marked as NA. When constructing cell-by-region matrices using the window binning strategy, due to the uneven distribution of methylation sites across the genome and the impact of window size, many regions may lack captured methylation sites, resulting in average methylation levels marked as NA. A methylation matrix with NA values cannot be used for downstream analyses, making the imputation of NA values a necessary preprocessing step.
Recently, a study by BioX lab at the School of Mathematical Sciences, Nankai University, published in the Quantitative Biology journal, titled "Imputing not available values in single-cell DNA methylation data using the median is straightforward and effective," revealed that imputing not available values in single-cell DNA methylation data using the median is a simple and effective approach.
When analyzing scDNAm data, an intuitive solution is to impute all NA as zeros. However, from another perspective, higher read counts in scRNA-seq data typically correspond to higher gene expression levels, and gene expression is strongly negatively correlated with DNA methylation levels. Thus, NA values in scRNA-seq data are usually treated as zeros, which is equivalent to imputing NA values in scDNAm data as ones. Additionally, using various statistical methods to smooth NA values presents an intuitive approach. For instance, EpiScanpy imputes NA values by using the mean methylation levels of a region across all cells. This study suggests that imputing NA values with the median is a simple and effective method for highlighting cellular heterogeneity in scDNAm data. It provides an accurate data foundation for downstream analyses and allows for more precise and reliable interpretation of the underlying biological processes.
DOI:10.1002/qub2.7000