A landmark study, titled “How error correction affects polymerase chain reaction deduplication: A survey based on unique molecular identifier datasets of short reads” recently published in
Quantitative Biology reveals critical flaws in widely-used computational tools for next-generation sequencing (NGS) data analysis.
Researchers from the University of Technology Sydney and Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, conducted the first comprehensive evaluation of PCR deduplication and error correction methods using "ground truth" datasets created with Unique Molecular Identifiers (UMIs)—molecular "barcodes" that track individual DNA molecules through the sequencing process. They found that 1) pure computational deduplication methods showed less than 60% overlap with UMI-based results, meaning thousands of genuine biological sequences were incorrectly eliminated or retained. 2) all tested error correction tools introduced tens to hundreds of thousands of new sequences that never existed in the original sample—like a spellchecker that adds typos while trying to fix them. 3) Tools that allow small sequence differences to catch PCR errors end up mistakenly removing authentic reads, while still leaving hundreds of thousands of erroneous reads untouched. 4) Performance varied dramatically across different datasets, with no single method emerging as consistently reliable—a "methodological roulette" for researchers. Facing to these challenges, the researchers propose three key directions for improvement: 1) Incorporate sequence abundance information to distinguish true duplicates from errors; 2) Develop tools that preserve read identity and quantity information throughout processing; 3) Apply machine learning to understand platform-specific error patterns and make more intelligent corrections.
DOI:
10.1002/qub2.99