New Way to Count Unique Big Data Items Faster
en-GBde-DEes-ESfr-FR

New Way to Count Unique Big Data Items Faster

23/07/2025 Frontiers Journals

Researchers at Renmin University of China have completed a comprehensive survey to date of methods for quickly and accurately counting the number of unique items in large datasets, a task vital to everything from speeding up database queries to detecting network intrusions. By systematically comparing two main classes of approachessampling-based and sketch-based algorithmsthe study identifies when each method is most effective and where gaps remain in handling today’s massive and complex data environments.
Distinct Value Counting Powers Database Speed & Anomaly Detection
Counting distinct values underpins core functions in database engines, network security systems that flag unusual activity, and machine-learning models that must recognize rare events. Better estimators mean faster queries, reduced storage and communication costs, and more reliable detection of anomalies—benefits that extend to enterprises, cloud providers, and policymakers regulating data-driven services.
“Accurately knowing how many unique items you have is the cornerstone of any high-performance database or security system,” says Prof. Zhewei Wei. “Our survey shows just how far we can push speed without sacrificing reliability.”
Survey Identifies Optimal Sampling vs. Sketch Strategies
The survey distills decades of research into several critical insights:
  • Two main strategies dominate: sampling methods read only a fraction of records to save time (with some loss of precision), while sketch methods hash every item into a compact summary for higher accuracy but greater input/output (I/O) costs.
  • Historical progression: early statistical estimators date back to 1943; recent advances tailor solutions to streaming data, distributed systems, and modern storage systems.
  • Performance trade-offs: sampling can miss rare values, whereas sketching can be too heavy on huge tables.
  • Adaptive approaches that tune themselves to a dataset’s skew and size outperform fixed-form formulas in real-world tests.
Review Evaluates Estimator Accuracy, I/O Cost & Memory Footprint
The authors reviewed decades of literature, classifying methods by their core ideas—sampling, hashing, linear programming, and maximum likelihood—and evaluated them on criteria such as error rate, I/O cost, memory footprint, and one-pass processing capability. Complex mathematical analyses were distilled into clear comparisons, with case studies showing where specific techniques excel or lag.
Adaptive & Learning-Based Estimators Set to Enhance Counting
Despite significant strides, counting distinct items in increasingly large and complex datasets remains a challenging task. Promising directions include block-level sampling that aligns with physical storage layouts, learning-based estimators trained on past data, and tighter integration of these methods into mainstream databases for real-time analytics. This survey was published in Frontiers of Computer Science in March 2025 (https://doi.org/10.1007/s11704-024-40952-3).
By mapping the landscape of sampling- and sketch-based estimators and highlighting where each approach excels or struggles, this work lays a foundation for future innovations—helping ensure that as data volumes continue to grow, our ability to understand and manage the uniqueness of that data keeps pace.
DOI: 10.1007/s11704-024-40952-3
23/07/2025 Frontiers Journals
Regions: Asia, China
Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2025 by AlphaGalileo Terms Of Use Privacy Statement