Last week at the PASS Summit in Seattle, Kevin Farlee & Joe Sack ran a vNext demo in the Microsoft booth and dropped a little surprise. SQL Server 2019 will let you trade speed for accuracy.
They’re working on a new APPROXIMATE_COUNT_DISTINCT.
It would work like Oracle 12c’s feature, giving accuracy within about 4% by using HyperLogLog (PDF with serious math, didn’t read). They also showed it on a slide under “Approximate Query Processing,” and the way it was shown suggested that there might be other APPROXIMATE% features coming, too.
If you have a use case for this and you’d be willing to run preproduction versions of SQL Server, contact us with info about your use case & database size, and we can put you in touch with the MS folks involved.
I have been thinking about this a little… one place where this could be useful is on for example reddit… where you see number of comments and votes. This number in that case doesn’t have to be exact because by the time you read the text and click on the comments link the number would have changed anyway.. So this makes SQL Server webscale 🙂 I am thinking of another use for this and can’t think of anything at the moment
I’m trying to think of a use case as well – I could see it having a place in data profiling activities where you need to get a snapshot feel for the data and don’t want it to be too expensive. I’m going to follow this feature just to understand how it would be used, and if we think of something for us, I’m all on board with trying it!
Sounds absolutely daft to me. How about some new features:
Why? Getting approximate counts makes a ton of sense for large data sets where being off by small margins doesn’t make a difference. You can still figure out proportionally how values are distributed.
Interesting idea. Just to clarify, I think you mean, “will let you trade accuracy for speed”..?
Brad – same thing. Remember, sometimes you trade more for less, and sometimes you trade less for more. 😉
Dennis, Erik,and Ryan are correct. This is extremely useful, but only when you have big tables/datasets. This is a common feature in Big Data platforms such as Apache Hive (percentile_approx is one command that comes to mind). The most common use case is table statistics for query optimizers. To really understand why this is useful… I had to step outside of the perfect-transactional-consistency-RDBMS world a bit.
When you have massive tables (think hundreds of billions of rows) spread out across dozens or hundreds of cluster nodes, doing full counts and count distincts are massive operations. For processes like updating table statistics, you don’t need perfect numbers – you need close numbers. So if your perfect count is 138.2 billion or an approximation is 134 billion – it doesn’t matter for query optimization purposes since it will still use the same execution plan. But having an approximated count could save a gazillion CPU cycles and a ton of processing time on your cluster/server. This frees up cycles for processes that truly need it.
PS – I’m also the Assistant Deputy Secretary of the Joe Sack Fan Club (Upper Midwest #1252). Joe rocks.
What I do right now is look at rowcnt in sysindexes (or sys.partitions when I’m trying to be modern)