Coming in SQL Server 2019: Approximate_Count_Distinct

Last Updated 5 years ago

SQL Server 2019

Last week at the PASS Summit in Seattle, Kevin Farlee & Joe Sack ran a vNext demo in the Microsoft booth and dropped a little surprise. SQL Server 2019 will let you trade speed for accuracy.

I have had approximately all of the breakfast margaritas

They’re working on a new APPROXIMATE_COUNT_DISTINCT.

It would work like Oracle 12c’s feature, giving accuracy within about 4% by using HyperLogLog (PDF with serious math, didn’t read). They also showed it on a slide under “Approximate Query Processing,” and the way it was shown suggested that there might be other APPROXIMATE% features coming, too.

If you have a use case for this and you’d be willing to run preproduction versions of SQL Server, contact us with info about your use case & database size, and we can put you in touch with the MS folks involved.

Partitioned Views, Aggregates, and Cool Query Plans

Implied Predicate and Partition Elimination

9 Comments. Leave new

Denis Gobo
November 8, 2017 8:49 am

I have been thinking about this a little… one place where this could be useful is on for example reddit… where you see number of comments and votes. This number in that case doesn’t have to be exact because by the time you read the text and click on the comments link the number would have changed anyway.. So this makes SQL Server webscale 🙂 I am thinking of another use for this and can’t think of anything at the moment

Reply
Ryan McCauley
November 8, 2017 11:31 am

I’m trying to think of a use case as well – I could see it having a place in data profiling activities where you need to get a snapshot feel for the data and don’t want it to be too expensive. I’m going to follow this feature just to understand how it would be used, and if we think of something for us, I’m all on board with trying it!

Reply
John
November 8, 2017 11:57 am

Sounds absolutely daft to me. How about some new features:
DELETE_A_FEW_ROWS_OR_SO
INSERT_WHATEVER_YOU_REALLY_FEEL_LIKE

Reply
- Erik Darling
  November 8, 2017 12:56 pm
  
  Why? Getting approximate counts makes a ton of sense for large data sets where being off by small margins doesn’t make a difference. You can still figure out proportionally how values are distributed.
  
  Reply
Brad Stiritz
November 8, 2017 6:45 pm

Interesting idea. Just to clarify, I think you mean, “will let you trade accuracy for speed”..?

Reply
- Brent Ozar
  November 9, 2017 3:49 pm
  
  Brad – same thing. Remember, sometimes you trade more for less, and sometimes you trade less for more. 😉
  
  Reply
Bill Preachuk
November 9, 2017 6:17 am

Dennis, Erik,and Ryan are correct. This is extremely useful, but only when you have big tables/datasets. This is a common feature in Big Data platforms such as Apache Hive (percentile_approx is one command that comes to mind). The most common use case is table statistics for query optimizers. To really understand why this is useful… I had to step outside of the perfect-transactional-consistency-RDBMS world a bit.

When you have massive tables (think hundreds of billions of rows) spread out across dozens or hundreds of cluster nodes, doing full counts and count distincts are massive operations. For processes like updating table statistics, you don’t need perfect numbers – you need close numbers. So if your perfect count is 138.2 billion or an approximation is 134 billion – it doesn’t matter for query optimization purposes since it will still use the same execution plan. But having an approximated count could save a gazillion CPU cycles and a ton of processing time on your cluster/server. This frees up cycles for processes that truly need it.

Reply
Bill Preachuk
November 9, 2017 6:20 am

PS – I’m also the Assistant Deputy Secretary of the Joe Sack Fan Club (Upper Midwest #1252). Joe rocks.

Reply
Stephen Morris
November 10, 2017 4:16 am

What I do right now is look at rowcnt in sysindexes (or sys.partitions when I’m trying to be modern)

Reply