Coming in SQL Server 2019: Approximate_Count_Distinct

SQL Server 2019
9 Comments

Last week at the PASS Summit in Seattle, Kevin Farlee & Joe Sack ran a vNext demo in the Microsoft booth and dropped a little surprise. SQL Server 2019 will let you trade speed for accuracy.

I have had approximately all of the breakfast margaritas

They’re working on a new APPROXIMATE_COUNT_DISTINCT.

It would work like Oracle 12c’s feature, giving accuracy within about 4% by using HyperLogLog (PDF with serious math, didn’t read). They also showed it on a slide under “Approximate Query Processing,” and the way it was shown suggested that there might be other APPROXIMATE% features coming, too.

If you have a use case for this and you’d be willing to run preproduction versions of SQL Server, contact us with info about your use case & database size, and we can put you in touch with the MS folks involved.

Previous Post
Partitioned Views, Aggregates, and Cool Query Plans
Next Post
Implied Predicate and Partition Elimination

9 Comments. Leave new

  • I have been thinking about this a little… one place where this could be useful is on for example reddit… where you see number of comments and votes. This number in that case doesn’t have to be exact because by the time you read the text and click on the comments link the number would have changed anyway.. So this makes SQL Server webscale 🙂 I am thinking of another use for this and can’t think of anything at the moment

    Reply
  • I’m trying to think of a use case as well – I could see it having a place in data profiling activities where you need to get a snapshot feel for the data and don’t want it to be too expensive. I’m going to follow this feature just to understand how it would be used, and if we think of something for us, I’m all on board with trying it!

    Reply
  • Sounds absolutely daft to me. How about some new features:
    DELETE_A_FEW_ROWS_OR_SO
    INSERT_WHATEVER_YOU_REALLY_FEEL_LIKE

    Reply
    • Why? Getting approximate counts makes a ton of sense for large data sets where being off by small margins doesn’t make a difference. You can still figure out proportionally how values are distributed.

      Reply
  • Interesting idea. Just to clarify, I think you mean, “will let you trade accuracy for speed”..?

    Reply
  • Bill Preachuk
    November 9, 2017 6:17 am

    Dennis, Erik,and Ryan are correct. This is extremely useful, but only when you have big tables/datasets. This is a common feature in Big Data platforms such as Apache Hive (percentile_approx is one command that comes to mind). The most common use case is table statistics for query optimizers. To really understand why this is useful… I had to step outside of the perfect-transactional-consistency-RDBMS world a bit.

    When you have massive tables (think hundreds of billions of rows) spread out across dozens or hundreds of cluster nodes, doing full counts and count distincts are massive operations. For processes like updating table statistics, you don’t need perfect numbers – you need close numbers. So if your perfect count is 138.2 billion or an approximation is 134 billion – it doesn’t matter for query optimization purposes since it will still use the same execution plan. But having an approximated count could save a gazillion CPU cycles and a ton of processing time on your cluster/server. This frees up cycles for processes that truly need it.

    Reply
  • Bill Preachuk
    November 9, 2017 6:20 am

    PS – I’m also the Assistant Deputy Secretary of the Joe Sack Fan Club (Upper Midwest #1252). Joe rocks.

    Reply
  • Stephen Morris
    November 10, 2017 4:16 am

    What I do right now is look at rowcnt in sysindexes (or sys.partitions when I’m trying to be modern)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.