Update: SQL Server 2025’s REGEX Performance Isn’t So Bad!

T-SQL
14 Comments

Back in March 2025 when Microsoft first announced that REGEX support was coming to SQL Server 2025 and Azure SQL DB, I gave it a quick test, and the performance was horrific. It was bad in 3 different ways:

  1. The CPU usage was terrible, burning 60 seconds of CPU time to check a few million rows
  2. It refused to use an index
  3. The cardinality estimation was terrible, hard-coded to 30% of the table

Prompted by a comment from Erland Sommarskog this month, I circled back and ran the tests again with the release version of SQL Server 2025. Great news! Microsoft fixed 1 of the problems, and… well, one of them is a little tricky. To demonstrate, I’m going to use the large 2024-04 Stack Overflow database to create a worst-case scenario, then start with an index on the small Users table and query it via regex like we did in the March 2025 post.

The actual execution plan:

Actual plan for TOP 100

It took about 8 seconds, all of which was spent burning CPU. That’s actually GREAT, a HUGE improvement from last time! 8 seconds of CPU time sounds bad, but it’s fantastic given the number of rows that SQL Server had to examine to find 100 matches:

Number of rows read

Because the data I was looking for was relatively rare, SQL Server had to read about 10 million rows in order to find 100 matches. That means SQL Server was able to read 1.2 million rows per second, and examine their contents with regex. That’s awesome! I love it, and I wish the story ended there.

But let’s switch over to examining the Title column of the Posts table, one of the bigger ones in the database. I’ve created an index on the Title column:

The table has about 60M rows, the clustered index is 163GB, and the index on just Title is 3GB. If SQL Server will use the index, this will give us a giant performance boost over having to scan the whole table.

Posts table size

Let’s run the same WHERE clause filter, but use a SUM(1) this time instead of TOP 100 so that SQL Server is forced to hit all of the rows, and so I can demonstrate the cardinality estimation:

The actual plan doesn’t look great at first glance, but hang in there, because this really is a worst-case scenario – there’s some great stuff in here:

Index scan on Posts.Title

First, it used the index! That’s fantastic. Obviously we can’t seek on it, but at least we’re only reading 3GB of data instead of 163GB. That’s good – that’s the one problem Microsoft completely fixed. Love it.

Second, it went parallel automatically, recognizing that it was gonna be a lot of work. It had to read 60M rows, and it took 7 minutes, so it processed about 145K rows per second. That’s… not good. That’s a huge drop from our Users table processing which was hitting about 1.2 million rows per second. While it was running, ooof, our poor server:

It's getting hot in here
It’s getting hot in here, so light up all your cores

Wait stats are a parallelism disaster:

Wait stats

So… is parallelism a problem? I’ve heard folks say CXCONSUMER is harmless, but long-term readers here will know better. Slap an OPTION (MAXDOP 1) hint on the query, and it runs in 32 seconds:

32 seconds for MAXDOP 1

Which brings us back up to 1.86 million rows per second processed by REGEX. That’s honestly fantastic. If you need to find a needle in a haystack with regex, and you’ve got an index so it can scan less data, and if CPU scheduling doesn’t get in the way, this is a dang fast way to do it. Note that I didn’t try more complex regular expressions – I don’t wanna make up synthetic stuff for testing, and instead you should test with regexes you actually intend to use in your work.

On the down side, note that the estimated number of rows is still hot garbage – SQL Server estimated that 5,383,710 rows would come back, when in actuality none did. I think that’s fair, because I don’t know how you could predict the number of rows that would match a given regex. Hell, I can’t even READ most regexes and understand what they’re trying to do. If your query has a regular expression in the filter, you probably want to load the matching rows into a temp table first so that on subsequent joins, SQL Server better understands the number of rows that’ll be involved.

So in summary, SQL Server 2025’s regex situation is better than it was in Azure SQL DB – at least it’s using indexes now, and CPU is better than it was. Just be careful with running it in production – if you’re just using it as a utility query for quick research, try hinting it with MAXDOP 1 for two reasons. It might run faster, and it’ll be less likely to dominate the server’s entire CPU stack.

Previous Post
How’s the Job Market? Let’s Find Out Together.
Next Post
[Video] Office Hours: Everything’s Gone Wrong Edition

14 Comments. Leave new

  • Wow, this is great news! Thanks for researching this and updating us on this cool new feature’s performance.

    Reply
  • It’s rather amusing that at least 100 users have a valid-enough email address as their Location.

    Reply
  • “Hell, I can’t even READ most regexes and understand what they’re trying to do.” Right there with you on this one. IRegex – Sanskrit, same same to me.

    Reply
  • How does the performance compare to using LIKE ‘%SearchTerm%’ and CONTAINS queries?

    Reply
  • Brent always keeping us on out toes.
    I Presume the format for your examples title and email is username@domain.com, what if I add another character in the format?

    Reply
    • I wish I could run personalized tests for everyone for free, but your best bet would be to download the freely available databases in the blog post and run your own experiments. Hope that’s fair, and enjoy!

      Reply
      • From my tests, Brent, adding many number of extra characters to either of the + did not change the results, however taking a default character out from your above default format threw an error.

        Reply
  • I don’t fully agree, that you can’t do an Index Seek with REGEX.
    Your expression starts with ‘^[A-Za-z0-9._%+-]’, which means that the string has to start with one of those chars / ranges in the square brackets (caution: it does not include e.g. German umlauts, contrary to a LIKE which includes them, when your collation is accent insensitive).

    The server could at least precompile the expression and check, if it filters for the first char (yes, it does) and add a SEEK predicate matching at least this one.

    There is no big difference in execution time / rows read when you do it with your expression (which searches for title = valid email address) since it covers almost all possible chars.

    But when you would look just for mail adresses starting with a number (and allow other text after it, otherwise my smaller StackOverflow-DB would find nothing) there is a big difference in execution time /
    — 51 ms / 137 reads
    SELECT t.Title
    FROM dbo.Posts AS t
    WHERE t.Title LIKE ‘[0-9]%’ — manually forcing the seek
    AND REGEXP_LIKE(t.Title, ‘^[0-9]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}’)

    — 22 sek / 101k reads
    SELECT t.Title, t.Id
    FROM dbo.Posts AS t
    WHERE REGEXP_LIKE(t.Title, ‘^[0-9]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}’)

    And of course I don’t want to do this by myself, ideally SQL server should recognize it and add the correct seek predicate when my RegExp starts with an ^. On the other hand this could become the next big parameter sniffing issue when the search expression is dynamic 😮

    Reply
    • Yeah, I think that’s pretty far beyond what I would expect SQL Server to do in the first release of regex. I do agree that it’d be awesome to have – but I bet so few regexes will be sargable anyway.

      Reply
  • Index seek for REGEXP_LIKE is coming. Stay tuned

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.