Indexing Wide Keys in SQL Server

Key length matters in SQL Server indexes.

It’s a best practice to keep your index keys as narrow as possible, and SQL Server enforces a maximum key length of 900 bytes on most “normal” clustered and nonclustered indexes.

But what happens if you want to optimize the lookup of a wide column? You’re not necessarily out of luck, you may just have to get a bit creative.

What If I Need to do an Equality Search on a Wide Column?

Let’s say I have a simple table. I have a narrow key on my clustered index and then I have a pretty wide variable length column. I need the wide column to be unicode, which makes it even wider, since unicode data types take up more room.

Here’s our sample table with a few rows (just pretend it has a lot more):

Let’s say we write to this table rarely, but query it often. When this query runs, I want to make it as fast as possible:

Right now, this query has to scan every row in the clustered index (the whole table) to find instances where bigval=N’bunny’. That’s not ideal, and as the table grows it’ll become worse and worse, burning more IO and CPU, and taking longer over time.

There’s usually an easy way to make a query like this fast: just create a nonclustered index on the bigval column. But when I try, it doesn’t work because of restrictions on key size.

SQL Says:

[code] Warning! The maximum key length is 900 bytes. The index ‘ix_LookupValues_bigval’ has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.
Msg 1946, Level 16, State 3, Line 1
Operation failed. The index entry of length 1400 bytes for the index ‘ix_LookupValues_bigval’ exceeds the maximum length of 900 bytes.
The statement has been terminated.
[/code]

Terminated. Yeah. I can’t just index this to make my query fast.

Options for Indexing Wide Keys

So what’s a performance tuner to do?

My first thought when I hit this problem was that I might have to use a fulltext index. A fulltext index can work here– it lets you index large columns, but it would be kind of a bummer to have to do it. Fulltext indexes have extra overhead and are really designed for different things than doing a simple equality search, so it would be like using a jackhammer because you can’t find a mallet.

My partner Jeremiah Peschka came up with a quick and clever solution using an indexed computed column. You can work all sorts of cool magic with computed columns in SQL Server– the trick is just to remember them!

Here’s how it works: you add a computed column to the table that’s the hash of the large value. You then index the computed column and modify your query to take advantage of it.

In this example we use SHA_512 for the hashing algorithm. This will give an output of 64 bytes— well within our limits for index key sizes.

Now, to get the query work, we need to change it a bit:

This revised approach gives me a targeted index seek and limits my logical reads. Voila!

The Fine Print on This Solution

There are a few things to note:

  • HASHBYTES results are dependent upon datatype. If my query used HASHBYTES(‘SHA2_512’, ‘bunny’), it would not find any rows, because the column is hashed unicode values and I provided a hashed non-unicode value.
  • I do still include “bigval= N’bunny'” in my query. In theory there shouldn’t be collisions with SHA-512, but it doesn’t add much expense to the query and in my example I deemed it “worth it” to me. You might make a different choice.

Sometimes Old Tools Do the Trick

What I love most about this solution is that it’s creative, but it’s not really weird, when you think about it. It uses standard features that have been in SQL Server for a long time to create a way to do something that seems like the product wouldn’t support– and that’s really cool.

Previous Post
3 Steps to Finding Your Secret Sauce
Next Post
sp_Blitz® v22: Output Results to a Table, Better Comments, and an App

26 Comments. Leave new

  • Very cool. I love this technique,

    It reminds me of the example at CHECKSUM. If we’re guarding against collisions any way, then maybe this hash function would suit us.

    I mean, in a different scenario where we do write to that table frequently, maybe CPU does matter. My guess is that CHECKSUM is less CPU intensive, we don’t get the crypographic security of a one-way hash function, but we weren’t using that anyway. Is that fair? Is the CPU difference between HASHBYTES(sha2_512) and CHECKSUM significant?

    Reply
    • Hi Michael,

      I happen to have a quick test harness saved from a few months ago when I was looking into alternate dedupe methods. I put it up here if you want to take a look. I tested checksum, binary checksum, and the different hashbytes algorithms.

      http://pastebin.com/iF4XBSab

      Thanks,
      Erik

      Reply
  • Todd Everett
    May 8, 2013 10:24 am

    That is a great technique Kendra – thanks for sharing. You are right – we often forget about the ability to use a computed column and this is a great reminder of their many great uses. One question – let’s say you are attempting to tune a query that is already in production and you don’t have the option to change the SQL? How would you go about deciding if implementing the fulltext index, even if it is a jackhammer when a mallet will do, is worth it?

    Reply
    • Kendra Little
      May 8, 2013 10:39 am

      Hi Todd,

      For the fulltext solution, you would also need to change the TSQL. To take advantage of the fulltext index you’d need to use a fulltext predicate in the query (such as “CONTAINS”). There’d also be some different considerations regarding timing– you can have latency where data is in the table and not yet present in the fulltext index, which can be a deal breaker sometimes.

      Hope this helps!
      Kendra

      Reply
  • Nice example! Just out of curiosity is there any real difference between using a computed column of HASHBYTES vs using a LEFT(bigval,450) ? Admittedly it’s a larger index, but It seems like it would avoid the problem of data type and make the queries simpler.

    The following query ends up using the new index when I tested it.

    SELECT i
    from dbo.LookupValues
    where bigval = N’bunny’;
    GO

    Reply
  • CHECKSUM is much less CPU intensive but you will get collisions like crazy it is basically a CRC32 and that just isn’t enough bits to encode a large volume of bytes. Also, since it is a very simple algorithm you would need many more bits to avoid collisions than just using SHA. You could generate a SHA on collision but that would mean extra coding and may slow you down more than just using SHA in the first place. Even MD5 on a large enough set of columns and enough rows will yield collisions at some point. SHA2_512 is probably as efficient as you are going to get and be guaranteed(*) no collisions. if you have very small amounts of data to hash CHECKSUM may be fine but for this I wouldn’t use it.

    -wes

    Reply
  • John Halunen
    May 8, 2013 1:29 pm

    I agree with Wes. I played with both Checksum and SHA recently for a nvarchar(max) column and found them to be similar performance (on an underpowered billing db server), both about 10x faster than not having an index, but had issues with checksum when I tried to keep the index length low. Using Full Text for me was quite a bit slower, but not as bad as having no index, and not worth the PITA given SHA was faster and simpler to administer.
    John

    Reply
  • I recommend shortening the hash to 16 bytes (Guid size which is already considered wordlwide unique). As you’re keeping the true equality check anyway this is safe. You could even reduce the hash size to something like 8 bytes and still have an insanely high probability of matching only the exact sought-after row.

    Reply
  • Hey Team,

    Thanks Kendra for your example with this I really find a way to do things easier working with wide columns, the only issue I had was with the type of hash using on this example getting NULL values as a results but I did realize that SHA2_512 only works sql2012 + not for previous versions, but I think is great to find solutions like this that many of us (me) didnt keep in mind when resolving issues ignoring this type of functions (computed columns) which are awesome depending on the case.

    Reply
  • Really very nice post but my probelm now how can i check all my index on all tables in database to know what is the index need to update to not be Indexing Wide Keys in SQL Server

    any one have Script to check the Key Column size for the index

    Reply
  • I think it is more optimal to store hash as BIGINT (aka 8 byes) as the default behavior is to store that column as varbinary (8000 bytes). See cast below:
    ALTER TABLE dbo.LookupValues ADD bigvalhash AS CAST(HASHBYTES(‘SHA2_512’, bigval) as BIGINT) PERSISTED;

    Reply
  • So what if you have a string that’s larger than 8000 bytes (NVARCHAR(4000)+ or VARCHAR(8000)+ ) or a combination of columns that you want to apply HASHBYTES to that adds up to more than 8000B? Do you have any examples of hashing such large strings with pure SQL?

    Reply
    • Not offhand. But this just hasn’t come up for me much in the 18 months since I wrote the post. The first question I would ask is why you need to index such large strings, and are you really going to be able to seek on an index?

      Reply
  • Also, just a general note of caution with indexed computed columns and running DBCC CHECKDB:

    http://www.sqlskills.com/blogs/paul/dbcc-checkdb-performance-and-computed-column-indexes/

    Reply
  • Bhupendra Baraiya
    May 6, 2015 5:42 am

    Hi

    What if I want to do like search in a wider key
    will I have to go for Fulttext how can I implement Like in above example

    SELECT i
    from dbo.LookupValues
    where bigvalhash = HASHBYTES(‘SHA2_512′, N’bunny’)
    and bigval = N’bunny’;
    GO

    Reply
    • Kendra Little
      May 6, 2015 7:19 am

      Hi there,

      Fulltext might help with some scenarios, but it doesn’t support full wildcard searches like ‘%this%’. If you want to do prefix searches or find phrases containing a word, it might fit your needs though.

      Kendra

      Reply
  • You are wrong saying “In theory there shouldn’t be collisions with SHA-512”. This phrase from Wikipedia is about security issues. It says that no proven collision attack was found (yet).
    Since your data column is longer than hash column there will be collisions. It’s just matter of probability and luck.

    Reply
  • Michael Schall
    December 7, 2015 11:12 am

    I found that I had to create the persisted column with a cast. Otherwise the database created a varbinary(8000) column which cannot have an index on it…

    You will have to adjust the size you cast to based on the algorithm you choose.

    ALTER TABLE LookupValues ADD bigvalhash AS CAST(HASHBYTES(‘SHA2_256’, bigval) as VARBINARY(32)) PERSISTED;

    Reply
  • Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.