Blog

Key length matters in SQL Server indexes.

It’s a best practice to keep your index keys as narrow as possible, and SQL Server enforces a maximum key length of 900 bytes on most “normal” clustered and nonclustered indexes.

But what happens if you want to optimize the lookup of a wide column? You’re not necessarily out of luck, you may just have to get a bit creative.

What If I Need to do an Equality Search on a Wide Column?

Let’s say I have a simple table. I have a narrow key on my clustered index and then I have a pretty wide variable length column. I need the wide column to be unicode, which makes it even wider, since unicode data types take up more room.

Here’s our sample table with a few rows (just pretend it has a lot more):

CREATE TABLE dbo.LookupValues (
	i int identity,
	bigval nvarchar(2000) default (REPLICATE('d',700)),
	constraint pk_LookupValues_i primary key (i)
);
GO

--Insert rows with the default values
begin tran
	declare @i smallint = 0;
	while @i < 10000
	begin
		insert dbo.LookupValues default values;
		set @i=@i+1;
	end
commit
GO

--Insert a few smaller values
insert dbo.LookupValues (bigval) VALUES ('big');
insert dbo.LookupValues (bigval) VALUES ('bunny');
insert dbo.LookupValues (bigval) VALUES ('bunny bunny');
GO

Let’s say we write to this table rarely, but query it often. When this query runs, I want to make it as fast as possible:

SELECT i
from dbo.LookupValues
where bigval = N'bunny';

Right now, this query has to scan every row in the clustered index (the whole table) to find instances where bigval=N’bunny’. That’s not ideal, and as the table grows it’ll become worse and worse, burning more IO and CPU, and taking longer over time.

There’s usually an easy way to make a query like this fast: just create a nonclustered index on the bigval column. But when I try, it doesn’t work because of restrictions on key size.

--Make my query faster!
CREATE NONCLUSTERED INDEX ix_LookupValues_bigval on dbo.LookupValues (bigval);
GO

SQL Says:

Warning! The maximum key length is 900 bytes. The index 'ix_LookupValues_bigval' has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.
Msg 1946, Level 16, State 3, Line 1
Operation failed. The index entry of length 1400 bytes for the index 'ix_LookupValues_bigval' exceeds the maximum length of 900 bytes.
The statement has been terminated.

Terminated. Yeah. I can’t just index this to make my query fast.

Options for Indexing Wide Keys

So what’s a performance tuner to do?

My first thought when I hit this problem was that I might have to use a fulltext index. A fulltext index can work here– it lets you index large columns, but it would be kind of a bummer to have to do it. Fulltext indexes have extra overhead and are really designed for different things than doing a simple equality search, so it would be like using a jackhammer because you can’t find a mallet.

My partner Jeremiah Peschka came up with a quick and clever solution using an indexed computed column. You can work all sorts of cool magic with computed columns in SQL Server– the trick is just to remember them!

Here’s how it works: you add a computed column to the table that’s the hash of the large value. You then index the computed column and modify your query to take advantage of it.

In this example we use SHA_512 for the hashing algorithm. This will give an output of 64 bytes– well within our limits for index key sizes.

ALTER TABLE dbo.LookupValues ADD bigvalhash AS HASHBYTES('SHA2_512', bigval) PERSISTED;
GO

CREATE NONCLUSTERED INDEX ix_LookupValues_bigvalhash on dbo.LookupValues (bigvalhash) INCLUDE (bigval);
GO

Now, to get the query work, we need to change it a bit:

SELECT i
from dbo.LookupValues
where bigvalhash = HASHBYTES('SHA2_512', N'bunny')
	and bigval = N'bunny';
GO

This revised approach gives me a targeted index seek and limits my logical reads. Voila!

The Fine Print on This Solution

There are a few things to note:

  • HASHBYTES results are dependent upon datatype. If my query used HASHBYTES(‘SHA2_512′, ‘bunny’), it would not find any rows, because the column is hashed unicode values and I provided a hashed non-unicode value.
  • I do still include “bigval= N’bunny'” in my query. In theory there shouldn’t be collisions with SHA-512, but it doesn’t add much expense to the query and in my example I deemed it “worth it” to me. You might make a different choice.

Sometimes Old Tools Do the Trick

What I love most about this solution is that it’s creative, but it’s not really weird, when you think about it. It uses standard features that have been in SQL Server for a long time to create a way to do something that seems like the product wouldn’t support– and that’s really cool.

↑ Back to top
  1. Very cool. I love this technique,

    It reminds me of the example at CHECKSUM. If we’re guarding against collisions any way, then maybe this hash function would suit us.

    I mean, in a different scenario where we do write to that table frequently, maybe CPU does matter. My guess is that CHECKSUM is less CPU intensive, we don’t get the crypographic security of a one-way hash function, but we weren’t using that anyway. Is that fair? Is the CPU difference between HASHBYTES(sha2_512) and CHECKSUM significant?

  2. That is a great technique Kendra – thanks for sharing. You are right – we often forget about the ability to use a computed column and this is a great reminder of their many great uses. One question – let’s say you are attempting to tune a query that is already in production and you don’t have the option to change the SQL? How would you go about deciding if implementing the fulltext index, even if it is a jackhammer when a mallet will do, is worth it?

    • Hi Todd,

      For the fulltext solution, you would also need to change the TSQL. To take advantage of the fulltext index you’d need to use a fulltext predicate in the query (such as “CONTAINS”). There’d also be some different considerations regarding timing– you can have latency where data is in the table and not yet present in the fulltext index, which can be a deal breaker sometimes.

      Hope this helps!
      Kendra

  3. Nice example! Just out of curiosity is there any real difference between using a computed column of HASHBYTES vs using a LEFT(bigval,450) ? Admittedly it’s a larger index, but It seems like it would avoid the problem of data type and make the queries simpler.

    The following query ends up using the new index when I tested it.

    SELECT i
    from dbo.LookupValues
    where bigval = N’bunny';
    GO

  4. CHECKSUM is much less CPU intensive but you will get collisions like crazy it is basically a CRC32 and that just isn’t enough bits to encode a large volume of bytes. Also, since it is a very simple algorithm you would need many more bits to avoid collisions than just using SHA. You could generate a SHA on collision but that would mean extra coding and may slow you down more than just using SHA in the first place. Even MD5 on a large enough set of columns and enough rows will yield collisions at some point. SHA2_512 is probably as efficient as you are going to get and be guaranteed(*) no collisions. if you have very small amounts of data to hash CHECKSUM may be fine but for this I wouldn’t use it.

    -wes

    • After about 100,000 random values, the chance of a CHECKSUM collision is about 60%. But avoiding collisions is not a goal (Neither Kendra, or CHECKSUM‘s documentation count on unique hashes)

      • The collision question makes me wonder about statistics and query plans on some larger sets of data! It would take a bit of time to do some testing on it, but it would certainly be interesting.

        How the different options handle null values also might be an interesting consideration for some people– if the dataset contains a lot of nulls and you want to filter them out, you’d want to specifically use one of the functions that preserves null in the result. (That doesn’t really matter when it comes to storage size for fixed width types like ints, but the ability to filter is interesting.)

  5. I agree with Wes. I played with both Checksum and SHA recently for a nvarchar(max) column and found them to be similar performance (on an underpowered billing db server), both about 10x faster than not having an index, but had issues with checksum when I tried to keep the index length low. Using Full Text for me was quite a bit slower, but not as bad as having no index, and not worth the PITA given SHA was faster and simpler to administer.
    John

  6. I recommend shortening the hash to 16 bytes (Guid size which is already considered wordlwide unique). As you’re keeping the true equality check anyway this is safe. You could even reduce the hash size to something like 8 bytes and still have an insanely high probability of matching only the exact sought-after row.

  7. Hey Team,

    Thanks Kendra for your example with this I really find a way to do things easier working with wide columns, the only issue I had was with the type of hash using on this example getting NULL values as a results but I did realize that SHA2_512 only works sql2012 + not for previous versions, but I think is great to find solutions like this that many of us (me) didnt keep in mind when resolving issues ignoring this type of functions (computed columns) which are awesome depending on the case.

  8. Really very nice post but my probelm now how can i check all my index on all tables in database to know what is the index need to update to not be Indexing Wide Keys in SQL Server

    any one have Script to check the Key Column size for the index

  9. Pingback: Technical: Microsoft – SQL Server – Analysis Services – Adding new dimension to an existing Cube, but “No go” | Daniel Adeniji's – Learning in the Open

  10. I think it is more optimal to store hash as BIGINT (aka 8 byes) as the default behavior is to store that column as varbinary (8000 bytes). See cast below:
    ALTER TABLE dbo.LookupValues ADD bigvalhash AS CAST(HASHBYTES(‘SHA2_512′, bigval) as BIGINT) PERSISTED;

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php