Key length matters in SQL Server indexes.
It’s a best practice to keep your index keys as narrow as possible, and SQL Server enforces a maximum key length of 900 bytes on most “normal” clustered and nonclustered indexes.
But what happens if you want to optimize the lookup of a wide column? You’re not necessarily out of luck, you may just have to get a bit creative.
What If I Need to do an Equality Search on a Wide Column?
Let’s say I have a simple table. I have a narrow key on my clustered index and then I have a pretty wide variable length column. I need the wide column to be unicode, which makes it even wider, since unicode data types take up more room.
Here’s our sample table with a few rows (just pretend it has a lot more):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
CREATE TABLE dbo.LookupValues ( i int identity, bigval nvarchar(2000) default (REPLICATE('d',700)), constraint pk_LookupValues_i primary key (i) ); GO --Insert rows with the default values begin tran declare @i smallint = 0; while @i < 10000 begin insert dbo.LookupValues default values; set @i=@i+1; end commit GO --Insert a few smaller values insert dbo.LookupValues (bigval) VALUES ('big'); insert dbo.LookupValues (bigval) VALUES ('bunny'); insert dbo.LookupValues (bigval) VALUES ('bunny bunny'); GO |
Let’s say we write to this table rarely, but query it often. When this query runs, I want to make it as fast as possible:
1 2 3 |
SELECT i from dbo.LookupValues where bigval = N'bunny'; |
Right now, this query has to scan every row in the clustered index (the whole table) to find instances where bigval=N’bunny’. That’s not ideal, and as the table grows it’ll become worse and worse, burning more IO and CPU, and taking longer over time.
There’s usually an easy way to make a query like this fast: just create a nonclustered index on the bigval column. But when I try, it doesn’t work because of restrictions on key size.
1 2 3 |
--Make my query faster! CREATE NONCLUSTERED INDEX ix_LookupValues_bigval on dbo.LookupValues (bigval); GO |
SQL Says:
[code] Warning! The maximum key length is 900 bytes. The index ‘ix_LookupValues_bigval’ has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.Msg 1946, Level 16, State 3, Line 1
Operation failed. The index entry of length 1400 bytes for the index ‘ix_LookupValues_bigval’ exceeds the maximum length of 900 bytes.
The statement has been terminated.
[/code]
Terminated. Yeah. I can’t just index this to make my query fast.
Options for Indexing Wide Keys
So what’s a performance tuner to do?
My first thought when I hit this problem was that I might have to use a fulltext index. A fulltext index can work here– it lets you index large columns, but it would be kind of a bummer to have to do it. Fulltext indexes have extra overhead and are really designed for different things than doing a simple equality search, so it would be like using a jackhammer because you can’t find a mallet.
My partner Jeremiah Peschka came up with a quick and clever solution using an indexed computed column. You can work all sorts of cool magic with computed columns in SQL Server– the trick is just to remember them!
Here’s how it works: you add a computed column to the table that’s the hash of the large value. You then index the computed column and modify your query to take advantage of it.
In this example we use SHA_512 for the hashing algorithm. This will give an output of 64 bytes— well within our limits for index key sizes.
1 2 3 4 5 |
ALTER TABLE dbo.LookupValues ADD bigvalhash AS HASHBYTES('SHA2_512', bigval) PERSISTED; GO CREATE NONCLUSTERED INDEX ix_LookupValues_bigvalhash on dbo.LookupValues (bigvalhash) INCLUDE (bigval); GO |
Now, to get the query work, we need to change it a bit:
1 2 3 4 5 |
SELECT i from dbo.LookupValues where bigvalhash = HASHBYTES('SHA2_512', N'bunny') and bigval = N'bunny'; GO |
This revised approach gives me a targeted index seek and limits my logical reads. Voila!
The Fine Print on This Solution
There are a few things to note:
- HASHBYTES results are dependent upon datatype. If my query used HASHBYTES(‘SHA2_512’, ‘bunny’), it would not find any rows, because the column is hashed unicode values and I provided a hashed non-unicode value.
- I do still include “bigval= N’bunny'” in my query. In theory there shouldn’t be collisions with SHA-512, but it doesn’t add much expense to the query and in my example I deemed it “worth it” to me. You might make a different choice.
Sometimes Old Tools Do the Trick
What I love most about this solution is that it’s creative, but it’s not really weird, when you think about it. It uses standard features that have been in SQL Server for a long time to create a way to do something that seems like the product wouldn’t support– and that’s really cool.
26 Comments. Leave new
Very cool. I love this technique,
It reminds me of the example at CHECKSUM. If we’re guarding against collisions any way, then maybe this hash function would suit us.
I mean, in a different scenario where we do write to that table frequently, maybe CPU does matter. My guess is that CHECKSUM is less CPU intensive, we don’t get the crypographic security of a one-way hash function, but we weren’t using that anyway. Is that fair? Is the CPU difference between HASHBYTES(sha2_512) and CHECKSUM significant?
Hi Michael,
I happen to have a quick test harness saved from a few months ago when I was looking into alternate dedupe methods. I put it up here if you want to take a look. I tested checksum, binary checksum, and the different hashbytes algorithms.
http://pastebin.com/iF4XBSab
Thanks,
Erik
That is a great technique Kendra – thanks for sharing. You are right – we often forget about the ability to use a computed column and this is a great reminder of their many great uses. One question – let’s say you are attempting to tune a query that is already in production and you don’t have the option to change the SQL? How would you go about deciding if implementing the fulltext index, even if it is a jackhammer when a mallet will do, is worth it?
Hi Todd,
For the fulltext solution, you would also need to change the TSQL. To take advantage of the fulltext index you’d need to use a fulltext predicate in the query (such as “CONTAINS”). There’d also be some different considerations regarding timing– you can have latency where data is in the table and not yet present in the fulltext index, which can be a deal breaker sometimes.
Hope this helps!
Kendra
Nice example! Just out of curiosity is there any real difference between using a computed column of HASHBYTES vs using a LEFT(bigval,450) ? Admittedly it’s a larger index, but It seems like it would avoid the problem of data type and make the queries simpler.
The following query ends up using the new index when I tested it.
SELECT i
from dbo.LookupValues
where bigval = N’bunny’;
GO
CHECKSUM is much less CPU intensive but you will get collisions like crazy it is basically a CRC32 and that just isn’t enough bits to encode a large volume of bytes. Also, since it is a very simple algorithm you would need many more bits to avoid collisions than just using SHA. You could generate a SHA on collision but that would mean extra coding and may slow you down more than just using SHA in the first place. Even MD5 on a large enough set of columns and enough rows will yield collisions at some point. SHA2_512 is probably as efficient as you are going to get and be guaranteed(*) no collisions. if you have very small amounts of data to hash CHECKSUM may be fine but for this I wouldn’t use it.
-wes
After about 100,000 random values, the chance of a CHECKSUM collision is about 60%. But avoiding collisions is not a goal (Neither Kendra, or CHECKSUM‘s documentation count on unique hashes)
The collision question makes me wonder about statistics and query plans on some larger sets of data! It would take a bit of time to do some testing on it, but it would certainly be interesting.
How the different options handle null values also might be an interesting consideration for some people– if the dataset contains a lot of nulls and you want to filter them out, you’d want to specifically use one of the functions that preserves null in the result. (That doesn’t really matter when it comes to storage size for fixed width types like ints, but the ability to filter is interesting.)
@Kendra
Tony Rogerson is publishing a series of posts on using bloom filters with SQL Server.
He starts with a very similar hashing technique but stores the output in a bloom filter rather than directly in an index.
Anyway, I’m mentioning it here because he has some stats on hash collisions and false positives. Not entirely applicable here but instructive nonetheless.
http://dataidol.com/tonyrogerson/2013/05/09/reducing-sql-server-io-and-access-times-using-bloom-filters-part-1-concepts/
archived link for the lazy since the live one is no longer available
https://web.archive.org/web/20130907204530/http://dataidol.com/tonyrogerson/2013/05/09/reducing-sql-server-io-and-access-times-using-bloom-filters-part-1-concepts/
I agree with Wes. I played with both Checksum and SHA recently for a nvarchar(max) column and found them to be similar performance (on an underpowered billing db server), both about 10x faster than not having an index, but had issues with checksum when I tried to keep the index length low. Using Full Text for me was quite a bit slower, but not as bad as having no index, and not worth the PITA given SHA was faster and simpler to administer.
John
I recommend shortening the hash to 16 bytes (Guid size which is already considered wordlwide unique). As you’re keeping the true equality check anyway this is safe. You could even reduce the hash size to something like 8 bytes and still have an insanely high probability of matching only the exact sought-after row.
Hey Team,
Thanks Kendra for your example with this I really find a way to do things easier working with wide columns, the only issue I had was with the type of hash using on this example getting NULL values as a results but I did realize that SHA2_512 only works sql2012 + not for previous versions, but I think is great to find solutions like this that many of us (me) didnt keep in mind when resolving issues ignoring this type of functions (computed columns) which are awesome depending on the case.
Really very nice post but my probelm now how can i check all my index on all tables in database to know what is the index need to update to not be Indexing Wide Keys in SQL Server
any one have Script to check the Key Column size for the index
Hi there– our sp_BlitzIndex® free tool looks for wide keys. https://www.brentozar.com/blitzindex/
Thanks for sharing.
I think it is more optimal to store hash as BIGINT (aka 8 byes) as the default behavior is to store that column as varbinary (8000 bytes). See cast below:
ALTER TABLE dbo.LookupValues ADD bigvalhash AS CAST(HASHBYTES(‘SHA2_512’, bigval) as BIGINT) PERSISTED;
So what if you have a string that’s larger than 8000 bytes (NVARCHAR(4000)+ or VARCHAR(8000)+ ) or a combination of columns that you want to apply HASHBYTES to that adds up to more than 8000B? Do you have any examples of hashing such large strings with pure SQL?
Not offhand. But this just hasn’t come up for me much in the 18 months since I wrote the post. The first question I would ask is why you need to index such large strings, and are you really going to be able to seek on an index?
Also, just a general note of caution with indexed computed columns and running DBCC CHECKDB:
http://www.sqlskills.com/blogs/paul/dbcc-checkdb-performance-and-computed-column-indexes/
Hi
What if I want to do like search in a wider key
will I have to go for Fulttext how can I implement Like in above example
SELECT i
from dbo.LookupValues
where bigvalhash = HASHBYTES(‘SHA2_512′, N’bunny’)
and bigval = N’bunny’;
GO
Hi there,
Fulltext might help with some scenarios, but it doesn’t support full wildcard searches like ‘%this%’. If you want to do prefix searches or find phrases containing a word, it might fit your needs though.
Kendra
You are wrong saying “In theory there shouldn’t be collisions with SHA-512”. This phrase from Wikipedia is about security issues. It says that no proven collision attack was found (yet).
Since your data column is longer than hash column there will be collisions. It’s just matter of probability and luck.
I found that I had to create the persisted column with a cast. Otherwise the database created a varbinary(8000) column which cannot have an index on it…
You will have to adjust the size you cast to based on the algorithm you choose.
ALTER TABLE LookupValues ADD bigvalhash AS CAST(HASHBYTES(‘SHA2_256’, bigval) as VARBINARY(32)) PERSISTED;
Even i had to type cast to , which Michael Schall Suggested. Great solution
[…] https://www.brentozar.com/archive/2013/05/indexing-wide-keys-in-sql-server/ […]