Why You Should Test Your Queries Against Bigger Data

Last Updated 7 years ago

I Like Testing Things

Especially Brent’s patience. And I like watching people test things. SQL Server is sort of perfect for that. You make some changes, you hit f5, and you wait for something to finish. If it’s faster, you ship it. Right? I mean, we all know you just added NOLOCK hints anyway.

A long time ago, at a company far, far away, I did something stupid. I was working on a dashboard report, and I used data from a past project to replicate the layout. The old project only had about 1000 users, so reporting on it was simple. We didn’t even need nonclustered indexes. Of course, the new project ended up fielding 2.6 million users, and got extended for 12 months. It ended up with close to 4 million users. The dashboard queries stopped working well after around 100k users, if you’re wondering.

Working on sp_BlitzIndex

Is fun and rewarding and you should totally contribute code. We recently added some informational checks for statistics. If they’re out of date with a bunch of modifications, if they’re using no recompute, and some other stuff. Cool! I wanted to get something started for when I find out the details on this Connect Item getting fixed. It’s easier to add information to a pull than it is to add a whole new pull of information. Just figuring out column nullability is a headache. I MEAN FUN AND REWARDING NOT A HEADACHE!

While working out some kinks, I wanted a way to create a ton of statistics objects to see how slow the queries would run under object duress. This is handy, because we may skip or filter certain checks based on how many objects there are to process. For instance, we won’t loop through all databases if you have more than 50 of them. That can take a heck of a long time. Thinking back to one of Brent’s Bad Idea Jeans posts, I decided to do something similar creating statistics.

Different Strokes

Indexes have slightly different rules. You can only have 30,000 of them per object. Well, fine, but I don’t need that many. I only want to create about 100,000 total. In my restore of the Stack Overflow database, I have 11 tables. After some tweaking to this script, I got it so that It creates 100,044 objects. I had to do some maneuvering around not touching views, and not hitting certain column data types.

SELECT 'CREATE STATISTICS [S_' + 
    c1.COLUMN_NAME + '_' + CONVERT(VARCHAR(36), NEWID()) + ']'
    + ' ON [' + c1.TABLE_SCHEMA + '].[' + c1.TABLE_NAME + '] ' 
    + '([' + c1.COLUMN_NAME + '])' 
FROM INFORMATION_SCHEMA.COLUMNS c1
JOIN INFORMATION_SCHEMA.TABLES AS t
ON c1.TABLE_NAME = t.TABLE_NAME
CROSS JOIN (SELECT TOP 1588 * FROM sys.messages) AS m
WHERE c1.TABLE_SCHEMA = 'dbo' 
AND (c1.CHARACTER_MAXIMUM_LENGTH <> -1 OR c1.CHARACTER_MAXIMUM_LENGTH IS NULL)
AND t.TABLE_TYPE <> 'VIEW'
ORDER BY c1.TABLE_NAME, c1.ORDINAL_POSITION

1

2

3

4

5

6

7

8

9

10

11

12

SELECT 'CREATE STATISTICS [S_' +

c1.COLUMN_NAME + '_' + CONVERT(VARCHAR(36), NEWID()) + ']'

+ ' ON [' + c1.TABLE_SCHEMA + '].[' + c1.TABLE_NAME + '] '

+ '([' + c1.COLUMN_NAME + '])'

FROM INFORMATION_SCHEMA.COLUMNS c1

JOIN INFORMATION_SCHEMA.TABLES AS t

ON c1.TABLE_NAME = t.TABLE_NAME

CROSS JOIN (SELECT TOP 1588 * FROM sys.messages) AS m

WHERE c1.TABLE_SCHEMA = 'dbo'

AND (c1.CHARACTER_MAXIMUM_LENGTH <> -1 OR c1.CHARACTER_MAXIMUM_LENGTH IS NULL)

AND t.TABLE_TYPE <> 'VIEW'

ORDER BY c1.TABLE_NAME, c1.ORDINAL_POSITION

The thing is, I’m also lazy. I don’t want to copy and paste 100k rows. No thank you. I’m gonna loop this one and I don’t care who knows it. You can uncomment out the EXEC if you want. There’s precious little defensive scripting in here, I know. But creating 100k stats objects is going to take long enough without me protecting myself from self SQL injection.

CREATE TABLE #LoopHolder (Id INT IDENTITY(1,1), CreateStats VARCHAR(8000))

INSERT #LoopHolder
        ( CreateStats )
SELECT 'CREATE STATISTICS [S_' + 
    c1.COLUMN_NAME + '_' + CONVERT(VARCHAR(36), NEWID()) + ']'
    + ' ON [' + c1.TABLE_SCHEMA + '].[' + c1.TABLE_NAME + '] ' 
    + '([' + c1.COLUMN_NAME + '])' 
FROM INFORMATION_SCHEMA.COLUMNS c1
JOIN INFORMATION_SCHEMA.TABLES AS t
ON c1.TABLE_NAME = t.TABLE_NAME
CROSS JOIN (SELECT TOP 1588 * FROM sys.messages) AS m
WHERE c1.TABLE_SCHEMA = 'dbo' 
AND (c1.CHARACTER_MAXIMUM_LENGTH <> -1 OR c1.CHARACTER_MAXIMUM_LENGTH IS NULL)
AND t.TABLE_TYPE <> 'VIEW'
ORDER BY c1.TABLE_NAME, c1.ORDINAL_POSITION

DECLARE @sql NVARCHAR(MAX) = ''
DECLARE @minid INT = 0
DECLARE @maxid INT = 0

SELECT @minid = MIN(Id),
     @maxid = MAX(Id) 
FROM #LoopHolder AS lh

WHILE @minid < @maxid
BEGIN
SELECT @sql = lh.CreateStats
FROM #LoopHolder AS lh
WHERE lh.Id = @minid

PRINT @sql
--EXEC (@sql)

SELECT @minid +=1

END

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

CREATE TABLE #LoopHolder (Id INT IDENTITY(1,1), CreateStats VARCHAR(8000))

INSERT #LoopHolder

( CreateStats )

SELECT 'CREATE STATISTICS [S_' +

c1.COLUMN_NAME + '_' + CONVERT(VARCHAR(36), NEWID()) + ']'

+ ' ON [' + c1.TABLE_SCHEMA + '].[' + c1.TABLE_NAME + '] '

+ '([' + c1.COLUMN_NAME + '])'

FROM INFORMATION_SCHEMA.COLUMNS c1

JOIN INFORMATION_SCHEMA.TABLES AS t

ON c1.TABLE_NAME = t.TABLE_NAME

CROSS JOIN (SELECT TOP 1588 * FROM sys.messages) AS m

WHERE c1.TABLE_SCHEMA = 'dbo'

AND (c1.CHARACTER_MAXIMUM_LENGTH <> -1 OR c1.CHARACTER_MAXIMUM_LENGTH IS NULL)

AND t.TABLE_TYPE <> 'VIEW'

ORDER BY c1.TABLE_NAME, c1.ORDINAL_POSITION

DECLARE @sql NVARCHAR(MAX) = ''

DECLARE @minid INT = 0

DECLARE @maxid INT = 0

SELECT @minid = MIN(Id),

@maxid = MAX(Id)

FROM #LoopHolder AS lh

WHILE @minid < @maxid

BEGIN

SELECT @sql = lh.CreateStats

FROM #LoopHolder AS lh

WHERE lh.Id = @minid

PRINT @sql

--EXEC (@sql)

SELECT @minid +=1

END

So What Did I Find?

Well, after adding 100k objects, the query favoring the new syntax still finished in 2 seconds. The older syntax query ran… Forever. And ever. Maybe. I killed it after 30 seconds, because that’s unacceptable. For reference, this is the ‘old query’ that doesn’t use sys.dm_db_stats_properties, because it wasn’t invented yet. I have to hit sys.sysindexes to get a little bit of the information back. Right now, this query pulls back a few more columns than necessary, but will likely be used when I finish adding gizmos and doodads. It’ll be grand.

SELECT DB_NAME() AS DatabaseName,
 OBJECT_NAME(s.object_id) AS table_name,
 SCHEMA_NAME(obj.schema_id) AS schema_name,
 ISNULL(i.name, 'System Or User Statistic') AS index_name,
 c.name AS column_name,
 s.name AS statistics_name,
 CONVERT(DATETIME, STATS_DATE(s.object_id, s.stats_id)) AS last_statistics_update,
 DATEDIFF(DAY, STATS_DATE(s.object_id, s.stats_id), SYSDATETIME()) AS days_since_last_stats_update,
 si.rowcnt,
 si.rowmodctr,
 CASE WHEN si.rowmodctr > 0 THEN CAST(si.rowmodctr / ( 1. * NULLIF(si.rowcnt, 0) ) * 100 AS DECIMAL(18, 1))
 ELSE si.rowmodctr
 END AS percent_modifications,
 CASE WHEN si.rowcnt < 500 THEN 500
 ELSE CAST(( si.rowcnt * .20 ) + 500 AS INT)
 END AS modifications_before_auto_update,
 ISNULL(i.type_desc, 'System Or User Statistic - N/A') AS index_type_desc,
 CONVERT(DATE, obj.create_date) AS table_create_date,
 CONVERT(DATE, obj.modify_date) AS table_modify_date,
 s.no_recompute,
 s.has_filter,
 s.filter_definition
FROM sys.stats AS s
LEFT JOIN sys.stats_columns sc
ON sc.object_id = s.object_id
 AND sc.stats_id = s.stats_id
LEFT JOIN sys.columns c
ON c.object_id = sc.object_id
 AND c.column_id = sc.column_id
LEFT JOIN sys.objects obj
ON s.object_id = obj.object_id
LEFT JOIN sys.indexes AS i
ON i.object_id = s.object_id
 AND i.index_id = s.stats_id
JOIN sys.sysindexes AS si
ON si.name = s.name
WHERE obj.is_ms_shipped = 0;

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

SELECT DB_NAME() AS DatabaseName,

OBJECT_NAME(s.object_id) AS table_name,

SCHEMA_NAME(obj.schema_id) AS schema_name,

ISNULL(i.name, 'System Or User Statistic') AS index_name,

c.name AS column_name,

s.name AS statistics_name,

CONVERT(DATETIME, STATS_DATE(s.object_id, s.stats_id)) AS last_statistics_update,

DATEDIFF(DAY, STATS_DATE(s.object_id, s.stats_id), SYSDATETIME()) AS days_since_last_stats_update,

si.rowcnt,

si.rowmodctr,

CASE WHEN si.rowmodctr > 0 THEN CAST(si.rowmodctr / ( 1. * NULLIF(si.rowcnt, 0) ) * 100 AS DECIMAL(18, 1))

ELSE si.rowmodctr

END AS percent_modifications,

CASE WHEN si.rowcnt < 500 THEN 500

ELSE CAST(( si.rowcnt * .20 ) + 500 AS INT)

END AS modifications_before_auto_update,

ISNULL(i.type_desc, 'System Or User Statistic - N/A') AS index_type_desc,

CONVERT(DATE, obj.create_date) AS table_create_date,

CONVERT(DATE, obj.modify_date) AS table_modify_date,

s.no_recompute,

s.has_filter,

s.filter_definition

FROM sys.stats AS s

LEFT JOIN sys.stats_columns sc

ON sc.object_id = s.object_id

AND sc.stats_id = s.stats_id

LEFT JOIN sys.columns c

ON c.object_id = sc.object_id

AND c.column_id = sc.column_id

LEFT JOIN sys.objects obj

ON s.object_id = obj.object_id

LEFT JOIN sys.indexes AS i

ON i.object_id = s.object_id

AND i.index_id = s.stats_id

JOIN sys.sysindexes AS si

ON si.name = s.name

WHERE obj.is_ms_shipped = 0;

Where’s The Beef?

The first thing I want to find out is if one of the join conditions is the culprit. I’ll usually change my query to a COUNT(*) and comment out joins to see which is the most gruesome. In this case it was, of course, the one I really needed. Joining to sys.sysindexes. Without it, the query finishes immediately. Of course, it also finishes without some really helpful information. So I can’t just skip it! I mean, I could just not give you information about statistics prior to 2008R2 SP2, 2012 SP1, etc. But that would leave large swaths of the community out. You people are terribly lazy about upgrading and patching! So the kid stays in the picture, as they say. I think.

Another thing I figured out is that if I filter out rows from sys.sysindexes that had 0 for rowcnt, the query was back to finishing in about a second. Unfortunately. that seemed to filter out all system generated statistics. I tried my best to get SQL to pay attention to them, but never got anything out of it. That sucks, because now I can’t give you any information about your system generated stats, but there’s nothing I can do about that part. Aside from being outdated, they wouldn’t get caught in any of our other queries anyway. They’re probably not going to be filtered or created with no recompute, and we can’t get rows sampled here either. So, out they go. If you ever wonder why you don’t get as much cool stats information on older versions of SQL Server, this is why.

Back To The Original Point

If I hadn’t tested this query against way more stats objects than most databases probably have, I never would have figured this out. If you have way more than 100k, let me know, and I’ll test against a higher number. The next time you’re testing a query, ask yourself if the data you’re developing with is near enough to the size of your production data to avoid missing obvious perf bottlenecks.

Thanks for reading!

2 Comments. Leave new

Geoff Patterson
December 8, 2016 3:21 pm

SELECT COUNT(*) FROM sys.stats
— 12,193,760

It’s a dev / unit testing playground though, not production. Up to you whether that warrants further stress testing of sp_blitzIndex 🙂

Reply
- Erik Darling
  December 8, 2016 3:26 pm
  
  I’ll pass that along to Ernie.
  
  Reply