StackOverflow’s SQL 2008 FTS issue solved

Last Updated 7 years ago

Whew – the Microsoft folks really came through on this one and tracked down the problem pretty quickly. I want to personally thank Denis Churin, Nikita Shamgunov and Sundaram Narayanan for their detailed investigation into the issue and helping us get it fixed. I’m going to explain the issue to help anybody else get through it in case they run into similar problems.

How SQL 2008’s Full Text Search is Different than 2005

Let’s start with the basics: in SQL Server 2005, the full text data lived outside of SQL Server and wasn’t subject to transactional locking. If you inserted a gazillion records into a table that had a full text index on it, the indexes were rebuilt without worrying about simultaneous inserts and locking. It wasn’t like a traditional SQL Server table.

In SQL Server 2008, the full text indexes were moved inside the database server and became subject to transactional locking. By default, the indexes are updated automatically whenever SQL Server determines they need to be updated. When that merge process happens, SQL Server needs to obtain some locks on the indexed table. Ideally, it grabs the locks when there’s a brief period of no load, does its merge work, and lets go of the locks. The merge process can be quite brief (well under a second) as long as the amount of data hasn’t changed dramatically.

In a heavily transactional environment when there’s a whole lot of inserts/updates/deletes on the indexed table, the DBA may need to restrict those merge activities to only a certain time window of the day. SQL Server 2008 gives you that ability to do merges manually, but I wanted to avoid that on StackOverflow.com. Every time I put in a manual job into a solution, it requires manual maintenance, alerting and corrective actions, and I don’t take that lightly.

Where The Problem Comes In: Convoys of Queries

Imagine this scenario:

A full-text select query is issued that looks like it’ll finish extremely quickly
The SQL Server doesn’t see much activity in the full text table (only selects, not inserts/updates/deletes) so it kicks off a merge
The select query doesn’t finish as quickly as the engine expected, and the merge can’t start until it obtains the locks it needs
More full-text queries come in (could be selects, inserts, updates, deletes) that need to obtain locks

Those newly issued queries in step 4 are suddenly delayed while waiting for query #1 to finish. The impact on full text performance varies depending on how long it takes query #1 to finish – might be milliseconds, might be seconds.

Denis Churin, Nikita Shamgunov and Sundaram Narayanan (the Microsoft heroes) suspected this might be our performance problem at StackOverflow, so they had us grab a memory dump and a database backup at the exact moment we were having performance problems. They looked at the memory dump and isolated a single particular full-text query that was confusing the engine. The engine was building an execution plan for it that didn’t work well, and instead of taking milliseconds, it took seconds (as many as 50 seconds). During that time, performance went into the toilet.

They rewrote the query in a different way the engine would analyze better, and when that query was changed, presto, the performance problems disappeared.

There’s a QFE coming in a few weeks that will let the merge thread run without blocking other queries, but for now, we’re in good shape.

Diagnosing The Problem in Your Environment

Sundaram gave us this query to help troubleshoot when a long-running query is blocking other queries. This helps identify the issue when a full-text select query suddenly blocks the SQL Server 2008 full text merge thread, and you can look at that query to make it run faster. I haven’t tested this in depth, but these guys have proven to be much more qualified than me, so I have a hunch it’ll work, heh:

declare @temp int
declare @parent int
declare @final int
set @parent = 0
while (@parent = 0)
BEGIN
select @parent=blocked from sys.sysprocesses where lastwaittype=’LCK_M_SCH_M’ and waittime > 30000
WAITFOR DELAY ’00:00:01′;
END

WHILE (@parent <> 0)
BEGIN
set @final = @parent
select @temp=blocked from sys.sysprocesses where spid = @parent
set @parent = @temp
END
select * from sys.sysprocesses where spid = @final
select * from sys.dm_exec_requests ER
CROSS APPLY sys.dm_exec_sql_text(ER.plan_handle) ST
where ER.session_id = @final
select * from sys.dm_exec_requests ER
CROSS APPLY sys.dm_exec_query_plan(ER.plan_handle) ST
where ER.session_id = @final

And now, finally, I can start performance tuning that server!

Moved to Michigan and got older

Tony Davis disses StackOverflow.com

31 Comments. Leave new

MarlonRibunal
November 12, 2008 7:49 pm

So NO downgrade to SQL Server 2005 took place? So, SO sticks with SQL Server 2008?

Reply
Rob Boek
November 13, 2008 2:17 am

I knew you could do it!

Reply
Jerry Hung
November 13, 2008 4:38 pm

I knew Microsoft AND you can do it, my faith in SQL 2008 has been restored 😀

Reply
Brent Ozar
November 13, 2008 4:39 pm

Mostly Microsoft. 🙂

Reply
Keith Robinson
November 17, 2008 5:24 pm

I believe I am having the same issue. How exactly should you run the query Microsoft gave you? I have the query running right now and it has been over 5 minutes and it is still running??

Reply
Brent Ozar
November 17, 2008 5:26 pm

It runs forever waiting for the issue to occur, and then it breaks when the issue occurs. If you want to force it to see what’ll happen, change this:

waittime > 30000

to this:

waittime > 1

Very quickly, you should get some results back. That 30,000 means 30 seconds (30,000 milliseconds). You can tweak that number to get results in your environment. They set it to 30 seconds for us because we were having some pretty long locking problems, like a minute or two, so 30 seconds was a safe number.

Reply
Keith Robinson
November 17, 2008 5:31 pm

Thank you for the reply. I tried it as you explained and it does work like you said. Thank you very much. This should really help to troubleshoot our queries.

Thanks again!

Reply
David
November 4, 2009 11:14 am

Hi guys, useful post, was wondering does the indexed text get stored in RAM or hardisk, i have a small VPS 512mb RAM one, and wondered, would it hammer all its memory to use FTS on a 50,000 record table, or is the index stored in database and accessed that way? So memory is not affected in the sense its not stored in memory?

Thanks

Reply
- Brent Ozar
  November 4, 2009 11:16 am
  
  Full text indexes are stored on the hard drives. You can’t store them in memory, unfortunately.
  
  Reply
Keith Robinson
November 4, 2009 11:18 am

In SQL 2008 the FTS are stored in tables and taken off disk. At least that is my experience. In SQL 2005 they are stored on disk.

Reply
- Brent Ozar
  November 4, 2009 11:41 am
  
  Keith – yes, they’re inside the database, but the database is on disk. If you have a big full text index, you have to be acutely aware of storage performance. It’s not like it’s cached in memory once and you’re done.
  
  Reply
David
November 5, 2009 4:16 am

Thanks so much guys that was helpful.

Reply
Steve
December 9, 2009 5:51 am

Hi Brent – Could you confirm when this problem was fixed? We’re going to be moving to 2008 early next year from 2005 and we use full-text heavily. I’m guessing the fix is in SP1 but can’t seem to find the article explaining the problem.

Thanks

Reply
- Brent Ozar
  December 9, 2009 6:19 am
  
  Steve – sorry, but I didn’t want to deploy the QFE in StackOverflow’s environment without testing, so we didn’t follow up on that. QFEs are quick fixes to address a specific problem, but they may cause other problems. If you’ve got any alternatives, you should avoid deploying QFEs in production. The guys tweaked the query instead, and it worked fine.
  
  I’d deploy a 2008 development server first, and do some stress testing there. You should be able to find out right away if your queries exhibit the convoy problem.
  
  Hope that helps!
  
  Reply
Steve
December 9, 2009 6:32 am

Thanks Brent – you’ve confirmed our potential problem. I was hoping for the easy answer. I’ll make sure we get the testing under way. We have a similar problem in 2005 with contention between FT Change Tracking and Transactional Replication on our front end DB machines.

Reply
Jonlee
June 24, 2010 4:26 am

Brent,

What RAID would you recommmend for iFTS?

Thanks

JL

Reply
- Brent Ozar
  June 24, 2010 5:17 am
  
  JL – well, if money’s no object, I recommend RAID 10 for everything. Unfortunately, money’s usually an object, so you have to decide where to invest your money based on the performance of your application. If 75% of your user queries are full text search, then it makes sense to put the iFTS data on RAID 10 even at the expense of data and log drives, but if that number is closer to 5-10%, you might want to invest more in your data or log drives. It’s a balance that requires careful investigation of the user queries, load patterns, and HA/DR needs (in terms of transaction log activity). Sorry that I don’t have an easier answer!
  
  Reply
Andrei Rinea
June 24, 2010 5:30 am

Could we know if this was fixed in SQL Server 2008 R2?

Reply
- Brent Ozar
  June 24, 2010 11:09 am
  
  Andrei – unfortunately, I don’t have an answer to that.
  
  Reply
Bryan Crow
July 6, 2010 9:05 pm

Yes, it’s fixed and released.

http://support.microsoft.com/kb/958947

Reply
- Dennis Gorelik
  June 12, 2011 11:35 am
  
  Bryan,
  Was it fixed in SP1 or SP2?
  
  I’m still getting similar timeouts in SQL Server 2008 SP1.
  
  http://www.microsoft.com/downloads/en/details.aspx?FamilyID=66ab3dbb-bf3e-4f46-9559-ccc6a4f9dc19&displaylang=en
  
  http://www.microsoft.com/downloads/en/details.aspx?FamilyID=8fbfc1de-d25e-4790-88b5-7dda1f1d4e17
  
  Reply
Mike Bird
August 3, 2010 10:09 pm

Brent thanks for the find here. Would you be willing to direct me to some input of lucene.net vs. Ifts? I am a dba who has been very happy with ifts, but I have dev now pushing for something different mainly lucene. Can you give me some pros and cons of both. Thanks!

Reply
- Brent Ozar
  August 7, 2010 4:59 pm
  
  Howdy, Mike. Unfortunately I haven’t worked with Lucene, so I don’t have any good references for it.
  
  Reply