StackOverflow’s SQL 2008 FTS issue solved

Stack Overflow
31 Comments

Whew – the Microsoft folks really came through on this one and tracked down the problem pretty quickly.  I want to personally thank Denis Churin, Nikita Shamgunov and Sundaram Narayanan for their detailed investigation into the issue and helping us get it fixed. I’m going to explain the issue to help anybody else get through it in case they run into similar problems.

How SQL 2008’s Full Text Search is Different than 2005

Let’s start with the basics: in SQL Server 2005, the full text data lived outside of SQL Server and wasn’t subject to transactional locking.  If you inserted a gazillion records into a table that had a full text index on it, the indexes were rebuilt without worrying about simultaneous inserts and locking.  It wasn’t like a traditional SQL Server table.

In SQL Server 2008, the full text indexes were moved inside the database server and became subject to transactional locking.  By default, the indexes are updated automatically whenever SQL Server determines they need to be updated.  When that merge process happens, SQL Server needs to obtain some locks on the indexed table.  Ideally, it grabs the locks when there’s a brief period of no load, does its merge work, and lets go of the locks.  The merge process can be quite brief (well under a second) as long as the amount of data hasn’t changed dramatically.

In a heavily transactional environment when there’s a whole lot of inserts/updates/deletes on the indexed table, the DBA may need to restrict those merge activities to only a certain time window of the day.  SQL Server 2008 gives you that ability to do merges manually, but I wanted to avoid that on StackOverflow.com.  Every time I put in a manual job into a solution, it requires manual maintenance, alerting and corrective actions, and I don’t take that lightly.

Where The Problem Comes In: Convoys of Queries

Imagine this scenario:

  1. A full-text select query is issued that looks like it’ll finish extremely quickly
  2. The SQL Server doesn’t see much activity in the full text table (only selects, not inserts/updates/deletes) so it kicks off a merge
  3. The select query doesn’t finish as quickly as the engine expected, and the merge can’t start until it obtains the locks it needs
  4. More full-text queries come in (could be selects, inserts, updates, deletes) that need to obtain locks

Those newly issued queries in step 4 are suddenly delayed while waiting for query #1 to finish.  The impact on full text performance varies depending on how long it takes query #1 to finish – might be milliseconds, might be seconds.

Denis Churin, Nikita Shamgunov and Sundaram Narayanan (the Microsoft heroes) suspected this might be our performance problem at StackOverflow, so they had us grab a memory dump and a database backup at the exact moment we were having performance problems.  They looked at the memory dump and isolated a single particular full-text query that was confusing the engine.  The engine was building an execution plan for it that didn’t work well, and instead of taking milliseconds, it took seconds (as many as 50 seconds).  During that time, performance went into the toilet.

They rewrote the query in a different way the engine would analyze better, and when that query was changed, presto, the performance problems disappeared.

There’s a QFE coming in a few weeks that will let the merge thread run without blocking other queries, but for now, we’re in good shape.

Diagnosing The Problem in Your Environment

Sundaram gave us this query to help troubleshoot when a long-running query is blocking other queries.  This helps identify the issue when a full-text select query suddenly blocks the SQL Server 2008 full text merge thread, and you can look at that query to make it run faster.  I haven’t tested this in depth, but these guys have proven to be much more qualified than me, so I have a hunch it’ll work, heh:

declare @temp int
declare @parent int
declare @final int
set @parent = 0
while (@parent = 0)
BEGIN
select @parent=blocked from sys.sysprocesses where lastwaittype=’LCK_M_SCH_M’ and waittime > 30000
WAITFOR DELAY ’00:00:01′;
END

WHILE (@parent <> 0)
BEGIN
set @final = @parent
select @temp=blocked from sys.sysprocesses where spid = @parent
set @parent = @temp
END
select * from sys.sysprocesses where spid = @final
select * from sys.dm_exec_requests ER
CROSS APPLY sys.dm_exec_sql_text(ER.plan_handle) ST
where ER.session_id = @final
select * from sys.dm_exec_requests ER
CROSS APPLY sys.dm_exec_query_plan(ER.plan_handle) ST
where ER.session_id = @final

And now, finally, I can start performance tuning that server!

Previous Post
Moved to Michigan and got older
Next Post
Tony Davis disses StackOverflow.com

31 Comments. Leave new

  • So NO downgrade to SQL Server 2005 took place? So, SO sticks with SQL Server 2008?

    Reply
  • I knew you could do it!

    Reply
  • I knew Microsoft AND you can do it, my faith in SQL 2008 has been restored 😀

    Reply
  • Mostly Microsoft. 🙂

    Reply
  • Keith Robinson
    November 17, 2008 5:24 pm

    I believe I am having the same issue. How exactly should you run the query Microsoft gave you? I have the query running right now and it has been over 5 minutes and it is still running??

    Reply
  • It runs forever waiting for the issue to occur, and then it breaks when the issue occurs. If you want to force it to see what’ll happen, change this:

    waittime > 30000

    to this:

    waittime > 1

    Very quickly, you should get some results back. That 30,000 means 30 seconds (30,000 milliseconds). You can tweak that number to get results in your environment. They set it to 30 seconds for us because we were having some pretty long locking problems, like a minute or two, so 30 seconds was a safe number.

    Reply
  • Keith Robinson
    November 17, 2008 5:31 pm

    Thank you for the reply. I tried it as you explained and it does work like you said. Thank you very much. This should really help to troubleshoot our queries.

    Thanks again!

    Reply
  • Hi guys, useful post, was wondering does the indexed text get stored in RAM or hardisk, i have a small VPS 512mb RAM one, and wondered, would it hammer all its memory to use FTS on a 50,000 record table, or is the index stored in database and accessed that way? So memory is not affected in the sense its not stored in memory?

    Thanks

    Reply
  • Keith Robinson
    November 4, 2009 11:18 am

    In SQL 2008 the FTS are stored in tables and taken off disk. At least that is my experience. In SQL 2005 they are stored on disk.

    Reply
    • Keith – yes, they’re inside the database, but the database is on disk. If you have a big full text index, you have to be acutely aware of storage performance. It’s not like it’s cached in memory once and you’re done.

      Reply
  • Thanks so much guys that was helpful.

    Reply
  • Hi Brent – Could you confirm when this problem was fixed? We’re going to be moving to 2008 early next year from 2005 and we use full-text heavily. I’m guessing the fix is in SP1 but can’t seem to find the article explaining the problem.

    Thanks

    Reply
    • Steve – sorry, but I didn’t want to deploy the QFE in StackOverflow’s environment without testing, so we didn’t follow up on that. QFEs are quick fixes to address a specific problem, but they may cause other problems. If you’ve got any alternatives, you should avoid deploying QFEs in production. The guys tweaked the query instead, and it worked fine.

      I’d deploy a 2008 development server first, and do some stress testing there. You should be able to find out right away if your queries exhibit the convoy problem.

      Hope that helps!

      Reply
  • Thanks Brent – you’ve confirmed our potential problem. I was hoping for the easy answer. I’ll make sure we get the testing under way. We have a similar problem in 2005 with contention between FT Change Tracking and Transactional Replication on our front end DB machines.

    Reply
  • Brent,

    What RAID would you recommmend for iFTS?

    Thanks

    JL

    Reply
    • JL – well, if money’s no object, I recommend RAID 10 for everything. Unfortunately, money’s usually an object, so you have to decide where to invest your money based on the performance of your application. If 75% of your user queries are full text search, then it makes sense to put the iFTS data on RAID 10 even at the expense of data and log drives, but if that number is closer to 5-10%, you might want to invest more in your data or log drives. It’s a balance that requires careful investigation of the user queries, load patterns, and HA/DR needs (in terms of transaction log activity). Sorry that I don’t have an easier answer!

      Reply
  • Could we know if this was fixed in SQL Server 2008 R2?

    Reply
  • Yes, it’s fixed and released.

    http://support.microsoft.com/kb/958947

    Reply
  • Brent thanks for the find here. Would you be willing to direct me to some input of lucene.net vs. Ifts? I am a dba who has been very happy with ifts, but I have dev now pushing for something different mainly lucene. Can you give me some pros and cons of both. Thanks!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.