Say you run an online store, and on each item’s page, you need to show related items that people purchased. Let’s take an Amazon page for my favorite rechargeable AA batteries:
In a perfect world, we would cache this data in the web/app tier – but back here in the real world, sometimes our developers build stored procedures to fetch this kind of data, and the stored procedure ends up getting called way too often.
To solve it, let’s build a caching layer into our stored procedure.
Here’s the pseudocode of how our stored procedure usually works:
I’m using a really simple stored procedure, but this approach works best when you have a complex stored procedure that does a lot of heavy lifting – calculations, querying big tables, aggregations, etc.
Instead of directly doing all the heavy work, check this out:
I’ve introduced a new table here – Cache.dbo.GetRelatedItems. I created a new database called Cache, and I keep it in simple recovery mode. I can use a totally different backup strategy for this database – perhaps not even backing it up at all.
The Cache.dbo.GetRelatedItems table has exactly the same columns that are normally returned by usp_GetRelatedItems, plus the input field. In this case, our stored procedure has an input field of ItemID, and it returns RelatedItemID and RelatedItemName, so the table would look like the one shown at right.
Rocket surgeon readers will note that I added an ID field to give the table something unique. They will also notice that I didn’t add a DatePopulated or DateCached field – depending on my business needs, I would probably just truncate this table every night. In the case of related items in an e-commerce store, I might only truncate it every week, and I’d want to do it right before a really low load period so that the cache could gradually refresh. This outright-truncation technique is less efficient for refreshing the cache, but it minimizes the locking required by deletes. In a caching setup, I’m worried about concurrency here.
When implementing a solution like this, I usually do a lot of A/B performance testing to find the right clustered index for the table. Typically each caching table has no non-clustered indexes, and has just one clustered index designed to produce the fastest range scans for the number of parameters for the stored proc. (Before somebody posts a comment asking for that strategy, no, I’m not blogging that, because it’d take me a day, and I’m lazy.)
If you choose not to back up the Cache database, your code should not rely on the existence of objects in it. It should start by checking to see if the Cache.dbo.GetRelatedItems table exists, and if not, create it. That way if you fail over to your DR site and the Cache database is empty, your queries won’t fail.
I’ve over-simplified the stored procedure a little, though – something actually has to populate the cache table. There’s two ways I could do it: externally, like a SQL Agent job or an SSIS process, or internally – inside the stored procedure itself. Let’s code that:
We start by checking the cache table for results, and if we don’t find any, we add them.
I’m cringing as I write this because I can hear the screams of performance tuners. Yes, I’m adding additional write load on the SQL Server – keep in mind that I only use this approach when I’m faced with:
- A very work-intensive but read-only stored procedure
- Called very frequently (hundreds or thousands of times per minute)
- Whose results change less than once a day (or where we’re not concerned about real-time accuracy)
- A business that needs immediate speed and can’t wait for developers to implement a caching layer
As soon as I deploy a solution like this and the business pain goes away, I immediately start working with the developers on a better long-term solution. This solution is an emergency band-aid to get the business up and running, but it still incurs load on the SQL Server for writing the caching results, getting locks, and running the stored procedure. This is when I start talking to the developers about caching in the app tier, and here’s my favorite resources on that:
Thanks for the informative blog.
Note, that two concurrent cache fills can cause the cache to be filled with duplicate results. This needs a concurrency strategy. Maybe just an app-lock.
You never need to use the pattern IF EXISTS (do some SELECT) do the SELECT. All you need to do is to remove the IF EXISTS clause, just perform the SELECT and check the returned @@ROWCOUNT to work out if you need to reload the cache. Also, need to rework the bottom half to avoid race conditions and duplicated values.
Yeah, exactly – my personal favorite is to populate the cache periodically via batch jobs rather than one-off stored procs, but I didn’t want to go into quite that much detail, heh.
Greg, I’m not sure I agree with your stmt regarding the pattern. If we just perform initial SELECT, and upon seeing @@ROWCOUNT is zero we then perform another SELECT, that’s two resultsets. I’m sure caller code can be adapted to handle this, but that may well be out of scope. Or am I wrong…is this one of those issues that used to be a problem a decade ago but no longer applies?
Brent, good article. I’ve worked at shops where we had success with this pattern, but implemented at the next higher tier of the framework. App checked if results existed for proc+parm1+parm2…, and if so returned them, else ran proc & cached results, then returned them. Separate asynch routine deleted old items from cache. Advantage of this was we could use one simple framework for all manner of procs. If trying to do this within SQL, we need a different cache table for each different type of resultset; not terribly difficult, but not nearly as simple.
Mike, that would indeed cause two select result sets.
You could get around it by using somethlng like:
DECLARE @Test INT;
SELECT TOP 1 @Test=1 WHERE Cache.dbo.field1 = @ItemId;
IF @@ROWCOUNT = 0
(The @Test=… should swallow the resultset.)
I have worked with similar logic to cache the results, but we used Tempdb to keep the cached results, and there was a separate async process to remove the cached results after an expiry time.
Brent Ozar needs to study the English language before writing again.
Bad news – I’ve got blog posts queued up and no English books on the shelf.
Hey Brent, this comment (and your reaction) made me laugh.
Hahaha, thanks sir. And for reference, I do have some English books on the shelf and queued up, heh.
My old man always says “never trust anyone that can spell… or type”.
Maybe old fashioned.
And with puttin it on Ramdisk, it’s the new style!
Very cool idea Brent. I was able to patch up your template with medicinal rocketry to check for the existence of the database (if not just get the data from the table like normal), then as you suggested checking for the table (creating it if needed). I’m currently running some tests with the Cache db on RamDisk vs. Internal disk arrays.
Not only cool but extremely fun
This is the driest humour I’ve seen in some considerable time. You’re hired
I had to do something like this – except with a function. It’s much harder with a function because you’re not allowed to make your function do any database writes. I had to jump through cache checking steps on all the procs using the function
Some readers might want to consider Oracle which has had native function/query result caching for a few releases. http://docs.oracle.com/cd/E11882_01/appdev.112/e25519/subprograms.htm#LNPLS00817
Just “consider Oracle”, huh? That’s your answer to a feature request? Change the entire database platform? Wow, that sounds reasonable.
No; asking SQLServer devoplement to implement function-result-caching would be another alternative. Of course, what you’re suggesting to “roll your own” is also possible, but getting it right in all cases can be tricky. BTW, Oracle is not the only vendor, MySQL has been providing result caching for longer than the Oracle.
Seems like this might be a viable place to use in-memory OLTP. The cache table could be in memory, giving it the speed that people would hope for in a cache.
Daniel – if the data is frequently cached, it would be in memory anyway. SQL Server automatically caches data that is frequently queried. Plus, the in-memory OLTP (Hekaton) would suffer from the same problem that regular tables would – it still only caches the raw data, not the query output, so it would have to rebuild the query output every time the query runs. I like your thinking though!
Amazing. This post is so good I had to read it twice.
Being the Cache DB probably small I wonder if it could be possible to put it on a RAM drive.