Tag Archive: fulltext

A Different World: Three Use Cases For Riak

Writing and deploying an application is pretty easy, there’s a process to follow that will get you most of the way there. The tricky part is getting the business logic down. But this isn’t about writing code, this is about what happens once you have an application out in the wild. The tricky part comes when you have to grow with your application. There are some tricky parts of keeping an application up and running. Things get complicated when you start trying to add complex functionality for your data. For those of us who use SQL Server there are a lot of options within SQL Server, but some of them require a seasoned DBA, expensive features, or both.

Durability

Making sure your data sticks around is important. Losing data can be catastrophic for a business. Even if you have a backup, it can take hours or even days, to restore your database and verify that everything is back up and running. That doesn’t even include the possibility that a backup is corrupted and you won’t be able to bring your database online. Or, worse yet, you could lose your SAN completely. Durability is important.

With SQL Server it’s possible to set up a cluster, mirroring, or replication to provide a second copy of your data for readability. These three solutions all require knowledge of some specialized features of SQL Server. Keeping each of these features up and running requires some knowledge, planning, set up, and monitoring. None of it is easy, some of it’s hard, and some features even need a lot of care and feeding to stay up and running. DBAs tell horror stories about hand feeding replication for days to get it up and running again after a catastrophic failure; it’s almost a badge of honor. Here’s the thing: none of this should be difficult. One way data durability (mirroring, transactional replication) is limited in its functionality – reads can happen everywhere, but writes can only happen in one place. If the master server goes down, it can be difficult to switch the master server around.

Durability problems plague DBAs every day, whether they know it or not. I spent a lot of time trying to solve my durability problems by configuring replication or log shipping and building complex application logic to support the possibility of multiple write locations. I later solved the problem by using Riak. One of the main benefits of Riak is the availability and fault tolerance it provides. That is: it’s durable as all get out. Data isn’t written to one place, it’s written to several places at once; writes won’t succeed unless more than one server responds that a write is successful. In a way, it’s like a really fancy version of database mirroring or SQL Server Denali’s Availability Groups feature that you can have right now.

When a server in a Riak cluster eventually fails, as hardware always does, you could recover it using a filesystem backup. Riak will handle making sure everyone has the right data by using a very cunning technique called read repair. If the server completely fails the other option is to simply replace it. You tell the cluster of servers that one server is gone and another server is taking its place. The cluster then figures out how to perform recovery or redistribute the data. It’s a lot less painful than shipping full database backups or replication snapshots across the data center or even across the country.

Latency

Relational database, and SQL Server in particular, use B+trees indexes to store data. B+trees provide reasonably fast look ups of data (any lookup of a single row will need to traverse the same number of pages on disk as any other lookup). By contrast, inserting data into a B+tree is not always particularly fast – page splits may occur, forcing data to be shuffled around on disk and moving things out of order could require intermediate pages to be updated. Because of the B+tree structure, many SQL Server DBAs advocate using an ordered, constantly increasing key for clustered indexes.

Riak, by contrast, defaults to using the Bitcask storage back end (with the option to use several others). Bitcask is unique in that it doesn’t use a B+tree to store data on disk. Instead Bitcask uses a pair of data structures – a series of data files (written as log-structured hash table) and an in memory directory of keys (a keydir) to make it easy to find records in the log-structured hash table. Reading data out of Bitcask will take two hops – one for the keydir and one for the data file – versus many potential hops in larger SQL Server tables (probably around 4 physical hops with a cold cache on a large-ish table).

There are two things that immediately stand out to me about Riak’s data latency. One is that Riak’s latency is predictable. Any write is going to take as much time as it takes to stream the data to disk. Since Bitcask is a log-structured hash, there’s no possibility of fragmentation; data is written in the order it arrives. Like some other databases, Riak does not perform in place updates of data. Instead, a new record is written in the log-structured hash table and the keydir is updated with the location of the new record. Because of this, it’s easy to predict how long an insert or update will take – as long as it takes to write that many bytes of data to disk.

Just as Riak provides a predictable low latency for writes, it also provides low latency for reading data from disk. Riak doesn’t use locking, so there can be no blocking. Instead read latency boils down to the amount of time that it takes to pull data off of disk. The more data there is in a given record, the slower the read is going to be. Obviously, there’s some seek time involved, but it’s negligible when you consider that a seek involves a single memory read and a single disk read.

Full Text Search

SQL Server’s full text search has caused a lot of problems for a lot of smart people. The more complex the schema and the more data load involved, the more likely that SQL Server’s full text search is going to have some problems. It’s a great tool, but there are some limitations and tuning full text queries is very different from tuning regular T-SQL queries.

A problem around full text search is the inability to scale the full text search service independently of the SQL Server. They’re tied together on the same physical instance. If you need to increase the performance of full text search, you must resort to increasing the performance of SQL Server, including the same licensing costs for SQL Server. Anyone how has gone from a 2 socket to a 4 socket machine can tell you that a doubling in license costs isn’t trivial. Full text search can, like many things, be moved to a separate SQL Server using replication, but SQL Server’s transactional replication has a reputation for being a bit manpower intensive. Many teams eventually move full text search to something other than SQL Server. Pushing full text search outside of the database engine frees up the database engine to serve other queries. Some people, like the fine folks at StackOverflow, have used Lucene.net. One of the advantages of using a full text search engine is that it offers phenomenal flexibility. Unfortunately, both SQL Server’s full text search Lucene and limited to a single node. Riak Search is Lucene compatible, flexible, and durable.

Riak Search automatically indexes documents whenever they’re saved in a specific bucket, just like SQL Server. Server side triggers are created to asynchronously index data as it is saved. Unlike SQL Server’s full text search, it’s possible to create custom functionality for searching. Different word breakers and tokenizers can be created to support different methods of indexing. Riak Search also allows complex documents to be indexed using custom schemas, much like Lucene.

The icing on the cake, for me, is that Riak Search can be used to provide linear scaling – as servers begin to run out of capacity, additional nodes can be brought online to quickly scale out. The upside of Riak Search is that there’s also durability built-in: the search indexes are spread across the cluster. Spreading indexes across the cluster, term partitioning, means that load from complex queries can be served by many nodes at once making it possible to provide overall higher query throughput on large data sets.

Making It Work

It’s fair to say that many people using SQL Server are also using the .NET Framework. While Riak is implemented in Erlang and runs on Unix based operating systems, it’s easy to work with SQL Server and Riak together in the same environment. CorrugatedIron is a community developed, open source, .NET library for Riak. While it’s still under heavy development pending Riak’s 1.0 release later this month, CorrugatedIron makes it possible to develop cross database functionality to save data in SQL Server and Riak without requiring developers to learn a new programming language or operating system. Functionality can be moved out of SQL Server and onto a distributed platform where functionality can be scaled horizontally and linearly.

This isn’t to say that I’m advocating for teams to abandon SQL Server in favor of Riak. In fact, that’s the opposite of what I want to say. Teams should pick a best of breed solution for their data storage needs. SQL Server is a great relational database, but there are some things it doesn’t do as well as we’d like. Riak is a great distributed database that is impervious to single node failure. They both provide different features and functionality that complement each other very well. Riak removes single points of failure, provides rich full text search functionality, and provides consistent latency guarantees. SQL Server provides rich querying semantics on high structure data and support for atomic transactions.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Sargability: Why %string% Is Slow

People love to search.

Google has us addicted to fast, easy search functions.  Users expect every application to have a built-in blazing-fast search functionality.  To pull that off, developers build search queries that let users enter a string, and we ask SQL Server to find matches.  For example, say our users need to find some nuts.  We take their string, put percent signs on either side, and then search for that field in the product names:

SELECT *
FROM AdventureWorks.Production.Product
WHERE [Name] LIKE '%nut%'

This works great on our machine – but when we scale it up to hundreds or thousands of simultaneous users, or if we ramp up the number of products, our response time sucks.  Let’s find out why by checking the execution plan.  In SQL Server Management Studio, click Query, Display Estimated Execution Plan, and we get:

Query Execution Plan

Query Execution Plan

There’s only one operation, which would sound like a good thing, but that one operation is a clustered index scan.  That means SQL Server has to read every row out of the Product table, check to see whether it’s got “nut” anywhere in the name, and then return our results.

But wait – there’s an index on Production.Product for the Name field, and SQL Server will use that index if we have a slightly different query:

Less Wildcards, Same Nuts

Less Wildcards, Same Nuts

If we take the leading % wildcard out, notice that this time it does two operations.  First, it does an index seek against our Name index, then it looks up the corresponding rows in the base table (Product).  Two operations might sound more expensive, but if we looked at the total cost for each query, the second one would be faster in situations where we’re only pulling a small number of search results back.  If this table was for a company called Nothin’ But Nuts, on the other hand, we would probably still need to scan the entire table, but that’s a discussion for another day.

So why doesn’t SQL Server use the index for the %nut% query?  Pretend for a moment that you held a phone book in your hand, and I asked you to find everyone whose last name contains the letters HAM.  You would have to scan every single page in the phone book, because the results would include things like:

  • Beckham
  • Chambers
  • Hamilton
  • Thames

If, on the other hand, I asked you to find everyone whose last name began with the letters HAM, you could turn straight to the H page in the phone book, scan through a few lines, and quickly get the information you need.

When I asked you for everyone beginning with the letters HAM, my query was sargable.  When I asked you for all last names containing HAM anywhere in the name, my query was not sargable – meaning, you couldn’t leverage the indexes to do an index seek.  (Yes, sargable is sort of a real word – it stems from the phrase Search Arguments.)

Since we can’t talk users out of using search queries, and since non-sargable queries don’t scale, we need to find a better way to search for text.  That’s where SQL Server’s Full Text Search comes in.  Unfortunately, your queries must change – normally when you add an index, your queries just magically become faster.  Full text indexes don’t work that way.  You have to tweak your queries to use operators like CONTAINS instead of LIKE.  If you’re dealing with an old boat anchor of a database and the developers are long gone, you might be out of luck.

%String% is just one thing that will cause SQL Server to do slower scans instead of high-performing index seeks.  To learn more about sargability, check out:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

StackOverflow’s SQL 2008 FTS issue solved

Whew – the Microsoft folks really came through on this one and tracked down the problem pretty quickly.  I want to personally thank Denis Churin, Nikita Shamgunov and Sundaram Narayanan for their detailed investigation into the issue and helping us get it fixed. I’m going to explain the issue to help anybody else get through it in case they run into similar problems.

How SQL 2008′s Full Text Search is Different than 2005

Let’s start with the basics: in SQL Server 2005, the full text data lived outside of SQL Server and wasn’t subject to transactional locking.  If you inserted a gazillion records into a table that had a full text index on it, the indexes were rebuilt without worrying about simultaneous inserts and locking.  It wasn’t like a traditional SQL Server table.

In SQL Server 2008, the full text indexes were moved inside the database server and became subject to transactional locking.  By default, the indexes are updated automatically whenever SQL Server determines they need to be updated.  When that merge process happens, SQL Server needs to obtain some locks on the indexed table.  Ideally, it grabs the locks when there’s a brief period of no load, does its merge work, and lets go of the locks.  The merge process can be quite brief (well under a second) as long as the amount of data hasn’t changed dramatically.

In a heavily transactional environment when there’s a whole lot of inserts/updates/deletes on the indexed table, the DBA may need to restrict those merge activities to only a certain time window of the day.  SQL Server 2008 gives you that ability to do merges manually, but I wanted to avoid that on StackOverflow.com.  Every time I put in a manual job into a solution, it requires manual maintenance, alerting and corrective actions, and I don’t take that lightly.

Where The Problem Comes In: Convoys of Queries

Imagine this scenario:

  1. A full-text select query is issued that looks like it’ll finish extremely quickly
  2. The SQL Server doesn’t see much activity in the full text table (only selects, not inserts/updates/deletes) so it kicks off a merge
  3. The select query doesn’t finish as quickly as the engine expected, and the merge can’t start until it obtains the locks it needs
  4. More full-text queries come in (could be selects, inserts, updates, deletes) that need to obtain locks

Those newly issued queries in step 4 are suddenly delayed while waiting for query #1 to finish.  The impact on full text performance varies depending on how long it takes query #1 to finish – might be milliseconds, might be seconds.

Denis Churin, Nikita Shamgunov and Sundaram Narayanan (the Microsoft heroes) suspected this might be our performance problem at StackOverflow, so they had us grab a memory dump and a database backup at the exact moment we were having performance problems.  They looked at the memory dump and isolated a single particular full-text query that was confusing the engine.  The engine was building an execution plan for it that didn’t work well, and instead of taking milliseconds, it took seconds (as many as 50 seconds).  During that time, performance went into the toilet.

They rewrote the query in a different way the engine would analyze better, and when that query was changed, presto, the performance problems disappeared.

There’s a QFE coming in a few weeks that will let the merge thread run without blocking other queries, but for now, we’re in good shape.

Diagnosing The Problem in Your Environment

Sundaram gave us this query to help troubleshoot when a long-running query is blocking other queries.  This helps identify the issue when a full-text select query suddenly blocks the SQL Server 2008 full text merge thread, and you can look at that query to make it run faster.  I haven’t tested this in depth, but these guys have proven to be much more qualified than me, so I have a hunch it’ll work, heh:

declare @temp int
declare @parent int
declare @final int
set @parent = 0
while (@parent = 0)
BEGIN
select @parent=blocked from sys.sysprocesses where lastwaittype=’LCK_M_SCH_M’ and waittime > 30000
WAITFOR DELAY ’00:00:01′;
END

WHILE (@parent <> 0)
BEGIN
set @final = @parent
select @temp=blocked from sys.sysprocesses where spid = @parent
set @parent = @temp
END
select * from sys.sysprocesses where spid = @final
select * from sys.dm_exec_requests ER
CROSS APPLY sys.dm_exec_sql_text(ER.plan_handle) ST
where ER.session_id = @final
select * from sys.dm_exec_requests ER
CROSS APPLY sys.dm_exec_query_plan(ER.plan_handle) ST
where ER.session_id = @final

And now, finally, I can start performance tuning that server!

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts