Tag Archive: corrugatediron

A Different World: Three Use Cases For Riak

Writing and deploying an application is pretty easy, there’s a process to follow that will get you most of the way there. The tricky part is getting the business logic down. But this isn’t about writing code, this is about what happens once you have an application out in the wild. The tricky part comes when you have to grow with your application. There are some tricky parts of keeping an application up and running. Things get complicated when you start trying to add complex functionality for your data. For those of us who use SQL Server there are a lot of options within SQL Server, but some of them require a seasoned DBA, expensive features, or both.

Durability

Making sure your data sticks around is important. Losing data can be catastrophic for a business. Even if you have a backup, it can take hours or even days, to restore your database and verify that everything is back up and running. That doesn’t even include the possibility that a backup is corrupted and you won’t be able to bring your database online. Or, worse yet, you could lose your SAN completely. Durability is important.

With SQL Server it’s possible to set up a cluster, mirroring, or replication to provide a second copy of your data for readability. These three solutions all require knowledge of some specialized features of SQL Server. Keeping each of these features up and running requires some knowledge, planning, set up, and monitoring. None of it is easy, some of it’s hard, and some features even need a lot of care and feeding to stay up and running. DBAs tell horror stories about hand feeding replication for days to get it up and running again after a catastrophic failure; it’s almost a badge of honor. Here’s the thing: none of this should be difficult. One way data durability (mirroring, transactional replication) is limited in its functionality – reads can happen everywhere, but writes can only happen in one place. If the master server goes down, it can be difficult to switch the master server around.

Durability problems plague DBAs every day, whether they know it or not. I spent a lot of time trying to solve my durability problems by configuring replication or log shipping and building complex application logic to support the possibility of multiple write locations. I later solved the problem by using Riak. One of the main benefits of Riak is the availability and fault tolerance it provides. That is: it’s durable as all get out. Data isn’t written to one place, it’s written to several places at once; writes won’t succeed unless more than one server responds that a write is successful. In a way, it’s like a really fancy version of database mirroring or SQL Server Denali’s Availability Groups feature that you can have right now.

When a server in a Riak cluster eventually fails, as hardware always does, you could recover it using a filesystem backup. Riak will handle making sure everyone has the right data by using a very cunning technique called read repair. If the server completely fails the other option is to simply replace it. You tell the cluster of servers that one server is gone and another server is taking its place. The cluster then figures out how to perform recovery or redistribute the data. It’s a lot less painful than shipping full database backups or replication snapshots across the data center or even across the country.

Latency

Relational database, and SQL Server in particular, use B+trees indexes to store data. B+trees provide reasonably fast look ups of data (any lookup of a single row will need to traverse the same number of pages on disk as any other lookup). By contrast, inserting data into a B+tree is not always particularly fast – page splits may occur, forcing data to be shuffled around on disk and moving things out of order could require intermediate pages to be updated. Because of the B+tree structure, many SQL Server DBAs advocate using an ordered, constantly increasing key for clustered indexes.

Riak, by contrast, defaults to using the Bitcask storage back end (with the option to use several others). Bitcask is unique in that it doesn’t use a B+tree to store data on disk. Instead Bitcask uses a pair of data structures – a series of data files (written as log-structured hash table) and an in memory directory of keys (a keydir) to make it easy to find records in the log-structured hash table. Reading data out of Bitcask will take two hops – one for the keydir and one for the data file – versus many potential hops in larger SQL Server tables (probably around 4 physical hops with a cold cache on a large-ish table).

There are two things that immediately stand out to me about Riak’s data latency. One is that Riak’s latency is predictable. Any write is going to take as much time as it takes to stream the data to disk. Since Bitcask is a log-structured hash, there’s no possibility of fragmentation; data is written in the order it arrives. Like some other databases, Riak does not perform in place updates of data. Instead, a new record is written in the log-structured hash table and the keydir is updated with the location of the new record. Because of this, it’s easy to predict how long an insert or update will take – as long as it takes to write that many bytes of data to disk.

Just as Riak provides a predictable low latency for writes, it also provides low latency for reading data from disk. Riak doesn’t use locking, so there can be no blocking. Instead read latency boils down to the amount of time that it takes to pull data off of disk. The more data there is in a given record, the slower the read is going to be. Obviously, there’s some seek time involved, but it’s negligible when you consider that a seek involves a single memory read and a single disk read.

Full Text Search

SQL Server’s full text search has caused a lot of problems for a lot of smart people. The more complex the schema and the more data load involved, the more likely that SQL Server’s full text search is going to have some problems. It’s a great tool, but there are some limitations and tuning full text queries is very different from tuning regular T-SQL queries.

A problem around full text search is the inability to scale the full text search service independently of the SQL Server. They’re tied together on the same physical instance. If you need to increase the performance of full text search, you must resort to increasing the performance of SQL Server, including the same licensing costs for SQL Server. Anyone how has gone from a 2 socket to a 4 socket machine can tell you that a doubling in license costs isn’t trivial. Full text search can, like many things, be moved to a separate SQL Server using replication, but SQL Server’s transactional replication has a reputation for being a bit manpower intensive. Many teams eventually move full text search to something other than SQL Server. Pushing full text search outside of the database engine frees up the database engine to serve other queries. Some people, like the fine folks at StackOverflow, have used Lucene.net. One of the advantages of using a full text search engine is that it offers phenomenal flexibility. Unfortunately, both SQL Server’s full text search Lucene and limited to a single node. Riak Search is Lucene compatible, flexible, and durable.

Riak Search automatically indexes documents whenever they’re saved in a specific bucket, just like SQL Server. Server side triggers are created to asynchronously index data as it is saved. Unlike SQL Server’s full text search, it’s possible to create custom functionality for searching. Different word breakers and tokenizers can be created to support different methods of indexing. Riak Search also allows complex documents to be indexed using custom schemas, much like Lucene.

The icing on the cake, for me, is that Riak Search can be used to provide linear scaling – as servers begin to run out of capacity, additional nodes can be brought online to quickly scale out. The upside of Riak Search is that there’s also durability built-in: the search indexes are spread across the cluster. Spreading indexes across the cluster, term partitioning, means that load from complex queries can be served by many nodes at once making it possible to provide overall higher query throughput on large data sets.

Making It Work

It’s fair to say that many people using SQL Server are also using the .NET Framework. While Riak is implemented in Erlang and runs on Unix based operating systems, it’s easy to work with SQL Server and Riak together in the same environment. CorrugatedIron is a community developed, open source, .NET library for Riak. While it’s still under heavy development pending Riak’s 1.0 release later this month, CorrugatedIron makes it possible to develop cross database functionality to save data in SQL Server and Riak without requiring developers to learn a new programming language or operating system. Functionality can be moved out of SQL Server and onto a distributed platform where functionality can be scaled horizontally and linearly.

This isn’t to say that I’m advocating for teams to abandon SQL Server in favor of Riak. In fact, that’s the opposite of what I want to say. Teams should pick a best of breed solution for their data storage needs. SQL Server is a great relational database, but there are some things it doesn’t do as well as we’d like. Riak is a great distributed database that is impervious to single node failure. They both provide different features and functionality that complement each other very well. Riak removes single points of failure, provides rich full text search functionality, and provides consistent latency guarantees. SQL Server provides rich querying semantics on high structure data and support for atomic transactions.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook

CorrugatedIron v0.1.2 – Now with options!

Corrugated Iron, the C# library to working with Riak, isn’t terribly old – OJ and I released the first version two and a half weeks ago. In the mean time, we’ve fixed a few bugs but haven’t pushed out major enhancements. The minor changes that we’ve pushed out, though, have been in direct response to user feedback. This next version is no exception.

When we first put the library together, we made some very broad assumptions about how people were using the library. We assumed that everyone would be using Inversion of Control containers to do configuration – that turned out to be wrong. We assumed that everyone would not be using load balancers – that turned out to be wrong (fix coming in v0.2). And we assumed that everyone would be using JSON to serialize their data – once again we were wrong.

I like being wrong because it gives me the chance to re-think what I’m doing and learn something new. In this case, a user was looking for examples using a different way to serialize data. I provided some sample code and it only halfway worked; OJ and I had made assumptions about how users would be reading and writing data. Thankfully the developer pointed out the exact code that was in error and I was able to quickly turn around a patch and get it pushed out the door this evening. Once we identified the issue, the turnaround time was about 17 hours. Not too bad for a few weeks on the job.

To get the latest version of CorrugatedIron, head on over to the download page and grab the latest version.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook

Just One More Thing… Introducing CorrugatedIron

I like to share what I know. That’s why earlier this year, I contributed some code to the Riak function contrib. Since then, I’ve been quietly becoming an independent consultant, starting a business, and working on a big chunk of code. I’m proud to release to the world CorrugatedIron.

More NoSQL for .NET

If you’ve been paying any attention at all (please say you pay attention to me), you’ll have noticed that I like to find fun and interesting ways to use and abuse data. You’ve also noticed that a lot of my interest lies in Riak. When I discovered Riak, there was a very good Ruby library, Ripple, some Java and Erlang libraries, and two .NET libraries that seemed a bit dead in the water.

I saw a lot of promise in Riak – it’s a distributed key/value database and it’s incredibly fault-tolerant. But the downside for many developers, especially developers who work with the Microsoft stack, was that there was no good way to connect to Riak. What that really means is that there was no good way for me to bring Riak to the masses of Microsoft developers and IT pros. Luckily, I had a plan.

I Love It When a Plan Comes Together – Developing Corrugated Iron

Through the Basho folks, I got in touch with OJ Reeves. OJ had started work on a C# library for Riak. Well, he’d started in his head. We emailed back and forth (he has an interesting take on the exchange), and nothing happened. I started a company, he had barbecues and ate shrimp with people named Bruce. In April, we decided to stop slacking off and write some software.

We wrote code quickly and threw it away even faster. We wrote and broke unit tests daily. Interest grew and we decided that we needed a date. We chose July 25th – it’s the first day of OSCON and it was a hard and fast date to get software out the door.

Throughout the development process, OJ and I have joked about strong opinions held loosely. Change is good, challenge is good; having a healthy respect for each other is good. Working with a developer on another continent taught me a lot about evaluating ideas, careful communication, and my own skills as a developer. OK, some of that’s a lie. There were many emails, Skype chats, and pair programming sessions via Webex, but through it all we maintained a respect for each other and a willingness to build and tear down code as many times as necessary to make a feature work.

Where is CorrugatedIron Now?

The whole idea behind CorrugatedIron is to make it easier for .NET developers to use Riak in their applications. SQL Server does many things really well, but there are some things that RDBMSes just aren’t good for. CorrugatedIron opens up choices for .NET developers.

The documentation isn’t where we want it to be and the code isn’t tested as thoroughly as we’d like, but we have working samples. I put some working into building a Session State Provider for ASP.NET (a great use of Riak, by the way) and OJ wrote several sample applications and configuration samples.

What Does It Mean To Me?

I’ve enjoyed working on this project for a few reasons. It’s given me the chance to write code outside of SQL Server. I didn’t realize how rusty I was with C# until I started working on CorrugatedIron. After a few weeks I was right back into it and I’ve been slowly working my way through many of the topics that I missed over the last few years. The best way to learn something is to do it.

The other big reason, for me, is giving something back. I’ve used open source software a lot throughout my career. While I can’t give back directly to many of the projects I’ve worked on, I’m able to give back by writing code and sharing it with the world.

What Next? ###

If you want to get started with CorrugatedIron, or you think you might know a developer who’d like to experiment, grab the source, download some binaries, install the NuGet package, and write some code. This is open source, so if there are missing features, issues you’re running into, or problems you’re having, hit us up on github, fork the repository, and get contributing!

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook