Sunday, February 12, 2012

A basic shared-nothing data sharding system

There's a lot of buzz about sharding data. Today, I'll provide a very brief overview of how sharding helps systems I manage run more efficiently and how we're addressing keeping individual shards balanced.

The goal of sharding data in our environment involves: 1) make the structure of the data consistent across all the shards, 2) dividing data up so it can be found easily, 3) automatically and continuously re-balance the shards, and 4) allow for changes in scale (like adding a new shard or different shard sizes).

Item 1 is a snap - all we do there is to deploy the same data structures in each of the shards with all the supporting data required to answer questions related to a user. Some of this data is user-specific, some is globally replicated. In any case, this goal makes it easy to use one set of code to access data in any of the shards without having to cross to another shard or database to get the answer for a question. This reduces workload in the application and on other database servers.

Item 2 is done by hashing our key data. Let's say that we have a set of widgets that users are concerned with. Some users have a few widgets, some have a lot, but each user is very different from another. Widgets are pretty common and well defined. Each user has a user ID and any question we ask the system always involves a specific user ID. So - our key data we hash against in this case would be the user ID. Data about the widgets is replicated to all the shards, but data about each user is only kept on the shard where that user's data lives.

Item 3 is handled by a separate process that utilizes the same API the application uses. Balancing the data between shards is simple - the balancer asks the API if there are any users that need to move. If yes, the balancer lets the API know to lock that user temporarily, moves the data, then unlocks the users for use on the new shard. What this means for applications is each time a location is returned for a specific user, that location is only guaranteed for a given window of time (30 seconds for example). So - when the balancer tells the API it's moving a user's records, any requests for that user's records are held up until the user's data is moved. The API is smart enough to only let the balancer move data that has not been accessed recently. This doesn't prevent all lock collisions, but it handles most of them.

Item 4 is handled through the configuration of the API. Because we use an API to tell the application where the data is for a given user, we've abstracted away where data actually lives. This makes it easy to add and remove servers from the sharding pool. We've extended this to include allowing a shard to be marked as in a draining state. When a shard is draining, the API will ask the balancer to move rows from the draining shard and redistribute that information onto other members of the sharding pool. This makes it possible to take a shard out of rotation for routine maintenance without the loss of data.

Notice that I didn't mention any specific software here. I didn't tell you what language the application is written in, what language the API is written in, or what the actual data store was. The technique of sharding data is pretty simple and can be done with nearly any persistence layer using any programming language.

The beauty of this system is that once the API is written, the balancer can be a complete "black box" to the application. This type of system could be implemented with a data store when just starting out and be expanded to multiple stores as the need expands. Also - sharding key needs to change, again, the application doesn't need to change - just the API and the balancer.

One other big benefit to sharding data like this - it's often a lot cheaper to buy several smaller systems than to buy and maintain one very large system. If one of the systems in the sharding pool goes off-line, the worst possible exposure in a shared-nothing sharding system is the data stored on the member that went down. In a monolithic system, you stand to lose a lot more.

While I wouldn't suggest trying to do this type of work on top of every data set out there, I do see that there is a lot of benefit when the types of questions being asked of a data set can be divided up easily while still making it relatively easy to answer the "question at hand" from a single source. The secret in the sauce is making sure that any common data is shared among all the systems in the pool.