< October 2013 >
   1 2 3 4 5
6 7 8 9101112
Sun, 20 Oct 2013:

This is very definitely not about computer monitors. It is a story about data hashing, distribution and indeed re-distribution - about divisors, primes and more than all of that, downtime & costs.

In all honesty, I'm just posting a conversation from a coffee break almost two years ago (and some context to frame it). I'm writing this down now because it has reached the edge of my memory and is about to fall off.

One day, Prakash in the middle of coffee started to talk about the pains that powers of two cause. I assume, it was because one of the membase clusters (now called zBase) was running out of capacity and needed to be doubled in size.

Doubling in size was the only way our clusters grew, which was all well & good when the product was on its way up. It means you can double in size rather rapidly and without any real downtime. But it does cost a lot of money to keep doing after the first or second rehashing. Doubling made even more sense in a master-slave setup because the old cluster became the new masters and dropped half the data each, which would then be caught up by a new array of slaves which meant it would happen without any downtime for data copying and was a nearly no-risk operation.

This is relatively straight-forward, because the masters only have key invalidations happening during the re-shard. No data actually moves between masters during the reshard. This is brilliant because a few seconds after you flip the switch, the master array is ready to update and the slaves have a few minutes to catch up. But this is very expensive. Say, you had a 4 node cluster serving out data and eventually node1 runs out of space, then you need to add 4 more extra servers where otherwise only 2 would have been sufficient.

But this is primarily a one-way street. You cannot shrink a server cluster to half its size by this method.

Most startups never need to do that - but this kind of trickery needs to be done to optimize costs in a more mature product. And more relevantly, doubling a cluster to go from 8 to 16 to fix one node that's going out of capacity is so wasteful. If you are prepared to move a lot of data around, you don't necessarily need to double it - but that's the question, the downtime involved in moving the data is usually too much to justify that approach - but at least some of the data moving problems can be bypassed by more intelligent hash partitioning.

The most common hash partitioning method used is a modulo of the hash. Where you take the hash value and % by the number of servers to end up with a server where the data resides. But this has catastrophic consequences when rehashing data across multiple machines. Let us run through the simple scenario of data distributed across 10 machines and adding a 11th machine. So if we follow the hashing algorithms' approach, the only data that would sit still would be items with hash keys which are a multiple of both 10 and 11. For a uniform distribution of data that would approx 0.9% of the data.

Or in other words, with such a stupid hash partitioning mechanism, you'll end up moving 99.1% of data around to bring the cluster to order. So someone was bound to invent a method to work around the doubling problem.

When I was in Yahoo!, I ran into one of the most brilliant solutions to this problem. A mathematical, clean solution - but one which I notice hasn't even been patented. This is somewhat sad, but it shouldn't be lost to the world because it hasn't been published. Basically, it is a hash based solution that outputs a partition id instead of a hash value - basically, when you move up from a 10 node to 11 node cluster, the data only gets moved from all 10 nodes to the 11th node. And with a uniform distribution, each node only loses 10% of data.

There are tons of examples of this approach of consistent hashing, the easiest to use would be the Ketama hash. Or if you care about its origin story, you can see the genesis of this idea elsewhere. But the trouble really is that even with a consistent hash, you'll be moving around individual keys and value pairs to balance your data out. And there was no real clean way to drop data from the machine quickly without going through everything.

As with the revolution pallets caused in warehousing and containers did for shipping (or Quantum theory for physics), the idea of putting data into indivisble containers was the solution. On top of that, having partitioning be independent of machines was a rather easy way out of this particular problem in membase. Basically, bucket the data into a fixed number of partitions independently of actual machine count and then have a second order lookup map which actually tells you where that particular bucket is residing. This would be catastrophic if the clients went out of sync in this mechanism - but all a server had to do was say "Not Mine" when it sees an update or get to the wrong bucket. The client could then refresh the 2nd order mapping lazily without having to get a push update for each reconfiguration of the servers.

Since you move buckets between machines instead of keys and values, it is easy to proxy all updates to a given server into another with a filter on the bucket id (instead of failing - the failure is triggered after the move is finalized), avoiding downtime during the rehash. And you can just drop off the hash for the bucket once it has moved all data from the current machine to another. The locking, deallocation and movement of data now happens in a larger chunk as single transaction - a break in transaction does not leave either side in an inconsistent state. Just drop that vbucket in destination, pick another destination and serialize the in-mem hash for that vbucket for streaming it over.

In some human way, knowing which of those buckets have data/key-space/traffic allows a fairly involved user to actually micro-manage the resharding. And ideally, you can reshard down into a smaller cluster with this approach faster by hand than you can with the ketama, which is likely to make you go nuts since it is a spectrum of keys instead of being wrapped bundles. It is far easier to write a packing algorithm in 2 dimensions with such "rectangular" blocks - x being data size and y being traffic.

Now is when things start to get really interesting, from the Prakash angle. Almost everybody uses a power of 2 as number of these buckets. I remember that membase used to use 1024 by default and I talked to people at Aerospike recently, where they use the last 12 bits of the hash to determine virtual buckets (for the innumerate that is 4096 buckets). Now there is no fair way to distribute these numbers fairly across server counts, naively. According to Prakash, a number like 1440 would've been far nicer. And I did some grade school math to explore that. Basically, the number of fair distribution values below 1000.

1440 : [1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 30, 32, 36, 40, 45, 48, 60, 72, 80, 90, 96, 120, 144, 160, 180, 240, 288, 360, 480, 720]
1024 : [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] 

In essence, 1024 has 10 divisors and 1440 has 35. In the 1440 case there are 3x number of possible fair distributions of uniform data than in the 1024 case. The scenario is even better when you look at the small digit numbers in the distribution. The first 10 numbers for 1440 are 1, 2, 3, 4, 5, 6, 8, 9, 10, 12. All of which are fair configurations of bucket distributions, where as the 1024 case jumps straight from 8 to 16.

The real commentary here is on the lack of magic in 1024 (and 4096). Also slightly about people who pick prime numbers in this scenario - they work better for some key distributions, but lack the divisibility (duh!) that simplifies the distribution. This number choice is something you have to live with once you pick it, because changing it requires as massive a data movement as the original modulo distribution (because let's face it, the vbucket is a modulo distribution, with modulo picking the vbucket). And for those who picked a power of 2 already, I have to say that it is possible to acheive an ideal distribution of data/traffic irrespective of the actual number of vbuckets. When a server in a group runs out of capacity, just move the highest capacity vbucket in it out onto a new node. It is that simple - the default number only sucks in scenarios of uniform distribution and uniform traffic.

1024's just an imperfect number which will work for you in an imperfect world. Which I am told is often the case.

There is no safety in numbers, or in anything else.
  -- James Thurber

posted at: 23:20 | path: /hacks | permalink | Tags: , ,