< December 2011 >
     1 2 3
4 5 6 7 8 910
Thu, 22 Dec 2011:

Divide and conquer. That's how the web's scaling problems have always been solved.

And the tier scales out horizontally for a while. You scale the tiers and everything works. But sooner or later you end up with a different problem - latency. The system gets choked out with the interconnects, but the magnitude of the problem is just mind boggling.

Let's take a random example - imagine about 500 memcached servers, 1000 web nodes and 64 processes on each node. Simple back of the envelope math turns it into 32 million persistent connections going on at any given time. But I'm assuming the worst-case scenario - only because that's what's in production in most places.

The real problem is that the preferred scale-out approach for a web tier is a round-robin or least-connection based fair distribution of requests. That works great for read-heavy throughputs where the key distribution is not tied to a user session. But if you ended up with a scenario where you are operating on only one user's data per-request, the wastefulness of this scenario starts to become evident.

What I really want is to pick a web-node which is currently optimal for this particular user's request. The traditional approach is to pick a node and route all subsequent requests to the particular node and hope that I can do at least a few stale reads off the local cache there. We want to go further and pick an optimal web node (network-wise) for this user session. Since the data layer gets dynamically rebalanced and failed nodes get replaced, the mapping is by no means static. Not only is that an issue, strict pinning might cause a hotspot of web activity might bring down a web server.

The solution is to repurpose stateless user pinning as used by HAProxy to let the web tier rebalance requests as it pleases. We plan on hijacking the cookie mechanisms in haproxy and setting the cookies from the webservers themselves instead of injecting it from the proxy.

Here's how my haproxy.cfg looks at the moment

backend app
	balance roundrobin
	cookie SERVERID indirect preserve
	server app1 cookie app1 maxconn 32
	server app2 cookie app2 maxconn 32
	server app3 cookie app2 maxconn 32

That's pretty much the complicated part. What remains to be done is merely the php code to set the cookie. The following code is on each app node (so that "app2" can set-cookie as "app3" if needed).

$h = crc32("$uid:blob"); # same used for memcache key

$servers = $optimal_server[($h % $shard_count)];
$s = $servers[0];

header("Set-Cookie: SERVERID=$s");

As long the optimal_server (i.e ping-time < threshold or well, EC2 availability zone) is kept up-to-date for each user data shard, this will send all requests according to server id till the maxconn is reached. And it fails over to round-robin when no cookie provided or the machine is down/out-of-capacity. HAproxy even holds a connection to reroute data to a different node for failover instead of erroring out.

And nobody said you had to do this for every user - do it only for a fraction that you care about :)

“Many are stubborn in pursuit of the path they have chosen, few in pursuit of the goal.”
            -- Friedrich Nietzsche

posted at: 10:55 | path: /hacks | permalink | Tags: , ,