not my sock 24 Nov 2017

Fri, 24 Nov 2017:

Problems I Want to Precipitate

I have started to remember more details about problems than I do about code.

Code used to occupy a large amount of my work-time thinking roots - algorithms, data structures and architectural concepts. Over the last two years, the organization of my thoughts has shifted onto understanding parallels with the real life structure of the problems. I'm looking through my attic for my brain and finding new uses for all that I've already done, which makes a better part II.

So for a while I've been putting thought into revisiting ideas, to feed them through this new lens. Here are some of them which I'd like to work towards thinking more clearly about them over 2018. If any of you have one of these problems or would like to enlighten me about how you are solving them, I'd like to buy you a beverage of your choice.

First up, large corpus compression. At a previous job, we had a few million users' game worlds in blob form, compressed with SNAPPY. This of course was incredibly inefficient to store every user's world's stereotypical artifacts again and again. The fundamental problem was to split apart the common structures between multiple blobs into a common blob and implement a LZ77 variant which can entirely remove large chunks of the input data from the compressed form and merely reference the common blob range (i.e SNAPPY COPY instructions with -ve offsets). This is not very revolutionary, because you can find something similar for english text in a new compression algorithm like Brotli (see RFC7932 Appendix A). It is already possible to do something similar, when it comes to Druid segment files, to compress URL columns by a large fraction within the segment, while still being able to decode any random row out of the column.

But what I want to do with that core idea is to apply it for another data-set which has a large amount of repetition between records - DNA and RNA. The idea is to apply this sort of lossless compression models to FASTQ data, but to seed it with known sequences as a database for long-range LZ77. The core idea is to compress the million genomes (like the one VA is building with the Million Veterans Project) while making it much easier to lookup by a known (or relevant) sequence. The compression format offers a fast way to do 1000 base comparisons, if you could build a pre-set dictionary and map them while they are being ingested. This is not a trivial operation at scale, but if you can cross-cut through the compression algorithm for your search implementation, this offers a massive improvement on the IO requirement on a cloud system. In a very vague way of saving, the more data you have the less data you'll have per-person.

The basic search optimization problems are also interesting for this, since nobody is going to download a large warehouse of data to match a single chunk of ctDNA that they pulled out of somebody's blood, but it is a batch problem with interesting parallels to other plain alphabet problems. My scrabble word finding cheat scripts do prime-hashing, which would work for fragments shuffles (& also off-by 1 identifications), which compresses 8 letter combos into 1 long without order (i.e all combinations of 8 letters into the same long). As a performance engineer (& and not a scientist), finding patterns with a prime multiply Rabin Karp in an alphabet soup is exactly the as complex as an alighnment problem. But the difference between the toy problem and the real one is literally life and death.

Second up, fleeting value capture problems. There are some problems where the value of a data entry decays as more time passes, if nothing reacts to it. These show up for a large number of streaming real-time systems, though the real life problems tend to tolerate a fractional loss of information while the streaming platforms are built to be feeding a system of record without loss. The demand-supply curve in a number of systems tend to work in similar fashion, where the historical data is not particularly relevant except in aggregate, but the current values are exceedingly pointy problems. Reactions to discounts and surge pricing are extremely individual, if you could map the actual demand to the demand against 68719476736a price point, in real-time, that can drive decisions at a finer granularity while keeping the maximum number of people happy.

This might look like a new idea, but I've seen this used to great effect in Facebook games - the random rewards within that skinner box is not exactly random. However, the difference is that in a value-cost model, the supply-demand curve does not meet at the same place for all people who are making financial decisions. The micro-economics in the single party model (with the "house" monopoly handing out infinite inventory and the players putting money down, there is no per-unit-profitability, only non-recoverable-engineering costs from the artwork & code) is much more easy to model than the delayed elasticity of the supply curve in a two-party model like the new gig economy.

This brings up another point, when you have weakly interacting processor models, they do not support strong failure tolerance characteristics. Failure tolerance is one of those "can't do without" concepts in distributed systems. This is sort of the "UDP for audio streams" discussion, because failure tolerance often demands waiting for the failure to be corrected before moving on - in a number of these processing models, delaying a future operation to make sure you have processed the current one is a complete waste of time. If you had an ETA estimation system for cars, then there is absolutely no value in processing a five minute old record when a user has already left a car. These problems routinely crop up in geo-fencing problems, when sending location updates over a cell phone network - when position changes, cancel the previous queued request and send a new one.

68719476736 The weakly interacting processors tend to be implemented as reliable queues and retried/stateless micro-services, both of which are imperfect abstractions to the actual problem at hand, but very clearly look like what MPI with buffering would look like. The data flow model is easier to delegate to different teams to build, but it does not match the characteristics of a single system which can handle geographical data as a sort of cellular network handing off cars and rides to the next cell as a vehicle moves through the city. If San Francisco has 45,000 drivers operating in it and 150,000 passengers with their apps open, you might realize that the state distribution for those + all the people in a vehicle right now is not exactly a giant scale problem, even if it looks like a 50k HTTP calls/sec of traffic going over the wire when it rains.

Geo-spatial data is another crucial problem for me to reapply some optimization work I did with NPCs in a video game - basically, if you ask them to get coffee, they need to find the nearest coffee shop and walk there. This meant dividing up the city into QuadTrees and build an A* path routing over that space. However, after I started to poke about that class of problems, I ran into Hilbert curve numbers and Hex bins. A Hilbert curve converts a 2-dimensional space into a single number and there are a large number of ways to handle single dimensional data in indexes, which suddenly become useful when you dump your data into Druid or another SQL DB which handles integers much faster than a complex square root function. There are space filling curves for hexagons which form a much prettier picture, but the real advantage with Hex bins is that for a QuadTree for a given position, there are 8 adjacent grids to check, while the Hexagon only ends up with six adjacent blocks to check.

To put it together, Geo-spatial data mapped onto a weakly handed-off cellular processor model (i.e the fuzzy boundaries) tends to work very well in a distributed systems, without a significant cost of the failure tolerance required to perform a strong hand-off. This is something left over from my first job writing a call-handling application for Ericsson, where the cellular paging channels and how they handle a user who merely drives through cells over someone who's on active call driving through, while being able to get GSM audio packets correctly (also it is encrypted). The interesting part of that is using the first derivative of position computed for each bogey in motion & the accidental utility of knowing collision/occlusion math from old video games.

Over the last 15 years, I've gathered up a bunch of eclectic experience, which seem to be useful in several other places across the industry in different scenarios with their own additional trade-offs and complexities. Complexities I haven't considered and trade-offs I haven't explored.

So, if you're working on something like this, I'd like to pick your brain and understand what I don't know - expand my ignorance, if you will.

--
If you're not part of the solution ...

posted at: 18:17 | path: /fun | permalink | Tags: public, me, work

<	November 2017					>
Su	Mo	Tu	We	Th	Fr	Sa
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30