So, there I was debugging what looked like a memory leak in APC - a perfectly straightforward bug at first glance. APC's internal allocator was leaving around a bunch of 40 byte fragments all over the place. The fragments were literally killing APC allocate and deallocate performance - with nearly 85k fragments lying around in the 128Mb cache that www.php.net uses. Even though the allocator is a first-fit based system, it still has to traverse a large number of blocks to locate the previous free block to free any particular allocated block.
Basically, it was having serious issues with memory performance. This had something to do with one of the changes I'd put into APC-3.0.13 - canary checks. The canary essentially increased the memory header size by one size_t exactly. This broke the default word alignment on x86, but I thought I had all bases covered when I put in the approriate word aligns.
24 Bytes: Now, the default allocation size in APC is 24 bytes on x86. That is 12 bytes (3 x sizeof(size_t)) plus padding to make it a multiple of 8 coming to a total of 16 bytes. Then put in the data area (say, 1 byte), which is padded up to 8 bytes. Add all of it together and the smallest block APC can allocate is 24 bytes.
Due to some strange quirk of code, 40 bytes seems to be a very unpopular size to allocate. The allocations for 17-24 bytes of data goes into the 40 byte block and for some strange reason that seems really rare. I ran through a bunch of the standard tests I run with APC to get some sane statistics out of it. After running through hundred odd random tests from the standard phpt files, I got a pile of data. Twenty minutes later I had pulled that data together into a rough histogram (which is nearly the same thing as a bar chart for discrete data, I suppose) by printing out SVG and applying styling in inkscape.
Maybe I'm just hooked into drawing pretty graphs. But it clearly indicates what is going wrong. There are not enough 40 byte allocations to consume all the spare chunks being created. But, is this not true for the 80 or 96 blocks, you might ask. Unlike the 40 byte block both 80 (32 + 48) and 96 (48 + 48) byte blocks are easily consumed by requirements for smaller blocks. The 40 byte block on the other hand cannot be split into any smaller block because it is smaller than 24x2.
Thus due to the lack of demand and the inabilty to compromise (*heh*), the 40 byte blocks remain unwilling to accept any commitments. Until a memory block nearby is free'd the block will sit around waiting for someone to allocate 40 bytes - which as the pretty graph shows, is not a popular choice.
Now to sleep on this problem and hope I wake up with a solution - clear and perfect.
--If it breaks then you get to keep both pieces.
-- Warranty disclaimer for the chat program.