< June 2006 >
SuMoTuWeThFrSa
     1 2 3
4 5 6 7 8 910
11121314151617
18192021222324
252627282930 
Thu, 01 Jun 2006:

Valgrind is one of the most common tools people use to debug memory. Recently while I was debugging APC, the primary problem I have is of php Zend code writing into shared memory without acquiring the locks required. I had been debugging that with gdb for a while, but gdb is just dead slow for watching writes to 16 Mb of memory and generating backtraces.

The result of all that pain was a quick patch on valgrind 3.1.1. The patch would log all writes to a memory block with backtraces. But valgrind does not have a terminal to type into midway, unlike gdb. So the question was how to indicate a watchpoint. Valgrind magic functions were the answer. The magic functions can pass a parameter to valgrind while in execution. This is a source hack and is a hell of a lot easier to do than actually breaking in gdb and marking a breakpoint everytime you run it. So here's how the code looks like :-

#include "valgrind/memcheck.h"

int main()
{
	int * k = malloc(sizeof(int));
	int x = VALGRIND_SET_WATCHPOINT(k, sizeof(int));
	modify(k);
	VALGRIND_CLEAR_WATCHPOINT(x);
}

That is marked out in the normal code with the following assembly fragment.

    movl    $1296236555, -56(%ebp)
    movl    8(%ebp), %eax
    movl    %eax, -52(%ebp)
    movl    $4, -48(%ebp)
    movl    $0, -44(%ebp)
    movl    $0, -40(%ebp)
    leal    -56(%ebp), %eax
    movl    $0, %edx
    roll $29, %eax ; roll $3, %eax
    rorl $27, %eax ; rorl $5, %eax
    roll $13, %eax ; roll $19, %eax
    movl    %edx, %eax
    movl    %eax, -12(%ebp)

This doesn't do anything at all on a normal x86 cpu but inside the valgrind executor, it is picked up and delivered to mc_handle_client_request where I handle the case and add the address and size, to the watch points list.

So whenever a helperc_STOREV* function is called, the address passed in is checked against the watchpoints list, which is stored in the corresponding primary map of access bits. All of these bright ideas were completely stolen from Richard Walsh patch for valgrind 2.x. But of course, if it weren't for the giants on whose shoulders I stand ...

bash$ valgrind a.out

==6493== Watchpoint 0 event: write
==6493==    at 0x804845E: modify (in /home/gopalv/hacks/valgrind-tests/a.out)
==6493==    by 0x80484EA: main (in /home/gopalv/hacks/valgrind-tests/a.out)
==6493== This watchpoint has been triggered 1 time
==6493== This watchpoint was set at:
==6493==    at 0x80484DB: main (in /home/gopalv/hacks/valgrind-tests/a.out)

Now, I can actually run a huge ass set of tests on php5 after marking the APC shared memory as watched and see all the writes, filter out all the APC writes and continue to copy out the other written segments into local memory for Zend's pleasure.

Writing software gives you that high of creating something out of nearly nothing. Since I am neither a poet nor a painter, there's no other easy way to get that high (unless ... *ahem*).

--
Mathemeticians stand on each other's shoulders while computer scientists stand on each other's toes.
                -- Richard Hamming

posted at: 23:44 | path: /hacks | permalink | Tags: , ,

Some time in late 2002, I got to see a clear picture of what interpreter optimisation is all about. While I only wrote a small paragraph of the Design of the Portable.net Interpreter, I got a good look at some of the design decisions that went into pnet. The history of the CVM engine aside, more recently I started looking into the Php5 engine core interpreter loop. And believe me, it wasn't written with raw performance in mind.

The VM doesn't go either the register VM or stack VM way, there by throwing away years of optimisations which have gone into either. The opcode parameters are passed between opcodes in the ->result entry in each opcode and which are used as the op1 or op2 of the next opcode. You can literally see the tree of operations in this data structure. As much as it is good for data clarity, it means that every time I add two numbers, I write to a memory location somewhere. For example, I cannot persist some data in a register and have it picked up by the latter opcode - which is pretty easy to do with a stack VM.

Neither did I see any concept called verifiability, which means that I cannot predict output types or make any assumptions about them either. For example, the following is code for the add operation.

ZEND_ADD_SPEC_VAR_VAR_HANDLER:
{
    zend_op *opline = EX(opline);
    zend_free_op free_op1, free_op2;

    add_function(&EX_T(opline->result.u.var).tmp_var,
        _get_zval_ptr_var(&opline->op1, EX(Ts), &free_op1 TSRMLS_CC),
        _get_zval_ptr_var(&opline->op2, EX(Ts), &free_op2 TSRMLS_CC) TSRMLS_CC);
    if (free_op1.var) {zval_ptr_dtor(&free_op1.var);};
    if (free_op2.var) {zval_ptr_dtor(&free_op2.var);};
    ZEND_VM_NEXT_OPCODE();
}

Since we have no idea what type of zval is contained in the operators, the code has to do a set of conversion to number. All these operations involve basically a conditional jump somewhere (aka if) which are what we're supposed to be avoiding to speed up.

Neither could I registerify variables easily, because there was a stupid CALL based VM (which is flexible enough to do some weird hacks by replacing opcodes) which throws away all variables in every scope. That's some serious stack space churn, which I can't force anyone to re-do. At least, not yet. So inspite of having a CGOTO core, there was hardly anything I could do without breaking the CALL core codebase.

Basically, after I'd exhausted all my usual bag of tricks I looked a little closer at the assembly thrown out by the compiler. Maybe there was something that wasn't quite obvious happening in the engine.

.L1031:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
.L203:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
....
L1194: 
    jmp *-2748(%ebp)

As you can clearly see, the jump target is yet another jump instruction. For a pipelined CPU that's really bad news, especially when the jump is so long off. So I wrote up some assembly to remove the double jump and convert into a single one.

#ifdef __i386__
#define ZEND_VM_CONTINUE() do { __asm__ __volatile__ (\
        "jmp *%0" \
        :: "r" (EX(opline)->handler) ); \
    /* just to fool the compiler */ \
    goto * ((void **)(EX(opline)->handler)); } while(0)
#else
#define ZEND_VM_CONTINUE() goto *(void**)(EX(opline)->handler)
#endi

So in i386 land the jump is assembly code and marked volatile so that it will not be optimised or rearranged to be more "efficent".

.L1031:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
#APP
    jmp *%eax
#NO_APP
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
.L203:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
#APP
    jmp *%eax
#NO_APP
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194

The compiler requires a goto to actually realize it has to flush all the stack params inside the scope. I've learnt that fact a long time ago trying to do the same for dotgnu's amd64 unroller. Anyway, let's look at the numbers.

           
Before:                     After:
simple             0.579    simple             0.482
simplecall         0.759    simplecall         0.692
simpleucall        1.193    simpleucall        1.111
simpleudcall       1.409    simpleudcall       1.320
mandel             2.034    mandel             1.830
mandel2            2.551    mandel2            2.227
ackermann(7)       1.438    ackermann(7)       1.638
ary(50000)         0.100    ary(50000)         0.097
ary2(50000)        0.080    ary2(50000)        0.080
ary3(2000)         1.051    ary3(2000)         1.024
fibo(30)           3.914    fibo(30)           3.383
hash1(50000)       0.185    hash1(50000)       0.182
hash2(500)         0.209    hash2(500)         0.198
heapsort(20000)    0.616    heapsort(20000)    0.580
matrix(20)         0.500    matrix(20)         0.481
nestedloop(12)     0.953    nestedloop(12)     0.855
sieve(30)          0.499    sieve(30)          0.494
strcat(200000)     0.079    strcat(200000)     0.074
------------------------    ------------------------
Total             18.149    Total             16.750

This is in comparison to the default php5 core which takes a pathetic 23.583 to complete the tests. But there's more to the story. If you look carefully, you'll notice that there's a register indirection just before the move. But x86 does support an indirect indexed jump with a zero index.

   __asm__ __volatile__ ("jmp *(%0)",:: "r" (&(EX(opline)->handler))); 

That generates a nice jmp *(%eax); which is perfect enough for my purpose. Except for the fact that I can see in the assembly, the above fix didn't really do much for performance. For example, look at the following code :-

    leal    -72(%ebp), %eax
    addl    $76, (%eax)
#APP
    nop
#NO_APP
    movl    -72(%ebp), %eax
#APP
    jmp *(%eax)
#NO_APP

The EAX loader between the two custom asm statements is what I was trying to avoid. But the variable is re-loaded again from stack because there is no register variable cache for the handler pointer. One way around that is to do what pnet did, keep your PC (eqiv of handler var) in a register, preferably EBX and use it directly. The seperation between operands (stack) and operator (handler) makes it hard to optimize both in one go. The opline contains both together making it really really hard to properly speed up.

But there's this thing about me - I'm lazy.

--
Captain, we have lost entire left hamster section.
Now, pedal faster.

posted at: 17:55 | path: /php | permalink | Tags: , ,