< June 2006 >
SuMoTuWeThFrSa
     1 2 3
4 5 6 7 8 910
11121314151617
18192021222324
252627282930 
Thu, 01 Jun 2006:

Some time in late 2002, I got to see a clear picture of what interpreter optimisation is all about. While I only wrote a small paragraph of the Design of the Portable.net Interpreter, I got a good look at some of the design decisions that went into pnet. The history of the CVM engine aside, more recently I started looking into the Php5 engine core interpreter loop. And believe me, it wasn't written with raw performance in mind.

The VM doesn't go either the register VM or stack VM way, there by throwing away years of optimisations which have gone into either. The opcode parameters are passed between opcodes in the ->result entry in each opcode and which are used as the op1 or op2 of the next opcode. You can literally see the tree of operations in this data structure. As much as it is good for data clarity, it means that every time I add two numbers, I write to a memory location somewhere. For example, I cannot persist some data in a register and have it picked up by the latter opcode - which is pretty easy to do with a stack VM.

Neither did I see any concept called verifiability, which means that I cannot predict output types or make any assumptions about them either. For example, the following is code for the add operation.

ZEND_ADD_SPEC_VAR_VAR_HANDLER:
{
    zend_op *opline = EX(opline);
    zend_free_op free_op1, free_op2;

    add_function(&EX_T(opline->result.u.var).tmp_var,
        _get_zval_ptr_var(&opline->op1, EX(Ts), &free_op1 TSRMLS_CC),
        _get_zval_ptr_var(&opline->op2, EX(Ts), &free_op2 TSRMLS_CC) TSRMLS_CC);
    if (free_op1.var) {zval_ptr_dtor(&free_op1.var);};
    if (free_op2.var) {zval_ptr_dtor(&free_op2.var);};
    ZEND_VM_NEXT_OPCODE();
}

Since we have no idea what type of zval is contained in the operators, the code has to do a set of conversion to number. All these operations involve basically a conditional jump somewhere (aka if) which are what we're supposed to be avoiding to speed up.

Neither could I registerify variables easily, because there was a stupid CALL based VM (which is flexible enough to do some weird hacks by replacing opcodes) which throws away all variables in every scope. That's some serious stack space churn, which I can't force anyone to re-do. At least, not yet. So inspite of having a CGOTO core, there was hardly anything I could do without breaking the CALL core codebase.

Basically, after I'd exhausted all my usual bag of tricks I looked a little closer at the assembly thrown out by the compiler. Maybe there was something that wasn't quite obvious happening in the engine.

.L1031:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
.L203:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
....
L1194: 
    jmp *-2748(%ebp)

As you can clearly see, the jump target is yet another jump instruction. For a pipelined CPU that's really bad news, especially when the jump is so long off. So I wrote up some assembly to remove the double jump and convert into a single one.

#ifdef __i386__
#define ZEND_VM_CONTINUE() do { __asm__ __volatile__ (\
        "jmp *%0" \
        :: "r" (EX(opline)->handler) ); \
    /* just to fool the compiler */ \
    goto * ((void **)(EX(opline)->handler)); } while(0)
#else
#define ZEND_VM_CONTINUE() goto *(void**)(EX(opline)->handler)
#endi

So in i386 land the jump is assembly code and marked volatile so that it will not be optimised or rearranged to be more "efficent".

.L1031:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
#APP
    jmp *%eax
#NO_APP
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
.L203:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
#APP
    jmp *%eax
#NO_APP
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194

The compiler requires a goto to actually realize it has to flush all the stack params inside the scope. I've learnt that fact a long time ago trying to do the same for dotgnu's amd64 unroller. Anyway, let's look at the numbers.

           
Before:                     After:
simple             0.579    simple             0.482
simplecall         0.759    simplecall         0.692
simpleucall        1.193    simpleucall        1.111
simpleudcall       1.409    simpleudcall       1.320
mandel             2.034    mandel             1.830
mandel2            2.551    mandel2            2.227
ackermann(7)       1.438    ackermann(7)       1.638
ary(50000)         0.100    ary(50000)         0.097
ary2(50000)        0.080    ary2(50000)        0.080
ary3(2000)         1.051    ary3(2000)         1.024
fibo(30)           3.914    fibo(30)           3.383
hash1(50000)       0.185    hash1(50000)       0.182
hash2(500)         0.209    hash2(500)         0.198
heapsort(20000)    0.616    heapsort(20000)    0.580
matrix(20)         0.500    matrix(20)         0.481
nestedloop(12)     0.953    nestedloop(12)     0.855
sieve(30)          0.499    sieve(30)          0.494
strcat(200000)     0.079    strcat(200000)     0.074
------------------------    ------------------------
Total             18.149    Total             16.750

This is in comparison to the default php5 core which takes a pathetic 23.583 to complete the tests. But there's more to the story. If you look carefully, you'll notice that there's a register indirection just before the move. But x86 does support an indirect indexed jump with a zero index.

   __asm__ __volatile__ ("jmp *(%0)",:: "r" (&(EX(opline)->handler))); 

That generates a nice jmp *(%eax); which is perfect enough for my purpose. Except for the fact that I can see in the assembly, the above fix didn't really do much for performance. For example, look at the following code :-

    leal    -72(%ebp), %eax
    addl    $76, (%eax)
#APP
    nop
#NO_APP
    movl    -72(%ebp), %eax
#APP
    jmp *(%eax)
#NO_APP

The EAX loader between the two custom asm statements is what I was trying to avoid. But the variable is re-loaded again from stack because there is no register variable cache for the handler pointer. One way around that is to do what pnet did, keep your PC (eqiv of handler var) in a register, preferably EBX and use it directly. The seperation between operands (stack) and operator (handler) makes it hard to optimize both in one go. The opline contains both together making it really really hard to properly speed up.

But there's this thing about me - I'm lazy.

--
Captain, we have lost entire left hamster section.
Now, pedal faster.

posted at: 17:55 | path: /php | permalink | Tags: , ,