Some time in late 2002, I got to see a clear picture of what interpreter optimisation is all about. While I only wrote a small paragraph of the Design of the Portable.net Interpreter, I got a good look at some of the design decisions that went into pnet. The history of the CVM engine aside, more recently I started looking into the Php5 engine core interpreter loop. And believe me, it wasn't written with raw performance in mind.
The VM doesn't go either the register VM or stack VM way, there by throwing away years of optimisations which have gone into either. The opcode parameters are passed between opcodes in the ->result entry in each opcode and which are used as the op1 or op2 of the next opcode. You can literally see the tree of operations in this data structure. As much as it is good for data clarity, it means that every time I add two numbers, I write to a memory location somewhere. For example, I cannot persist some data in a register and have it picked up by the latter opcode - which is pretty easy to do with a stack VM.
Neither did I see any concept called verifiability, which means that I cannot predict output types or make any assumptions about them either. For example, the following is code for the add operation.
ZEND_ADD_SPEC_VAR_VAR_HANDLER: { zend_op *opline = EX(opline); zend_free_op free_op1, free_op2; add_function(&EX_T(opline->result.u.var).tmp_var, _get_zval_ptr_var(&opline->op1, EX(Ts), &free_op1 TSRMLS_CC), _get_zval_ptr_var(&opline->op2, EX(Ts), &free_op2 TSRMLS_CC) TSRMLS_CC); if (free_op1.var) {zval_ptr_dtor(&free_op1.var);}; if (free_op2.var) {zval_ptr_dtor(&free_op2.var);}; ZEND_VM_NEXT_OPCODE(); }
Since we have no idea what type of zval is contained in the operators, the code has to do a set of conversion to number. All these operations involve basically a conditional jump somewhere (aka if) which are what we're supposed to be avoiding to speed up.
Neither could I registerify variables easily, because there was a stupid CALL based VM (which is flexible enough to do some weird hacks by replacing opcodes) which throws away all variables in every scope. That's some serious stack space churn, which I can't force anyone to re-do. At least, not yet. So inspite of having a CGOTO core, there was hardly anything I could do without breaking the CALL core codebase.
Basically, after I'd exhausted all my usual bag of tricks I looked a little closer at the assembly thrown out by the compiler. Maybe there was something that wasn't quite obvious happening in the engine.
.L1031: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194 .L203: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194 .... L1194: jmp *-2748(%ebp)
As you can clearly see, the jump target is yet another jump instruction. For a pipelined CPU that's really bad news, especially when the jump is so long off. So I wrote up some assembly to remove the double jump and convert into a single one.
#ifdef __i386__ #define ZEND_VM_CONTINUE() do { __asm__ __volatile__ (\ "jmp *%0" \ :: "r" (EX(opline)->handler) ); \ /* just to fool the compiler */ \ goto * ((void **)(EX(opline)->handler)); } while(0) #else #define ZEND_VM_CONTINUE() goto *(void**)(EX(opline)->handler) #endi
So in i386 land the jump is assembly code and marked volatile so that it will not be optimised or rearranged to be more "efficent".
.L1031: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax #APP jmp *%eax #NO_APP movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194 .L203: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax #APP jmp *%eax #NO_APP movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194
The compiler requires a goto to actually realize it has to flush all the stack params inside the scope. I've learnt that fact a long time ago trying to do the same for dotgnu's amd64 unroller. Anyway, let's look at the numbers.
Before: After: simple 0.579 simple 0.482 simplecall 0.759 simplecall 0.692 simpleucall 1.193 simpleucall 1.111 simpleudcall 1.409 simpleudcall 1.320 mandel 2.034 mandel 1.830 mandel2 2.551 mandel2 2.227 ackermann(7) 1.438 ackermann(7) 1.638 ary(50000) 0.100 ary(50000) 0.097 ary2(50000) 0.080 ary2(50000) 0.080 ary3(2000) 1.051 ary3(2000) 1.024 fibo(30) 3.914 fibo(30) 3.383 hash1(50000) 0.185 hash1(50000) 0.182 hash2(500) 0.209 hash2(500) 0.198 heapsort(20000) 0.616 heapsort(20000) 0.580 matrix(20) 0.500 matrix(20) 0.481 nestedloop(12) 0.953 nestedloop(12) 0.855 sieve(30) 0.499 sieve(30) 0.494 strcat(200000) 0.079 strcat(200000) 0.074 ------------------------ ------------------------ Total 18.149 Total 16.750
This is in comparison to the default php5 core which takes a pathetic 23.583 to complete the tests. But there's more to the story. If you look carefully, you'll notice that there's a register indirection just before the move. But x86 does support an indirect indexed jump with a zero index.
__asm__ __volatile__ ("jmp *(%0)",:: "r" (&(EX(opline)->handler)));
That generates a nice jmp *(%eax); which is perfect enough for my purpose. Except for the fact that I can see in the assembly, the above fix didn't really do much for performance. For example, look at the following code :-
leal -72(%ebp), %eax addl $76, (%eax) #APP nop #NO_APP movl -72(%ebp), %eax #APP jmp *(%eax) #NO_APP
The EAX loader between the two custom asm statements is what I was trying to avoid. But the variable is re-loaded again from stack because there is no register variable cache for the handler pointer. One way around that is to do what pnet did, keep your PC (eqiv of handler var) in a register, preferably EBX and use it directly. The seperation between operands (stack) and operator (handler) makes it hard to optimize both in one go. The opline contains both together making it really really hard to properly speed up.
But there's this thing about me - I'm lazy.
--Captain, we have lost entire left hamster section.
Now, pedal faster.