Valgrind is one of the most common tools people use to debug memory. Recently while I was debugging APC, the primary problem I have is of php Zend code writing into shared memory without acquiring the locks required. I had been debugging that with gdb for a while, but gdb is just dead slow for watching writes to 16 Mb of memory and generating backtraces.
The result of all that pain was a quick patch on valgrind 3.1.1. The patch would log all writes to a memory block with backtraces. But valgrind does not have a terminal to type into midway, unlike gdb. So the question was how to indicate a watchpoint. Valgrind magic functions were the answer. The magic functions can pass a parameter to valgrind while in execution. This is a source hack and is a hell of a lot easier to do than actually breaking in gdb and marking a breakpoint everytime you run it. So here's how the code looks like :-
#include "valgrind/memcheck.h" int main() { int * k = malloc(sizeof(int)); int x = VALGRIND_SET_WATCHPOINT(k, sizeof(int)); modify(k); VALGRIND_CLEAR_WATCHPOINT(x); }
That is marked out in the normal code with the following assembly fragment.
movl $1296236555, -56(%ebp) movl 8(%ebp), %eax movl %eax, -52(%ebp) movl $4, -48(%ebp) movl $0, -44(%ebp) movl $0, -40(%ebp) leal -56(%ebp), %eax movl $0, %edx roll $29, %eax ; roll $3, %eax rorl $27, %eax ; rorl $5, %eax roll $13, %eax ; roll $19, %eax movl %edx, %eax movl %eax, -12(%ebp)
This doesn't do anything at all on a normal x86 cpu but inside the valgrind executor, it is picked up and delivered to mc_handle_client_request where I handle the case and add the address and size, to the watch points list.
So whenever a helperc_STOREV* function is called, the address passed in is checked against the watchpoints list, which is stored in the corresponding primary map of access bits. All of these bright ideas were completely stolen from Richard Walsh patch for valgrind 2.x. But of course, if it weren't for the giants on whose shoulders I stand ...
bash$ valgrind a.out ==6493== Watchpoint 0 event: write ==6493== at 0x804845E: modify (in /home/gopalv/hacks/valgrind-tests/a.out) ==6493== by 0x80484EA: main (in /home/gopalv/hacks/valgrind-tests/a.out) ==6493== This watchpoint has been triggered 1 time ==6493== This watchpoint was set at: ==6493== at 0x80484DB: main (in /home/gopalv/hacks/valgrind-tests/a.out)
Now, I can actually run a huge ass set of tests on php5 after marking the APC shared memory as watched and see all the writes, filter out all the APC writes and continue to copy out the other written segments into local memory for Zend's pleasure.
Writing software gives you that high of creating something out of nearly nothing. Since I am neither a poet nor a painter, there's no other easy way to get that high (unless ... *ahem*).
--Mathemeticians stand on each other's shoulders while computer scientists stand on each other's toes.
-- Richard Hamming
Some time in late 2002, I got to see a clear picture of what interpreter optimisation is all about. While I only wrote a small paragraph of the Design of the Portable.net Interpreter, I got a good look at some of the design decisions that went into pnet. The history of the CVM engine aside, more recently I started looking into the Php5 engine core interpreter loop. And believe me, it wasn't written with raw performance in mind.
The VM doesn't go either the register VM or stack VM way, there by throwing away years of optimisations which have gone into either. The opcode parameters are passed between opcodes in the ->result entry in each opcode and which are used as the op1 or op2 of the next opcode. You can literally see the tree of operations in this data structure. As much as it is good for data clarity, it means that every time I add two numbers, I write to a memory location somewhere. For example, I cannot persist some data in a register and have it picked up by the latter opcode - which is pretty easy to do with a stack VM.
Neither did I see any concept called verifiability, which means that I cannot predict output types or make any assumptions about them either. For example, the following is code for the add operation.
ZEND_ADD_SPEC_VAR_VAR_HANDLER: { zend_op *opline = EX(opline); zend_free_op free_op1, free_op2; add_function(&EX_T(opline->result.u.var).tmp_var, _get_zval_ptr_var(&opline->op1, EX(Ts), &free_op1 TSRMLS_CC), _get_zval_ptr_var(&opline->op2, EX(Ts), &free_op2 TSRMLS_CC) TSRMLS_CC); if (free_op1.var) {zval_ptr_dtor(&free_op1.var);}; if (free_op2.var) {zval_ptr_dtor(&free_op2.var);}; ZEND_VM_NEXT_OPCODE(); }
Since we have no idea what type of zval is contained in the operators, the code has to do a set of conversion to number. All these operations involve basically a conditional jump somewhere (aka if) which are what we're supposed to be avoiding to speed up.
Neither could I registerify variables easily, because there was a stupid CALL based VM (which is flexible enough to do some weird hacks by replacing opcodes) which throws away all variables in every scope. That's some serious stack space churn, which I can't force anyone to re-do. At least, not yet. So inspite of having a CGOTO core, there was hardly anything I could do without breaking the CALL core codebase.
Basically, after I'd exhausted all my usual bag of tricks I looked a little closer at the assembly thrown out by the compiler. Maybe there was something that wasn't quite obvious happening in the engine.
.L1031: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194 .L203: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194 .... L1194: jmp *-2748(%ebp)
As you can clearly see, the jump target is yet another jump instruction. For a pipelined CPU that's really bad news, especially when the jump is so long off. So I wrote up some assembly to remove the double jump and convert into a single one.
#ifdef __i386__ #define ZEND_VM_CONTINUE() do { __asm__ __volatile__ (\ "jmp *%0" \ :: "r" (EX(opline)->handler) ); \ /* just to fool the compiler */ \ goto * ((void **)(EX(opline)->handler)); } while(0) #else #define ZEND_VM_CONTINUE() goto *(void**)(EX(opline)->handler) #endi
So in i386 land the jump is assembly code and marked volatile so that it will not be optimised or rearranged to be more "efficent".
.L1031: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax #APP jmp *%eax #NO_APP movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194 .L203: leal -72(%ebp), %eax addl $76, (%eax) movl -72(%ebp), %eax movl (%eax), %eax #APP jmp *%eax #NO_APP movl -72(%ebp), %eax movl (%eax), %eax movl %eax, -2748(%ebp) jmp .L1194
The compiler requires a goto to actually realize it has to flush all the stack params inside the scope. I've learnt that fact a long time ago trying to do the same for dotgnu's amd64 unroller. Anyway, let's look at the numbers.
Before: After: simple 0.579 simple 0.482 simplecall 0.759 simplecall 0.692 simpleucall 1.193 simpleucall 1.111 simpleudcall 1.409 simpleudcall 1.320 mandel 2.034 mandel 1.830 mandel2 2.551 mandel2 2.227 ackermann(7) 1.438 ackermann(7) 1.638 ary(50000) 0.100 ary(50000) 0.097 ary2(50000) 0.080 ary2(50000) 0.080 ary3(2000) 1.051 ary3(2000) 1.024 fibo(30) 3.914 fibo(30) 3.383 hash1(50000) 0.185 hash1(50000) 0.182 hash2(500) 0.209 hash2(500) 0.198 heapsort(20000) 0.616 heapsort(20000) 0.580 matrix(20) 0.500 matrix(20) 0.481 nestedloop(12) 0.953 nestedloop(12) 0.855 sieve(30) 0.499 sieve(30) 0.494 strcat(200000) 0.079 strcat(200000) 0.074 ------------------------ ------------------------ Total 18.149 Total 16.750
This is in comparison to the default php5 core which takes a pathetic 23.583 to complete the tests. But there's more to the story. If you look carefully, you'll notice that there's a register indirection just before the move. But x86 does support an indirect indexed jump with a zero index.
__asm__ __volatile__ ("jmp *(%0)",:: "r" (&(EX(opline)->handler)));
That generates a nice jmp *(%eax); which is perfect enough for my purpose. Except for the fact that I can see in the assembly, the above fix didn't really do much for performance. For example, look at the following code :-
leal -72(%ebp), %eax addl $76, (%eax) #APP nop #NO_APP movl -72(%ebp), %eax #APP jmp *(%eax) #NO_APP
The EAX loader between the two custom asm statements is what I was trying to avoid. But the variable is re-loaded again from stack because there is no register variable cache for the handler pointer. One way around that is to do what pnet did, keep your PC (eqiv of handler var) in a register, preferably EBX and use it directly. The seperation between operands (stack) and operator (handler) makes it hard to optimize both in one go. The opline contains both together making it really really hard to properly speed up.
But there's this thing about me - I'm lazy.
--Captain, we have lost entire left hamster section.
Now, pedal faster.