not my sock : hacks/ecx-vs-rcx-2005-11-13-14-45.html

Sun, 13 Nov 2005:

32 bits in, 64 bits out

Contrary to what I thought, Rhys can indeed make mistakes. For the last two weeks, I've been pissed off by an absolutely untraceable bug in the AMD64 engine. One single test case inside the engine fails to pass.

TestMarshalOffsetOf ... failed: 
    CSUnit.TestAssertFailed: Test failed: OffsetOf (6)
    at CSUnit.TestCase.Fail(String) in ./TestCase.cs:14
    at TestMarshal.TestMarshalOffsetOf() ....

Thanks to foresight by the FFOD ^[1], debugging the engine is far easier than most people would imagine. All it requires is for you to comment out things till stuff works and then hit on the one that you last commented ^[2] . The opcodes which aren't JIT converted to native code automatically fall back on the C-version which has been working on AMD64 for a couple of years now (that's what keeps my hair on my head and my stomach ulcer free). The real trick is to imagine what is happening and comment out the right one first off. So after commenting a lot of code, I finally hit on the right one (this was on 29th night, sometime near 3:30 AM). It was the following code

case COP_PWRITE_FIELD:
{
    /* Write a word field to an object */
    unsigned temp = (unsigned)CVM_ARG_BYTE;
    UNROLL_START();
    GetTopTwoWordRegisters(&unroll, &reg, &reg2,
                           MD_REG1_NATIVE | MD_REG2_32BIT);
    CheckForNull(&unroll, reg, pc, (unsigned char *)inst, 0);
    md_store_membase_word_native(unroll.out, reg2, reg, (int)temp);
    FreeTopRegister(&unroll, -1);
    FreeTopRegister(&unroll, -1);
    MODIFY_UNROLL_PC(CVM_LEN_BYTE);
}
break;

Now fastforward to saturday night, November 12th. I was bored, this was pending and valgrind wouldn't load 64 bit binaries, nor would it let me use cachegrind on the 32 bit VM (thanks to libgc's stack probes). OK, now that I have started on this whole bloody thing, the next obvious step was to figure out on what exact arguments the whole thingy goes for a toss. It turns out that it explodes to tiny bits when temp is 0. So the obvious workaround looked like this.

unsigned temp = (unsigned)CVM_ARG_BYTE;
if(temp == 0) goto defaultCase;
UNROLL_START();

But the md_store_membase_word_native was supposed to work even when displacement was zero. So I had a look at the code with gdb while in execution.

if(temp == 0) amd64_breakpoint(unroll.out);
md_store_membase_word_native(unroll.out, reg2, reg, (int)temp);

Now we have to head into assembly programming territory. But the good news is that AMD64 isn't that different from x86 at all. So this is what I do in gdb to debug code which was generated directly in binary form.

Program received signal SIGTRAP, Trace/breakpoint trap.
(gdb) x/5i $pc-2
0x2a97a6e277:   rex64 int3                 # breakpoint
0x2a97a6e279:   mov    %rcx,(%rax)         # mov_membase
0x2a97a6e27c:   sub    $0x20,%r14          # stack update 
0x2a97a6e280:   mov    $0x2a97a6e1c4,%r12  # load next insn
0x2a97a6e28a:   rex64Z jmpq   *(%r12)      # jump there

Everything there looks as it should. Around 15 minutes after staring at the binary code generation, I came to the conclusion that the displacement isn't the issue at all. There was something subtle I was missing in all this. After the first half hour, I had almost given up on the code - then I put a breakpoint before the registers are loaded, and ran the testcase again.

(gdb) x/12i $pc
0x2a97a6e250:   mov    -8(%r14),%ecx  
0x2a97a6e254:   mov    -16(%r14),%rax 
0x2a97a6e258:   or     %rax,%rax          # null check       
0x2a97a6e25b:   rex64 jne    0x2a97a6e279 # if == 0
0x2a97a6e262:   mov    %rax,-16(%r14)     # restore
0x2a97a6e266:   mov    %ecx,-8(%r14)      # arguments
0x2a97a6e26a:   mov    $0x2a97a6e1b4,%r12 
0x2a97a6e274:   jmpq   0x2b00451a8a       # rexecute 
0x2a97a6e279:   mov    %rcx,(%rax) 
0x2a97a6e27c:   sub    $0x20,%r14         # stack += 2 * 8
0x2a97a6e280:   mov    $0x2a97a6e1c4,%r12
0x2a97a6e28a:   rex64Z jmpq   *(%r12)

So after a couple of minutes, the highlighted code peices hit my brain properly. I was storing to ECX and writing from RCX. The word register was being loaded in 32 bit mode, not the full and glorious 64 bit mode. And this is a pointer write, so some pointer somewhere in the generated code was being truncated by the above code. In hindsight it was a very obvious bug - but hindsight is always 20/20. Fix was very simple, change the MD_REG2_32BIT to MD_REG2_NATIVE and it stops truncating the pointer to be stored. Next in line, array read & write inlining.

There is a moral to this story, so to speak. Do not judge anyone's contribution to any project by mere linecount. This single line fix too me over 6 working hours over a period of 17 days to locate, track down and fix. The results are obvious to anyone who runs my souped up engine.

Benchmark	Interpreter	Unroller
Sieve	1665	10775
Loop	1773	22638
Logic	1637	14895
String	16021	19465
Float	222	458
Method	1569	8013
PNetMark (Weighted Sum)	1731	7986

This new and improved engine beats the hell out of the 32 bit mode JIT on the same machine (and I barely put in 30-odd hours into building this). All the small CPU-specific optimisations add up to a huge difference in performance. I am a R!c3r and AMD64 R0xx0rs.

[1] Flying Fingers of Death : Rhys Weatherley
[2] ease of debugging : If you thought I was joking, think again. Most JITs follow an all or nothing policy, where if you comment out an opcode, code with that opcode will just fail. Inside pnet code, it falls back gracefully to the interpreter core.

posted at: 14:45 | path: /hacks | permalink | Tags: public

<	November 2005					>
Su	Mo	Tu	We	Th	Fr	Sa
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30