< May 2006 >
SuMoTuWeThFrSa
  1 2 3 4 5 6
7 8 910111213
14151617181920
21222324252627
28293031   
Mon, 29 May 2006:

Last night, I was attempting to build Flubox-0.1.14 on my home amd64 box. After struggling with a couple of errors in src/Resource.hh I managed to get it built with gcc-3.4.2. Now the real bug hunting started in earnest. So consider the following code.

void va_arg_test(const char * pattern, ...) 
{
    va_list va;
    int i = 0;
    const char *p;

    va_start(va,pattern);
    while((p = va_arg(va, char *)) != 0) {
        printf("%02d) '%s'\n", ++i, p);
    }
    va_end(va);
}

int main(int argc, char * argv[])
{
    va_arg_test("wvZX", "hello", "world", 0);
}

Now, to the inexperienced eye it might look like perfect working code. But the code above is buggy. If I wanted to really fix it, I'd have to rewrite it so.

    va_arg_test("wvZX", "hello", "world", NULL);

The old code pushes 2 pointers and an integer into the stack as variable args, while it reads out 3 pointers inside the loop. In normal x86 land that's not a big deal because pointers are integers and vice versa. But my box is an amd64, there LLP64 holds sway. A pointer is a long long, not just a long. So it reads 32 bits of junk along with the last 0 and goes on reading junk off the stack.

If you'd run my so called buggy code on an amd64, you'd have found that it doesn't actually crash at all. That's where the plot thickens. To understand why it doesn't crash, you have to peer deep into the AMD64 ABI for function calls. As far as I remember, the ABI says that the first 6 arguments can be passed to a function using registers. So the current assembly listing for my code shows up as

    movl    $0, %ecx
    movl    $.LC1, %edx
    movl    $.LC2, %esi
    movl    $.LC3, %edi
    movl    $0, %eax
    call    va_arg_test

But if I increase the arguments to 8 parameters, then the data has to be pushed into the stack to passed around and then you'll note the critical difference in the opcodes between a pointer and integer handling.

    movl    $0, 16(%rsp)
    movq    $.LC7, 8(%rsp)
    movq    $.LC8, (%rsp)
    movl    $.LC4, %r9d
    ...
    movl    $0, %eax
    call    va_arg_test

As you can see the integer 0 is moved into the stack using the movl while the pointers were moved in using the movq viz long word and quad word. Doing this for varargs on amd64 leaves the rest of the quad word in that stack slot unitialized. Therefore you are not guarunteed a NULL pointer if you read that data out as a char *.

After that was fixed in XmbFontImp.cc, fluxbox started working. God knows how many other places has similar code that will break similarly.

--
At this rate you'd be dead and buried before that counter rolls over back to zero.
Better get some exercise if you want to fix it when it happens.

posted at: 10:11 | path: /hacks | permalink | Tags: , ,