In my porting Windows assembly to OS X write-up, I mentioned an easier way than actually combing through your code and making sure the stack is properly aligned. It turns out gcc has a nice compiler flag, called -mstackrealign with the following effect:

Realign the stack at entry. On the Intel x86, the -mstackrealign option will generate an alternate prologue/epilogue that realigns the runtime stack. This supports mixing legacy codes that keep a 4-byte aligned stack with modern codes that keep a 16-byte stack for SSE compatibility. The alternate prologue and epilogue are slower and bigger than the regular ones, and they require one dedicated register for the entire function. This also lowers the number of registers available if used in conjunction with the "regparm" attribute. Nested functions encountered while -mstackrealign is on will generate warnings, and they will not realign the stack when called.

So this sounds great! For a small performance hit, you can use your legacy code without modification. OS X doesn't use registers to pass parameters, so there's no conflict with regparm. However, I tried to use this option in my project, and it still crashed. Not on the movdqa instruction as before, but in seemingly random places. And only in Release mode. It sounds like this option doesn't play nicely with compiler optimizations, and just had to investigate.

I was able to narrow it down to a very simple bit of code. Take the following and put it in a new Xcode command line tool project:

#include <stdio.h>

#define NOINLINE __attribute__((noinline))

NOINLINE static void foo1(int i1)
{
    printf("foo1: %d\n", i1);
}

NOINLINE static void foo2(int i1, int i2)
{
    printf("foo2: %d, %d\n", i1, i2);
}

NOINLINE static void foo3(int i1, int i2, int i3)
{
    printf("foo3: %d, %d, %d\n", i1, i2, i3);
}

NOINLINE static void foo4(int i1, int i2, int i3, int i4)
{
    printf("foo4: %d, %d, %d, %d\n", i1, i2, i3, i4);
}

NOINLINE static void foo5(int i1, int i2, int i3, int i4, int i5)
{
    printf("foo5: %d, %d, %d, %d, %d\n", i1, i2, i3, i4, i5);
}

NOINLINE static void foo6(int i1, int i2, int i3, int i4, int i5, int i6)
{
    printf("foo6: %d, %d, %d, %d, %d, %d\n", i1, i2, i3, i4, i5, i6);
}

int main(int argc, char **argv)
{
    foo1(1);
    foo2(1, 2);
    foo3(1, 2, 3);
    foo4(1, 2, 3, 4);
    foo5(1, 2, 3, 4, 5);
    foo6(1, 2, 3, 4, 5, 6);
    return 0;
}

You would expect the output to be:

foo1: 1
foo2: 1, 2
foo3: 1, 2, 3
foo4: 1, 2, 3, 4
foo5: 1, 2, 3, 4, 5
foo6: 1, 2, 3, 4, 5, 6

And if you run it in Debug mode, that's what you get. But run it in Release mode with optimizations turned on, and you get:

foo1: 1
foo2: 1, 2
foo3: 1, 2, 1
foo4: 1, 2, 4, -1881117246
foo5: 1, 2, 4, 5, 0
foo6: 1, 2, 4, 5, 6, 0

What the? All of a sudden, the 3rd paramter has gone missing. The 4th and higher paramters get shfited over. And the last paramter is random garbage. This was quite interesting, and I decided to dig deeper.

I was able to narrow the problem down to a specific optimization known as unit-at-a-time. Read the gcc manual for the full description, but the relevant portion is:

Static functions now can use non-standard passing conventions that may break asm statements calling functions directly.

Ah ha... let's look at the generated assembly code in main() where it calls foo4(), when using -Os, but without -mstackrealign:

        movl    $4, (%esp)
        movl    $3, %ecx
        movl    $2, %edx
        movl    $1, %eax
        call    _foo4

Okay. The optimizer is being clever. Since foo4() is static, it knows it is only called within this one module. Thus, it doesn't have to follow the usual calling conventions, and will pass parameters in registers instead of the stack for efficiency. So it takes 3 registers, %ecx, %edx and %eax, and uses them for the first three parameters. The fourth parameter and up go on the stack. The corresponding code in foo4(), naturally expects the first three parameters in registers, too:

        # void foo4(int i1, int i2, int i3, int i4)
_foo4:
        pushl   %ebp                # Save old frame pointer on the stack
        movl    %esp, %ebp          # Setup new frame pointer
        pushl   %esi                # Save %esi on the stack
        pushl   %ebx                # Save %ebx on the stack
        subl    $32, %esp           # Allocate space for 5 4-byte arguments
                                    #  to printf(), plus 12 bytes of padding
        call    ___i686.get_pc_thunk.bx # Position independent magic
"L00000000004$pb":
        movl    8(%ebp), %esi       # Move i4 parameter to %esi
        movl    %esi, 16(%esp)      # Push %esi on the stack
        movl    %ecx, 12(%esp)      # Push i3 parameter on the stack
        movl    %edx, 8(%esp)       # Push i2 parameter on the stack
        movl    %eax, 4(%esp)       # Push i1 parameter on the stack
        leal    LC3-"L00000000004$pb"(%ebx), %eax # Push printf() format
        movl    %eax, (%esp)        #  string on the stack
        call    L_printf$stub       # Call printf()
        addl    $32, %esp           # Free stack space for parameters
        popl    %ebx                # Restore %ebc
        popl    %esi                # Restore %esi
        popl    %ebp                # Restore frame pointer
        ret                         # We're done

The problem comes in when we look at the generated assembly code with -mstackrealign in effect. The code for main() doesn't change. It passes 3 parameters in registers and 1 on the stack. However, we don't have to get past the prolog of foo4() to see the problem:

_foo4:
        leal    4(%esp), %ecx       # Special prologue to realign
        andl    $-16, %esp          #  stack to 16-bytes
        pushl   -4(%ecx)            #  (cont.)
        pushl   %ebp                # Save old frame pointer on the stack 
        movl    %esp, %ebp          # Setup new frame pointer

This option added 3 new instructions to the prologue to realign the stack. In fact, the problem lies in the very first instruction when it clobbers %ecx. Remember main() passed the 3rd parameter, i3, in %ecx. Well this explains where the 3rd parameter disappeared to... the bit bucket. Looking at the rest of the code, it's apparent that foo4() expected the 3rd and 4th parameters to be on the stack. This also explains why the 4th parameter got shifted over to the 3rd, and why the last parameter was random garbage. This section in the -mstackrealign description has come home to roost:

The alternate prologue and epilogue [...] require one dedicated register for the entire function. This also lowers the number of registers available if used in conjunction with the "regparm" attribute.

We're not using regparm, but the effect is the same. The new prolog steals %ecx for stack realignment, but apparently it forgot to tell unit-at-a-time. Thus we have one of those miscommunications I alluded to in part 1 when then the calling conventions are not followed. In this case, the caller is using 3 registers for parameter passing, whereas the callee is using 2 registers.

Fine, mystery solved. But can we work around it? It turns out unit-at-a-time is enabled for -O2, -O3, and -Os. So one solution is to not optimize at these levels. But that's, pardon the pun, suboptimal.

It's possible to disable unit-at-a-time individually by passing -fno-unit-at-a-time. I've verified this does fix the problem, even with -O3 and -mstackrealign. So this is your best bet. You'll lose some optimization, but at least your code will work, both with legacy Windows assembly code and with itself. I tested all of this in gcc 4.0.1, build 5367. rdar://problem/4861528 has been filed.