Here are some helpful hints to get optimal performance:
The EMMS call takes a lot of time, so try to separate floating point and MMX operations.
Use MMX only in low level routines because the compiler saves all used MMX registers when calling a subroutine.
The NOT-operator isn’t supported natively by MMX, so the compiler has to generate a workaround and this operation is inefficient.
Simple assignments of floating point numbers don’t access floating point registers, so you need no call to the EMMS procedure. Only when doing arithmetic, you need to call the EMMS procedure.