improved speed of 4x4 sse2 fdct.

* speed improvment of 30 percent achieved
* multiplies and adds remain the same
* non-arithmetic instructions minimized by hand, by:
   -expanding 2 pass loop
   -removing irrelivant "shuffles"
   -combining last two rounding steps
* further improvments may be possible

Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
1 file changed