Make fdct32 computation flow within 16bit range

This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.

This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.

Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
6 files changed