arm64: cdef: Rewrite an expression slightly

Instead of apply_sign(imin(abs(diff), clip), diff), do
imax(imin(diff, clip), -clip).

Before:                  Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:    592.7   374.5   384.5
cdef_filter_4x8_8bpc_neon:   1093.0   704.4   706.6
cdef_filter_8x8_8bpc_neon:   1962.6  1239.4  1252.1
After:
cdef_filter_4x4_8bpc_neon:    593.7   355.5   373.2
cdef_filter_4x8_8bpc_neon:   1091.6   663.2   685.3
cdef_filter_8x8_8bpc_neon:   1964.2  1182.5  1210.8
diff --git a/src/arm/64/cdef.S b/src/arm/64/cdef.S
index 333bdde..c51b451 100644
--- a/src/arm/64/cdef.S
+++ b/src/arm/64/cdef.S
@@ -299,17 +299,17 @@
         uabd            v20.8h, v0.8h,  \s2\().8h   // abs(diff)
         ushl            v17.8h, v16.8h, \shift      // abs(diff) >> shift
         ushl            v21.8h, v20.8h, \shift      // abs(diff) >> shift
-        uqsub           v17.8h, \thresh_vec, v17.8h // imax(0, threshold - (abs(diff) >> shift))
-        uqsub           v21.8h, \thresh_vec, v21.8h // imax(0, threshold - (abs(diff) >> shift))
-        cmhi            v18.8h, v0.8h,  \s1\().8h   // px > p0
-        cmhi            v22.8h, v0.8h,  \s2\().8h   // px > p1
-        umin            v17.8h, v17.8h, v16.8h      // imin(abs(diff), imax())
-        umin            v21.8h, v21.8h, v20.8h      // imin(abs(diff), imax())
+        uqsub           v17.8h, \thresh_vec, v17.8h // clip = imax(0, threshold - (abs(diff) >> shift))
+        uqsub           v21.8h, \thresh_vec, v21.8h // clip = imax(0, threshold - (abs(diff) >> shift))
+        sub             v18.8h, \s1\().8h,  v0.8h   // diff = p0 - px
+        sub             v22.8h, \s2\().8h,  v0.8h   // diff = p1 - px
+        neg             v16.8h, v17.8h              // -clip
+        neg             v20.8h, v21.8h              // -clip
+        smin            v18.8h, v18.8h, v17.8h      // imin(diff, clip)
+        smin            v22.8h, v22.8h, v21.8h      // imin(diff, clip)
         dup             v19.8h, \tap                // taps[k]
-        neg             v16.8h, v17.8h              // -imin()
-        neg             v20.8h, v21.8h              // -imin()
-        bsl             v18.16b, v16.16b, v17.16b   // constrain() = apply_sign()
-        bsl             v22.16b, v20.16b, v21.16b   // constrain() = apply_sign()
+        smax            v18.8h, v18.8h, v16.8h      // constrain() = imax(imin(diff, clip), -clip)
+        smax            v22.8h, v22.8h, v20.8h      // constrain() = imax(imin(diff, clip), -clip)
         mla             v1.8h,  v18.8h, v19.8h      // sum += taps[k] * constrain()
         mla             v1.8h,  v22.8h, v19.8h      // sum += taps[k] * constrain()
 3: