intel/compiler: remove branch weight heuristic As a result of this patch, compiler chooses SIMD32 shaders more frequently. Current logic is designed to avoid regressions from enabling SIMD32 at all cost, even though the cases where regression can happen are probably for smaller draw calls (far away from the camera and though smaller). In Intel perf CI this patch improves FPS in: - gfxbench5 alu2: 21.92% (gen9), 23.7% (gen11) - synmark OglShMapVsm: 3.26% (gen9), 4.52% (gen11) - gfxbench5 car chase: 1.34% (gen9), 1.32% (gen11) No observed regressions there. In my testing, it also improves FPS in: - The Talos Principle: 2.9% (gen9) The other 16 games I tested had very minor changes in performance (2/3 positive, but not significant enough to list here). Note: this patch harms synmark OglDrvState (which is not in Intel perf CI) by ~2.9%, but this benchmark renders multiple scenes from other workloads (including OglShMapVsm, which is helped in standalone mode) in tiny rectangles. Rendering so small drastically changes branching statistics, which favors smaller SIMD modes. I assume this matters only in micro-benchmarks, as in real workloads more expensive (with more uniform branching behavior) draw calls dominate. Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com> Acked-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7137>

commit: 21ffacff8c70c35649679fd67f2ee770245751e4 [log] [tgz]
author: Marcin Ślusarz <marcin.slusarz@intel.com> Wed Oct 14 16:32:55 2020 +0200
committer: Marge Bot <eric+marge@anholt.net> Tue Nov 03 10:49:04 2020 +0000
tree: f0a3c4d8134344f42cde436315b976959992a83f
parent: 06764e0e5d5e37f9a3e00db7676b76d5472e305b [diff]
diff --git a/src/intel/compiler/brw_ir_performance.cpp b/src/intel/compiler/brw_ir_performance.cpp
index ce74f0a..c0fae61 100644
--- a/src/intel/compiler/brw_ir_performance.cpp
+++ b/src/intel/compiler/brw_ir_performance.cpp

@@ -1505,16 +1505,23 @@
                             const backend_instruction *),
                          unsigned dispatch_width)
    {
-      /* XXX - Plumbing the trip counts from NIR loop analysis would allow us
-       *       to do a better job regarding the loop weights.  And some branch
-       *       divergence analysis would allow us to do a better job with
-       *       branching weights.
+      /* XXX - Note that the previous version of this code used worst-case
+       *       scenario estimation of branching divergence for SIMD32 shaders,
+       *       but this heuristic was removed to improve performance in common
+       *       scenarios. Wider shader variants are less optimal when divergence
+       *       is high, e.g. when application renders complex scene on a small
+       *       surface. It is assumed that such renders are short, so their
+       *       time doesn't matter and when it comes to the overall performance,
+       *       they are dominated by more optimal larger renders.
+       *
+       *       It's possible that we could do better with divergence analysis
+       *       by isolating branches which are 100% uniform.
+       *
+       *       Plumbing the trip counts from NIR loop analysis would allow us
+       *       to do a better job regarding the loop weights.
        *
        *       In the meantime use values that roughly match the control flow
-       *       weights used elsewhere in the compiler back-end -- Main
-       *       difference is the worst-case scenario branch_weight used for
-       *       SIMD32 which accounts for the possibility of a dynamically
-       *       uniform branch becoming divergent in SIMD32.
+       *       weights used elsewhere in the compiler back-end.
        *
        *       Note that we provide slightly more pessimistic weights on
        *       Gen12+ for SIMD32, since the effective warp size on that
@@ -1523,7 +1530,6 @@
        *       previous generations, giving narrower SIMD modes a performance
        *       advantage in several test-cases with non-uniform discard jumps.
        */
-      const float branch_weight = (dispatch_width > 16 ? 1.0 : 0.5);
       const float discard_weight = (dispatch_width > 16 || s->devinfo->gen < 12 ?
                                     1.0 : 0.5);
       const float loop_weight = 10;
@@ -1539,16 +1545,12 @@
 
             issue_instruction(st, s->devinfo, inst);
 
-            if (inst->opcode == BRW_OPCODE_ENDIF)
-               st.weight /= branch_weight;
-            else if (inst->opcode == FS_OPCODE_PLACEHOLDER_HALT && discard_count)
+            if (inst->opcode == FS_OPCODE_PLACEHOLDER_HALT && discard_count)
                st.weight /= discard_weight;
 
             elapsed += (st.unit_ready[unit_fe] - clock0) * st.weight;
 
-            if (inst->opcode == BRW_OPCODE_IF)
-               st.weight *= branch_weight;
-            else if (inst->opcode == BRW_OPCODE_DO)
+            if (inst->opcode == BRW_OPCODE_DO)
                st.weight *= loop_weight;
             else if (inst->opcode == BRW_OPCODE_WHILE)
                st.weight /= loop_weight;
commit	21ffacff8c70c35649679fd67f2ee770245751e4	[log] [tgz]
author	Marcin Ślusarz <marcin.slusarz@intel.com>	Wed Oct 14 16:32:55 2020 +0200
committer	Marge Bot <eric+marge@anholt.net>	Tue Nov 03 10:49:04 2020 +0000
tree	f0a3c4d8134344f42cde436315b976959992a83f
parent	06764e0e5d5e37f9a3e00db7676b76d5472e305b [diff]