[inductor] enable software pipelining on AMD devices (#125858)

Summary:
per-AMD, software pipelining is enabled by setting `num_stages=0`, and should provide a nice perf boost for GEMMs. caveat is that `num_stages=1` is preferred for instances of back-to-back GEMMs, but take `num_stages=0` as the better default.

wait to land until triton upstream lands in OSS, pipelining does not work well on the fork

Test Plan: n/a

Reviewed By: xw285cornell, yoyoyocmu

Differential Revision: D56221447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125858
Approved by: https://github.com/pragupta, https://github.com/yoyoyocmu
diff --git a/torch/_inductor/kernel/mm_common.py b/torch/_inductor/kernel/mm_common.py
index 5a7f60e5..26d0818 100644
--- a/torch/_inductor/kernel/mm_common.py
+++ b/torch/_inductor/kernel/mm_common.py
@@ -178,14 +178,14 @@
     if config["cond"]
 )
 
-# On ROCm convert num_stages to 1 as pipelining provides no benefit
+# On ROCm convert num_stages to 0 to enable software pipelining
 if torch.version.hip:
     mm_platform_configs = tuple(
-        (config[0], config[1], config[2], 1, config[4])
+        (config[0], config[1], config[2], 0, config[4])
         for config in mm_platform_configs
     )
     int8_platform_configs = tuple(
-        (config[0], config[1], config[2], 1, config[4])
+        (config[0], config[1], config[2], 0, config[4])
         for config in mm_platform_configs
     )