[inductor] enable software pipelining on AMD devices (#125858)
Summary:
per-AMD, software pipelining is enabled by setting `num_stages=0`, and should provide a nice perf boost for GEMMs. caveat is that `num_stages=1` is preferred for instances of back-to-back GEMMs, but take `num_stages=0` as the better default.
wait to land until triton upstream lands in OSS, pipelining does not work well on the fork
Test Plan: n/a
Reviewed By: xw285cornell, yoyoyocmu
Differential Revision: D56221447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125858
Approved by: https://github.com/pragupta, https://github.com/yoyoyocmu
diff --git a/torch/_inductor/kernel/mm_common.py b/torch/_inductor/kernel/mm_common.py
index 5a7f60e5..26d0818 100644
--- a/torch/_inductor/kernel/mm_common.py
+++ b/torch/_inductor/kernel/mm_common.py
@@ -178,14 +178,14 @@
if config["cond"]
)
-# On ROCm convert num_stages to 1 as pipelining provides no benefit
+# On ROCm convert num_stages to 0 to enable software pipelining
if torch.version.hip:
mm_platform_configs = tuple(
- (config[0], config[1], config[2], 1, config[4])
+ (config[0], config[1], config[2], 0, config[4])
for config in mm_platform_configs
)
int8_platform_configs = tuple(
- (config[0], config[1], config[2], 1, config[4])
+ (config[0], config[1], config[2], 0, config[4])
for config in mm_platform_configs
)