Inductor cpp wrapper: add -ffast-math in linking flag (#104332)

Fix cpp wrapper CPU performance gap on `swsl_resnext101_32x16d` compared with the default python wrapper.

The pre-trained weights of `swsl_resnext101_32x16d` contains denormal numbers (close to 0.0).

Linking with `-ffast-math` will make the CPU flush denormals.
For the default python wrapper, the compilation and linking are done in one command thus `-ffast-math` will take effect in both compilation and linking.
CPP wrapper leverages cpp_extension which will do the compilation and linking in two stages, thus we need to explicitly add `-ffast-math` as a linking flag.

Single thread single batch on ICX:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link=blue vlink=purple>

  | time (s) default python wrapper | time (s) cpp wrapper before fix | time (s) cpp wrapper after fix
-- | -- | -- | --
swsl_resnext101_32x16d | 0.459097836 | 13.82326214 | 0.448116195

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104332
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang
diff --git a/torch/_inductor/codecache.py b/torch/_inductor/codecache.py
index db6748d..320228b 100644
--- a/torch/_inductor/codecache.py
+++ b/torch/_inductor/codecache.py
@@ -925,7 +925,12 @@
                     _use_custom_generated_macros = use_custom_generated_macros()
 
                     extra_cflags = f"{_cpp_flags} {_opt_flags} {_warning_all_flag} {_macros} {_use_custom_generated_macros}"
-                    extra_ldflags = f"{_shared} {_lpaths} {_libs}"
+                    # For CPP wrapper, add -ffast-math during linking to make CPU flush denormals.
+                    # CPP wrapper leverages cpp_extension which will do the compilation and linking in two stages.
+                    # We need to explicitly add -ffast-math as a linking flag.
+                    # For the default python wrapper, the compilation and linking are done in one command thus -ffast-math
+                    # will take effect in both compilation and linking.
+                    extra_ldflags = f"{_shared} {_lpaths} {_libs} -ffast-math"
                     extra_include_paths = f"{_ipaths}"
 
                     mod = torch.utils.cpp_extension.load_inline(