[inductor] fix dynamic size array(vla) build error on msvc v4 (#134221)
MSVC don't support dynamic array.
Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler
We tried to solutions:
1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs.
2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments: https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower.
3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number.
> a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729
> b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains.
Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think:
1. MSVC it the only one compiler, which not support VLA.
2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works.
3. For other compilers, keep using current `VLA` code.
4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC.
5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner.
Reproduce UT:
```cmd
pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads
```
Error msg:
```cmd
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads'
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable
```
Genarated code:
```c++
#include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h"
extern "C" __declspec(dllexport) void kernel(const float* in_ptr0,
const float* in_ptr1,
float* out_ptr0,
float* out_ptr1)
{
{
{
float tmp_acc0 = 0;
at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
int max_threads = omp_get_max_threads();
float tmp_acc0_arr[max_threads];
for (int tid = 0; tid < max_threads; tid++)
{
tmp_acc0_arr[tid] = 0;
}
at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads];
for (int tid = 0; tid < max_threads; tid++)
{
tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
}
#pragma omp parallel
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221
Approved by: https://github.com/zhuhaozhe, https://github.com/jansel
diff --git a/torch/_inductor/codegen/cpp.py b/torch/_inductor/codegen/cpp.py
index 49bb082..c93ccac 100644
--- a/torch/_inductor/codegen/cpp.py
+++ b/torch/_inductor/codegen/cpp.py
@@ -1578,12 +1578,26 @@
num_threads = (
"max_threads" if config.cpp.dynamic_threads else parallel_num_threads()
)
- acc_per_thread = f"{acc}_arr[{num_threads}]"
+ acc_per_thread_var_name = f"{acc}_arr"
+ acc_per_thread = f"{acc_per_thread_var_name}[{num_threads}]"
+ """
+ MSVC don't support dynamic array(VLA). Please use std::unique_ptr to instead of it.
+ Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler
+ MSVC is the only one compiler, which not support VLA. And MSVC can't get good inductor performance.
+ So, we can use unique_ptr make it works on MSVC.
+ For other compilers, we continue to use VLA to get best performence.
+ """
+ acc_per_thread_unique_ptr_decl = f"auto {acc_per_thread_var_name} = std::make_unique<{acc_type}[]>({num_threads})"
+ acc_per_thread_vla_decl = f"{acc_per_thread_var_name}[{num_threads}]"
acc_local_in_array = acc_per_thread.replace(f"[{num_threads}]", "[tid]")
self.local_reduction_init.writeline(
f"{acc_type} {acc_local} = {reduction_init_fn(reduction_type, dtype)};"
)
- self.parallel_reduction_prefix.writeline(f"{acc_type} {acc_per_thread};")
+ self.parallel_reduction_prefix.writeline(
+ f"{acc_per_thread_unique_ptr_decl};"
+ if cpp_builder.is_msvc_cl()
+ else f"{acc_type} {acc_per_thread_vla_decl};"
+ )
self.parallel_reduction_prefix.writelines(
[
f"for (int tid = 0; tid < {num_threads}; tid++)",