Parallelize the quantization conversion operators (#45536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45536
Quantization conversion/reverse conversion operators will be used in critical serving path.
The operators can make use of aten::parallel to parallelize the rowwise quantization of tensors.
Overall, i see 20-25% improvement with the parallelization optimization added here.
The following result is from running benchmark on my `devvm`. Requested a dedicated machine and will post benchmark results again.
Easier view to compare results https://our.intern.facebook.com/intern/diffing/?paste_number=143973933
Baseline results are based on D23675777 (https://github.com/pytorch/pytorch/commit/677a59dcaa72fbc91abfe01731a41e0849e81154)
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.782
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 17.443
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 25.898
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 13.903
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 18.575
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.650
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 14.158
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 19.818
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.852
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 47.596
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 91.025
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 131.425
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.637
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.856
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 33.944
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 21.181
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 34.213
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 59.622
```
Results with the parallelization
```
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 8.852
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 13.594
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 20.120
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.049
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.710
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 23.320
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 11.998
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 15.972
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 23.619
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 30.764
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 50.969
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 129.960
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.797
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 15.767
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 27.032
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 16.521
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 26.050
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 45.231
```
Test Plan:
1. buck test //caffe2/test:quantization -- 'test_embedding_bag*' --print-passing-details
2. Ran benchmarks with ```buck build mode/opt caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test; ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/qembedding_pack_test.par```
Reviewed By: qizzzh
Differential Revision: D24002456
fbshipit-source-id: 23b9b071b2ce944704b2582be40d0aaaaeceb298
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
index bf17fe1..0305f0b 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
@@ -142,8 +142,14 @@
auto* output_data = output.data_ptr<uint8_t>();
#ifdef USE_FBGEMM
- fbgemm::FloatToFused8BitRowwiseQuantizedSBFloat(
- weight_data, embedding_rows, embedding_cols, output_data);
+ at::parallel_for(
+ 0, embedding_rows, 1, [&](int32_t start_idx, int32_t end_idx) {
+ for (int64_t row = start_idx; row < end_idx; ++row) {
+ fbgemm::FloatToFused8BitRowwiseQuantizedSBFloat(
+ weight_data + row * embedding_cols, 1,
+ embedding_cols, output_data + row * output_shape[1]);
+ }
+ });
#else
size_t output_columns = output_shape[1];
constexpr float kEpsilon = 1e-8f;
@@ -213,8 +219,14 @@
#ifdef USE_FBGEMM
if (!optimized_qparams) {
- fbgemm::FloatToFusedNBitRowwiseQuantizedSBHalf(
- bit_width, weight_data, embedding_rows, embedding_cols, output_data);
+ at::parallel_for(
+ 0, embedding_rows, 1, [&](int32_t start_idx, int32_t end_idx) {
+ for (int64_t row = start_idx; row < end_idx; ++row) {
+ fbgemm::FloatToFusedNBitRowwiseQuantizedSBHalf(
+ bit_width, weight_data + row * embedding_cols, 1,
+ embedding_cols, output_data + row * output_shape[1]);
+ }
+ });
} else {
#endif // USE_FBGEMM
const auto output_columns = output.size(output.dim() - 1);
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp
index 86c66b6..542d166 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp
@@ -106,8 +106,16 @@
float* output_data = output.data_ptr<float>();
#ifdef USE_FBGEMM
- fbgemm::Fused8BitRowwiseQuantizedSBFloatToFloat(
- input, input_rows, input_columns, output_data);
+ at::parallel_for(
+ 0, input_rows, 1, [&](int32_t start_idx, int32_t end_idx) {
+ for (int64_t row = start_idx; row < end_idx; ++row) {
+ fbgemm::Fused8BitRowwiseQuantizedSBFloatToFloat(
+ input + row * input_columns,
+ 1,
+ input_columns,
+ output_data + row * output_columns);
+ }
+ });
#else
for (std::size_t row = 0; row < input_rows; ++row) {
const std::uint8_t* input_row = input + row * input_columns;
@@ -145,8 +153,16 @@
packed_weight.suggest_memory_format());
float* output_data = output.data_ptr<float>();
#ifdef USE_FBGEMM
- fbgemm::FusedNBitRowwiseQuantizedSBHalfToFloat(
- BIT_RATE, input_data, input_rows, input_columns, output_data);
+ at::parallel_for(
+ 0, input_rows, 1, [&](int32_t start_idx, int32_t end_idx) {
+ for (int64_t row = start_idx; row < end_idx; ++row) {
+ fbgemm::FusedNBitRowwiseQuantizedSBHalfToFloat(BIT_RATE,
+ input_data + row * input_columns,
+ 1,
+ input_columns,
+ output_data + row * output_dimensions[1]);
+ }
+ });
#else
auto output_columns = output_dimensions[1];
for (size_t row = 0; row < input_rows; ++row) {