| # Gemm Tuner |
| |
| ## Introduction |
| |
| This is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each |
| has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**. |
| The details of these strategies can be found in the documentations of the corresponding kernels: |
| **CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and |
| **CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**. |
| |
| The Tuner consists of 2 scripts and 3 binaries: |
| * benchmark_gemm_examples.sh and GemmTuner.py under examples/gemm_tuner, and |
| * benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under |
| build/tests/gemm_tuner (you'll need to build the library first) |
| |
| The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly |
| data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation: |
| ``` |
| LHS x RHS = DST |
| ``` |
| Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size. |
| |
| The outputs of the tuning process are 4 json files: |
| 1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam |
| 2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam |
| 3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam |
| 4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam |
| |
| These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam, |
| what kernel and subsequently what configurations for that kernels are the most performant. |
| |
| ## Step-by-step example |
| |
| ### Step1: Prepare the shape and configs files |
| 1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*. |
| 2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires |
| some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device). |
| |
| Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv". |
| |
| Please refer to the Prerequisite section for more details |
| |
| ### Step2: Push relevant files to the target device |
| All the files that need to be present on the target device are: |
| * benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/benchmark_gemm_examples.sh |
| * shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv |
| * Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm* |
| |
| ### Step3: Collect benchmark data |
| With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed |
| to a folder called *gemm_tuner*. While logged onto our device: |
| ``` |
| # Native |
| ./benchmark_gemm_examples.sh -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native |
| # Reshaped Only RHS |
| ./benchmark_gemm_examples.sh -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs |
| # Reshaped |
| ./benchmark_gemm_examples.sh -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped |
| ``` |
| You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy), |
| but you may need to change the output folder for each repeat |
| |
| ### Step4: Generate the heuristics |
| 1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine |
| 2. We use the GemmTuner.py script to give us the heuristics |
| ``` |
| python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics |
| ``` |
| When it's finished, there should be 4 json files in the *heuristics* folder |
| |
| One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because |
| we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by |
| passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script. |
| |
| ## Prerequisite |
| * A target device to be tuned, plus the following on the device: |
| * Android or Linux OS |
| * Bash shell |
| * Built Compute Library with benchmark examples binaries |
| * benchmark_gemm_examples.sh script |
| * gemm shape file |
| |
| A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're |
| interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance |
| and can be provided on request. |
| |
| The format is described as: |
| |
| A headerless csv file with fields separated by commas. |
| |
| A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and |
| RHS) with: |
| |
| M - Number of lhs matrix rows |
| N - Number of rhs matrix columns |
| K - Number of lhs matrix columns/rhs matrix rows |
| B - Batch size |
| |
| An example gemm shape file looks like: |
| ``` |
| 100,100,30,1 |
| 100,100,30,3 |
| ... |
| ``` |
| * gemm config file |
| A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we |
| search for the optimal one. **Note that we have a different list for each strategy.** |
| The default lists are prepared by Compute Library developers in advance and can be provided on request. |
| |
| The format of the file for each strategy is the same: |
| |
| A headerless csv file with fields separated by commas. |
| |
| However the fields of GEMMConfig differ for each strategy: |
| |
| * Strategy **native**: |
| A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with: |
| |
| m0 - Number of rows processed by the matrix multiplication |
| n0 - Number of columns processed by the matrix multiplication |
| k0 - Number of partial accumulations performed by the matrix multiplication |
| |
| Only the following configurations of M0, N0 and K0 are currently supported: |
| |
| M0 = 1, 2, 3, 4, 5, 6, 7, 8 |
| N0 = 2, 3, 4, 8, 16 |
| K0 = 2, 3, 4, 8, 16 |
| |
| An example gemm config file looks like: |
| ``` |
| 1,4,4 |
| 2,3,8 |
| ... |
| ``` |
| * Strategy **reshaped_rhs_only**: |
| A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values: |
| |
| m0 - Number of rows processed by the matrix multiplication |
| n0 - Number of columns processed by the matrix multiplication |
| k0 - Number of partial accumulations performed by the matrix multiplication |
| h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row |
| interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) |
| transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0) |
| export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true |
| with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel |
| for more details |
| |
| Only the following configurations of M0, N0 and K0 are currently supported: |
| |
| M0 = 1, 2, 3, 4, 5, 6, 7, 8 |
| N0 = 2, 3, 4, 8, 16 |
| K0 = 2, 3, 4, 8, 16 |
| H0 >= 1 |
| |
| An example gemm config file looks like: |
| ``` |
| 4,4,4,1,1,1,0 |
| 4,4,4,3,1,0,1 |
| ... |
| ``` |
| * Strategy **reshaped**: |
| A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values: |
| |
| m0 - Number of rows processed by the matrix multiplication |
| n0 - Number of columns processed by the matrix multiplication |
| k0 - Number of partial accumulations performed by the matrix multiplication |
| v0 - Number of vertical blocks of size (m0xk0) stored on the same output row |
| h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row |
| interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0) |
| interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) |
| transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0) |
| export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true |
| with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel |
| for more details |
| |
| If rhs matrix is transposed only the following configurations are currently supported: |
| |
| M0 = 2, 3, 4, 5, 6, 7, 8 |
| N0 = 2, 3, 4, 8, 16 |
| K0 = 2, 3, 4, 8, 16 |
| V0 >= 1 |
| H0 >= 1 |
| |
| If lhs matrix is transposed only the following configurations are currently supported: |
| |
| M0 = 2, 3, 4, 8 |
| N0 = 2, 3, 4, 8, 16 |
| K0 = 2, 3, 4, 8, 16 |
| V0 >= 1 |
| H0 >= 1 |
| |
| An example gemm config file looks like: |
| ``` |
| 4,4,4,1,3,1,1,1,0 |
| 4,4,4,3,3,1,1,0,1 |
| ... |
| ``` |
| * A host machine, plus these on the machine: |
| * python >= 3.6 |
| * GemmTuner.py script |
| |
| ## Usage |
| The usage of the 2 scripts: |
| |
| 1. benchmark_gemm_examples.sh |
| |
| Run the shell script (**benchmark_gemm_examples.sh**) on your **target device**. Note that all the built benchmark |
| examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running. |
| The benchmark results will be saved to json files in an output directory. |
| ``` |
| Usage: benchmark_gemm_examples.sh [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\> |
| -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>] |
| |
| Options: |
| -h |
| Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that |
| strategy. Otherwise if no strategy is specified, display messages for all available strategies. |
| |
| -s <strategy> |
| Strategy option. |
| Options: ${ALL_STRATEGY_OPTIONS[@]}. |
| |
| -e <example_binary_dir> |
| Path to directory that holds all example binaries |
| |
| -g <gemm_shape_file> |
| Path to gemm shape csv file |
| |
| -c <gemm_config_file> |
| Path to gemm config csv file |
| |
| -d <data_type> |
| Data type option with which to run benchmark examples |
| Default: ${DEFAULT_DATA_TYPE} |
| Supported options: |
| Strategy : Data Types |
| Native : F32 |
| Reshaped : F16, F32 |
| Reshaped RHS Only : F16, F32 |
| |
| -o <out_dir> |
| Path to output directory that holds output json files |
| Default: ${DEFAULT_OUT_DIR} |
| ``` |
| 2. GemmTuner.py: |
| |
| Run the python script (**GemmTuner.py**) on your **host machine**. |
| You'll need to transfer all the benchmark result json files generated from the previous step to your host machine |
| beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files |
| ``` |
| Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D] |
| |
| CL GEMM Tuner |
| optional arguments: |
| -h, --help show this help message and exit |
| -b PATH, --benchmark_results PATH |
| Path to benchmark result directory, where benchmark |
| result json files have a file extension of |
| 'gemmtuner_benchmark' |
| -o PATH, --output_dir PATH |
| Path to directory that holds output json files. |
| -t TOLERANCE, --tolerance TOLERANCE |
| For testing if two GEMMConfigs are equivalent in terms |
| of performance. The tolerance is OpenCL timer in |
| milliseconds. Recommended value: <= 0.1 ms |
| -D, --debug Enable script debugging output |
| |
| ``` |