benchmarks/dynamo/README.md - platform/external/pytorch - Git at Google

 # `torch.compile()` Benchmarking

 This directory contains benchmarking code for TorchDynamo and many
 backends including TorchInductor.  It includes three main benchmark suites:

 - [TorchBenchmark](https://github.com/pytorch/benchmark): A diverse set of models, initially seeded from
 highly cited research models as ranked by [Papers With Code](https://paperswithcode.com).  See [torchbench
 installation](https://github.com/pytorch/benchmark#installation) and `torchbench.py` for the low-level runner.
 [Makefile](Makefile) also contains the commands needed to setup TorchBenchmark to match the versions used in
 PyTorch CI.

 - Models from [HuggingFace](https://github.com/huggingface/transformers): Primarily transformer models, with
 representative models chosen for each category available.  The low-level runner (`huggingface.py`) automatically
 downloads and installs the needed dependencies on first run.

 - Models from [TIMM](https://github.com/huggingface/pytorch-image-models): Primarily vision models, with representative
 models chosen for each category available.  The low-level runner (`timm_models.py`) automatically downloads and
 installs the needed dependencies on first run.


 ## GPU Performance Dashboard

 Daily results from the benchmarks here are available in the [TorchInductor
 Performance Dashboard](https://hud.pytorch.org/benchmark/compilers),
 currently run on an NVIDIA A100 GPU.

 The [inductor-perf-test-nightly.yml](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml)
 workflow generates the data in the performance dashboard.  If you have the needed permissions, you can benchmark
 your own branch on the PyTorch GitHub repo by:
 1) Select "Run workflow" in the top right of the [workflow](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml)
 2) Select your branch you want to benchmark
 3) Choose the options (such as training vs inference)
 4) Click "Run workflow"
 5) Wait for the job to complete (4 to 12 hours depending on backlog)
 6) Go to the [dashboard](https://hud.pytorch.org/benchmark/compilers)
 7) Select your branch and commit at the top of the dashboard

 The dashboard compares two commits a "Base Commit" and a "New Commit".
 An entry such as `2.38x → 2.41x` means that the performance improved
 from `2.38x` in the base to `2.41x` in the new commit.  All performance
 results are normalized to eager mode PyTorch (`1x`), and higher is better.


 ## CPU Performance Dashboard

 The [TorchInductor CPU Performance
 Dashboard](https://github.com/pytorch/pytorch/issues/93531) is tracked
 on a GitHub issue and updated periodically.

 ## Running Locally

 Raw commands used to generate the data for
 the performance dashboards can be found
 [here](https://github.com/pytorch/pytorch/blob/641ec2115f300a3e3b39c75f6a32ee3f64afcf30/.ci/pytorch/test.sh#L343-L418).

 To summarize there are three scripts to run each set of benchmarks:
 - `./benchmarks/dynamo/torchbench.py ...`
 - `./benchmarks/dynamo/huggingface.py ...`
 - `./benchmarks/dynamo/timm_models.py ...`

 Each of these scripts takes the same set of arguments.  The ones used by dashboards are:
 - `--accuracy` or `--performance`: selects between checking correctness and measuring speedup (both are run for dashboard).
 - `--training` or `--inference`: selects between measuring training or inference (both are run for dashboard).
 - `--device=cuda` or `--device=cpu`: selects device to measure.
 - `--amp`, `--bfloat16`, `--float16`, `--float32`:  selects precision to use `--amp` is used for training and `--bfloat16` for inference.
 - `--cold-start-latency`: disables caching to accurately measure compile times.
 - `--backend=inductor`: selects TorchInductor as the compiler backend to measure.  Many more are available, see `--help`.
 - `--output=<filename>.csv`: where to write results to.
 - `--dynamic-shapes --dynamic-batch-only`: used when the `dynamic` config is enabled.
 - `--disable-cudagraphs`: used by configurations without cudagraphs enabled (default).
 - `--freezing`: enable additional inference-only optimizations.
 - `--cpp-wrapper`: enable C++ wrapper code to lower overheads.
 - `TORCHINDUCTOR_MAX_AUTOTUNE=1` (environment variable): used to measure max-autotune mode, which is run weekly due to longer compile times.
 - `--export-aot-inductor`: benchmarks ahead-of-time compilation mode.
 - `--total-partitions` and `--partition-id`: used to parallel benchmarking across different machines.

 For debugging you can run just a single benchmark by adding the `--only=<NAME>` flag.

 A complete list of options can be seen by running each of the runners with the `--help` flag.

 As an example, the commands to run first line of the dashboard (performance only) would be:
 ```
 ./benchmarks/dynamo/torchbench.py --performance --training --amp --backend=inductor --output=torchbench_training.csv
 ./benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend=inductor --output=torchbench_inference.csv

 ./benchmarks/dynamo/huggingface.py --performance --training --amp --backend=inductor --output=huggingface_training.csv
 ./benchmarks/dynamo/huggingface.py --performance --inference --bfloat16 --backend=inductor --output=huggingface_inference.csv

 ./benchmarks/dynamo/timm_models.py --performance --training --amp --backend=inductor --output=timm_models_training.csv
 ./benchmarks/dynamo/timm_models.py --performance --inference --bfloat16 --backend=inductor --output=timm_models_inference.csv
 ```
	# `torch.compile()` Benchmarking

	This directory contains benchmarking code for TorchDynamo and many
	backends including TorchInductor. It includes three main benchmark suites:

	- [TorchBenchmark](https://github.com/pytorch/benchmark): A diverse set of models, initially seeded from
	highly cited research models as ranked by [Papers With Code](https://paperswithcode.com). See [torchbench
	installation](https://github.com/pytorch/benchmark#installation) and `torchbench.py` for the low-level runner.
	[Makefile](Makefile) also contains the commands needed to setup TorchBenchmark to match the versions used in
	PyTorch CI.

	- Models from [HuggingFace](https://github.com/huggingface/transformers): Primarily transformer models, with
	representative models chosen for each category available. The low-level runner (`huggingface.py`) automatically
	downloads and installs the needed dependencies on first run.

	- Models from [TIMM](https://github.com/huggingface/pytorch-image-models): Primarily vision models, with representative
	models chosen for each category available. The low-level runner (`timm_models.py`) automatically downloads and
	installs the needed dependencies on first run.


	## GPU Performance Dashboard

	Daily results from the benchmarks here are available in the [TorchInductor
	Performance Dashboard](https://hud.pytorch.org/benchmark/compilers),
	currently run on an NVIDIA A100 GPU.

	The [inductor-perf-test-nightly.yml](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml)
	workflow generates the data in the performance dashboard. If you have the needed permissions, you can benchmark
	your own branch on the PyTorch GitHub repo by:
	1) Select "Run workflow" in the top right of the [workflow](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml)
	2) Select your branch you want to benchmark
	3) Choose the options (such as training vs inference)
	4) Click "Run workflow"
	5) Wait for the job to complete (4 to 12 hours depending on backlog)
	6) Go to the [dashboard](https://hud.pytorch.org/benchmark/compilers)
	7) Select your branch and commit at the top of the dashboard

	The dashboard compares two commits a "Base Commit" and a "New Commit".
	An entry such as `2.38x → 2.41x` means that the performance improved
	from `2.38x` in the base to `2.41x` in the new commit. All performance
	results are normalized to eager mode PyTorch (`1x`), and higher is better.


	## CPU Performance Dashboard

	The [TorchInductor CPU Performance
	Dashboard](https://github.com/pytorch/pytorch/issues/93531) is tracked
	on a GitHub issue and updated periodically.

	## Running Locally

	Raw commands used to generate the data for
	the performance dashboards can be found
	[here](https://github.com/pytorch/pytorch/blob/641ec2115f300a3e3b39c75f6a32ee3f64afcf30/.ci/pytorch/test.sh#L343-L418).

	To summarize there are three scripts to run each set of benchmarks:
	- `./benchmarks/dynamo/torchbench.py ...`
	- `./benchmarks/dynamo/huggingface.py ...`
	- `./benchmarks/dynamo/timm_models.py ...`

	Each of these scripts takes the same set of arguments. The ones used by dashboards are:
	- `--accuracy` or `--performance`: selects between checking correctness and measuring speedup (both are run for dashboard).
	- `--training` or `--inference`: selects between measuring training or inference (both are run for dashboard).
	- `--device=cuda` or `--device=cpu`: selects device to measure.
	- `--amp`, `--bfloat16`, `--float16`, `--float32`: selects precision to use `--amp` is used for training and `--bfloat16` for inference.
	- `--cold-start-latency`: disables caching to accurately measure compile times.
	- `--backend=inductor`: selects TorchInductor as the compiler backend to measure. Many more are available, see `--help`.
	- `--output=<filename>.csv`: where to write results to.
	- `--dynamic-shapes --dynamic-batch-only`: used when the `dynamic` config is enabled.
	- `--disable-cudagraphs`: used by configurations without cudagraphs enabled (default).
	- `--freezing`: enable additional inference-only optimizations.
	- `--cpp-wrapper`: enable C++ wrapper code to lower overheads.
	- `TORCHINDUCTOR_MAX_AUTOTUNE=1` (environment variable): used to measure max-autotune mode, which is run weekly due to longer compile times.
	- `--export-aot-inductor`: benchmarks ahead-of-time compilation mode.
	- `--total-partitions` and `--partition-id`: used to parallel benchmarking across different machines.

	For debugging you can run just a single benchmark by adding the `--only=<NAME>` flag.

	A complete list of options can be seen by running each of the runners with the `--help` flag.

	As an example, the commands to run first line of the dashboard (performance only) would be:
	```
	./benchmarks/dynamo/torchbench.py --performance --training --amp --backend=inductor --output=torchbench_training.csv
	./benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend=inductor --output=torchbench_inference.csv

	./benchmarks/dynamo/huggingface.py --performance --training --amp --backend=inductor --output=huggingface_training.csv
	./benchmarks/dynamo/huggingface.py --performance --inference --bfloat16 --backend=inductor --output=huggingface_inference.csv

	./benchmarks/dynamo/timm_models.py --performance --training --amp --backend=inductor --output=timm_models_training.csv
	./benchmarks/dynamo/timm_models.py --performance --inference --bfloat16 --backend=inductor --output=timm_models_inference.csv
	```