benchmarks/inference/README.md - platform/external/pytorch - Git at Google

 ## Inference benchmarks

 This folder contains a work in progress simulation of a python inference server.

 The v0 version of this has a backend worker that is a single process. It loads a
 ResNet-18 checkpoint to 'cuda:0' and compiles the model. It accepts requests in
 the form of (tensor, request_time) from a `multiprocessing.Queue`, runs
 inference on the request and returns (output, request_time) in the a separate
 response `multiprocessing.Queue`.

 The frontend worker is a process with three threads
 1. A thread that generates fake data of a given batch size in the form of CPU
    tensors and puts the data into the request queue
 2. A thread that reads responses from the response queue and collects metrics on
    the latency of the first response, which corresponds to the cold start time,
    average, minimum and maximum response latency as well as throughput.
 3. A thread that polls nvidia-smi for GPU utilization metrics.

 For now we omit data preprocessing as well as result post-processing.

 ### Running a single benchmark

 The togglable commmand line arguments to the script are as follows:
   - `num_iters` (default: 100): how many requests to send to the backend
     excluding the first warmup request
   - `batch_size` (default: 32): the batch size of the requests.
   - `model_dir` (default: '.'): the directory to load the checkpoint from
   - `compile` (default: compile): or `--no-compile` whether to `torch.compile()`
     the model
   - `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory.
   - `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction

 e.g. A sample command to run the benchmark

 ```
 python -W ignore server.py --num_iters 1000 --batch_size 32
 ```

 the results will be found in `results/output.csv`, which will be appended to if the file already exists.

 Note that `m.compile()` time in the csv file is not the time for the model to be compiled,
 which happens during the first iteration, but rather the time for PT2 components
 to be lazily imported (e.g. triton).

 ### Running a sweep

 The script `runner.sh` will run a sweep of the benchmark over different batch
 sizes with compile on and off and collect the mean and standard deviation of warmup latency,
 average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics
 from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md`
 will contain the mean and standard deviation of results for a given batch size and compile setting.
 If the file already exists, the metrics from the run will be appended as a new row in the markdown table.
	## Inference benchmarks

	This folder contains a work in progress simulation of a python inference server.

	The v0 version of this has a backend worker that is a single process. It loads a
	ResNet-18 checkpoint to 'cuda:0' and compiles the model. It accepts requests in
	the form of (tensor, request_time) from a `multiprocessing.Queue`, runs
	inference on the request and returns (output, request_time) in the a separate
	response `multiprocessing.Queue`.

	The frontend worker is a process with three threads
	1. A thread that generates fake data of a given batch size in the form of CPU
	tensors and puts the data into the request queue
	2. A thread that reads responses from the response queue and collects metrics on
	the latency of the first response, which corresponds to the cold start time,
	average, minimum and maximum response latency as well as throughput.
	3. A thread that polls nvidia-smi for GPU utilization metrics.

	For now we omit data preprocessing as well as result post-processing.

	### Running a single benchmark

	The togglable commmand line arguments to the script are as follows:
	- `num_iters` (default: 100): how many requests to send to the backend
	excluding the first warmup request
	- `batch_size` (default: 32): the batch size of the requests.
	- `model_dir` (default: '.'): the directory to load the checkpoint from
	- `compile` (default: compile): or `--no-compile` whether to `torch.compile()`
	the model
	- `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory.
	- `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction

	e.g. A sample command to run the benchmark

	```
	python -W ignore server.py --num_iters 1000 --batch_size 32
	```

	the results will be found in `results/output.csv`, which will be appended to if the file already exists.

	Note that `m.compile()` time in the csv file is not the time for the model to be compiled,
	which happens during the first iteration, but rather the time for PT2 components
	to be lazily imported (e.g. triton).

	### Running a sweep

	The script `runner.sh` will run a sweep of the benchmark over different batch
	sizes with compile on and off and collect the mean and standard deviation of warmup latency,
	average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics
	from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md`
	will contain the mean and standard deviation of results for a given batch size and compile setting.
	If the file already exists, the metrics from the run will be appended as a new row in the markdown table.