docs/source/compile/get-started.rst - platform/external/pytorch - Git at Google

 Getting Started
 ===============

 Let’s start with a simple example. Note that you are likely to see more
 significant speedups the newer your GPU is.

 The below is a tutorial for inference, for a training specific tutorial, make sure to checkout `example on training <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`__

 .. code:: python

    import torch
    def fn(x, y):
        a = torch.cos(x).cuda()
        b = torch.sin(y).cuda()
        return a + b
    new_fn = torch.compile(fn, backend="inductor")
    input_tensor = torch.randn(10000).to(device="cuda:0")
    a = new_fn(input_tensor, input_tensor)

 This example will not actually run faster. Its purpose is to demonstrate
 the ``torch.cos()`` and ``torch.sin()`` features which are
 examples of pointwise ops as in they operate element by element on a
 vector. A more famous pointwise op you might want to use would
 be something like ``torch.relu()``. Pointwise ops in eager mode are
 suboptimal because each one would need to read a tensor from
 memory, make some changes, and then write back those changes. The single
 most important optimization that inductor does is fusion. So back to our
 example we can turn 2 reads and 2 writes into 1 read and 1 write which
 is crucial especially for newer GPUs where the bottleneck is memory
 bandwidth (how quickly you can send data to a GPU) rather than compute
 (how quickly your GPU can crunch floating point operations).

 Another major optimization that inductor makes available is automatic
 support for CUDA graphs.
 CUDA graphs help eliminate the overhead from launching individual
 kernels from a Python program which is especially relevant for newer GPUs.

 TorchDynamo supports many different backends but inductor specifically works
 by generating `Triton <https://github.com/openai/triton>`__ kernels and
 we can inspect them by running ``TORCH_COMPILE_DEBUG=1 python trig.py``
 with the actual generated kernel being

 .. code-block:: python

    @pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
    @triton.jit
    def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
        xnumel = 10000
        xoffset = tl.program_id(0) * XBLOCK
        xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = tl.sin(tmp0)
        tmp2 = tl.sin(tmp1)
        tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

 And you can verify that fusing the two ``sin`` did actually occur
 because the two ``sin`` operations occur within a single Triton kernel
 and the temporary variables are held in registers with very fast access.

 You can read up a lot more on Triton’s performance
 `here <https://openai.com/blog/triton/>`__ but the key is it’s in Python
 so you can easily understand it even if you have not written all that
 many CUDA kernels.

 Next, let’s try a real model like resnet50 from the PyTorch
 hub.

 .. code-block:: python

    import torch
    model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
    opt_model = torch.compile(model, backend="inductor")
    model(torch.randn(1,3,64,64))

 And that is not the only available backend, you can run in a REPL
 ``torch._dynamo.list_backends()`` to see all the available backends. Try out the
 ``cudagraphs`` or ``nvfuser`` next as inspiration.

 Let’s do something a bit more interesting now, our community frequently
 uses pretrained models from
 `transformers <https://github.com/huggingface/transformers>`__ or
 `TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of
 our design goals is for Dynamo and inductor to work out of the box with
 any model that people would like to author.

 So we will directly download a pretrained model from the
 HuggingFace hub and optimize it:

 .. code-block:: python

    import torch
    from transformers import BertTokenizer, BertModel
    # Copy pasted from here https://huggingface.co/bert-base-uncased
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
    model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
    output = model(**encoded_input)

 If you remove the ``to(device="cuda:0")`` from the model and
 ``encoded_input``, then Triton will generate C++ kernels that will be
 optimized for running on your CPU. You can inspect both Triton or C++
 kernels for BERT, they’re obviously more complex than the trigonometry
 example we had above but you can similarly skim it and understand if you
 understand PyTorch.

 Similarly let’s try out a TIMM example

 .. code-block:: python

    import timm
    import torch._dynamo as dynamo
    import torch
    model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
    opt_model = torch.compile(model, backend="inductor")
    opt_model(torch.randn(64,3,7,7))

 Our goal with Dynamo and inductor is to build the highest coverage ML compiler
 which should work with any model you throw at it.

 Existing Backends
 ~~~~~~~~~~~~~~~~~

 TorchDynamo has a growing list of backends, which can be found in the
 `backends <https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/backends/>`__ folder
 or ``torch._dynamo.list_backends()`` each of which with its optional dependencies.

 Some of the most commonly used backends include:

 **Training & inference backends**:
   * ``torch.compile(m, backend="inductor")`` - Uses ``TorchInductor`` backend. `Read more <https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747>`__
   * ``torch.compile(m, backend="aot_ts_nvfuser")`` - nvFuser with AotAutograd/TorchScript. `Read more <https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593>`__
   * ``torch.compile(m, backend="nvprims_nvfuser")`` - nvFuser with PrimTorch. `Read more <https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593>`__
   * ``torch.compile(m, backend="cudagraphs")`` - cudagraphs with AotAutograd. `Read more <https://github.com/pytorch/torchdynamo/pull/757>`__

 **Inference-only backends**:
   * ``torch.compile(m, backend="onnxrt")`` - Uses ONNXRT for inference on CPU/GPU. `Read more <https://onnxruntime.ai/>`__
   * ``torch.compile(m, backend="tensorrt")`` - Uses ONNXRT to run TensorRT for inference optimizations. `Read more <https://github.com/onnx/onnx-tensorrt>`__
   * ``torch.compile(m, backend="ipex")`` - Uses IPEX for inference on CPU. `Read more <https://github.com/intel/intel-extension-for-pytorch>`__
   * ``torch.compile(m, backend="tvm")`` - Uses Apache TVM for inference optimizations. `Read more <https://tvm.apache.org/>`__

 Why do you need another way of optimizing PyTorch code?
 -------------------------------------------------------

 While a number of other code optimization tools exist in the PyTorch
 ecosystem, each of them has its own flow.
 Here is a few examples of existing methods and their limitations:

 -  ``torch.jit.trace()`` is silently wrong if it cannot trace, for example:
    during control flow
 -  ``torch.jit.script()`` requires modifications to user or library code
    by adding type annotations and removing non PyTorch code
 -  ``torch.fx.symbolic_trace()`` either traces correctly or gives a hard
    error but it’s limited to traceable code so still can’t handle
    control flow
 -  ``torch._dynamo`` works out of the box and produces partial graphs.
    It still has the option of producing a single graph with
    ``nopython=True`` which are needed for `some
    situations <./documentation/FAQ.md#do-i-still-need-to-export-whole-graphs>`__
    but allows a smoother transition where partial graphs can be
    optimized without code modification
	Getting Started
	===============

	Let’s start with a simple example. Note that you are likely to see more
	significant speedups the newer your GPU is.

	The below is a tutorial for inference, for a training specific tutorial, make sure to checkout `example on training <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`__

	.. code:: python

	import torch
	def fn(x, y):
	a = torch.cos(x).cuda()
	b = torch.sin(y).cuda()
	return a + b
	new_fn = torch.compile(fn, backend="inductor")
	input_tensor = torch.randn(10000).to(device="cuda:0")
	a = new_fn(input_tensor, input_tensor)

	This example will not actually run faster. Its purpose is to demonstrate
	the ``torch.cos()`` and ``torch.sin()`` features which are
	examples of pointwise ops as in they operate element by element on a
	vector. A more famous pointwise op you might want to use would
	be something like ``torch.relu()``. Pointwise ops in eager mode are
	suboptimal because each one would need to read a tensor from
	memory, make some changes, and then write back those changes. The single
	most important optimization that inductor does is fusion. So back to our
	example we can turn 2 reads and 2 writes into 1 read and 1 write which
	is crucial especially for newer GPUs where the bottleneck is memory
	bandwidth (how quickly you can send data to a GPU) rather than compute
	(how quickly your GPU can crunch floating point operations).

	Another major optimization that inductor makes available is automatic
	support for CUDA graphs.
	CUDA graphs help eliminate the overhead from launching individual
	kernels from a Python program which is especially relevant for newer GPUs.

	TorchDynamo supports many different backends but inductor specifically works
	by generating `Triton <https://github.com/openai/triton>`__ kernels and
	we can inspect them by running ``TORCH_COMPILE_DEBUG=1 python trig.py``
	with the actual generated kernel being

	.. code-block:: python

	@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
	@triton.jit
	def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
	xnumel = 10000
	xoffset = tl.program_id(0) * XBLOCK
	xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
	xmask = xindex < xnumel
	x0 = xindex
	tmp0 = tl.load(in_ptr0 + (x0), xmask)
	tmp1 = tl.sin(tmp0)
	tmp2 = tl.sin(tmp1)
	tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

	And you can verify that fusing the two ``sin`` did actually occur
	because the two ``sin`` operations occur within a single Triton kernel
	and the temporary variables are held in registers with very fast access.

	You can read up a lot more on Triton’s performance
	`here <https://openai.com/blog/triton/>`__ but the key is it’s in Python
	so you can easily understand it even if you have not written all that
	many CUDA kernels.

	Next, let’s try a real model like resnet50 from the PyTorch
	hub.

	.. code-block:: python

	import torch
	model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
	opt_model = torch.compile(model, backend="inductor")
	model(torch.randn(1,3,64,64))

	And that is not the only available backend, you can run in a REPL
	``torch._dynamo.list_backends()`` to see all the available backends. Try out the
	``cudagraphs`` or ``nvfuser`` next as inspiration.

	Let’s do something a bit more interesting now, our community frequently
	uses pretrained models from
	`transformers <https://github.com/huggingface/transformers>`__ or
	`TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of
	our design goals is for Dynamo and inductor to work out of the box with
	any model that people would like to author.

	So we will directly download a pretrained model from the
	HuggingFace hub and optimize it:

	.. code-block:: python

	import torch
	from transformers import BertTokenizer, BertModel
	# Copy pasted from here https://huggingface.co/bert-base-uncased
	tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
	model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
	model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
	text = "Replace me by any text you'd like."
	encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
	output = model(**encoded_input)

	If you remove the ``to(device="cuda:0")`` from the model and
	``encoded_input``, then Triton will generate C++ kernels that will be
	optimized for running on your CPU. You can inspect both Triton or C++
	kernels for BERT, they’re obviously more complex than the trigonometry
	example we had above but you can similarly skim it and understand if you
	understand PyTorch.

	Similarly let’s try out a TIMM example

	.. code-block:: python

	import timm
	import torch._dynamo as dynamo
	import torch
	model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
	opt_model = torch.compile(model, backend="inductor")
	opt_model(torch.randn(64,3,7,7))

	Our goal with Dynamo and inductor is to build the highest coverage ML compiler
	which should work with any model you throw at it.

	Existing Backends
	~~~~~~~~~~~~~~~~~

	TorchDynamo has a growing list of backends, which can be found in the
	`backends <https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/backends/>`__ folder
	or ``torch._dynamo.list_backends()`` each of which with its optional dependencies.

	Some of the most commonly used backends include:

	Training & inference backends:
	* ``torch.compile(m, backend="inductor")`` - Uses ``TorchInductor`` backend. `Read more <https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747>`__
	* ``torch.compile(m, backend="aot_ts_nvfuser")`` - nvFuser with AotAutograd/TorchScript. `Read more <https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593>`__
	* ``torch.compile(m, backend="nvprims_nvfuser")`` - nvFuser with PrimTorch. `Read more <https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593>`__
	* ``torch.compile(m, backend="cudagraphs")`` - cudagraphs with AotAutograd. `Read more <https://github.com/pytorch/torchdynamo/pull/757>`__

	Inference-only backends:
	* ``torch.compile(m, backend="onnxrt")`` - Uses ONNXRT for inference on CPU/GPU. `Read more <https://onnxruntime.ai/>`__
	* ``torch.compile(m, backend="tensorrt")`` - Uses ONNXRT to run TensorRT for inference optimizations. `Read more <https://github.com/onnx/onnx-tensorrt>`__
	* ``torch.compile(m, backend="ipex")`` - Uses IPEX for inference on CPU. `Read more <https://github.com/intel/intel-extension-for-pytorch>`__
	* ``torch.compile(m, backend="tvm")`` - Uses Apache TVM for inference optimizations. `Read more <https://tvm.apache.org/>`__

	Why do you need another way of optimizing PyTorch code?
	-------------------------------------------------------

	While a number of other code optimization tools exist in the PyTorch
	ecosystem, each of them has its own flow.
	Here is a few examples of existing methods and their limitations:

	- ``torch.jit.trace()`` is silently wrong if it cannot trace, for example:
	during control flow
	- ``torch.jit.script()`` requires modifications to user or library code
	by adding type annotations and removing non PyTorch code
	- ``torch.fx.symbolic_trace()`` either traces correctly or gives a hard
	error but it’s limited to traceable code so still can’t handle
	control flow
	- ``torch._dynamo`` works out of the box and produces partial graphs.
	It still has the option of producing a single graph with
	``nopython=True`` which are needed for `some
	situations <./documentation/FAQ.md#do-i-still-need-to-export-whole-graphs>`__
	but allows a smoother transition where partial graphs can be
	optimized without code modification